Evaluating Augment Code by Actually Building Something

Most AI coding tools work well when the problem is small and the next step is obvious. As soon as requirements change, or the design gets a bit messy, their usefulness tends to drop off. So instead of another surface‑level trial, I put Augment Code through its paces the same way I’d test any development tool: build something real, then deliberately make it harder.

Before getting into the details, it’s worth setting some context. Tools like GitHub Copilot and Claude Code have moved well beyond simple autocomplete — both now offer chat and agent‑style workflows that can plan and apply changes across multiple files. They’re solid tools, and in many cases they’re already good enough for day‑to‑day development.

Augment takes a slightly different angle. Rather than focusing on how to generate code or orchestrate changes, it puts most of its effort into understanding the structure of your entire codebase — how the pieces fit together, what depends on what, and what’s likely to break when something changes. Think less “smart autocomplete with agents” and more “teammate who’s already read the repo.”

The setup

The project itself was intentionally boring:

FastAPI backend
React + TypeScript frontend
A simple “weather + advice” domain

The app didn’t matter. What mattered was how the code changed over time. I wasn’t testing “can it write code?” — I was testing whether it could keep up with real engineering work as requirements evolved and the shape of the system shifted.

Example: starting structure

			
weather-plus/
├─ backend/
│  ├─ app/
│  │  ├─ api/
│  │  ├─ models/
│  │  ├─ services/
│  │  └─ rules/
├─ frontend/
│  ├─ src/
│  │  ├─ api/
│  │  └─ components/
└─ .augment/
   └─ rules/

		

Phase 1 — Start clean or don’t bother

Phase 1 was about structure, not features. I focused on:

Thin API routes
Explicit models
No business logic in the frontend
Stubbed data instead of real APIs

Nothing exciting here. Just laying the groundwork before adding any real features.

Example: thin FastAPI route

			
@router.get("/weather", response_model=WeatherResponse)
def get_weather():
    weather = weather_service.get_current_weather()
    advice = advice_service.evaluate(weather)
    return WeatherResponse(weather=weather, advice=advice)

		

Phase 2 — Does it actually understand the code?

Before letting Augment change anything, I asked it to explain the system. Not generically — this repo:

Where data came from
How it flowed
What would break if something changed

This was the first real signal. Augment could reason across files and layers in a way that felt closer to a junior engineer reading the codebase than a fancy autocomplete. If it had failed here, the rest wouldn’t have been worth doing.

Example: General design question

Explain how this system works

Augment walked through how the frontend calls a single /weather endpoint, the backend fetches raw weather data, maps it to an internal model, evaluates it against a set of rules, and returns both the weather and the advice together.

Example: Change impact question

If I change the Weather model, what else needs to update?

Augment gave a more thorough answer — covering the API response shape in the backend, the TypeScript types in the frontend, and any components that render those fields.

Phase 3 — Add logic, but don’t let it sprawl

Next, I introduced derived “weather advice”:

Biking conditions
Laundry windows
Alerts

The rules themselves were simple. What I was watching was where the logic ended up.

Good signs:

Logic stayed in the backend
API routes stayed thin
Frontend stayed dumb

Bad signs (which thankfully didn’t happen much):

Logic creeping into React
Interpretation duplicated across layers

With clear guardrails, Augment behaved sensibly. It still needed direction though — it followed the structure I’d set, it didn’t invent one.

Example: Advice evaluation logic

			
def evaluate_biking(weather: Weather) -> BikingAdvice:
    if weather.heavy_rain:
        return BikingAdvice.NO
    if weather.humidity > HUMIDITY_THRESHOLD:
        return BikingAdvice.NO
    if weather.wind_speed > WIND_THRESHOLD:
        return BikingAdvice.MAYBE
    return BikingAdvice.YES

		

Phase 4 — Real APIs, real mess

Phase 4 replaced stubbed data with a real weather API. When I mentioned Open-Meteo, Augment read the documentation and suggested how to consume it — I didn’t have to point it to anything. That was a nice surprise.

This is where things usually get messy though:

Weird JSON shapes
Leaky abstractions
“Just pass it through for now” shortcuts that turn into tech debt

The earlier structure paid off. The integration stayed contained, the domain model stayed stable, and the rest of the system didn’t care. That wasn’t AI magic — it was boundaries doing their job. Augment just didn’t undermine them.

Example: Mapping external data to internal model

			
def map_open_meteo(response: dict) -> Weather:
    return Weather(
        temperature=response["current"]["temperature_2m"],
        wind_speed=response["current"]["wind_speed_10m"],
        humidity=response["current"]["relative_humidity_2m"],
        rain=response["current"].get("rain", 0),
    )

		

Phase 5 — Change pressure (the important bit)

This was the most revealing phase. I deliberately introduced:

Overlapping rules
Ambiguous conditions
A third “MAYBE” state
A forced rename (laundry → drying)

This is where most AI tools fall over. What I found:

Augment is very good at mechanical refactors
It reliably propagates changes across layers
It does not decide semantics or precedence for you

That last point is worth sitting with. Augment won’t tell you which rule should win when two conditions overlap, or what “MAYBE” should actually mean in your domain. That’s not a flaw — that’s exactly where a human should still be in charge.

Example: Rename propagated safely

			
- class LaundryAdvice(Enum):
+ class DryingAdvice(Enum):
    OK = "ok"
    RISKY = "risky"

			
- advice.laundry
+ advice.drying

Phase 6 — Tests and trust

The final phase was about confidence. I added focused backend tests around the advice logic, then intentionally broke things.

At one point I told Augment: “I’ve updated the rules, now the tests are failing — can you fix them?” It didn’t just blindly update the tests to pass. It spotted which rule had changed, understood how that affected the expected behaviour, and updated the tests to match. It also explained what it had changed and why.

With tests in place:

Failures were obvious
Fixes were safer
Augment became more useful, not less

That last point surprised me a bit. The more structure and tests that existed, the more confidently Augment could operate — it had enough context to understand what “correct” actually meant.

AI without tests feels risky. AI with tests feels like a reasonable engineering choice.

Example: Behaviour‑focused test

			
def test_biking_maybe_when_windy_but_dry():
    weather = Weather(wind_speed=35, rain=0, humidity=40)
    assert evaluate_biking(weather) == BikingAdvice.MAYBE

So, is Augment worth it?

After my evaluation, my honest take is: yes — but only in the right context.

Most of the time I’m working across multiple projects, and most of those codebases are relatively small. In that world, GitHub Copilot is usually more than enough. It’s quick, lightweight, and fits neatly into day-to-day work without much setup or mental overhead. For small changes, scripts, and one-off tasks, it gets out of the way and does its job.

If you want something that can take on larger chunks of work more autonomously — “here’s a feature, go build it” — Claude Code is worth looking at. It behaves more like an agent than an assistant, and that’s useful when you’re happy to delegate a task and review the outcome.

Augment sits in a different spot. It’s not trying to take over the task, and it’s not especially well suited to lots of small scripts or short-lived projects. Where it shines is in larger, longer-lived codebases where understanding structure, dependencies, and knock-on effects actually matters. Changing a model, tracing impact across layers, renaming a concept cleanly — that’s where it earns its keep.

It works best if:

You spend most of your time in one or two large codebases
Changes regularly touch multiple layers of the system
There’s already some structure and tests in place

It’s probably not the right tool if:

You’re constantly context-switching between small projects
A lot of your work is scripting or ad-hoc automation
You want an AI to just “take over” and do everything

The bigger lesson from this exercise is that the tool is only as useful as the codebase you point it at. Get the structure right first, pick the tool that matches how you actually work, and you’ll get far more value than trying to force one tool to fit every situation.

Jamie Nicholls