AI agents can generate code.
That part is becoming obvious.
They can create components, wire APIs, refactor modules, add tests, open pull requests, and implement features from written specifications.
This changes software development.
But I think the most interesting change is not simply that code becomes cheaper.
The more important shift is this:
AI reduces the cost of implementation, but it does not remove the cost of judgement.
And in many cases, it makes that cost more visible.
The first implementation becomes cheap.
The refinement loop becomes the real work.
The Promise: Specs Become Pull Requests
The attractive idea is simple.
You write a specification. An AI agent reads it. The agent changes the codebase. A pull request appears. You review it. The project moves forward.
This is the promise behind many AI-native development workflows.
It is also one of the ideas behind my own experiments with spec-driven coding and AI-assisted project automation.
At first, it feels like a huge unlock.
Instead of manually implementing every detail, you can describe what you want. Instead of spending hours on boilerplate, you can delegate. Instead of treating every small feature as a manual coding session, you can imagine a factory of changes.
Specification in. Pull request out.
But after using this style of development, I keep noticing a more complicated reality.
The agent implements the task.
Then you open the product.
And the real work begins.
The Reality: The Feature Exists, But Something Feels Wrong
An AI agent can implement a feature that is technically correct.
The code compiles. The tests pass. The page renders. The API is connected. The pull request looks reasonable.
But when you interact with the result, you start seeing small wrongness everywhere.
A button should have been a link.
A menu becomes overloaded.
A long value breaks the layout.
A filter does not persist.
A navigation path feels awkward.
An empty state is missing.
A label is technically accurate but visually heavy.
A new feature feels bolted onto the product rather than integrated into it.
None of these problems are necessarily catastrophic.
But together they create the real cost of AI-assisted development.
The agent produced something that was almost right.
And “almost right” is expensive.
The Specification Is Never Complete
The common answer is:
Write better specifications.
That is true, but incomplete.
Better specifications help. Clearer acceptance criteria help. More examples help. More context helps.
But a specification is not a complete description of the desired product.
A specification is a compressed expression of intent.
When a human developer reads a task, they bring a lot of invisible context with them:
- experience;
- taste;
- common sense;
- product memory;
- understanding of existing patterns;
- awareness of user expectations;
- a sense of what would be embarrassing to deliver;
- a sense of what is technically correct but product-wise wrong.
This is why a human can often take an imperfect ticket and still produce something reasonable.
They fill in the gaps.
The task says:
Add a popover showing who changed the record and when.
A human developer may ask:
Should this be compact? Should it use explicit labels? Is this a high-fidelity design or a rough sketch? Will the email be long? Should the timestamp be formatted for the user’s locale? Is this metadata important enough to take two lines? Does this match other popovers in the product?
Some of these questions may be explicit. Many are not.
The human still brings them into the work.
An AI agent also fills in gaps.
But it fills them differently.
It uses what is explicit in the task, what is available in context, and what it can infer from patterns.
Sometimes this works very well.
Sometimes the result is plausible, but wrong.
The agent did not fail because it could not code.
It failed because the task depended on unwritten judgement.
AI Exposes the Hidden Human Layer
A lot of software development was never fully written down.
It lived in people.
In their taste. In their habits. In their memory of previous decisions. In their sense of what belongs in the product. In their ability to notice when something technically works but does not feel right.
AI agents expose this hidden layer.
When the agent misses something, we often say:
I should have written a better prompt.
Sometimes that is true.
But sometimes the deeper truth is:
I did not realize this requirement existed until I saw the wrong result.
Taste is often reactive.
You do not always know every detail of what you want upfront.
You discover it when something feels off.
You see the generated UI and immediately think:
No, not like this.
This menu should not grow again. This should be a secondary link. This branch name needs truncation. This filter state should survive navigation. This is too visually loud. This feels like a patch, not a product decision.
Before seeing the wrong version, you may not have written any of that into the spec.
It felt obvious.
But it was only obvious to the human carrying the product in their head.
”Almost Right” Creates a Long Tail of Corrections
AI makes it easier to produce a first draft.
But the first draft is not the product.
After the first implementation, the refinement loop begins:
- change this button into a link;
- move this action out of the top-level menu;
- preserve this state in the URL;
- add an empty state;
- handle long values;
- make this consistent with the rest of the product;
- simplify the layout;
- update the copy;
- add missing edge cases;
- explain the decision in the PR;
- add screenshots;
- adjust the tests;
- rewrite the specification.
This is a new kind of work.
The developer is no longer only writing code.
The developer is debugging interpretation.
Sometimes they are debugging taste.
The code may compile. The tests may pass. The feature may exist.
But the result still needs to be corrected because the agent misunderstood the implicit standard of “good”.
This is why “almost right” matters.
It looks like progress.
But it creates a long tail of small corrections that require human judgement.
AI Has No Taste Unless Taste Is Operationalized
AI does not know what you care about unless the workflow makes that care explicit.
This does not mean AI cannot help with taste.
It can.
It can critique. It can compare alternatives. It can review a UI flow. It can ask good questions. It can identify inconsistencies. It can help generate refinement tasks. It can act as a product reviewer, UX reviewer, or architecture reviewer.
But this needs to be part of the workflow.
If the workflow is only:
implement this task
then the result will often reflect only the visible task.
A better workflow asks the agent to do more than produce code.
For example:
- Implement the feature.
- Explain the assumptions made.
- Describe what changed in the user experience.
- Identify possible product risks.
- Review the change as a senior engineer.
- Review the change as a UX critic.
- Identify taste debt.
- Suggest follow-up refinements.
- Ask the human which trade-offs are acceptable.
This turns AI from a blind executor into a participant in a structured product conversation.
But the human remains central.
The human still decides what matters.
Code Review Is Not Enough
In AI-native development, it is tempting to rely on automated review.
Types. Tests. Linting. Security checks. Complexity analysis. Dependency scanning. Performance warnings.
All of this is useful.
But code review was never only about finding bugs.
At its best, code review is a stress test of decisions.
Why this approach? Why this abstraction? Why this interaction? Why this visual structure? Why this data flow? Why this trade-off?
A good reviewer does not only ask whether the code works.
They ask whether the change deserves to become part of the system.
That question is technical, but not only technical.
It is architectural. It is product-oriented. It is sometimes aesthetic.
Automated review can detect code debt.
But it rarely detects taste debt.
Taste debt looks like this:
- too many top-level actions;
- inconsistent button/link semantics;
- visual overflow;
- unclear hierarchy;
- missing feedback after user actions;
- overloaded menus;
- flows that technically work but feel awkward;
- features that feel attached rather than integrated.
This debt matters.
It makes the product harder to understand. It makes future work harder. It reduces trust. It makes the project less pleasant to return to.
AI can create taste debt very quickly.
Not because it is bad at coding, but because it is very good at adding things.
And adding things is not the same as improving the product.
A Project Is Not a File Tree
Another mistake is treating a software project as just a repository of files.
A project is not a file tree.
A project is an environment.
When you work deeply in a project, you develop orientation inside it.
You know where things are. You know which patterns exist. You know what feels natural. You know which parts are dangerous. You know what belongs. You know which affordances already exist.
It almost feels physical.
In one project, you reach for a deployment indicator and it is there. In another project, you reach for the same affordance and find an empty shelf.
This matters because every change modifies that environment.
A pull request is not just a code diff.
A pull request changes the project’s reality.
It can add confidence. It can reduce friction. It can create a new affordance. It can improve orientation. Or it can introduce confusion.
AI agents can generate diffs.
But someone still has to ask:
Did this change make the project world better?
That question cannot be answered by the diff alone.
Freelance Briefs Have the Same Problem
This is not only about software.
The same issue appears when people imagine replacing freelance work with AI agents.
The experiment sounds simple:
Take a task that would normally go to a freelancer. Give it to an AI agent. See whether the agent can complete it.
But freelance tasks are usually written for humans.
They rely on shared cultural assumptions.
When you ask a human for an eight-minute video, you do not specify every detail:
narrative structure, pacing, continuity, scene coherence, visual variety, topic relevance, intro, development, ending.
The human brings those assumptions.
An AI agent may technically produce a video, but miss the underlying expectation.
You ask for an eight-minute video.
You get a one-minute clip with eight cats.
Or you ask for a 3D room.
A human understands that the room should remain stable.
An AI may produce something where the furniture shifts, the walls change, objects appear and disappear, and the result still resembles “a room” at the surface level.
The output is plausible.
But it does not preserve the intended world.
Again, the problem is not simply execution.
The problem is hidden context.
The Human Role Is Changing, Not Disappearing
AI-native development changes the role of the human.
But it does not remove the human.
The human role shifts from:
writing every line
towards:
directing, reviewing, refining, judging, integrating.
Less typing.
More judgement.
Less mechanical implementation.
More responsibility for intent.
This is not necessarily easier.
In some ways, it is harder.
Because now the human must inspect outputs that look plausible. They must detect subtle wrongness. They must articulate taste. They must turn vague dissatisfaction into actionable refinement. They must decide what is good enough to become part of the product.
The developer becomes less like a typist and more like an editor, reviewer, architect, product thinker, and director.
What AI-Native Workflows Need
The naive workflow is:
task → agent → pull request
The better workflow is:
intent → specification → implementation → reality diff → review packet → taste review → human judgement → refinement → merge
The agent should not only produce code.
It should produce a reviewable change.
For example, every AI-generated pull request could include a review packet:
What changed
A plain-language summary of the change.
Why it changed
The original intent.
Product impact
How the user experience is different now.
New affordances
What became easier, clearer, or safer.
Assumptions made
Where the agent filled in gaps.
Possible regressions
What may have become worse.
Design decisions
Why the implementation took this shape.
Alternatives considered
What could have been done differently.
Taste risks
Anything that may feel awkward, ugly, overloaded, or inconsistent.
Questions for the human reviewer
Where judgement is needed.
This is the kind of structure AI-native development needs.
Not only more generation.
Better loops for judgement.
The Real Bottleneck
The current AI coding conversation often focuses on productivity.
How much faster can we generate code? How many tasks can agents complete? How many pull requests can they open? How much implementation work can be automated?
These are useful questions.
But they are not the whole story.
The real bottleneck is moving.
It is moving from typing code to deciding what good means.
AI can execute. AI can generate. AI can approximate. AI can suggest. AI can critique.
But someone still has to care.
Someone has to say:
This works, but it is ugly. This satisfies the ticket, but misses the intention. This is clever, but not maintainable. This is fast, but not coherent. This looks done, but it is not good enough. This change should not become part of the product.
That someone is still human.
Conclusion
AI coding is powerful.
It can reduce implementation cost dramatically.
But implementation was never the whole job.
Good software depends on the world around the task:
context, taste, standards, constraints, review culture, product memory, human judgement.
If we give AI agents shallow tasks and expect them to behave like experienced human collaborators, we will often get outputs that are technically plausible but fundamentally incomplete.
The future of AI-native development is not just replacing humans with agents.
It is building workflows where agents generate, critique, explain, and refine — while humans remain the carriers of intent, taste, and responsibility.
The best teams will not be the ones that generate the most code.
They will be the ones that build the best loops for judgement.
Because the hardest part of work is not always doing the task.
Sometimes the hardest part is knowing what “good” means.