AI · Future of Work · 30 May 2026

Review Is the New Bottleneck — What AI Actually Changed in Software Engineering

By Yavuz Bogazci 20 min read 4,311 words

The productivity decks are right about one number and wrong about the system around it. AI compressed the part of software that was already the cheapest. The discipline that ships defendable, operable, auditable code did not get cheaper — it got more visible. For the executive deciding what to build, what to buy, what to acquire, and what to staff, this is what the dashboard is missing in May 2026.

The number in every consulting deck this year is N× faster — 30%, 55%, 100×. All of them describe the typing step. None of them describe the work. Software engineering, broken into its actual disciplines — requirements, architecture, code, testing, security, deployment, operations, observability, documentation, evolution, accountability — has a typing step that is roughly ten to twenty percent of total effort. The other eighty percent did not change category. It changed visibility. The cost of skipping it shifted from invisible to immediate.

What Vibe-Coding Actually Is

Andrej Karpathy named the trend in February 2025: “give in to the vibes, embrace exponentials, forget that the code even exists.” Collins Dictionary made vibe coding its 2025 Word of the Year.

The credible coding-agent tools in May 2026 are Claude Code, OpenAI Codex, GitHub Copilot, and Amazon Kiro — AWS’s spec-driven agentic IDE, launched July 2025, that prompts the agent to write a spec first and only generates code after the human approves. The IDE that ends vibe coding, sold to people who paid for vibe coding twelve months ago.

In honest use, vibe-coding compresses the typing step. First drafts in minutes. Sprints from two-to-three weeks to 48 hours. Real — and a partial story.

What Vibe-Coding Does Not Replace

What professional software engineering still requires when an agent is producing the code:

Requirements engineering. If the spec lives only in the prompt, you do not have requirements — you have improv.
Architecture with a rationale. Three valid answers: the team can run it, the strategy needs it, legacy forces it. “The model thought it was a nice idea” is not on the list.
Testing, security, CI/CD, observability, documentation. Coverage is now an AI-output verification problem. The prompt is not the documentation.
Legal, data protection, accessibility. EU AI Act fit, personal-data scope, WCAG-grade accessibility tested with people who actually rely on assistive technology.
Liability. Whose name is on the production deploy?

The discipline of engineering did not get cheaper when the typing got faster. The cost just moved.

The Brownfield Reality No One Puts on the Slide

The 55%, the 100×, “the senior who hasn’t written code since December” are greenfield numbers. In a grown brownfield enterprise — twenty years of acquisitions, a partially-documented monolith — the realistic productivity gain from a coding assistant is more like ten to twenty percent. The senior who has been on that system since 2010 can be woken at 02:00 and tell you which of three places the bug almost certainly is. The LLM cannot. The half-fixed-in-2014 race condition, the comment in German that says “DO NOT TOUCH”, the integration constraint nobody documented because everyone already knew — that knowledge lives in people, not in the repo.

Very large code bases cannot be fed to an agent. “Here are five million lines, understand them and modify” does not fit the context window — and even when it does, the agent has no model of the call graph or which modules are load-bearing. The practical answer is code graphs: symbolic indexing, dependency-graph extraction, static-analysis pipelines that pre-digest the codebase into chunks. Real engineering work. No prompt replaces it.

The Quiet Casualty: Low-Code, Not Engineering

Everyone asks whether AI will replace software engineers. Almost no one asks the more uncomfortable question: what happens to low-code and no-code?

Low-code was built on a single economic premise: professional developers are scarce and expensive, so let’s abstract them away. That premise powered a remarkable run. Gartner predicted that 70% of new enterprise apps would utilize low-code/no-code by 2025, and forecasts the market to exceed $30 billion in 2026. Platforms like Power Apps, OutSystems, Mendix, and ServiceNow became the answer to the developer shortage.

Agentic AI doesn’t compete with that answer. It dissolves the question.

When an agent can generate real, tested, portable production code from a specification in hours, the drag-and-drop canvas stops being an accelerator and starts being a ceiling. You traded flexibility, portability, and scale for speed — and now pro-code is just as fast. The abstraction was never the product. It was a workaround for scarcity, and the scarcity of code production is gone.

Remember how we got here. Microsoft built the entire citizen developer movement on an IDC prediction it repeated in every keynote: 500 million new applications in five years, against a shortfall of 4 million developers — Microsoft even branded it the “App Gap Challenge.” Citizen developers were the answer, and every enterprise was told to unleash them. Six years later, at Microsoft’s own Power Platform conference, Charles Lamanna — the executive who built that business — declared from the stage: “Low code as we know it is dead”, repositioning the platform around agents and Copilot. When the company that wrote the citizen developer gospel starts rewriting it, the shift isn’t coming. It has already happened.

The other vendors know it too. Gartner is now publishing reports titled “Why AI Won’t Replace the Need for Low-Code Application Platforms” — distributed, tellingly, by OutSystems. When the defense is being run by the defendants, pay attention. ServiceNow spent Knowledge 2026 repositioning itself away from being the most capable development environment toward being the most governable platform — shipping its Build Agent into third-party IDEs and opening its Action Fabric to external agents. Read that carefully: the visual development layer is no longer the value proposition. The governed runtime, the data model, the workflow context — that’s what survives.

And here’s where this essay’s core argument returns with force. In professional engineering, AI shifted the bottleneck from writing to reviewing — painful, but survivable, because senior engineers can review. Citizen development has no equivalent. A business user prompting an agent can review exactly one dimension of the result: functionality. Does the screen look right, does the button do what I asked. That’s acceptance testing, not review.

Everything that actually determines whether software survives contact with production remains invisible to them: the architecture decisions that decide whether the system scales or collapses at 10x load. The security posture — injection paths, permission models, exposed secrets. Deployment and operations — how it ships, how it’s monitored, how it fails, how it recovers. Dependency hygiene — the agent happily pulling in libraries deprecated two years ago, with known CVEs, because they dominated its training data. A citizen developer doesn’t miss these issues. They can’t see them. There is no prompt for judgment you don’t possess.

Generation without the ability to review isn’t democratization. It’s unreviewed liability at scale — shipped with confidence, because it looks like it works.

So no — agentic AI won’t replace professional software engineering. Engineering is where the judgment lives, and judgment is the one thing that didn’t get cheaper. What agentic AI actually replaces is the layer we invented to avoid engineering. Low-code will survive as a governed execution substrate. As a development paradigm, its premise expired the day code stopped being scarce.

Architecture Is Still a Conscious Decision

The hardest part of an enterprise software decision is not what to type — it is what to choose. The implicit knowledge of what exists, who owns it, who signs for what, lives in the seniors who have been there for a decade. The decision is still the human’s. AI proposes, validates, challenges; it does not decide. The Solution Architect role becomes where the highest-leverage cognitive work happens. The same is true for UI and UX: generative agents produce a recognisable visual signature, so a designer is required for a product that needs to look like itself.

The McKinsey Lesson

On March 9, 2026, the red-team firm CodeWall hacked McKinsey’s internal AI assistant Lilli end-to-end — read and write — in two hours. The route in was banal: of more than 200 publicly documented API endpoints, 22 required no authentication at all; JSON field names concatenated into SQL queries without sanitisation. CodeWall reported access to 46.5M chat messages, 728k files, 57k user accounts, and the 95 system prompts that controlled Lilli’s behaviour — stored in the same database, so an attacker with write access could quietly rewrite how the assistant answered every consultant. McKinsey patched within 24 hours, no client-data access. Worth reading: CodeWall’s report; The Register.

Lilli was internal, not client-facing — fair caveat. The underlying point stands: the failure mode itself is exactly the one vibe-coded production systems produce by default. The API surface grew faster than the security review; the prompt store and the user data shared a perimeter.

BCG booked $3.6 billion — 25% of 2025 revenue — from AI consulting. Both firms have published software-engineering plays this year. Their core business is advice and transformation, not building and operating production software systems. Ask which production system they themselves built and operate. The silences are the procurement filter.

In the Same Week — McKinsey’s Rewiring Paper

While I was finishing this post, McKinsey published Rewiring software delivery for the agentic era — the closest thing to a public roadmap for the system this piece argues for: 24-hour sprints, multi-agent workflows over shared knowledge graphs, “two-shift digital factory”, −60% time and −60% team. Direction of travel: the same. Four lines belong next to McKinsey’s charts:

Brownfield reality. −60% is greenfield. Reading it as a universal benchmark queues up the next pilot-purgatory wave.
Review bottleneck. Eliminating human handoffs is right. Handoffs were also verification checkpoints. The cost moves; it does not vanish.
Knowledge-graph build. An engineering programme measured in months, not in prompts. Naming the need is not the same as shipping the system.
Liability column. Whose name is on the production deploy when the agent’s spec was wrong?

Twelve weeks before this paper was published, McKinsey’s own Lilli was hacked end-to-end in two hours via 22 unauthenticated endpoints. Buy the framework from someone who has shipped one — not from someone whose own internal chatbot was open to the public internet ninety days ago.

The New Bottleneck — Review

Once typing is no longer the bottleneck, review is. Hundreds of thousands of lines generated overnight must be read. Thousands of test cases need checking against the failure modes that matter. Dozens of library updates per merge need confirming as compatible and free of known CVEs. Every spec the agent produces contains a new architectural decision that must survive a steerco minute.

And review is only as honest as the disciplines beneath it: requirements engineering, architecture, coding patterns, testing, security, DevOps, observability, documentation, run. Has the reviewer kept command of all nine, or has the agent quietly taken each one in turn? Do they know what was built? Do they trust it? Would they sign for it?

Two paradoxes follow: reviewing AI output is harder than producing the original — the reviewer must understand the domain more deeply than the original author would have needed to. And the better the AI, the worse the human oversight — at 95% accuracy, people stop reading carefully. The human in the loop becomes the human asleep at the wheel.

Generation is cheap now. Review is the work. Review-throughput per senior is the right capacity metric in 2026.

The Spotify Test — What Does “Done” Mean?

In Spotify’s Q4 2025 earnings call, co-CEO Gustav Söderström said the company’s best developers have not written a single line of code since December. They use an internal system called Honk that integrates Claude Code. As reported by TechCrunch: an engineer on the morning commute opens Slack on the phone, tells Claude to fix a bug or add a feature to the iOS app; Claude pushes a new build back into Slack; engineer merges to production before reaching the office.

The headline is true and incomplete. Customers expect maximum stability — an overnight release that breaks playback on Monday costs more than the productivity gain that produced it. The senior who did not write a line of code since December is still the person who signs that the output does what the spec promised, that it did not break the playlist engine, that it does not leak data, that it is safe to ship. The keyboard moved. The accountability did not. Senior engineering work changed from production to verification — which needs more juniors in the pipeline, not fewer.

Bringing the Developers Along

The change-management cost of moving an engineering organisation from writing to reviewing is the most under-budgeted variable in the transition. Developers are passionate — they love writing code. Telling a fifteen-year senior that their job is now to read what an LLM produced and sign for it is a professional identity shift, and done poorly it triggers the dynamic that has killed every off-shoring programme of the last twenty years: quiet attrition of the best people, who join the next company before the transformation team notices. What works is participation in designing the new workflow — which gates, which guardrails, which review queue, which kill-switches. There is no version of the AI-augmented SDLC that succeeds with a hostile senior bench.

AI-Shoring — A Different Staffing Geography

For two decades the European cost-out playbook was the same: hire an army in Bangalore or Sofia, accept the cultural distance, defend the rate-card savings. The bottleneck moved. The lever moved with it.

AI-Shoring — onshore + small + senior + AI — increasingly beats offshore + large + mixed-seniority on durable delivery. Thirty engineers become ten, not zero. A realistic pattern is 10 onshore + 20 nearshore + AI rather than 100 offshore; the 10 are where the architectural decisions live. Sprints compress from three weeks to one. The customer becomes the bottleneck.

Forward-Deployed Is a Formation, Not a Hire

The labs have started hiring for the last mile. OpenAI stood up a roughly $4 billion “Deployment Company”; Google, Anthropic, and Meta are hiring forward-deployed engineers — FDEs — by the hundreds, in the same quarter. The market has finally priced the thing this piece has been arguing: the model was never the bottleneck. But the popular picture of the FDE is already wrong in the way that matters. It imagines one person — an AI specialist, embedded in your operations, who makes the model work. That person does not exist. And not because good engineers are rare.

The reason is the same one that makes review the bottleneck. Reviewing an agent’s output is harder than producing it, because the reviewer has to understand each domain more deeply than the author needed to. And the agent does not produce in one domain. It produces across all of them at once — architecture, data, security, integration, the business logic itself. No single human holds that much depth across that much technology. The full-stack engineer who can vouch for all of it is a myth — not because nobody is senior enough, but because review-depth does not compress the way typing did. Breadth got cheap. Depth did not.

So the forward-deployed unit is not a person. It is a small senior cell, onsite. Depth is distributed — specialists senior enough to review the agent’s work in their own domain and catch what it quietly got wrong. Breadth is concentrated — one architect who integrates the parts, sequences the work, carries the liability, and translates between the business and the build. The specialists make the output safe. The architect makes the decision, and sits close enough to the business to be heard when the change is uncomfortable. Neither half works alone: depth without an integrator ships nine correct components that do not add up; breadth without depth is a confident reviewer asleep at the wheel.

This is what the AI-Shoring section was pointing at. The 10 onshore are not a smaller version of the offshore army. They are a different animal: senior, full-stack as a cell rather than a person, close to the business. The cost-out playbook sent typing offshore because typing was the bottleneck. When the bottleneck moves to review, distance becomes the most expensive thing in the system — and the formation moves back onsite, next to the people who own the process it is about to change.

The forward-deployed engineer is not a hire. It is a formation: breadth concentrated in one architect, review-depth distributed across the seniors — onsite, close enough to decide.

Custom Software and the M&A Window

Two macro consequences follow.

First, the appetite for custom solutions rises. A system genuinely tailored to one company’s workflow is now within budget. The signal is visible in the incumbents: Salesforce CEO Marc Benioff publicly claimed AI agents had replaced 4,000 customer-support workers; days later the company filed to permanently lay off 262 employees from its San Francisco HQ. Revenue still growing, but the posture has shifted. SaaS incumbents are using AI to compress their own cost base while simultaneously enabling the custom alternatives that compete with their core product.

Second — and this is the executive call almost nobody is making — the consolidation window is open now. The vibe-coding hype has driven a wave of “we don’t need software companies anymore” commentary. Most boards are listening. Many are quietly cancelling acquisitions of software firms on the assumption that target values are about to collapse. The reverse is true. A competent software firm — small, senior, pipeline-mature, AI-fluent — is worth more right now than it has been in a decade, because the market is mispricing it on the headline narrative. If a software boutique is on your strategy slide, the time is now, not when the market has corrected.

AI-FinOps — The Meter Becomes a Cost Object

Every AI-augmented system carries an operating cost the line item never had ten years ago: the meter that runs every time the model is called — priced by a third party, changeable at their discretion, invisible to the steerco until the invoice arrives. Four directions can move per-request cost by a third or more without warning:

Per-token rates can rise. Multi-model architectures are multi-billing architectures.
The tokenizer can change. Opus 4.7’s new tokenizer consumes up to 35% more tokens for the same input. Per-request cost up by a third with no rate change.
Usage patterns shift faster than budgets. Token-per-feature drifts upward weeks before finance notices.
Uncontrolled use in the dev environment is the most expensive variant. A single careless while true against a frontier model can burn a five-figure invoice over a weekend — the dashboard only registers it when the invoice arrives.

For a €10M programme, €250–500k of tokens is a rounding error. For a €10,000 fixed-price engagement, €10,000 of tokens is the entire margin.

The operational answer is AI-FinOps — per-team quotas with hard ceilings, weekly tokens-per-feature publishing, model-routing owned by an architect, anomaly alerts at incident severity. Into the SDLC pipeline from the start, alongside the security and observability gates. And one discipline above all of it: stop, route, cache. The cheapest token is the one never spent — deterministic problems belong in deterministic code, not in an agentic loop. Tasks get routed to the cheapest model that changes the outcome, not reflexively to the frontier. Reusable context — system prompts, policies, standards — gets cached and read back at a steep discount. Three moves, no model change required, and together they routinely cut the bill by a third.

That was the 2025 conversation: cost control. The 2026 conversation is cost accounting — where does AI sit in the P&L? The question stopped being academic this spring. Two years ago, 31% of FinOps teams managed AI spend; today it is 98% — not because AI became a boardroom priority, but because the invoices arrived and nobody was ready. In June, the Linux Foundation stood up a Tokenomics Foundation to standardise how enterprise AI consumption is measured and billed — backed by Oracle, Google, Microsoft, SAP, JPMorganChase. Anthropic repriced its enterprise tier in April; GitHub moved Copilot to usage-based billing after agentic loops turned its flat-rate heavy users margin-negative. The meter is not going away. It has to be booked somewhere. And two wrong answers dominate.

The first wrong answer buries the agent in overhead. Tokens priced into the fully loaded day rate, next to the laptop and the licences — a surcharge nobody owns. The cost drifts invisibly, no project, no product, no customer can be held to account, and the €10,000 that was the entire margin never appears in the bid calculation that lost it. Burying inference in a generic hosting bucket is precisely how organisations discover their margin problem on the invoice.

The second wrong answer treats the agent as another employee. Its own cost centre, its own rate card, a line in the staffing plan. This model has momentum: the workforce planners have extended buy, build, borrow with bot; McKinsey calls agents a parallel workforce and floats the zero-FTE department; the vendors are shipping the billing units to match — per resolution, per agentic work unit. It sounds rigorous. It is a fiction. An employee has fixed capacity, fixed cost, linear utilisation. An agent has none of the three: its cost is variable, superlinear when loops run, and repriced at a third party’s discretion. A digital FTE with a fixed hourly rate is exactly the accounting construct that hides the cost explosion until the invoice arrives — the while true weekend, formalised.

The defensible answer splits the token bill three ways. Tokens that build reusable capability — agents, workflows, code graphs — are an investment, and behave like capitalised software. Tokens that run internal work are opex, budgeted per function and per workflow like any other operating cost. Tokens inside a delivery or a product are direct cost — part of the engagement’s cost basis, managed as gross margin. For a services business, that last line is the one that matters: the token spend of a project is a direct cost of that project — like a subcontractor with variable volume — not an overhead surcharge on top of the day rates. It belongs in the bid calculation, in its own line, next to the staffing model. A fixed-price offer without a token budget per work package is a fixed-price offer with an unpriced supplier in it.

And the metric that unifies all three is cost per outcome — per shipped feature, per resolved ticket, per accepted deliverable — with the human cost inside the denominator. Because in every workflow that ships, a person starts the work, steers it, and signs for it. Which closes the loop of this essay in the language finance understands: the review is not overhead on the AI’s output. It is part of its unit cost. Generation made the numerator cheap. Review is what the denominator still costs.

The agent belongs in the calculation, not in the overhead surcharge. And the review is not overhead on its output — it is part of its unit cost.

The Real Complexity — Everything Has to Plug Together

Every assertion of “100× faster with AI” runs into the same reality the moment it leaves the slide. An AI-augmented SDLC has to integrate with the toolchain the enterprise already operates: ServiceNow, Jira, GitHub Enterprise, CI/CD, observability, identity and access management, data classification, Betriebsrat agreements. None of it is glamorous; all of it is non-negotiable. An “agentic SDLC pipeline” is a system-integration programme, not a tooling decision. Anyone selling it as the second is selling you the slide.

Many firms now write about agentic SDLC pipelines. Many talk about them. Many pitch them. Building and operating one that is functional, secure, and reliable enough to run a regulated business on is a different story. It is a craft — years of operating production software, scars from incidents that never made it onto a slide, judgment that compounds across hundreds of deployments. Writing about a pipeline is not building one. Pitching one is not running one. The distance between a published playbook and a pipeline that survives 02:00 on a Sunday is the entire game.

Junior Engineers — Pair-Programming with the Agent

A pipeline that produces only seniors has a fifteen-year half-life. The organisations winning quietly are the ones that took AI savings and reinvested in a junior pipeline — juniors who pair-program with the agent as their daily practice, learn what the agent gets wrong, and become tomorrow’s reviewers. Universities need a mandatory new curriculum on top of the fundamentals: how to develop software with agentic AI. You cannot develop software with AI if you do not know how to develop software. The organisations that take AI savings straight to the bottom line will find, in five years, that they have nobody who can review what the agents produce — and the agent’s output will be ungoverned.

The Bottom Line

Vibe-coding is not the enemy of software engineering. It is one part of the pipeline that finally got cheap. The other parts — the ones that separate a working demo from a system that can run a regulated business — got more important, not less. A team that ships vibe-coded code into production without engineering discipline is running into a debt schedule it has not yet seen. A consulting firm that sells you a pipeline it has never operated is selling you a slide.

Compression is genuine. Liability did not move. Review is the new bottleneck. The pipeline is the product. The seniors are the moat. The window to acquire one is open now. Sprints went from three weeks to one day. The art is the review.

Engineering is not the bottleneck on the typing. It is the discipline that lets a business defend what came out the other end. The firms that internalise that will be the ones still shipping in 2030. The firms that do not will be the ones whose chatbots get hacked in two hours, and whose strategy decks no longer match the system the business is actually running.