Dynamic workflows: parallel verification at scale

The short version

Dynamic workflows let Claude Code run many subagents in parallel and, more to the point, have them check each other before anything reaches you. The parallelism is not the win. The checking is.

A workflow is a script the runtime runs in the background, not a smarter single agent
It caps at 16 agents running at once and 1,000 across a full run
I am using one to re-verify around 250 posts on this site, with independent verifiers and a consensus gate
It earns its cost on high-volume work where being wrong is expensive; a single agent is cheaper for everything else

Tens to hundreds of agents, all running at once. That is the headline for dynamic workflows in Claude Code. It is also the least interesting thing about them.

The number that matters is not how many agents run. It is how many of them exist to check the others.

A dynamic workflow is a script that Claude writes and a runtime executes in the background, spawning subagents to do the work while your session stays free. Anthropic describes it as running “tens to hundreds of parallel subagents in a single session, checking its work before anything reaches you.” Read that last clause again. Checking its work. That is the product.

What a dynamic workflow actually is

Three tools in Claude Code can run a multi-step job: subagents, skills, and workflows. The docs draw the line by asking who holds the plan. With subagents and skills, Claude is the orchestrator. It decides turn by turn what to spawn next, and every result lands back in its context window. A workflow moves the plan into code. The script holds the loop, the branching, and the half-finished results, so Claude’s context holds only the final answer.

That sounds like a small distinction. It is not. That distinction is the whole thing. When the plan lives in a script, the run survives interruption. A job that gets stopped picks up where it left off instead of starting over. It can run for hours. And because the results live in script variables instead of one context window, the work can fan much wider than a single conversation could ever track.

The runtime sets two hard limits: up to 16 agents running at once, fewer on a machine with limited cores, and 1,000 agents total across a run. The first number bounds what your laptop is doing this second. The second is a ceiling on the whole job, not a crowd that shows up together. People read “1,000 agents” and picture a stadium. It is closer to a turnstile that lets sixteen through at a time, up to a thousand by closing.

This is not the bag of agents problem, where you throw a pile of models at a task with no topology and watch them fall into hallucination loops with nothing checking them. A workflow has a checking plane. The script is it.

It is also not the same as pointing one agent at a whole process and hoping. Dynamic workflows are in research preview as I write this, on the paid Claude plans. You start one by asking for it directly, or by turning on the ultracode setting and letting Claude decide when a task is big enough to deserve one.

Twelve days later, the details firmed up. The trigger keyword is now literally ultracode, renamed from workflow in v2.1.160, and /effort ultracode makes it session-wide: xhigh reasoning, plus a workflow planned for every substantive task without you asking. A quieter detail matters more. Workflow subagents always run in acceptEdits mode with file edits auto-approved, and with ultracode on in auto permission mode, the launch prompt is skipped as well, so the built-in gates loosen exactly as the fan-out widens. That leaves the habit this post keeps insisting on, a human reading the diff, as the one check the runtime never removes.

The real point is parallel verification

Here is the problem every long AI task runs into. Reliability compounds. An agent that is 95% reliable on a single step is not 95% reliable across twenty of them. It is 0.95 to the twentieth power, which is about 36%. I worked through that math in AI does tasks, not jobs, and it is the reason a single agent grinding through a hundred-step job tends to drift, then confidently hand you something wrong.

Parallel verification attacks the problem from the other side. Instead of one chain that has to be right at every step, you do the work independently and then you check it independently. The launch post puts the pattern plainly: “agents address the problem from independent angles, other agents try to refute what they found, and the run keeps iterating until the answers converge.” Generate, then refute, then converge. That is a different shape from generate-and-hope.

A single long chain lets a wrong step reach output; a verified workflow fans out and refutes, catching the error first

The refute step is the one people skip, and it is the one that earns the whole thing. A single agent reviewing its own output is a weak check. It is invested in being right. An independent agent told to break the claim has no such loyalty. Run three of those and take the majority, and you have something closer to a verdict than an opinion.

Better base models help. Anthropic says Claude Opus 4.8, which shipped on May 28, 2026, is around four times less likely than the model before it to let a flaw in its own code pass unremarked. A model that catches more of its own mistakes makes a better verifier. But notice the model is not what makes the result trustworthy. The architecture is. You could run this pattern on a weaker model and still beat a single strong agent, because three skeptical passes catch what one confident pass misses.

A real run: refreshing this site

Let me make this concrete with a job I am actually running, on this very blog.

The site has around 250 posts. Some are years old. Links rot. Model names go stale, and a statistic that was current when I wrote it is now two model generations out of date. I wanted every factual claim re-checked against a live primary source and every dead link caught, without rewriting a single post’s voice or touching its publish date.

Think about doing that with one agent. It reads post one, makes some edits, reads post two, and by post forty it has forgotten what it decided on post three and is quietly inventing sources to justify changes nobody asked for. The compounding-error math again. A hundred small judgment calls in a row, with no memory and no check.

So I am running it as a workflow instead.

Dynamic workflow run: a planner fans out to independent verifiers, a refute-and-agree gate, one commit, then human review

A planner agent audited the whole set first and wrote a per-post work list. Then the run fans out: independent agents, each taking one post, each re-checking its claims against live primary sources. A second pass goes behind them with one instruction, which is to refute. Find the edit that is wrong. Only changes that survive that second look, and that a primary source confirms word for word, are allowed through. Anything unconfirmed keeps the original text. One commit per batch of five posts. Nothing gets pushed anywhere until I have read the diff myself. No two agents end up on the same post, because the claim is a single move on disk that only one can win, and what is left to do is counted off the files rather than held in a tally that could drift. The same trick lets several sessions run the job at once.

The part I did not expect, and the reason the whole structure earns its keep, is this. A real fraction of that first agent’s proposed edits were made up. Old text it claimed was in the file that was not actually there. Sources it cited that did not say what it said they said. If I had trusted the first pass, I would have published fabrications under my own name. The refute pass caught them. The human gate at the end caught the rest. That is not a knock on the model. It is the whole argument for checking as a separate, adversarial step, rather than a box the same agent ticks on its way out the door.

This is also, in plain terms, the kind of work I now take on through Blue Sheen, the advisory firm I run with Pravina. I am not going to pretend an orchestration diagram sells itself. Most companies do not need this. But when the work is high volume and being wrong is expensive, the pattern is worth knowing.

When it earns its cost

A workflow spawns a lot of agents, so a single run can use a lot more tokens than doing the same task in a normal conversation. That cost is the thing to be clear-eyed about. You are paying for the checking. The question is whether the checking is worth more than it costs.

Three conditions have to line up before it is.

Volume. If you have five things to check, check them yourself, or ask one agent. The overhead of planning and fanning out only pays back across dozens or hundreds of items.

Being wrong has to hurt. If a mistake is cheap and reversible, skip the refute pass and save the tokens. If a mistake means publishing a made-up statistic, or merging a security hole, or telling a client something false, the second look is the cheapest insurance you will buy that week.

The work has to split. The items have to be checkable on their own. Re-verifying 250 posts splits cleanly, because post 12 does not depend on post 200. A task where every step feeds the next does not split, and a workflow buys you nothing there.

Miss any of those and reach for something simpler. A small, sequential, low-stakes job wants a single capable agent. A process you repeat the exact same way every time wants a plain script, not an AI at all. I keep coming back to the same point I made about agentic AI use cases: build the check before you build the automation, because a workflow with no way to catch its own errors just multiplies them faster. For the full decision, including the cases where the answer is a flat no, see when to reach for a dynamic workflow.

None of this is about replacing the person at the end. The opposite. The whole design assumes a human reads the diff before anything ships. What changes is what that person spends their attention on. Not re-checking 250 posts by hand. Reading a clean, already-refuted result and deciding whether to publish it.

Tens to hundreds of agents is a fun headline. Checking you can afford to run in parallel is the actual reason to care.

If your team is staring at a job that is too big for one agent and too costly to get wrong, that is worth a conversation.

ai-agentsclaude-codeworkflow-automationorchestrationai

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Dynamic workflows: parallel verification at scale

Dynamic workflows: parallel verification at scale

What a dynamic workflow actually is

The real point is parallel verification

A real run: refreshing this site

When it earns its cost

About the Author

Related Posts

When to use a dynamic workflow

How I run my whole consulting practice with Claude

AI does tasks. It does not do jobs.

Multi-agent orchestration - the complexity trap

How to run a long autonomous Claude Code job without it drifting

The built-in agent types in Claude Code