AI does tasks. It does not do jobs.

Everybody wants AI to do their job. I’ve spent ten years building software that quietly bets the opposite way, and I think the people chasing the whole job are going to keep getting burned.

Here’s the distinction I keep coming back to. A task is one defined unit of work: draft this email, check this number against the contract, file this form. A job is a long chain of those tasks that adds up to an outcome, like onboarding a client or closing the books. AI is brilliant at the first kind and shaky at the second. And the reason isn’t that the models are thick. It’s multiplication.

Why AI needs one defined task

Tasks in the job 10 AI reliability per task 90%

AI alone, chained end to end 35%

With Tallyfy: defined, tracked, gated 99%

0.90¹⁰ = 0.35. A 10-step job run blind is a coin flip you lose.

Play with the full calculator

Why does the whole job fall apart?

Because reliability compounds, and it compounds downward.

Say an agent does each task at 90% reliability. Respectable for one task. But the job only finishes if every task in the chain lands, so you multiply 0.9 by itself once per task. Three tasks and you’re near 73%. Ten tasks and you’re at 35%. Twenty and you slide under 13%. The model never got worse at any single step. The chain ate the reliability.

That’s the whole problem in one line.

There’s a measurement from METR that stuck with me. Frontier models hit almost 100% on tasks a human could do in under four minutes, and under 10% on tasks that run past four hours. Short and defined, reliable. Long and sprawling, not yet, and maybe not for a while. The thing is, most real jobs are the long kind, which is exactly why aiming an agent at a whole job and walking away tends to end in a mess.

I felt this directly last year. I built an MCP server so AI agents could run Tallyfy tasks on their own. Hand an agent one well-scoped task and it’s a joy to watch. Ask it to carry an eight-step process end to end with nobody checking between steps, and it drifts. Same model. Different unit of work.

Watch it fall apart yourself

I don’t expect you to take my word for the arithmetic, so here’s a tiny simulation. It flips a weighted coin per task across a hundred thousand runs. The simulated numbers land right on top of the predicted ones, because the math really is that boring.

Simulation output showing a 10 task job at 90 percent per task reliability succeeds about 35 percent of the time while a gated retry pattern holds near 99 percent

import random

random.seed(42)
TRIALS = 100_000

def chain_success(n, r, trials=TRIALS):
    # Autonomous chain: the job succeeds only if all n tasks succeed
    wins = 0
    for _ in range(trials):
        if all(random.random() < r for _ in range(n)):
            wins += 1
    return wins / trials

def gated_success(n, r, attempts=3, trials=TRIALS):
    # Gated chain: each task gets up to `attempts` tries before the job fails
    wins = 0
    for _ in range(trials):
        ok = all(any(random.random() < r for _ in range(attempts)) for _ in range(n))
        wins += 1 if ok else 0
    return wins / trials

R = 0.90
for n in (1, 3, 5, 10, 20):
    print(f"{n:>2} tasks  predicted {R**n:>6.1%}  simulated {chain_success(n, R):>6.1%}")

print(f"gated 10 tasks: {gated_success(10, R):.1%}")

Download it and change the inputs. The chained number falls off a cliff. The gated number, where each task gets a couple of retries before the job is allowed to fail, holds near 99%. The model is identical in both. The only thing that changed is structure. Scale the same trick across a few hundred tasks and you get dynamic workflows in Claude Code, where the gate is other agents, checking in parallel.

What a task actually is

So the move is almost a no-brainer once you’ve seen the numbers: stop handing AI jobs, start handing it tasks.

Economists worked out that the task was the unit a long time ago. Acemoglu and Restrepo model automation as something that acts on specific tasks inside a role, never the whole role in one go. A job is a basket of tasks. Machines take some, people keep others, new ones appear. “Job” is an HR word. “Task” is the real economic unit, and it always was.

The engineers landed in the same spot. Anthropic’s guide to building agents calls the reliable pattern a workflow of predefined steps, and tells you to make each call an easier task on purpose. Smaller task, higher hit rate. Mind you, that isn’t a workaround. It’s the design.

This is the bet I made with Tallyfy ten years ago, before any of this was fashionable. The atom is a single defined task, with an owner, a deadline, and a check before the next one starts. Turns out that’s the exact shape AI needs to be useful. We didn’t build it for the agents. The agents just happen to need what a good process already had.

Where this leaves the job

I don’t think AI is coming for your job in one clean sweep. I think it’s going to swallow more and more of the tasks inside your job, and the roles that survive will be the ones that get good at defining and supervising those tasks. That’s a less dramatic story than the headlines sell, and a more demanding one.

The fix was never a cleverer model. It’s structure. Define each task. Track it. Gate it. Keep a human on the hook where it matters. Do that and a fragile chain turns into something that actually finishes, whether the doer is a person, an agent, or a rule.

Gating is the word doing the heavy lifting there, so here is what it means in practice for an AI task. The failure that actually bites is not the agent producing a wrong answer. It is the agent reporting that it finished while nothing real changed. So the gate I lean on checks two cheap things before a task is allowed to count as done. Did any actual work land, or did the step just narrate success? And is the change real or cosmetic? A task that cannot pass both gets retried or handed to a person. It never moves downstream on its own word.

If you want the longer, product-flavored version of this argument, I wrote one over on Tallyfy. And if you just want to drag the sliders until the job collapses, the full calculator is here. Either way, the lesson is the same one I’ve been living with for a decade. Give AI a task. Never give it a job.

aiai-agentsworkflow-automationreliabilitytallyfy

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

A dynamic workflow in Claude Code runs up to sixteen subagents at once and a thousand across a job. That power is wasted on most tasks. This is the decision I use before reaching for one: when a single agent wins, when a dynamic workflow earns its cost, and when the answer is to not automate at all.

Dynamic workflows: parallel verification at scale

Dynamic workflows in Claude Code run tens to hundreds of subagents that check each other before anything reaches you. The parallelism is not the interesting part. The verification is. Here is how I am using one to re-verify 250 posts on this site, and when it earns its cost.

How I run my whole consulting practice with Claude

I run Blue Sheen, my AI advisory firm, through Claude and Claude Code. The practice lives in a version-controlled folder that Claude reads at the start of every session, with Close CRM as the source of truth. This is the real workflow stage by stage: prospecting, proposals, delivery, and the judgment a human still has to own.

The forgetting curve is the math behind your make-or-buy decision for knowledge work

Humans forget 58% of new information in 20 minutes, 75% in a day, 90% in a week. Ebbinghaus measured this in 1885 and Murre replicated it cleanly in 2015. The forgetting curve is the cognitive-science substrate that decides which retention-critical knowledge work AI can structurally replace at a mid-size company.

Stop telling Claude it is an expert: describe the work, not the worker

You are an expert X was a useful crutch when GPT-3.5 was state of the art. On Claude Opus 4.8 and the models coming this summer, persona prompting actively caps the ceiling. It tells the model to stay in a lane just as models are finally getting good at leaving the lane. Describe the work instead.

Claude Artifacts for enterprise workflows - replacing expensive tools with AI

Mid-size companies spend tens of thousands annually on workflow tools that fragment their operations. Claude Artifacts offers a unified AI-powered workspace. IDC data shows generative AI returns of 3.7x per dollar invested, with some teams achieving full ROI within three months.

AI does tasks. It does not do jobs.

AI does tasks. It does not do jobs.

Why AI needs one defined task

Why does the whole job fall apart?

Watch it fall apart yourself

What a task actually is

Where this leaves the job

About the Author

Related Posts

When to use a dynamic workflow

Dynamic workflows: parallel verification at scale

How I run my whole consulting practice with Claude

The forgetting curve is the math behind your make-or-buy decision for knowledge work

Stop telling Claude it is an expert: describe the work, not the worker

Claude Artifacts for enterprise workflows - replacing expensive tools with AI