· AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

AI does tasks. It does not do jobs.

Ten years building Tallyfy, and a year pointing AI agents at it, taught me one blunt thing. A job is a chain of tasks, and AI reliability multiplies down that chain until the whole thing is a coin flip. The fix is not a smarter model.

Everybody wants AI to do their job. I’ve spent ten years building software that quietly bets the opposite way, and I think the people chasing the whole job are going to keep getting burned.

Here’s the distinction I keep coming back to. A task is one defined unit of work: draft this email, check this number against the contract, file this form. A job is a long chain of those tasks that adds up to an outcome, like onboarding a client or closing the books. AI is brilliant at the first kind and shaky at the second. And the reason isn’t that the models are thick. It’s multiplication.

Why AI needs one defined task

AI alone, chained end to end 35%
With Tallyfy: defined, tracked, gated 99%

0.9010 = 0.35. A 10-step job run blind is a coin flip you lose.

Play with the full calculator

Why does the whole job fall apart?

Because reliability compounds, and it compounds downward.

Say an agent does each task at 90% reliability. Respectable for one task. But the job only finishes if every task in the chain lands, so you multiply 0.9 by itself once per task. Three tasks and you’re near 73%. Ten tasks and you’re at 35%. Twenty and you slide under 13%. The model never got worse at any single step. The chain ate the reliability.

That’s the whole problem in one line.

There’s a measurement from METR that stuck with me. Frontier models hit almost 100% on tasks a human could do in under four minutes, and under 10% on tasks that run past four hours. Short and defined, reliable. Long and sprawling, not yet, and maybe not for a while. The thing is, most real jobs are the long kind, which is exactly why aiming an agent at a whole job and walking away tends to end in a mess.

I felt this directly last year. I built an MCP server so AI agents could run Tallyfy tasks on their own. Hand an agent one well-scoped task and it’s a joy to watch. Ask it to carry an eight-step process end to end with nobody checking between steps, and it drifts. Same model. Different unit of work.

Watch it fall apart yourself

I don’t expect you to take my word for the arithmetic, so here’s a tiny simulation. It flips a weighted coin per task across a hundred thousand runs. The simulated numbers land right on top of the predicted ones, because the maths really is that boring.

Simulation output showing a 10 task job at 90 percent per task reliability succeeds about 35 percent of the time while a gated retry pattern holds near 99 percent
import random

random.seed(42)
TRIALS = 100_000

def chain_success(n, r, trials=TRIALS):
    # Autonomous chain: the job succeeds only if all n tasks succeed
    wins = 0
    for _ in range(trials):
        if all(random.random() < r for _ in range(n)):
            wins += 1
    return wins / trials

def gated_success(n, r, attempts=3, trials=TRIALS):
    # Gated chain: each task gets up to `attempts` tries before the job fails
    wins = 0
    for _ in range(trials):
        ok = all(any(random.random() < r for _ in range(attempts)) for _ in range(n))
        wins += 1 if ok else 0
    return wins / trials

R = 0.90
for n in (1, 3, 5, 10, 20):
    print(f"{n:>2} tasks  predicted {R**n:>6.1%}  simulated {chain_success(n, R):>6.1%}")

print(f"gated 10 tasks: {gated_success(10, R):.1%}")

Download it and change the inputs. The chained number falls off a cliff. The gated number, where each task gets a couple of retries before the job is allowed to fail, holds near 99%. The model is identical in both. The only thing that changed is structure.

What a task actually is

So the move is almost a no-brainer once you’ve seen the numbers: stop handing AI jobs, start handing it tasks.

Economists worked out that the task was the unit a long time ago. Acemoglu and Restrepo model automation as something that acts on specific tasks inside a role, never the whole role in one go. A job is a basket of tasks. Machines take some, people keep others, new ones appear. “Job” is an HR word. “Task” is the real economic unit, and it always was.

The engineers landed in the same spot. Anthropic’s guide to building agents calls the reliable pattern a workflow of predefined steps, and tells you to make each call an easier task on purpose. Smaller task, higher hit rate. Mind you, that isn’t a workaround. It’s the design.

This is the bet I made with Tallyfy ten years ago, before any of this was fashionable. The atom is a single defined task, with an owner, a deadline, and a check before the next one starts. Turns out that’s the exact shape AI needs to be useful. We didn’t build it for the agents. The agents just happen to need what a good process already had.

Where this leaves the job

I don’t think AI is coming for your job in one clean sweep. I think it’s going to swallow more and more of the tasks inside your job, and the roles that survive will be the ones that get good at defining and supervising those tasks. That’s a less dramatic story than the headlines sell, and a more demanding one.

The fix was never a cleverer model. It’s structure. Define each task. Track it. Gate it. Keep a human on the hook where it matters. Do that and a fragile chain turns into something that actually finishes, whether the doer is a person, an agent, or a rule.

If you want the longer, product-flavoured version of this argument, I wrote one over on Tallyfy. And if you just want to drag the sliders until the job collapses, the full calculator is here. Either way, the lesson is the same one I’ve been living with for a decade. Give AI a task. Never give it a job.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Related Posts

View All Posts »
Stop telling Claude it is an expert: describe the work, not the worker

Stop telling Claude it is an expert: describe the work, not the worker

You are an expert X was a useful crutch when GPT-3.5 was state of the art. On Claude Opus 4.7 and the models coming this summer, persona prompting actively caps the ceiling. It tells the model to stay in a lane just as models are finally getting good at leaving the lane. Describe the work instead.

Multi-agent orchestration - the complexity trap

Multi-agent orchestration - the complexity trap

Multi-agent AI systems promise specialized intelligence but deliver exponential complexity. Salesforce research shows agents achieve only 58 percent success on single tasks and adding orchestration doubles the failure rate. Most mid-size companies need one capable agent, not coordinated swarms.

AI advisory services via Blue Sheen.
Contact me Follow 10k+