· · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

Few-shot learning: common challenges with this technique

Bad examples teach AI boundaries better than good ones. Testing hundreds of few-shot prompts in production at Tallyfy reveals why negative examples consistently improve AI performance by showing what not to do. The key is teaching systems what to avoid, not just what to do.

What you will learn

  1. Negative examples outperform positive-only approaches - showing AI what not to do improves accuracy by up to 20% compared to positive examples alone
  2. Quality beats quantity in example selection - 3 carefully chosen negative examples work better than 20 random positive ones
  3. The 70/30 rule works - mixing 70% positive with 30% negative examples creates optimal decision boundaries
  4. Format consistency is your hidden multiplier - standardized example structure can improve performance more than adding examples

Almost everyone does few-shot learning backwards. They pile on perfect examples and hope the AI figures out the pattern.

After three years building AI systems at Tallyfy and watching implementations fail in ways that surprised me, I finally understood what research on learning from negative examples had already figured out: bad examples teach better than good ones. This is one of the most underrated aspects of prompt engineering.

The worst part? Most people don’t realize they’re doing it wrong.

The problem with positive-only training

AI models don’t learn patterns from positive examples alone. They learn boundaries from negative ones.

Think about teaching someone to identify a dog. Fifty photos of dogs helps. But they’ll probably still point at a wolf and say “dog.” Show them 3 dogs and 2 wolves with clear labels? Suddenly the boundary clicks. The distinction matters. The edge cases become visible.

Turns out, the data backs this up. Models trained with negative examples performed much better, plateauing at around 15 negative examples per positive. The improvement was large.

I saw this pattern clearly when building customer service automation. Hundreds of perfect response examples still produced nonsense 30% of the time. Adding examples of terrible responses changed everything. Accuracy jumped to 94%. The model finally understood what to avoid.

Why boundaries matter more than patterns

Cognitive science, building on Eleanor Rosch’s categorization research, has known this for decades. Humans learn boundaries better than patterns. We notice what doesn’t belong before we can articulate what does.

AI models work the same way. When you only show positive examples, the model has to infer where the edges are. It guesses. Usually wrong.

Negative examples define those boundaries explicitly. No guessing required.

A classification system for Tallyfy needed to categorize support tickets. Showing examples of “bug reports” wasn’t working. The model kept misclassifying feature requests as bugs. Adding negative examples (“This is NOT a bug report, it’s a feature request”) made the distinction clear overnight.

MLflow 3.0’s research-backed evaluators let you measure this properly, scoring factuality and groundedness systematically. In practice, a good mix of examples, negatives included, often matches or beats fine-tuning. You don’t need thousands of examples. You need the right mix.

Finding the right balance

After analyzing hundreds of production prompts, one ratio kept emerging: 70% positive, 30% negative.

Facebook’s search team ran the numbers: blending random and hard negatives improved model recall up to a 100:1 easy-to-hard ratio. For few-shot learning, the sweet spot is simpler than that.

Show 7 examples of correct behavior. These teach the main pattern. Then show 3 examples of incorrect behavior. These define the edges.

Not 10 positives and 1 negative. Not 5 and 5. The 70/30 ratio consistently delivers better performance across different tasks and models, from content generation to data extraction. I think this is probably one of the most underappreciated levers in prompt engineering.

But the negative examples still need to be chosen carefully. Random negatives are nearly useless.

Need a thinking partner on this? Blue Sheen takes on this kind of advisory work.

Choosing negative examples that actually teach something

Random negative examples don’t work. You need examples that sit right at the boundary of correctness.

One comparison of one-class and two-class methods tells the story: careful negative sampling improved accuracy from 70% to 90%. Same number of examples. Totally different selection criteria.

Three approaches work consistently.

Edge cases that almost work. For email classification, don’t use obviously wrong examples. Use emails that are almost spam but not quite. These teach the subtle boundaries that actually trip models up.

Common failure modes. Track where your model fails most often. Convert those failures into negative examples. This directly addresses your real weak points, not imagined ones.

Boundary violations. Find examples that break one specific rule while following all others. These isolate and clarify individual constraints without overwhelming the model.

Building a content moderation system showed this clearly. Random inappropriate content as negative examples produced 72% accuracy. Deliberately selected edge cases produced 89%. Same number of examples, totally different outcomes.

Diversity also matters more than volume here. Three diverse examples outperform twenty similar ones. Every time.

In document classification systems, 50 examples from similar documents often perform worse than 5 examples from totally different document types. The model needs to see the full range. This connects directly to the fragmentation problem in AI implementations. Narrow training examples produce narrow, fragile systems.

Format consistency kills more implementations than bad examples do

Few-shot example selection flow: task to 7 positives plus 3 boundary negatives, standardize, validate, ship

Inconsistent formatting destroys most few-shot implementations. Quietly. Without obvious error messages.

Your examples might be perfect. Your selection might be careful. But if the format varies, the model gets confused trying to separate format signals from content signals.

I spent weeks debugging a data extraction system before finding the issue. Some examples used JSON. Others used XML. Some had comments, others didn’t. The model couldn’t separate format from content. Once we standardized everything, performance jumped without changing a single example.

Industry best practices confirm this. Treating prompts like code with version control and consistent formatting can improve performance more than adding additional examples.

This template works reliably across implementations:

POSITIVE EXAMPLE 1:
Input: [exact format they'll use]
Output: [exact format you want]
Why this is correct: [brief explanation]

NEGATIVE EXAMPLE 1:
Input: [similar but wrong]
Output: [incorrect output]
Why this is wrong: [specific violation]

Same structure every time. The model learns the pattern, not the formatting chaos.

Testing and knowing when to stop

Most people test few-shot prompts wrong. They try a few inputs, see decent results, and ship it. Then it fails in production with real users.

Error rates compound fast. 95% reliability per step yields only 36% success over 20 steps. That is a brutal number. Yet industry data paints a mixed picture: 89% of teams have implemented observability while only 52% have proper evaluation in place. That gap is where things fall apart.

A few things that actually work for testing:

Holdout validation. Never test with data similar to your training examples. Use different data to verify the model is generalizing, not memorizing.

Adversarial testing. Try to break your prompt. Use edge cases, malformed inputs, weird formatting. If it survives this, it might survive production.

A/B testing in production. This is where prompt engineering discipline becomes important. Tools like Helicone have processed over 2 billion LLM interactions, enabling teams to test variations with real traffic and measure actual performance rather than synthetic benchmarks.

Progressive rollout. Start with 5% of traffic. Monitor closely. Scale gradually. At Tallyfy, this methodology caught a prompt that seemed 95% accurate in testing but was actually 67% accurate with real user input.

The common mistakes I see repeatedly: using only positive examples (like teaching someone to drive by only showing correct driving), selecting negatives that are too hard (which can collapse the very distinctions you are trying to teach), ignoring format consistency, and assuming a prompt that works on GPT-5.5 will work on Claude. It won’t, always.

Sometimes few-shot learning just isn’t enough. The task is too complex. The variations are too numerous. You know you’ve hit the limit when accuracy plateaus despite better examples, edge cases multiply faster than you can document them, the prompt balloons past 50 examples, or performance varies wildly between similar inputs. This often connects to deeper issues like security vulnerabilities in RAG systems or fundamental architecture problems that more examples can’t fix.

The production track record is rough: plenty of agentic AI projects get abandoned when teams hit complexity they never anticipated. Sometimes fine-tuning is the proper answer.

What I keep coming back to

The field is moving fast. Evaluation platforms are evolving from niche utilities into core infrastructure. Models keep getting better at learning from fewer examples.

(Update, June 2026: the economics shifted since I wrote this. Million-token context windows on the current model lineup mean you can pack far more examples into a single prompt than you used to, and adaptive thinking lets a model reason about boundary cases on its own. That makes fitting your 7 positives and 3 negatives cheap and easy. It does not change the ratio or the principle below. More room for examples is not a reason to dump in fifty.)

But the underlying principle stays constant: negative examples define boundaries better than positive examples define patterns.

Is this the only thing that matters? No. But it is what most people skip.

Stop teaching AI what to do. Start teaching it what not to do.

That’s where the actual learning happens.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Related Posts

View All Posts »
Stop telling Claude it is an expert: describe the work, not the worker

Stop telling Claude it is an expert: describe the work, not the worker

You are an expert X was a useful crutch when GPT-3.5 was state of the art. On Claude Opus 4.8 and the models coming this summer, persona prompting actively caps the ceiling. It tells the model to stay in a lane just as models are finally getting good at leaving the lane. Describe the work instead.

Managing prompts in production

Managing prompts in production

Your prompts are code. Treat them like it. LaunchDarkly found that teams lose hours figuring out which prompt version runs in production. Here is why version control, testing, and deployment pipelines matter more than writing perfect prompts.

How to prompt engineer like a pro

How to prompt engineer like a pro

Great prompts are discovered through iteration, not designed upfront. After testing hundreds of prompts across multiple models at Tallyfy, here is what actually works for professional prompt engineering.

The consultant who fought to keep his client off AI

The consultant who fought to keep his client off AI

Some advisors resist letting a company connect AI to its own systems, dressed up as too risky. The Everlaw survey found 90% of legal professionals expect AI to change billing within two years. The real driver is an AI consultant protecting the gatekeeper role.

AI advisory services via Blue Sheen.
Contact me Follow 10k+