AI

Product management broke when AI features stopped being deterministic

Traditional product management assumes features work the same way every time. AI features do not. They drift, hallucinate, and behave differently for different users. This creates a new discipline where the PM must understand both how to use AI for PM work and how to manage products with AI inside them. Most companies have neither skill.

Traditional product management assumes features work the same way every time. AI features do not. They drift, hallucinate, and behave differently for different users. This creates a new discipline where the PM must understand both how to use AI for PM work and how to manage products with AI inside them. Most companies have neither skill.

The short version

AI-native product management is a new discipline that combines two skills most companies lack: using AI to accelerate PM work, and knowing how to manage products whose AI features behave non-deterministically. Mid-size companies need this capability but can't justify a full-time hire.

  • Traditional PM breaks when features give different answers to different users
  • AI accelerates PM busywork but can't replace PM judgment
  • The gap between engineering building AI features and nobody managing them grows silently

Every product management framework you’ve ever used assumes one thing: when a user does X, the system does Y. Every time. Deterministically.

AI broke that assumption.

Your AI-powered search gives different results to different users for the same query. Your AI chatbot answers the same question three different ways depending on when you ask it. Your recommendation engine surfaces different items even when the user profile hasn’t changed. The feature isn’t broken. It’s working as designed. But nobody on your product team has ever managed something that works like this.

This isn’t a niche problem. Google’s PAIR team has published extensively on how non-deterministic behaviour fundamentally changes UX design assumptions. And it touches every company shipping AI features, which at this point is basically all of them.

Why traditional PM skills fall apart with AI features

The core PM discipline depends on acceptance criteria. “When user clicks Submit, form data saves to the database and confirmation appears.” That’s testable. Repeatable. You either built it right or you didn’t.

Now write acceptance criteria for: “When user asks a question, AI provides a helpful and accurate response.”

You can’t. Not really.

Helpful to whom? Accurate compared to what baseline? What about the 4% of the time it confidently states something wrong? What about when the underlying model gets updated and the behaviour shifts overnight without anyone touching the code? Nielsen Norman Group’s research on AI UX patterns found that users build mental models of how features work, and inconsistent behaviour from AI features erodes trust in ways that traditional PM metrics completely miss.

Roadmaps fall apart too. Traditional roadmaps assume feature completion as a milestone. You build the feature, it works, you move on. AI features are never really “done.” They drift. The model that performed brilliantly in testing starts producing subtly different outputs three months later because the data distribution shifted or Anthropic updated the underlying weights. This is called concept drift, and research from the University of Manchester shows it affects practically every production ML system.

A/B testing assumptions break. Traditional A/B tests assume both variants behave consistently within their group. When the feature itself produces variable outputs, your test groups aren’t actually controlled. You’re measuring noise on top of noise. Something I keep noticing across industries: teams run A/B tests on AI features, get inconclusive results, and blame the sample size when the real problem is non-deterministic behaviour within each variant.

In building Tallyfy, we hit this directly when adding AI-powered workflow suggestions. The suggestion quality varied by context, by user history, by time of day. Traditional PM metrics told us the feature was performing well on average. But averages hide the distribution. The frustrated users at the tail end weren’t showing up in our dashboards.

Using AI to do product management faster

The flip side of this problem is genuinely exciting.

AI doesn’t just create PM challenges. It also makes a huge chunk of PM work faster. The discipline of AI-native product management runs in both directions: managing AI products AND using AI as a PM tool.

User research synthesis is the obvious one. Dump 50 interview transcripts into Claude and ask for patterns. What used to take a product manager two weeks of highlighting and affinity mapping now takes an afternoon. But here’s the catch: the PM still needs to know which questions to ask, which patterns actually matter, and which ones are just noise the model latched onto because they appeared frequently.

Competitive analysis compresses from weeks to hours. PRDs and specs can be drafted in minutes. RICE prioritization and weighted scoring frameworks can be modelled conversationally. The busywork melts away.

The risk that nobody talks about: companies start thinking AI replaces PM judgment. It doesn’t. It replaces PM busywork. The judgment, the strategic thinking, the ability to say “we’re not building that even though users keep asking for it” because you understand the bigger picture, that’s where the value lives. AI can synthesize 200 feature requests. It can’t tell you which three actually matter for your next quarter.

When explaining this to MBA students at OneDay, I use a simple frame: AI is the research assistant who works at 10x speed. You’re the principal investigator who decides what the research means. If you let the assistant run the lab, you get a lot of experiments and no conclusions.

Managing products that have AI inside them

This is the harder half. And honestly, most PM teams aren’t equipped for it.

Prompt management becomes a PM discipline. The system prompts powering your AI features are product configuration, full stop. They need version control. They need testing against evaluation sets before deployment. They need approval workflows, just like code changes. Managing prompts in production is as much a PM responsibility as it is an engineering one, because the prompt directly shapes the user experience.

Evaluation loops replace traditional QA. You can’t write a deterministic test for a probabilistic output. Instead, PMs need to understand evaluation metrics that didn’t exist in traditional product work: relevance scores, hallucination rates, latency distributions, confidence thresholds, fallback trigger rates. When your AI feature decides it’s not confident enough to answer and escalates to a human, how often is that happening? Is 12% acceptable or is that a sign the feature is struggling? These are PM questions, not engineering questions, because they directly affect the user experience and the business case.

The success metrics change entirely. Traditional PM tracks adoption, task completion, NPS. AI features add a whole layer: confidence scores by query type, human escalation percentages, output quality over time, user correction rates. If users keep editing the AI’s suggestions before accepting them, that’s a signal. But only if someone is watching.

Graceful degradation is a core PM concern now. What happens when the AI gives a bad answer? What’s the fallback? How does the UI communicate uncertainty to the user? Does the feature silently fail or does it acknowledge its limitations? These design decisions used to be edge cases. With AI features, they’re the main case. The discipline of AI operations barely exists at most companies, and the PM is often the only person positioned to bridge the gap between “the model works in testing” and “the model helps users in production.”

The gap most companies are sitting in right now

Here’s what I see at 50 to 500 person companies. Engineering builds the AI feature. Product says “ship it.” Marketing announces it. And then… nobody owns it.

Nobody is monitoring whether the AI recommendations are getting worse over time. Nobody is tracking the hallucination rate trend line. Nobody is running prompt iterations against evaluation sets. Nobody is collecting user feedback specifically about AI behaviour versus general product satisfaction. The feature shipped. The sprint closed. The team moved on.

This gap grows silently. Model drift doesn’t announce itself. Prompt rot, where system prompts gradually become less effective as the model’s behaviour shifts through updates, is invisible unless someone is actively checking. The AI feature that was brilliant at launch quietly degrades until a customer complaint surfaces months later.

The problem isn’t that companies don’t care. It’s that nobody has the right combination of skills. Your existing PMs are brilliant at traditional product work. They understand users, they prioritize well, they ship features. But they’ve never managed something non-deterministic. They’ve never evaluated prompt quality. They’ve never thought about confidence thresholds as a UX variable.

If you want to dig into this for your company, my door is open.

Full-time AI product managers exist, but they’re expensive and hard to find. The talent market is brutal. Teaching your existing PMs the AI-specific skills takes time you might not have if you’re shipping AI features now.

What this role actually looks like in practice

It’s not a full-time job for most mid-size companies. Not yet.

It’s 10 to 20 hours a week of specialized work layered on top of existing PM capacity. Weekly AI feature quality reviews: pulling evaluation metrics, checking output samples, reviewing user feedback that mentions AI behaviour specifically. Prompt iteration cycles: testing changes against evaluation sets before they go to production, exactly like code review but for the words that shape your AI’s personality.

AI-specific user research is part of it too. How do your users actually interact with non-deterministic features? Do they trust the AI suggestions? Do they verify them? Do they use the feature differently when they know it’s AI-powered versus when they don’t? These questions are different from standard user research, and the answers directly inform product decisions about confidence displays, explanation features, and fallback designs.

Building internal capability matters most. The goal isn’t to create a permanent dependency on a specialist. It’s to train your existing PM team to handle the AI-specific aspects so the company builds this muscle internally. Teaching founders about product decisions at the Skandalaris Center, I keep coming back to this point: the fractional model works because it transfers knowledge, not just delivers output.

The companies that figure this out early have a proper advantage. Their AI features improve over time instead of degrading. Their PMs understand what they’re managing. Their users trust the product because someone is actively maintaining that trust.

The companies that don’t? They’ll keep shipping AI features that work great on demo day and quietly disappoint six months later. And they won’t understand why, because nobody was watching.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience and as the founder of Tallyfy (raised $3.6m), he helps mid-size companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding.

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Want to discuss this for your company?

Contact me