Stop Shipping Features: Why AI Products Need an Experiment Mindset

Sketch of experiment hypothesis board replacing a traditional product roadmap

Shipped 12 features. None moved the metric. Here is why.

We shipped 12 features in Q3.

I remember the sprint review. Green checkmarks everywhere. The PM was happy. The engineers were exhausted but proud. We had a beautiful roadmap that was 100% complete.

And then we looked at the metrics.

Retention: flat. Task completion: flat. The one quality score we actually cared about: marginally worse than when we started.

Twelve features. Zero meaningful movement. And the brutal part? We had no idea which of those 12 things had helped, which had hurt, and which had done absolutely nothing. We had shipped ourselves into a fog.

That was my healthcare AI product two years ago. We were running it like a software project. We were wrong.

The Lie at the Center of Every AI Roadmap

Traditional software roadmaps work because the relationship between output and outcome is legible. You ship a checkout flow, conversion either goes up or it does not. The feature is deterministic. You can trace cause to effect.

AI products break this contract immediately.

The model does not behave the same way twice. The failure modes are not bugs — they are probability distributions. The thing that works for 90% of users fails completely for the other 10%, and you will not know which 10% until you are already in production. You cannot unit test your way to quality. You cannot spec your way to reliability.

Here's the framing I've come to believe: the right metric for an AI product team is not features shipped. It is experiments run. That is a fundamentally different operating model, and most teams — including mine, for longer than I want to admit — never make the switch.

What "Experiment Mindset" Actually Means

Here is what changes when you treat your AI product like a science program instead of a software project.

1. The roadmap becomes a hypothesis board.

Every item on the roadmap gets reframed as a falsifiable hypothesis. Not "add structured output formatting" but "we believe structured output formatting will reduce user re-queries by 20% for complex tasks." That reframe forces two things that most product teams avoid: deciding what success looks like before you build, and committing to a measurement plan before you ship.

At OpenLoop we eventually landed on a simple format: hypothesis, metric, baseline, threshold, and expiration date. If the experiment does not move the metric past the threshold within the time window, it fails. You do not negotiate with the data.

2. Evaluation is built first, not bolted on later.

This is the one that hurts the most to admit. For the first year of building healthcare AI, we had no systematic evaluation. We had vibes. Engineers would look at a sample of outputs and say "yeah, this feels better." Product would approve the release. We shipped.

The right order is: define what good looks like, build an evaluator that can detect it, then build the feature. Evaluation-first development is not a nice-to-have. It is the only way to know whether anything you ship actually works.

For practical evaluation setup: start with 50 to 100 labeled examples of good and bad outputs, reviewed by a domain expert — ideally one person, not a committee. Binary pass/fail labels beat 1-to-5 rating scales every time. You want signal, not sentiment. Then automate the evaluator so it runs on every change, like a test suite, but for quality instead of correctness.

If you are passing 100% of your evals, your evals are not hard enough.

3. Traces are your primary observability primitive.

In a normal web application, you instrument requests. You log errors. You track latency. That is table stakes.

In an AI product, the equivalent is trace-based observability. Every call to your model needs to log the full input, the full output, any retrieval context, and the metadata that explains what the system was trying to do. Not summaries. Full traces.

This matters because AI failures are almost never what you expect. The bug is usually upstream — a retrieval step that pulled the wrong document, a prompt template that collapsed context in a subtle way, a system message that conflicted with the user turn. You cannot diagnose these failures from aggregate metrics. You have to read the actual traces.

We spent three months thinking our model was hallucinating. It was not. The retrieval layer was returning outdated clinical guidelines because our chunking strategy was stripping document dates. We only found this by reading 200 traces one afternoon. Zero dashboards would have caught it.

4. Experiment velocity is the team's north star.

How many experiments did your team run this week? How many hypotheses did you invalidate? These are the questions worth asking in your weekly review, not how many tickets were closed.

The teams building the best AI products right now are running 10 to 20 experiments per month. Not 10 to 20 features — experiments. Many of them fail. That is the point. A failed experiment is not a failure; it is a piece of evidence. A shipped feature that you cannot evaluate is the real failure, because now you have no idea whether you helped or hurt.

One practical forcing function: implement a lightweight experiment log. Date, hypothesis, metric, result, what you learned. Ours lives in a shared doc. We review it monthly. It is the most valuable artifact the team produces.

5. Separate the "what" from the "how" in every sprint.

On traditional software projects, the product team defines what to build and the engineering team defines how to build it. On AI projects, this split breaks down. The "how" — model choice, prompt strategy, retrieval architecture, chunking approach — often determines whether the "what" is even achievable.

The practical fix: engineers need to be in the hypothesis-setting conversation from the beginning. They are not implementation resources; they are co-researchers. A PM who hands engineers a spec without involving them in the experiment design is burning cycles. The people closest to the model behavior are the ones most likely to spot which hypotheses are worth testing.

The PM or CTO Reading This

If you are leading an AI product and your team is shipping features but the metrics are not moving, the problem is almost certainly the operating model, not the people.

Your roadmap is a backlog of guesses. Some guesses are better than others, but they are all guesses until you have data. The only way to generate that data is to run experiments fast, measure them honestly, and let the results drive what you build next.

That requires:

  • An evaluation system you trust before you ship anything
  • Trace-level observability so you can diagnose failures, not just observe them
  • A team culture where failing an experiment is celebrated, not apologized for
  • A roadmap format that makes hypotheses explicit and outcomes measurable

The last one is the hardest. Stakeholders want features. Investors want features. Feature delivery is visible in a way that "we invalidated three bad hypotheses this month" is not. Part of your job as a product leader is translating experiment outcomes into a narrative that makes sense to people who think in roadmaps.

But internally, with your team: run the science.

What This Looks Like in Practice

For healthcare AI specifically, the stakes make this non-negotiable. A bad recommendation is not a UX annoyance — it is a patient safety event. We cannot ship-and-iterate the way a consumer app can. Every change needs to be evaluated against a defined safety and quality threshold before it goes to production.

That constraint, which felt like a burden at first, is actually the right forcing function for any AI product. It demands evaluation before shipping. It demands trace observability. It demands documented hypotheses. Healthcare just makes the cost of skipping these steps visible.

The teams who figure out experiment-driven development early will compound their advantage. Every experiment they run adds to a body of evidence about what works in their specific domain, with their specific users, for their specific task. That evidence does not expire. It is organizational knowledge that makes every future experiment faster and sharper.

The teams who stay in feature-shipping mode are spinning their wheels. They are generating activity without generating learning.


After that Q3 with 12 features and zero movement, we stopped writing a roadmap and started writing a hypothesis board. Our next quarter had half the feature output and three times the metric movement.

Fewer things shipped. More things learned. The product actually got better.

That is the trade. Make it.