What Separates the 5% of AI Pilots That Actually Show P&L Impact

Davinci AI Team•7 min read•May 2026

Most pilots return nothing. The ones that work all do four things.

MIT studied a thousand enterprise AI pilots last year. 95% of them showed zero impact on the bottom line.

McKinsey’s 2025 State of AI: only 39% of companies saw any EBIT lift. Just 6% qualify as “high performers.” KPMG Canada’s number is the one that keeps showing up in our conversations — 93% of Canadian business leaders are piloting AI, but only 2% can point to measurable ROI.

If you’re a mid-market operator, those numbers feel familiar. You ran something. It demoed well. A few people loved it. The board meeting came up six months later and you couldn’t answer the question: what did this actually do for us?

The natural reaction is to blame the technology. Wrong vendor. Wrong model. Wrong moment. But after working through enough of these pilots, the pattern is hard to miss: the failed ones look almost identical to each other. So do the successful ones.

It’s not a technology problem. It’s a design problem.

The four things the 5% always do

We’ve reviewed enough mid-market AI pilots — ours and other people’s — to spot the through-line. Every pilot that produced a real P&L number had these four characteristics. Every one that didn’t was missing at least two.

1. They picked a workflow, not a “use case.”

The 95% start with something fuzzy. “Customer service.” “Operations.” “Reduce manual work.” That’s a category, not a target.

The 5% start with a workflow that has a name, an owner, a frequency, and a number attached to it. Not “improve sales” — “reduce the time our four inside-sales reps spend on post-meeting CRM updates, currently averaging 47 minutes per rep per day.” That’s a workflow. You can measure it. You can rebuild it. You can prove what changed.

When the scope is fuzzy, every result is debatable. When the scope is specific, the result tells you whether to expand or kill it.

2. They wrote the ROI baseline before they built anything.

Here’s the most reliable predictor of failure: nobody captured the “before” number. The pilot launches, runs for three months, everyone agrees it “feels faster” — and there’s nothing to compare it to.

The 5% spend the first week of a pilot measuring the status quo. Cost per ticket today. Hours per report today. Error rate today. Time-to-decision today. They write it down. Then they build.

This isn’t a finance exercise — it’s a discipline exercise. The act of measuring the current state forces you to define what success looks like before you fall in love with the demo.

3. They engineered the human review loop. They didn’t bolt it on.

We’ve written before about why augmentation beats end-to-end autonomy in 2026. The 5% take that further. They don’t stop at having a human in the loop — they design the loop on purpose.

Who reviews the AI’s output? By what role? At what threshold? What happens to edge cases? Where do corrections get logged so the system learns? In the 5%, those questions have written answers before the AI processes a single request. In the 95%, they get figured out in the moment, usually after something goes sideways.

The review loop is not overhead. It’s where the trust gets built and where the data for tomorrow’s improvement comes from. Skipping it doesn’t make the pilot lighter — it makes it disposable.

4. They put one operator on it for ninety days.

This is the step everyone skips. A pilot is announced, a vendor is signed, a workshop happens — and then nobody owns it. The original sponsor moves on to the next priority. The team using it asks questions of an empty Slack channel. Tweaks don’t happen. Drift sets in.

The 5% assign one person — not a committee, not a vendor — who owns the pilot for ninety days. Their job is unglamorous: review the outputs, talk to the users, log the corrections, adjust the workflow, report the numbers monthly. After ninety days, that person decides expand, refine, or kill.

No owner, no operator, no honest review at day ninety — and you don’t have a pilot. You have a subscription.

Why the 95% miss it

Each of the four has an obvious inverse, and the failed pilots tend to make all four mistakes at once:

They scope by category instead of workflow, so nothing is provable.
They skip the baseline, so the answer to “did it work?” is a feeling.
They treat the review loop as overhead instead of design, so trust never compounds.
They announce a pilot and forget to staff it, so it dies of neglect rather than failure.

None of these are technology issues. They’re operating issues. The 5% are not running better AI. They’re running better pilots.

The fastest test you can run

If you’re sitting on an AI pilot right now — running, planned, or recently quiet — there’s a one-paragraph test. Write down the answers to these:

Which named workflow does this pilot target?
What was the baseline number, in real units, before we started?
Who reviews the output, and where do their corrections get logged?
Who owns this for the next ninety days, and when do they report?

If you can answer all four cleanly, you’re in the 5%. If two or more are blank or vague, that’s not a status update — that’s the diagnostic. The pilot isn’t failing because of the model. It’s failing because the design is incomplete.

We’ve helped clients fix every one of those gaps after the fact, and the answer is almost never “more AI.” It’s tightening the scope, capturing the missing baseline, putting one operator in the chair, and running it for another ninety days the right way.

That’s the part most articles skip. The 5% pattern isn’t a clever framework. It’s the unglamorous operational discipline that turns a demo into a result.

You don’t need a better model. You need a better pilot.

Turn Your AI Demo Into a Result

Do not settle for a feeling when your business requires a hard P&L impact. Take our AI Readiness Assessment today to evaluate your AI adoption, capture your missing baseline, and build the unglamorous discipline needed to run your pilot the right way.