GET AI Labs logoG.E.TAI LABS
Adoption · RN-016

Measuring ROI on enterprise AI investments

Most AI return figures are unfalsifiable. A method for costing the full system, classifying which benefits are measurable, and instrumenting an initiative so its return can be defended after deployment.

Published
2026 · 05
Read
8 min
Author
GET Team
Category
Adoption

Most published figures on AI return on investment are not measurements. They are estimates produced by the team that wanted the budget, computed after the fact against a baseline no one recorded before the work began. The number is rarely false on purpose. It is unfalsifiable by construction: there is no controlled comparison, the costs counted stop at the model bill, and the benefit is attributed entirely to the AI even when three other things changed the same quarter. A figure that cannot be wrong is not evidence.

Measuring AI ROI honestly is harder than measuring the return on most enterprise software, because the costs are diffuse and recurring while the benefits are indirect and shared across systems. This is the maximize-return half of an AI strategy: funding initiatives against a defensible bar rather than a hope. What follows is a method — how to cost the full system, classify the kinds of return and which can actually be measured, estimate return before building, and instrument an initiative so the question is answerable once it is live.

Why AI ROI is hard to measure honestly

Three structural problems make AI return harder to measure than it looks. The first is hidden cost: the model API or training run is usually a minority of lifetime spend, dominated by data preparation, integration, the evaluation suite, monitoring, human review, and operations — most of which recur monthly, so a case that counts only the model understates the denominator by a large multiple.

The second is diffuse benefit: AI rarely produces a clean new revenue line, but instead shaves minutes off a workflow, raises a decision's hit rate, or lowers the variance of an outcome — gains spread across many transactions and people, easy to feel and hard to isolate. The third is attribution: when a metric moves after a deployment, the AI is one of several plausible causes, alongside a process change shipped with it, shifted headcount, or moved demand. Without a baseline recorded beforehand and a comparison group held aside, attributing the full delta to the model is an assumption.

The full cost stack of an AI system

The model is one line item in a system of many. Before estimating return, write down what the system actually costs to build and to keep running, because the recurring components are the ones that quietly erase a thin margin. In our experience the irreducible cost stack includes:

  • Data preparation and pipeline — sourcing, cleaning, labeling, and the engineering to keep inputs flowing as upstream systems change. Frequently the largest line, and almost entirely recurring.
  • Integration — wiring into the systems of record, identity, and workflow tools where the work happens, including certified connectors and change management.
  • Evaluation suite — the test sets, failure taxonomy, and thresholds needed to know whether the system works, maintained for every model version.
  • Model and inference — API spend, hosting, or training and serving cost. The most visible line, and usually not the biggest.
  • Monitoring and operations — drift detection, incident response, retraining, and the on-call burden of running a probabilistic system in production.
  • Human-in-the-loop — reviewers who handle low-confidence outputs, exceptions, and escalations. A real and often permanent labor cost, not a temporary scaffold.
  • Governance — model risk review, audit and lineage, documentation, and the periodic re-approval that regulated deployments require.

Data preparation and human-in-the-loop are the lines teams most often omit, and the ones most likely to decide whether the system is profitable. A defensible estimate separates one-time build cost from monthly run cost and projects the run cost across the life of the system, not the length of the pilot.

The categories of AI return, and which are measurable

Return arrives in distinct forms that differ sharply in how cleanly they can be measured. Conflating them produces business cases that mix a hard number with a hopeful one and present the sum as fact. Separate five categories, each held to its own evidentiary standard:

  • Cost reduction — fewer hours, lower unit cost, deflected volume. The most measurable, provided the pre-change baseline was recorded and the AI's operating cost is netted out.
  • Revenue — higher conversion, retention, or cross-sell. Measurable only with a holdout or staged rollout; without a comparison group, revenue is the least trustworthy claim of all.
  • Risk reduction — fewer errors, lower fraud loss, better compliance. Estimable through avoided-loss modeling, rarely directly measurable, because the counterfactual loss did not occur.
  • Cycle-time — faster turnaround, shorter queues. Directly measurable and a strong leading indicator, though time saved becomes money only when the freed capacity is redeployed or removed.
  • Capability and optionality — work that was previously impossible, or the option to build on a new foundation. Real but not quantifiable; argue it as strategy, never as a dollar figure.

Estimating return before you build

The strongest time to estimate return is before committing the full budget, when the estimate can still change the decision. The aim is not a precise forecast — it is a bar. Decide in advance what return would justify the fully loaded cost, then ask what evidence would tell you whether the system can clear it. This connects directly to feasibility work: a small prototype scored against a real evaluation suite yields evidence on achievable accuracy, escalation rate, and the share of cases that still need a human — the inputs that drive every benefit estimate.

A defensible pre-build estimate has a recognizable shape. It states the baseline explicitly, expresses return as a range with named assumptions rather than a point, and nets the full operating cost against the gross benefit so the figure is a return and not a revenue line. Above all it names the one or two assumptions the whole case rests on — usually the achievable automation rate or the volume the system will see — so the prototype can test exactly those, and the project is funded against a bar rather than a hope.

Instrumenting an initiative so return is measurable

Whether a return can be measured after deployment is decided before deployment, by what gets instrumented. The single most valuable and most neglected step is recording the baseline before anything changes; a baseline captured after launch is a guess wearing the costume of one. Where the risk of misattribution is high, a staged rollout or a held-aside comparison group turns a plausible story into evidence by showing what would have happened without the system.

Instrumentation also means watching the right indicator at the right time. Leading indicators move within weeks and tell you whether the system is on track: accuracy against the evaluation suite, automation rate, human escalation rate, adoption, latency. Lagging indicators move over quarters and tell you whether the return materialized: realized cost per unit, end-to-end cycle time, revenue or retention against the holdout, error rates. Leading indicators catch a stalling initiative early; lagging indicators settle the ROI question. Tracking both against a recorded baseline is what makes the answer defensible rather than asserted.

Why most ROI claims are unreliable

Most AI ROI claims fail in the same four ways: the cost stack stops at the model bill, the baseline is never recorded, estimable and strategic benefits are folded into one measured-looking number, and no comparison group rules out what else changed. A defensible case avoids all four and is willing to be wrong. It usually shows a smaller number than the optimistic version — and that is the point: a figure that survives scrutiny is the one you can act on, the one that tells you which investments to fund, which to defer, and which to stop.

Bottom line: what to do next

Treat AI return as something estimated against a bar before building and confirmed against a baseline after deployment, not narrated afterward. The organizations that know whether their AI is paying off are not the ones with the most impressive headline figures. They are the ones who counted the whole cost stack, recorded the baseline before they changed anything, and kept what they measured separate from what they modeled.

Before approving the next AI investment, require three artifacts: a fully loaded cost projection across the life of the system, a recorded baseline with the leading and lagging indicators to be tracked against it, and a return estimate stated as a range with its load-bearing assumptions named. If those three exist and agree, the investment can be defended. If they do not, the figure is a hope — and a hope is not a business case.

Authored by GET Team · GET AI Labs
← All research notes
Next step

Have a technical challenge worth investigating?

Bring us the problem. We will help determine what is possible, what is practical, and what should be built next.

Response within two business days · NDAs available when required