---
title: "How carriers should measure AI ROI in fraud investigation (hint: it's not NPS or cycle time)"
description: "NPS, CSAT, and cycle time are the wrong fraud investigation AI ROI metrics. Score the agent on yield, leakage recovered, cost per case, and defensibility."
date: "2026-06-09"
lastModified: "2026-06-09"
author: "Pankaj Dhariwal"
tags: ["Guides"]
canonical: "https://gethesperai.com/blog/measuring-fraud-investigation-ai-roi/"
---

# How carriers should measure AI ROI in fraud investigation (hint: it's not NPS or cycle time)

> **TL;DR** NPS, CSAT, and cycle time are customer-experience metrics - the right scorecard for a claims-handling AI and the wrong one for fraud investigation. An investigation serves the SIU, the CFO, the reserving actuary, and the state DOI, not the claimant, so its ROI has to be scored on loss avoidance and defensibility: fraud yield, leakage recovered, cost per investigation, capacity, false-positive reduction, and time-to-defensible-decision.
>
> - Six outcome metrics replace NPS, CSAT, and raw cycle time
> - Yield is gated by coverage: ~25% manual vs 100% with AI
> - Per-case cost ~$2,500 manual vs ~$150 sizes the denominator

- **$122B** - Annual US P&C fraud loss (Deloitte 2025 (10% of claims))
- **7-14%** - Leakage as % of claims spend (EY claims quality assessments)
- **14+ days → 2-4 hrs** - Investigation time per case (Manual SIU vs AI, Hesper benchmark)
- **~25% → 100%** - Flagged-claim coverage (The yield denominator, manual vs AI)

NPS, CSAT, and cycle time are the wrong metrics for fraud investigation AI, because they measure the claimant's experience and a fraud investigation does not exist to please the claimant. In a May 2026 Claims Journal interview, Openly's Senior Director of Claims described measuring the carrier's claims AI on "NPS and CSAT surveys ... while also measuring cycle time and reviewing our internal quality results," and cited a 30-plus day cycle-time reduction. That is exactly the right scorecard - for a contents-pricing and claims-handling AI. It is exactly the wrong scorecard for an AI that investigates suspected fraud.

The reason is a category boundary, not a quibble over which survey to run. The customer of a contents-pricing AI is the policyholder. The customer of a fraud investigation is the SIU, the CFO, the reserving actuary, and the state DOI examiner. Its objective is not satisfaction or speed for its own sake - it is loss avoidance and a decision the carrier can defend. A faster, friendlier process that confirms less fraud and recovers fewer dollars is a worse outcome, not a better one. Scored on NPS, the optimal move is to pay questionable claims so the score stays high. That is the opposite of what the function is for.

This post defines the six outcome metrics that actually measure a fraud investigation agent - fraud yield rate, claims leakage recovered, cost per investigation, SIU capacity, false-positive-rate reduction, and time-to-defensible-decision - and shows how to instrument each one. It is the metric-definitions layer beneath the board case in [the CFO ROI memo for AI claims investigation](/blog/cfo-roi-memo-ai-claims-investigation), and it sits inside the broader picture of [how uninvestigated claims drain profitability](/blog/claims-fraud-leakage-pillar).

## Why NPS, CSAT, and cycle time are the wrong yardstick for fraud investigation

Start with what those metrics measure and who they serve. NPS and CSAT score how a policyholder feels about the claims experience. Cycle time measures how fast a case moves from notice to close. For a claims-handling AI - triage, contents inventory, pricing, document intake - those are the correct yardsticks, because the policyholder is the customer and a faster, smoother experience is the product. The Openly example is well chosen and accurate for that use case. The error is transferring the same scorecard to a different layer of the stack.

Fraud investigation is a loss-avoidance function. Its job is to take a flagged claim and determine - with evidence a carrier can stand behind - whether the claim is what it appears to be, and to recover or deny the dollars that should not be paid. None of the people that work serves ever fills out an NPS survey. The reserving actuary needs an accurate exposure number. The CFO needs the leakage off the book. The state DOI examiner needs a documented decision under the carrier's antifraud plan. The claimant, in a confirmed-fraud case, is the adverse party. Measuring that function on claimant satisfaction is a category error.

It is also actively misleading, because optimizing the wrong metric degrades the right outcome. EY puts claims leakage - the gap between what a carrier paid and what it owed - at approximately 7% to 14% of total claims spend, and names "inadequate investigation of injury causation and liability" as one of four root causes, per [EY](https://www.ey.com/en_us/insights/insurance/claims-litigation). Read that carefully: leakage is an investigation-quality problem, not a cycle-time problem. A faster process that investigates less thoroughly widens the exact gap the function exists to close. Speed that comes at the cost of evidence completeness is negative ROI dressed up as efficiency.

This is why the layer matters. Detection is upstream; investigation is downstream. Detection vendors answer "did we flag the right claims?" and are correctly measured on precision and recall. Claims-handling AI answers "did we move the claim smoothly?" and is correctly measured on cycle time and CSAT. Fraud investigation answers a third question - "of the claims we flagged, how many did we resolve, recover on, and defend?" - and that question has no published measurement framework. Borrowing detection or handling metrics to fill the gap flatters speed while ignoring whether fraud was caught. The investigation layer needs its own scorecard.

## The six ROI metrics that actually measure a fraud investigation agent

Investigation ROI is outcome- and loss-based. Six metrics carry it, and each maps to a question one of the function's real customers is already asking. None of them is a satisfaction delta, and none of them can be gamed by going faster at the expense of thoroughness.

### The six metrics

1. Fraud yield rate - confirmed-fraud cases divided by investigated cases. The single most important number, because it tells you whether the investigations are finding anything. It is gated by coverage, which is why it is the section that follows.
2. Claims leakage recovered, in dollars - the difference between what was paid and what was owed per the contract, identified and recovered or denied. This is the board-facing numerator, sized by EY at 7-14% of spend.
3. Cost per investigation and cost per SIU referral - the fully loaded cost to work one flagged claim, and the marginal cost of working one more. This is the denominator that determines whether full coverage is affordable.
4. SIU capacity per investigator - investigations completed per investigator per period, which determines how much of the flag pile can be worked at all. Manual throughput runs about 10 cases per investigator per month.
5. False-positive-rate reduction - measured against a rules-based detection baseline of 60-85%, this captures how much investigator attention stops being wasted on claims that were never fraud.
6. Time-to-defensible-decision - not time-to-close, but time to an audit-ready finding with an evidence chain and explicit support for the action taken.

Notice what these six have in common: each survives the Bain leakage test. Bain & Company defines leakage as "the difference between what is paid vs. what is owed" and links generative AI in claims to a 30% to 50% reduction in total leakage and a 20% to 25% decrease in loss-adjusting expenses, per [Bain](https://www.bain.com/insights/100-billion-dollar-opportunity-for-generative-ai-in-p-and-c-claims-handling/). The six investigation metrics are the operational decomposition of that 30-50% range. NPS captures none of it. Cycle time captures the LAE side at best and says nothing about leakage. The right metrics are precisely the ones that move when leakage moves.

The table below is the direct swap. For each customer-experience metric a carrier might be tempted to carry over, it gives the reason that metric misleads when applied to investigation, and the outcome metric to use in its place. This is the core argument of the post in one view: do not reach for a new survey, reach for a different category of number.

| Customer-experience metric | Why it misleads for fraud investigation | Investigation-ROI metric to use instead |
| --- | --- | --- |
| NPS | Scores claimant sentiment; the claimant is the adverse party in a confirmed-fraud case | Fraud yield rate |
| CSAT | A satisfied claimant on an overpaid claim is leakage, not success | Claims leakage recovered ($) |
| Cycle time / time-to-close | Rewards a fast close on a thin file; a closed case is not a resolved one | Time-to-defensible-decision |
| % of cases auto-closed | Auto-closing without investigation pays the uninvestigated tail | Cost per investigation / per referral |
| Cases closed per month | Counts noise; 60-85% of rules-based alerts are false positives | False-positive-rate reduction |
| Handler throughput | Measures handling speed, not how much of the flag pile is worked | SIU capacity per investigator |

> **The distinction this post turns on**
>
> Customer-experience metrics ask whether the claimant was happy and the case moved fast. Investigation-ROI metrics ask whether fraud was found, recovered, and defended. The two are not different points on the same scale - they measure different functions with different customers. A friendlier, faster process that confirms less fraud scores higher on the first set and lower on the second. When they conflict, the second set is the one tied to the loss ratio.

## Fraud yield rate - the metric cycle time hides

Fraud yield rate is the share of investigated claims that turn out to be confirmed fraud: confirmed-fraud cases divided by investigated cases. It is the metric cycle time hides, because the two can move in opposite directions. An agent that closes cases in minutes but confirms nothing has not created value - it has destroyed it, while reporting a beautiful cycle-time number. Yield is the check that keeps speed honest. A scorecard that tracks cycle time without yield can show green while the unit confirms less fraud every quarter.

The external evidence says most fraud is never confirmed today. Deloitte estimates current detection rates at 20% to 40% for soft fraud and 40% to 80% for hard fraud, against an estimated 10% of P&C claims being fraudulent and a $122 billion annual P&C fraud loss, per [Deloitte](https://www.deloitte.com/us/en/insights/industry/financial-services/financial-services-industry-predictions/2025/ai-to-fight-insurance-fraud.html). Those are detection rates - the share of fraud that gets flagged in the first place. Yield sits one layer further down: of the claims that do get flagged, what share get investigated to a confirmed conclusion. When most fraud is not even flagged and only a fraction of flags are worked, portfolio yield is structurally capped long before any cycle-time improvement enters the picture.

The cap is coverage. Yield is a rate, and a rate has a denominator. If a manual SIU investigates only about 25% of the claims it flags - because each case takes 14+ days and an investigator already carries 200+ cases - then 75% of the flag pile never enters the yield calculation at all. Investigating 100% of flagged claims at an honest yield beats investigating 25% at any speed, because the 25% case leaves three-quarters of the recoverable exposure untouched. Coverage is the hidden multiplier sitting behind the yield number, and it is the single biggest loss-cost lever in the stack.

| Flagged-claim coverage - the yield denominator (manual SIU vs AI investigation) | Value | Share |
| --- | --- | --- |
| Manual SIU - flagged claims fully investigated | ~25% | 25% |
| AI investigation - flagged claims fully investigated | 100% | 100% |

This is the coverage shift from roughly 25% to 100% of flagged claims, and it is why yield is the metric to instrument first. An AI investigation agent does not raise yield by being cleverer on the cases a human already works - it raises portfolio-level confirmed fraud by working the cases a human never reaches. For SIU leaders building the operational view around this, the four-quadrant scorecard in [the 12 SIU KPIs every director should track](/blog/siu-kpis-what-to-track-2026) defines yield, coverage, and the capacity metrics at the team level; this post is the ROI-facing complement.

## Putting dollars on it - leakage recovered and cost per investigation

Two numbers carry the dollar case to a finance reviewer: leakage recovered as the numerator and cost per investigation as the denominator. Neither is a satisfaction metric, and both are defensible line by line in a board setting. The first sizes the prize; the second proves the prize is reachable at a cost the loss-recovery story justifies.

On the numerator, the pool is large and well documented. EY puts leakage at 7% to 14% of total claims spend and notes that defense and cost containment alone run more than $23 billion a year across the industry, per [EY](https://www.ey.com/en_us/insights/insurance/claims-litigation). Bain links generative AI in claims to a 30% to 50% reduction in total leakage and more than $100 billion in economic benefit globally, per [Bain](https://www.bain.com/insights/100-billion-dollar-opportunity-for-generative-ai-in-p-and-c-claims-handling/). These are industry figures, not a Hesper outcome - the carrier-specific number is its own flagged-claim volume run against its own recovery rate. But they establish that the recoverable pool is measured in points of total spend, not basis points.

On the denominator, unit cost is what historically made full coverage unaffordable. A manual SIU investigation runs about $2,500 per case at 14+ days of investigator attention, and one investigator completes around 10 investigations per month. At that cost and throughput, investigating 100% of flags would require a headcount bill the recovery story cannot justify, so carriers ration coverage down to the roughly 25% a human team can reach. AI investigation runs about $150 per case and lifts throughput toward 800+ cases per investigator per month. The marginal cost of investigating one more flagged claim - one more SIU referral - falls by roughly 94%, a derivation from those two canonical cost numbers and the lever that makes full coverage economic.

For a finance reviewer, the ROI is the incremental leakage recovered from the previously uninvestigated flag volume, net of platform spend - modeled on the carrier's own inputs, not on a Hesper-published savings figure. Keep deterrence and reputational effects out of the headline number; they are real but hard to attribute. The payback math, IRR on incremental investigated cases, and a pilot structure that funds proof rather than a leap are worked through in [the CFO ROI memo for AI claims investigation](/blog/cfo-roi-memo-ai-claims-investigation), and three carrier-tier scenarios that apply these metrics live in [the ROI case studies for AI claims investigation](/blog/roi-ai-claims-investigation-case-studies).

| ROI input | Manual SIU | AI investigation (Hesper) |
| --- | --- | --- |
| Flagged-claim coverage | ~25% | 100% |
| Throughput per investigator / month | ~10 cases | 800+ cases |
| Cost per investigation | ~$2,500 | ~$150 |
| Cycle time per case | 14+ days | 2-4 hours |
| False-positive handling | Each alert worked by hand; 60-85% are not fraud | Every alert triaged; investigator attention reserved for confirmed signal |

> Leakage recovered over cost per investigation is the only ROI ratio that survives a board review. NPS tells you the claimant was happy on a claim you may have overpaid. The investigation scorecard tells you what you stopped paying and whether the file would survive a DOI audit.
>
> - Hesper AI product research

## Time-to-defensible-decision, not time-to-close

Speed does belong on the investigation scorecard - but as time-to-defensible-decision, not time-to-close. Time-to-close measures handling speed: how long until a case is marked done. Time-to-defensible-decision measures how long it takes to produce an output the carrier can stand behind - an audit-ready finding with an evidence chain and explicit support for the action taken. A fast close on a thin file is a liability, because the speed is purchased by skipping the evidence that makes the conclusion defensible. The two metrics can point in opposite directions, and only one is tied to ROI.

The stakes are why the distinction matters. A denial or SIU referral that is not backed by a documented, reasonable investigation is bad-faith and unfair-claims-practices exposure. The output has to survive an examination under oath, a bad-faith suit, and a state DOI audit under the antifraud-plan filing requirements of NAIC Model Act #680 - adopted in 48 states - and documented-decision rules such as California 10 CCR 2698.36. None of that is captured by time-to-close. A case can close fast and still fail the audit. Time-to-defensible-decision is the only speed metric that prices in whether the fast output would hold up.

This is where speed and defensibility stop trading off. A manual investigation is slow because a human investigator's attention is the bottleneck and the phases run one after another. An AI investigation agent runs 15+ investigation phases in parallel on every flagged claim - document forensics, OSINT, statement cross-referencing, timeline reconstruction, financial-pattern analysis - and logs every step with sources, reasoning, and timestamps. That compresses the cycle from 14+ days to 2-4 hours while making the output more complete, not less. The audit trail is native to the process, not bolted on after, so the fast finding is also the defensible one.

This is the move from fraud detection to fraud resolution. Detection ends at a score; resolution ends at an audit-ready decision. The investigator's role shifts from execution to decision-making - the agent assembles the evidence and the finding, and a human SIU lead reviews, overrides where needed, and owns the call. As Deloitte's claims-practice leader put it in coverage of the same fraud predictions, "AI plus human is going to be better than human alone or AI alone," per [Insurance Journal](https://www.insurancejournal.com/magazines/mag-features/2025/06/16/827428.htm). Time-to-defensible-decision is the metric that captures that combined output; time-to-close captures only half of it.

## Building the scorecard and instrumenting the metrics

The six metrics are only useful if they can be instrumented from data a carrier already has, and assigned to the audience that acts on them. The good news is that none of them requires a new data warehouse. The hard part is discipline, not plumbing: baseline before the pilot, or the post-deployment numbers have nothing to move against.

### Where each metric comes from

- Fraud yield rate and coverage come from the claims system and SIU case management - flags raised, cases investigated, and dispositions are already logged events.
- Leakage recovered comes from reserve changes, denials, and settlement reductions tagged to investigated cases; it requires tagging discipline, not new infrastructure.
- Cost per investigation and capacity come from fully loaded SIU cost over case volume - finance already has the inputs.
- False-positive-rate reduction comes from the detection layer's alert volume against confirmed dispositions downstream.
- Time-to-defensible-decision comes from the agent's audit log - the timestamped record of when an audit-ready finding was produced, which is a richer source than a case-closed flag.

The instrumentation gap is real and worth naming. Roughly 80% of insurers already use predictive modeling to detect fraud, up from 55% in 2018, per the [Insurance Information Institute](https://www.iii.org/fact-statistic/facts-and-statistics-insurance-fraud) summary of Coalition Against Insurance Fraud data, and 35% of insurance executives rank fraud detection a top-five priority for generative AI, per [Deloitte](https://www.deloitte.com/us/en/insights/industry/financial-services/financial-services-industry-predictions/2025/ai-to-fight-insurance-fraud.html). Detection is instrumented and adopted. The investigation layer - where flagged claims actually get resolved - is rarely instrumented at all. That is the gap this scorecard fills, and it is why borrowing detection precision or handling cycle time as a stand-in produces a number that looks fine while the loss-cost lever sits untouched.

### Which metric belongs to which audience

Match the metric to the person who acts on it, and the cadence to the decision it informs. The SIU Director tracks yield, coverage, false-positive reduction, and capacity weekly - those are operational dials. The Claims VP and CFO see leakage recovered and cost per investigation quarterly and on the board deck - those are the loss-ratio and unit-economics numbers. Time-to-defensible-decision is shared: the SIU lead watches it operationally, and compliance reviews it against the antifraud plan. The audit log feeds all of them from one source. The point of the scorecard is not more tiles - it is replacing activity counts and satisfaction surveys with the handful of numbers that actually predict loss avoidance and defensibility.

One closing note on the competitive frame, because it is easy to misread. Handler-assist and detection vendors publish real, valid metrics - loss reduction, percent automation, precision, recall, cross-carrier match rates. Those measure their layers, and Hesper is complementary to FRISS, Shift Technology, and Verisk, not a replacement. The argument here is narrower and harder to dispute: those metrics measure detection precision and handling throughput, and neither one answers whether a flagged claim was investigated to a confirmed, recovered, and defensible conclusion. The investigation layer has no incumbent except manual SIU teams, and it needs its own scorecard - the six metrics above.

## Frequently asked questions

### What metrics should carriers use to measure AI ROI in fraud investigation?

Use outcome- and loss-based metrics, not customer-experience ones. The six that matter: fraud yield rate - confirmed fraud divided by investigated claims; claims leakage recovered in dollars - the gap between what was paid and what was owed, which EY puts at 7-14% of carrier spend; cost per investigation and cost per SIU referral; SIU capacity per investigator; false-positive-rate reduction against a rules-based baseline of 60-85%; and time-to-defensible-decision, meaning an audit-ready finding, not just a fast close. NPS, CSAT, and raw cycle time measure the claimant experience. Fraud investigation serves the SIU, the CFO, the reserving actuary, and the state DOI examiner, so its ROI has to be scored on loss avoidance and defensibility.

### Why aren't NPS and CSAT good metrics for fraud investigation AI?

Because the claimant is not the customer of a fraud investigation. NPS and CSAT measure how a policyholder feels about the claims experience, which is the right yardstick for a contents-pricing or handling AI - that is exactly how Openly described measuring its tool in a May 2026 Claims Journal interview. But a fraud investigation exists to confirm fraud, recover leakage, and produce a defensible decision. A faster, friendlier process that confirms less fraud is a worse outcome, not a better one. Optimizing investigation for claimant satisfaction would push toward paying questionable claims to keep scores high. The audiences an investigation actually serves - SIU, finance, reserving, regulators - never appear in an NPS survey, so the metric structurally cannot capture investigation value.

### What is fraud yield rate and why does it matter more than cycle time?

Fraud yield rate is the share of investigated claims that turn out to be confirmed fraud - confirmed-fraud cases divided by investigated cases. It matters more than cycle time because cycle time can improve while yield collapses: an agent that closes cases in minutes but confirms nothing has destroyed value, not created it. Deloitte estimates current detection rates at just 20-40% for soft fraud and 40-80% for hard fraud, meaning most fraud is never confirmed today. Yield is also gated by coverage - investigating 100% of flagged claims at an honest yield beats investigating 25% at any speed. That is why coverage, where manual SIUs reach roughly 25% of flags and AI investigation can reach 100%, is the multiplier sitting behind the yield number.

### How do you calculate the dollar ROI of AI claims investigation?

Two board-facing numbers carry it: leakage recovered as the numerator and cost per investigation as the denominator. Leakage is the difference between what a carrier paid and what it owed per the contract; Bain links generative AI in claims to a 30-50% reduction in total leakage, and EY sizes leakage at 7-14% of total carrier spend. On the cost side, Hesper's per-case investigation cost runs roughly $150 versus about $2,500 for a manual SIU workup, which drops the marginal cost of investigating one more referral by about 94%. The ROI is the incremental leakage recovered from the previously uninvestigated flag volume, net of platform spend - modeled on the carrier's own inputs, not a customer-satisfaction delta. Keep deterrence and reputational effects out of the headline number; they are real but hard to attribute.

### What is time-to-defensible-decision and how is it different from cycle time?

Time-to-defensible-decision is how long it takes to produce an investigation output a carrier can stand behind - an audit-ready finding with an evidence chain and explicit support for the action taken - not just how long until a case is marked closed. Cycle time measures handling speed; it says nothing about whether the conclusion would survive a bad-faith suit, an examination under oath, or a state DOI audit under NAIC Model Act #680 and rules like California 10 CCR 2698.36. A fast close on a thin file is a liability. Hesper compresses investigation from 14+ days to 2-4 hours while running 15+ evidence-gathering phases in parallel and logging every step, so speed and defensibility move together rather than trading off.

### How is measuring fraud investigation ROI different from measuring claims automation ROI?

Claims automation and fraud investigation sit at different layers, so they need different scorecards. Claims-handling automation - triage, contents pricing, document intake - is a customer-experience and efficiency function, correctly measured on cycle time, loss-adjusting-expense reduction (Bain cites 20-25%), and CSAT. Fraud investigation is a loss-avoidance and defensibility function, measured on yield, leakage recovered, false-positive reduction, and time-to-defensible-decision. Detection is upstream and investigation is downstream of it; 80% of insurers already run predictive detection, but the investigation layer where flagged claims get resolved has no published ROI framework. Borrowing handling metrics for investigation is a category error that flatters speed while ignoring whether fraud was actually caught and recovered.

## Key takeaways

- NPS, CSAT, and cycle time are customer-experience metrics; fraud investigation is a loss-avoidance function, so scoring an investigation agent on claimant satisfaction is a category error.
- The six metrics that measure investigation ROI are fraud yield rate, claims leakage recovered, cost per investigation, SIU capacity per investigator, false-positive-rate reduction, and time-to-defensible-decision.
- Fraud yield - confirmed fraud divided by investigated claims - matters more than speed, and it is gated by coverage: Deloitte puts current detection at 20-40% soft and 40-80% hard, while manual SIUs investigate only about 25% of flags.
- The dollar case rests on leakage recovered (EY: 7-14% of carrier spend; Bain: 30-50% reducible with AI) over cost per investigation (about $2,500 manual versus about $150 AI), not on a satisfaction delta.
- Speed only counts as time-to-defensible-decision: a 2-4 hour, audit-ready, 15-phase investigation beats a fast close that cannot survive a bad-faith suit or a state DOI exam.