Every carrier running a modern fraud platform - FRISS, Shift Technology, Verisk, or an internal rules engine - knows the trade-off. The system surfaces suspicious claims. Alerts pile up. Investigators triage, investigate a small slice, and close the rest without resolution.
The reason is structural. Rules-based fraud detection was engineered to be sensitive, not precise. When the underlying fraud rate across P&C claims is 10-15% and the cost of missing fraud is high, the optimisation target becomes recall, not precision. The consequence is a high false positive rate - and the consequence of that is a capacity problem.
Autonomous AI investigation agents sit downstream of detection. Instead of producing alerts for humans to evaluate, they investigate flagged claims end-to-end and return investigation-ready findings. The false positive problem becomes irrelevant in the same way a prescription becomes irrelevant once a pharmacist has dispensed the medication - the question shifts from 'is this likely fraud?' to 'what did the investigation find?'
This guide is for claims executives, SIU directors, and technology buyers evaluating how AI fits into existing fraud infrastructure. It covers the architectural difference between detection and investigation, a side-by-side comparison of legacy platforms and autonomous agents, and a buyer evaluation framework. For the economics of uninvestigated claims, see how uninvestigated claims drain profitability.
The false positive problem
False positives in fraud detection are not a measurement error. They are a direct consequence of how rules-based detection works. A scoring model evaluates a claim against a library of red flag indicators - late reporting, high loss amount, recent policy inception, prior claims history, network overlap with known fraudsters. Each match contributes to a risk score. Claims above a threshold trigger an alert.
The problem is that 'suspicious' and 'fraudulent' are not the same thing. Many legitimate claims share characteristics with fraudulent ones. A legitimate claimant who reports a theft 10 days after it occurred because they were travelling looks, to a scoring model, identical to one who waited in order to fabricate the loss. A policy purchased three weeks before a claim is statistically overrepresented in fraud, but most such claims are still legitimate.
According to the Coalition Against Insurance Fraud, roughly 10% of P&C claims involve some form of fraud, but the precision of rules-based detection runs much lower. Most SIU teams report confirmation rates between 15% and 40% on referred cases - meaning 60-85% of alerts do not result in confirmed fraud when fully investigated.
Why precision is capped
A rules-based system sees ~30 signals per claim: claim data, policy data, prior claims, basic external data. That signal density is not enough to distinguish 'unusual' from 'fraudulent'. The distinguishing evidence typically lives in documents, statements, public records, and OSINT sources - all of which require investigation, not scoring.
The consequence of a 60-85% false positive rate is not just alert fatigue. It is capacity erosion. An investigator handling 200 active cases at a 20% true positive rate is doing 160 cases of work for 40 confirmed outcomes. Most carriers resolve this by investigating only the highest-severity alerts. The rest are closed with an abbreviated review or no review at all - which is why roughly 75% of flagged claims are never fully investigated.
How rules-based detection works
Modern fraud detection platforms combine rules, statistical models, and network analysis. The core output - a risk score and an alert - is consistent across vendors. The anatomy of a rules-based detection pipeline looks like this:
- Data ingest: claim data from the claims system, policy data from the policy admin system, and for some vendors, external data feeds like NICB, ISO ClaimSearch, and public records.
- Signal extraction: the platform evaluates the claim against a library of red flag rules. Each rule generates a binary or weighted signal.
- Scoring: signals are combined into a risk score, typically 0-100 or 0-1. Some platforms apply machine learning on top of rules to improve ranking; most still rely on weighted rule combinations.
- Network analysis: claims are cross-referenced against historical claims, known fraud rings, and provider networks. Vendors like FRISS and Shift Technology have invested heavily in this layer.
- Alert generation: claims above a configured threshold route to the SIU queue. Each alert includes the risk score, triggered rules, and available context.
- Manual investigation: an SIU investigator evaluates the alert, decides whether to investigate, and if yes, executes the manual investigation workflow.
This pipeline is effective at what it was designed to do: surface suspicious activity for human evaluation. It is not designed to confirm fraud, gather evidence, or resolve cases. That work happens downstream, in the SIU, and it is where legacy fraud infrastructure stops.
Legacy is not necessarily a pejorative. Rules-based detection serves a real purpose: fast, interpretable, auditable, regulator-friendly. Platforms like FRISS, Shift Technology, and Verisk are category standards for precisely these reasons. The question is not whether detection is useful. The question is whether detection is sufficient.
How autonomous AI investigation works
Autonomous AI investigation agents operate downstream of detection. The input is a flagged claim (from any detection source - rules engine, adjuster referral, external tip). The output is an investigation-ready report with confirmed findings, evidence, and a recommendation. The workflow in between is what legacy platforms do not automate.
A properly designed autonomous agent runs 15+ investigation phases in parallel. For a suspected auto injury fraud claim, this includes:
- Document forensics: pixel-level analysis of claim documents, medical records, repair estimates, and photographs for signs of manipulation.
- Medical record analysis: reviewing treatment records against the injury mechanism and timeline, flagging inconsistencies, identifying upcoding patterns.
- Database cross-referencing: NICB, ISO ClaimSearch, state DMV, prior claims history, provider billing patterns.
- Public records and OSINT: court records, social media, business filings, property records where relevant.
- Statement analysis: recorded statements and examinations under oath cross-referenced for internal consistency and against submitted documentation.
- Timeline reconstruction: a chronological reconstruction of the claim events from all available sources.
- Financial analysis: loss calculation review, billing pattern analysis, motive indicators.
- Network analysis: connections to known fraud rings, provider collusion patterns, repeat claimant networks.
- Report generation: a structured investigation report with citations, confidence scores, and a recommendation.
The total signal density per case is 200+ data points rather than the ~30 signals a rules engine evaluates. The difference is not incremental. A detection system tells you a claim is suspicious. An investigation agent tells you why - with citations, evidence, and a structured timeline.
“The false positive rate conversation stops being interesting once you automate investigation. If every flagged claim gets fully investigated in a few hours, it no longer matters whether detection was 20% precise or 60% precise. You find the real cases, and the false ones are resolved as part of the workflow.”
- Hesper AI product research, Q1 2026
Head-to-head comparison
The table below summarises the architectural difference between rules-based detection platforms and autonomous AI investigation agents. These are not competing categories - detection sits upstream of investigation - but they are frequently evaluated together, and the differences matter.
Confirmed fraud cases per 100 flagged claims (by workflow)
The 72 figure is not a precision claim for detection. It is what happens when the investigation-capacity ceiling is removed. Most flagged claims that are currently closed without investigation turn out, when investigated, to fall in a broad middle: some legitimate, some minor soft fraud, and a meaningful minority confirmed as hard fraud. The yield from investigating the remaining 75% is typically higher than the yield from re-investigating the top 25% a second time.
What changes in the SIU workflow
Autonomous investigation does not replace SIU teams. It changes what investigators do. The shift is from execution to judgement.
Before: investigator as executor
The manual SIU workflow is execution-heavy. Investigators spend an estimated 88% of their time on evidence gathering, documentation, administrative logging, and vendor coordination. Only ~12% of their time goes to analysis and decision-making. A typical investigator working 200 active cases completes roughly 10 investigations per month at 14+ days each.
After: investigator as decision-maker
With autonomous investigation agents, the AI handles stages 2-5 of the six-stage investigation process (planning, evidence gathering, analysis, report generation). The investigator's role shifts to reviewing findings, making the final determination, and coordinating with claims, legal, and law enforcement where required. A review-oriented investigator can handle 800+ cases per month - an 80x step-up in throughput.
What stays human
The final decision on every claim remains with a human investigator. Autonomous agents surface findings, cite evidence, and recommend outcomes. They do not deny claims, file SARs, or trigger recovery actions autonomously. The investigator reviews and approves. Regulators and claims law require human accountability for fraud determinations; autonomous investigation preserves it.
For the full six-stage investigation workflow and how AI maps to it, see how insurance companies investigate fraud.
Evaluation criteria for buyers
When evaluating AI-driven fraud infrastructure - either as a replacement for legacy detection or, more commonly, as a complement to it - the following criteria matter most. They map to the questions decision-makers ask during RFP and procurement.
1. Scope: detection only, or detection plus investigation?
The most important question and the one most frequently confused. A vendor that 'uses AI' may still only do detection - in which case you get a better alert queue, but the capacity bottleneck remains. Ask: what does the system produce - a score, or a report? If a score, you are buying detection. If a report, you are buying investigation.
2. Signal density per case
Rules-based detection evaluates ~30 signals. Autonomous investigation should evaluate 200+, including document forensics, statement analysis, OSINT, and timeline reconstruction. Ask for the full list of data sources and signals evaluated per case type.
3. Evidence and auditability
Every finding should be backed by cited evidence that will stand up to regulator review, claim litigation, or criminal prosecution. Black-box scoring is a red flag. Ask to see a sample investigation report on a claim similar to your portfolio, with full citations.
4. Integration with existing systems
Most carriers already run detection platforms and claims systems. A modern investigation agent should sit downstream of whatever detection stack is in place - FRISS, Shift, Verisk, ISO ClaimSearch, internal rules - without requiring replacement. Ask about integration path, time to first investigation, and data model compatibility.
5. Data retention and privacy
Claim data is sensitive: PHI, financial, and personal information. Ask whether the vendor retains claim data beyond the active investigation. Zero-retention architectures are now available and should be the default expectation for enterprise carriers.
6. Regulatory posture
State DOI requirements and the NAIC model SIU regulation require human decision-making on fraud determinations. A system that makes autonomous denial decisions is non-compliant. Confirm that the vendor's workflow preserves human sign-off on fraud outcomes.
7. Deployment time
Legacy fraud platforms typically take 6-18 months to deploy due to rule library customisation, integration, and user training. A well-designed investigation agent should be in production on real claims within 30-90 days. Ask for reference deployments with realistic timelines.
8. Measurement and ROI
The right metrics for investigation differ from those for detection. Detection measures alert quality (precision, recall). Investigation measures outcome - confirmed fraud per flagged case, investigation coverage rate, average time-to-close, claim leakage reduction, and regulatory filing accuracy. Ask how the vendor reports on these metrics.
Key takeaways
- Rules-based fraud detection platforms (FRISS, Shift Technology, Verisk) run at 60-85% false positive rates by design - they are engineered for recall, not precision.
- The false positive rate is only a problem because manual investigation capacity is limited. Most carriers investigate only 25% of flagged claims.
- Autonomous AI investigation agents sit downstream of detection and investigate flagged claims end-to-end: 200+ signals per case, 2-4 hour turnaround, investigation-ready report.
- The investigator's role shifts from execution (88% of current time) to decision-making. Throughput per investigator moves from ~10 cases/month to 800+.
- Detection and investigation are complementary, not competing. Most carriers will keep their detection stack and add autonomous investigation downstream. Evaluate vendors on scope, signal density, evidence citations, integration, and regulatory compliance.
Related reading: why 75% of flagged claims are never fully investigated, how insurance companies investigate fraud: inside the SIU process, and Hesper AI vs. manual investigation.