Evaluating AI fraud infrastructure is harder than it should be. Most vendors use the same vocabulary: 'AI-powered', 'automated', 'fraud detection'. Beneath the vocabulary, the products do fundamentally different things - some score claims, some score documents, some actually investigate. Buying the wrong category is a 12-month mistake.
This is a working evaluation framework built from conversations with SIU directors, claims executives, and procurement leaders at US P&C carriers. It is organised around 12 criteria in four categories: technical, operational, compliance, and economic. Use it during vendor RFPs, shortlist reviews, or internal build-vs-buy analysis.
If you have not yet framed the detection-vs-investigation distinction, start with legacy rules-based systems vs. autonomous AI. That is the prerequisite. This checklist assumes you have already decided you want investigation, not just better detection.
Why the evaluation framework matters
Three structural forces are reshaping SIU technology purchasing in 2026: the detection-vs-investigation distinction, rising regulatory scrutiny on AI use in claims decisions, and board-level expectations for quantifiable fraud outcomes. Evaluation frameworks that worked in 2020 - focused on rule libraries, alert quality, and detection precision - miss the dimensions that matter now.
The checklist below gives equal weight to technical capabilities, operational fit, compliance posture, and economic impact. Each criterion maps to a procurement question with a clear answer format. The goal is to make the evaluation defensible: to peers, to auditors, and to regulators.
Step 1: define scope before comparing vendors
The first question is not 'which vendor?' but 'what problem are we buying a solution for?'. The three common scopes are:
- Detection only - better alerts, better triage. Replaces or augments a rules engine. Examples: FRISS, Shift Technology, Verisk.
- Investigation only - investigates flagged claims end-to-end, regardless of detection source. Example: Hesper AI.
- Detection + investigation - single vendor for both layers. Rare in 2026; most carriers assemble best-of-breed.
Decide the scope before comparing products. Mixing detection vendors and investigation vendors in the same shortlist is apples-to-oranges - they should both be in your stack, but in different roles.
Technical criteria (1-4)
1. Signal density per case
Question: how many signals does the system evaluate per case? What are they? Rules-based detection evaluates ~30 signals. Autonomous investigation should evaluate 200+, covering document forensics, medical record analysis, statement cross-referencing, OSINT, public records, financial analysis, timeline reconstruction, and network analysis.
2. Evidence and citations
Question: can the vendor show me an investigation report on a claim similar to ours, with citations to each piece of evidence? Every finding should trace back to a specific document, statement, database query, or public record. Black-box scoring without citations is a red flag - not because the model is wrong, but because you cannot defend a claim denial on unverifiable output.
3. Output type: alert vs. report
Question: what does the system produce? A risk score is detection. An investigation report with evidence, findings, timeline, and recommendation is investigation. Ask to see the actual output format. If it is a score or an alert list, you are buying detection regardless of how the vendor markets it.
4. Integration with existing systems
Question: what is the integration path to our claims management system (Guidewire, Duck Creek, Majesco, Snapsheet)? What happens to our existing detection stack (FRISS, Shift, Verisk, ISO ClaimSearch)? A modern investigation agent should sit downstream of any detection stack and integrate via claims-system APIs without requiring replacement. Integration timelines of 30-90 days are achievable; 6-18 months suggests a heavyweight implementation that is hard to reverse.
Operational criteria (5-8)
5. Investigation coverage
Question: what percentage of flagged claims will be investigated? Legacy manual workflows cap at ~25%. An autonomous investigation agent should enable 100% coverage of flagged claims at current SIU headcount. If the vendor cannot commit to coverage numbers, the economics are unclear.
6. Throughput per investigator
Question: how does this change cases-per-investigator capacity? Current manual average is ~10 investigations per investigator per month. With autonomous investigation, review-oriented investigators can handle 800+ cases per month. Ask for reference deployments with before/after throughput metrics.
7. Time to deployment
Question: when will this be investigating live claims? Modern agents should be deployed within 30-90 days, including integration, data security review, and pilot investigations. Legacy platform rollouts of 6-18 months are often the result of rule library customisation and heavy services engagement.
8. Report quality and format
Question: what does a finished report look like? Is it audit-ready (structured sections, full citations, timeline, recommendation) or will the investigator still need to write the narrative? Audit-ready output is the measure - if an investigator has to rewrite the report, throughput gains collapse.
Compliance criteria (9-11)
9. Data retention and privacy
Question: what is retained, where, and for how long? Zero-retention architectures (claim data loaded into memory, analysed, discarded) are the enterprise default in 2026. PHI and PII handling should be documented and auditable. SOC 2 Type II or pending certification is a minimum. Data residency (US-only, EU-only) should be configurable.
10. Regulatory posture
Question: who makes the fraud determination? The NAIC model SIU regulation and state DOI rules require human decision-making on fraud determinations. Any vendor whose workflow includes autonomous claim denial is non-compliant. The AI produces findings and recommendations; the investigator signs off. Confirm this in writing.
11. Audit trail
Question: if a regulator or litigant requests the investigation record, can we produce it? The full chain - which signals were evaluated, which sources were queried, which findings were surfaced, which the investigator acted on - should be auditable and exportable. Black-box workflows with no audit trail fail regulatory review.
Economic criterion (12)
12. Measurement and ROI
Question: how will we measure this? The right metrics for investigation are outcome-based, not alert-based. Detection measures alert quality (precision, recall). Investigation measures outcome - confirmed fraud per flagged case, investigation coverage rate, average time-to-close, claim leakage reduction, recovery rate, SAR filing accuracy.
Red flags in vendor claims
Over the last year, SIU leaders have described recurring patterns that separate marketing from reality. Watch for:
- 'AI-powered' without a clear answer to 'what does it do?'. If the answer is vague, the product is probably detection in a new wrapper.
- Unwillingness to show a sample investigation report on a real (anonymised) claim. A real report, with citations, is the single best diagnostic.
- 'Replaces your SIU team' claims. No compliant vendor replaces the investigator's decision authority; the AI's job is to produce findings, not to deny claims.
- Long deployment timelines (6+ months) for what should be a data-in, report-out workflow.
- Data retention that extends beyond the active investigation without a documented reason.
- Inability to provide reference deployments with measurable outcomes (coverage, throughput, confirmed fraud lift).
- Vague answers to the regulatory question. 'The AI makes the decision' should fail the RFP on the spot.
Key takeaways
- Evaluate vendors across 12 criteria in four categories: technical (scope, signals, evidence, integration), operational (coverage, throughput, deployment, output quality), compliance (retention, regulatory posture, audit trail), and economic (outcome-based ROI).
- Decide scope first - detection, investigation, or both. Mixing categories in the same shortlist produces meaningless comparisons.
- Signal density (30 vs 200+), output type (score vs report), and audit-ready citation are the technical tests that separate investigation platforms from detection-in-a-wrapper.
- Human decision authority on fraud determinations is a regulatory requirement. Any vendor who describes autonomous denial should be disqualified.
- Measurement should be outcome-based: confirmed fraud per flagged case, investigation coverage, time-to-close, leakage reduction. Alert-quality metrics miss the real value.