Claims investigation is a natural fit for agentic AI architecture, because an SIU investigation is already a multi-step, multi-source, parallelizable workflow with a defined end state - an audit-ready resolution. The catch is that only an architecture built for investigation actually resolves a flag. A chatbot answers questions and summarizes documents. A rules engine or ML scorer outputs a flag or a score. Neither gathers new evidence, reasons about the facts of a specific claim, or produces a defensible finding. That is a different class of system.
This post is written for the technical seat that has to decide whether an investigation agent holds up: where the data sits, what the components are, and whether the output is something an SIU lead can read, override, and defend. It walks through what agent architecture means in a claims context, why an SIU investigation maps onto it so cleanly, the five components that make up a production investigation agent, how that differs from the chatbots and rules engines it gets confused with, and why most enterprise agentic projects fail while a scoped investigation agent does not.
It is the engineering-depth companion to the autonomous AI claims investigation pillar, which frames the 15+ phases end-to-end. Here the focus is the architecture beneath that framing - orchestration, memory, evals, and the audit trail - and why it is specific to SIU investigation rather than a generic agent design.
What AI agent architecture means for claims investigation
An AI agent architecture for claims investigation is a system that plans an investigation, calls specialized tools, maintains an evidence state across parallel phases, evaluates its own findings against guardrails, and outputs an audit-ready report. That is the whole definition, and every clause of it distinguishes an investigation agent from a chatbot that answers questions or a rules engine that outputs a score. The output is a resolution, not a suggestion.
This is where the enterprise-software trend and the insurance reality diverge in a useful way. Gartner predicts 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from under 5% in 2025, and that agentic AI will be in 33% of enterprise software by 2028, up from under 1% in 2024. The direction is clear, and the word that matters is task-specific: purpose-built agents scoped to one job beat general-purpose assistants. In insurance, the highest-value task-specific target is investigation, because it is the one high-cost, high-judgment workflow that broad automation has not touched.
Most of what carriers today call insurance AI is not that. Insurance Journal, reporting on a Sedgwick study, notes that 82% of carriers use AI for routine tasks such as data extraction and automated customer interaction. That is task-level automation, and it is genuinely useful at the front of the funnel - the same reporting cites intake automation cutting average claim processing from 10 days to 36 hours. But routine is not investigation. An investigation is not a task; it is a plan made of tasks, run to a conclusion. This is the shift Hesper describes as moving from fraud detection to fraud resolution: detection scores a claim, and an agent architecture is what resolves it.
Why claims investigation fits an agentic architecture
Claims investigation fits an agentic architecture because an SIU investigation is already a structured, multi-source, parallelizable workflow with a defined end state. It has distinct phases - document forensics, OSINT, timeline reconstruction, statement cross-reference, financial pattern analysis, cross-carrier checks - each producing evidence that feeds a single audit-ready conclusion. That shape is precisely what agent architectures are good at. The workflow did not need reinventing; the constraint that capped it was human attention, and that is the constraint software removes.
The workflow is already a plan
An SIU playbook is a plan before any AI touches it. A trained investigator working a flagged claim knows the phases to run, the order dependencies between them, and what a finished file has to contain. That is the definition of a plannable task: a known decomposition into steps with a known end state. Detection is already mainstream at the top of this funnel - the Insurance Information Institute reports 80% of carriers use predictive modeling to detect fraud, up from 55% in 2018. Scoring is solved. The open frontier is the plan that runs after the score, on the roughly 10% of P&C claims that involve fraud and the far larger set that get flagged as suspicious. Our companion post on running 15 investigation phases in parallel walks the phase list in detail.
The bottleneck is attention, not method
A human runs the phases one at a time because human attention is the bottleneck, not because the method requires it. Manual investigation takes 14+ days per case, an investigator carries 200+ cases, and the arithmetic means SIU teams reach only about 25% of flagged claims - the rest are paid, denied without full work, or queued indefinitely against $308 billion in annual US insurance fraud loss. An agent architecture runs independent phases simultaneously - 15+ at once - because per-case attention is not a constraint for software. That compresses cycle time to 2-4 hours, lifts throughput from roughly 10 investigations per investigator per month to 800+, and moves coverage toward 100% of flagged claims. The investigator's role shifts from execution to decision-making.
Investigations per investigator per month (manual vs agent architecture)
Parallelism is the architecture story
The core architectural claim is not that the agent is smart; it is that the agent runs the investigation plan without an attention limit. A human investigator serializes phases because they can hold one case at a time. Software does not. Running 15+ phases in parallel on every flagged claim is what turns a 14+ day, ~25%-coverage workflow into a 2-4 hour, 100%-coverage one. The method is unchanged; the constraint is gone.
The five architecture components
A production investigation agent has five components: an orchestrator that plans and sequences the work, tools the agent calls to gather evidence, a memory and retrieval layer that persists an evidence state across phases, an evaluation layer that scores the agent's own outputs, and guardrails plus an audit trail that make the result defensible. Each maps onto a specific part of the SIU playbook, and each is where a generic agent design breaks when applied to insurance.
Orchestration and planning
Orchestration decomposes a flagged claim into an investigation plan, then sequences it. Some phases depend on others - a timeline reconstruction needs the document set parsed first - and those run in order. Many phases are independent - OSINT on a claimant, cross-carrier retrieval, financial pattern analysis - and those run in parallel. The orchestrator is what holds that dependency graph and drives 15+ phases to completion. This is the difference between an agent and a workflow tool: a fixed workflow runs the same steps every time, while an orchestrator adapts the plan to what the specific claim needs and what earlier phases surface.
Tool use
Each investigation phase is a tool call. The tools are insurance-specific: document forensics on claim forms and photos, OSINT on the claimant and providers, cross-carrier retrieval against contributory data sources such as ISO ClaimSearch, statement and EUO cross-reference, financial pattern analysis, and timeline reconstruction. This is the layer that separates an investigation agent from a chatbot - a chatbot only retrieves and summarizes what is already in front of it, while an agent goes and gathers new evidence the claim file did not contain. Tool use is how the agent does investigation rather than description.
Memory and retrieval
Memory is the evidence state the investigation is built on. As phases run, each writes findings - a metadata anomaly, an OSINT match, a timeline conflict - to a shared evidence graph that later phases read, so the agent reasons over the whole claim rather than one document at a time. Retrieval pulls the claim file, policy terms, and prior-claim history into that state. This is not the same as a long prompt. Research on agent memory, such as Hu, Wang and McAuley's 2025 MemoryAgentBench study, tests four competencies - accurate retrieval, test-time learning, long-range understanding, and selective forgetting - and finds current methods fall short across all four. Memory is a measured, unsolved engineering problem, which is exactly why an investigation agent needs an explicit, evaluated evidence architecture rather than relying on a model's context window.
Evaluation
Evals score the agent's own findings before a human ever sees them. That means checking that every cited fact traces to a real source, catching hallucinated citations, flagging low-confidence conclusions, and routing uncertain cases to human review rather than reporting them as resolved. This is not optional polish; it is what makes the output trustworthy enough to act on. The Sedgwick study reported by Insurance Journal found 75% of claims professionals believe AI needs human oversight, and the eval layer is what makes that oversight tractable - it surfaces the cases that need a human instead of asking a human to re-check everything.
Guardrails and audit trail
Guardrails enforce PII and HIPAA handling and log every decision with its source, reasoning, and timestamp. This is the audit-trail-native part of the architecture, and in insurance it is not a nice-to-have - it is a filing requirement. The output has to satisfy California 10 CCR 2698.36's documented-decision requirement and the antifraud-plan filing requirements of NAIC Model Act 680, adopted in 48 states. An investigation the agent cannot show its work on is not a defensible investigation, whatever its conclusion. For the full treatment of what makes an AI finding hold up, see the defensibility standard for fraud investigation AI. This is the component that turns a plausible answer into a record an SIU lead can override, a compliance officer can file, and a deposition can survive.
A flagged claim fans out into 15+ parallel investigation phases, each writing to a shared evidence state, converging on an audit-ready resolution the investigator reviews.
Agentic investigation vs chatbots vs rules engines
Three systems get conflated in insurance AI conversations, and they do different jobs: an LLM chatbot answers questions, a rules engine or ML scorer flags a claim, and an agentic investigation architecture resolves a flag end-to-end. Only the third one gathers new evidence and produces a defensible conclusion. Conflating them is how carriers end up expecting a copilot to do an investigator's work, or a detection score to stand in for a documented finding.
The chatbot is single-turn or conversational: it retrieves and summarizes, has no plan, calls no investigative tools, keeps no evidence state, and does not evaluate its output against a standard. It is the class of tool behind that 82% routine-task adoption and the 80% faster processing on low-severity claims - real value at the simple end of the funnel. The rules engine is deterministic thresholds or a scoring model with high recall and a 60-85% false-positive rate; it outputs a flag or a 0-999 score and does not gather new evidence about the specific claim. The agent plans, calls tools, maintains memory across 15+ parallel phases, evaluates its own findings, and produces an audit-ready report in 2-4 hours covering 100% of flagged claims. Only the last one resolves a flag.
The contrast with rules engines is worth holding onto because it is the one carriers feel most directly: a 60-85% false-positive rate means most flags are not fraud, and every one of them still needs a human to investigate it. A score creates work; it does not do work. We treat that distinction in depth in legacy rules versus autonomous AI fraud detection. Detection is upstream; investigation is downstream.
Where agent architectures fail - and how investigation architecture avoids it
Most enterprise agentic AI projects fail for a specific, honest reason: they are unscoped. Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027 because of escalating costs, unclear business value, and inadequate risk controls - many are hype-driven experiments with no defined job. An investigation agent avoids that trap not by being cleverer but by being narrowly scoped to one high-value task with a measurable payoff and a defensible output. Scope, measurability, and defensibility are the difference between a shipped agent and a canceled proof of concept.
Insurance mirrors the broader pattern. The Sedgwick study reported by Insurance Journal found that between 58% and 82% of insurers use AI tools, but only 12% describe their AI capabilities as fully mature and just 7% have achieved scalable AI success. Nearly two-thirds report a gap between their AI vision and their reality. That gap is not at intake, where automation already works; it is downstream, at investigation, where broad-but-shallow adoption runs out of road. Adoption is easy; scaling a system that produces a defensible result is the hard part, and it is the part architecture discipline decides.
The design choices that keep an investigation agent on the shipped side of that line are the five components used deliberately. A scoped job: investigate a flagged claim, not do claims. Evals and guardrails that catch bad output before a human acts on it. A defensible audit trail that satisfies the filing requirements rather than a black-box conclusion. And a human in the decision seat, which is what 75% of claims professionals say they want. The unit economics close the case for finance: roughly $150 per AI investigation against roughly $2,500 for a manual case makes 100% coverage affordable, and a measurable payoff is exactly what the canceled-project cohort lacked. Hesper is purpose-built for SIU investigation, not a generic agent platform - the evidence graph, cross-carrier retrieval, and audit trail are specific to the job, which is why the scope is defined and the value is measurable.
How the architecture fits the existing SIU stack
An investigation agent sits downstream of detection and inside the claims-management system: a flagged claim flows out of the claims system, the agent investigates it, and an audit-ready report flows back as a case attachment. It does not replace the detection scorers, the handler-assist tools, or the claims suite already in place. Mapping the layers by architecture type shows why - each of those systems is built for a different job, and only the investigation layer produces a resolution.
Detection scorers such as FRISS, Verisk ClaimDirector, and SAS Fraud Framework are ML-scoring architectures - they output a flag or a 0-999 score on cross-carrier data and hand it off. Handler-assist agents such as Shift Claims augment a human adjuster's workflow; per Shift's announcement that includes loss reduction and faster handling, but the architecture keeps a human in the execution loop rather than running the SIU playbook autonomously. Claims suites such as Guidewire ClaimCenter and Duck Creek route and manage the case. None of them ships an autonomous, multi-phase, audit-trail-native investigation agent, because that is a different architecture with a different output. Hesper is complementary to FRISS, Shift Technology, and Verisk - not a replacement.
The architectural point beneath the even-handedness is the one that survives copy-paste to any of those vendors: a detection agent and an investigation agent are built for different jobs because scoring a claim and resolving a claim are different problems. A scorer optimizes recall against a threshold. An investigator optimizes toward a defensible conclusion. You cannot get the second by tuning the first. The investigation layer is the layer no other vendor occupies, and it is architecturally distinct because its output is an audit-ready resolution, not a score or a suggestion. Make every flagged claim investigable.
Key takeaways
- An AI agent architecture for claims investigation plans an investigation, calls tools, maintains evidence memory across parallel phases, evaluates its own findings, and outputs an audit-ready report - not a score or an answer.
- Claims investigation fits an agentic architecture because an SIU playbook is already a plannable, multi-source workflow; the constraint was human attention, and running 15+ phases in parallel removes it.
- The five components - orchestration, tool use, memory and retrieval, evals, and guardrails plus audit trail - each map onto a specific part of the SIU playbook and are where generic agent designs break in insurance.
- Only an agentic investigation architecture resolves a flag: a chatbot answers questions and a rules engine outputs a score with a 60-85% false-positive rate, and neither gathers new evidence about the specific claim.
- Scope, measurability, and defensibility separate a shipped investigation agent from the over-40% of agentic projects Gartner expects canceled by 2027 - which is why a purpose-built SIU agent survives where generic platforms stall.