AI agent architecture for claims investigation: orchestration, memory, evals

40%

Enterprise apps with task-specific agents by end 2026

Gartner, up from under 5% in 2025

15+

Investigation phases run in parallel

per flagged claim, not one at a time

>40%

Agentic AI projects canceled by 2027

Gartner, cost and weak risk controls

2-4 hrs

AI investigation cycle time

vs 14+ days manual per case

Claims investigation is a natural fit for agentic AI architecture, because an SIU investigation is already a multi-step, multi-source, parallelizable workflow with a defined end state - an audit-ready resolution. The catch is that only an architecture built for investigation actually resolves a flag. A chatbot answers questions and summarizes documents. A rules engine or ML scorer outputs a flag or a score. Neither gathers new evidence, reasons about the facts of a specific claim, or produces a defensible finding. That is a different class of system.

This post is written for the technical seat that has to decide whether an investigation agent holds up: where the data sits, what the components are, and whether the output is something an SIU lead can read, override, and defend. It walks through what agent architecture means in a claims context, why an SIU investigation maps onto it so cleanly, the five components that make up a production investigation agent, how that differs from the chatbots and rules engines it gets confused with, and why most enterprise agentic projects fail while a scoped investigation agent does not.

It is the engineering-depth companion to the autonomous AI claims investigation pillar, which frames the 15+ phases end-to-end. Here the focus is the architecture beneath that framing - orchestration, memory, evals, and the audit trail - and why it is specific to SIU investigation rather than a generic agent design.

What AI agent architecture means for claims investigation

An AI agent architecture for claims investigation is a system that plans an investigation, calls specialized tools, maintains an evidence state across parallel phases, evaluates its own findings against guardrails, and outputs an audit-ready report. That is the whole definition, and every clause of it distinguishes an investigation agent from a chatbot that answers questions or a rules engine that outputs a score. The output is a resolution, not a suggestion.

This is where the enterprise-software trend and the insurance reality diverge in a useful way. Gartner predicts 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from under 5% in 2025, and that agentic AI will be in 33% of enterprise software by 2028, up from under 1% in 2024. The direction is clear, and the word that matters is task-specific: purpose-built agents scoped to one job beat general-purpose assistants. In insurance, the highest-value task-specific target is investigation, because it is the one high-cost, high-judgment workflow that broad automation has not touched.

Most of what carriers today call insurance AI is not that. Insurance Journal, reporting on a Sedgwick study, notes that 82% of carriers use AI for routine tasks such as data extraction and automated customer interaction. That is task-level automation, and it is genuinely useful at the front of the funnel - the same reporting cites intake automation cutting average claim processing from 10 days to 36 hours. But routine is not investigation. An investigation is not a task; it is a plan made of tasks, run to a conclusion. This is the shift Hesper describes as moving from fraud detection to fraud resolution: detection scores a claim, and an agent architecture is what resolves it.

Why claims investigation fits an agentic architecture

Claims investigation fits an agentic architecture because an SIU investigation is already a structured, multi-source, parallelizable workflow with a defined end state. It has distinct phases - document forensics, OSINT, timeline reconstruction, statement cross-reference, financial pattern analysis, cross-carrier checks - each producing evidence that feeds a single audit-ready conclusion. That shape is precisely what agent architectures are good at. The workflow did not need reinventing; the constraint that capped it was human attention, and that is the constraint software removes.

The workflow is already a plan

An SIU playbook is a plan before any AI touches it. A trained investigator working a flagged claim knows the phases to run, the order dependencies between them, and what a finished file has to contain. That is the definition of a plannable task: a known decomposition into steps with a known end state. Detection is already mainstream at the top of this funnel - the Insurance Information Institute reports 80% of carriers use predictive modeling to detect fraud, up from 55% in 2018. Scoring is solved. The open frontier is the plan that runs after the score, on the roughly 10% of P&C claims that involve fraud and the far larger set that get flagged as suspicious. Our companion post on running 15 investigation phases in parallel walks the phase list in detail.

The bottleneck is attention, not method

A human runs the phases one at a time because human attention is the bottleneck, not because the method requires it. Manual investigation takes 14+ days per case, an investigator carries 200+ cases, and the arithmetic means SIU teams reach only about 25% of flagged claims - the rest are paid, denied without full work, or queued indefinitely against $308 billion in annual US insurance fraud loss. An agent architecture runs independent phases simultaneously - 15+ at once - because per-case attention is not a constraint for software. That compresses cycle time to 2-4 hours, lifts throughput from roughly 10 investigations per investigator per month to 800+, and moves coverage toward 100% of flagged claims. The investigator's role shifts from execution to decision-making.

Investigations per investigator per month (manual vs agent architecture)

Manual SIU workflow~10

Agent architecture (15+ parallel phases)800+

Parallelism is the architecture story

The core architectural claim is not that the agent is smart; it is that the agent runs the investigation plan without an attention limit. A human investigator serializes phases because they can hold one case at a time. Software does not. Running 15+ phases in parallel on every flagged claim is what turns a 14+ day, ~25%-coverage workflow into a 2-4 hour, 100%-coverage one. The method is unchanged; the constraint is gone.

The five architecture components

A production investigation agent has five components: an orchestrator that plans and sequences the work, tools the agent calls to gather evidence, a memory and retrieval layer that persists an evidence state across phases, an evaluation layer that scores the agent's own outputs, and guardrails plus an audit trail that make the result defensible. Each maps onto a specific part of the SIU playbook, and each is where a generic agent design breaks when applied to insurance.

Orchestration and planning

Orchestration decomposes a flagged claim into an investigation plan, then sequences it. Some phases depend on others - a timeline reconstruction needs the document set parsed first - and those run in order. Many phases are independent - OSINT on a claimant, cross-carrier retrieval, financial pattern analysis - and those run in parallel. The orchestrator is what holds that dependency graph and drives 15+ phases to completion. This is the difference between an agent and a workflow tool: a fixed workflow runs the same steps every time, while an orchestrator adapts the plan to what the specific claim needs and what earlier phases surface.

Tool use

Each investigation phase is a tool call. The tools are insurance-specific: document forensics on claim forms and photos, OSINT on the claimant and providers, cross-carrier retrieval against contributory data sources such as ISO ClaimSearch, statement and EUO cross-reference, financial pattern analysis, and timeline reconstruction. This is the layer that separates an investigation agent from a chatbot - a chatbot only retrieves and summarizes what is already in front of it, while an agent goes and gathers new evidence the claim file did not contain. Tool use is how the agent does investigation rather than description.

Memory and retrieval

Memory is the evidence state the investigation is built on. As phases run, each writes findings - a metadata anomaly, an OSINT match, a timeline conflict - to a shared evidence graph that later phases read, so the agent reasons over the whole claim rather than one document at a time. Retrieval pulls the claim file, policy terms, and prior-claim history into that state. This is not the same as a long prompt. Research on agent memory, such as Hu, Wang and McAuley's 2025 MemoryAgentBench study, tests four competencies - accurate retrieval, test-time learning, long-range understanding, and selective forgetting - and finds current methods fall short across all four. Memory is a measured, unsolved engineering problem, which is exactly why an investigation agent needs an explicit, evaluated evidence architecture rather than relying on a model's context window.

Evaluation

Evals score the agent's own findings before a human ever sees them. That means checking that every cited fact traces to a real source, catching hallucinated citations, flagging low-confidence conclusions, and routing uncertain cases to human review rather than reporting them as resolved. This is not optional polish; it is what makes the output trustworthy enough to act on. The Sedgwick study reported by Insurance Journal found 75% of claims professionals believe AI needs human oversight, and the eval layer is what makes that oversight tractable - it surfaces the cases that need a human instead of asking a human to re-check everything.

Guardrails and audit trail

Guardrails enforce PII and HIPAA handling and log every decision with its source, reasoning, and timestamp. This is the audit-trail-native part of the architecture, and in insurance it is not a nice-to-have - it is a filing requirement. The output has to satisfy California 10 CCR 2698.36's documented-decision requirement and the antifraud-plan filing requirements of NAIC Model Act 680, adopted in 48 states. An investigation the agent cannot show its work on is not a defensible investigation, whatever its conclusion. For the full treatment of what makes an AI finding hold up, see the defensibility standard for fraud investigation AI. This is the component that turns a plausible answer into a record an SIU lead can override, a compliance officer can file, and a deposition can survive.

A flagged claim fans out into 15+ parallel investigation phases, each writing to a shared evidence state, converging on an audit-ready resolution the investigator reviews.

The mistake is treating memory and evals as features you bolt on. In an investigation agent they are load-bearing. Memory is the evidence graph the whole case reasons over, and evals are what let the system say 'I am not sure, send this to a human' instead of confidently reporting a hallucination as a finding.
Hesper AI product research

Agentic investigation vs chatbots vs rules engines

Three systems get conflated in insurance AI conversations, and they do different jobs: an LLM chatbot answers questions, a rules engine or ML scorer flags a claim, and an agentic investigation architecture resolves a flag end-to-end. Only the third one gathers new evidence and produces a defensible conclusion. Conflating them is how carriers end up expecting a copilot to do an investigator's work, or a detection score to stand in for a documented finding.

The chatbot is single-turn or conversational: it retrieves and summarizes, has no plan, calls no investigative tools, keeps no evidence state, and does not evaluate its output against a standard. It is the class of tool behind that 82% routine-task adoption and the 80% faster processing on low-severity claims - real value at the simple end of the funnel. The rules engine is deterministic thresholds or a scoring model with high recall and a 60-85% false-positive rate; it outputs a flag or a 0-999 score and does not gather new evidence about the specific claim. The agent plans, calls tools, maintains memory across 15+ parallel phases, evaluates its own findings, and produces an audit-ready report in 2-4 hours covering 100% of flagged claims. Only the last one resolves a flag.

Dimension	LLM chatbot / copilot	Rules engine / ML scorer	Agentic investigation architecture
Core job	Answer questions, summarize	Flag or score a claim	Resolve a flagged claim end-to-end
Planning	None	None (fixed thresholds)	Decomposes claim into an investigation plan
Tool use	Retrieval only	None	Forensics, OSINT, cross-carrier, statements, financial
State / memory	Conversation window	Stateless per rule	Evidence graph across 15+ parallel phases
Self-evaluation	None	Score only	Evals, confidence thresholds, human routing
Output	Text answer	Flag / 0-999 score	Audit-ready report with sourced decision trail
False-positive posture	n/a	60-85% (rules-based)	Flags investigated, not just raised
Layer	Assist	Detection	Investigation

The contrast with rules engines is worth holding onto because it is the one carriers feel most directly: a 60-85% false-positive rate means most flags are not fraud, and every one of them still needs a human to investigate it. A score creates work; it does not do work. We treat that distinction in depth in legacy rules versus autonomous AI fraud detection. Detection is upstream; investigation is downstream.

Where agent architectures fail - and how investigation architecture avoids it

Most enterprise agentic AI projects fail for a specific, honest reason: they are unscoped. Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027 because of escalating costs, unclear business value, and inadequate risk controls - many are hype-driven experiments with no defined job. An investigation agent avoids that trap not by being cleverer but by being narrowly scoped to one high-value task with a measurable payoff and a defensible output. Scope, measurability, and defensibility are the difference between a shipped agent and a canceled proof of concept.

Insurance mirrors the broader pattern. The Sedgwick study reported by Insurance Journal found that between 58% and 82% of insurers use AI tools, but only 12% describe their AI capabilities as fully mature and just 7% have achieved scalable AI success. Nearly two-thirds report a gap between their AI vision and their reality. That gap is not at intake, where automation already works; it is downstream, at investigation, where broad-but-shallow adoption runs out of road. Adoption is easy; scaling a system that produces a defensible result is the hard part, and it is the part architecture discipline decides.

The design choices that keep an investigation agent on the shipped side of that line are the five components used deliberately. A scoped job: investigate a flagged claim, not do claims. Evals and guardrails that catch bad output before a human acts on it. A defensible audit trail that satisfies the filing requirements rather than a black-box conclusion. And a human in the decision seat, which is what 75% of claims professionals say they want. The unit economics close the case for finance: roughly $150 per AI investigation against roughly $2,500 for a manual case makes 100% coverage affordable, and a measurable payoff is exactly what the canceled-project cohort lacked. Hesper is purpose-built for SIU investigation, not a generic agent platform - the evidence graph, cross-carrier retrieval, and audit trail are specific to the job, which is why the scope is defined and the value is measurable.

How the architecture fits the existing SIU stack

An investigation agent sits downstream of detection and inside the claims-management system: a flagged claim flows out of the claims system, the agent investigates it, and an audit-ready report flows back as a case attachment. It does not replace the detection scorers, the handler-assist tools, or the claims suite already in place. Mapping the layers by architecture type shows why - each of those systems is built for a different job, and only the investigation layer produces a resolution.

Detection scorers such as FRISS, Verisk ClaimDirector, and SAS Fraud Framework are ML-scoring architectures - they output a flag or a 0-999 score on cross-carrier data and hand it off. Handler-assist agents such as Shift Claims augment a human adjuster's workflow; per Shift's announcement that includes loss reduction and faster handling, but the architecture keeps a human in the execution loop rather than running the SIU playbook autonomously. Claims suites such as Guidewire ClaimCenter and Duck Creek route and manage the case. None of them ships an autonomous, multi-phase, audit-trail-native investigation agent, because that is a different architecture with a different output. Hesper is complementary to FRISS, Shift Technology, and Verisk - not a replacement.

Layer	Representative vendors	Architecture type	Output	Resolves a flag?
Prevention	LexisNexis, underwriting tools	Rules plus data	Fewer bad claims filed	No
Detection	FRISS, Verisk ClaimDirector, SAS, Shift	ML scoring / network analysis	Flag or 0-999 score	No
Claim handling	Shift Claims, Guidewire / Duck Creek AI	Handler-assist agent / suite feature	Faster human handling	No
Investigation	Hesper AI (manual SIU is the incumbent)	Autonomous multi-agent, 15+ parallel phases	Audit-ready resolution	Yes

The architectural point beneath the even-handedness is the one that survives copy-paste to any of those vendors: a detection agent and an investigation agent are built for different jobs because scoring a claim and resolving a claim are different problems. A scorer optimizes recall against a threshold. An investigator optimizes toward a defensible conclusion. You cannot get the second by tuning the first. The investigation layer is the layer no other vendor occupies, and it is architecturally distinct because its output is an audit-ready resolution, not a score or a suggestion. Make every flagged claim investigable.

Key takeaways

An AI agent architecture for claims investigation plans an investigation, calls tools, maintains evidence memory across parallel phases, evaluates its own findings, and outputs an audit-ready report - not a score or an answer.
Claims investigation fits an agentic architecture because an SIU playbook is already a plannable, multi-source workflow; the constraint was human attention, and running 15+ phases in parallel removes it.
The five components - orchestration, tool use, memory and retrieval, evals, and guardrails plus audit trail - each map onto a specific part of the SIU playbook and are where generic agent designs break in insurance.
Only an agentic investigation architecture resolves a flag: a chatbot answers questions and a rules engine outputs a score with a 60-85% false-positive rate, and neither gathers new evidence about the specific claim.
Scope, measurability, and defensibility separate a shipped investigation agent from the over-40% of agentic projects Gartner expects canceled by 2027 - which is why a purpose-built SIU agent survives where generic platforms stall.

AI agent architecture in insurance is the design of software agents that plan and carry out a multi-step task - such as investigating a flagged claim - rather than just answering a question or returning a score. A production agent architecture has five parts: an orchestrator that plans and sequences the work, tools the agent can call such as document forensics, cross-carrier data lookups, OSINT, and statement analysis, a memory or evidence state that persists across steps, an evaluation layer that scores the agent's own outputs, and guardrails plus an audit trail. Gartner predicts 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from under 5% in 2025. In insurance, the most valuable target for this architecture is claims investigation, a workflow that is already multi-step and multi-source.

A chatbot answers questions and summarizes documents in a conversation; it has no plan, calls no investigative tools, keeps no evidence state, and does not evaluate its own output against a standard. An investigation agent does all four. It decomposes a flagged claim into an investigation plan, runs tools like document forensics and cross-carrier retrieval across 15+ phases, maintains an evidence graph the phases read and write, and produces an audit-ready report an SIU lead reviews. Insurance Journal reports 82% of carriers already use AI for routine tasks like data extraction and customer interaction, which is chatbot-class work. Investigation is a different job. The distinction matters because a chatbot can describe a claim, but only an agent architecture can resolve the flag on it.

A rules engine or ML scorer applies thresholds or a model to output a flag or a score - Verisk's ClaimDirector scores 0-999, FRISS scores at FNOL. That is the detection layer, and it is mainstream: the Insurance Information Institute reports 80% of carriers use predictive modeling to detect fraud, up from 55% in 2018. But a score is not an investigation. Rules-based fraud systems also carry a 60-85% false-positive rate, so most flags are not fraud, and each one still needs a human to investigate it. An agentic investigation architecture takes that flag and resolves it: it gathers new evidence, reasons about the specific claim's facts, and documents a conclusion. Detection is upstream; investigation is downstream.

Because an SIU investigation is already a structured, multi-source, parallelizable workflow with a defined end state - an audit-ready resolution. It has distinct phases: document forensics, OSINT, timeline reconstruction, statement cross-reference, financial pattern analysis, cross-carrier checks. A human runs them one at a time because human attention is the bottleneck; manual investigation takes 14+ days per case and covers only about 25% of flagged claims. An agent architecture runs independent phases in parallel - 15+ at once - because per-case attention is not a constraint for software. That compresses cycle time to 2-4 hours and lifts coverage toward 100% of flagged claims. The workflow did not need to be reinvented; the architecture removed the attention limit that capped it.

Memory is the evidence state the investigation is built on. As phases run, each writes findings - a metadata anomaly, an OSINT match, a timeline conflict - to a shared evidence graph that later phases read, so the agent reasons over the whole claim rather than one document at a time. Retrieval pulls in the claim file, policy terms, and cross-carrier history. This is not the same as a long prompt: research on agent memory, such as Hu, Wang and McAuley's 2025 MemoryAgentBench, shows current methods still fall short across accurate retrieval, long-range understanding, and selective forgetting - memory is a measured, unsolved engineering problem. That is exactly why an investigation agent needs an explicit, evaluated memory architecture rather than relying on a model's context window.

Through evals and guardrails, plus a human in the loop. Evals score the agent's own findings - checking that every cited fact traces to a real source, flagging low-confidence conclusions, and routing uncertain cases to human review. Guardrails enforce PII and HIPAA handling and log every decision with its source, reasoning, and timestamp, producing an audit trail that satisfies California 10 CCR 2698.36 and NAIC Model Act 680 filings. This discipline is what separates a shipped agent from a failed one: Gartner predicts more than 40% of agentic AI projects will be canceled by the end of 2027 over unclear value and weak risk controls, and 75% of claims professionals say AI needs human oversight. The investigator reviews and can override every output; the agent handles execution.

Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027 because of escalating costs, unclear business value, and inadequate risk controls - many are hype-driven experiments with no scoped job. Insurance mirrors this: Insurance Journal reports between 58% and 82% of insurers use AI, but only 12% call their capabilities mature and just 7% have scaled successfully. An investigation agent avoids the trap by being narrowly scoped to one high-value job with a measurable payoff - resolve a flagged claim, at roughly $150 versus roughly $2,500 manual and 2-4 hours versus 14+ days - with evals, guardrails, a defensible audit trail, and a human decision-maker. Scope, measurability, and defensibility are the difference between a production agent and a canceled proof of concept.

No. The architecture is built to keep a human in the decision seat. The agent runs the execution-heavy phases - evidence gathering, cross-referencing, timeline building, drafting the report - and produces an audit-ready package. The investigator reviews it, overrides where judgment is needed, and makes the call on referral, denial, or SIU escalation. This matches how carriers want to deploy AI: 75% of claims professionals say AI needs human oversight. Headcount does not shrink in this model; it gets re-aimed. Manual SIU teams can only investigate about 25% of flagged claims; an agent architecture lifts that toward 100%, so investigators spend their time on judgment across far more cases rather than manual execution of a few. The investigator's role shifts from execution to decision-making.