ARISE levels of autonomy vs. investigation-specific AI: why Shift's claims taxonomy misses the fraud dimension

Levels in Shift's ARISE taxonomy

Answers to Exceeds, modeled on SAE J3016

$308B

Annual US insurance fraud loss

Coalition Against Insurance Fraud

15+

Investigation phases Hesper runs

In parallel per flagged claim, the second axis

2-4 hrs

Hesper time to a defensible finding

vs 14+ days manual, Hesper internal benchmark

On June 3, 2026, Shift Technology published ARISE, a five-level taxonomy for AI-agent autonomy in insurance, modeled explicitly on the SAE J3016 self-driving standard. It is a useful contribution and carriers should use it. ARISE gives the sector a clean, vendor-neutral vocabulary for one thing that genuinely needed one: how much human involvement a given AI step requires. The argument in this post is not that ARISE is wrong. It is that an autonomy-level axis is necessary and not sufficient for fraud investigation, which needs a second, orthogonal axis - evidence-synthesis depth and whether the resulting finding can be defended.

The distinction matters most to an SIU director, who does not ask "what automation level is this?" but "can I defend this finding in a deposition, an examination under oath, or a SAR filing?" Those are different questions. An autonomy level tells you how independently an agent acts. It says nothing about how hard the underlying problem is or whether the output survives scrutiny. A fully autonomous agent that replaces a windshield and a fully autonomous agent that investigates a suspicious fire loss can sit at the same ARISE level and still be categorically different problems.

This post lays out ARISE fairly from Shift's own report, takes the SAE analogy seriously enough to see what it actually measures, and then adds the second axis fraud investigation requires. It is the framework-level companion to a product-level argument we made elsewhere - the product-level version of this distinction lives in Shift Claims vs. Hesper, which draws the handler-assist versus autonomous-investigation line at the product level. This is a cluster post under our AI fraud platforms compared buyer's guide.

What ARISE actually says, and why it is a useful contribution

Start with the framework on its own terms. Per Shift Technology's ARISE report, ARISE defines five levels of AI-agent autonomy in insurance: Answers, Recommends, Initiates, Solves, and Exceeds (Level 1 through Level 5). The levels climb from an agent that answers questions, to one that recommends actions, to one that initiates them, to Level 4 "Solves" - an agent that handles a task end-to-end with no human intervention at 99% or higher accuracy - to Level 5 "Exceeds," an agent that surpasses the top 1% of human performers.

Shift frames ARISE as "the precise, vendor-neutral vocabulary [the sector] has been missing: a way to evaluate, procure, and govern AI agent capabilities." That framing is the right one, and it is the opening this post extends rather than attacks. Buyers do need a shared vocabulary for autonomy level, and ARISE supplies one. The taxonomy is modeled on SAE J3016, the self-driving standard, which is a sensible analogy to reach for when the question is "how much of the task does the machine do versus the human."

Shift positions its own claims agents at Level 4 "Solves" in production today, for named use cases: auto glass, auto physical damage repairs, property electronic-device claims, workers' compensation coverage, medical bill review, and travel. It targets Level 5 "Exceeds" for 2026 on use cases such as auto liability and property building and content loss. Take those self-positioning claims at face value; this post does not contest them. The point worth noticing is what every one of those named Level 4 use cases has in common, which is the subject of the next two sections.

What ARISE measures, stated plainly

ARISE measures autonomy level - how much human touch each step of a claims-handling task requires. That is a real and useful dimension. It is not a measure of how difficult the task is, how deep the evidence synthesis runs, or whether the resulting decision can be defended to a regulator. Those are separate questions, and an autonomy level does not answer them.

The SAE analogy, taken seriously, reveals a single axis

The SAE analogy is worth taking seriously, because it clarifies exactly what an autonomy taxonomy does and does not measure. SAE J3016, "Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles" (2021 revision), defines six levels of driving automation, per the SAE J3016 driving-automation standard: Level 0 No Driving Automation, Level 1 Driver Assistance, Level 2 Partial Driving Automation, Level 3 Conditional Driving Automation, Level 4 High Driving Automation, and Level 5 Full Driving Automation.

Here is the part the analogy makes obvious. SAE J3016 measures one thing - how much of the driving task the system performs versus the human. It deliberately does not rank how hard the driving is. A Level 4 shuttle running a fixed campus loop and a Level 4 robotaxi negotiating dense city traffic share an autonomy level, but they are categorically different problems. The shuttle operates in a constrained, predictable environment; the robotaxi handles open-ended, adversarial conditions where other actors behave unpredictably. The level number tells you who does the driving. It does not tell you how difficult the driving is.

Carry that straight into insurance and the same property holds for ARISE. The taxonomy measures human-touch-per-step. It does not, and is not designed to, rank problem difficulty. An L4 agent that replaces a windshield and an L4 agent that investigates a suspicious total loss are equally autonomous by the definition, yet the difficulty and the stakes are nowhere near the same. This is not a flaw in ARISE; a single-axis taxonomy is exactly what SAE built and exactly what ARISE inherited. It is a reason to add a second axis when the task is investigation, not a reason to discard the first.

The campus shuttle and the robotaxi

This is an analogy, not a sourced statistic. A Level 4 campus shuttle and a Level 4 robotaxi sit at the same SAE level and solve problems of wildly different difficulty. The autonomy level is silent on that difference. In claims, the equivalent is an L4 glass-replacement agent and an L4 fraud-investigation agent: same autonomy level, different problem class.

Why fraud investigation needs a second axis - evidence depth and defensibility

The reason a second axis is needed comes down to a difference in problem class. Claims handling - in-policy glass replacement, a standard auto physical damage repair, medical bill review against a fee schedule - is largely a deterministic problem. The correct outcome is set by contractual and regulatory logic, so the job is to apply that logic consistently and quickly. There is no contested fact. That is precisely the kind of task an autonomy level describes well, and it is why every Level 4 use case Shift names is a handling or straight-through-processing problem.

Fraud investigation is a different problem class. It is adversarial and contested. The claimant is actively trying to pass; the answer is not written in the policy document; and the finding has to be defended downstream - in an examination under oath, a suspicious-activity report, or a state Department of Insurance audit. Roughly 10% of property-casualty losses involve fraud, per the Coalition Against Insurance Fraud, which also puts the annual US insurance fraud loss at $308 billion. Most flagged claims, in other words, need real investigation, not just fast handling.

Fraud also surfaces slowly, which is what makes investigation depth and time matter. Across fraud generally, the median scheme runs about 12 months before it is caught, per the ACFE Report to the Nations. That figure is from occupational fraud, not P&C claims specifically, so read it only as evidence that fraud is structurally hard to surface and resolve - not as a claims number. The general point stands: a flagged claim is the start of an investigation, not its conclusion, and the depth of that investigation is the thing an autonomy level does not capture.

Depth is the volume of evidence synthesis, not the level of autonomy

Investigation depth is a measure of how much evidence synthesis a claim receives. At Hesper, the unit of that depth is 15+ investigation phases run in parallel on every flagged claim: document forensics, OSINT, statement cross-referencing, timeline reconstruction, financial-pattern analysis, and more, running simultaneously rather than one after another. A handling agent does not need that depth, because there is no contested fact to resolve. An investigation agent that skips it has not investigated - it has triaged.

Defensibility is whether the finding survives scrutiny

The second half of the axis is defensibility. A fraud finding is only useful if a human SIU lead can stand behind it when a regulator pulls the file. That requires an evidence chain, a documented decision trail, and support for the action taken - what a defensible finding actually requires is covered in the defensibility standard for fraud investigation AI. Hesper is audit-trail-native by design: every decision the agent makes is logged with sources, reasoning, and timestamps, which is what lets a Hesper-investigated case satisfy California 10 CCR 2698.36 and the antifraud-plan filing requirements of NAIC Model Act 680. An autonomy level says nothing about any of this.

There is one more reason depth matters downstream of detection: noise. Rules-based detection produces a false-positive rate in the range of 60-85%, which means most flagged claims are not fraud at all. Resolving that noise - separating the genuine 10% from the false flags - is itself an investigation task. An autonomy level on the handling side tells you nothing about whether the investigation behind a decision was deep enough to clear the noise defensibly.

The two-axis map - ARISE autonomy level x investigation-depth tier

Put the two axes together and the orthogonality becomes visible. Where an agent sits on the autonomy axis tells you nothing about where it sits on the investigation-depth axis. To make the second axis concrete, define four tiers of investigation depth:

Tier 0 - No investigation: the claim's correct outcome is determined by contractual and regulatory logic (glass replacement, in-policy APD repair). There is no contested fact to resolve.
Tier 1 - Triage and scoring: surface anomalies and flag suspicion. This is the detection layer - FRISS, Verisk, Shift detection - and it raises a question rather than answering it.
Tier 2 - Assisted investigation: a human investigator does the multi-source work with software assistance. The depth still depends on human attention; the tool speeds the human.
Tier 3 - Autonomous defensible investigation: end-to-end multi-phase evidence synthesis producing an audit-ready, defensible finding a human SIU lead reviews. This is the layer Hesper occupies.

With those tiers defined, the map shows why a single number is not enough. A glass-claim agent can be Level 4 on ARISE and Tier 0 on investigation depth - fully autonomous and with nothing to investigate. A suspicious fire-loss agent can also be Level 4 on ARISE and must be Tier 3 on investigation depth - the same autonomy level, far deeper synthesis, and a finding that has to be defended. The axes do not move together.

Example agent	ARISE autonomy level	Investigation-depth tier	Why the two diverge
Auto-glass replacement agent	L4 Solves	Tier 0	Outcome is contractual; nothing to investigate
Straight-through APD repair agent	L4 Solves	Tier 0	Deterministic estimate logic; no contested fact
FNOL fraud-scoring model	L1-L2	Tier 1	Flags suspicion; does not resolve it
Handler-assist claims agent	L3-L4	Tier 2	Speeds the handler; investigation depth still human
Suspicious fire-loss investigation agent (Hesper)	L4 Solves	Tier 3	Same autonomy level, far deeper synthesis, must be defensible
Organized auto-fraud-ring investigation (Hesper)	L4 Solves	Tier 3	Adversarial, multi-claim, audit-trail-native finding

Read the table by row pairs. The first two rows and the last two rows all sit at Level 4 Solves on the autonomy axis - identical by that measure - yet they occupy the extremes of the depth axis. The glass and APD agents are Tier 0; the fire-loss and fraud-ring agents are Tier 3. That is the orthogonality in one picture. Knowing an agent is Level 4 tells you it acts without human intervention. It does not tell you whether the problem it solves has a contested fact, requires multi-phase evidence synthesis, or produces a finding a state DOI auditor can read.

Hesper is the only point on the map that is both Level 4 on autonomy and Tier 3 on depth. That combination is the whole positioning: from fraud detection to fraud resolution. Hesper takes a flagged claim and runs the full investigation to a defensible, audit-ready finding in 2-4 hours, lifting flagged-claim coverage from about 25% to 100% by running 15+ phases in parallel. No autonomy-level axis on its own can describe that, because the depth and defensibility are on the second axis.

The depth axis an autonomy level cannot see: flagged-claim investigation coverage (Hesper internal benchmark)

Manual SIU, regardless of handling-agent autonomy level~25%

Autonomous Tier 3 investigation (Hesper)100%

That coverage gap is exactly what an autonomy level cannot see. A carrier can deploy Level 4 handling agents across glass and APD and still investigate only about 25% of its flagged claims, because flagged-claim coverage lives on the depth axis, not the autonomy axis. Manual SIU teams reach roughly 25% because each case takes 14+ days and an investigator carries 200+ cases; running the investigation autonomously in 2-4 hours is what lifts coverage to 100%. The autonomy level of the handling agent upstream does not move that number.

An autonomy level tells you how independently the agent acts. It is silent on how deep the evidence synthesis runs and whether the finding can be defended. For fraud, you have to score both axes - and the second one is the one that survives a deposition.
Hesper AI product research

How to evaluate an investigation agent - the question ARISE does not ask

The practical takeaway for a buyer is that asking only "what ARISE level is it?" can leave a gap. You can buy a Level 4 handling agent, automate glass and APD end-to-end, and still have an uninvestigated-flag problem downstream, because the handling agent never sat on the investigation-depth axis at all. The fix is to score both axes in the RFP. The autonomy axis - something like ARISE - tells you how much human touch each step needs. The depth axis tells you whether the flag actually gets investigated.

Timing makes this worth getting right. An AM Best survey reported by Insurance Journal in May 2026 found only about one in five carriers (20%) describe their AI implementation as being at an advanced stage, 53% call themselves cautious pacesetters rather than first movers, and just 13% feel very confident measuring AI ROI. Most buyers are early and want a defensible evaluation lens, not hype. A procurement vocabulary like ARISE is timely for exactly that reason - and so is adding the second-axis question, because the measurement gap the survey describes is partly a question of which axis you are scoring on.

Concretely, the depth-axis questions an SIU director should add to any investigation-agent RFP:

Across how many evidence-synthesis phases does it investigate each claim, and do they run in parallel or sequentially?
Is it audit-trail-native - does every decision come logged with sources, reasoning, and timestamps?
Does it produce an evidence chain and a documented decision trail, or only a recommendation and a routing action?
Can my investigator defend the output in a deposition, an EUO, or a state DOI audit under California 10 CCR 2698.36 and NAIC Model Act 680?
Does it investigate 100% of flagged claims, or accelerate the ones a human still works?

Those questions map onto the depth axis the way ARISE maps onto the autonomy axis. A handling agent answers the autonomy questions cleanly and the depth questions not at all, which is exactly correct for handling. An investigation agent has to answer both. In this model the investigator's role shifts from execution to decision-making: the agent runs the 15+ phases and produces the finding, and the human reviews and stands behind it. For the wider field of where each vendor sits, the AI fraud platforms compared buyer's guide places detection, handling, and investigation on the same map.

Where Hesper sits relative to ARISE and Shift's agents

None of this makes Hesper a competitor to Shift's ARISE agents. The layers are complementary. Shift's Level 4 handling agents operate on the handling side: detect, assist the handler, and automate straight-through use cases like glass and APD repair. Hesper operates downstream on the investigation side: it takes a flagged claim and runs the full investigation to a defensible, audit-ready finding. A carrier can run Shift's handling agents and Hesper's investigation agent at the same time; they address different points in the workflow and different axes of the problem.

Hesper is complementary to FRISS, Shift Technology, and Verisk - not a replacement. FRISS and Verisk are detection-layer: FRISS scores claims, and Verisk flags through cross-carrier data and ClaimSearch. Neither published an autonomy taxonomy, and they do not need to, because the whole detection layer sits upstream of the investigation-depth axis - it raises the flag that an investigation agent then resolves. Detection is upstream; investigation is downstream. ARISE describes autonomy level across claims work generally; the investigation-depth axis is the one Hesper is purpose-built for.

Shift explicitly invited the sector to adopt ARISE as shared vocabulary. The posture of this post is to accept the invitation and add the axis fraud investigation requires. The autonomy axis is real, useful, and now well-named. The depth-and-defensibility axis is the one that determines whether a finding survives an EUO, a SAR, or a DOI exam. A buyer who scores both will not mistake a fully automated handling agent for a fully investigated flag. That is the difference between detecting fraud and resolving it.

Key takeaways

ARISE is a useful, vendor-neutral vocabulary for AI-agent autonomy in claims handling, modeled on SAE J3016, with Shift positioning its claims agents at Level 4 today and targeting Level 5 in 2026.
An autonomy-level axis measures how much human touch a step needs; it is silent on how deep the evidence synthesis is and whether the resulting finding can be defended.
Claims handling is largely deterministic - apply contractual and regulatory logic - while fraud investigation is adversarial and contested, so it needs a second axis: evidence-synthesis depth and defensibility.
The two axes are orthogonal: an L4 glass-claim agent (Tier 0 depth) and an L4 suspicious-fire-loss agent (Tier 3 depth) are categorically different problems at the same autonomy level.
Hesper should be evaluated on investigation depth - 15+ parallel phases, audit-trail-native, defensible finding, about 25% to 100% coverage - not only on where it sits on ARISE, and it is complementary to Shift, not a replacement.

Frequently asked questions

ARISE is a five-level taxonomy for AI-agent autonomy in insurance, published by Shift Technology on June 3, 2026 and modeled on the SAE J3016 self-driving standard. The levels are Answers, Recommends, Initiates, Solves, and Exceeds. Level 4 Solves describes an agent that acts end-to-end without human intervention at 99% or higher accuracy; Level 5 Exceeds describes an agent that surpasses the top 1% of human performers. Shift positions its claims agents at Level 4 today for use cases like auto glass and medical bill review, targeting Level 5 in 2026. Shift frames ARISE as a vendor-neutral vocabulary for evaluating, procuring, and governing AI agents. It is a useful lens for automation level in claims handling.

No. An autonomy level measures how much human involvement a step requires - it does not measure how deep the evidence synthesis is or whether the resulting finding can be defended. The same point holds in SAE J3016: a Level 4 vehicle on a fixed campus loop and a Level 4 robotaxi in dense traffic share an autonomy level but are categorically different problems. In claims, an L4 auto-glass agent and an L4 suspicious-fire-loss agent are equally autonomous yet face entirely different evidentiary burdens. For fraud investigation specifically, you have to score a second axis: evidence-synthesis depth and defensibility. Autonomy level alone can let you buy a fully automated handling agent and still have an uninvestigated-flag problem downstream.

Claims handling is largely deterministic: for an in-policy glass replacement or a standard auto physical damage repair, the correct outcome is set by contractual and regulatory logic, so the job is to apply that logic consistently. Fraud investigation is adversarial and contested: the claimant is actively trying to pass, the answer is not written in the policy document, and the finding must be defended in an examination under oath, a suspicious-activity report, or a state DOI audit. Roughly 10% of property-casualty losses involve fraud, per the Coalition Against Insurance Fraud, so most flagged claims need real investigation, not just fast handling. That is why an autonomy taxonomy built for handling does not, on its own, describe investigation quality.

Use two axes, not one. First, the autonomy axis (something like ARISE): how much human touch each step needs. Second, and the one ARISE does not measure, the investigation-depth axis: how many evidence-synthesis phases run on each claim, whether the agent is audit-trail-native, whether it produces an evidence chain, and whether the finding is defensible. An AM Best survey reported by Insurance Journal in May 2026 found only about one in five carriers at an advanced stage of AI adoption and just 13% very confident measuring AI ROI, so the evaluation lens matters. Concretely, ask whether it runs the full investigation playbook end-to-end and whether your investigator can defend the output in a deposition. Hesper runs 15+ phases in parallel and is built audit-trail-native for that reason.

No - the layers are complementary. Shift's claims agents operate on the handling side: detect, assist the handler, and automate straight-through use cases like glass and APD repair. Hesper operates downstream on the investigation side: it takes a flagged claim and runs the full investigation to a defensible, audit-ready finding. A carrier can run Shift's handling agents and an investigation agent at the same time; they address different points in the workflow. Hesper is complementary to FRISS, Shift Technology, and Verisk - not a replacement. The distinction is that ARISE describes autonomy level across claims work generally, while investigation requires a second axis - evidence-synthesis depth and defensibility - that Hesper is purpose-built for.

SAE J3016, Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles (2021 revision), defines six levels of driving automation: Level 0 No Driving Automation, Level 1 Driver Assistance, Level 2 Partial Driving Automation, Level 3 Conditional Driving Automation, Level 4 High Driving Automation, and Level 5 Full Driving Automation. Shift's ARISE explicitly borrows this structure to define AI-agent autonomy in insurance. The useful thing the analogy clarifies is that SAE measures a single dimension - how much of the driving task the system performs versus the human. It deliberately does not rank how difficult the driving is. Carried into insurance, that means an autonomy axis is necessary but not sufficient to describe a fraud investigation's depth or defensibility.