There is a quiet, dangerous failure mode lurking inside the end-to-end autonomous-driving planners that learn to drive by imitating human experts. The model watches expert demonstrations and learns to map scenes to actions, but it has no built-in understanding of why the expert acted. So it learns whatever correlates with the action, and correlation is not causation. If a particular building facade or a roadside object happened to appear in many scenes where the expert braked or turned, the planner may learn to associate that scenery with the decision, even though the object had nothing to do with why the expert acted. This is called causal confusion, and a new paper, CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners (arXiv:2606.14438) by Zikun Guo, is devoted to detecting and fixing it.

The reason this matters so much is that the failure is silent. As the paper puts it, causal confusion silently compromises reliability in long-tail scenarios, the rare, unusual situations that are precisely where you most need the planner to be reasoning about the right things. A planner that drives well on common roads because it happens to have latched onto cues that usually correlate with correct behavior can fail abruptly when the spurious cue and the correct action come apart, which is exactly what a long-tail scenario does.

Why you cannot see the problem with the usual metrics

The deeper issue the paper identifies is that the standard evaluation tools are blind to this. The prevailing open-loop metrics for driving planners, L2 displacement (how far the planned trajectory deviates from the expert's) and collision rate, are, as Guo notes, dominated by ego status: the planner's own current state and motion are so predictive of the immediate next trajectory that these metrics look good almost regardless of what the planner is actually attending to in the scene. A planner could be making decisions for entirely spurious reasons and still post strong L2 and collision numbers, because those numbers are largely determined by the trivial continuity of the ego vehicle's own motion. The metrics that the field uses to declare a planner good do not indicate whether it depends on spurious cues. That is a profound problem: the standard yardstick cannot measure the standard flaw.

The deployment blind spot in existing fixes

There is prior work on causal confusion, but it has a practical limitation CADET targets directly. Existing remedies based on causal-intervention training, methods that try to break spurious dependencies by intervening during training, require retraining large models. For modern end-to-end driving planners, retraining is enormously expensive, and more importantly, it cannot help with a planner that is already deployed. If you have a planner running in the field and you want to know whether it is causally confused, retraining-based methods offer you nothing; you would have to rebuild the model from scratch. There is a whole category of already-fielded planners that current causal-intervention methods simply cannot audit.

CADET's defining feature is that it is training-free. It audits, benchmarks, and repairs spurious reliance in pretrained end-to-end planners without any parameter update. Those three verbs describe a complete workflow. Audit: determine whether a given pretrained planner is relying on spurious, non-causal cues, the diagnostic step the standard metrics cannot perform. Benchmark: quantify and compare that spurious reliance, presumably so different planners can be ranked on a dimension that L2 and collision rate ignore. Repair: reduce the spurious dependence, deconfound the planner, again without touching its weights. Doing all three with no parameter update is what makes CADET applicable to the deployed-model case that retraining methods abandon.

What 'physics-grounded' signals

The title's other operative phrase is physics-grounded. The distinction between a causal and a spurious cue in driving is not arbitrary, it is anchored in the physics of the scene: the variables that causally determine a driving decision are the ones with a real physical bearing on the vehicle's safe motion, the dynamics of nearby agents, the geometry of the drivable space, while spurious cues like a building facade have no physical role in why the expert acted. Grounding the auditing in physics gives CADET a principled criterion for separating the cues that should matter from the ones that merely co-occur, rather than relying on the planner's own learned, possibly confused, sense of relevance. It is the physics of the driving scene, not the planner's internal statistics, that defines what counts as a genuine cause.

Why this is the right way to think about AV reliability

The significance of this work is as much conceptual as technical. It reframes a core safety question for autonomous driving away from how closely a planner imitates experts on average and toward whether it is making decisions for the right reasons. Those are very different questions, and the gap between them is exactly where long-tail failures live. A planner that scores well on L2 and collision rate but is causally confused is a planner that will behave unpredictably when the world stops cooperating with its spurious correlations, and current evaluation will not warn you in advance.

By making causal reliance auditable, benchmarkable, and repairable without retraining, CADET addresses two practical realities the field cannot ignore: that retraining giant planners is costly, and that many planners are already deployed and need to be assessed as they are. A training-free audit that you can run against a fielded model, and that exposes a flaw the standard metrics structurally cannot, is exactly the kind of safety tooling autonomous driving needs as end-to-end planners move from research benchmarks toward real roads.

The honest caveats follow from the framing. Causal confusion is a slippery target, and identifying which cues are spurious in a complex driving scene is itself a hard problem; CADET's physics grounding gives it a principled criterion, but the practical effectiveness of the audit and the repair will depend on how well that grounding captures the messy reality of real scenes, and the value of the benchmark depends on its tests genuinely stressing the spurious-versus-causal distinction. The training-free repair, too, is constrained by what can be achieved without changing the model's weights. But the contribution is real and well-aimed: it names a failure the field's metrics hide, builds tooling to expose it in the planners that are already on the road, and does so without the prohibitive cost of retraining. For autonomous driving, getting the planner to decide for the right reasons, not merely to match the expert's trajectory, is the difference that long-tail safety turns on.