Diffusion-based visuomotor policies have become a default recipe for learning robot manipulation from demonstrations, but deploying them in the real world surfaces a problem that is mostly invisible in a benchmark table. To run fast enough to control a robot, these policies are executed asynchronously: the policy generates a chunk of future actions, the robot starts executing it, and the next chunk is computed in parallel. The trouble is that successive chunks are generated from different observations at different times, and they do not necessarily agree where they meet. A new arXiv paper posted June 16, 2026 — LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation, by Guowei Shi, Xupeng Xie, Yiming Luo, Jian Guo, Jun Ma and Boyu Zhou — names the two symptoms precisely and proposes a unified fix.

The symptoms are inter-chunk discontinuities and a lack of explicit obstacle-aware execution, and the authors are direct about their consequences: "jerky motions and collisions that hinder reliable manipulation in real-world scenes." Both follow from the same root cause. A diffusion policy is trained to imitate demonstrated actions; it has no built-in notion of executing across the seam between two independently sampled chunks, and no built-in notion of obstacles it must avoid that were not salient in the demonstrations. Asynchronous execution exposes the first; messy real scenes expose the second. The jerk is not just an aesthetic complaint — discontinuous acceleration commands stress hardware and degrade precision — and the collisions are an outright safety failure.

"LAGO Policy improves inter-chunk consistency via latency-aware classifier-free guidance conditioning on future actions."— arXiv, source

That sentence describes the first of three mechanisms, and it is the cleverest in how it repurposes an existing tool. Classifier-free guidance is the standard knob in diffusion models for steering generation toward a condition; here the condition is future actions, and the guidance is made latency-aware. The insight is that the discontinuity at a chunk boundary is fundamentally a timing problem: by the time the new chunk is ready, the robot has already moved partway through the old one, so the new chunk should be generated to be consistent with where execution will actually be, not where it was when generation started. Conditioning the diffusion sampling on the relevant future actions, with explicit awareness of the inference latency, pulls each new chunk toward agreement with the one it is supposed to continue. It is a targeted fix that addresses the seam at its source rather than smoothing over it after the fact.

Adding a goal the policy can plan toward

The second mechanism addresses the obstacle-awareness gap, and it does so by giving the policy something diffusion imitation does not natively provide: a goal. LAGO enables goal-directed collision-free trajectory planning by predicting a task-relevant interaction goal from demonstrations. This is a meaningful architectural addition. A pure diffusion policy generates actions reactively; it does not necessarily represent where, ultimately, the manipulation is trying to get to. By extracting a task-relevant interaction goal — say, the pose at which the gripper should make contact with an object — the framework gives the planning layer an anchor it can route toward while avoiding obstacles. That separation matters: the learned policy supplies the rich, demonstration-grounded behavior, and an explicit goal lets a planner reason about collision-free paths to that behavior's endpoint, which is precisely the kind of geometric reasoning diffusion imitation is weakest at.

The third mechanism is the one that turns intent into clean motion: spatial-temporal trajectory optimization refines the actions to be executed for low-jerk and feasible motion. This is the trajectory-optimization half of the "unified asynchronous action-generation framework" the authors describe, and it is what guarantees that the final commands the robot receives are not just collision-free and consistent at the seams but dynamically smooth — minimizing jerk and respecting feasibility. Pairing a learned generative policy with a classical optimization back end is an increasingly common and increasingly sensible pattern: let the network propose behavior shaped by data, then let an optimizer enforce the smoothness and feasibility constraints that networks are bad at honoring on their own.

Why the composition is the contribution

Read as a whole, LAGO Policy is an argument that the gap between a diffusion policy that scores well offline and one that runs reliably on hardware is bridged by three coordinated additions, not one. The latency-aware guidance closes the temporal seam; the predicted interaction goal supplies the spatial target for collision avoidance; the spatial-temporal optimization delivers low-jerk, feasible commands. Each addresses a distinct failure of the naive asynchronous-diffusion deployment, and they compose into a single pipeline rather than competing. From an actuation-and-manipulation standpoint, the most durable idea here is the explicit recognition that asynchronous execution is not a free implementation detail — it injects discontinuities that have to be designed against — and that the right place to design against them is partly in the generative model (via conditioning) and partly in a classical refinement layer.

The validation is real-world rather than simulated: the authors report extensive real-world experiments showing that LAGO Policy achieves smooth collision-free execution with high task success across challenging manipulation tasks. For a method whose entire motivation is the gap between offline scores and real hardware behavior, real-world evaluation is not optional, and it is the right bar to hold the work to. The appropriate caveats are the ones any such paper invites: "high task success" and "challenging" are relative to the chosen task suite, and the abstract does not quantify how much the latency-aware conditioning alone contributes versus the trajectory optimizer — a decomposition a careful reader would want from the ablations. There is also a practical question of how much additional inference and optimization budget the three-stage pipeline costs, since the whole point of asynchronous execution was to keep up with control rates in the first place. But as a clearly diagnosed and coherently engineered response to two real deployment failures of the dominant manipulation-learning recipe, LAGO Policy is a notable data point on how learned policies are being hardened for the physical world. The full preprint and project page are linked from arXiv.