SCDP: Single-Camera Diffusion Policy for Robot Manipulation

A new diffusion-based manipulation policy uses the robot's own planned end-effector trajectory as a visual attention anchor, matching multi-camera systems while running from a single global RGB view.

The quiet orthodoxy of modern robot manipulation is that you need a lot of cameras. Most visual imitation-learning systems that have worked well in the last few years lean on a multi-camera rig, typically one or two cameras watching the workspace from a fixed global vantage and at least one camera bolted to the robot's wrist that travels with the gripper. The wrist camera is doing heavy lifting that is easy to overlook: as the hand approaches an object, it delivers a close, stabilized, occlusion-resistant view of the exact contact region the policy needs to get right. Strip that camera away and the problem changes character. A new paper, Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera (arXiv:2606.14535), argues that the wrist camera is solving a problem you can solve differently, in software, and that doing so unlocks a much cheaper and simpler sensing setup.

The authors, Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, and Min Jun Kim, frame the difficulty plainly. Manipulation from a single global view is hard because the policy has to capture fine-grained interaction details, the millimeter-scale alignment of a peg and a hole, the precise lip of a container, while also figuring out which part of a cluttered scene actually matters for the task. The wrist camera previously answered both questions at once by physically pointing at the action. Without it, a single distant camera sees everything and emphasizes nothing.

“Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard.”— arXiv:2606.14535 source

The core idea: the trajectory is the attention

SCDP's central insight is elegant enough to state in a sentence: the robot's own planned end-effector trajectory is a near-perfect signal for where in the image the policy should be looking. If the gripper is about to move toward a particular point in space, then the visual region around that point is, almost by definition, the task-relevant region. The policy does not need an external mechanism to learn attention from scratch; the motion it is already generating tells it where the action is.

Building on that observation, SCDP couples two components. The first is a visual encoder that produces multi-scale feature maps, deliberately preserving both broad scene context and fine-grained local detail rather than collapsing the image into a single global vector. That multi-scale design matters because the two halves of the manipulation problem live at different resolutions: identifying the right object is a coarse, contextual judgment, while inserting or aligning it is a fine one. The second component is a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories inside the diffusion loop. This is the part that makes the method distinctive.

Why doing this inside the diffusion loop is the clever part

Diffusion policies generate an action, here a trajectory, by starting from noise and iteratively denoising it over many steps. A naive way to add spatial conditioning would be to compute it once, up front, from a fixed guess of where the robot is going. SCDP instead samples visual features along the intermediate trajectory at points during the denoising process. The consequence is a feedback loop between the action being generated and the perception conditioning it: as the diffusion process refines its estimate of where the end-effector will go, the trajectory anchors move, and the policy re-samples visual features from the updated, more accurate locations. Perception and action co-refine instead of perception being frozen before action planning begins.

This is the mechanism that lets a single global camera approximate what a wrist camera provided physically. A wrist camera gives a close-up of the contact region because the hardware is attached to the hand. SCDP gives a close-up of the contact region because it crops, in feature space, exactly the image patches the predicted hand path runs through. One solution moves a sensor; the other moves attention. The second is far cheaper to deploy and carries no risk of the wrist camera being occluded, smeared by motion, or knocked out of calibration.

What the results suggest, and what they don't

The empirical claim has two parts. In extensive simulation experiments, SCDP consistently outperforms strong single-view baselines, other methods that also use only one camera, and reaches performance comparable to multi-camera baselines. That second comparison is the headline result: it suggests the spatial conditioning recovers most of the value the extra cameras were providing, at least on the evaluated tasks. In the real world, the authors report precise manipulation and, importantly, robustness to visual distractors, additional objects in the scene that a less focused policy might fixate on. Distractor robustness is a natural consequence of the design, because the trajectory anchor directs attention to the region the hand will actually visit, not to whatever happens to be visually salient elsewhere in the frame.

It is worth being disciplined about scope. The strongest claim, parity with multi-camera setups, is established primarily in simulation, with real-world experiments demonstrating precision and distractor robustness rather than a full head-to-head against multi-camera rigs across a large task suite. Single-view sensing also inherits structural limits that no amount of clever conditioning fully removes: a global camera cannot see what is occluded from its single vantage, and depth ambiguity from one RGB view is a real constraint for certain insertion and stacking tasks. SCDP's contribution is not a claim that one camera is always sufficient; it is a demonstration that a large fraction of the multi-camera advantage was about attention and contact-region detail, and that those can be supplied algorithmically.

Why it matters for where manipulation is heading

The practical significance is about cost, simplicity, and deployability. Multi-camera, wrist-mounted rigs add hardware, wiring, calibration burden, failure modes, and integration complexity to every robot. They also complicate the data side: collecting demonstrations and synchronizing several camera streams is more work than collecting from one. If single-camera policies can be made precise and robust, the barrier to deploying learned manipulation on inexpensive, simply instrumented arms drops sharply, which matters most for the long tail of practical applications where a fleet of cameras per workcell is uneconomic.

There is also a conceptual takeaway that outlasts this specific architecture. SCDP is part of a broader move in robot learning toward letting the action representation guide perception rather than treating perception as a fixed front-end. Using the predicted trajectory as a query into the visual feature space is a clean instance of that principle, and it generalizes: any policy that produces a structured spatial output can, in principle, use that output to decide where to look. The single-camera result is the immediately useful payoff, but the design pattern, action-conditioned, iteratively refined spatial attention sampled inside a generative loop, is the part most likely to show up in future systems. For a field that has spent several years quietly assuming more sensors are the answer, the more interesting message is that a better-placed question about where to attend can substitute for a camera.

One Camera, Precise Hands: How Spatially Conditioned Diffusion Policy Rethinks Visuomotor Imitation

The core idea: the trajectory is the attention

Why doing this inside the diffusion loop is the clever part

What the results suggest, and what they don't

Why it matters for where manipulation is heading

Comments