DeepMind's LoGeR Can Map 3D Scenes Across 19,000-Frame Videos — Without Falling Apart

The 3D reconstruction field has a dirty secret: most systems are trained and tested on short, tidy clips that bear little resemblance to the continuous video streams that real-world applications actually produce. A camera moving through a kilometer-scale environment, a robot navigating a building, a self-driving car logging an extended route — none of these fit the constraints that competitive models are built around.

LoGeR, developed by researchers at Google DeepMind and UC Berkeley and published to arXiv on March 3, takes direct aim at this gap. The system processes sequences up to 19,000 frames through a chunk-based architecture that handles long video without the usual tradeoffs. Full bidirectional attention models like VGGT and π3 scale quadratically with sequence length, making them impractical beyond a few hundred frames. Recurrent approaches like CUT3R and TTT3R handle long sequences but tend to lose geometric precision over time. LoGeR does neither.

The trick is a hybrid memory module embedded in each residual block. Sparse Sliding Window Attention handles local alignment at chunk boundaries — preserving the fine geometric detail that recurrent models sacrifice. Test-Time Training maintains a compressed global state across chunks, preventing the scale drift that accumulates over long sequences. Bidirectional attention handles dense reasoning within each chunk. Together they achieve sub-quadratic scaling without trading away accuracy.

The benchmark results are unambiguous. On VBR — a dataset of sequences up to 19,000 frames across kilometer-scale trajectories — LoGeR is 30.8% ahead of the previous best feedforward method, and the margin grows as sequences get longer. On KITTI, Average Trajectory Error drops to 18.65, a 74% improvement over prior baselines. On 7-Scenes the gain is 69.2%. All of this from a model trained exclusively on 128-frame sequences. The out-of-distribution generalization — to sequences two orders of magnitude longer than anything seen during training — is the result that will get the most attention.

Code and a project page are already live alongside the paper. For anyone building systems that need accurate 3D maps from extended video — robotics, autonomous vehicles, persistent AR — the gap LoGeR closes is less an academic milestone than an engineering one.