Clonts's Face-Tracking VLA Only Learned from Failure

Nathan Clonts built a vision-language-action model to track faces with pan/tilt servos. It barely worked. Twenty degrees of average error in a sixty-degree field of view. The architecture was fine. The training data was the real problem.

The setup: a 1.5 million parameter action model on top of frozen foundation models like CLIP text encoders and DINOv2 visual features, trained through behavioral cloning. Clonts ran ten thousand simulated episodes with five hundred AI-generated faces and language commands like "track the face with glasses" or "look at the right-most person." But analysis showed that only three percent of training frames contained meaningful error. The oracle started each episode at a random offset, caught its target within a few frames, then coasted. The model spent ninety-seven percent of its training time learning what a perfectly centered face looks like. That's exactly the state it rarely encountered once deployed.

The field has developed methods like DAgger and Hindsight Experience Replay specifically to inject failure into training pipelines. Clonts's face-tracking experiment is a tidy proof of why that work matters. Perfect demonstrations teach almost nothing. Models need to see what going wrong looks like if they're going to learn how to recover.