Robots Learn to Act from Video

Video2Act diverges from conventional video-language action models by implementing an asynchronous dual-system architecture-a slow perceptual system extracts nuanced spatial and motion data, while a fast-system action decoder leverages this information to achieve both high-frequency responsiveness and stable robotic control, effectively bypassing the limitations of static image-token concatenation or direct feature conditioning approaches.

A new framework empowers robots to understand and replicate actions observed in videos, bridging the gap between visual perception and physical manipulation.