Seeing is Understanding: Robots Learn to Interact with the World Through Language
![The system iteratively refines its understanding of a scene through interactive perception, where a robot evaluates observations and queries-leveraging a memory [latex]\mathcal{M}_{t}[/latex] to avoid repetition-and, when necessary, annotates images with segmented elements like push lines, keypoints, or grid patterns to inform subsequent actions [latex]a_{t}=\pi(\mathcal{M}_{t},\textbf{x}_{t},z_{t},\tilde{o}_{t})[/latex], ultimately building a contextual history [latex]S_{t}[/latex] of images and scene descriptions to enhance task efficiency.](https://arxiv.org/html/2602.18374v1/Figures/rss25/InterPer2.png)
A new framework empowers robots to manipulate objects and answer questions about their environment, even when vision is limited, by combining language understanding with enhanced visual perception.



