Can AI Truly Read the Room?

Model scores, as detailed in Table 1, demonstrate an inherent capacity for refinement; initial evaluations are consistently surpassed following a period of reflection, suggesting that systems, even those initially imperfect, possess an intrinsic tendency toward graceful aging.

A new benchmark reveals that while large language models excel at following instructions, they struggle with the spatial awareness and social reasoning needed to navigate real-world interactions.

Can AI Truly Understand Physics?

QuantiPhy constructs a robust benchmark through a three-stage process: diverse video acquisition and background segmentation are followed by meticulous annotation-tailored to each source to capture nuanced physical properties-and culminates in the formulation of benchmark tasks categorized as either $2D$ or $3D$ based on the object’s movement relative to the camera.

A new benchmark reveals that vision-language models struggle with basic physical reasoning, often relying on memorization rather than genuine understanding of visual scenes.