Scene graphs provide a structured and interpretable representation of objects, attributes, and relationships in 2D, 3D, and even 4D scenes, serving as a vital bridge between raw visual data and high-level reasoning, which is critical for tasks such as visual reasoning, navigation, and embodied AI. With the rapid rise of multimodal foundation models, integrating scene graphs has become a timely and essential task, offering controllability, explainability, and stronger generalization across different domains and modalities.
This workshop will highlight the latest advances in scene graph generation, representation learning, and their applications in vision–language reasoning, multimodal generation, and robotics. We aim to establish new benchmarks, foster interdisciplinary collaboration, and chart future directions toward the development of structured multimodal intelligence. By uniting researchers from computer vision, NLP, and robotics, the workshop will stimulate impactful discussions and accelerate progress toward trustworthy, general-purpose AI systems.