A scene graph is a topological structure representing a scene described in text, image, video, or other modalities. Nodes encode object categories, attributes, or regions; edges encode pair-wise relationships. That explicit structure makes scene graphs a natural bridge from perception to reasoning and generation.
Scene graphs provide a structured and interpretable representation of objects, attributes, and relationships in 2D, 3D, and even 4D scenes. They make complex scenes decomposable, editable, and easier to align across modalities.
With multimodal foundation models rapidly scaling, integrating scene graphs can be benefitial: offerring controllability, explainability, and stronger generalization across domains, viewpoints, and input types.
Improving cross-modal alignment
More fine-grained vision-text matching.
Enhancing multimodal fusion
Semantic-level feature learning.
More controllable task modeling
Highly structured modal representation.