Research Series

Scene Graph Structured Intelligence

A research thread on scene graphs as the explicit structural representations connecting and facilitating comprehension, reasoning, and generation across text, image, video, 3D, and 4D worlds, etc. As multimodal foundation models grow more capable, scene graphs are expected to offer the missing bias toward better controllability, explainability, and stronger generalization.

Scene graph examples across image, text, and video inputs. — *Scene graphs (SGs) unify objects, attributes, and relationships across vairous modalities, serving as a structured interface for image, text, video, 3D, and 4D representations.*

A scene graph is a topological structure representing a scene described in text, image, video, or other modalities. Nodes encode object categories, attributes, or regions; edges encode pair-wise relationships. That explicit structure makes scene graphs a natural bridge from perception to reasoning and generation.

Scene graphs provide a structured and interpretable representation of objects, attributes, and relationships in 2D, 3D, and even 4D scenes. They make complex scenes decomposable, editable, and easier to align across modalities.

With multimodal foundation models rapidly scaling, integrating scene graphs can be benefitial: offerring controllability, explainability, and stronger generalization across domains, viewpoints, and input types.

Improving cross-modal alignment

More fine-grained vision-text matching.

Enhancing multimodal fusion

Semantic-level feature learning.

More controllable task modeling

Highly structured modal representation.

Flagship Research

Track 01

Scene Graph Generation / Parsing

From 4D dynamic scenes to universal cross-modal parsers.

Track 02

Cross-modal Comprehension / Reasoning

Use structure to align language, video, 3D scenes, translation, and IE.

Track 03

Cross-modal Generation

Turn scene structure into explicit guidance for image and video synthesis.

Workshops

WACV26

Scene Graph for Structured Intelligence

Workshop series page and community hub

Featured workshop Community event

SG4SI @ WACV 2026

A focal workshop that brings the SG-SI thread into public view: structured scene representation, multimodal reasoning, controllable generation, and embodied intelligence under one community-facing program.

It serves as the public face of the series, creating a visible entry point for collaborators, benchmark discussions, invited talks, and future positioning of scene graphs in structured multimodal intelligence.

Research visibility Community building Future directions

Workshop page

Survey

Coming soon SG-SI Survey

A forthcoming synthesis of the SG-SI landscape: tasks, benchmarks, methods, and open problems across structured multimodal intelligence.

Survey paper In preparation

Systematic landscape review

Taxonomy, methods, and open challenges

This module is reserved for the survey paper that will organize the SG-SI thread from a higher level: what counts as scene graph structured intelligence, how existing approaches differ, which evaluation settings are still missing, and where the most important opportunities lie.

Taxonomy Method landscape Open problems

Link will be added when available

Community curated Awesome Scene Graph Generation

A living repository that tracks the broader scene graph landscape, including papers, applications, and community resources.

Papers Applications Resources

Community resource GitHub repository

Awesome Scene Graph Generation

Curated resources for the broader scene graph landscape

This repository systematizes scene graph generation and adjacent SG-SI directions through a continuously maintained collection of resources: paper lists, datasets, codebases, and broader community updates.

Paper collections Datasets and tools Community updates

GitHub repo