LEXI-SG

Abstract

Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed—enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We demonstrate improved trajectory estimation and dense reconstruction, as well as competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone.

System Overview

RGB frames are segmented into rooms using DINO features. Upon detecting a room transition, the accumulated batch is passed through a feed-forward model (MapAnything) to produce per-frame depths and poses in a local room frame. New rooms are checked for loop closures and the pose graph is globally optimized over Sim(3).

Key Contributions

First monocular RGB scene graph SLAM system. LEXI-SG builds open-vocabulary 3D scene graphs without depth sensors or ground-truth poses.
Vision-only room detection. An online room transition detector using DINO features identifies room boundaries with no geometric priors.
Room-deferred reconstruction. Feed-forward inference (MapAnything) is deferred until a room is fully observed, giving each batch maximal co-visibility and avoiding sliding-window scale drift.
Sim(3) room-level factor graph. A pose graph over room nodes globally aligns per-room reconstructions while preserving local geometric consistency and correcting monocular scale ambiguity.
Open-vocabulary object segmentation. 2D mask tracklets are lifted into the scene graph as 3D object nodes, supporting natural-language queries.

Results

We evaluate across four tasks: camera pose estimation (ATE), dense reconstruction (chamfer distance), room segmentation, and open-vocabulary object segmentation. LEXI-SG achieves the lowest average trajectory error on both HM3D and our self-collected egocentric office dataset (AOD), and the lowest reconstruction error on AOD. For full quantitative results see the paper.

Reconstruction Quality

Room-based segmentation allows for cleaner reconstructions and better pose estimation accuracy compared to sliding-window baselines.

HM3D sequence 829 — rotate, pan, scroll to explore

LEXI-SG

MASt3R-SLAM

VGGT-SLAM2

ViSTA-SLAM

Room Segmentation

We perform room segmentation online from RGB alone using DINO feature cues in a variety of indoor layouts.

Open-Vocabulary Segmentation

We evaluate on the OpenLex3D benchmark and achieve competitive object segmentation results (green is better).

BibTeX

@article{kassab2026lexisg,
  title   = {LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction},
  author  = {Kassab, Christina and Gil, Hyeonjae and Mattamala, Matias and Kim, Ayoung and Fallon, Maurice},
  year    = {2026},
}

Acknowledgements

The work at the University of Oxford was supported by a Royal Society University Research Fellowship (Fallon, Kassab), and the work at Seoul National University (Kim, Gil) is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)(No. RS-2024-00461409).

LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

LEXI-SG is the first dense monocular mapping system to build open-vocabulary 3D scene graphs from RGB input alone, without depth sensors or ground-truth poses.