Shaohui Liu
ETH Zurich

Simultaneous localization and mapping is a fundamental technique with applications spanning robotics, spatial AI, and autonomous navigation. It addresses two tightly coupled challenges: localizing the device while incrementally building a coherent map of the surroundings. Localization, or positioning, involves estimating a 6 Degrees-of-Freedom (6-DoF) pose for each image in a continuous sequence, typically aided by other sensor data, while mapping involves constructing an evolving representation of the surrounding environment. Accurate localization helps the device track its movement and improve the map's quality, while a better map further refines the device's pose estimates. Positioning is also crucial in real-world applications to ensure the persistence of digital content and enable seamless sharing across devices, which is especially important for applications like augmented reality, where precise placement enhances user experience and interaction.
Recent advancements in mobile computing have fueled the development of wearable devices equipped with multiple color or depth cameras, inertial units, and GPS. These devices capture egocentric, multi-modal data that pose challenges often overlooked by traditional SLAM research, which typically relies on curated datasets featuring controlled viewpoints and restricted motion patterns. On the other hand, egocentric data exhibits significantly more diversity in motion patterns, viewpoints, and environments. These devices aspire to be all-day wearables that capture data over extended durations, where factors like sensor calibration can shift over time. In this tutorial, we address the task of accurate positioning for large-scale egocentric data using visual-inertial simultaneous localization and mapping (SLAM) and odometry (VIO).
As the academic community has been mainly driven by benchmarks that are disconnected from the specifics of egocentric data, we introduce LaMAria, a city-scale egocentric dataset collected using Project Aria devices to track progress in egocentric VIO/SLAM. These devices capture rich multi-sensor streams in a glasses-like form-factor, such that they can be worn over extended durations and distances without impeding the wearer's motion. The dataset exhibits key characteristics of egocentric data, with a focus on challenges that break existing algorithms: long trajectories, extremely low illumination, fast motion, time-varying calibration, and traveling in a moving platform or vehicle. In this tutorial, we aim to provide a hands-on experience with the new dataset, while laying the groundwork for a forthcoming benchmark that will offer insights for research in accurate localization in the context of egocentric VIO/SLAM.
The tutorial will be structured as follows: