Documentation - LaMAria

Welcome to the LaMAria documentation! This documentation provides an overview of our sequence and calibration formats, ground truths, evaluation, and result submission.

Data formats
    Raw data
    ASL dataset
    ROS1 bag
Calibration data
Ground truth
    Sparse control points
    Dense pseudo ground truth poses
Evaluation
    Sparse evaluation
    Dense evaluation
Result submission
References

Data formats

We release the complete LaMAria recordings in three complementary formats to fit a wide range of research purposes. The primary distribution is Meta's VRS file format, which is the recording format utilized by the Aria glasses. For ease in SLAM benchmarking, we also provide our sequence data in the ASL dataset and ROS1 bag formats, discussed below. However, it is important to note that the image data in the ASL dataset and ROS1 bag formats have been undistorted to the PINHOLE model. We believe this step was necessary as the native Aria camera model is not supported by academic SLAM baselines. This is in contrast to the VRS file format, which retains the camera calibration. The undistortion to the pinhole model is performed using colmap image_undistorter and can be replicated using the script here.

Raw data

All sequences are natively recorded in Meta's VRS format, designed for long duration, multimodal XR data capture. Each VRS file contains timestamped streams from two global-shutter grayscale cameras (640 x 480 @ 20 FPS), one rolling-shutter RGB camera (1408 × 1408 @ 10 FPS), two IMUs (1 kHz & 800 Hz), magnetometer, barometer, GPS, WiFi, and Bluetooth sensors. To learn more about the VRS file format, please refer to the official documentation.

Although the VRS files record a wide array of sensors, our ground-truth generation and SLAM benchmarking rely exclusively on the two grayscale cameras (camera-slam-left, camera-slam-right) and the 1 kHz IMU (imu-right) streams.

ASL dataset

The ASL dataset format was originally introduced by the EuRoC MAV dataset [1] and is widely used in the SLAM community. It is a simple folder structure that contains the images, IMU data, calibration parameters, and ground truth poses. For the purpose of our benchmarking and evaluation, we release a modified version of the ASL dataset format that is described as follows:

    aria/
    ├── cam0/
    │   ├── data/
    │   │   ├── <timestamp1>.png
    │   │   ├── <timestamp2>.png
    │   │   └── ..
    │   └── data.csv
    ├── cam1/
    │   ├── data/
    │   │   ├── <timestamp1>.png
    │   │   ├── <timestamp2>.png
    │   │   └── ..
    │   └── data.csv
    ├── imu0/
    │   └── data.csv
    └── <sequence_name>.txt

The aria folder holds the grayscale cameras and 1kHz IMU data. The cam0 and cam1 folders refer to the left and right cameras respectively, and the data subfolder contains the images corresponding to the capture timestamps, stored as .png files. The data.csv file contains the mapping of capture timestamps to image filenames. The data.csv file of the imu0 folder stores the IMU data in the following format:

    timestamp, gyro_x, gyro_y, gyro_z, accel_x, accel_y, accel_z

Each aria folder also contains a text file with the name of the specific sequence. This file is used to store the nanosecond timestamps of the images.

ROS1 bag

The ROS1 bag format is a single .bag file that carries the grayscale cameras and 1kHz IMU data in three ROS topics. It contains two image topics, /cam0/image_raw and /cam1/image_raw, each carrying sensor_msgs/Image messages with 8-bit, mono-grayscale frames converted from the original .png files using cv_bridge. Every image message is stamped with a nanosecond timestamp and is assigned the frame_id cam0 or cam1, depending on the camera. The third ROS topic, /imu0, carries sensor_msgs/Imu messages at 1 kHz where the linear_acceleration and angular_velocity parameter fields are populated from the IMU data. Similarly to the camera messages, each IMU message is stamped with a nanosecond timestamp and is assigned the frame_id imu0.

Calibration data

We provide the calibration data for the grayscale cameras and 1kHz IMU in the following ways. This data can be found for each sequence in the datasets section.

aria calibration - Contains the intrinsic parameters and the factory transformation between the sensor to body frame for the original Aria cameras and right IMU. If you are working with the original Aria images, please use this calibration for the cameras.
pinhole calibration - Contains the intrinsic parameters and the factory transformation between the sensor to body frame for the pinhole (undistorted from original factory Aria model) cameras and right IMU.

The aria calibration contains intrinsic camera parameters ordered as follows:

    fx, fy, cx, cy, k0, k1, k2, k3, k4, k5, p0, p1, s0, s1, s2, s3

k0-k4 represent the radial distortion parameters, p0, p1 represent the tangential distortion parameters, and s0-s3 represent the thin-prism distortion parameters. The Aria camera model can be utilized directly within COLMAP by building the Camera object using the RAD_TAN_THIN_PRISM_FISHEYE model. To learn more about the Aria camera model, please refer to the official documentation.

Similarly, the pinhole calibration contains intrinsic camera parameters ordered as follows:

    fx, fy, cx, cy

The pinhole camera model can be utilized directly within COLMAP by building the Camera object using the PINHOLE model.

The calibration files also contain the right IMU noise and random walk parameters. This was obtained from the Aria sensor measurement and IMU noise model discussed in the official documentation. The IMU calibration parameters follow the commonly used Kalibr format described here.

Ground truth

Our dataset provides two complementary forms of ground truth to support evaluation of SLAM and odometry systems. First, we offer sparse control points measured with survey-grade instruments at centimeter accuracy, enabling highly precise but spatially sparse trajectory evaluation. Second, we provide dense pseudo ground truth poses, generated via joint visual-inertial control point optimization, which propagate these high-accuracy measurements to all keyframes for fine-grained analysis.

Sparse control points

Sparse control points (CPs) are fixed 3D locations measured with centimeter-level accuracy using GNSS-RTK surveying. Each CP is marked with an AprilTag [2] fiducial marker for automatic detection in the sequence imagery, triangulated from multiple viewpoints, and aligned to its surveyed position using a similarity transformation. The resulting transformation and alignment errors provide precise estimates of translation and scale drift allowing high accuracy trajectory evaluation.

The sparse control points are openly provided for the training sequences and can be found in the datasets section. The format for the control point data is as follows:

    {
      "filename": <sequence_name>,
      "timestamps": 
      {
        "camera-slam-left": 
        {
          "<timestamp>": "<image_name>",
          ...
        },
        "camera-slam-right":
        {
          "<timestamp>": "<image_name>",
          ...
        }
      },
      "control_points": 
      {
        "<cp_name>": 
        {
          "tag_id": [],
          "image_names": ["<image_name>", ...],
          "measurement": [x, y, z],
          "uncertainty": [x, y, z]
        },
        ...
      },
      "images": 
      {
        "<image_name>": 
        {
          "timestamp": <timestamp>,
          "control_point": "<cp_name>",
          "detection": [x, y]
        },
        ...
      }
    }

The timestamps field contains the mapping of capture timestamps to image filenames for both grayscale cameras. The control_points field contains the 3D position of each control point in the LV95/LN02 coordinate system, along with the associated uncertainty and the list of images in which the control point is observed. The images field contains the mapping of each image filename to its capture timestamp, the control point observed in the image, and the 2D pixel location of the control point in the image.

Dense pseudo ground truth poses

Dense pseudo ground truth poses are obtained by extending the sparse CP constraints to all the keyframes of a sequence in a joint bundle adjustment framework where visual, inertial and CP information are optimized. The resulting trajectories are less precise than the sparse control points, but sufficiently accurate for evaluating larger pose errors across the entire sequence.

The format for the dense pseudo ground truth poses is as follows:

    timestamp, tx, ty, tz, qx, qy, qz, qw

Here, timestamp is in nanoseconds and represents the capture timestamp of the keyframe, tx, ty, tz form the translation vector of the keyframe pose, and qx, qy, qz, qw form the quaternion rotation of the keyframe pose.

The body reference frame of the per-keyframe pseudo ground truth poses is the left camera frame (camera-slam-left) of the Aria glasses.

The dense pseudo ground truth poses are openly provided for the training sequences and can be found in the datasets section. Additionally, as discussed in the paper, we cannot guarantee sufficient accuracy of the dense pseudo ground truth poses for the moving platform sequences. If a moving platform sequence is missing a pose file, it is due to the lack of a good initialization for our optimization framework. In such cases, we recommend using only the sparse control points for evaluation.

Evaluation

The benchmark evaluates various metrics depending on the set that each sequence belongs to. We have three sets of sequences, namely the controlled experimental set, the additional set, and the main dataset. Our controlled experimental set and additional set of sequences serve as our training data, whereas our main dataset serves as our test data. To learn more about individual sequences and their set categories, please refer to the datasets section or the paper.

Sparse evaluation

Sparse evaluation measures trajectory accuracy by aligning the estimated poses to centimeter-accurate surveyed CPs located throughout the city. Each CP is automatically detected via fiducial markers in the images and triangulated from multiple views, then aligned to its surveyed position using a similarity transformation. The resulting alignment error quantifies the drift with a precision exceeding that of state-of-the-art SLAM systems, enabling reliable benchmarking even on kilometer-long sequences.

The sparse evaluation utilizes the following metrics:

Score 2D: Per-sequence core computed using the piecewise linear scoring function over the control point alignment error, evaluated in 2D. To learn more about the scoring function, refer to Section 5.2 and Figure 7 of our paper.
CP Recall @ 1m: Percentage of control points with alignment error less than 1 meter, evaluated in 2D.

We provide these metrics for all sequences that observe control points.

Dense evaluation

Dense evaluation leverages our dense pseudo ground truth poses, obtained by jointly optimizing visual, inertial, and control point constraints. While not as precise as the sparse CP-based evaluation, the dense pseudo poses are suitable for evaluating large keyframe pose errors.

For the controlled experimental set, where sequences are short and mostly do not include control points for sparse alignment, we evaluate using the common Absolute Trajectory Error (ATE) after Umeyama alignment to the dense pseudo ground truth poses.

For the additional set and main dataset, we instead report pose recall after sparse alignment, which measures the percentage of keyframes whose position error is less than 5 meters with respect to the dense pseudo ground truth poses.

The dense evaluation utilizes the following metrics:

Absolute Trajectory Error (ATE): Measures the error between estimated and ground-truth camera positions after Umeyama alignment. Used for the controlled experimental set.
Pose Recall @ 5m: Percentage of keyframes whose position error is less than 5 meters with respect to the dense pseudo ground truth poses.

It is possible to evaluate both training and test set results on this website by uploading the results as described below. For evaluating training results locally, the evaluation program is provided as a part of the source code (on GitHub).

Result submission

Important details

To submit your results, you need to sign up for an account and login. Both the training and test sequences can be evaluated online. We strongly recommend that participants use only the training sequences while tuning and developing their algorithms, and submit to the test sequences only once they are confident in their final approach.

To preserve the integrity of the benchmark and prevent overfitting to the test data, we do not allow continuous submissions to the test set. Test results can be updated for a method after 24 hours from the previous test set upload time. Partial submissions (on a subset of sequences) are possible, but for the main leaderboard we only compute challenge averages (and therefore rank a method in a category) if results for all of the sequences within a challenge are provided.

Submission format

Results for a method must be uploaded as a single .zip file, which must in its root contain a folder named slam with the results. There must not be any extra files or folders in the archive, or the submission will be rejected. Upper- / lowercase spelling matters for all directories and sequence names.

For a given sequence, the trajectory estimated by the VIO/SLAM algorithm must be provided as a text file with the same name as the sequence name, with file extension .txt. For example, for the sequence "R_01_easy", the result file must be named "R_01_easy.txt". Each line in the result file specifies the camera pose for one image. Lines must be ordered by increasing timestamp order and timestamps must be in nanoseconds. The format of each line is the same as the format of the dense pseudo ground truth data:

    timestamp, tx, ty, tz, qx, qy, qz, qw

For methods that utilize the IMU (monocular/binocular), we expect the IMU pose to be provided in the trajectory estimate file (world_from_imu). For visual-only methods, we expect the left camera pose to be provided in the trajectory estimate file (world_from_cam0).

Sparse evaluation is performed by triangulating the control points from the provided poses and aligning them to the measured control points. Therefore, it is important that the trajectory estimate file contains the poses for all timestamps of a sequence. Dense evaluation is performed against the keyframe dense pseudo ground truth poses.

The file structure of a submission should look like this:

    slam/
    ├── R_01_easy.txt
    ├── R_02_easy.txt
    ├── R_04_medium.txt
    ├── R_11_5cp.txt
    ├── sequence_1_1.txt
    ├── sequence_2_2.txt
    ├── .
    ├── .
    ├── .
    ├── .
    └── sequence_5_5.txt

References

[1] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. Achtelik and R. Siegwart, The EuRoC micro aerial vehicle datasets, International Journal of Robotic Research, DOI: 10.1177/0278364915620033, early 2016.
[2] E. Olson, "AprilTag: A robust and flexible visual fiducial system," 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 2011, pp. 3400-3407, doi: 10.1109/ICRA.2011.5979561.,