Continual SLAM

Beyond Lifelong Simultaneous Localization and Mapping through Continual Learning

An essential task for an autonomous robot deployed in the open world without prior knowledge about its environment is to perform Simultaneous Localization and Mapping (SLAM) to facilitate planning and navigation. Classical methods typically rely on handcrafted, low-level features, which tend to fail under challenging conditions, e.g., textureless regions. Deep learning-based approaches mitigate such problems due to their ability to learn high-level features. However, they lack the ability to generalize to out-of-distribution data, with respect to the training set. Such out-of-distribution data with regards to visual SLAM can, for instance, correspond to images sourced from cities in different countries or from substantially different environmental conditions, e.g., summer versus winter.

In the context of this work, lifelong SLAM considers the long-term operation of a robot in a dynamically changing environment. Although this environment can be altered over time, the robot is constrained to stay within a single bounded environment, e.g., to obtain continuous map updates within a city. Recent works attempt to relax this assumption by leveraging domain adaptation techniques for deep neural networks. While a naive solution for adapting to a new environment is to source additional data, this is not feasible when the goal is to ensure the uninterrupted operation of the robot. Moreover, changes in environments can be sudden, e.g., rapid weather changes, and data collection and annotation often come at a high cost. Therefore, adaption methods should be trainable in an unsupervised or self-supervised manner without the need for ground truth data. As illustrated in the figure above, the setting addressed in domain adaptation only considers unidirectional knowledge transfer from a single known to a single unknown environment and thus does not represent the open world, where the number of new environments that a robot can encounter is infinite and previously known environments can be revisited.

To overcome this gap, we propose a new task called continual SLAM, where the robot is deployed on a sequence of diverse scenes from different environments. Ideally, a method addressing the continual SLAM problem should be able to achieve the following goals: 1) quickly adapt to unseen environments while deployment, 2) leverage knowledge from previously seen environments to speed up the adaptation, and 3) effectively memorize knowledge from previously seen environments to minimize the required adaptation when revisiting them.

Technical Approach

Continual SLAM Metrics

To address these challenges, we propose two novel metrics, namely adaptation quality (AQ) which measures the short-term adaptation capability when being deployed in a new environment, and retention quality (RQ) that captures the long-term memory retention when revisiting a previously encountered environment. The AQ measures the ability of a method to effectively adapt to a new environment based on experiences from previously seen environments. Although being inspired by forward transfer, in our setting we additionally apply online adaptation to the current task: mapping and localization in a new environment. To further account for the opposing challenge of the continual SLAM setting, the RQ measures the ability of an algorithm to preserve long-term knowledge when being re-deployed in a previously encountered environment and is inspired by the backward transfer used in other continual learning settings. For the mathematical details, please consult our paper.

CL-SLAM Architecture

The core of CL-SLAM is the dual-network architecture of the visual odometry (VO) model that consists of an expert that produces the myopic online odometry estimates and a generalizer that focuses on the long-term learning across environments. Both networks are trained in a self-supervised manner where the weights of the expert are updated only based on online data, whereas the weights of the generalizer are updated based on a combination of data from both the online stream and a replay buffer. In this work, we utilize Monodepth2 for obtaining VO estimates, although the general approach is not limited to a specific architecture.

Figure: Overview of our proposed CL-SLAM that is constructed as a dual-network architecture including a generalizer (left) and an expert (right). While the expert focuses on the short-term adaptation to the current scene, the generalizer avoids catastrophic forgetting by employing a replay buffer comprising samples from the past and the present. Note that both subnetworks contain a single PoseNet, shown two times to illustrate the self-supervised training scheme.

Before deployment, i.e., performing continual adaptation, we pre-train the DepthNet and the PoseNet using the standard self-supervised training procedure based on the photometric consistency loss functions. When deployed in a new environment, we continuously update the weights of both the expert and the generalizer in an online manner, following a similar scheme as CoMoDA:

Create an image triplet composed of the latest and the two previous frames. Similarly, batch the corresponding velocity measurements.
Estimate the camera motion between both pairs of subsequent images.
Generate the depth estimate of the previous image.
Compute the self-supervised loss and backpropagate to update the weights of the DepthNet and PoseNet.
Loop over steps (2) to (4) for multiple iterations.
Repeat the previous steps for the next image triplet.

Upon deployment, both the expert and the generalizer are initialized with the same set of parameter weights, initially obtained from pre-training and later replaced by the memory of the generalizer. The weights of the expert are updated as described in the algorithm above. Additionally, every new frame from the online image stream is added to a replay buffer along with the corresponding velocity reading. Using only the online images, the expert will quickly adapt to the current environment. This behavior can be described as a desired form of overfitting for a myopic increase in performance. On the other hand, the generalizer acts as the long-term memory of CL-SLAM circumventing the problem of catastrophic forgetting in continual learning settings. Here, in step (1), we augment the online data by adding image triplets from the replay buffer to rehearse experiences made in the past. After deployment, the weights of the stored parameters used for initialization are replaced by the weights of the generalizer, thus preserving the continuous learning process of CL-SLAM. The weights of the expert are then discarded.

Experience Replay

In CoVIO, we present a new version of the replay buffer that explicitly addresses the limited storage capacity on mobile devices and robotic platforms. In detail, the replay buffer is set to a fixed size and images are added/removed based on the cosine similarity of image features. The current online sample is only added if the similarity to the most similar image in the buffer is below a threshold. If adding a new image results in exceeding the allowed size of the replay buffer, the image that is the most similar with respect to all other images is removed.

Code

A software implementation of this project based on PyTorch including trained models can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publications

Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, and Abhinav Valada,
"CoVIO: Online Continual Learning for Visual-Inertial Odometry"
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023.

(PDF) (BibTex)

Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, and Abhinav Valada,
"Continual SLAM: Beyond Lifelong Simultaneous Localization and Mapping through Continual Learning"
International Symposium on Robotics Research (ISRR), 2022.

(PDF) (BibTex)