3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for
Robust 6D Pose Estimation

1Google DeepMind    2MIT   3IBM
* Equal Contribution

3D Neural Embedding Likelihood (3DNEL) defines a likelihood of real noisy RGB-D images given a 3D scene description through a principled combination of RGB and depth information. We formulate 6D pose estimation as posterior inference using 3DNEL in an inverse graphics framework, and develop an efficient multi-stage inverse graphics pipeline (MSIGP) based on coarse enumerative pose hypotheses generation and stochastic search. In addition to achieving SOTA performance for sim-to-real 6D pose estimation, 3DNEL's probabilistic formulation also improves pose estimation robustness, naturally quantifies uncertainty, and supports extension to additional tasks using principled probabilsitic inference without task-specific retraining.

Abstract

The ability to perceive and understand 3D scenes is crucial for many applications in computer vision and robotics. Inverse graphics is an appealing approach to 3D scene understanding that aims to infer the 3D scene structure from 2D images. In this paper, we introduce probabilistic modeling to the inverse graphics framework to quantify uncertainty and achieve robustness in 6D pose estimation tasks. Specifically, we propose 3D Neural Embedding Likelihood (3DNEL) as a unified probabilistic model over RGB-D images, and develop efficient inference procedures on 3D scene descriptions. 3DNEL effectively combines learned neural embeddings from RGB with depth information to improve robustness in sim-to-real 6D object pose estimation from RGB-D images. Performance on the YCB-Video dataset is on par with state-of-the-art yet is much more robust in challenging regimes. In contrast to discriminative approaches, 3DNEL's probabilistic generative formulation jointly models multiple objects in a scene, quantifies uncertainty in a principled way, and handles object pose tracking under heavy occlusion. Finally, 3DNEL provides a principled framework for incorporating prior knowledge about the scene and objects, which allows natural extension to additional tasks like camera pose tracking from video.



3D Neural Embedding Likelihood (3DNEL)


3DNEL defines the probability of an observed RGB-D image conditioned on a 3D scene description. We first render the 3D scene description into: (1) a depth image, which is transformed to a rendered point cloud image, (2) a semantic segmentation map, and (3) the object coordinate image (each pixel contains the object frame coordinate of the object surface point from which the pixel originates). The object coordinate image is transformed, via the key models, into key embeddings. The observed RGB image is transformed, via the query models, into query embeddings. The observed depth is transformed into an observed point cloud image. The 3DNEL Energy Function is evaluated using the rendered point cloud image, semantic segmentation, key embeddings, the observed point cloud image, and query embeddings.



Accurate Sim-to-real Object Pose Estimation


We evaluate on the YCB-V dataset from the Benchmark for 6D Object Pose Estimation in the sim-to-real setup. The reported average recall metric measures the pose estimation accuracy, and is calculated using the BOP Toolkit. Higher is better.

SurfEMB is the previous SOTA for sim-to-real 6D object pose estimation. For SurfEMB, FFB6D and MegaPose we use the numbers reported by the authors. For CosyPose and Coupled Iterative Refinement we use the publicly available codebases to re-evaluate in the sim-to-real setup. Our proposed 3DNEL Multi-stage Inverse Graphics Pipeline (MSIGP) significantly outperforms previous SOTA.


Method Average recall on YCB-V
3DNEL Multi-stage Inverse Graphics Pipeline (MSIGP) 84.85%
SurfEMB 80.00%
FFB6D 75.80%
MegaPose 63.3%
CosyPose 71.42%
Coupled Iterative Refinement 76.58%


Additional Benefits from 3DNEL's Probabilistic Formulation

More Robust Object Pose Estimation

Compared with existing more discriminative approaches, 3DNEL’s probabilistic formulation significantly reduces large-error pose estimations and improves robustness. It is especially helpful in challenging situations where similar-looking objects are present, RGB information is less informative, or 2D detections are missing, as shown in the following examples.

Javascript Required

Natural Support for Uncertainty Quantification

Pose uncertainty may arise from partial observability, viewpoint ambiguity, and inherent symmetries of the object. 3DNEL can naturally quantify such pose uncertainties due to its probabilistic formulation. Panel (a) illustrates how 3DNEL identifies the pose uncertainty of the red bowl due to its inherent symmetry. Panels (b)(c) consider the red mug in YCB objects. 3DNEL can accurately capture that while there is no pose uncertainty when the mug handle is visible, there are a range of equally likely poses when the mug handle is not visible.


Object Pose Tracking Through Occlusion

3DNEL's probabilistic formulation allows us to formulate object pose tracking from video as probabilistic inference in a state-space model. Given a sequence of RGB-D frames from a video, we use the Sampling Importance Resampling (SIR) particle filter to infer the posterior distribution, and take the maximum a posteriori probability (MAP) estimate as our tracking estimate.


Here we visualize tracking with 3DNEL with 200 particles on a representative YCB-V video, where the tomato can gets fully occluded before reappearing. 3DNEL's joint modeling of multiple objects in a scene naturally handles occlusion through rendering and can reliably track through occlusion. The tomato can is briefly occluded by a narrow occluder, and the estimated posterior from particle filtering indicates there is little uncertainty about where the tomato can is even when it is fully occluded.


We can additionally leverage 3DNEL's ability to quantify uncertainty to track an object through extended occlusion. The above visualizes tracking with 3DNEL with 400 particles in a challenging video with extended occlusion. 3DNEL can accurately quantify uncertainty with particle filtering: the estimated posterior concentrates on the actual pose when the sugar box is visible, yet spreads to cover a range of possible poses when the sugar box becomes occluded. Such modeling of the full posterior helps 3DNEL to regain track when the sugar box reappears, after which the posterior again concentrates on the actual pose.


Extension to Camera Pose Tracking from Video

3DNEL's probabilistic formulation provides a principled framework for incorporating prior knowledge about the scene and objects, and enables easy extension to camera pose tracking from video using probabilistic inference in the same model without task specific retraining. Given the video of a static scene, we further assume we know the scene is static and only the camera moves, which translates into jointly updating all object poses by the same amount in a scene in the inference process.


YCB-V Scene ID

48 49 50 51 52 53 54 55 56 57 58 59
Single Frame 71.9% 77.5% 83.1% 87.7% 87.5% 84.1% 88.4% 80.4% 82.8% 85.3% 94.3% 86.4%
Camera Pose Tracking 81.5% 94.7% 97.5% 97.0% 97.0% 97.0% 97.2% 97.5% 96.8% 92.2% 98.0% 97.0%

The above table compares the pose estimation average recalls when we do the pose estimation for each frame independently and in the camera tracking setup. The results shows that the same inference procedure can readily handle such extensions, taking into account the dynamics prior and the knowledge of a static scene within the same probabilistic model. More importantly, incorporating the additional prior knowledge leads to comprehensive improvements over single frame predictions.

BibTeX

@inproceedings{zhou20233d,
  title       = {{3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation}},
  author      = {Zhou, Guangyao and Gothoskar, Nishad and Wang, Lirui and Tenenbaum, Joshua B and Gutfreund, Dan and L{\'a}zaro-Gredilla, Miguel and George, Dileep and Mansinghka, Vikash K},
  booktitle   = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year        = {2023}
}