VFM-6D uses cost-effective synthetic data to adapt robust object representations from pre-trained vision foundation models to the task of object pose estimation. VFM-6D supports both instance-level unseen object pose estimation and category-level pose estimation for novel object categories.
Object pose estimation plays a crucial role in robotic manipulation, however, its practical applicability still suffers from limited generalizability. This paper addresses the challenge of generalizable object pose estimation, particularly focusing on category-level object pose estimation for unseen object categories. Current methods either require impractical instance-level training or are confined to predefined categories, limiting their applicability. We propose VFM-6D, a novel framework that explores harnessing existing vision and language models, to elaborate object pose estimation into two stages: category-level object viewpoint estimation and object coordinate map estimation. Based on the two-stage framework, we introduce a 2D-to-3D feature lifting module and a shape-matching module, both of which leverage pre-trained vision foundation models to improve object representation and matching accuracy. VFM-6D is trained on cost-effective synthetic data and exhibits superior generalization capabilities. It can be applied to both instance-level unseen object pose estimation and category-level object pose estimation for novel categories. Evaluations on benchmark datasets demonstrate the effectiveness and versatility of VFM-6D in various real-world scenarios.
VFM-6D elaborates object pose estimation into two stages: foundation model based object viewpoint estimation and reference-based object coordinate map estimation. First, a 2D-to-3D foundation feature lifting module is developed to adapt 2D pre-trained features to 3D view-aware object representations for precise query-reference matching and object viewpoint estimation. Building on the estimated object viewpoint and the matched reference image, we further introduce a foundation-feature-based object 3D shape representation module. It enhances robust shape matching and coordinate map estimation across a variety of objects for generalizable object pose estimation.
We evaluate VFM-6D on Wild6D benchmark dataset comprised of 162 different object instances in 5 table-top categories, and on CO3D dataset comprised of 200 different object instances in 20 categories. Moreover, we also evaluate VFM-6D on instance-level LINEMOD benchmark dataset, which consists of 13 untextured object instances. Our experiments demonstrate the superior generalization capability of VFM-6D.
Use the tabs and the dropdown menu to select the object category / instance name in the corresponding benchmark. For category-level evaluation, we show the used shape template on the left side of the image. The first row shows the predicted object coordinate map, and the second row visualizes the predicted object pose. For instance-level evaluation, we use the corresponding instance-level object CAD model during experiments.
In open-world scenarios, it is possible that the shape template corresponding to the category of the observed object is not collected in advance. By integrating VFM-6D with GPT-4V and text-to-3D model, we demonstrate the robustness and generalization capability of VFM-6D in handling open-world scenarios involving novel object categories. In our evaluation, we exploit the text-to-3D generation function provided by Meshy.
Single-category static open-world scenes:
Multi-category dynamic manipulation scenes:
@misc{kai2024vfm,
title={Vision Foundation Model Enables Generalizable Object Pose Estimation},
author={Kai Chen and Yiyao Ma and Xingyu Lin and Stephen James and Jianshu Zhou and Yun-Hui Liu and Pieter Abbeel and Qi Dou},
booktitle={Neural Information Processing Systems (NeurIPS)},
month={December},
year={2024}
}