π¦β¨ Dream up accurate 3D bounding boxes for objects in the wild!
- [2025.10.06] π BoxDreamer demo upgraded! Now supports CLI usage for easier interaction!
- [2025.06.26] π Paper accepted by ICCV 2025! Code is now open-sourced on GitHub!
- [2025.04.10] π BoxDreamer paper released on arXiv!
- π Table of Contents
- π¦ Method Overview
- π» Installation
- π± CLI Demo
- π€ Gradio demo
- π Dataset Preparation
- Reference Database Creation (Optional)
- π Reconstruction
- ποΈ Training
- π Evaluation
- π¦ Model Zoo
- β Frequently Asked Questions
- π Citation
- π License
- π Acknowledgements
BoxDreamer supports two installation methods: a fast automated script or manual step-by-step installation. Choose the method that best suits your needs.
If your system is compatible with PyTorch 2.5.1 + CUDA 12.1, use our automated installation script:
bash install.shAfter successful installation, you can immediately start using BoxDreamer via the CLI or Gradio demo.
For custom configurations or troubleshooting, follow these steps:
# Create and activate conda environment
conda create -n boxdreamer python=3.11
conda activate boxdreamerWe recommend using uv for faster dependency installation:
pip install uv# Install PyTorch (adjust CUDA version if needed)
uv pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# Install PyTorch3D
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
# Install Flash Attention
pip install flash_attn
# Install xformers
pip install xformers==0.0.28.post3# Install required packages
uv pip install -r requirements.txt
# Install BoxDreamer
pip install -e .For CUDA 12.1 + Python 3.11 + PyTorch 2.5.1:
# Download pre-built wheel from https://miropsota.github.io/torch_packages_builder/sam-2/
pip install https://github.com/MiroPsota/torch_packages_builder/releases/download/SAM_2-1.0%2Bc2ec8e1/SAM_2-1.0%2Bc2ec8e1pt2.5.1cu121-cp311-cp311-linux_x86_64.whl
# Install additional demo dependencies
pip install decord pyqt5 gradio transformersgit submodule update --init --recursivetouch .env
echo "PYTHONPATH=three/dust3r" >> .envmkdir -p weights && cd weights
wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pthwget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pthTest your installation with the CLI:
boxdreamer-cli --helpIf you see the help menu, installation was successful! Proceed to CLI Usage or try the Gradio demo.
# Display help and available options
boxdreamer-cli --help
# Process video with text prompt (automatic object detection)
boxdreamer-cli --video src/demo/examples/mode1/mode1-4.mp4 \
--show_point_cloud --interactive --use_grounding_dino \
--text_prompt "Controller"
# Manual object annotation mode
boxdreamer-cli --video src/demo/examples/mode1/mode1-4.mp4 \
--show_point_cloud --interactive
# Auto reference frame selection
boxdreamer-cli --video src/demo/examples/mode1/mode1-4.mp4 \
--show_point_cloud
# Quick processing (without point cloud rendering)
boxdreamer-cli --video src/demo/examples/mode1/mode1-4.mp4Launch the interactive web interface:
# Using local checkpoint
python -m src.demo.gradio_demo --ckpt path_to_boxdreamer_ckpt
# Or load from Hugging Face
python -m src.demo.gradio_demo --hfYou can download the dataset from CDPN. Then extract the dataset to data/lm folder.
Download OnePose dataset from OpenDataLab OnePose, OnePose-LowTexture from here. Then extract the dataset to data/onepose folder and data/onepose_lowtexture separately.
Download the dataset from here. Then extract the dataset to data/lmo folder.
Download the dataset from OpenDataLab YCB-Video. Then extract the dataset and move the YCB_Video_Dataset folder to data/ycbv folder.
You can get foundationpose reference database from here.
python src/datasets/utils/ycbv/foundationpose_ref_process.py.pypython src/datasets/utils/ycbv/ycbv_preprocess.pypython src/datasets/utils/linemod_utils/linemod_o_process.py# Create FPS 5 views database for LINEMOD
python -m src.datasets.utils.view_sampler --dataset linemod --method fps --num_views 5 --root data/lm# Basic usage: Reconstruct LINEMOD with DUSt3R
python -m src.reconstruction.main --dataset LINEMOD --reconstructor dust3r --ref_suffix _fps_5Key Parameters:
- --dataset: Dataset name (LINEMOD, OnePose, etc.)
- --reconstructor: Reconstruction method (dust3r, etc.)
- --ref_suffix: Suffix for reference views database
# Basic usage: Train on OnePose with 5 reference views
python run.py --config-name=train.yaml \
datamodule.train_datasets=[OnePose] \
datamodule.val_datasets=[OnePose] \
length=6Note: For zsh shell, escape brackets with backslash: \[OnePose\]
Key Parameters:
- datamodule.train_datasets: List of training datasets
- datamodule.val_datasets: List of validation datasets
- length: Number of reference views + 1 query view (e.g., 6 means 5 reference views)
# Basic usage: Evaluate on LINEMOD using FPS 5 views
python run.py --config-name=test.yaml \
pretrain_name=subfolder \
exp_name=lm \
datamodule.test_datasets=[LINEMOD] \
datamodule.LINEMOD.config.model_suffix=_dust3r_5 \
datamodule.LINEMOD.config.reference_suffix=_fps_5 \
length=6
# Or, load ckpt from huggingface (lastest checkpoint)
python run.py --hf --config-name=test.yaml \
exp_name=lm \
datamodule.test_datasets=[LINEMOD] \
datamodule.LINEMOD.config.model_suffix=_dust3r_5 \
datamodule.LINEMOD.config.reference_suffix=_fps_5 \
length=6
# Use the reproducible version checkpoint
python run.py --hf --reproducibility --config-name=test.yaml \
exp_name=lm \
datamodule.test_datasets=[LINEMOD] \
datamodule.LINEMOD.config.model_suffix=_dust3r_5 \
datamodule.LINEMOD.config.reference_suffix=_fps_5 \
length=6
Key Parameters:
- pretrain_name: Name of the pretrained model folder
- datamodule.test_datasets: List of test datasets
- datamodule.LINEMOD.config.model_suffix: Suffix for model files If not provided, ground truth models will be used for bounding box extraction
- datamodule.LINEMOD.config.reference_suffix: Suffix for reference database If not provided, full views database will be used
- length: Number of reference views + 1 query view
For evaluation with a dense reference database, set length to the total number of reference images plus one. Enabling the DINO feature filter (model.modules.dense_cfg.enable=True) will further assist in selecting the most relevant neighbor views for the decoder input.
| Version | Training Data | Params | Download |
|---|---|---|---|
| Latest | Objaverse + OnePose | 88.6M | Download |
| Pretrained | Objaverse | 88.6M | Coming soon |
download the ckpt and put it in models/checkpoints/subfolder folder and rename it to last.ckpt
Does BoxDreamer require CAD models or mesh representations of objects?
No, BoxDreamer does not require any 3D CAD models or mesh representations of objects during inference. This is a key advantage of our approach, as it enables generalization to novel objects without access to their 3D models. During training, we do use bounding box annotations, but no detailed 3D models are required.
How computationally expensive is BoxDreamer during inference?
The BoxDreamer-Base model runs at over 40 FPS on a single NVIDIA RTX 4090 GPU with 5 reference images.
Can BoxDreamer work with RGB-D images?
Yes! While the base version of BoxDreamer works with RGB images only, depth information also provides the access to object coordinates in real world. We plan to introduce a variant of BoxDreamer that incorporates depth information in the future.
If you find BoxDreamer useful in your research, please consider citing our paper:
@article{yu2025boxdreamer,
title={BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation},
author={Yu, Yuanhong and He, Xingyi and Zhao, Chen and Yu, Junhao and Yang, Jiaqi and Hu, Ruizhen and Shen, Yujun and Zhu, Xing and Zhou, Xiaowei and Peng, Sida},
journal={arXiv preprint arXiv:2504.07955},
year={2025}
}
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Our implementation is based on several open-source repositories. We thank the authors of these repositories for making their code available.
At the same time, I would like to thank Yating Wang, Chengrui Dong, and Yiguo Fan for their sincere suggestions and the valuable live demonstrations they provided.
