Official repository for the project "SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners".
[🌐 Webpage] [🤗 HuggingFace Demo] [📖 arXiv Report]
We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor scenes, and raw LiDAR.
To our best knowledge, SAM2POINT presents the most faithful implementation of SAM in 3D, demonstrating superior implementation efficiency, promptable flexibility, and generalization capabilities for 3D segmentation.
We showcase the multi-directional videos generated during the segmentation of SAM2Point:
Clone the repository:
git clone https://github.com/ZiyuGuo99/SAM2Point.git
cd SAM2Point
Create a conda environment:
conda create -n sam2point python=3.10
conda activate sam2point
SAM2Point requires Python >= 3.10, PyTorch >= 2.3.1, and TorchVision >= 0.18.1. Please follow the instructions here to install both PyTorch and TorchVision dependencies.
Install additional dependencies:
pip install -r requirements.txt
Download the checkpoint of SAM 2:
cd checkpoints
bash download_ckpts.sh
cd ..
We provide 3D data samples from different datasets for testing SAM2Point:
gdown --id 1hIyjBCd2lsLnP_GYw-AMkxJnvNtyxBYq
unzip data.zip
Alternatively, you can download the samples directly from this link.
Code for custom 3D input and prompts will be released soon.
Modify DATASET
, SAMPLE_IDX
, PROPMT_TYPE
, PROMPT_IDX
in run.sh
to specify the 3D input and prompt.
Run the segmentation script:
bash run.sh
The segmentation results will be saved under ./results/
, and the corresponding multi-directional videos will be saved under ./video/
.
If you find SAM2Point useful for your research or applications, please kindly cite using this BibTeX:
@article{guo2024sam2point,
title={SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners},
author={Guo, Ziyu and Zhang, Renrui and Zhu, Xiangyang and Tong, Chengzhuo and Gao, Peng and Li, Chunyuan and Heng, Pheng-Ann},
journal={arXiv preprint arXiv:2408.16768},
year={2024}
}
Explore our additional research on 3D, SAM, and Multi-modal Large Language Models:
- [Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
- [Personalized SAM] Personalize Segment Anything Model with One Shot
- [Point-NN & Point-PN] Starting from Non-Parametric Networks for 3D Analysis
- [PointCLIP] 3D Point Cloud Understanding by CLIP
- [Any2Point] Empowering Any-modality Large Models for 3D
- [LLaVA-OneVision] Latest Generations of LLaVA Model
- [LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- [MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?