OpenCV University (@OpenCVUniverse)
2025-06-30 | ❤️ 139 | 🔁 34
📢SAM4D: Segment Anything in Camera and LiDAR Streams
SAM4D introduces a 4D foundation model for promptable segmentation across camera and LiDAR streams, addressing the limitations of frame-centric and modality-isolated approaches in autonomous driving.
Key Highlights:
✅Promptable Multi-modal Segmentation (PMS) – Enables interactive segmentation across sequences from both modalities using diverse prompts (points, boxes, masks), allowing cross-modal propagation and long-term object tracking. ✅Unified Multi-modal Positional Encoding (UMPE) – Aligns image and LiDAR features in a shared 3D space using sinusoidal and MLP-based encoding for seamless cross-modal interaction while preserving modality-specific structure. ✅Motion-aware Cross-modal Memory Attention (MCMA) – Incorporates ego-motion compensation into memory attention, enabling temporally consistent retrieval and robust segmentation in dynamic scenes. ✅Multi-modal Architecture – Builds on SAM2 with Hiera for image encoding and MinkUNet (via TorchSparse) for LiDAR voxelization, allowing efficient 2D-3D joint segmentation. ✅Efficient Prompt Handling – Supports point, box, and mask prompts from either modality, using a unified decoder to produce temporally consistent masks across the stream. ✅Waymo-4DSeg Dataset – A large-scale pseudo-labeled dataset containing 15M image masks, 30M LiDAR masks, and 300k cross-modal masklets, generated via VFM segmentation, 4D LiDAR reconstruction, and ray casting. ✅Cross-Modal Label Fusion Pipeline – Builds dense pixel-to-voxel mappings, filters noisy masklets using DBSCAN clustering, and merges multi-view data into high-quality voxel masklets. ✅Cross-Dataset Generalization – Demonstrates strong zero-shot and fine-tuned performance on nuScenes, validating robust transferability across sensor configurations and environments. ✅Quantitative Performance – Achieves 69.8% mIoU on images and 55.7% on LiDAR with 80.1% J&F, significantly outperforming single-modality and projection-based baselines. ✅Scalable & Efficient Design – 119.88M parameter model optimized with memory banks, FIFO queues, and prompt imitation logic for high-throughput 4D segmentation. ✅Future-Proof Foundation – Roadmap includes natural language prompting via LLMs, multi-sensor scaling, weak/self-supervised learning, and improved memory and compute efficiency.
➡️Project: https://sam4d-project.github.io/ ➡️Github Repo: https://github.com/CN-ADLab/SAM4D ➡️LearnopenCV blog post: https://learnopencv.com/sam-2/
SegmentAnything SAM4D LiDAR Camera 4DPerception AutonomousDriving MultiModal PromptableSegmentation
🔗 Related
See similar notes in domain-llm, domain-vlm, domain-dev-tools