<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Access NVIDIA H100s from just $2.06/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 1 Aug 2024

Getting Started with SAM 2: A Comprehensive Guide to Meta’s Latest Model for Videos and Images

TABLE OF CONTENTS

updated

Updated: 7 Aug 2024

NVIDIA H100 GPUs On-Demand

Sign up/Login

Mark Zuckerberg said “AI has more potential than any other modern technology to increase human productivity, creativity and quality of life” in an open letter last week and he meant it. With this open-science approach, Meta has released its latest model SAM 2's code, model weights and the SA-V dataset with a permissive Apache 2.0 license including over 51,000 videos and 600,000 masklets. SAM 2, unlike its predecessor, excels in real-time promptable segmentation to accurately segment objects in images and videos—even those it hasn't been specifically trained on. This zero-shot capability makes SAM 2 exceptionally versatile, from enhancing video effects and creative projects by integrating with generative video models to streamlining the annotation process for visual data, which could significantly accelerate the development of advanced computer vision systems. 

Continue reading as we explore the capabilities of SAM 2 and guide you on getting started with this model on Hyperstack.

What is SAM 2?

SAM 2, short for Segment Anything Model 2 is Meta's latest advancement in computer vision technology. While the previous SAM model was known for segmenting objects in images, SAM 2 extends this capability to videos, creating a unified model for real-time promptable object segmentation across static and moving visual content. What sets SAM 2 apart is its ability to perform this task with remarkable accuracy and speed, even on objects and visual domains it has not seen previously.

Key Features of SAM2

With SAM 2, you can create and build better computer vision systems. SAM 2 outperforms its previous model SAM with the latest features including:

Unified Image and Video Segmentation

One of SAM 2's most significant advancements is its ability to handle images and videos within a single and unified architecture. The model treats images as short videos with a single frame, allowing seamless application across different visual media. This unified approach enables consistent performance and user experience whether working with static images or complex video sequences.

Real-Time Processing

SAM 2 is designed for real-time operation, processing video frames at approximately 44 frames per second. This speed makes it suitable for live video applications, interactive editing, and other time-sensitive tasks where immediate feedback is crucial.

Promptable Segmentation

Building on SAM's foundation, SAM 2 allows users to specify objects of interest through various prompt types, including clicks, bounding boxes, or masks. These prompts can be applied to any frame in a video, with the model then propagating the segmentation across all frames. This interactive approach allows for precise control and refinement of segmentations.

Memory Mechanism

To handle the temporal aspects of video segmentation, SAM 2 introduces a sophisticated memory mechanism. This consists of a memory encoder, a memory bank, and a memory attention module. These components allow the model to store and recall information about objects and user interactions across video frames, enabling consistent tracking and segmentation of objects throughout a video sequence.

Occlusion Handling

SAM 2 includes an "occlusion head" that predicts whether an object of interest is present in each frame. This feature allows the model to handle scenarios where objects become temporarily hidden or move out of view, a common challenge in video segmentation tasks.

Ambiguity Resolution

The model can output multiple valid masks when faced with ambiguous prompts, such as a click that could refer to a part of an object or the entire object. SAM 2 handles this by creating multiple masks and selecting the most confident one for propagation if the ambiguity isn't resolved through additional prompts.

How Does SAM2 Perform?

SAM 2 has shown impressive performance across different benchmarks and real-world scenarios. Check below the performance results of SAM2:

  • Interactive Video Segmentation: SAM 2 outperforms previous approaches across 17 zero-shot video datasets, requiring approximately three times fewer human-in-the-loop interactions.
  • Image Segmentation: On SAM's original 23 dataset zero-shot benchmark suite, SAM 2 surpasses its predecessor while being six times faster.
  • Video Object Segmentation: SAM 2 excels on existing benchmarks such as DAVIS, MOSE, LVOS, and YouTube-VOS, compared to prior state-of-the-art models.
  • Annotation Speed: When used for video segmentation annotation, SAM 2 is 8.4 times faster than manual per-frame annotation with the original SAM.
  • Fairness: Evaluations show minimal performance discrepancy across perceived gender and age groups (18-25, 26-50, 50+), indicating good fairness characteristics.

Limitations of SAM2

While SAM 2 is a major advancement in object segmentation technology, it does have some limitations:

  • Extended Video Tracking: The model may lose track of objects in scenarios involving drastic camera viewpoint changes, long occlusions, crowded scenes, or extended videos.
  • Object Confusion: In crowded scenes with similar-looking objects, SAM 2 can sometimes confuse the target object, especially if it's only specified in one frame.
  • Fine Detail in Fast Motion: For complex, fast-moving objects, the model can miss fine details, and predictions may be unstable across frames.
  • Multiple Object Efficiency: While SAM 2 can segment multiple individual objects simultaneously, doing so decreases the model's efficiency significantly.
  • Temporal Smoothness: The model doesn't enforce penalties for jittery predictions between frames, which can lead to a lack of temporal smoothness in some cases.
  • Human Intervention: Despite advancements in automatic masklet generation, human annotators are still required for some steps in the data annotation process.

Is SAM 2 Open-Source?

SAM 2 is being released as an open-source model. Meta continues to live up to its commitment to open science and collaborative AI development with this model. The SAM 2 code and model weights are shared under a permissive Apache 2.0 license. It is important to know that Meta is releasing the SAM 2 evaluation code under a BSD-3 license. This allows researchers and developers to not only use the model but also thoroughly assess its performance and compare it with other solutions.

The open-source release of SAM 2 also includes:

  1. The SA-V dataset, featuring approximately 51,000 real-world videos and over 600,000 masklet annotations, was released under a CC BY 4.0 license
  2. A web-based demo that shows real-time interactive segmentation of short videos and applies video effects on the model predictions.


Open source will ensure that more people around the world have access to the benefits and opportunities of AI, that power isn’t concentrated in the hands of a small number of companies, and that the technology can be deployed more evenly and safely across society

- Mark Zuckerberg, Founder and CEO, Meta


Getting Started with SAM2

To get started with SAM 2, you can leverage the high-performance computing capabilities offered by Hyperstack. Here's how you can begin:

Step 1: Choose the Right Hardware

SAM 2 benefits from high-end GPUs for optimal performance, especially when processing videos or large batches of images. We recommend using powerful GPUs like the NVIDIA A100 and the NVIDIA H100 PCIe. Hyperstack offers access to these GPUs at a cost-effective GPU pricing model so you can run SAM2 efficiently without investing in expensive hardware.

Step 2: Set Up Environment on Hyperstack

To leverage Hyperstack's high-performance GPUs, you'll need to set up your environment on our platform. Check out our platform video demo to get started with Hyperstack.

Step 3: Set Up the Python Environment

This will also download the SAM2 models.

# Install python3-pip, python3-venv
sudo apt-get install python3-pip python3-venv -y


# Configure virtual environment
python3 -m venv venv
source venv/bin/activate


# Clone the repository
git clone https://github.com/facebookresearch/segment-anything-2.git
cd segment-anything-2


# Install requirements (including demo requirements)
pip install -e .
# Install demo requirements (adjusted because the command in the repo 'pip install -e ".[demo]"' is broken)
pip install jupyter==1.0.0 matplotlib==3.9.0 opencv-python==4.10.0.84


# Download checkpoints
cd checkpoints
./download_ckpts.sh

Step 4: Run SAM2 on Images or Videos

To run the segmentation on images or videos, see this video example notebook and this image example notebook. Please note, that if you want to run these notebooks, you need to set up a Jupyter notebook server. Follow the below steps to do so:

1. Run a Jupyter Notebook server by running the command below in your VM.  Copy the text after '?token' in the text that is printed in your terminal (e.g. http://localhost:8888/lab?token=a919e699f41dcc0c34754464cbaf55e0faa59bde96361b85)

source /home/ubuntu/venv/bin/activate jupyter lab

2. Open an SSH tunnel by running this command from your local terminal (NOT in your VM). Replace the path to your keypair and ip address accordingly.

ssh -i [path-to-your-keypair] -L 8888:localhost:8888 ubuntu@[vm-ip-address]

3. Open http://localhost:8888 in your browser to view your notebooks

4. The Jupyter Notebook server will ask you for your token (see step 2). You only need the text after '?token' (e.g: a919e699f41dcc0c34754464cbaf55e0faa59bde96361b85)

5. Go to the 'notebooks' directory and execute the example notebook you need. Please note that you may need to add a pip install cell at the top for the required dependencies.

FAQs

What is the difference between SAM 2 and SAM?

The key difference is that SAM 2 extends object segmentation capabilities to videos, while the original SAM was designed for images only. SAM 2 also offers improved performance and speed for image segmentation tasks.

Is SAM 2 open-source?

Yes, SAM 2 is an open-source model released by Meta. The SAM 2 code and model weights are shared under a permissive Apache 2.0 license.

Can SAM 2 be used for real-time video processing applications?

SAM 2 is designed for real-time operation, processing approximately 44 frames per second, making it suitable for live video applications and interactive editing.

Does SAM 2 require extensive training for new objects or scenes?

No, SAM 2 features zero-shot generalisation, allowing it to segment objects it hasn't seen during training without custom adaptation.

Is SAM 2 suitable for scientific or medical imaging applications?

Yes, SAM 2 has potential applications in various scientific fields, including medical imaging. It has been used to segment cellular images and aid in tasks like skin cancer detection.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

Hyperstack - Tutorials link

11 Sep 2024

Qwen2-72B, built on the advanced Transformer architecture with features like SwiGLU ...

Hyperstack - Tutorials link

29 Aug 2024

Deploying advanced AI models like FLUX.1 on Hyperstack provides the perfect environment ...

Hyperstack - Tutorials link

14 Aug 2024

Kubernetes has become the go-to platform for companies looking to scale their Generative ...