<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">
Reserve here

NVIDIA H100 SXMs On-Demand at $2.40/hour - Reserve from just $1.90/hour. Reserve here

Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close
|

Updated on 2 Dec 2025

Run DeepSeek OCR on Hyperstack with your Own UI

TABLE OF CONTENTS

NVIDIA H100 GPUs On-Demand

Sign up/Login
summary

In our latest tutorial, we show how to set up DeepSeek-OCR on a Hyperstack VM to create a high-performance, private OCR workflow. DeepSeek-OCR is a 3-billion-parameter multimodal model that combines a vision encoder and language decoder to extract text and preserve document structure, including tables and complex layouts. Using vLLM for GPU-accelerated serving, you can run PDFs and images through a simple Gradio UI or build a custom REST API. 

Take Control of Your Own OCR Workflow with DeepSeek-OCR and Hyperstack

Optical Character Recognition (OCR) is the process of recognising and extracting text from a source like images or PDFs using just the visual field - it's what we do when we read!

Methods for performing OCR have exited for a while but in the past few years (or even months rather), transformer-based models have become incredibly competent at it. DeepSeek, one of the world's leading AI foundation model labs, have released their DeepSeek-OCR 3B parameter model for quickly and easily creating your own OCR workflows.

Picture 1-3

Why is it harder to run than other DeepSeek models?

You might be used to running other AI models, like DeepSeek's LLMs, which are often available via a simple API call or a straightforward Python library like transformers. We've even made tutorials in the past that you can follow to get DeepSeek V3. DeepSeek-OCR is a bit more hands-on because it's not just a language model; it's a specialised multi-modal system.

It essentially has two parts: a sophisticated vision encoder that sees and understands the layout of a page (just like our eyes), and a 3-billion-parameter language decoder that reads and interprets the text from that visual information. This two-stage process is what makes it so powerful, but it also requires a more complex stack of software to run efficiently.

The setup in this guide uses vLLM, a high-throughput serving engine, to get the best possible performance. This is what adds most of the setup steps - we need to install a particular version of it along with dependencies like flash-attn. It's this requirement for a high-performance, GPU-accelerated serving environment that makes it more complex than a simple pip install package, but the payoff in speed and accuracy is well worth it.

How good is DeepSeek-OCR? 

In short: it's exceptionally good. It represents the current state-of-the-art for open-source OCR in its size group, especially when it comes to understanding real-world, complex documents.

Where traditional OCR tools might just extract a "wall of text" that loses all formatting, DeepSeek-OCR understands the structure of the document. This is its key advantage. It excels at:

  • Complex Layouts: Accurately reading multi-column articles, magazine pages, and scientific papers.

  • Tables: It doesn't just see text in a table; it understands the table's rows and columns and formats the output (as markdown) to match.

  • Mixed Content: It's highly adept at handling pages with a mix of text, code blocks, and even mathematical equations.

Because it outputs structured markdown, you're not just getting the raw text; you're getting the document's semantic structure. This makes its output immediately useful for feeding into other systems, like a RAG pipeline or a summarisation model. For its 3B-parameter size, it hits a perfect sweet spot of being incredibly accurate while still being fast enough to interpret huge documents on a single H100 GPU.

How to set up DeepSeek-OCR on your own Hyperstack VM, step-by-step

We'll take you through the whole process from start to end to get a really simple and basic OCR workflow running on your own Hyperstack VM. 

Step 0: Getting a Hyperstack VM

This guide assumes you've just spun up a new Linux VM on our platform and can access it via SSH. If you haven't done this before, please see our getting started guide in our documentation.

Step 1: Clone the DeepSeek-OCR repo 

# Clone the DeepSeek-OCR repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git

Step 2: Install UV (the package manager):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Step 3: Create a python virtual environment:

uv venv deepseek-ocr --python 3.12.9
source deepseek-ocr/bin/activate

Step 4: Install vLLM and other requirements

cd DeepSeek-OCR

# Get vllm whl
wget https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
unzip vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl -d vllm-0.8.5+cu118-whl

# Install requirements
uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install uvicorn fastapi gradio --upgrade
uv pip install transformers==4.57.1 --upgrade

This step may take a while, there are a lot of dependencies!

Step 5: Download the Python code

main.py 

This is a standalone python file that sets up the webserver and hosts it on your VM. We recommend you have a quick read through before you attempt to run it, just to familiarise yourself with what it does (more on this later).

Step 6: Get the code into your VM:

# Create the "web" dir and put main.py in there
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
mkdir -p web

cat <<EOF > web/main.py
<paste the contents of main.py here>
EOF

You can alternatively use some editor like nano or vim, or SSH into this VM from a more interactive source like VSCode or similar to make this part easier. 

Step 7: Start the server and access via your browser

# Start the server
uvicorn web.main:app --host 0.0.0.0 --port 3000

You should now be able to navigate to the UI by going to http://<your-VMs-ip>:3000, and interact with the UI! 

NOTE: Remember to open port 3000 for inbound TCP traffic via your VM's firewall on Hyperstack! For more info on this, see our documentation here 

Once loaded, It should look something like this:

image (15)

In this simple, barebones UI, you can upload PDFs or images and DeepSeek-OCR will automatically run on them.

The results will be visible in the lower box, with the option to see (and download) the labelled input and the extracted text in markdown format. 

To re-run, simple delete the existing input and upload something new!

Here's an example of an example PDF article output from DeepSeek-OCR:

Picture 1-4

Troubleshooting

As stated, this is a very minimal, quickly-put-together UI, and hence is not maintained and updated by Hyperstack, and is certainly not bug-free! However, feel free to modify the code the main.py file to solve any issues or add any features you like.

One bug we are aware of in our early testing is the UI's inability to replace old inputs when new ones are uploaded. In this case, simply Ctrl+C to terminate the server and re-run the same uvicorn command - this and a reload of the web page will then start a fresh instance of the UI with the issue no longer being present. 

What's Next?

Congratulations! You've now got your own private, high-performance OCR server running. This Gradio UI is a fantastic sandbox for testing, but the real power comes from what you can build on top of it.

The most logical next step is to adapt the web/main.py file. Instead of launching a Gradio UI, you could modify it to create a simple, robust REST API endpoint using FastAPI. Imagine an endpoint where you can POST an image or PDF file and get a clean JSON response containing the extracted markdown.

Once you have that API, the possibilities are endless:

  • Build a RAG Pipeline: This is the big one. You can now programmatically feed your entire library of PDFs and documents through this API, storing the clean markdown output in a vector database.

  • Create a "Chat with your Docs" App: Combine your new OCR API with a conversational LLM (like DeepSeek-LLM) to build a powerful application that lets you ask questions about your documents.

  • Automate Data Entry: Create a workflow that watches a specific folder or email inbox, runs any new attachments through your OCR API, and then parses the structured output to populate a database or spreadsheet.

You've done the hard part by setting up the core engine. Now you can use your Hyperstack VM as a stable, private microservice to power all kinds of intelligent document-processing workflows.

Launch Your VM today and Get Started with Hyperstack!

FAQs

What type of model is DeepSeek-OCR?

DeepSeek-OCR is a multimodal model combining vision and language understanding, designed to extract text and structure from documents efficiently.

What format does DeepSeek-OCR output?

It outputs structured markdown that preserves tables, layout, and semantic information, making it ready for downstream processing or RAG pipelines.

Which engine is used for high-throughput serving?

vLLM is used as a high-throughput serving engine, optimised for GPU acceleration to deliver fast, efficient OCR performance.

Which package manager is required for setup?

The setup requires UV, a modern package manager, to create virtual environments and install all dependencies reliably on Hyperstack.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

24 Oct 2025

Want to build your own ChatGPT-like model? Sounds fascinating until you start worrying ...

15 Oct 2025

What is Qwen3-VL-30B-A3B-Instruct-FP8? Qwen3-VL-30B-A3B-Instruct-FP8 is a fine-tuned, ...