TABLE OF CONTENTS
NVIDIA H100 GPUs On-Demand
In our latest tutorial, we show how to set up DeepSeek-OCR on a Hyperstack VM to create a high-performance, private OCR workflow. DeepSeek-OCR is a 3-billion-parameter multimodal model that combines a vision encoder and language decoder to extract text and preserve document structure, including tables and complex layouts. Using vLLM for GPU-accelerated serving, you can run PDFs and images through a simple Gradio UI or build a custom REST API.
Take Control of Your Own OCR Workflow with DeepSeek-OCR and Hyperstack
Optical Character Recognition (OCR) is the process of recognising and extracting text from a source like images or PDFs using just the visual field - it's what we do when we read!
Methods for performing OCR have exited for a while but in the past few years (or even months rather), transformer-based models have become incredibly competent at it. DeepSeek, one of the world's leading AI foundation model labs, have released their DeepSeek-OCR 3B parameter model for quickly and easily creating your own OCR workflows.

Why is it harder to run than other DeepSeek models?
You might be used to running other AI models, like DeepSeek's LLMs, which are often available via a simple API call or a straightforward Python library like transformers. We've even made tutorials in the past that you can follow to get DeepSeek V3. DeepSeek-OCR is a bit more hands-on because it's not just a language model; it's a specialised multi-modal system.
It essentially has two parts: a sophisticated vision encoder that sees and understands the layout of a page (just like our eyes), and a 3-billion-parameter language decoder that reads and interprets the text from that visual information. This two-stage process is what makes it so powerful, but it also requires a more complex stack of software to run efficiently.
The setup in this guide uses vLLM, a high-throughput serving engine, to get the best possible performance. This is what adds most of the setup steps - we need to install a particular version of it along with dependencies like flash-attn. It's this requirement for a high-performance, GPU-accelerated serving environment that makes it more complex than a simple pip install package, but the payoff in speed and accuracy is well worth it.
How good is DeepSeek-OCR?
In short: it's exceptionally good. It represents the current state-of-the-art for open-source OCR in its size group, especially when it comes to understanding real-world, complex documents.
Where traditional OCR tools might just extract a "wall of text" that loses all formatting, DeepSeek-OCR understands the structure of the document. This is its key advantage. It excels at:
-
Complex Layouts: Accurately reading multi-column articles, magazine pages, and scientific papers.
-
Tables: It doesn't just see text in a table; it understands the table's rows and columns and formats the output (as markdown) to match.
-
Mixed Content: It's highly adept at handling pages with a mix of text, code blocks, and even mathematical equations.
Because it outputs structured markdown, you're not just getting the raw text; you're getting the document's semantic structure. This makes its output immediately useful for feeding into other systems, like a RAG pipeline or a summarisation model. For its 3B-parameter size, it hits a perfect sweet spot of being incredibly accurate while still being fast enough to interpret huge documents on a single H100 GPU.
How to set up DeepSeek-OCR on your own Hyperstack VM, step-by-step
We'll take you through the whole process from start to end to get a really simple and basic OCR workflow running on your own Hyperstack VM.
Step 0: Getting a Hyperstack VM
This guide assumes you've just spun up a new Linux VM on our platform and can access it via SSH. If you haven't done this before, please see our getting started guide in our documentation.
Step 1: Clone the DeepSeek-OCR repo
# Clone the DeepSeek-OCR repositorygit clone https://github.com/deepseek-ai/DeepSeek-OCR.git
Step 2: Install UV (the package manager):
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
Step 3: Create a python virtual environment:
uv venv deepseek-ocr --python 3.12.9source deepseek-ocr/bin/activate
Step 4: Install vLLM and other requirements
cd DeepSeek-OCR# Get vllm whlwget https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whlunzip vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl -d vllm-0.8.5+cu118-whl# Install requirementsuv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118uv pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whluv pip install -r requirements.txtuv pip install flash-attn==2.7.3 --no-build-isolationuv pip install uvicorn fastapi gradio --upgradeuv pip install transformers==4.57.1 --upgrade
This step may take a while, there are a lot of dependencies!
Step 5: Download the Python code
main.py
This is a standalone python file that sets up the webserver and hosts it on your VM. We recommend you have a quick read through before you attempt to run it, just to familiarise yourself with what it does (more on this later).
Step 6: Get the code into your VM:
# Create the "web" dir and put main.py in therecd DeepSeek-OCR-master/DeepSeek-OCR-vllmmkdir -p webcat <<EOF > web/main.py<paste the contents of main.py here>EOF
You can alternatively use some editor like nano or vim, or SSH into this VM from a more interactive source like VSCode or similar to make this part easier.
Step 7: Start the server and access via your browser
# Start the serveruvicorn web.main:app --host 0.0.0.0 --port 3000
You should now be able to navigate to the UI by going to http://<your-VMs-ip>:3000, and interact with the UI!
NOTE: Remember to open port 3000 for inbound TCP traffic via your VM's firewall on Hyperstack! For more info on this, see our documentation here
Once loaded, It should look something like this:
.png?width=634&height=321&name=image%20(15).png)
In this simple, barebones UI, you can upload PDFs or images and DeepSeek-OCR will automatically run on them.
The results will be visible in the lower box, with the option to see (and download) the labelled input and the extracted text in markdown format.
To re-run, simple delete the existing input and upload something new!
Here's an example of an example PDF article output from DeepSeek-OCR:

Troubleshooting
As stated, this is a very minimal, quickly-put-together UI, and hence is not maintained and updated by Hyperstack, and is certainly not bug-free! However, feel free to modify the code the main.py file to solve any issues or add any features you like.
One bug we are aware of in our early testing is the UI's inability to replace old inputs when new ones are uploaded. In this case, simply Ctrl+C to terminate the server and re-run the same uvicorn command - this and a reload of the web page will then start a fresh instance of the UI with the issue no longer being present.
What's Next?
Congratulations! You've now got your own private, high-performance OCR server running. This Gradio UI is a fantastic sandbox for testing, but the real power comes from what you can build on top of it.
The most logical next step is to adapt the web/main.py file. Instead of launching a Gradio UI, you could modify it to create a simple, robust REST API endpoint using FastAPI. Imagine an endpoint where you can POST an image or PDF file and get a clean JSON response containing the extracted markdown.
Once you have that API, the possibilities are endless:
-
Build a RAG Pipeline: This is the big one. You can now programmatically feed your entire library of PDFs and documents through this API, storing the clean markdown output in a vector database.
-
Create a "Chat with your Docs" App: Combine your new OCR API with a conversational LLM (like DeepSeek-LLM) to build a powerful application that lets you ask questions about your documents.
-
Automate Data Entry: Create a workflow that watches a specific folder or email inbox, runs any new attachments through your OCR API, and then parses the structured output to populate a database or spreadsheet.
You've done the hard part by setting up the core engine. Now you can use your Hyperstack VM as a stable, private microservice to power all kinds of intelligent document-processing workflows.
Launch Your VM today and Get Started with Hyperstack!
FAQs
What type of model is DeepSeek-OCR?
DeepSeek-OCR is a multimodal model combining vision and language understanding, designed to extract text and structure from documents efficiently.
What format does DeepSeek-OCR output?
It outputs structured markdown that preserves tables, layout, and semantic information, making it ready for downstream processing or RAG pipelines.
Which engine is used for high-throughput serving?
vLLM is used as a high-throughput serving engine, optimised for GPU acceleration to deliver fast, efficient OCR performance.
Which package manager is required for setup?
The setup requires UV, a modern package manager, to create virtual environments and install all dependencies reliably on Hyperstack.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?