Deploying Multi-Agent AI Workloads on GPU Clusters

Written by Damanpreet Kaur Vohra | Jul 3, 2026 9:10:43 AM

Multi-agent systems do not just multiply your capability. They multiply your inference bill.

Anthropic's own production data puts numbers on it. Agents use about 4x the tokens of a single chat interaction. Multi-agent systems use about 15x. That is not a rounding error. A single multi-agent research session can consume millions of tokens, which works out to dollars per query when billed at frontier API rates.

Once token volume is sustained and concurrency is real, per-token API billing stops being a convenience and starts being a constraint. Self-hosting open models on a GPU cluster changes the cost structure entirely: a fixed GPU-hour rate rather than per-token billing, no data leaving your tenancy, and full control over quantisation, fine-tuning, and model selection.

The architecture to do this is not especially complex. But the sequencing matters.

What Multi-Agent Systems Mean

A multi-agent system is several LLM-powered agents collaborating towards a goal. Each agent is an augmented LLM with its own instructions, tools, and memory. They are not monolithic.

The canonical production pattern is orchestrator-worker. A central LLM decomposes a task, delegates to worker LLMs with independent context windows, and synthesises the results. Anthropic's own research system uses exactly this: a LeadResearcher orchestrator (Claude Opus 4), Claude Sonnet 4 subagents, and a synthesis pass. That system outperformed single-agent Claude Opus 4 by 90.2% on their internal research evaluation. In Anthropic's analysis, token usage alone explained 80% of the variance in performance.

Other patterns exist: prompt chaining, routing, parallelisation (sectioning or voting), and evaluator-optimiser loops. All of them share the same economic structure. Each agent step is one or more LLM calls. Sequential steps compound latency. Fan-out multiplies cost.

Anthropic's own guidance is deliberately conservative on when to use multi-agent at all: find the simplest solution possible, and only increase complexity when the task value is high enough to pay for the increased performance. Multi-agent is a poor fit when all agents need to share the same context or when there are many dependencies between agents. Get that assessment right first.

Why the Cost Structure Breaks at Scale

The 4x and 15x token multipliers are not theoretical. They compound against frontier API pricing directly. At $3 per million input tokens and $15 per million output tokens for Sonnet-class models, and $5/$25 for Opus-class, the per-query cost of a multi-agent session scales fast.

Self-hosting on a GPU cluster addresses three things at once: cost (fixed GPU-hour rate instead of per-token billing), data privacy (no data leaves your tenancy), and model control (fine-tunes, LoRA adapters, quantisation, and open-weight model selection). The breakeven favours self-hosting once sustained token volume is high and GPU utilisation stays strong. APIs win for spiky, low-volume, or quality-critical frontier work where only a proprietary model will do.

The honest variable is GPU utilisation. Idle GPUs destroy the economics. The architecture has to keep them fed.

The Two-Layer Architecture

The winning pattern separates orchestration from model serving. Physically, not just logically.

Layer 1 is the agent runtime. It lives on CPU nodes. This is where the orchestration framework runs (LangGraph, CrewAI, OpenAI Agents SDK, or Microsoft Agent Framework), alongside the state and memory stores: Redis for short-term and session memory, Postgres for durable state and checkpoints, and a vector database for long-term memory and RAG. These are CPU-bound processes. Putting them on GPU nodes wastes expensive compute.
Layer 2 is the model-serving layer. It lives on GPU nodes. vLLM or SGLang pods, one deployment per model, scheduled via node affinity: large planner models to NVIDIA H100 nodes, smaller worker models to NVIDIA A100 or NVIDIA L40 nodes.

Between the two layers sits a LiteLLM router: one OpenAI-compatible endpoint in front of all model deployments, with per-role routing, load balancing, fallbacks, rate limits, and cost tracking. Each agent role points to its own vLLM endpoint. The orchestration layer never needs to know which physical GPU is serving which call.

Anthropic's own production pattern externalises state: state lives outside the agents so they can reset without losing it, with the lead agent saving its plan to memory before the 200K context window truncates. That principle applies regardless of which orchestration framework you choose.

The Serving Stack: vLLM, SGLang, and Prefix Caching

Prefix caching is not a nice-to-have for agent workloads. It is the single biggest performance lever available.

Agent loops re-send the same system prompt, tool definitions, and accumulated history on every step. Without caching, all of that is recomputed from scratch each time. vLLM's Automatic Prefix Caching (APC) and SGLang's RadixAttention both reuse the KV cache for shared prefixes. SGLang reports cache hit rates of 50% to 99% across its benchmarks, and throughput improvements of up to 6.4x over baseline inference systems.

vLLM is the default OpenAI-compatible server. In current vLLM V1, prefix caching is on by default. Key flags for agent deployments: --tensor-parallel-size for splitting large models across GPUs, --quantization fp8 and --kv-cache-dtype fp8 for memory efficiency on NVIDIA H100 and NVIDIA L40 hardware (FP8 reduces model memory requirements by roughly 2x with up to 1.6x throughput improvement), and guided decoding for reliable JSON and tool-call output, which is essential for agents that need to call tools deterministically.

SGLang is the stronger choice when prefix overlap is very high: shared tool definitions and system prompts across all agent sessions. As a rule of thumb, use it where the cache hit rate consistently exceeds 70%.

Ray Serve LLM wraps vLLM (with SGLang support emerging) for multi-model composition, prefix-aware routing, and queue-depth autoscaling, at roughly 1 to 2ms of per-request overhead. For autoscaling, do not use HPA on CPU. LLM serving is GPU-bound while CPU sits low. Use KEDA with a Prometheus trigger on vllm:num_requests_waiting (queue depth). KEDA also supports scale-to-zero for idle models.

GPU Sizing and On-Demand Pricing on Hyperstack

The right-sizing logic is straightforward: weights (in GB) equal parameters × 2 bytes for FP16, or × 1 byte for FP8. Add roughly 25% for KV cache, then set --gpu-memory-utilization between 0.85 and 0.92.

For a heterogeneous agent fleet:

Planner/orchestrator (70B FP8): 2× NVIDIA H100
On Hyperstack, NVIDIA H100 PCIe (80GB) is available on-demand at $2.50/hour, while NVIDIA H100 SXM (80GB) is $3.20/hour. Two NVIDIA H100 PCIe GPUs for the planner come to $5.00/hour.
Worker/executor (7B to 8B FP16): 1× NVIDIA A100 or NVIDIA L40 per replica
NVIDIA A100 PCIe (80GB) is $1.35/hour on-demand, while NVIDIA L40 (48GB) is $1.00/hour. Two worker replicas on NVIDIA A100 PCIe run at $2.70/hour.
Embeddings and reranker: 1× NVIDIA L40 (shared)
$1.00/hour.

A full heterogeneous agent fleet runs at approximately $8.70/hour, comprising two H100 PCIe GPUs for the planner ($5.00/hour), two A100 PCIe GPUs for worker replicas ($2.70/hour), and one shared L40 GPU for embeddings and reranking ($1.00/hour).

Where Multi-Agent Systems Break in Production

Most of these problems have a specific signal that shows up before the system fails visibly. The teams that catch them early are watching metrics, not waiting for complaints.

KV cache pressure

When vllm:gpu_cache_usage_perc sits consistently above 90%, the serving layer is running out of room to hold concurrent sessions. The underlying cause is usually too little memory headroom left for the KV cache once the weights are loaded: size weights at parameters × 2 bytes for FP16, or × 1 byte for FP8, then allow roughly 25% on top for the KV cache. Switching to FP8 KV cache roughly halves the pressure. PagedAttention already handles fragmentation well, but it cannot fix a fundamentally undersized allocation.

Cold starts when scaling

Scale-out events feel fast until a new pod takes longer to become ready than users are willing to wait. With weights pre-cached on a PVC, a 7B model takes roughly 30 to 60 seconds to load; a 70B model takes 2 to 5 minutes. Without pre-caching, those numbers become 5 to 10 minutes for 7B and 20 minutes or more for 70B. Pre-caching weights on a PVC or baking them into the container image is the standard fix. For hot models that cannot afford any scale-to-zero, setting KEDA minReplicaCount to 1 and cooldownPeriod to 300 keeps at least one replica warm at all times.

Context window blowup

Output quality degrading in long agent loops, or truncation errors appearing mid-run, is usually the first sign. Anthropic describes this as context rot: recall degrades as the context window fills, not all at once but gradually. The right response is to trim and summarise history actively, externalise state using the plan-to-memory pattern so agents can reset without losing their progress, and offload KV cache to CPU or NVMe where the framework supports it.

Runaway fan-out

Token spend per session spiking 10x or more versus baseline is the signal. A subagent recursively spawning subagents, or a tool returning oversized results, multiplies cost with no corresponding quality gain. The fix is architectural: explicit per-run fan-out caps and circuit breakers baked into the system before it goes to production. These safeguards do not come switched on by default, so they have to be added deliberately.

Observability gaps

Non-deterministic agent behaviour means failures are often hard to reproduce without a full trace. If the team cannot reliably reproduce a failure, the system is not observable enough. LangSmith, Langfuse, or OpenTelemetry tracing should be in place from the first production deployment, not added after the first incident. At the infrastructure layer, the NVIDIA DCGM Exporter exposes per-GPU Prometheus metrics on port 9400, and vLLM's /metrics endpoint surfaces vllm:time_to_first_token_seconds, vllm:num_requests_waiting, and vllm:gpu_cache_usage_perc. All three matter for diagnosing whether a slow agent step is a model problem or a GPU problem.

Security

Multi-agent systems have a larger attack surface than single-agent systems because of agent-to-agent communication. The OWASP GenAI Security Project's agentic AI threat guide names the specific risks: tool misuse, intent manipulation, and privilege compromise. Sandboxed tools, input and output guardrails, least-privilege tool scopes, and validation of inter-agent messages are the standard mitigations. None of these are optional in production.

How to Stage the Build

The staged adoption logic matters as much as the architecture.

Stage 1: Prove value before building the infrastructure. Start with a single augmented LLM or a simple workflow. Only adopt multi-agent when the task is high value, decomposes into parallel independent threads, and a single agent has measurably hit a quality ceiling. The threshold to escalate: single-agent quality plateaus and the task value justifies a 4 to 15x token premium.

Stage 2: Stand up the two-layer architecture. Deploy the agent runtime on CPU nodes with Redis, Postgres, and a vector database. Serve one model per role with vLLM on Hyperstack's managed Kubernetes via Secure Private Cloud: 70B FP8 planner on 2× NVIDIA H100, 8B workers on NVIDIA A100 or NVIDIA L40, embeddings and reranker on NVIDIA L40. Put LiteLLM in front. Turn on prefix caching everywhere. If the cache hit rate is low, reorder prompts so fixed content (system prompt and tool definitions) is byte-identical and at the front of every call.

Stage 3: Make it elastic and observable. Add KEDA autoscaling on queue depth, pre-cache weights on a PVC, deploy DCGM Exporter with Prometheus and Grafana, and wire up agent tracing. Add a cost-per-token recording rule as a first-class metric. Scale up a role when sustained queue depth exceeds 5 per replica or p95 TTFT breaches your SLA.

Stage 4: Harden. Add fan-out caps, circuit breakers, tool sandboxing, guardrails, and inter-agent message validation per the OWASP agentic guidance. Run a cost model monthly. If GPU utilisation is consistently low, consolidate models onto MIG slices or move bursty, low-volume roles back to an API.

The Decision Is Economic

Multi-agent systems earn their cost when the task value is high, the work decomposes into independent parallel threads, and token volume is sustained. At that point, per-token API billing is the wrong pricing model.

Self-hosting on a GPU cluster with a two-layer architecture, prefix caching, FP8 quantisation, and role-based model routing changes the unit economics. At roughly 3,000 tokens per second aggregate for a 70B FP8 model on 2× NVIDIA H100 ($5.00/hour), self-hosted throughput reaches approximately 2.16 million tokens per dollar. Frontier API billing works out to roughly 0.04 to 0.33 million tokens per dollar.

The architecture is buildable today on Hyperstack's Kubernetes. The GPU mix maps directly onto agent roles, our documentation covers the serving stack end to end, and on-demand pricing means there is no minimum commitment to get started. What remains is the engineering judgement about when to build it versus when to stay on an API. That judgement should be data-driven.

Multi-agent AI demands infrastructure that can keep pace with growing token volumes, increasing concurrency, and tightening latency targets.

Start Building on Hyperstack

With on-demand GPU clusters, managed Kubernetes, and support for vLLM, SGLang, LiteLLM, and modern agent frameworks, Hyperstack provides everything you need to deploy production AI infrastructure.

Explore Hyperstack GPU Cloud and start deploying production-ready AI infrastructure today.

FAQs

What is a multi-agent AI system?

A multi-agent AI system consists of multiple AI agents collaborating, sharing tasks, and coordinating to solve complex problems.

Why do multi-agent AI systems cost more?

Multi-agent systems generate many additional model calls, increasing token usage, inference time, and overall compute requirements.

What is the best architecture for multi-agent AI workloads?

The recommended architecture separates agent orchestration on CPUs from model serving on dedicated GPU infrastructure for efficiency.

Why should enterprises self-host multi-agent AI workloads?

Self-hosting provides predictable costs, stronger data privacy, infrastructure control, and flexibility in model selection and tuning.

What is prefix caching in AI inference?

Prefix caching reuses previously computed prompt segments, reducing latency and significantly improving throughput for repetitive workloads.

What is vLLM and why is it used for AI serving?

vLLM is an open-source inference server designed for high throughput, efficient memory usage, and OpenAI-compatible APIs.

How do GPU clusters improve multi-agent AI performance?

GPU clusters enable parallel inference, support multiple agents simultaneously, and deliver lower latency for large-scale workloads.

When should organisations use multi-agent systems?

Organisations should adopt multi-agent systems when single agents cannot deliver required quality or tasks need parallel execution.

What are the biggest production challenges in multi-agent AI?

Common challenges include cache pressure, context growth, cold starts, observability gaps, security risks, and uncontrolled costs.

Is self-hosting AI cheaper than using APIs?

Self-hosting often becomes cheaper when workloads are continuous, token volumes are high, and GPU utilisation remains strong.

View full post