SLURM vs Kubernetes for Model Training: How to Choose the Right Orchestrator

Written by Damanpreet Kaur Vohra | Jun 17, 2026 9:48:41 AM

The team picked Kubernetes. They had containers, they had pipelines, it felt like the right call. Six weeks into a large distributed training run, jobs were getting preempted mid-epoch, checkpoint recovery was unreliable, and the GPU utilisation numbers looked nothing like what the benchmarks had promised. Switching to SLURM mid-project was not an option. Rebuilding the workflow was.

This is the conversation that plays out repeatedly in AI infrastructure teams. Not because one scheduler is wrong. Because each was built for a different set of problems and the wrong choice at the architecture stage costs you weeks later.

The decision matters more than it did when training runs were measured in hours, not days. Training runs are longer, clusters are larger and the cost of instability compounds across both dimensions. Our latest blog covers what each scheduler was actually built to do, where each wins and what the infrastructure needs to support either.

What Each Scheduler Was Built to Do

SLURM (Simple Linux Utility for Resource Management) originated in HPC. It was designed to schedule batch jobs across homogeneous compute clusters, allocate resources at the node and GPU levels with precision and handle the long-running, tightly coupled workloads that HPC teams run. Job definitions are bash scripts. Resource allocation is explicit. The scheduler's mental model is a cluster of mostly identical machines doing mostly predictable work.

Kubernetes was built to orchestrate containerised microservices. Its scheduler thinks in pods, not nodes. It handles dynamic workloads, variable resource demand, and service availability. The cloud-native ecosystem grew around it: Helm, Argo, Kubeflow, Ray and horizontal pod autoscaling. The mental model is a pool of resources being allocated dynamically to heterogeneous services with varying lifespans.

Neither was built for large-scale distributed model training. Both have adapted. The question is how well each adaptation holds up under the actual conditions of a serious training run.

Where SLURM Wins

For large-scale distributed training, SLURM's design assumptions match the workload requirements more closely.

Gang scheduling is the clearest example. A distributed training job needs all participating nodes to start simultaneously. SLURM allocates the full resource block before any part of the job begins. Kubernetes, by default, schedules pods individually. Under resource pressure, pods start at different times and the training job either waits or proceeds with partial allocation. Neither outcome is acceptable for a multi-node all-reduce operation.

SLURM's GPU allocation model is also closer to what large training runs need. Allocation happens at the physical level: specific GPUs on specific nodes, with awareness of NVLink topology, PCIe bandwidth, and NUMA boundaries. Kubernetes allocates GPUs as generic resources. It does not natively understand GPU topology, which means two GPUs allocated to the same pod may have meaningfully different communication latency depending on how they sit on the hardware.

Checkpoint recovery under SLURM is more straightforward. Jobs write to shared filesystems, restart from a defined state, and the scheduler handles re-queuing without container lifecycle complexity. In Kubernetes, checkpoint logic has to account for pod state, persistent volume claims, and restart policies. That overhead is manageable for short jobs. For training runs measured in days, it adds up.

MPI integration is native in SLURM. For workloads using NCCL all-reduce, the process group model aligns with how SLURM distributes work across nodes. Kubernetes MPI operator implementations exist but they add abstraction layers between the scheduler and the communication fabric that can affect performance at scale.

Where Kubernetes Wins

Kubernetes is the better choice when the workload is heterogeneous, the lifecycle is short, or the team needs to run training and serving infrastructure from the same cluster.

MLOps pipelines are the clearest case. A pipeline that preprocesses data, runs a fine-tuning job, evaluates the output, and deploys a model endpoint involves multiple workload types with different resource profiles. Kubernetes handles this natively. SLURM was not designed for long-lived services or event-driven workflows, and retrofitting it for that purpose creates complexity rather than removing it.

GPU fragmentation is a real problem at scale, and Kubernetes handles it better for mixed workloads. Bin-packing algorithms in the Kubernetes scheduler can place smaller jobs across available GPU capacity more efficiently when job sizes vary. SLURM's gang scheduling model assumes homogeneous resource blocks; under mixed workload conditions, that assumption can leave capacity stranded while larger allocations wait for resources.

The ecosystem is one of Kubernetes' strongest advantages. Tools such as Kubeflow, Argo Workflows, Ray on Kubernetes, KServe, MLflow integrations, and service meshes provide a mature foundation for building end-to-end AI platforms. Teams can manage experiment pipelines, model registries, feature services, inference endpoints, and operational workflows within the same environment. For organisations standardising on cloud-native infrastructure, this reduces the number of separate systems that need to be operated and maintained.

Kubernetes also offers significant advantages for platform teams managing shared AI infrastructure. Namespace isolation, RBAC, resource quotas, admission controls, and policy frameworks allow multiple teams to safely share the same cluster while maintaining governance and operational boundaries. This becomes increasingly valuable as AI initiatives expand beyond a single research group and require support for data science, engineering, MLOps, and inference workloads on the same platform.

Another advantage is consistency across the AI lifecycle. The same orchestration platform can run data processing jobs, training workloads, batch inference, online serving, monitoring, and supporting services. While SLURM excels at scheduling large-scale training jobs, Kubernetes enables organisations to consolidate a broader range of workloads onto a common operational model, reducing platform sprawl and simplifying automation.

When Teams Run Both

The hybrid architecture is not a compromise. It reflects a genuine mismatch in what each scheduler is optimised for, applied to workloads that have different requirements by design:

The typical split: SLURM owns the training cluster, Kubernetes owns everything else. The reason is not preference. It is that gang scheduling, bare-metal GPU allocation, and MPI-native job distribution are non-negotiable requirements for serious distributed training. No Kubernetes operator fully replicates those guarantees. At the same time, SLURM cannot sensibly manage a model serving endpoint, a feature pipeline or a CI system for ML code. Those workloads need dynamic scaling, container lifecycle management, and ecosystem tooling that SLURM was not built to provide.
The operational implication: the infrastructure underneath both schedulers must support the requirements of each. The storage layer needs to be accessible from both environments. The network fabric needs to handle all-reduce traffic from the SLURM cluster without introducing congestion that affects inference latency on the Kubernetes side. The two schedulers run on the same physical hardware, but their performance requirements are not the same, and the infrastructure design has to account for both.

What the Infrastructure Supporting SLURM and Kubernetes Needs

The scheduler choice is the visible decision. The infrastructure underneath is where the constraints live.

Distributed training at scale needs predictable GPU-to-GPU communication latency. A single congestion event on the fabric during an all-reduce operation stalls every participating node. On shared infrastructure, that congestion can come from another tenant's workload. On single-tenant infrastructure, you control the fabric. The difference in training stability is significant, and it compounds across long runs.

Storage requirements are split across workload types. Training jobs need high-throughput NVMe scratch for data staging and checkpoint writes mid-run. Persistent datasets and model artefacts need durable storage that survives node restarts. Parallel file access across multiple nodes, common in both HPC-style SLURM workloads and large distributed training jobs, requires a shared filesystem that can sustain throughput under concurrency without degrading.

Hyperstack Secure Private Cloud is built to run both schedulers on single-tenant, dedicated infrastructure. Under the Managed Platform deployment model, Hyperstack operates the cluster at the orchestrator layer with SLURM and Kubernetes available as scheduler options. The storage stack covers NVMe scratch, Shared Storage Volumes (SSVs), Secure Object Storage and a parallel filesystem. Networking uses RoCE (Ethernet) or InfiniBand fabrics, with NVIDIA ConnectX-8 SuperNICs for deployments where GPU-to-GPU communication bandwidth is the limiting factor.

Run SLURM, Kubernetes or Both on Dedicated Infrastructure

Hyperstack Secure Private Cloud supports both SLURM and Kubernetes under the Managed Platform deployment model on single-tenant, dedicated infrastructure with RoCE or InfiniBand networking, NVMe scratch, SSVs, Secure Object Storage and parallel filesystem. No shared-tenancy noise. No GPU allocation compromises.

Talk to the team about how Secure Private Cloud fits your training architecture.

FAQ

Can Kubernetes reliably run distributed model training?

Yes, with the right operators and configuration. Kubeflow's PyTorchJob and MPI Operator handle distributed training on Kubernetes. The limitations show at scale: gang scheduling guarantees are weaker than SLURM's, GPU topology awareness requires explicit plugin configuration, and checkpoint recovery adds container lifecycle overhead. For smaller distributed jobs or fine-tuning workloads, Kubernetes is workable. For multi-node pre-training runs measured in days, SLURM's design assumptions are a closer match.

What is gang scheduling and why does it matter for training?

Gang scheduling starts all tasks in a distributed job simultaneously, across all allocated nodes. For all-reduce operations in distributed training, all participating processes need to be running at the same time. If nodes start at different times, the job either waits for the full allocation (wasting GPU time) or proceeds with partial allocation (producing incorrect gradients). SLURM implements gang scheduling natively. Kubernetes requires additional configuration and admission controllers to approximate it.

How does GPU topology awareness differ between SLURM and Kubernetes?

SLURM allocates GPUs with awareness of NVLink connectivity, PCIe topology, and NUMA boundaries. A job can request GPUs that are physically connected via NVLink, which directly affects all-reduce bandwidth. Kubernetes treats GPUs as generic countable resources by default. The NVIDIA GPU Feature Discovery plugin adds topology labels, and the Topology Manager can use them, but the native scheduler does not reason about GPU-to-GPU communication paths the way SLURM does. For large multi-GPU training jobs, this matters.

Is it operationally complex to run SLURM and Kubernetes on the same cluster?

It adds operational surface area, but it is a well-established pattern. The standard approach uses SLURM for the training partition and Kubernetes for serving and pipeline workloads, with shared storage accessible from both. The complexity lives in the storage layer (ensuring volumes are accessible from both schedulers), the network configuration (fabric segmentation where needed), and access control. On a managed infrastructure where the cluster is operated at the orchestrator layer, most of that complexity is handled by the infrastructure provider.

Does SLURM support containerised workloads?

Yes. SLURM supports containers via Singularity/Apptainer and, more recently, native Docker/OCI container support through plugins. Many HPC environments run containerised training jobs under SLURM. The difference from Kubernetes is the execution model: SLURM jobs still run as batch processes with explicit resource allocation, not as pods managed by a control plane. Teams already invested in container workflows do not need to abandon them to use SLURM.

How does checkpoint recovery compare between SLURM and Kubernetes?

In SLURM, a requeued job reads from the last checkpoint written to shared storage and restarts from that state. The process is explicit and script-driven. In Kubernetes, checkpoint recovery involves pod restart policies, persistent volume claims, and init container logic to load state. For a job that runs for 72 hours, SLURM's model is operationally simpler. For a shorter fine-tuning job that runs as part of a larger pipeline, Kubernetes handles the failure and restart within the existing workflow without manual intervention.

What storage setup does distributed training actually need?

Three tiers are typically required. High-throughput NVMe scratch for staging training data locally and writing checkpoints during a run, where latency and write bandwidth matter. Persistent block storage (SSVs) for datasets and artefacts that need to survive node restarts or be accessed across jobs. Durable object storage for long-term retention of model artefacts, logs, and experiment data. For multi-node jobs accessing shared data simultaneously, a parallel filesystem that sustains throughput under concurrent read load is also necessary. The wrong storage configuration is one of the most common causes of training stalls that get misattributed to the scheduler.

When should a team choose Kubernetes over SLURM for training?

When the training workloads are shorter, job sizes vary significantly, the team runs training and serving from the same cluster, or the MLOps tooling investment is already built around the Kubernetes ecosystem. Kubernetes is also the better fit when multi-tenancy, namespace isolation, and dynamic resource allocation across heterogeneous workload types matter more than raw gang scheduling performance. For teams running fine-tuning at scale rather than full pre-training, the Kubernetes ecosystem frequently covers requirements adequately.

View full post