The team picked Kubernetes. They had containers, they had pipelines, it felt like the right call. Six weeks into a large distributed training run, jobs were getting preempted mid-epoch, checkpoint recovery was unreliable, and the GPU utilisation numbers looked nothing like what the benchmarks had promised. Switching to SLURM mid-project was not an option. Rebuilding the workflow was.
This is the conversation that plays out repeatedly in AI infrastructure teams. Not because one scheduler is wrong. Because each was built for a different set of problems and the wrong choice at the architecture stage costs you weeks later.
The decision matters more than it did when training runs were measured in hours, not days. Training runs are longer, clusters are larger and the cost of instability compounds across both dimensions. Our latest blog covers what each scheduler was actually built to do, where each wins and what the infrastructure needs to support either.
SLURM (Simple Linux Utility for Resource Management) originated in HPC. It was designed to schedule batch jobs across homogeneous compute clusters, allocate resources at the node and GPU levels with precision and handle the long-running, tightly coupled workloads that HPC teams run. Job definitions are bash scripts. Resource allocation is explicit. The scheduler's mental model is a cluster of mostly identical machines doing mostly predictable work.
Kubernetes was built to orchestrate containerised microservices. Its scheduler thinks in pods, not nodes. It handles dynamic workloads, variable resource demand, and service availability. The cloud-native ecosystem grew around it: Helm, Argo, Kubeflow, Ray and horizontal pod autoscaling. The mental model is a pool of resources being allocated dynamically to heterogeneous services with varying lifespans.
Neither was built for large-scale distributed model training. Both have adapted. The question is how well each adaptation holds up under the actual conditions of a serious training run.
For large-scale distributed training, SLURM's design assumptions match the workload requirements more closely.
Gang scheduling is the clearest example. A distributed training job needs all participating nodes to start simultaneously. SLURM allocates the full resource block before any part of the job begins. Kubernetes, by default, schedules pods individually. Under resource pressure, pods start at different times and the training job either waits or proceeds with partial allocation. Neither outcome is acceptable for a multi-node all-reduce operation.
SLURM's GPU allocation model is also closer to what large training runs need. Allocation happens at the physical level: specific GPUs on specific nodes, with awareness of NVLink topology, PCIe bandwidth, and NUMA boundaries. Kubernetes allocates GPUs as generic resources. It does not natively understand GPU topology, which means two GPUs allocated to the same pod may have meaningfully different communication latency depending on how they sit on the hardware.
Checkpoint recovery under SLURM is more straightforward. Jobs write to shared filesystems, restart from a defined state, and the scheduler handles re-queuing without container lifecycle complexity. In Kubernetes, checkpoint logic has to account for pod state, persistent volume claims, and restart policies. That overhead is manageable for short jobs. For training runs measured in days, it adds up.
MPI integration is native in SLURM. For workloads using NCCL all-reduce, the process group model aligns with how SLURM distributes work across nodes. Kubernetes MPI operator implementations exist but they add abstraction layers between the scheduler and the communication fabric that can affect performance at scale.
Kubernetes is the better choice when the workload is heterogeneous, the lifecycle is short, or the team needs to run training and serving infrastructure from the same cluster.
MLOps pipelines are the clearest case. A pipeline that preprocesses data, runs a fine-tuning job, evaluates the output, and deploys a model endpoint involves multiple workload types with different resource profiles. Kubernetes handles this natively. SLURM was not designed for long-lived services or event-driven workflows, and retrofitting it for that purpose creates complexity rather than removing it.
GPU fragmentation is a real problem at scale, and Kubernetes handles it better for mixed workloads. Bin-packing algorithms in the Kubernetes scheduler can place smaller jobs across available GPU capacity more efficiently when job sizes vary. SLURM's gang scheduling model assumes homogeneous resource blocks; under mixed workload conditions, that assumption can leave capacity stranded while larger allocations wait for resources.
The ecosystem is one of Kubernetes' strongest advantages. Tools such as Kubeflow, Argo Workflows, Ray on Kubernetes, KServe, MLflow integrations, and service meshes provide a mature foundation for building end-to-end AI platforms. Teams can manage experiment pipelines, model registries, feature services, inference endpoints, and operational workflows within the same environment. For organisations standardising on cloud-native infrastructure, this reduces the number of separate systems that need to be operated and maintained.
Kubernetes also offers significant advantages for platform teams managing shared AI infrastructure. Namespace isolation, RBAC, resource quotas, admission controls, and policy frameworks allow multiple teams to safely share the same cluster while maintaining governance and operational boundaries. This becomes increasingly valuable as AI initiatives expand beyond a single research group and require support for data science, engineering, MLOps, and inference workloads on the same platform.
Another advantage is consistency across the AI lifecycle. The same orchestration platform can run data processing jobs, training workloads, batch inference, online serving, monitoring, and supporting services. While SLURM excels at scheduling large-scale training jobs, Kubernetes enables organisations to consolidate a broader range of workloads onto a common operational model, reducing platform sprawl and simplifying automation.
The hybrid architecture is not a compromise. It reflects a genuine mismatch in what each scheduler is optimised for, applied to workloads that have different requirements by design:
The scheduler choice is the visible decision. The infrastructure underneath is where the constraints live.
Distributed training at scale needs predictable GPU-to-GPU communication latency. A single congestion event on the fabric during an all-reduce operation stalls every participating node. On shared infrastructure, that congestion can come from another tenant's workload. On single-tenant infrastructure, you control the fabric. The difference in training stability is significant, and it compounds across long runs.
Storage requirements are split across workload types. Training jobs need high-throughput NVMe scratch for data staging and checkpoint writes mid-run. Persistent datasets and model artefacts need durable storage that survives node restarts. Parallel file access across multiple nodes, common in both HPC-style SLURM workloads and large distributed training jobs, requires a shared filesystem that can sustain throughput under concurrency without degrading.
Hyperstack Secure Private Cloud is built to run both schedulers on single-tenant, dedicated infrastructure. Under the Managed Platform deployment model, Hyperstack operates the cluster at the orchestrator layer with SLURM and Kubernetes available as scheduler options. The storage stack covers NVMe scratch, Shared Storage Volumes (SSVs), Secure Object Storage and a parallel filesystem. Networking uses RoCE (Ethernet) or InfiniBand fabrics, with NVIDIA ConnectX-8 SuperNICs for deployments where GPU-to-GPU communication bandwidth is the limiting factor.
Hyperstack Secure Private Cloud supports both SLURM and Kubernetes under the Managed Platform deployment model on single-tenant, dedicated infrastructure with RoCE or InfiniBand networking, NVMe scratch, SSVs, Secure Object Storage and parallel filesystem. No shared-tenancy noise. No GPU allocation compromises.
Talk to the team about how Secure Private Cloud fits your training architecture.
Yes, with the right operators and configuration. Kubeflow's PyTorchJob and MPI Operator handle distributed training on Kubernetes. The limitations show at scale: gang scheduling guarantees are weaker than SLURM's, GPU topology awareness requires explicit plugin configuration, and checkpoint recovery adds container lifecycle overhead. For smaller distributed jobs or fine-tuning workloads, Kubernetes is workable. For multi-node pre-training runs measured in days, SLURM's design assumptions are a closer match.
Gang scheduling starts all tasks in a distributed job simultaneously, across all allocated nodes. For all-reduce operations in distributed training, all participating processes need to be running at the same time. If nodes start at different times, the job either waits for the full allocation (wasting GPU time) or proceeds with partial allocation (producing incorrect gradients). SLURM implements gang scheduling natively. Kubernetes requires additional configuration and admission controllers to approximate it.
SLURM allocates GPUs with awareness of NVLink connectivity, PCIe topology, and NUMA boundaries. A job can request GPUs that are physically connected via NVLink, which directly affects all-reduce bandwidth. Kubernetes treats GPUs as generic countable resources by default. The NVIDIA GPU Feature Discovery plugin adds topology labels, and the Topology Manager can use them, but the native scheduler does not reason about GPU-to-GPU communication paths the way SLURM does. For large multi-GPU training jobs, this matters.
It adds operational surface area, but it is a well-established pattern. The standard approach uses SLURM for the training partition and Kubernetes for serving and pipeline workloads, with shared storage accessible from both. The complexity lives in the storage layer (ensuring volumes are accessible from both schedulers), the network configuration (fabric segmentation where needed), and access control. On a managed infrastructure where the cluster is operated at the orchestrator layer, most of that complexity is handled by the infrastructure provider.
Yes. SLURM supports containers via Singularity/Apptainer and, more recently, native Docker/OCI container support through plugins. Many HPC environments run containerised training jobs under SLURM. The difference from Kubernetes is the execution model: SLURM jobs still run as batch processes with explicit resource allocation, not as pods managed by a control plane. Teams already invested in container workflows do not need to abandon them to use SLURM.
In SLURM, a requeued job reads from the last checkpoint written to shared storage and restarts from that state. The process is explicit and script-driven. In Kubernetes, checkpoint recovery involves pod restart policies, persistent volume claims, and init container logic to load state. For a job that runs for 72 hours, SLURM's model is operationally simpler. For a shorter fine-tuning job that runs as part of a larger pipeline, Kubernetes handles the failure and restart within the existing workflow without manual intervention.
Three tiers are typically required. High-throughput NVMe scratch for staging training data locally and writing checkpoints during a run, where latency and write bandwidth matter. Persistent block storage (SSVs) for datasets and artefacts that need to survive node restarts or be accessed across jobs. Durable object storage for long-term retention of model artefacts, logs, and experiment data. For multi-node jobs accessing shared data simultaneously, a parallel filesystem that sustains throughput under concurrent read load is also necessary. The wrong storage configuration is one of the most common causes of training stalls that get misattributed to the scheduler.
When the training workloads are shorter, job sizes vary significantly, the team runs training and serving from the same cluster, or the MLOps tooling investment is already built around the Kubernetes ecosystem. Kubernetes is also the better fit when multi-tenancy, namespace isolation, and dynamic resource allocation across heterogeneous workload types matter more than raw gang scheduling performance. For teams running fine-tuning at scale rather than full pre-training, the Kubernetes ecosystem frequently covers requirements adequately.