Updated on 20 May 2026

The Inference SLA Kept Breaking. Nothing in the Model Explained It: Here's Why

TABLE OF CONTENTS

Key Takeaways

Shared-tenancy GPU infrastructure introduces unpredictable network contention that can break real-time inference SLAs even when models, pipelines and application code are correctly optimised and production-ready.
Latency spikes in regulated inference workloads are often infrastructure-layer problems hidden beneath tenant visibility, making root-cause analysis slow, expensive and operationally disruptive for engineering teams.
Under DORA, regulated organisations must manage ICT risk across third-party infrastructure, including concentration risk, processing locations, and the controls relied on to evidence operational resilience.
Single-tenant infrastructure materially reduces contention and compliance ambiguity by providing dedicated compute, networking and operational visibility that shared public cloud environments cannot consistently guarantee under load.

- Production AI systems handling regulated financial workloads require infrastructure designed around deterministic performance, sovereignty requirements and operational auditability rather than generic self-serve cloud deployment flexibility.

Consider two engineering teams at European payments processors. Same model architecture. Same transaction volumes. Same contractual SLA with their card networks: sub-50 millisecond inference response time on every fraud score. Both are running on public cloud GPU infrastructure.

Team A's latency is stable. Deployments are clean. The compliance team has crisp answers for the regulator. The fraud detection platform ships on schedule.

Team B is three weeks into a latency investigation with no resolution. The model has been re-benchmarked four times. The feature pipeline has been refactored twice. Every fix holds for a few days, then the spikes return. The SLA has been breached repeatedly over several weeks. The compliance team has received a request tied to DORA obligations around ICT resilience and third-party infrastructure. They are asked to explain where inference runs, what operational controls exist, and how third-party infrastructure risk is documented. Nobody can answer it cleanly.

The difference between these two teams is not the model. It is not the data. It is not the engineering quality. It is the infrastructure assumption they made before they wrote a single line of code.

The Latency Problem Nobody Expects

Public cloud GPU infrastructure is provisioned in shared tenancy. The VM your team runs on is physically co-located with resources belonging to other tenants. The network fabric is shared. The GPU memory bus is dedicated to your allocation, but the communication paths between nodes, the storage fabric and the supporting physical switching infrastructure are not.

For most workloads, this does not matter. For batch training jobs, exploratory inference and development environments, shared tenancy comes with noise that is tolerable when the requirement is approximate.

Real-time fraud detection at transaction scale is not that kind of workload.

At sub-50ms SLA requirements, the variance introduced by multi-tenant congestion is not a background nuisance. It is a direct SLA risk. When another tenant runs a heavy workload on the same physical fabric, the network latency on your inference request increases. Your model scores the transaction correctly. The response just arrives late. And late, in this context, means a breach.

The reason this takes weeks to diagnose is that the monitoring tools available to tenants report on the tenant's stack. They show you your inference latency. They do not show you what is happening on the shared physical hardware underneath. The variance looks like a model problem or a pipeline problem, or a data problem, because those are the layers you can see. The real cause sits one level below your visibility.

During a spike window, Team B's infrastructure engineer finds that the evidence increasingly points away from the model and toward contention somewhere below the tenant-visible layer. The fix is not a software change. It is an infrastructure change.

The Compliance Problem That Arrives Next

Two days after the infrastructure engineer's finding, Team B's compliance lead receives a formal request. Their regulator, exercising oversight under the Digital Operational Resilience Act, wants visibility into where the fraud detection inference runs, what third-party infrastructure controls are in place, and how cross-tenant exposure is being managed.

These are not unreasonable questions for a regulated payments processor to receive. Under DORA, ICT risk management includes third-party infrastructure, and the obligation to understand and document the operational resilience of critical systems sits firmly with the financial entity, not the cloud provider.

The problem is that Team B cannot answer the questions cleanly.

They can name the cloud provider. They can name the region. But visibility into the underlying infrastructure, tenancy model and operational dependencies is harder to document. By design, tenants on shared public cloud do not have line-of-sight into the set of co-tenants on the same physical resources. They can point to the provider's general compliance certifications but those certifications cover the provider's operations, not the specific isolation characteristics of a shared tenancy deployment.

The compliance team escalates. Legal gets involved. The audit timeline starts moving. The engineering team is now split between fixing the latency problem and building documentation for a compliance review that cannot be resolved at the infrastructure layer they currently operate on.

This is not a security incident. No data has been exposed. But the inability to answer basic questions about where a regulated workload runs and who shares that environment is, in a DORA context, a gap that requires remediation. The remediation is not a policy document. It is an infrastructure change.

What Shared Tenancy Cannot Provide

The two problems Team B is dealing with are different in surface form but identical in root cause. Latency variance and compliance opacity are both consequences of the same architectural fact: shared tenancy means you do not control, and often cannot see, the environment your workload runs in.

For inference workloads with hard latency requirements, control over the network fabric is not a nice-to-have. It is the variable that determines whether the SLA holds under load. When the fabric is shared, that variable is outside your control. You can tune the model, optimise the pipeline and right-size the instance type. None of it addresses congestion that you cannot see and cannot prevent.

For regulated workloads subject to operational resilience requirements, this opacity is not a compliance checkbox. In regulated environments, shared tenancy can also complicate concentration-risk assessments under frameworks like DORA. By design, the set of co-tenants on shared physical infrastructure is dynamic and not visible to the regulated entity, which makes a clean answer structurally difficult to produce.

The teams operating production AI in regulated environments that hit these walls are not making naive mistakes. They are encountering the natural ceiling of what shared infrastructure can deliver for this category of workload. Choosing the public cloud is not the mistake. The mistake is failing to re-evaluate when the workload's requirements move beyond what shared tenancy can support.

What The Right Environment Looks Like

The requirements for a fraud detection inference platform at the scale and compliance profile described above are not exotic. But they are specific.

Single-tenant infrastructure: dedicated physical hardware with no other tenants on the same network fabric or GPU allocation. Not logical isolation. Physical separation. The kind where “who else runs on your hardware” has a one-word answer.
Dedicated network fabric: RoCE Ethernet or InfiniBand, selected based on the throughput and latency profile of the workload, sized to support peak transaction volumes without contention. For a distributed inference deployment at this scale, the network is not secondary to the compute. It is the compute.
Jurisdiction-specific deployment: the ability to specify not just a cloud region but a legal jurisdiction, and to receive documentation confirming that inference runs within it. For a regulated European payments processor, this means the difference between a clean answer to a regulator and a months-long compliance engagement.
Operational auditability: access logs, operational trails, and incident records structured for regulated environments. Not generic cloud logging. Audit-ready documentation of who accessed what, under what controls, and when.

None of this requires building and operating infrastructure in-house. It requires a deployment model where the infrastructure provider operates a dedicated environment on the customer's behalf, built to their compliance and performance requirements, with governance defined at the contract stage rather than discovered during an audit.

That is a different commercial and operational model from the self-serve public cloud. For the workloads described above, it is the model best positioned to address both problems together.

How Hyperstack Secure Private Cloud addresses this

Hyperstack's Secure Private Cloud is a dedicated, single-tenant private cloud for enterprises and regulated industries that need strong isolation, controlled access, and region-specific deployment.

On the performance side: each customer receives fully reserved GPUs, CPU, memory, and networking. Resources are not shared or oversubscribed with other customers. For inference workloads with hard latency requirements, this means the network fabric behaves more predictably under load because contention from neighbouring tenants is removed by design. Throughput is more predictable. Latency variance caused by shared tenancy is materially reduced at the infrastructure level.
On the compliance side: single-tenant isolation means there is a direct answer to 'who else runs on your hardware.' Region and sovereignty options allow deployments to be located in specific legal jurisdictions, with reduced exposure to cross-border data transfer obligations. Access trails and operational logs are designed for regulated environments. Hyperstack is designed to support GDPR-regulated environments through deployment controls, regional flexibility, and operational transparency. For organisations operating under DORA or similar frameworks, the deployment model produces documentation that maps to what regulators ask for, rather than approximations built from a shared-tenancy provider's general compliance materials.

Unlike our on-demand platform, Hyperstack Secure Private Cloud is not self-serve. Environments are designed, built, validated and operated according to agreed architectural and operational requirements. That is not a limitation. For a regulated payments processor running production inference at scale, it is the specification.

High-speed networking is deployment-dependent: RoCE Ethernet or InfiniBand fabrics are selected based on workload scale and performance requirements, with NVIDIA ConnectX-8 SuperNICs used where ultra-high bandwidth and low-latency GPU-to-GPU communication is required.
Storage is configured per workload: local NVMe for high-throughput staging and checkpoints, persistent block volumes for datasets and artefacts across restarts, and secure object storage for durable retention and ingress/egress.

If inference SLAs are slipping and the model is not the reason, the question worth asking is: who else is running on that hardware?

See how Hyperstack Secure Private Cloud delivers dedicated single-tenant GPU infrastructure with predictable networking, jurisdiction-specific deployment and audit-ready operational controls for regulated AI inference workloads at scale.

FAQs

Why can shared-tenancy infrastructure affect inference latency?

In shared environments, network fabrics, storage paths and switching infrastructure are used by multiple tenants simultaneously. During congestion spikes, inference responses may slow despite models and application pipelines operating correctly internally.

Why are latency issues difficult to diagnose in public cloud environments?

Tenant monitoring tools expose application and instance-level telemetry but rarely provide visibility into underlying physical network contention, making infrastructure congestion appear like model or pipeline instability instead.

What does DORA require from AI infrastructure deployments?

DORA requires regulated entities to manage ICT risk across critical functions, including third-party infrastructure arrangements, concentration risk, and contractual provisions covering subcontracting and processing locations. The controls a financial entity relies on to evidence this sit with the financial entity, not the provider.

Why is single-tenancy important for regulated AI inference?

Single-tenancy provides dedicated infrastructure without neighbouring workloads sharing physical resources, enabling predictable performance, cleaner audit responses, and a structural answer to concentration-risk and processing-location questions under DORA and similar frameworks.

How does dedicated networking improve inference stability?

Dedicated RoCE Ethernet or InfiniBand fabrics eliminate contention from unrelated workloads, reducing latency variance and ensuring distributed inference systems maintain predictable throughput during sustained transaction-heavy production operations.

Why is jurisdiction-specific deployment important for compliance?

Regulated organisations often need evidence that workloads, datasets and operational access remain inside specific legal jurisdictions to satisfy sovereignty obligations, audit requirements and cross-border data transfer restrictions.

AI, LLM, Financial Services, Cloud Computing, GPU Cloud, H100, H200, GPU Clusters, Secure Private Cloud, Inference

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Talk to an expert

Share On Social Media

link

Running a Multi-Program AI Research Environment on ...

Note: All metrics and scenarios in this case study are illustrative. Specific outcomes ...

link

Omics to Digital Twins: What Running AI-Intensive Health ...

1. The Problem A European biomedical research consortium operating shared AI ...

The Inference SLA Kept Breaking. Nothing in the Model Explained It: Here's Why

Key Takeaways

The Latency Problem Nobody Expects

The Compliance Problem That Arrives Next

What Shared Tenancy Cannot Provide

What The Right Environment Looks Like

How Hyperstack Secure Private Cloud addresses this

FAQs

Why can shared-tenancy infrastructure affect inference latency?

Why are latency issues difficult to diagnose in public cloud environments?

What does DORA require from AI infrastructure deployments?

Why is single-tenancy important for regulated AI inference?

How does dedicated networking improve inference stability?

Why is jurisdiction-specific deployment important for compliance?

Subscribe to Hyperstack!

Get Started

Running a Multi-Program AI Research Environment on ...

Omics to Digital Twins: What Running AI-Intensive Health ...

United Kingdom (Head office)

Registered Office

Spain

Solutions

Resources

Site map

Products

Legal

The Inference SLA Kept Breaking. Nothing in the Model Explained It: Here's Why

Key Takeaways

The Latency Problem Nobody Expects

The Compliance Problem That Arrives Next

What Shared Tenancy Cannot Provide

What The Right Environment Looks Like

How Hyperstack Secure Private Cloud addresses this

FAQs

Why can shared-tenancy infrastructure affect inference latency?

Why are latency issues difficult to diagnose in public cloud environments?

What does DORA require from AI infrastructure deployments?

Why is single-tenancy important for regulated AI inference?

How does dedicated networking improve inference stability?

Why is jurisdiction-specific deployment important for compliance?

Subscribe to Hyperstack!

Related Post

Get Started

Related Post

Running a Multi-Program AI Research Environment on ...

Omics to Digital Twins: What Running AI-Intensive Health ...

United Kingdom (Head office)

Registered Office

Spain

Solutions

Resources

Site map

Products

Legal