<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">
Reserve here

NVIDIA H100 SXMs On-Demand at $2.40/hour - Reserve from just $1.90/hour. Reserve here

Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close
|

Updated on 3 Dec 2025

ECC on NVIDIA H100 PCIe VMs: How to Enable or Disable It and Why It Matters

TABLE OF CONTENTS

NVIDIA H100 SXM On-Demand

Sign up/Login
summary

In our latest tutorial, we explain the importance of ECC (Error-Correcting Code) on NVIDIA H100 PCIe VMs. ECC detects and corrects memory errors, ensuring data integrity for AI, HPC, and scientific workloads. We guide you through checking, enabling or disabling ECC safely on Hyperstack for balancing performance and reliability.

Error-Correcting Codes and Why They're More Important Than You Think

You've just spun up an NVIDIA H100 PCIe virtual machine on Hyperstack, ready to tackle massive AI training or complex HPC workloads. You open a terminal, run nvidia-smi to check the status, and you see this:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:00:08.0 Off |                    0 |
| N/A   30C    P0             77W /  310W |   17959MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

In the right-most column, under "Volatile Uncorr. ECC," it says "0."

Believe it or not, this setting is quite important, sometimes even critical, depending on what you're doing. Should you disable it or keep it enabled, and why? What are the pros and cons of each? What is ECC anyway?

This guide will walk you through all of that and more, and provide you with a method to configure it on your Hyperstack H100 VMs.

What is ECC in Simple Terms?

ECC stands for Error-Correcting Code. It's a feature of memory that can automatically detect and correct most errors in data in real-time.

Think of it as a vigilant proofreading system for your GPU's memory. As data moves in and out of memory, ECC checks for a certain type of error - called bit flips - and fixes it on the fly.

Why Do Memory Errors Happen?

A "bit flip" is a rare event, but with the speed and size of an H100 and the conditions that it's deployed in, "rare" can become "inevitable." These errors are not typically a sign of a "bad" GPU but are a physical reality of high-performance silicon.

  • Cosmic Rays: High-energy particles from space (mostly muons) can physically strike a memory cell and flip its value. This is a primary cause of random, single-bit errors.

  • Background Radiation: Trace radioactive elements in the chip packaging or the surrounding environment can emit particles that do the same.

  • Voltage Fluctuations: Tiny, imperceptible changes in the power delivery to the memory chip can sometimes cause a cell to lose its state.

  • Manufacturing Imperfections: At the nanometer scale of modern VRAM, there may be tiny, stable imperfections that are within manufacturing tolerance but make a specific memory cell slightly more susceptible to flipping under certain conditions.

For everyday gaming, a single bit-flip might go unnoticed. But for a scientific simulation, a financial model, or training an LLM for days, a single uncorrected error can corrupt your entire dataset, invalidate your results, or crash your job.

The Great Debate: ECC Pros vs. Cons

Choosing whether to enable or disable ECC is a trade-off between integrity and raw performance. Here’s a breakdown to help you decide what's right for your workload.

Feature

ECC Enabled

ECC Disabled

Primary Goal

Data Integrity & Stability.

Maximum Performance & VRAM.

Pros

- Automatically detects & corrects single-bit errors.

- Detects (but can't correct) double-bit errors.

- Essential for scientific, financial, or medical data.

- Prevents silent data corruption that ruins models.

- Makes debugging easier (rules out random memory errors).

- Recaptures a minor percentage of memory bandwidth (ECC calculations add latency).

- Recaptures a small amount of VRAM (ECC uses some capacity for its parity bits).

- May prevent some specific bugs from crashing the entire hypervisor (the VM crashes instead).

Cons

- Incurs a minor performance overhead (a few per cent).

- Uses a small portion of the total VRAM.

- High risk of silent data corruption.

- Potential for instability, especially on long-running jobs (days/weeks).

- Much harder to debug crashes (is it my code, my data, or a bit-flip?).

- Unsuitable for workloads where results must be 100% correct.

A Quick Note on Our Support Policy

Before you make any changes, it's important to understand our policy.

IMPORTANT: While you, as the customer, have the flexibility to disable ECC, we do not officially support this configuration. Neither does Nvidia nor our hardware vendor for this class of GPU.

Issues that you encounter as a direct result of manually disabling ECC (such as data corruption, model instability, or crashes) will not be covered by our SLA. We strongly recommend enabling ECC for most production AI and HPC workloads.

How to Check Your Current ECC Status

That nvidia-smi output is your first clue. For a more direct confirmation, you can query a specific GPU (e.g., GPU 0) using this command:

$ nvidia-smi -q -d ECC -i 0

If it's disabled, you'll see this:

==============NVSMI LOG==============
...
ECC Mode
Current : Disabled
Pending : Disabled
...

If it's enabled, you'll see:

==============NVSMI LOG==============
...
ECC Mode
Current : Enabled
Pending : Enabled
...

Why Toggling ECC Fails (The "Stuck" Problem)

On some VMs, you might try the standard command: sudo nvidia-smi -e 1 (to enable) or sudo nvidia-smi -e 0 (to disable). The command reports success and asks for a reboot. However, after rebooting, you find the setting hasn't changed.

This is the real issue: Your VM is likely running the Nvidia open kernel driver (you can check with sudo dpkg -l | grep nvidia and look for nvidia-driver-XXX-open). This driver package includes a module called nvidia_drm which, while useful for desktop graphics, is not required for Tesla-class cards like the H100. This nvidia_drm module effectively blocks the ECC setting from being applied correctly on reboot.

How to Change Your ECC Status

Here are the correct procedures, accounting for the nvidia_drm issue.

Before you start: Remove your VM from any production workloads or schedulers. This process requires at least one reboot, and the "stuck" procedure requires two.

Procedure A: How to Properly Enable ECC (If It's "Stuck" Off)

If your ECC is "Stuck" off, follow these steps. This involves blacklisting the problematic module, rebooting, setting the ECC flag, and rebooting again.

Step 1. Blacklist the nvidia_drm Module Tell the system not to load this module.

# Remove the old config file
sudo rm /etc/modprobe.d/nvidia-graphics-drivers-kms.conf

# Create a new blacklist file for the nvidia_drm module
echo "blacklist nvidia_drm" | sudo tee /etc/modprobe.d/blacklist-nvidia_drm.conf

Step 2. Update initramfs Apply this change so the kernel knows about it on the next boot.

sudo update-initramfs -u

Step 3. First Reboot Reboot your VM. This will load the OS without the nvidia_drm module.

sudo reboot

Step 4. Set the ECC Enabled Flag Once your VM is back online, log in. Now, run the command to enable ECC.

sudo nvidia-smi -e 1

This command will set a pending change, which will be applied on the next reboot.

Step 5. Second Reboot Reboot the VM one final time to apply the pending ECC change.

sudo reboot

Procedure B: How to Disable ECC

If you have weighed the risks and your workload benefits from disabling ECC, the process is simpler (and generally doesn't require the nvidia_drm fix).

# Set ECC to disabled (0)
sudo nvidia-smi -e 0

# Reboot to apply the change
sudo reboot

(Note: If you used Procedure A to enable ECC, you can use this simple command to disable it again. The blacklist file does not need to be removed.)

How to Verify the Change

After your VM comes back online, it's time to confirm.

If you ENABLED ECC: Run nvidia-smi. You should now see a "0" in the ECC column. This "0" has two meanings:

  1. ECC is Enabled.

  2. The "Volatile Uncorrectable" error count is 0 (a good thing!).

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:00:08.0 Off |                    0 |
| N/A   30C    P0             77W /  310W |   17959MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

You can also confirm with $ nvidia-smi -q -d ECC and look for Current: Enabled.

If you DISABLED ECC: Run nvidia-smi. You will see "Off" in the ECC column, as shown at the beginning of this article.

What's Next?

You now have the context and the technical steps to make an informed decision about ECC. You can configure your H100 VM to match your specific workload's requirements, whether that's the raw performance of ECC-off or the data integrity of ECC-on.

Take Control of Your GPU Performance

Configure ECC on your Hyperstack NVIDIA H100 PCIe VM today and strike the perfect balance between speed and data integrity.

FAQs

What does ECC stand for?

ECC stands for Error-Correcting Code, a GPU memory feature that detects and corrects bit-flip errors automatically during computation.

Why is ECC important?

ECC prevents silent memory errors that could corrupt AI models, scientific simulations, or financial computations, ensuring reliable results and stability.

Does ECC affect performance?

Enabling ECC slightly reduces available VRAM and adds minor latency, trading maximum performance for improved data integrity and reliability.

Which GPU supports ECC in this guide?

The NVIDIA H100 PCIe supports ECC, providing error detection and correction, ideal for high-performance AI or HPC workloads.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

2 Dec 2025

Take Control of Your Own OCR Workflow with DeepSeek-OCR and Hyperstack Optical Character ...

1 Dec 2025

With the growing adoption of AI-assisted development tools, intelligent coding assistants ...