TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
In our latest tutorial, we explain the importance of ECC (Error-Correcting Code) on NVIDIA H100 PCIe VMs. ECC detects and corrects memory errors, ensuring data integrity for AI, HPC, and scientific workloads. We guide you through checking, enabling or disabling ECC safely on Hyperstack for balancing performance and reliability.
Error-Correcting Codes and Why They're More Important Than You Think
You've just spun up an NVIDIA H100 PCIe virtual machine on Hyperstack, ready to tackle massive AI training or complex HPC workloads. You open a terminal, run nvidia-smi to check the status, and you see this:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:00:08.0 Off | 0 |
| N/A 30C P0 77W / 310W | 17959MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
In the right-most column, under "Volatile Uncorr. ECC," it says "0."
Believe it or not, this setting is quite important, sometimes even critical, depending on what you're doing. Should you disable it or keep it enabled, and why? What are the pros and cons of each? What is ECC anyway?
This guide will walk you through all of that and more, and provide you with a method to configure it on your Hyperstack H100 VMs.
What is ECC in Simple Terms?
ECC stands for Error-Correcting Code. It's a feature of memory that can automatically detect and correct most errors in data in real-time.
Think of it as a vigilant proofreading system for your GPU's memory. As data moves in and out of memory, ECC checks for a certain type of error - called bit flips - and fixes it on the fly.
Why Do Memory Errors Happen?
A "bit flip" is a rare event, but with the speed and size of an H100 and the conditions that it's deployed in, "rare" can become "inevitable." These errors are not typically a sign of a "bad" GPU but are a physical reality of high-performance silicon.
-
Cosmic Rays: High-energy particles from space (mostly muons) can physically strike a memory cell and flip its value. This is a primary cause of random, single-bit errors.
-
Background Radiation: Trace radioactive elements in the chip packaging or the surrounding environment can emit particles that do the same.
-
Voltage Fluctuations: Tiny, imperceptible changes in the power delivery to the memory chip can sometimes cause a cell to lose its state.
-
Manufacturing Imperfections: At the nanometer scale of modern VRAM, there may be tiny, stable imperfections that are within manufacturing tolerance but make a specific memory cell slightly more susceptible to flipping under certain conditions.
For everyday gaming, a single bit-flip might go unnoticed. But for a scientific simulation, a financial model, or training an LLM for days, a single uncorrected error can corrupt your entire dataset, invalidate your results, or crash your job.
The Great Debate: ECC Pros vs. Cons
Choosing whether to enable or disable ECC is a trade-off between integrity and raw performance. Here’s a breakdown to help you decide what's right for your workload.
|
Feature |
ECC Enabled |
ECC Disabled |
|---|---|---|
|
Primary Goal |
Data Integrity & Stability. |
Maximum Performance & VRAM. |
|
Pros |
- Automatically detects & corrects single-bit errors. - Detects (but can't correct) double-bit errors. - Essential for scientific, financial, or medical data. - Prevents silent data corruption that ruins models. - Makes debugging easier (rules out random memory errors). |
- Recaptures a minor percentage of memory bandwidth (ECC calculations add latency). - Recaptures a small amount of VRAM (ECC uses some capacity for its parity bits). - May prevent some specific bugs from crashing the entire hypervisor (the VM crashes instead). |
|
Cons |
- Incurs a minor performance overhead (a few per cent). - Uses a small portion of the total VRAM. |
- High risk of silent data corruption. - Potential for instability, especially on long-running jobs (days/weeks). - Much harder to debug crashes (is it my code, my data, or a bit-flip?). - Unsuitable for workloads where results must be 100% correct. |
A Quick Note on Our Support Policy
Before you make any changes, it's important to understand our policy.
IMPORTANT: While you, as the customer, have the flexibility to disable ECC, we do not officially support this configuration. Neither does Nvidia nor our hardware vendor for this class of GPU.
Issues that you encounter as a direct result of manually disabling ECC (such as data corruption, model instability, or crashes) will not be covered by our SLA. We strongly recommend enabling ECC for most production AI and HPC workloads.
How to Check Your Current ECC Status
That nvidia-smi output is your first clue. For a more direct confirmation, you can query a specific GPU (e.g., GPU 0) using this command:
$ nvidia-smi -q -d ECC -i 0
If it's disabled, you'll see this:
==============NVSMI LOG==============
...
ECC Mode
Current : Disabled
Pending : Disabled
...
If it's enabled, you'll see:
==============NVSMI LOG==============
...
ECC Mode
Current : Enabled
Pending : Enabled
...
Why Toggling ECC Fails (The "Stuck" Problem)
On some VMs, you might try the standard command: sudo nvidia-smi -e 1 (to enable) or sudo nvidia-smi -e 0 (to disable). The command reports success and asks for a reboot. However, after rebooting, you find the setting hasn't changed.
This is the real issue: Your VM is likely running the Nvidia open kernel driver (you can check with sudo dpkg -l | grep nvidia and look for nvidia-driver-XXX-open). This driver package includes a module called nvidia_drm which, while useful for desktop graphics, is not required for Tesla-class cards like the H100. This nvidia_drm module effectively blocks the ECC setting from being applied correctly on reboot.
How to Change Your ECC Status
Here are the correct procedures, accounting for the nvidia_drm issue.
Before you start: Remove your VM from any production workloads or schedulers. This process requires at least one reboot, and the "stuck" procedure requires two.
Procedure A: How to Properly Enable ECC (If It's "Stuck" Off)
If your ECC is "Stuck" off, follow these steps. This involves blacklisting the problematic module, rebooting, setting the ECC flag, and rebooting again.
Step 1. Blacklist the nvidia_drm Module Tell the system not to load this module.
# Remove the old config file
sudo rm /etc/modprobe.d/nvidia-graphics-drivers-kms.conf
# Create a new blacklist file for the nvidia_drm module
echo "blacklist nvidia_drm" | sudo tee /etc/modprobe.d/blacklist-nvidia_drm.conf
Step 2. Update initramfs Apply this change so the kernel knows about it on the next boot.
sudo update-initramfs -u
Step 3. First Reboot Reboot your VM. This will load the OS without the nvidia_drm module.
sudo reboot
Step 4. Set the ECC Enabled Flag Once your VM is back online, log in. Now, run the command to enable ECC.
sudo nvidia-smi -e 1
This command will set a pending change, which will be applied on the next reboot.
Step 5. Second Reboot Reboot the VM one final time to apply the pending ECC change.
sudo reboot
Procedure B: How to Disable ECC
If you have weighed the risks and your workload benefits from disabling ECC, the process is simpler (and generally doesn't require the nvidia_drm fix).
# Set ECC to disabled (0)
sudo nvidia-smi -e 0
# Reboot to apply the change
sudo reboot
(Note: If you used Procedure A to enable ECC, you can use this simple command to disable it again. The blacklist file does not need to be removed.)
How to Verify the Change
After your VM comes back online, it's time to confirm.
If you ENABLED ECC: Run nvidia-smi. You should now see a "0" in the ECC column. This "0" has two meanings:
-
ECC is Enabled.
-
The "Volatile Uncorrectable" error count is 0 (a good thing!).
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:00:08.0 Off | 0 |
| N/A 30C P0 77W / 310W | 17959MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
You can also confirm with $ nvidia-smi -q -d ECC and look for Current: Enabled.
If you DISABLED ECC: Run nvidia-smi. You will see "Off" in the ECC column, as shown at the beginning of this article.
What's Next?
You now have the context and the technical steps to make an informed decision about ECC. You can configure your H100 VM to match your specific workload's requirements, whether that's the raw performance of ECC-off or the data integrity of ECC-on.
Take Control of Your GPU Performance
Configure ECC on your Hyperstack NVIDIA H100 PCIe VM today and strike the perfect balance between speed and data integrity.
FAQs
What does ECC stand for?
ECC stands for Error-Correcting Code, a GPU memory feature that detects and corrects bit-flip errors automatically during computation.
Why is ECC important?
ECC prevents silent memory errors that could corrupt AI models, scientific simulations, or financial computations, ensuring reliable results and stability.
Does ECC affect performance?
Enabling ECC slightly reduces available VRAM and adds minor latency, trading maximum performance for improved data integrity and reliability.
Which GPU supports ECC in this guide?
The NVIDIA H100 PCIe supports ECC, providing error detection and correction, ideal for high-performance AI or HPC workloads.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?