TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
Key Takeaways
• Training generative AI for 3D models starts with collecting and preprocessing high-quality 3D data such as meshes, point clouds, or multi-view images.
• Different model architectures are used depending on representation, including voxel-based, point-based, and neural implicit methods.
• Large GPU memory and compute resources are required to handle complex 3D structures and high-resolution outputs.
• Data augmentation and normalisation are critical for improving model generalisation and stability.
• Training workflows often combine 2D supervision with 3D consistency constraints to improve realism.
• Evaluation focuses on geometry accuracy, visual quality, and consistency across multiple viewpoints.
Training generative AI for 3D models has changed dramatically in the past two years. GANs and VAEs, once the standard approach, have largely been superseded by diffusion models, Neural Radiance Fields (NeRF), and 3D Gaussian Splatting. This tutorial reflects that shift. You will find working environment setup commands, a current architecture comparison and a complete walkthrough of provisioning a Hyperstack GPU VM and running a real training script using Nerfstudio.
By the end, you will have a functioning training pipeline running on Hyperstack.
Prerequisites
Before starting, make sure you have the following:
- Go to the Hyperstack website and log in.
- If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.
- An SSH keypair added to your Hyperstack profile
- Basic familiarity with Python and the Linux command line
- A Hugging Face account (for accessing pretrained checkpoints)
- Your training data: images, a video, or an existing 3D dataset (ShapeNet, Objaverse, or your own captures)
Tools and libraries used in this tutorial:
- Nerfstudio: modular NeRF and Gaussian Splatting training framework
- 3D Gaussian Splatting (Kerbl et al.): fast, high-quality scene reconstruction
- Shap-E (OpenAI): text/image-to-3D diffusion model
- PyTorch 2.x with CUDA 12.2
- COLMAP: for structure-from-motion preprocessing
Choosing the Right Architecture for 3D Generation
The right architecture depends on your input data, output format and quality requirements. Here is a current comparison of the main approaches:
|
Architecture |
Best for |
Output format |
GPU requirement |
Training time |
|
NeRF (Neural Radiance Field) |
Photorealistic scene reconstruction from images/video |
Implicit volumetric representation |
1-4x A100 / H100 |
Minutes to hours (Instant-NGP, Nerfacto) |
|
3D Gaussian Splatting |
Real-time rendering, fast scene reconstruction |
Point cloud with Gaussian splats |
1-2x A100 / H100 |
20-45 minutes per scene |
|
Diffusion-based (Shap-E, Zero123, One-2-3-45) |
Text-to-3D or image-to-3D generation |
Mesh, point cloud, NeRF |
2-8x A100 / H100 |
Hours to days for full training |
|
3D VAE / Point-E |
Fast low-res prototyping from text |
Point cloud |
1x A100 |
Minutes for inference; days for training |
|
GAN (GET3D, EG3D) |
Category-specific object generation (cars, chairs) |
Mesh + texture |
4-8x A100 |
Days to weeks |
Recommendation for most use cases: Start with 3D Gaussian Splatting via Nerfstudio if you have image/video input and need fast, high-quality results. Use Shap-E or Zero123++ for text-to-3D workflows. Reserve GAN-based methods for category-specific generation tasks where you have a large, labelled dataset.
Step 1: Provision Your Hyperstack VM
Log in to the Hyperstack console and click Deploy New Virtual Machine.
Recommended configuration for 3D generative AI training:
- GPU: NVIDIA A100-80GB for training; NVIDIA A6000 for inference and prototyping
- OS image: Ubuntu 22.04 LTS with CUDA 12.2 and Docker
- Storage: At a minimum of 200 GB on the ephemeral disk for datasets and checkpoints
- Networking: Assign a public IP and open port 22 (SSH) and port 7007 (Nerfstudio viewer)
Once deployed, connect via SSH:
ssh -i /path/to/your/key ubuntu@YOUR_VM_PUBLIC_IP
Step 2: Set Up the Environment
Update the system and verify the GPU is visible:
sudo apt-get update && sudo apt-get upgrade -y
nvidia-smi
You should see your GPU listed with the driver and CUDA version. Now, create a Python virtual environment and install PyTorch:
python3 -m venv ~/3d-train-env
source ~/3d-train-env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python -c "import torch; print(torch.cuda.get_device_name(0))"
Install system dependencies required by Nerfstudio and COLMAP:
sudo apt-get install -y git cmake build-essential libboost-all-dev \
libfreeimage-dev libgflags-dev libgoogle-glog-dev \
libsuitesparse-dev colmap ffmpeg
Step 3: Install Nerfstudio
Nerfstudio is the recommended framework for NeRF and Gaussian Splatting training. It provides a unified interface for multiple 3D representation methods and includes a real-time web viewer.
pip install nerfstudio
pip install ninja git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
ns-train --help
If you see the Nerfstudio CLI help output, the installation is successful.
Step 4: Prepare Your Dataset
Nerfstudio expects a set of images with known camera poses. If you have a video or a collection of photos of an object or scene, the ns-process-data command handles COLMAP preprocessing automatically.
Option A: From a video file:
ns-process-data video \
--data /path/to/your/video.mp4 \
--output-dir /ephemeral/dataset/my-scene
Option B: From a folder of images:
ns-process-data images \
--data /path/to/images/ \
--output-dir /ephemeral/dataset/my-scene
This runs COLMAP structure-from-motion under the hood, estimating camera intrinsics and extrinsics for every frame. Expect this to take 5-20 minutes, depending on image count and resolution.
Option C: Use an existing dataset (ShapeNet / Objaverse):
pip install objaverse
python -c "
import objaverse
uids = objaverse.load_uids()
objects = objaverse.load_objects(uids[:100])
print('Downloaded', len(objects), 'objects')
Step 5: Train a 3D Gaussian Splatting Model
3D Gaussian Splatting is currently the fastest method for producing photorealistic scene reconstructions. It trains in under an hour on a single A100 and produces real-time renderable outputs.
ns-train splatfacto \
--data /ephemeral/dataset/my-scene \
--output-dir /ephemeral/outputs/my-scene-gaussian \
--max-num-iterations 30000 \
--pipeline.model.num-downscales 0 \
--viewer.no-enable-viewer
To monitor training loss in real time:
tail -f /ephemeral/outputs/my-scene-gaussian/splatfacto/*/training.log
Checkpoints are saved automatically every 2,000 iterations. When training completes, render output frames:
ns-render camera-path \
--load-config /ephemeral/outputs/my-scene-gaussian/splatfacto/*/config.yml \
--output-path /ephemeral/outputs/render.mp4
Step 6: Train a NeRF Model (Nerfacto)
If you need an implicit volumetric representation rather than splats, for example, for downstream editing or novel view synthesis with fine detail, use Nerfacto, Nerfstudio's default high-quality NeRF method:
ns-train nerfacto \
--data /ephemeral/dataset/my-scene \
--output-dir /ephemeral/outputs/my-scene-nerf \
--max-num-iterations 50000 \
--pipeline.model.disable-scene-contraction True \
--viewer.no-enable-viewer
Nerfacto trains in roughly 20-40 minutes on an A100 for a typical indoor or object-centric scene.
Step 7: Text-to-3D with Shap-E
If your use case is generating 3D assets from text prompts rather than reconstructing from images, use OpenAI's Shap-E diffusion model:
pip install git+https://github.com/openai/shap-e.git
python <<'EOF'
import torch
from shap_e.diffusion.sample import sample_latents
from shap_e.diffusion.gaussian_diffusion import diffusion_from_config
from shap_e.models.download import load_model, load_config
from shap_e.util.notebooks import decode_latent_mesh
import trimesh
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
xm = load_model('transmitter', device=device)
model = load_model('text300M', device=device)
diffusion = diffusion_from_config(load_config('diffusion'))
prompt = 'a wooden chair with four legs'
latents = sample_latents(
batch_size=1, model=model, diffusion=diffusion,
guidance_scale=15.0,
model_kwargs=dict(texts=[prompt]),
progress=True, clip_denoised=True, use_fp16=True,
use_karras=True, karras_steps=64,
sigma_min=1e-3, sigma_max=160, s_churn=0,
)
for i, latent in enumerate(latents):
mesh = decode_latent_mesh(xm, latent).tri_mesh()
t = trimesh.Trimesh(vertices=mesh.verts, faces=mesh.faces)
t.export(f'/ephemeral/outputs/output_{i}.obj')
print(f'Saved output_{i}.obj')
EOF
The generated .obj files can be downloaded from the VM and opened directly in Blender, Maya or any standard 3D software.
Step 8: Evaluate Your Model
Evaluate reconstruction quality using standard metrics:
ns-eval \
--load-config /ephemeral/outputs/my-scene-nerf/nerfacto/*/config.yml \
--output-path /ephemeral/outputs/eval_results.json
cat /ephemeral/outputs/eval_results.json
Key metrics to target:
- PSNR > 28 dB: acceptable reconstruction quality for most applications
- PSNR > 32 dB: high-quality reconstruction
- SSIM > 0.85: strong structural similarity to ground truth
- LPIPS < 0.15: perceptually close to reference images
If metrics are below target, the most common fixes are: more training iterations, more input images with better coverage, or switching from Nerfacto to Gaussian Splatting for scenes with strong view-dependent effects.
Step 9: Export and Download Your Model
Export the trained model to a standard format for use in downstream tools:
ns-export gaussian-splat \
--load-config /ephemeral/outputs/my-scene-gaussian/splatfacto/*/config.yml \
--output-dir /ephemeral/exports/splat
ns-export marching-cubes \
--load-config /ephemeral/outputs/my-scene-nerf/nerfacto/*/config.yml \
--output-dir /ephemeral/exports/mesh \
--resolution 1024
Download the exports to your local machine:
scp -i /path/to/your/key -r ubuntu@YOUR_VM_PUBLIC_IP:/ephemeral/exports/ ./local-exports/
Step 10: Hibernate Your VM When Not Training
3D training jobs can be long, but you should not leave the VM running between sessions. In the Hyperstack dashboard, use Hibernate to pause compute billing while keeping your disk state intact. When you resume, your environment, datasets and checkpoints will all be exactly as you left them.
Current Limitations to Be Aware Of
- Reconstruction vs. generation: NeRF and Gaussian Splatting reconstruct scenes from images -- they do not generate novel objects from scratch. For pure generation from text or image prompts, use diffusion-based methods like Shap-E or Zero123++.
- Geometric precision: Diffusion-based 3D generation still produces meshes with artefacts, thin surfaces, and topology errors that require manual cleanup before use in engineering or manufacturing workflows.
- Data coverage: NeRF and Gaussian Splatting quality degrade sharply with sparse image coverage. Aim for 100+ images with at least 60% overlap between adjacent views for reliable reconstruction.
- Generalisation: Models trained on a single scene do not generalise. For a generalisable model across object categories, you need category-level training on large datasets like Objaverse, which requires significantly more compute.
Recommended GPU Selection on Hyperstack
- NVIDIA A6000 (48 GB): Ideal for prototyping, single-scene NeRF/Gaussian Splatting training, and inference. Best value for iterative work.
- NVIDIA A100-80GB: Recommended for Shap-E and diffusion-based 3D training, multi-scene batched training, and large-dataset fine-tuning runs.
- NVIDIA H100 SXM: Required for training large-scale generalised 3D diffusion models from scratch on datasets like Objaverse-XL.
Start training your 3D models on Hyperstack today. Provision an A100 or A6000 VM in minutes and follow the steps above to go from raw images or text prompts to a renderable 3D scene. Sign up and get started now.
FAQs
What is the best GPU for 3D generative AI training on Hyperstack?
The NVIDIA A6000 is the best starting point for single-scene NeRF and Gaussian Splatting training. For diffusion-based 3D generation or large-dataset training, use the A100-80GB. Multi-GPU H100 configurations are appropriate for training generalised models from scratch.
What is the difference between NeRF and 3D Gaussian Splatting?
NeRF represents a scene as a continuous implicit function that maps 3D coordinates to colour and density. Gaussian Splatting represents a scene as millions of 3D Gaussians with position, colour, opacity, and covariance. Gaussian Splatting trains faster (20-45 minutes vs. hours), renders in real time, and often produces sharper results. NeRF is more flexible for downstream editing and relighting tasks.
Can I use this pipeline for text-to-3D generation without input images?
Yes. Use Shap-E (Step 7 above) or Zero123++ for text-to-3D or single-image-to-3D generation. These are diffusion-based models that do not require multi-view image input. The trade-off is lower geometric precision and more manual cleanup required compared to reconstruction-based methods.
How long does 3D model training take on Hyperstack?
3D Gaussian Splatting trains in 20-45 minutes per scene on an A100. Nerfacto trains in 20-40 minutes. Diffusion model training takes hours to days depending on dataset size and number of GPU nodes. Full training of a generalised 3D model from scratch on Objaverse-scale data takes several days on a multi-GPU cluster.
What 3D file formats can I export from Nerfstudio?
Nerfstudio supports export to .ply (Gaussian splats and meshes), .obj (mesh via marching cubes), and .glb/.gltf for web and game engine use. All formats are compatible with Blender, Unity, Unreal Engine, and standard 3D pipelines.
Do I need a Hugging Face account?
Only if you are downloading gated pretrained checkpoints. Nerfstudio, 3D Gaussian Splatting, and Shap-E can all be used without one.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?