Table of Contents
When you train machine learning models on a GPU, raw power alone doesn’t guarantee fast results. Poor kernel execution, inefficient memory usage, or unnecessary CPU-GPU synchronization can significantly slow down the process, sometimes by minutes or even hours per epoch. This is where GPU profiling comes in.
By profiling and debugging GPU performance, you can pinpoint bottlenecks, optimize memory usage, and understand exactly how your training code interacts with the hardware.
In this guide, you’ll learn how to set up a profiling environment, run targeted profiling sessions, monitor GPU usage in real time, and interpret reports. We’ll work through a small CNN training example so you can replicate the process on your own server and apply it to larger models.
Prerequisites
- An Ubuntu 24.04 server equipped with an NVIDIA GPU that has at least 8 GB of memory.
- A non-root user or a user with sudo privileges.
- NVIDIA drivers are installed on your server.
Step 1 – Installing Required GPU Profiling Tools
Before we can analyze GPU performance, we need to install a few essential tools. These include NVIDIA’s Nsight Systems for timeline analysis, Nsight Compute for kernel-level profiling, and nvtop for real-time GPU monitoring.
1. Install Python and required dependencies.
apt install -y python3 python3-venv python3-pip git
2. Install NVIDIA profiling tools.
apt install nsight-compute nsight-systems nvtop
3. Reload your shell environment to ensure the profiling tools are in your PATH.
source ~/.bashrc
4. Verify NCU version.
ncu --version
Output.
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2025.1.0.0 (build 35237751) (public-release)
5. Verify the nsys version.
nsys --version
Output.
NVIDIA Nsight Systems version 2024.6.2.225-246235244400v0
Step 2 – Setting Up the Python Environment
To keep things organized and avoid package conflicts, we’ll create a dedicated Python virtual environment for our GPU profiling work.
1. Create a dedicated working directory for profiling experiments.
mkdir -p ~/gpu-profiling && cd ~/gpu-profiling
2. Create and activate the virtual environment.
python3 -m venv venv
source venv/bin/activate
3. Upgrade pip inside the virtual environment.
pip install --upgrade pip
4. Next, install PyTorch, torchvision, and torchaudio with CUDA 12 support.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
5. Once the installation completes, verify that PyTorch detects your GPU.
python - <<'PY'
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("Device:", torch.cuda.get_device_name(0))
print("CUDA capability:", torch.cuda.get_device_capability(0))
PY
Output.
CUDA available: True
Device: NVIDIA A40-16Q
CUDA capability: (8, 6)
6. Update your environment variables to include CUDA and Nsight tools in our PATH.
export PATH=/usr/local/cuda-12.8/bin:/opt/nvidia/nsight-compute:$PATH
Step 3 – Preparing a Sample Training Script for Profiling
To demonstrate GPU profiling, we’ll use a small CNN model with a synthetic dataset. This setup ensures that GPU activity is consistent and easy to interpret in profiling tools, without being slowed down by disk I/O or complex data processing.
1. Create a training script.
nano train.py
Add the following code.
import os
import time
import argparse
from contextlib import nullcontext
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.cuda import nvtx
# ---- Synthetic dataset to drive GPU compute ----
class SyntheticImages(Dataset):
def __init__(self, n=50000, c=3, h=224, w=224):
self.n, self.c, self.h, self.w = n, c, h, w
def __len__(self):
return self.n
def __getitem__(self, idx):
x = torch.randn(self.c, self.h, self.w) # random image
y = torch.randint(0, 1000, (1,)).item() # 1000-class label
return x, y
# ---- Small-ish CNN to keep kernels visible but fast ----
class SmallCNN(nn.Module):
def __init__(self, num_classes=1000):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(3, 32, 3, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(32, 64, 3, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(64, 128, 3, stride=2, padding=1),
nn.ReLU(inplace=True),
nn.AdaptiveAvgPool2d((1, 1))
)
self.fc = nn.Linear(128, num_classes)
def forward(self, x):
x = self.net(x)
x = torch.flatten(x, 1)
return self.fc(x)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=2)
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--workers", type=int, default=4)
parser.add_argument("--amp", action="store_true", help="Use mixed precision")
parser.add_argument("--profile", action="store_true", help="Write PyTorch profiler trace")
parser.add_argument("--profile_dir", type=str, default="./tb_logs")
args = parser.parse_args()
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.backends.cudnn.benchmark = True
# Data
train_set = SyntheticImages()
train_loader = DataLoader(
train_set,
batch_size=args.batch_size,
shuffle=True,
num_workers=args.workers,
pin_memory=True
)
# Model/opt
model = SmallCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=3e-4)
# AMP context
amp_ctx = torch.autocast(device_type="cuda", dtype=torch.float16) if (args.amp and device == "cuda") else nullcontext()
scaler = torch.cuda.amp.GradScaler(enabled=(args.amp and device == "cuda"))
# Optional PyTorch profiler
prof_ctx = nullcontext()
if args.profile:
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
os.makedirs(args.profile_dir, exist_ok=True)
prof_ctx = profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
on_trace_ready=tensorboard_trace_handler(args.profile_dir),
record_shapes=True,
with_stack=True,
with_flops=True,
profile_memory=True
)
start = time.time()
if args.profile:
prof_ctx.__enter__()
for epoch in range(args.epochs):
nvtx.range_push(f"epoch_{epoch}")
model.train()
running_loss = 0.0
for i, (x, y) in enumerate(train_loader):
nvtx.range_push("batch")
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)
optimizer.zero_grad(set_to_none=True)
with amp_ctx:
out = model(x)
loss = criterion(out, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
running_loss += loss.item()
if args.profile:
prof_ctx.step()
nvtx.range_pop() # batch
# Keep runtime contained for profiling demos
if i >= 100: # ~101 batches per epoch
break
print(f"Epoch {epoch+1} | Loss: {running_loss/(i+1):.4f}")
nvtx.range_pop() # epoch
if args.profile:
prof_ctx.__exit__(None, None, None)
dur = time.time() - start
print(f"Total time: {dur:.2f}s")
print(torch.cuda.memory_summary(device=device))
if __name__ == "__main__":
main()
2. Run the training script.
python3 train.py --epochs 2 --batch-size 64 --workers 4 --amp
Output.
Epoch 1 | Loss: 6.9113
Epoch 2 | Loss: 6.9093
Total time: 6.24s
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 57875 KiB | 379086 KiB | 146761 MiB | 146705 MiB |
| from large pool | 54272 KiB | 377258 KiB | 145822 MiB | 145769 MiB |
| from small pool | 3603 KiB | 4472 KiB | 939 MiB | 935 MiB |
|---------------------------------------------------------------------------|
| Active memory | 57875 KiB | 379086 KiB | 146761 MiB | 146705 MiB |
| from large pool | 54272 KiB | 377258 KiB | 145822 MiB | 145769 MiB |
| from small pool | 3603 KiB | 4472 KiB | 939 MiB | 935 MiB |
|---------------------------------------------------------------------------|
| Requested memory | 57870 KiB | 377033 KiB | 146492 MiB | 146435 MiB |
| from large pool | 54272 KiB | 375210 KiB | 145555 MiB | 145502 MiB |
| from small pool | 3598 KiB | 4466 KiB | 936 MiB | 933 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 419840 KiB | 514048 KiB | 1440 MiB | 1030 MiB |
| from large pool | 413696 KiB | 509952 KiB | 1434 MiB | 1030 MiB |
| from small pool | 6144 KiB | 6144 KiB | 6 MiB | 0 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 62957 KiB | 140490 KiB | 64668 MiB | 64607 MiB |
| from large pool | 60416 KiB | 137911 KiB | 63726 MiB | 63667 MiB |
| from small pool | 2541 KiB | 3410 KiB | 942 MiB | 939 MiB |
|---------------------------------------------------------------------------|
| Allocations | 40 | 49 | 15388 | 15348 |
| from large pool | 3 | 9 | 3646 | 3643 |
| from small pool | 37 | 46 | 11742 | 11705 |
|---------------------------------------------------------------------------|
| Active allocs | 40 | 49 | 15388 | 15348 |
| from large pool | 3 | 9 | 3646 | 3643 |
| from small pool | 37 | 46 | 11742 | 11705 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 10 | 11 | 24 | 14 |
| from large pool | 7 | 9 | 21 | 14 |
| from small pool | 3 | 3 | 3 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 16 | 19 | 8288 | 8272 |
| from large pool | 3 | 8 | 2629 | 2626 |
| from small pool | 13 | 16 | 5659 | 5646 |
|---------------------------------------------------------------------------|
| Oversize allocations | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Oversize GPU segments | 0 | 0 | 0 | 0 |
|===========================================================================|
This script is now ready for profiling; it contains NVTX markers for Nsight, an optional PyTorch profiler, and memory summary output for debugging allocations.
Step 4 – Monitoring Real-Time GPU Usage
Before running deep profiling sessions, it’s useful to watch GPU usage in real time. This helps confirm that the GPU is being utilized as expected and can quickly reveal issues like low utilization, excessive memory use, or thermal throttling.
1. Using nvidia-smi
The simplest tool is NVIDIA’s nvidia-smi, which comes with the GPU driver. Run.
watch -n 1 nvidia-smi
This command updates every second and shows:
- GPU utilization percentage
- Memory usage
- Temperature
- Running processes
2. Using nvtop for Interactive Monitoring
nvtop is like htop but for GPUs. It provides a colorful, real-time display.
nvtop
With nvtop, you can:
- Monitor per-process GPU usage
- Track VRAM consumption over time
- See compute vs. memory utilization
3. Using nvidia-smi dmon for Detailed Stats
For low-level telemetry such as power consumption and PCIe throughput, use.
nvidia-smi dmon -s pucvmt
This is helpful for detecting bottlenecks caused by thermal limits or power constraints.
Step 5 – Profiling with Nsight Systems (nsys)
Nsight Systems is NVIDIA’s system-wide performance analysis tool. It helps visualize how your training code interacts with the CPU, GPU, and operating system, making it easier to spot bottlenecks in kernel launches, data loading, or synchronization.
1. Create a directory to save the report.
mkdir -p reports
2. Running an nsys profile for one epoch of our CNN training script to keep the report small.
nsys profile --trace=cuda,nvtx,osrt \
-o reports/nsys_smoke \
python3 train.py --epochs 1 --batch-size 64 --workers 4 --amp
3. After profiling, generate a summary.
nsys stats reports/nsys_smoke.nsys-rep \
| tee reports/nsys_smoke_stats.txt
Output:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- --------------- --------------- ------------- ------------- ------------ ------- --------
70.2 3,687,984,904 1 3,687,984,904.0 3,687,984,904.0 3,687,984,904 3,687,984,904 0.0 PushPop :epoch_0
29.8 1,568,095,057 101 15,525,693.6 6,956,237.0 6,786,630 871,783,914 86,053,398.2 PushPop :batch
Processing [reports/nsys_smoke.sqlite] with [/opt/nvidia/nsight-systems/2024.6.2/host-linux-x64/reports/osrt_sum.py]...
** OS Runtime Summary (osrt_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------ ---------- ------------- ------------- ----------------------
35.8 16,379,720,411 207 79,129,084.1 3,912,872.0 26,329 1,892,649,216 342,138,838.6 pthread_cond_wait
31.4 14,374,694,895 125 114,997,559.2 98,891,000.0 28,279 1,153,881,169 182,208,818.2 sem_wait
16.3 7,478,573,597 174 42,980,308.0 13,189,978.0 1,542 763,891,502 72,151,566.3 poll
10.0 4,557,471,910 759 6,004,574.3 27,657.0 1,302 502,654,188 54,197,437.6 pthread_cond_timedwait
4.2 1,915,698,569 91 21,051,632.6 13,376,724.0 1,202,730 147,786,074 18,470,602.4 sem_clockwait
0.9 434,767,656 1,023 424,992.8 10,137.0 1,024 152,203,095 5,252,250.5 ioctl
0.8 369,512,061 2,073 178,249.9 2,477.0 2,057 5,530,737 774,095.6 munmap
0.1 54,252,032 4 13,563,008.0 12,635,347.0 11,484,002 17,497,336 2,696,547.7 fork
0.1 40,884,509 3,325 12,296.1 2,926.0 1,000 2,499,701 66,540.4 read
0.1 36,920,409 921 40,087.3 5,218.0 1,077 3,357,386 288,751.3 pthread_cond_signal
0.1 26,573,233 2 13,286,616.5 13,286,616.5 1,785,896 24,787,337 16,264,474.9 pthread_rwlock_wrlock
0.0 19,158,656 9,863 1,942.5 1,682.0 1,000 23,503 1,120.4 stat64
0.0 15,520,576 108 143,709.0 15,325.0 1,013 11,797,602 1,138,827.1 pthread_mutex_lock
0.0 15,417,551 13,104 1,176.6 1,156.0 1,000 20,377 371.3 lstat64
0.0 10,395,002 937 11,093.9 3,281.0 1,003 436,038 27,260.8 write
Step 6 – Profiling with Nsight Compute (ncu)
While Nsight Systems gives you a broad view of CPU => GPU interaction, Nsight Compute focuses on the low-level performance of individual CUDA kernels. It provides metrics such as warp execution efficiency, memory throughput, and bottleneck analysis.
1. Run a kernel-level profile using the speed-of-light preset to measure theoretical vs. achieved performance.
ncu --set speed-of-light \
--target-processes application-only \
--launch-skip 50 --launch-count 1 \
--force-overwrite \
--export reports/ncu_step \
python3 -u train.py --epochs 1 --batch-size 64 --workers 0 --amp
Once the run completes, you’ll have a .ncu-rep file in the reports directory.
==PROF== Report: /root/gpu-profiling/reports/ncu_step.ncu-rep
Explanation:
- –set speed-of-light → capture a wide range of performance metrics.
- –target-processes application-only → avoid profiling system processes.
- –launch-skip 50 → skip initial batches to focus on steady-state execution.
- –launch-count 1 → profile only one kernel launch to keep reports small.
- –export reports/ncu_step → save results for later viewing in Nsight Compute GUI.
2. You can open and extract details directly from the report in the terminal.
ncu --import reports/ncu_step.ncu-rep --page summary
Conclusion
Profiling and debugging GPU performance involves more than just running commands; it’s about understanding where your training workflow loses time and efficiency. On an Ubuntu 24.04 GPU server, Nsight Systems gives you a high-level view of how your code interacts with the CPU, GPU, and operating system. At the same time, Nsight Compute lets you drill down to individual CUDA kernels to see how effectively they are using the hardware.