When you train machine learning models on a GPU, raw power alone doesn’t guarantee fast results. Poor kernel execution, inefficient memory usage, or unnecessary CPU-GPU synchronization can significantly slow down the process, sometimes by minutes or even hours per epoch. This is where GPU profiling comes in.

By profiling and debugging GPU performance, you can pinpoint bottlenecks, optimize memory usage, and understand exactly how your training code interacts with the hardware.

In this guide, you’ll learn how to set up a profiling environment, run targeted profiling sessions, monitor GPU usage in real time, and interpret reports. We’ll work through a small CNN training example so you can replicate the process on your own server and apply it to larger models.

Prerequisites

  • An Ubuntu 24.04 server equipped with an NVIDIA GPU that has at least 8 GB of memory.
  • A non-root user or a user with sudo privileges.
  • NVIDIA drivers are installed on your server.

Step 1 – Installing Required GPU Profiling Tools

Before we can analyze GPU performance, we need to install a few essential tools. These include NVIDIA’s Nsight Systems for timeline analysis, Nsight Compute for kernel-level profiling, and nvtop for real-time GPU monitoring.

1. Install Python and required dependencies.

apt install -y python3 python3-venv python3-pip git

2. Install NVIDIA profiling tools.

apt install nsight-compute nsight-systems nvtop

3. Reload your shell environment to ensure the profiling tools are in your PATH.

source ~/.bashrc

4. Verify NCU version.

ncu --version

Output.

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2025.1.0.0 (build 35237751) (public-release)

5. Verify the nsys version.

nsys --version

Output.

NVIDIA Nsight Systems version 2024.6.2.225-246235244400v0

Step 2 – Setting Up the Python Environment

To keep things organized and avoid package conflicts, we’ll create a dedicated Python virtual environment for our GPU profiling work.

1. Create a dedicated working directory for profiling experiments.

mkdir -p ~/gpu-profiling && cd ~/gpu-profiling

2. Create and activate the virtual environment.

python3 -m venv venv
source venv/bin/activate

3. Upgrade pip inside the virtual environment.

pip install --upgrade pip

4. Next, install PyTorch, torchvision, and torchaudio with CUDA 12 support.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

5. Once the installation completes, verify that PyTorch detects your GPU.

python - <<'PY'
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))
    print("CUDA capability:", torch.cuda.get_device_capability(0))
PY

Output.

CUDA available: True
Device: NVIDIA A40-16Q
CUDA capability: (8, 6)

6. Update your environment variables to include CUDA and Nsight tools in our PATH.

export PATH=/usr/local/cuda-12.8/bin:/opt/nvidia/nsight-compute:$PATH

Step 3 – Preparing a Sample Training Script for Profiling

To demonstrate GPU profiling, we’ll use a small CNN model with a synthetic dataset. This setup ensures that GPU activity is consistent and easy to interpret in profiling tools, without being slowed down by disk I/O or complex data processing.

1. Create a training script.

nano train.py

Add the following code.

import os
import time
import argparse
from contextlib import nullcontext

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.cuda import nvtx

# ---- Synthetic dataset to drive GPU compute ----
class SyntheticImages(Dataset):
    def __init__(self, n=50000, c=3, h=224, w=224):
        self.n, self.c, self.h, self.w = n, c, h, w
    def __len__(self):
        return self.n
    def __getitem__(self, idx):
        x = torch.randn(self.c, self.h, self.w)  # random image
        y = torch.randint(0, 1000, (1,)).item()  # 1000-class label
        return x, y

# ---- Small-ish CNN to keep kernels visible but fast ----
class SmallCNN(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 64, 3, stride=2, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 128, 3, stride=2, padding=1),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.fc = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.net(x)
        x = torch.flatten(x, 1)
        return self.fc(x)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=2)
    parser.add_argument("--batch-size", type=int, default=64)
    parser.add_argument("--workers", type=int, default=4)
    parser.add_argument("--amp", action="store_true", help="Use mixed precision")
    parser.add_argument("--profile", action="store_true", help="Write PyTorch profiler trace")
    parser.add_argument("--profile_dir", type=str, default="./tb_logs")
    args = parser.parse_args()

    device = "cuda" if torch.cuda.is_available() else "cpu"
    torch.backends.cudnn.benchmark = True

    # Data
    train_set = SyntheticImages()
    train_loader = DataLoader(
        train_set,
        batch_size=args.batch_size,
        shuffle=True,
        num_workers=args.workers,
        pin_memory=True
    )

    # Model/opt
    model = SmallCNN().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=3e-4)

    # AMP context
    amp_ctx = torch.autocast(device_type="cuda", dtype=torch.float16) if (args.amp and device == "cuda") else nullcontext()
    scaler = torch.cuda.amp.GradScaler(enabled=(args.amp and device == "cuda"))

    # Optional PyTorch profiler
    prof_ctx = nullcontext()
    if args.profile:
        from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
        os.makedirs(args.profile_dir, exist_ok=True)
        prof_ctx = profile(
            activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
            schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
            on_trace_ready=tensorboard_trace_handler(args.profile_dir),
            record_shapes=True,
            with_stack=True,
            with_flops=True,
            profile_memory=True
        )

    start = time.time()
    if args.profile:
        prof_ctx.__enter__()

    for epoch in range(args.epochs):
        nvtx.range_push(f"epoch_{epoch}")
        model.train()
        running_loss = 0.0
        for i, (x, y) in enumerate(train_loader):
            nvtx.range_push("batch")
            x = x.to(device, non_blocking=True)
            y = y.to(device, non_blocking=True)

            optimizer.zero_grad(set_to_none=True)
            with amp_ctx:
                out = model(x)
                loss = criterion(out, y)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            running_loss += loss.item()
            if args.profile:
                prof_ctx.step()
            nvtx.range_pop()  # batch

            # Keep runtime contained for profiling demos
            if i >= 100:   # ~101 batches per epoch
                break

        print(f"Epoch {epoch+1} | Loss: {running_loss/(i+1):.4f}")
        nvtx.range_pop()  # epoch

    if args.profile:
        prof_ctx.__exit__(None, None, None)

    dur = time.time() - start
    print(f"Total time: {dur:.2f}s")
    print(torch.cuda.memory_summary(device=device))

if __name__ == "__main__":
    main()

2. Run the training script.

python3 train.py --epochs 2 --batch-size 64 --workers 4 --amp

Output.

Epoch 1 | Loss: 6.9113
Epoch 2 | Loss: 6.9093
Total time: 6.24s
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  57875 KiB | 379086 KiB | 146761 MiB | 146705 MiB |
|       from large pool |  54272 KiB | 377258 KiB | 145822 MiB | 145769 MiB |
|       from small pool |   3603 KiB |   4472 KiB |    939 MiB |    935 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  57875 KiB | 379086 KiB | 146761 MiB | 146705 MiB |
|       from large pool |  54272 KiB | 377258 KiB | 145822 MiB | 145769 MiB |
|       from small pool |   3603 KiB |   4472 KiB |    939 MiB |    935 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  57870 KiB | 377033 KiB | 146492 MiB | 146435 MiB |
|       from large pool |  54272 KiB | 375210 KiB | 145555 MiB | 145502 MiB |
|       from small pool |   3598 KiB |   4466 KiB |    936 MiB |    933 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   | 419840 KiB | 514048 KiB |   1440 MiB |   1030 MiB |
|       from large pool | 413696 KiB | 509952 KiB |   1434 MiB |   1030 MiB |
|       from small pool |   6144 KiB |   6144 KiB |      6 MiB |      0 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  62957 KiB | 140490 KiB |  64668 MiB |  64607 MiB |
|       from large pool |  60416 KiB | 137911 KiB |  63726 MiB |  63667 MiB |
|       from small pool |   2541 KiB |   3410 KiB |    942 MiB |    939 MiB |
|---------------------------------------------------------------------------|
| Allocations           |      40    |      49    |   15388    |   15348    |
|       from large pool |       3    |       9    |    3646    |    3643    |
|       from small pool |      37    |      46    |   11742    |   11705    |
|---------------------------------------------------------------------------|
| Active allocs         |      40    |      49    |   15388    |   15348    |
|       from large pool |       3    |       9    |    3646    |    3643    |
|       from small pool |      37    |      46    |   11742    |   11705    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      10    |      11    |      24    |      14    |
|       from large pool |       7    |       9    |      21    |      14    |
|       from small pool |       3    |       3    |       3    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      16    |      19    |    8288    |    8272    |
|       from large pool |       3    |       8    |    2629    |    2626    |
|       from small pool |      13    |      16    |    5659    |    5646    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

This script is now ready for profiling; it contains NVTX markers for Nsight, an optional PyTorch profiler, and memory summary output for debugging allocations.

Step 4 – Monitoring Real-Time GPU Usage

Before running deep profiling sessions, it’s useful to watch GPU usage in real time. This helps confirm that the GPU is being utilized as expected and can quickly reveal issues like low utilization, excessive memory use, or thermal throttling.

1. Using nvidia-smi
The simplest tool is NVIDIA’s nvidia-smi, which comes with the GPU driver. Run.

watch -n 1 nvidia-smi

This command updates every second and shows:

  • GPU utilization percentage
  • Memory usage
  • Temperature
  • Running processes

2. Using nvtop for Interactive Monitoring
nvtop is like htop but for GPUs. It provides a colorful, real-time display.

nvtop

With nvtop, you can:

  • Monitor per-process GPU usage
  • Track VRAM consumption over time
  • See compute vs. memory utilization

3. Using nvidia-smi dmon for Detailed Stats
For low-level telemetry such as power consumption and PCIe throughput, use.

nvidia-smi dmon -s pucvmt

This is helpful for detecting bottlenecks caused by thermal limits or power constraints.

Step 5 – Profiling with Nsight Systems (nsys)

Nsight Systems is NVIDIA’s system-wide performance analysis tool. It helps visualize how your training code interacts with the CPU, GPU, and operating system, making it easier to spot bottlenecks in kernel launches, data loading, or synchronization.

1. Create a directory to save the report.

mkdir -p reports

2. Running an nsys profile for one epoch of our CNN training script to keep the report small.

nsys profile --trace=cuda,nvtx,osrt \
  -o reports/nsys_smoke \
  python3 train.py --epochs 1 --batch-size 64 --workers 4 --amp

3. After profiling, generate a summary.

nsys stats reports/nsys_smoke.nsys-rep \
  | tee reports/nsys_smoke_stats.txt

Output:

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)     StdDev (ns)    Style    Range  
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  ------------  -------  --------
     70.2    3,687,984,904          1  3,687,984,904.0  3,687,984,904.0  3,687,984,904  3,687,984,904           0.0  PushPop  :epoch_0
     29.8    1,568,095,057        101     15,525,693.6      6,956,237.0      6,786,630    871,783,914  86,053,398.2  PushPop  :batch  

Processing [reports/nsys_smoke.sqlite] with [/opt/nvidia/nsight-systems/2024.6.2/host-linux-x64/reports/osrt_sum.py]... 

 ** OS Runtime Summary (osrt_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)     Min (ns)     Max (ns)      StdDev (ns)            Name         
 --------  ---------------  ---------  -------------  ------------  ----------  -------------  -------------  ----------------------
     35.8   16,379,720,411        207   79,129,084.1   3,912,872.0      26,329  1,892,649,216  342,138,838.6  pthread_cond_wait     
     31.4   14,374,694,895        125  114,997,559.2  98,891,000.0      28,279  1,153,881,169  182,208,818.2  sem_wait              
     16.3    7,478,573,597        174   42,980,308.0  13,189,978.0       1,542    763,891,502   72,151,566.3  poll                  
     10.0    4,557,471,910        759    6,004,574.3      27,657.0       1,302    502,654,188   54,197,437.6  pthread_cond_timedwait
      4.2    1,915,698,569         91   21,051,632.6  13,376,724.0   1,202,730    147,786,074   18,470,602.4  sem_clockwait         
      0.9      434,767,656      1,023      424,992.8      10,137.0       1,024    152,203,095    5,252,250.5  ioctl                 
      0.8      369,512,061      2,073      178,249.9       2,477.0       2,057      5,530,737      774,095.6  munmap                
      0.1       54,252,032          4   13,563,008.0  12,635,347.0  11,484,002     17,497,336    2,696,547.7  fork                  
      0.1       40,884,509      3,325       12,296.1       2,926.0       1,000      2,499,701       66,540.4  read                  
      0.1       36,920,409        921       40,087.3       5,218.0       1,077      3,357,386      288,751.3  pthread_cond_signal   
      0.1       26,573,233          2   13,286,616.5  13,286,616.5   1,785,896     24,787,337   16,264,474.9  pthread_rwlock_wrlock 
      0.0       19,158,656      9,863        1,942.5       1,682.0       1,000         23,503        1,120.4  stat64                
      0.0       15,520,576        108      143,709.0      15,325.0       1,013     11,797,602    1,138,827.1  pthread_mutex_lock    
      0.0       15,417,551     13,104        1,176.6       1,156.0       1,000         20,377          371.3  lstat64               
      0.0       10,395,002        937       11,093.9       3,281.0       1,003        436,038       27,260.8  write   

Step 6 – Profiling with Nsight Compute (ncu)

While Nsight Systems gives you a broad view of CPU => GPU interaction, Nsight Compute focuses on the low-level performance of individual CUDA kernels. It provides metrics such as warp execution efficiency, memory throughput, and bottleneck analysis.

1. Run a kernel-level profile using the speed-of-light preset to measure theoretical vs. achieved performance.

ncu --set speed-of-light \
    --target-processes application-only \
    --launch-skip 50 --launch-count 1 \
    --force-overwrite \
    --export reports/ncu_step \
    python3 -u train.py --epochs 1 --batch-size 64 --workers 0 --amp

Once the run completes, you’ll have a .ncu-rep file in the reports directory.

==PROF== Report: /root/gpu-profiling/reports/ncu_step.ncu-rep

Explanation:

  • –set speed-of-light → capture a wide range of performance metrics.
  • –target-processes application-only → avoid profiling system processes.
  • –launch-skip 50 → skip initial batches to focus on steady-state execution.
  • –launch-count 1 → profile only one kernel launch to keep reports small.
  • –export reports/ncu_step → save results for later viewing in Nsight Compute GUI.

2. You can open and extract details directly from the report in the terminal.

ncu --import reports/ncu_step.ncu-rep --page summary

Conclusion

Profiling and debugging GPU performance involves more than just running commands; it’s about understanding where your training workflow loses time and efficiency. On an Ubuntu 24.04 GPU server, Nsight Systems gives you a high-level view of how your code interacts with the CPU, GPU, and operating system. At the same time, Nsight Compute lets you drill down to individual CUDA kernels to see how effectively they are using the hardware.