How to Use TensorBoard to Monitor GPU Training Metrics in Real Time on Ubuntu 24.04 Server

Table of Contents

Prerequisites
Step 1 - Install Python and TensorBoard
Step 2 - Create a Training Script with TensorBoard Logging
Step 3 - Launch TensorBoard to View GPU Metrics
Conclusion

When you’re training deep learning models on a GPU server, knowing what’s going on under the hood is just as important as writing the code. This is where TensorBoard comes in. It’s a powerful visualization tool that lets you monitor your training process in real time, right from your browser.

With TensorBoard, you can track training loss, GPU memory usage, and other performance metrics as your model learns. Instead of waiting until training is over to discover problems, you can spot issues like slow convergence, overfitting, or memory bottlenecks early on and make adjustments immediately.

In this guide, you’ll learn how to install TensorBoard on Ubuntu 24.04, integrate it with a PyTorch training script, and monitor your GPU metrics live.

Prerequisites

An Ubuntu 24.04 server with an NVIDIA GPU.
A non-root user or a user with sudo privileges.
NVIDIA drivers are installed on your server.

Step 1 – Install Python and TensorBoard

The first step is to set up Python, a virtual environment, and the required packages for training and monitoring. This keeps your dependencies isolated and avoids version conflicts with system packages.

1. Install Python and additional dependencies.

apt install -y python3 python3-venv python3-pip git

2. Create a virtual environment.

python3 -m venv tb-env
source tb-env/bin/activate

3. Upgrade pip to the latest version.

pip install --upgrade pip

4. Install PyTorch, torchvision, torchaudio, and TensorBoard.

pip install torch torchvision torchaudio tensorboard

5. To confirm TensorBoard is installed correctly, check its version.

tensorboard --version

Output.

2.20.0

Step 2 – Create a Training Script with TensorBoard Logging

Now that TensorBoard is installed, the next step is to integrate it into a PyTorch training loop so you can log metrics in real time.

We’ll create a file named train_with_tb.py and write a simple MNIST digit classification model. Along with tracking the training loss per batch, we’ll also log GPU memory usage in MB, which is especially useful when working on a GPU server.

nano train_with_tb.py

Add the following code.

import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torch.utils.tensorboard import SummaryWriter

# Initialize TensorBoard writer
writer = SummaryWriter(log_dir="./tb_logs")

# Dataset & loader
transform = transforms.ToTensor()
train_data = datasets.MNIST(root="data", train=True, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Model, loss, optimizer
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
).cuda()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
global_step = 0
for epoch in range(3):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.cuda(), target.cuda()
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()

        # Log loss per batch
        writer.add_scalar("Loss/train", loss.item(), global_step)

        # Log GPU memory usage in MB per batch
        gpu_mem = torch.cuda.memory_allocated() / (1024 ** 2)
        writer.add_scalar("GPU/Memory_MB", gpu_mem, global_step)

        global_step += 1
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}, Average Loss: {avg_loss:.4f}")

writer.close()

Now, run the training script.

python3 train_with_tb.py

Output.

100.0%
100.0%
100.0%
100.0%
Epoch 1, Loss: 0.3434
Epoch 2, Loss: 0.1530
Epoch 3, Loss: 0.1066

Step 3 – Launch TensorBoard to View GPU Metrics

1. Once training is complete (or even while it’s still running), you can launch TensorBoard to visualize the logged data.

tensorboard --logdir ./tb_logs --port 6006 --host 0.0.0.0

Note:

--logdir ./tb_logs: Points TensorBoard to the directory where you saved your logs.
--port 6006: Specifies the network port to serve the dashboard on (6006 is the default).
--host 0.0.0.0: Makes TensorBoard listen on all network interfaces, allowing you to connect to it from your local machine using the server’s public IP address.”

Output.

TensorBoard 2.20.0 at http://0.0.0.0:6006/ (Press CTRL+C to quit)

2. Now, you can access the TensorBoard web interface using the URL http://your-server-ip:6006 from your browser to visualize your training metrics.

Note: Before you can access the dashboard, you may need to open port 6006 in your server’s firewall. For example, on Ubuntu using UFW (Uncomplicated Firewall), you would run:”

sudo ufw allow 6006/tcp

This chart displays the GPU memory usage in megabytes (MB) throughout training.

3. Click on the SCALARS tab, and you should see the graph below.

The above graph shows the model’s loss decreasing from around 0.4 to 0.1 over training steps, with expected batch-to-batch fluctuations, indicating steady learning and convergence, while a flat or rising curve would signal poor training requiring parameter or architecture changes.

Conclusion

Using TensorBoard on Ubuntu 24.04 gives you a clear, real-time view of your model’s training progress and GPU usage without guesswork. By logging metrics like training loss and GPU memory consumption, you can quickly identify whether your model is converging, detect inefficiencies, and make informed adjustments during training.

Facebook

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Up in 60 Seconds!

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Newsletter

Subscribe to our newsletter and stay updated.

Email Address

Provide your email address to subscribe. For e.g [email protected]

Your subscription could not be saved. Please try again.

Your subscription has been successful.

View White Papers

How to Use TensorBoard to Monitor GPU Training Metrics in Real Time on Ubuntu 24.04 Server

Prerequisites

Step 1 – Install Python and TensorBoard

Step 2 – Create a Training Script with TensorBoard Logging

Step 3 – Launch TensorBoard to View GPU Metrics

Conclusion

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Award-Winning Hosting Solutions & Services