Imagine typing “a cat on a windowsill” and instantly finding the exact picture you had in mind, no manual tagging, no tedious searching. That’s the power of CLIP (Contrastive Language–Image Pretraining). Developed by OpenAI, CLIP learns to understand the relationship between images and text by mapping them into the same vector space.

In this tutorial, you’ll learn how to train a CLIP model on an Ubuntu 24.04 GPU server for image–text matching. We’ll fine-tune the model with your own dataset, use GPU acceleration for faster training, and test it with real examples.

Prerequisites

  • An Ubuntu 24.04 server with an NVIDIA GPU.
  • A non-root user or a user with sudo privileges.
  • NVIDIA drivers are installed on your server.

Step 1 – Install and Set Up CLIP

Before training CLIP, you need to set up the environment on your Ubuntu 24.04 GPU server. This section gives you a fast, copy-paste setup so you can start right away.

1. Install system dependencies.

apt install -y python3 python3-venv python3-pip git wget unzip

2. Create and activate a virtual environment.

python3 -m venv clip-env
source clip-env/bin/activate

3. Upgrade pip to the latest version.

pip install --upgrade pip

4. Install PyTorch with CUDA support.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

5. Install CLIP dependencies.

pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

6. Verify GPU access.

python3 -c "import torch; print(torch.cuda.is_available())"

If everything is set up correctly, you’ll see.

True

Your GPU server is now ready for training CLIP. Next, we’ll prepare a dataset for image–text matching so the model has something to learn from.

Step 2 – Prepare the Dataset for CLIP Image-Text Matching

CLIP learns by pairing images with their matching text descriptions. You’ll need to organize your dataset so it’s easy to load during training.

1. Create dataset folders.

mkdir dataset/images

2. Create a captions file.

nano dataset/captions.txt

Add your image–caption pairs in the format.

img1.png|A cat sitting on a windowsill.
img2.png|A man riding a bicycle on the street.

3. Download some images to the images directory.

wget https://i.ibb.co/TM1bCfq8/img1.png -O dataset/images/
wget https://i.ibb.co/43YsGnq/img2.png -O dataset/images/
wget https://i.ibb.co/prv23zKw/test.jpg -O dataset/images/

Step 3 – Create the CLIP Training Script in Python

With your dataset ready, it’s time to write the Python script that will train CLIP on your images and captions.

1. Create the training script file.

nano train_clip.py

Add the following code.

# train_clip.py
import os
import math
from pathlib import Path
from contextlib import nullcontext

import torch
import torch.nn as nn
import torch.optim as optim
from PIL import Image
from torch.utils.data import Dataset, DataLoader
import clip

# -----------------------
# Config
# -----------------------
IMAGE_DIR = "dataset/images"
CAP_PATH = "dataset/captions.txt"
BATCH_SIZE = 16
EPOCHS = 5
LR = 5e-6
NUM_WORKERS = 2                     # match system suggestion to avoid stalls
CKPT_DIR = Path("checkpoints")
CKPT_DIR.mkdir(exist_ok=True)
FINAL_WEIGHTS = "clip_finetuned.pt"

# -----------------------
# Dataset
# -----------------------
class ImageTextDataset(Dataset):
    def __init__(self, image_folder, captions_file, preprocess):
        self.image_folder = image_folder
        self.preprocess = preprocess
        self.data = []
        with open(captions_file, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if not line or "|" not in line:
                    continue
                img, caption = line.split("|", 1)
                self.data.append((img, caption))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_name, caption = self.data[idx]
        img_path = os.path.join(self.image_folder, img_name)
        image = self.preprocess(Image.open(img_path).convert("RGB"))
        return image, caption

# -----------------------
# Model / Device
# -----------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Use FP32 master params for optimizer stability; AMP will handle compute casting
model = model.float()
model.train()

# -----------------------
# Data
# -----------------------
dataset = ImageTextDataset(IMAGE_DIR, CAP_PATH, preprocess)
loader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    pin_memory=(device == "cuda"),
    persistent_workers=(NUM_WORKERS > 0)
)

# -----------------------
# Optimizer / Loss
# -----------------------
optimizer = optim.AdamW(model.parameters(), lr=LR)
loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()

# -----------------------
# AMP (new API)
# -----------------------
if device == "cuda":
    autocast = lambda: torch.amp.autocast("cuda", dtype=torch.float16)
    scaler = torch.amp.GradScaler("cuda")
else:
    autocast = nullcontext
    scaler = None

# -----------------------
# Train Loop
# -----------------------
def run_epoch(epoch_idx: int):
    running = 0.0
    steps_per_epoch = math.ceil(len(dataset) / BATCH_SIZE)

    for step, (images, captions) in enumerate(loader, start=1):
        images = images.to(device, non_blocking=True)
        tokens = clip.tokenize(captions, truncate=True).to(device, non_blocking=True)

        with autocast():
            img_features = model.encode_image(images)
            txt_features = model.encode_text(tokens)

            # Normalize to unit length
            img_features = img_features / img_features.norm(dim=1, keepdim=True)
            txt_features = txt_features / txt_features.norm(dim=1, keepdim=True)

            # Similarity logits
            logits_per_image = img_features @ txt_features.t()
            logits_per_text  = txt_features @ img_features.t()

            # Contrastive targets
            ground_truth = torch.arange(len(images), device=device)
            total_loss = (loss_img(logits_per_image, ground_truth) +
                          loss_txt(logits_per_text,  ground_truth)) / 2

        optimizer.zero_grad(set_to_none=True)

        if scaler:
            scaler.scale(total_loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            total_loss.backward()
            optimizer.step()

        running += total_loss.item()
        if step % 10 == 0 or step == steps_per_epoch:
            avg = running / step
            print(f"Epoch {epoch_idx+1} Step {step}/{steps_per_epoch} - Loss: {avg:.4f}")

    # Save per-epoch checkpoint (state_dict only)
    ckpt_path = CKPT_DIR / f"epoch_{epoch_idx+1}.pt"
    torch.save(model.state_dict(), ckpt_path)
    print(f"Saved checkpoint: {ckpt_path}")

# -----------------------
# Execute Training
# -----------------------
for epoch in range(EPOCHS):
    run_epoch(epoch)

# Save final fine-tuned weights (state_dict only)
torch.save(model.state_dict(), FINAL_WEIGHTS)
print(f"Saved final weights to {FINAL_WEIGHTS}")

This script fine-tunes the CLIP ViT-B/32 model on a custom image–captions dataset:

  • Configuration – Sets dataset paths, hyperparameters (batch size, epochs, learning rate), checkpoint directory, and output model filename.
  • Dataset Class – Loads images and matching captions from a text file, applies CLIP preprocessing, and returns (image, caption) pairs.
  • Model Setup – Loads CLIP on GPU (if available), switches to training mode, and prepares FP32 weights for stability.
  • DataLoader – Creates a shuffled, multi-worker loader with pinned memory for faster GPU transfer.
  • Optimizer & Loss – Uses AdamW optimizer and cross-entropy loss for both image-to-text and text-to-image similarity.
  • Mixed Precision Training (AMP) – Speeds up training and reduces memory usage with half-precision on CUDA.
  • Training Loop – Encodes images and captions, normalizes embeddings, computes similarity logits, calculates contrastive loss, and updates weights using AMP if available.
  • Checkpointing – Saves model weights after each epoch and stores the final fine-tuned weights.

2. Run the training script.

python3 train_clip.py

Output.

Epoch 1 Step 1/1 - Loss: 0.6165
Saved checkpoint: checkpoints/epoch_1.pt
Epoch 2 Step 1/1 - Loss: 0.5577
Saved checkpoint: checkpoints/epoch_2.pt
Epoch 3 Step 1/1 - Loss: 0.4955
Saved checkpoint: checkpoints/epoch_3.pt
Epoch 4 Step 1/1 - Loss: 0.4474
Saved checkpoint: checkpoints/epoch_4.pt
Epoch 5 Step 1/1 - Loss: 0.4056
Saved checkpoint: checkpoints/epoch_5.pt
Saved final weights to clip_finetuned.pt

Now you have a fine-tuned CLIP model and saved checkpoints for each training stage. Next, we’ll verify the dataset integrity before training so you can avoid runtime errors.

Step 4 – Verify Dataset Integrity Before Training

Before you start training, it’s smart to check if your dataset is complete and correctly formatted. This quick check prevents wasted GPU time due to missing images or broken captions.

1. Create the dataset check script.

nano check_dataset.py

Add the following code.

from pathlib import Path
n = sum(1 for _ in open("dataset/captions.txt", "r", encoding="utf-8")
        if "|" in _.strip())
print("Pairs:", n)
print("Images folder exists:", Path("dataset/images").exists())

2. Run the check.

python3 check_dataset.py

Output.

Pairs: 2
Images folder exists: True

If the number of pairs matches your expectation and the image folder exists, your dataset is ready for CLIP training.

Step 5 – Test the Fine-Tuned CLIP Model for Image-Text Matching

After training, you can load the fine-tuned weights and test how well CLIP matches images to text. This helps confirm that your model learned from the dataset.

1. Create the inference script.

nano test_clip_inference.py

Add the below code.

import torch, clip
from PIL import Image
from pathlib import Path

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

weights_path = Path("clip_finetuned.pt")
if weights_path.exists():
    try:
        # Safer: only tensors, no arbitrary objects
        state = torch.load(weights_path, map_location=device, weights_only=True)
        missing, unexpected = model.load_state_dict(state, strict=False)
        print(f"Loaded fine-tuned weights from {weights_path}")
        if missing:
            print(f"Missing keys: {missing}")
        if unexpected:
            print(f"Unexpected keys: {unexpected}")
    except Exception as e:
        print(f"Could not load fine-tuned weights safely: {e}\nUsing base CLIP.")
else:
    print("No fine-tuned weights found; using base CLIP.")

image = preprocess(Image.open("test.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["A beautiful beach sunset", "A crowded city street"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    probs = (image_features @ text_features.T).softmax(dim=-1)

print("Probabilities:", probs.squeeze().tolist())

2. Run the script.

python3 test_clip_inference.py

Output.

Loaded fine-tuned weights from clip_finetuned.pt
Probabilities: [0.9970703125, 0.0031719207763671875]

In this example, the model is almost sure that the image matches the first caption. This confirms your fine-tuned CLIP works for image–text matching.

Conclusion

Training a CLIP model on an Ubuntu 24.04 GPU server gives you the ability to connect images and text in powerful ways. In this guide, you set up a GPU-ready environment, prepared a dataset of image–captions pairs, built a Python training script, enabled mixed precision for faster performance, saved checkpoints, and tested a fine-tuned model with real captions.