In Natural Language Processing (NLP), raw text data must be transformed into a structured format before it can be fed into machine learning models. This process is called tokenization. It breaks down text into smaller pieces—like words, subwords, or characters—so that models like BERT or GPT can understand and process it.

Tokenization is often the first and most frequent step in any NLP workflow. Whether you’re generating embeddings, training a classifier, or performing sentiment analysis, every input sentence goes through this stage. That’s why optimizing tokenization can significantly improve the overall performance of your pipeline, especially when you’re processing thousands or even millions of texts.

In this tutorial, you’ll learn how to set up your Ubuntu 24.04 GPU server for NLP tokenization.

Prerequisites

  • An Ubuntu 24.04 server with an NVIDIA GPU.
  • A non-root user or a user with sudo privileges.
  • NVIDIA drivers are installed on your server.

Step 1: Set up Python Environment

In this section, you’ll set up a clean Python environment and install the necessary libraries to run tokenization and benchmarking tasks using GPU-accelerated PyTorch and Hugging Face Transformers.

1. Update the package list and install dependencies.

apt update -y
apt install -y python3 python3-pip python3-venv git

2. Create and activate a virtual environment.

python3 -m venv venv
source venv/bin/activate

3. Upgrade pip to the latest version and install PyTorch with CUDA, along with transformers from Hugging Face.

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers

4. Before running any GPU-based operations, confirm that PyTorch detects the GPU.

python3 -c "import torch; print(torch.cuda.is_available())"

Output.

True

Step 2: Tokenizer Performance Benchmarking: CPU vs GPU

Now that your environment is ready, let’s measure how long it takes to tokenize a large batch of text using Hugging Face’s AutoTokenizer. While tokenizers don’t perform computation directly on the GPU, their output tensors (like input_ids and attention_mask) can be transferred to GPU memory for downstream model inference. This helps reduce the bottleneck in end-to-end NLP pipelines.

1. Create a new script file to benchmark tokenization.

nano  gpu_tokenizer.py

Add the following code.

# gpu_tokenizer.py

import time
from transformers import AutoTokenizer
import torch

# Load fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

# Create dummy batch
texts = ["Tokenize this sentence for GPU!"] * 10000

# --- CPU Tokenization ---
start_cpu = time.time()
cpu_encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
end_cpu = time.time()
print("✅ CPU tokenization time:", round(end_cpu - start_cpu, 2), "seconds")

# --- GPU Tokenization (moving outputs to CUDA) ---
start_gpu = time.time()
gpu_encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
gpu_encoded = {k: v.to("cuda") for k, v in gpu_encoded.items()}
end_gpu = time.time()
print("✅ GPU tokenization + transfer time:", round(end_gpu - start_gpu, 2), "seconds")

# Check device
print("📍input_ids tensor device:", gpu_encoded['input_ids'].device)

2. Run the script.

python3 gpu_tokenizer.py

Output.

✅ CPU tokenization time: 0.23 seconds
✅ GPU tokenization + transfer time: 0.45 seconds
📍input_ids tensor device: cuda:0

This means tokenization on CPU is slightly faster in isolation, but transferring the output to GPU allows you to chain this step directly into a model inference pipeline on the GPU—saving time overall.

Step 3: Tokenization with SentenceTransformers and GPU

If you’re working with sentence-level representations, SentenceTransformers is a powerful library that simplifies embedding generation. It handles tokenization and model inference under the hood—and yes, it runs smoothly on the GPU.

This section will show you how to use SentenceTransformers on your Ubuntu 24.04 GPU server to generate high-quality sentence embeddings quickly.

1. Install SentenceTransformers.

pip install sentence-transformers

2. Create the embedding script.

nano sentence_embed.py

Add the following code.

from sentence_transformers import SentenceTransformer

# Load model and move it to GPU
model = SentenceTransformer('all-MiniLM-L6-v2').to("cuda")

# Encode sentences
sentences = ["Tokenize this using GPU", "Sentence transformers are powerful"]
embeddings = model.encode(sentences, device="cuda")

# Output embedding shape
print("Embedding shape:", embeddings.shape)

3. Run the script.

python3 sentence_embed.py

Output.

Embedding shape: (2, 384)

You now have 384-dimensional embeddings for two sentences, computed fully on the GPU. This can be scaled up to thousands of sentences for downstream tasks like clustering, semantic search, or classification.

Step 4: Implementing Byte-Pair Encoding (BPE) from Scratch

Byte-Pair Encoding (BPE) is a widely used algorithm for tokenization in modern NLP models like GPT and RoBERTa. Instead of relying on pre-trained libraries, let’s implement a basic version of BPE yourself in Python. This will help you understand how text is broken down into subword units.

1. Create the BPE script.

nano bpe_trainer.py

Add the following code.

from collections import defaultdict

corpus = {"cat": 5, "cap": 3, "can": 2, "bat": 4, "bats": 2}

def get_tokenized_corpus(corpus):
    return {tuple(word): freq for word, freq in corpus.items()}

def get_pair_freqs(tokenized):
    pairs = defaultdict(int)
    for word, freq in tokenized.items():
        for i in range(len(word) - 1):
            pairs[(word[i], word[i+1])] += freq
    return pairs

def merge_pair(pair, corpus):
    new_corpus = {}
    for word, freq in corpus.items():
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word)-1 and (word[i], word[i+1]) == pair:
                new_word.append(''.join(pair))
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_corpus[tuple(new_word)] = freq
    return new_corpus

tokenized = get_tokenized_corpus(corpus)
for _ in range(5):
    pair_freqs = get_pair_freqs(tokenized)
    best_pair = max(pair_freqs, key=pair_freqs.get)
    tokenized = merge_pair(best_pair, tokenized)
    print(f"Merged {best_pair} → {''.join(best_pair)}")
    print("Updated:", tokenized)

2. Run the script.

python3 bpe_trainer.py

Output.

Merged ('a', 't') → at
Updated: {('c', 'at'): 5, ('c', 'a', 'p'): 3, ('c', 'a', 'n'): 2, ('b', 'at'): 4, ('b', 'at', 's'): 2}
Merged ('b', 'at') → bat
Updated: {('c', 'at'): 5, ('c', 'a', 'p'): 3, ('c', 'a', 'n'): 2, ('bat',): 4, ('bat', 's'): 2}
Merged ('c', 'at') → cat
Updated: {('cat',): 5, ('c', 'a', 'p'): 3, ('c', 'a', 'n'): 2, ('bat',): 4, ('bat', 's'): 2}
Merged ('c', 'a') → ca
Updated: {('cat',): 5, ('ca', 'p'): 3, ('ca', 'n'): 2, ('bat',): 4, ('bat', 's'): 2}
Merged ('ca', 'p') → cap
Updated: {('cat',): 5, ('cap',): 3, ('ca', 'n'): 2, ('bat',): 4, ('bat', 's'): 2}

You’ve now built a mini BPE tokenizer. While this is simplified and doesn’t include vocabulary size or merges file output, it teaches the core concept of subword merging based on frequency.

Conclusion

Tokenization plays a critical role in every NLP pipeline, especially when processing large volumes of text. In this guide, you set up an Ubuntu 24.04 GPU server, installed the necessary libraries, and benchmarked tokenization performance using Hugging Face Transformers on both CPU and GPU. You also generated sentence embeddings with SentenceTransformers and explored how Byte-Pair Encoding works by implementing it from scratch.