Toxic Comment Detection with TensorFlow on an Ubuntu GPU Server

Table of Contents

Prerequisites
Step 1: Setting Up the Environment
Step 2: Data Preparation
Step 3: Preprocess Data
Step 4: Model Training
Step 5: Deployment with FastAPI
Conclusion

Online platforms face increasing challenges in moderating user-generated content. Toxic comments—including insults, threats, hate speech, and obscene language—can create hostile environments and drive users away. Manual moderation doesn’t scale effectively for large platforms, making automated detection systems essential.

In this article, we’ll build a toxic comment classification system using TensorFlow on an Ubuntu server with GPU acceleration.

Prerequisites

An Ubuntu 24.04 server with an NVIDIA GPU.
A non-root user with sudo privileges.
NVIDIA drivers installed.

Step 1: Setting Up the Environment

Before we begin processing data or training models, we need to set up our Ubuntu server environment with the necessary dependencies.

1. Install Python and other required libraries.

apt install -y python3 python3-pip python3-venv

2. Create a virtual environment for your project.

python3 -m venv toxic-env

3. Activate the virtual environment.

source toxic-env/bin/activate

4. Upgrade Pip to the latest version.

pip install --upgrade pip

5. Install Tensorflow and other packages.

pip install tensorflow pandas numpy scikit-learn matplotlib nltk kaggle uvicorn fastapi

Step 2: Data Preparation

We’ll use the Jigsaw Toxic Comment Classification Challenge dataset from Kaggle, which contains Wikipedia comments labeled for various types of toxicity.

1. Create a directory structure for your project.

mkdir -p toxicity-detector/{data,models,src}

2. Navigate to the toxicity-detector directory.

cd toxicity-detector

3. Log in to the Kagel site, download the Jigsaw dataset file called archive.zip, and place it in the /root directory.

4. Unzip the archive.zip to the data/raw directory.

unzip /root/archive.zip -d data/raw/

Step 3: Preprocess Data

Effective text classification requires careful preprocessing.

1. Create a preprocess.py script.

nano src/preprocess.py

Add the following code.

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import multiprocessing as mp
import os

nltk_dir = os.path.expanduser('~/nltk_data')
os.makedirs(nltk_dir, exist_ok=True)
nltk.data.path.append(nltk_dir)

for resource in ['punkt', 'stopwords', 'wordnet', 'omw-1.4']:
    nltk.download(resource, download_dir=nltk_dir)

def clean_text(text):
    tokens = word_tokenize(str(text).lower())
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in tokens if w.isalpha()])

if __name__ == "__main__":
    df = pd.read_csv('data/raw/train.csv')
    with mp.Pool(mp.cpu_count()) as pool:
        df['clean_text'] = pool.map(clean_text, df['comment_text'])
    os.makedirs('data/processed', exist_ok=True)
    df.to_csv('data/processed/train_clean.csv', index=False)
    print("Preprocessing completed successfully!")

2. Run the preprocessing script.

python3 src/preprocess.py

Step 4: Model Training

Our model architecture uses a bidirectional LSTM network, which is particularly effective for sequence classification tasks like text analysis.

1. Create a train.py script.

nano src/train.py

Add the following code.

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization, Embedding, Bidirectional, LSTM, Dense, Input
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
import pickle

train_df = pd.read_csv('data/processed/train_clean.csv')
texts = train_df['clean_text'].astype(str).tolist()
labels = train_df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values

X_train, X_val, y_train, y_val = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Create vectorizer separately
vectorizer = TextVectorization(max_tokens=50000, output_sequence_length=200, output_mode='int')
vectorizer.adapt(X_train)

# Vectorize data before training
X_train_vec = vectorizer(np.array([[s] for s in X_train])).numpy()
X_val_vec = vectorizer(np.array([[s] for s in X_val])).numpy()

# Save vectorizer separately
with open('models/vectorizer.pkl', 'wb') as f:
    pickle.dump({'config': vectorizer.get_config(), 'weights': vectorizer.get_weights()}, f)

# Build model without vectorizer
model = Sequential([
    Input(shape=(200,)),
    Embedding(50000, 128, mask_zero=True),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(64, activation='relu'),
    Dense(6, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model on vectorized data
model.fit(X_train_vec, y_train, validation_data=(X_val_vec, y_val), epochs=10)

# Save model
model.save('models/toxicity.h5')

2. Run the script to train the model.

python3 src/train.py

Step 5: Deployment with FastAPI

For production deployment, we create a REST API using FastAPI that exposes our model’s prediction capabilities.

1. Create an api.py script.

nano src/api.py

Add the following code:

from fastapi import FastAPI
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import TextVectorization
from pydantic import BaseModel
import numpy as np
import pickle
import tensorflow as tf

app = FastAPI()

# Load the trained model
model = load_model('models/toxicity.h5')

# Load vectorizer separately and properly initialize it
with open('models/vectorizer.pkl', 'rb') as f:
    vec_data = pickle.load(f)

vectorizer = TextVectorization.from_config(vec_data['config'])
# Temporary adapt call with dummy data to fully initialize internal tables
vectorizer.adapt(tf.data.Dataset.from_tensor_slices(["dummy"]))
vectorizer.set_weights(vec_data['weights'])

class TextRequest(BaseModel):
    text: str

@app.post("/predict")
async def predict(request: TextRequest):
    try:
        # Vectorize the input text
        input_vector = vectorizer([request.text])

        # Make the prediction
        prediction = model.predict(input_vector)[0]

        return {
            "toxic": float(prediction[0]),
            "severe_toxic": float(prediction[1]),
            "obscene": float(prediction[2]),
            "threat": float(prediction[3]),
            "insult": float(prediction[4]),
            "identity_hate": float(prediction[5])
        }
    except Exception as e:
        return {"error": str(e)}

2. Start the server.

uvicorn src.api:app --reload &

3. Test the API with a sample request.

curl -X POST 'http://localhost:8000/predict' \
-H 'Content-Type: application/json' \
-d '{"text":"You are an idiot!"}'

Output.

{"toxic":0.5728890299797058,"severe_toxic":0.04583415016531944,"obscene":0.4520402252674103,"threat":0.007368859834969044,"insult":0.19956137239933014,"identity_hate":0.023788010701537132}

The output of the API request represents the model’s prediction scores for six different types of toxicity in the input text. Each score is a probability value between 0 and 1, where higher values indicate greater confidence that the text contains that particular type of toxicity.

Conclusion

This implementation demonstrates building an effective toxic comment classification system using TensorFlow on an Ubuntu GPU server. By leveraging GPU acceleration, we can train sophisticated models efficiently while the modular architecture ensures maintainability and scalability.

The system can be integrated into content moderation pipelines, forum platforms, or social media applications to automatically flag potentially harmful content for human review or automatic filtering.

Facebook

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Up in 60 Seconds!

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Newsletter

Subscribe to our newsletter and stay updated.

Email Address

Provide your email address to subscribe. For e.g [email protected]

Your subscription could not be saved. Please try again.

Your subscription has been successful.

View White Papers

Toxic Comment Detection with TensorFlow on an Ubuntu GPU Server

Prerequisites

Step 1: Setting Up the Environment

Step 2: Data Preparation

Step 3: Preprocess Data

Step 4: Model Training

Step 5: Deployment with FastAPI

Conclusion

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Award-Winning Hosting Solutions & Services