StarRocks is a blazing-fast analytical database optimized for modern analytics workloads. It supports semi-structured formats like JSON and nested data, making it an excellent fit for use cases involving raw event data, user activity logs, or application telemetry.

In this hands-on tutorial, you’ll learn how to deploy StarRocks on Ubuntu 24.04 with Docker, preprocess JSON using GPU acceleration, and run fast SQL queries using the StarRocks engine.

Prerequisites

  • An Ubuntu 24.04 server with an NVIDIA GPU.
  • A non-root user or a user with sudo privileges.
  • NVIDIA drivers are installed on your server.

Step 1: Install Required Packages

Install essential tools like Python, Docker, Java, and the MySQL client needed for running StarRocks and preprocessing JSON with a GPU.

First, update all the packages to the latest version.

apt update -y

Next, install the required packages.

apt install python3 python3-pip python3-venv docker-compose default-jdk  -y

Step 2: Deploy StarRocks with Docker Compose

We will use Docker Compose to deploy a single-node StarRocks cluster with Frontend (FE) and Backend (BE) services.

First, create a directory for your project and navigate inside it.

mkdir ~/starrocks && cd ~/starrocks

Next, create a docker-compose.yml file.

nano docker-compose.yml

Add the following configuration.

version: '3.8'
services:
  fe:
    image: starrocks/fe-ubuntu:3.2-latest
    container_name: starrocks-fe
    ports:
      - "8030:8030"    # Web UI
      - "9030:9030"    # MySQL protocol
    environment:
      - FE_SERVERS=starrocks-fe:9010
    command:
      - /opt/starrocks/fe/bin/start_fe.sh
    volumes:
      - ./fe-meta:/opt/starrocks/fe/meta
    restart: always

  be:
    image: starrocks/be-ubuntu:3.2-latest
    container_name: starrocks-be
    depends_on:
      - fe
    ports:
      - "8040:8040"    # BE port
    command:
      - /opt/starrocks/be/bin/start_be.sh
    environment:
      - FE_SERVERS=starrocks-fe:9010
    volumes:
      - ./be-storage:/opt/starrocks/be/storage
    restart: always

The above file defines a StarRocks deployment with two services:

  • Frontend (FE) – Acts as the query coordinator, running on port 8030 (web UI) and 9030 (MySQL protocol). It stores metadata in the ./fe-meta volume and connects to the backend.
  • Backend (BE) – Handles data storage and query execution, exposed on port 8040. It depends on the FE and stores data in ./be-storage. Both services auto-restart on failure.

Start the container using the command below.

docker compose up -d

Verify that containers are running.

docker ps

Output.

CONTAINER ID   IMAGE                            COMMAND                  CREATED          STATUS          PORTS                                                                                      NAMES
97465ff8fb67   starrocks/be-ubuntu:3.2-latest   "/opt/starrocks/be/b…"   31 minutes ago   Up 31 minutes   0.0.0.0:8040->8040/tcp, [::]:8040->8040/tcp                                                starrocks-be
b184563cf3b3   starrocks/fe-ubuntu:3.2-latest   "/opt/starrocks/fe/b…"   31 minutes ago   Up 31 minutes   0.0.0.0:8030->8030/tcp, [::]:8030->8030/tcp, 0.0.0.0:9030->9030/tcp, [::]:9030->9030/tcp   starrocks-fe

Step 3: Connect to StarRocks Frontend

Use the MySQL protocol to connect to the StarRocks FE service and issue SQL commands for managing databases and tables.

First, install the MySQL client package.

apt install mysql-client -y

Next, verify the StarRocks FE connectivity.

mysql -h 127.0.0.1 -P 9030 -uroot

The FE must register at least one BE node to store data. Add the backend node using the below command.

ALTER SYSTEM ADD BACKEND "starrocks-be:9050";

Verify that the backend is registered and active.

SHOW BACKENDS\G

Output.

*************************** 1. row ***************************
            BackendId: 10006
                   IP: 172.18.0.3
        HeartbeatPort: 9050
               BePort: 9060
             HttpPort: 8040
             BrpcPort: 8060
        LastStartTime: 2025-06-28 11:06:13
        LastHeartbeat: 2025-06-28 11:07:28
                Alive: true
 SystemDecommissioned: false
ClusterDecommissioned: false
            TabletNum: 59
     DataUsedCapacity: 0.000 B
        AvailCapacity: 269.413 GB
        TotalCapacity: 337.143 GB
              UsedPct: 20.09 %
       MaxDiskUsedPct: 20.09 %
               ErrMsg: 
              Version: 3.2.16-8dea52d
               Status: {"lastSuccessReportTabletsTime":"2025-06-28 11:07:14"}
    DataTotalCapacity: 269.413 GB
          DataUsedPct: 0.00 %
             CpuCores: 4
    NumRunningQueries: 0
           MemUsedPct: 0.73 %
           CpuUsedPct: 0.2 %
             Location: 
1 row in set (0.00 sec)

Step 4: Create a Database and Table

In this section, we will create a StarRocks database and define a table schema to receive flattened JSON data.

First, create a StarRocks database.

CREATE DATABASE gpu_json;

Change the database to gpu_json.

USE gpu_json;

Define a table schema.

CREATE TABLE IF NOT EXISTS users (
    id INT,
    name STRING,
    email STRING,
    city STRING,
    zip STRING
)
ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES("replication_num" = "1");

Press CTRL+D to exit from the MySQL shell.

Step 5: Set Up GPU JSON Preprocessing with cuDF

We’ll now use GPU acceleration to flatten nested JSON data using NVIDIA’s cuDF, a high-performance GPU dataframe library.

Create a Python virtual environment and activate it.

python3 -m venv venv
source venv/bin/activate

Update pip to the latest version.

pip install --upgrade pip

Install cuDF to preprocess nested JSON.

pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com

Create a sample JSON data.

nano raw_users.json

Add the following content.

[
  {
    "id": 1,
    "user": {
      "name": "Alice",
      "email": "[email protected]",
      "location": {"city": "New York", "zip": "10001"}
    }
  },
  {
    "id": 2,
    "user": {
      "name": "Bob",
      "email": "[email protected]",
      "location": {"city": "Chicago", "zip": "60601"}
    }
  }
]

This JSON structure contains nested fields that we’ll flatten before loading into StarRocks.

Create a preprocess_json_gpu.py script to load the JSON into GPU memory, extract nested values, and save the flat result to a CSV file.

nano preprocess_json_gpu.py

Add the following code.

import cudf
import json

# Load raw JSON file
with open("raw_users.json", "r") as f:
    data = json.load(f)

# Convert list of dicts to cuDF DataFrame
df = cudf.DataFrame(data)

# Convert 'user' column to pandas for safe iteration
user_data = df['user'].to_pandas()

# Extract nested fields using pandas-like iteration
df['name'] = [user['name'] for user in user_data]
df['email'] = [user['email'] for user in user_data]
df['city'] = [user['location']['city'] for user in user_data]
df['zip'] = [user['location']['zip'] for user in user_data]

# Drop the original nested column
df = df.drop(columns=['user'])

# Save to CSV
df.to_csv("flattened_users.csv", index=False)

print("✅ Flattened data saved to 'flattened_users.csv'")

Run the script.

python3 preprocess_json_gpu.py

Step 6: Load the CSV into StarRocks Using Stream Load

We’ll use StarRocks’ HTTP API to load the CSV file directly into the users table.

First, remove the CSV header row.

tail -n +2 flattened_users.csv > flattened_users_noheader.csv

Load the data to StarRocks using curl.

curl -v --location-trusted -u root: \
  -H "label: csv_load_02" \
  -H "format: csv" \
  -H "column_separator: ," \
  -H "Expect: 100-continue" \
  -T ./flattened_users_noheader.csv \
  http://localhost:8030/api/gpu_json/users/_stream_load

Look for a “Status”: “Success” message in the output.

{
    "TxnId": 4,
    "Label": "csv_load_02",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 3,
    "NumberLoadedRows": 3,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 100,
    "LoadTimeMs": 29,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 2,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 6,
    "CommitAndPublishTimeMs": 20
* Connection #1 to host 172.18.0.3 left intact

Step 7: Verify the Data

In this section, we will query the table to confirm that the flattened JSON has been loaded correctly.

First, connect to StarRocks.

mysql -h 127.0.0.1 -P 9030 -uroot

Change the database to gpu_json.

mysql> USE gpu_json;

Verify the data.

mysql> SELECT * FROM users;

Output.

+----+-------+-------------------+----------+--------+
| id | name  | email             | city     | zip    |
+----+-------+-------------------+----------+--------+
|  1 | Alice | [email protected] | New York | 10001  |
|  2 | Bob   | [email protected]   | Chicago  | 60601  |
+----+-------+-------------------+----------+--------+

Conclusion

In this tutorial, you built a powerful data pipeline using StarRocks and GPU acceleration on Ubuntu 24.04. You learned how to deploy a StarRocks single-node cluster using Docker Compose, preprocess nested JSON data with NVIDIA’s cuDF library, and ingest the flattened data using the high-performance Stream Load API. Finally, you verified that the data was correctly loaded and queried it using standard SQL.