Table of Contents
StarRocks is a blazing-fast analytical database optimized for modern analytics workloads. It supports semi-structured formats like JSON and nested data, making it an excellent fit for use cases involving raw event data, user activity logs, or application telemetry.
In this hands-on tutorial, you’ll learn how to deploy StarRocks on Ubuntu 24.04 with Docker, preprocess JSON using GPU acceleration, and run fast SQL queries using the StarRocks engine.
Prerequisites
- An Ubuntu 24.04 server with an NVIDIA GPU.
- A non-root user or a user with sudo privileges.
- NVIDIA drivers are installed on your server.
Step 1: Install Required Packages
Install essential tools like Python, Docker, Java, and the MySQL client needed for running StarRocks and preprocessing JSON with a GPU.
First, update all the packages to the latest version.
apt update -y
Next, install the required packages.
apt install python3 python3-pip python3-venv docker-compose default-jdk -y
Step 2: Deploy StarRocks with Docker Compose
We will use Docker Compose to deploy a single-node StarRocks cluster with Frontend (FE) and Backend (BE) services.
First, create a directory for your project and navigate inside it.
mkdir ~/starrocks && cd ~/starrocks
Next, create a docker-compose.yml file.
nano docker-compose.yml
Add the following configuration.
version: '3.8'
services:
fe:
image: starrocks/fe-ubuntu:3.2-latest
container_name: starrocks-fe
ports:
- "8030:8030" # Web UI
- "9030:9030" # MySQL protocol
environment:
- FE_SERVERS=starrocks-fe:9010
command:
- /opt/starrocks/fe/bin/start_fe.sh
volumes:
- ./fe-meta:/opt/starrocks/fe/meta
restart: always
be:
image: starrocks/be-ubuntu:3.2-latest
container_name: starrocks-be
depends_on:
- fe
ports:
- "8040:8040" # BE port
command:
- /opt/starrocks/be/bin/start_be.sh
environment:
- FE_SERVERS=starrocks-fe:9010
volumes:
- ./be-storage:/opt/starrocks/be/storage
restart: always
The above file defines a StarRocks deployment with two services:
- Frontend (FE) – Acts as the query coordinator, running on port 8030 (web UI) and 9030 (MySQL protocol). It stores metadata in the ./fe-meta volume and connects to the backend.
- Backend (BE) – Handles data storage and query execution, exposed on port 8040. It depends on the FE and stores data in ./be-storage. Both services auto-restart on failure.
Start the container using the command below.
docker compose up -d
Verify that containers are running.
docker ps
Output.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
97465ff8fb67 starrocks/be-ubuntu:3.2-latest "/opt/starrocks/be/b…" 31 minutes ago Up 31 minutes 0.0.0.0:8040->8040/tcp, [::]:8040->8040/tcp starrocks-be
b184563cf3b3 starrocks/fe-ubuntu:3.2-latest "/opt/starrocks/fe/b…" 31 minutes ago Up 31 minutes 0.0.0.0:8030->8030/tcp, [::]:8030->8030/tcp, 0.0.0.0:9030->9030/tcp, [::]:9030->9030/tcp starrocks-fe
Step 3: Connect to StarRocks Frontend
Use the MySQL protocol to connect to the StarRocks FE service and issue SQL commands for managing databases and tables.
First, install the MySQL client package.
apt install mysql-client -y
Next, verify the StarRocks FE connectivity.
mysql -h 127.0.0.1 -P 9030 -uroot
The FE must register at least one BE node to store data. Add the backend node using the below command.
ALTER SYSTEM ADD BACKEND "starrocks-be:9050";
Verify that the backend is registered and active.
SHOW BACKENDS\G
Output.
*************************** 1. row ***************************
BackendId: 10006
IP: 172.18.0.3
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2025-06-28 11:06:13
LastHeartbeat: 2025-06-28 11:07:28
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 59
DataUsedCapacity: 0.000 B
AvailCapacity: 269.413 GB
TotalCapacity: 337.143 GB
UsedPct: 20.09 %
MaxDiskUsedPct: 20.09 %
ErrMsg:
Version: 3.2.16-8dea52d
Status: {"lastSuccessReportTabletsTime":"2025-06-28 11:07:14"}
DataTotalCapacity: 269.413 GB
DataUsedPct: 0.00 %
CpuCores: 4
NumRunningQueries: 0
MemUsedPct: 0.73 %
CpuUsedPct: 0.2 %
Location:
1 row in set (0.00 sec)
Step 4: Create a Database and Table
In this section, we will create a StarRocks database and define a table schema to receive flattened JSON data.
First, create a StarRocks database.
CREATE DATABASE gpu_json;
Change the database to gpu_json.
USE gpu_json;
Define a table schema.
CREATE TABLE IF NOT EXISTS users (
id INT,
name STRING,
email STRING,
city STRING,
zip STRING
)
ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES("replication_num" = "1");
Press CTRL+D to exit from the MySQL shell.
Step 5: Set Up GPU JSON Preprocessing with cuDF
We’ll now use GPU acceleration to flatten nested JSON data using NVIDIA’s cuDF, a high-performance GPU dataframe library.
Create a Python virtual environment and activate it.
python3 -m venv venv
source venv/bin/activate
Update pip to the latest version.
pip install --upgrade pip
Install cuDF to preprocess nested JSON.
pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com
Create a sample JSON data.
nano raw_users.json
Add the following content.
[
{
"id": 1,
"user": {
"name": "Alice",
"email": "[email protected]",
"location": {"city": "New York", "zip": "10001"}
}
},
{
"id": 2,
"user": {
"name": "Bob",
"email": "[email protected]",
"location": {"city": "Chicago", "zip": "60601"}
}
}
]
This JSON structure contains nested fields that we’ll flatten before loading into StarRocks.
Create a preprocess_json_gpu.py script to load the JSON into GPU memory, extract nested values, and save the flat result to a CSV file.
nano preprocess_json_gpu.py
Add the following code.
import cudf
import json
# Load raw JSON file
with open("raw_users.json", "r") as f:
data = json.load(f)
# Convert list of dicts to cuDF DataFrame
df = cudf.DataFrame(data)
# Convert 'user' column to pandas for safe iteration
user_data = df['user'].to_pandas()
# Extract nested fields using pandas-like iteration
df['name'] = [user['name'] for user in user_data]
df['email'] = [user['email'] for user in user_data]
df['city'] = [user['location']['city'] for user in user_data]
df['zip'] = [user['location']['zip'] for user in user_data]
# Drop the original nested column
df = df.drop(columns=['user'])
# Save to CSV
df.to_csv("flattened_users.csv", index=False)
print("✅ Flattened data saved to 'flattened_users.csv'")
Run the script.
python3 preprocess_json_gpu.py
Step 6: Load the CSV into StarRocks Using Stream Load
We’ll use StarRocks’ HTTP API to load the CSV file directly into the users table.
First, remove the CSV header row.
tail -n +2 flattened_users.csv > flattened_users_noheader.csv
Load the data to StarRocks using curl.
curl -v --location-trusted -u root: \
-H "label: csv_load_02" \
-H "format: csv" \
-H "column_separator: ," \
-H "Expect: 100-continue" \
-T ./flattened_users_noheader.csv \
http://localhost:8030/api/gpu_json/users/_stream_load
Look for a “Status”: “Success” message in the output.
{
"TxnId": 4,
"Label": "csv_load_02",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 3,
"NumberLoadedRows": 3,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 100,
"LoadTimeMs": 29,
"BeginTxnTimeMs": 0,
"StreamLoadPlanTimeMs": 2,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 6,
"CommitAndPublishTimeMs": 20
* Connection #1 to host 172.18.0.3 left intact
Step 7: Verify the Data
In this section, we will query the table to confirm that the flattened JSON has been loaded correctly.
First, connect to StarRocks.
mysql -h 127.0.0.1 -P 9030 -uroot
Change the database to gpu_json.
mysql> USE gpu_json;
Verify the data.
mysql> SELECT * FROM users;
Output.
+----+-------+-------------------+----------+--------+
| id | name | email | city | zip |
+----+-------+-------------------+----------+--------+
| 1 | Alice | [email protected] | New York | 10001 |
| 2 | Bob | [email protected] | Chicago | 60601 |
+----+-------+-------------------+----------+--------+
Conclusion
In this tutorial, you built a powerful data pipeline using StarRocks and GPU acceleration on Ubuntu 24.04. You learned how to deploy a StarRocks single-node cluster using Docker Compose, preprocess nested JSON data with NVIDIA’s cuDF library, and ingest the flattened data using the high-performance Stream Load API. Finally, you verified that the data was correctly loaded and queried it using standard SQL.