Table of Contents
- Prerequisites
- Step 1 - Set Up the Python Environment
- Step 2 - Generate a Large Parquet File
- Step 3 - CPU vs GPU Read Benchmark
- Step 4 - Write Many Smaller Parquet Shards (Parallelism-Friendly)
- Step 5 - Stream Shards Directly to GPU & Aggregate
- Step 6 - Convert Arrow (Host) → cuDF (Device)
- Step 7 - Convert cuDF (Device) → Arrow (Host)
- Conclusion
Moving data to the GPU is often the bottleneck, not your kernels. Apache Arrow addresses a significant limitation by providing a fast, columnar memory format that integrates seamlessly with GPU analytics. In this guide, you’ll set up an Ubuntu 24.04 server, generate tens of millions of rows in Parquet, and load them straight into the GPU with RAPIDS cuDF. You’ll also learn when to keep working on CPU and when to hand it off to the GPU for real wins.
Prerequisites
- An Ubuntu 24.04 server with an NVIDIA GPU with 8 GB GPU memory.
- A non-root user or a user with sudo privileges.
- NVIDIA drivers are installed on your server.
Step 1 – Set Up the Python Environment
We’ll create a dedicated Python virtual environment so GPU libraries and dependencies stay clean and isolated from your system packages.
1. Install required build tools and Python venv.
apt install -y python3-venv build-essential
2. Create and activate a virtual environment.
python3 -m venv arrow-gpu
source arrow-gpu/bin/activate
3. Upgrade pip and wheel inside the venv.
python3 -m pip install --upgrade pip wheel
4. Install PyArrow for working with Arrow and Parquet formats.
pip install pyarrow
5. Install the GPU-accelerated RAPIDS libraries.
pip install cudf-cu12 rmm-cu12 --extra-index-url https://pypi.nvidia.com
6. Finally, add CuPy for low-level GPU array operations.
pip install cupy-cuda12x
7. To confirm everything is installed correctly and that Python can see your GPU, run.
python - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq
import cudf, cupy, rmm
print("PyArrow:", pa.__version__)
print("cuDF:", cudf.__version__)
print("CuPy:", cupy.__version__)
PY
Output.
PyArrow: 19.0.1
cuDF: 25.08.00
CuPy: 13.5.1
If you see the expected versions without errors, you’re ready to start creating and loading data.
Step 2 – Generate a Large Parquet File
To see real performance differences between CPU and GPU reads, we’ll first create a large dataset stored in Parquet format. This will give us enough rows to make the GPU useful and highlight the benefits of columnar storage.
1. Create a file named make_parquet.py.
nano make_parquet.py
Add the following code.
# make_parquet.py
import os, numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
N = int(os.environ.get("N_ROWS", "20_000_000")) # 20M rows
OUT = "demo.parquet"
def build_table(n):
arrays = {
"id": pa.array(np.arange(n, dtype=np.int64)),
"x": pa.array(np.random.randn(n).astype("float32")),
"y": pa.array(np.random.randn(n).astype("float32")),
"label": pa.array(np.random.randint(0, 10, size=n, dtype=np.int8)),
}
return pa.table(arrays)
if not os.path.exists(OUT):
tbl = build_table(N)
pq.write_table(
tbl, OUT,
compression="zstd", # good speed/ratio balance
write_statistics=True, # enables predicate pushdown
data_page_size=1<<20, # 1 MB pages
row_group_size=1<<27 # ~134M rows? No — PyArrow interprets as bytes if writing by size
)
# If your PyArrow version expects row group size in rows, replace with e.g., row_group_size=5_000_000
print("Wrote", OUT)
else:
print(OUT, "already exists")
2. Run the script.
python3 make_parquet.py
Output.
Wrote demo.parquet
Step 3 – CPU vs GPU Read Benchmark
With our 20M-row Parquet file ready, we can measure how long it takes to load into memory using:
- pandas with PyArrow (CPU only)
- cuDF (directly into GPU memory)
This test will help you see where the GPU can save time and where it may not offer much benefit for simple reads.
Create a file named bench_parquet.py.
nano bench_parquet.py
Add the following code.
# bench_parquet.py
import time, pandas as pd, cudf
FILE = "demo.parquet"
def timeit(fn, label):
t0 = time.perf_counter()
_ = fn()
dt = time.perf_counter() - t0
print(f"{label}: {dt:.2f}s")
def cpu_pandas():
return pd.read_parquet(FILE, engine="pyarrow")
def gpu_cudf():
return cudf.read_parquet(FILE) # directly to GPU
if __name__ == "__main__":
timeit(cpu_pandas, "CPU pandas read_parquet")
timeit(gpu_cudf, "GPU cuDF read_parquet")
Run the script.
python3 bench_parquet.py
Output.
CPU pandas read_parquet: 0.65s
GPU cuDF read_parquet: 0.93s
In this case, the GPU read is slightly slower. That’s because:
We only read a single file.
The dataset is large, but the GPU still needs to transfer data from CPU to device memory. There’s little computation after the load, so I/O and PCIe transfer time dominate.
Next, we’ll split our dataset into shards to prepare for faster streaming and parallel processing on the GPU.
Step 4 – Write Many Smaller Parquet Shards (Parallelism-Friendly)
Single big files are easy, but shards are faster to move and process in parallel. We’ll slice the dataset into 10 Parquet files of 2M rows each. This helps with multi-process loaders, overlapping I/O with compute, and better GPU memory utilization.
1. Create make_shards.py.
nano make_shards.py
Add the below code.
# make_shards.py
import os, numpy as np, pyarrow as pa, pyarrow.parquet as pq
os.makedirs("parquet_shards", exist_ok=True)
rows_per = 2_000_000
shards = 10
for i in range(shards):
start = i * rows_per
arrays = {
"id": pa.array(np.arange(start, start + rows_per, dtype=np.int64)),
"val": pa.array(np.random.randn(rows_per).astype("float32")),
"bucket": pa.array(np.random.randint(0, 32, size=rows_per, dtype=np.int16)),
}
tbl = pa.table(arrays)
pq.write_table(
tbl,
f"parquet_shards/part-{i:02d}.parquet",
compression="zstd",
write_statistics=True,
)
print("Shards ready")
2. Run the script.
python3 make_shards.py
Output.
Shards ready
Why this helps:
- Parallel reads: Multiple workers (or Dask-cuDF) can pull different shards at once.
- Better caching: Smaller files improve locality and reduce re-reads.
- Controlled memory: You can size shards/row groups to fit GPU memory (e.g., 128–512 MB row groups).
Next, we’ll stream these shards directly into the GPU and run a quick aggregation to measure end-to-end throughput.
Step 5 – Stream Shards Directly to GPU & Aggregate
Now let’s read the shard files straight into GPU memory and perform a small aggregation to simulate “real work” after I/O. We’ll compute the mean of val grouped by bucket per shard and time the whole pipeline.
1. Create stream_to_gpu.py.
nano stream_to_gpu.py
Add the below code.
# stream_to_gpu.py
import glob, cudf, time
files = sorted(glob.glob("parquet_shards/*.parquet"))
t0 = time.perf_counter()
total = 0
for path in files:
gdf = cudf.read_parquet(path) # GPU DataFrame
# Example GPU work: quick groupby
agg = gdf.groupby("bucket").val.mean()
total += len(gdf)
dt = time.perf_counter() - t0
print(f"Processed {total} rows from {len(files)} files in {dt:.2f}s")
2. Run the script.
python3 stream_to_gpu.py
Output.
Processed 20000000 rows from 10 files in 1.49s
What this shows:
- I/O + compute pipeline: We’re not just timing reads; we’re timing a GPU aggregation too, which is where the GPU starts to shine.
- Shard wins: Multiple smaller files reduce latency and make it easier to scale with more workers or Dask-cuDF.
Next, we’ll convert Arrow (host) → cuDF (device) to move CPU-built Arrow tables directly into GPU memory.
Step 6 – Convert Arrow (Host) → cuDF (Device)
Sometimes you prepare data on the CPU (ETL, joins with CPU-only sources) and then hand it off to the GPU for fast analytics. With Arrow, that handoff is simple: build an Arrow Table in host memory and create a cuDF DataFrame on the device in one step.
1. Create arrow_to_cudf.py.
nano arrow_to_cudf.py
Add the following code.
import numpy as np
import pyarrow as pa
import cudf
# Build a CPU Arrow table (host memory)
tbl = pa.table({
"a": np.arange(1_000_000, dtype=np.int32),
"b": np.random.randn(1_000_000).astype("float32"),
})
# Copy host→device and create a GPU DataFrame
gdf = cudf.DataFrame.from_arrow(tbl)
print("Rows on GPU:", len(gdf))
print("dtypes:", dict(zip(gdf.columns, map(str, gdf.dtypes))))
print("head():")
print(gdf.head())
2. Run the script.
python3 arrow_to_cudf.py
Output.
Rows on GPU: 1000000
dtypes: {'a': 'int32', 'b': 'float32'}
head():
a b
0 0 -2.701743
1 1 0.692678
2 2 -0.638901
3 3 0.102307
4 4 0.323030
Why this matters:
- Zero fuss interop: Arrow’s columnar layout maps cleanly to cuDF.
- Controlled dtypes: Set NumPy dtypes up front (e.g., int32, float32) to match GPU-friendly types and avoid casts.
- Clear boundary: Do CPU-bound prep where it’s convenient, then switch to GPU when it pays off.
Next, we’ll do the reverse: convert cuDF (device) → Arrow (host) for when you need to export data or pass it to CPU-only libraries.
Step 7 – Convert cuDF (Device) → Arrow (Host)
There are times when you’ve done heavy GPU processing but need to move the results back to the CPU, maybe for exporting to disk, handing off to a CPU-based library, or sending data over a network. With cuDF, you can easily convert a GPU DataFrame into an Apache Arrow Table in host memory.
1. Create cudf_to_arrow.py.
nano cudf_to_arrow.py
Add the following code.
import cudf
# Build a GPU DataFrame
gdf = cudf.DataFrame({
"x": cudf.Series(range(10_000_0)), # 100k rows
"y": cudf.Series(range(10_000_0)) * 2
})
# Copy device→host into an Arrow table
tbl = gdf.to_arrow()
print("Rows on CPU as Arrow:", tbl.num_rows)
print("Schema:", tbl.schema)
print("first 5 as pandas (CPU):")
print(tbl.to_pandas().head())
2. Run the script.
python3 cudf_to_arrow.py
Output.
Rows on CPU as Arrow: 100000
Schema: x: int64
y: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 462
first 5 as pandas (CPU):
x y
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
Why this is useful:
- Exporting data: Easily write Arrow tables to Parquet, Feather, or IPC for storage or sharing.
- CPU-only workflows: Hand off data to pandas, scikit-learn, or other CPU-based systems.
- Interoperability: Arrow format is supported across many languages (C++, Rust, R, Java).
Conclusion
By combining Apache Arrow with GPU acceleration via cuDF, you can move large datasets from disk to device memory much faster and with less friction. In this guide, you set up a clean Python environment on Ubuntu 24.04, generated massive Parquet datasets, and compared CPU vs GPU read speeds. You also learned how to split data into shards for parallel processing, stream those shards directly into GPU memory, and seamlessly convert between Arrow tables on the CPU and cuDF DataFrames on the GPU.