Table of Contents
- Prerequisites
- Step 1: Install and Configure K3s
- Step 2: Install NVIDIA Container Toolkit
- Step 3: Configure Persistent Storage for MLFlow
- Step 4: Deploy MLFlow Using Helm
- Step 5: Expose MLFlow Service
- Step 6: Run the MLFlow Experiment
- Step 7: Copy Kubeconfig to the Local Machine
- Step 8: Access MLFlow UI
- Conclusion
MLFlow is a powerful, open-source platform designed to manage the entire lifecycle of machine learning (ML) development. It provides tools for tracking experiments, packaging code, and deploying models. By deploying MLFlow on a Kubernetes cluster, you can leverage scalability, reliability, and GPU support for ML workloads.
This guide provides a detailed, step-by-step walkthrough for setting up MLFlow on a Kubernetes cluster.
Prerequisites
Before starting, ensure you have the following:
- An Ubuntu 22.04 Cloud GPU Server.
- CUDA Toolkit, cuDNN and Helm Installed.
- A root or sudo privileges.
Step 1: Install and Configure K3s
K3s is a lightweight Kubernetes distribution that is ideal for quick setups. Install it using the following command:
curl -sfL https://get.k3s.io | sh -
After installation, copy the K3s configuration file to your kubeconfig directory for kubectl to interact with the cluster:
cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
Confirm the Kubernetes cluster is up and running by checking the node status:
kubectl get nodes
Expected output:
NAME STATUS ROLES AGE VERSION
ubuntu Ready control-plane,master 9s v1.31.3+k3s1
Step 2: Install NVIDIA Container Toolkit
To enable GPU support in your Kubernetes cluster, install the NVIDIA container toolkit:
1. Add the NVIDIA repository and import its GPG key:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
2. Update and install the toolkit:
apt-get update
apt-get install -y nvidia-container-toolkit
3. Verify the installation:
nvidia-container-cli --version
Output.
cli-version: 1.17.3
lib-version: 1.17.3
build date: 2024-12-04T09:47+00:00
4. Deploy the NVIDIA plugin to manage GPUs in your Kubernetes cluster:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
Step 3: Configure Persistent Storage for MLFlow
MLFlow requires persistent storage to save experiment data and models. Create and configure a Persistent Volume (PV) and Persistent Volume Claim (PVC):
1. Create a YAML file for the PV and PVC configuration:
nano mlflow-pv-pvc.yaml
Add the following content:
apiVersion: v1
kind: PersistentVolume
metadata:
name: mlflow-pv
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/data"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlflow-pvc
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
2. Apply the configuration:
kubectl apply -f mlflow-pv-pvc.yaml
3. Verify the PV and PVC:
kubectl get pv
Output.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
mlflow-pv 10Gi RWO Retain Available manual 6s
kubectl get pvc
Output.
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
mlflow-pvc Bound mlflow-pv 10Gi RWO manual 15
Step 4: Deploy MLFlow Using Helm
1. Add the Helm chart repository for MLFlow:
helm repo add community-charts https://community-charts.github.io/helm-charts
2. Update the Helm repository.
helm repo update
3. Install MLFlow using Helm:
helm install atlantic community-charts/mlflow
Output.
NAME: atlantic
LAST DEPLOYED: Wed Dec 18 09:43:10 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=mlflow,app.kubernetes.io/instance=atlantic" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace default $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "Visit http://127.0.0.1:8080 to use your application"
kubectl --namespace default port-forward $POD_NAME 8080:$CONTAINER_PORT
4. Verify that MLFlow has been successfully deployed:
kubectl get deployments
Output.
NAME READY UP-TO-DATE AVAILABLE AGE
atlantic-mlflow 1/1 1 1 70s
Step 5: Expose MLFlow Service
1. To make MLFlow accessible, create a service:
nano mlflow-service.yaml
Add the following configuration:
apiVersion: v1
kind: Service
metadata:
name: mlflow-service
spec:
selector:
app: mlflow
ports:
- protocol: TCP
port: 80
targetPort: 5000
nodePort: 30001 # Optional, or let Kubernetes assign a random NodePort
type: NodePort
2. Deploy the service.
kubectl apply -f mlflow-service.yaml
3. Check the services running in your cluster.
kubectl get services
Output.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
atlantic-mlflow ClusterIP 10.43.183.234 5000/TCP 28m
kubernetes ClusterIP 10.43.0.1 443/TCP 35m
mlflow-service NodePort 10.43.158.142 80:30001/TCP 10m
Note: Note down the IP 10.43.158.142 shown in the CLUSTER-IP column.
Step 6: Run the MLFlow Experiment
1. Create a directory for your models.
mkdir Models
cd Models
2. Install the required Python packages.
pip install mlflow scikit-learn shap matplotlib
3. Set environment variables.
export MLFLOW_EXPERIMENT_NAME='my-sample-experiment'
export MLFLOW_TRACKING_URI='http://10.43.158.142'
Note: Replaced 10.43.158.142 with your CLUSTER-IP shown in the previous step.
4. Create a Python script (main.py) and add your ML experiment code.
nano main.py
Add the following code:
# Import Libraries
import os
import numpy as np
import shap
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
import mlflow
from mlflow.artifacts import download_artifacts
from mlflow.tracking import MlflowClient
# Prepare the Training Data
X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X.iloc[:50, :4]
y = y.iloc[:50]
# Train a model
model = LinearRegression()
model.fit(X, y)
# Log an explanation
with mlflow.start_run() as run:
mlflow.shap.log_explanation(model.predict, X)
# List Artifacts
client = MlflowClient()
artifact_path = "model_explanations_shap"
artifacts = [x.path for x in client.list_artifacts(run.info.run_id, artifact_path)]
print("# artifacts:")
print(artifacts)
# Load the logged explanation
dst_path = download_artifacts(run_id=run.info.run_id, artifact_path=artifact_path)
base_values = np.load(os.path.join(dst_path, "base_values.npy"))
shap_values = np.load(os.path.join(dst_path, "shap_values.npy"))
# Show a Force Plot
shap.force_plot(float(base_values), shap_values[0, :], X.iloc[0, :], matplotlib=True)
5. Run the script.
python3 main.py
Output.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 249.00it/s]
🏃 View run persistent-frog-660 at: http://10.43.158.142/#/experiments/1/runs/e19dbaac59fe452bab13217fcc9aac49
🧪 View experiment at: http://10.43.158.142/#/experiments/1
# artifacts:
['model_explanations_shap/base_values.npy', 'model_explanations_shap/shap_values.npy', 'model_explanations_shap/summary_bar_plot.png']
Step 7: Copy Kubeconfig to the Local Machine
To manage your Kubernetes cluster remotely, copy the kubeconfig file from the server to your local machine.
1. Create a .kube directory in your local machine.
mkdir .kube
2. Copy the kubeconfig file from your server.
scp root@server-ip:/root/.kube/config .kube/
3. Edit the kubeconfig file.
nano .kube/config
Find the following line:
server: https://127.0.0.1:6443
And replace it with the following:
server: https://your-server-ip:6443
4. Verify connectivity.
kubectl get services
Output.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
atlantic-mlflow ClusterIP 10.43.183.234 5000/TCP 28m
kubernetes ClusterIP 10.43.0.1 443/TCP 35m
mlflow-service NodePort 10.43.158.142 80:30001/TCP 10m
Step 8: Access MLFlow UI
To access the MLFlow UI from your local machine, you will need to forward the MLFlow service on that machine.
kubectl port-forward svc/mlflow-service 8880:80
Output.
Forwarding from 127.0.0.1:8880 -> 5000
Forwarding from [::1]:8880 -> 5000
Now, open your web browser and access the MLFlow UI at http://127.0.0.1:8880/#/experiments/1

Conclusion
You’ve successfully deployed MLFlow on a Kubernetes cluster, configured it for GPU support, and run a sample experiment. This setup can be extended to manage and track ML experiments for production-scale applications. Try it today on GPU hosting from Altantic.Net!
* This post is for informational purposes only and does not constitute professional or technical advice. Every situation is unique and may require specialized guidance.
Readers should perform their own due diligence before making any decisions.