Semantic search represents the next evolution in search technology, moving beyond keyword matching to understanding the intent and contextual meaning behind queries. When combined with GPU acceleration, semantic search systems can deliver unprecedented performance and accuracy for applications like e-commerce product discovery, enterprise document retrieval, and personalized recommendation systems.
In this guide, we’ll walk through deploying a high-performance semantic search application using Zilliz (an open-source vector database) on an Ubuntu 24.04 GPU server.
Prerequisites
Before beginning, ensure you have:
- An Ubuntu 24.04 server with NVIDIA GPU (Tested on A100, V100, or RTX 3090/4090).
- NVIDIA drivers installed.
- A root user or a user with sudo privileges.
Step 1: Set Up the Python Environment
1. Update the system packages.
apt update -y
2. Install required dependencies.
apt install -y python3-pip python3-venv build-essential libssl-dev docker-compose
Step 2: Install Zilliz with GPU Support
Zilliz offers several deployment options. We’ll use the open-source Milvus version with GPU acceleration:
1. Pull the GPU-enabled Milvus image
docker pull milvusdb/milvus:latest-gpu
2. Create a docker-compose.yml file.
nano docker-compose.yml
Add the following content.
version: '3.5'
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ./etcd_data:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
volumes:
- ./minio_data:/minio_data
command: minio server /minio_data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:latest-gpu
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ./milvus_data:/var/lib/milvus
ports:
- "19530:19530"
depends_on:
- "etcd"
- "minio"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
default:
name: milvus
3. Start the containers.
docker compose up -d
Step 3: Verify the Installation
1. Check that all services are running.
docker ps
Output.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7546b3a7999a milvusdb/milvus:latest-gpu "/tini -- milvus runā¦" 16 seconds ago Up 16 seconds 0.0.0.0:19530->19530/tcp, [::]:19530->19530/tcp milvus-standalone
d700cd1e1af3 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-entā¦" 16 seconds ago Up 16 seconds (health: starting) 9000/tcp milvus-minio
fbe9fd450a28 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cliā¦" 16 seconds ago Up 16 seconds 2379-2380/tcp milvus-etcd
2. Create a Python virtual environment.
python3 -m venv semantic_env
source semantic_env/bin/activate
3. Install the Milvus client.
pip install pymilvus
4. Test the Milvus connection.
python3 -c "from pymilvus import connections, utility; connections.connect('default', host='localhost', port='19530'); print(utility.get_server_version())"
Output.
v2.4.15-gpu
Step 4: Set Up the Semantic Search Application
1. Install required packages.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install sentence-transformers pymilvus fastapi uvicorn python-multipart
2. Create a file named semantic_search.py.
nano semantic_search.py
Add the below code.
import torch
from sentence_transformers import SentenceTransformer
from pymilvus import (
connections,
Collection,
utility,
FieldSchema,
CollectionSchema,
DataType
)
import numpy as np
from fastapi import FastAPI, Query, HTTPException
from fastapi.responses import JSONResponse
from typing import Optional
import uuid
import logging
from pydantic import BaseModel
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize FastAPI
app = FastAPI(
title="Semantic Search API",
description="GPU-accelerated semantic search using Zilliz/Milvus",
version="1.0"
)
# Configuration
COLLECTION_NAME = "semantic_search"
EMBEDDING_DIM = 384 # Dimension for 'all-MiniLM-L6-v2' model
MODEL_NAME = "all-MiniLM-L6-v2"
MILVUS_HOST = "localhost"
MILVUS_PORT = 19530
# Initialize the model (GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Using device: {device}")
model = SentenceTransformer(MODEL_NAME, device=device)
logger.info(f"Initialized model on device: {device}")
# Pydantic models for API documentation
class Document(BaseModel):
text: str
doc_id: Optional[str] = None
class SearchResult(BaseModel):
id: str
text: str
score: float
class SearchResponse(BaseModel):
results: list[SearchResult]
class ErrorResponse(BaseModel):
error: str
detail: Optional[str] = None
def initialize_milvus():
"""Initialize connection and create collection if needed"""
try:
# Connect to Milvus
connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT)
# Create collection if it doesn't exist
if not utility.has_collection(COLLECTION_NAME):
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR,
is_primary=True, auto_id=False, max_length=100),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM)
]
schema = CollectionSchema(fields, "Semantic search collection")
collection = Collection(COLLECTION_NAME, schema, consistency_level="Strong")
# Create index
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
}
collection.create_index("embedding", index_params)
logger.info("Created new collection with index")
else:
collection = Collection(COLLECTION_NAME)
logger.info("Connected to existing collection")
# Explicit load for version compatibility
try:
collection.load()
except Exception as e:
logger.warning(f"Load warning (may be normal in some versions): {str(e)}")
return collection
except Exception as e:
logger.error(f"Milvus initialization failed: {str(e)}")
raise HTTPException(
status_code=500,
detail=f"Database connection failed: {str(e)}"
)
# Initialize Milvus connection on startup
try:
collection = initialize_milvus()
except Exception as e:
logger.error(f"Failed to initialize Milvus: {str(e)}")
collection = None
@app.on_event("shutdown")
def shutdown_event():
"""Clean up on shutdown"""
if connections.has_connection("default"):
connections.disconnect("default")
logger.info("Disconnected from Milvus")
@app.post("/documents/",
response_model=dict,
responses={500: {"model": ErrorResponse}})
async def add_document(document: Document):
"""Add a document to the search index"""
try:
if not collection:
raise HTTPException(status_code=503, detail="Service unavailable")
# Generate embedding
embedding = model.encode(document.text)
# Prepare data
doc_id = document.doc_id if document.doc_id else str(uuid.uuid4())
data = [
[doc_id],
[document.text],
[embedding.tolist()]
]
# Insert into Milvus
collection.insert(data)
collection.flush()
return {"status": "success", "id": doc_id}
except Exception as e:
logger.error(f"Document insertion failed: {str(e)}")
raise HTTPException(
status_code=500,
detail=f"Document processing failed: {str(e)}"
)
@app.get("/search/",
response_model=SearchResponse,
responses={
400: {"model": ErrorResponse},
500: {"model": ErrorResponse},
503: {"model": ErrorResponse}
})
async def search(
query: str = Query(..., min_length=1, max_length=1000),
top_k: int = Query(5, ge=1, le=100)
):
"""Search for similar documents"""
try:
if not collection:
raise HTTPException(status_code=503, detail="Service unavailable")
# Version-compatible load check
try:
if hasattr(collection, 'is_loaded') and not collection.is_loaded:
collection.load()
elif not hasattr(collection, 'is_loaded'):
collection.load() # Force load for older versions
except Exception as e:
logger.warning(f"Load warning: {str(e)}")
# Generate embedding
query_embedding = model.encode(query)
# Search parameters
search_params = {
"metric_type": "L2",
"params": {"nprobe": 10}
}
# Execute search
results = collection.search(
data=[query_embedding.tolist()],
anns_field="embedding",
param=search_params,
limit=top_k,
output_fields=["text"]
)
# Format results
ret = []
for hits in results:
for hit in hits:
ret.append({
"id": hit.id,
"text": hit.entity.get("text"),
"score": 1 - hit.distance # Convert to similarity score
})
return {"results": ret}
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
except Exception as e:
logger.error(f"Search failed: {str(e)}")
raise HTTPException(
status_code=500,
detail=f"Search processing failed: {str(e)}"
)
@app.get("/status")
async def service_status():
"""Check service health"""
try:
milvus_ok = connections.has_connection("default")
gpu_available = torch.cuda.is_available()
# Version-compatible collection check
collection_status = False
if collection:
try:
if hasattr(collection, 'is_loaded'):
collection_status = collection.is_loaded
else:
# For older versions, assume loaded if we have a collection object
collection_status = True
except:
collection_status = False
return {
"milvus_connected": milvus_ok,
"gpu_available": gpu_available,
"collection_loaded": collection_status,
"model": MODEL_NAME,
"status": "healthy" if all([milvus_ok, collection_status]) else "degraded"
}
except Exception as e:
return {
"status": "unhealthy",
"error": str(e)
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Step 5: Launch the Application
1. Start the FastAPI service.
python3 semantic_search.py
Output.
INFO: Started server process [59814]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2. Your semantic search service is now running on port 8000.
Step 6: Test the Application
1. Add the text “The quick brown fox jumps over the lazy dog” to documents.
curl -X POST "http://localhost:8000/documents/" -H "Content-Type: application/json" -d '{"text": "The quick brown fox jumps over the lazy dog"}'
Output.
{"status":"success","id":"c9dbb77f-a41f-4056-bd4b-69eab9334ae3"}
2. Search for “brown fox” from the added documents using curl.
curl -G "http://localhost:8000/search/" --data-urlencode "query=brown fox" --data "top_k=2"
Output.
{"results":[{"id":"c9dbb77f-a41f-4056-bd4b-69eab9334ae3","text":"The quick brown fox jumps over the lazy dog","score":0.24801254272460938},{"id":"f27c880e-9b18-4a6f-b625-dc57e393117d","text":"The quick brown fox jumps over the lazy dog","score":0.24801254272460938}]}
3. Check the health status of your application.
curl "http://localhost:8000/status"
Output.
{"milvus_connected":true,"gpu_available":true,"collection_loaded":true,"model":"all-MiniLM-L6-v2","status":"healthy"}
Conclusion
You’ve now deployed a high-performance semantic search application leveraging Zilliz (Milvus) on an Ubuntu 24.04 GPU server. The architecture can be extended with custom embedding models, hybrid search (combining vectors and keywords), and real-time indexing for dynamic content.