Semantic search represents the next evolution in search technology, moving beyond keyword matching to understanding the intent and contextual meaning behind queries. When combined with GPU acceleration, semantic search systems can deliver unprecedented performance and accuracy for applications like e-commerce product discovery, enterprise document retrieval, and personalized recommendation systems.

In this guide, we’ll walk through deploying a high-performance semantic search application using Zilliz (an open-source vector database) on an Ubuntu 24.04 GPU server.

Prerequisites

Before beginning, ensure you have:

  • An Ubuntu 24.04 server with NVIDIA GPU (Tested on A100, V100, or RTX 3090/4090).
  • NVIDIA drivers installed.
  • A root user or a user with sudo privileges.

Step 1: Set Up the Python Environment

1. Update the system packages.

apt update -y

2. Install required dependencies.

apt install -y python3-pip python3-venv build-essential libssl-dev docker-compose

Step 2: Install Zilliz with GPU Support

Zilliz offers several deployment options. We’ll use the open-source Milvus version with GPU acceleration:

1. Pull the GPU-enabled Milvus image

docker pull milvusdb/milvus:latest-gpu

2. Create a docker-compose.yml file.

nano docker-compose.yml

Add the following content.

version: '3.5'

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ./etcd_data:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    volumes:
      - ./minio_data:/minio_data
    command: minio server /minio_data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:latest-gpu
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ./milvus_data:/var/lib/milvus
    ports:
      - "19530:19530"
    depends_on:
      - "etcd"
      - "minio"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

networks:
  default:
    name: milvus

3. Start the containers.

docker compose up -d

Step 3: Verify the Installation

1. Check that all services are running.

docker ps

Output.

CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS                             PORTS                                             NAMES
7546b3a7999a   milvusdb/milvus:latest-gpu                 "/tini -- milvus run…"   16 seconds ago   Up 16 seconds                      0.0.0.0:19530->19530/tcp, [::]:19530->19530/tcp   milvus-standalone
d700cd1e1af3   minio/minio:RELEASE.2023-03-20T20-16-18Z   "/usr/bin/docker-ent…"   16 seconds ago   Up 16 seconds (health: starting)   9000/tcp                                          milvus-minio
fbe9fd450a28   quay.io/coreos/etcd:v3.5.5                 "etcd -advertise-cli…"   16 seconds ago   Up 16 seconds                      2379-2380/tcp                                     milvus-etcd

2. Create a Python virtual environment.

python3 -m venv semantic_env
source semantic_env/bin/activate

3. Install the Milvus client.

pip install pymilvus

4. Test the Milvus connection.

python3 -c "from pymilvus import connections, utility; connections.connect('default', host='localhost', port='19530'); print(utility.get_server_version())"

Output.

v2.4.15-gpu

Step 4: Set Up the Semantic Search Application

1. Install required packages.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install sentence-transformers pymilvus fastapi uvicorn python-multipart

2. Create a file named semantic_search.py.

nano semantic_search.py

Add the below code.

import torch
from sentence_transformers import SentenceTransformer
from pymilvus import (
    connections,
    Collection,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType
)
import numpy as np
from fastapi import FastAPI, Query, HTTPException
from fastapi.responses import JSONResponse
from typing import Optional
import uuid
import logging
from pydantic import BaseModel

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize FastAPI
app = FastAPI(
    title="Semantic Search API",
    description="GPU-accelerated semantic search using Zilliz/Milvus",
    version="1.0"
)

# Configuration
COLLECTION_NAME = "semantic_search"
EMBEDDING_DIM = 384  # Dimension for 'all-MiniLM-L6-v2' model
MODEL_NAME = "all-MiniLM-L6-v2"
MILVUS_HOST = "localhost"
MILVUS_PORT = 19530

# Initialize the model (GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Using device: {device}")
model = SentenceTransformer(MODEL_NAME, device=device)
logger.info(f"Initialized model on device: {device}")

# Pydantic models for API documentation
class Document(BaseModel):
    text: str
    doc_id: Optional[str] = None

class SearchResult(BaseModel):
    id: str
    text: str
    score: float

class SearchResponse(BaseModel):
    results: list[SearchResult]

class ErrorResponse(BaseModel):
    error: str
    detail: Optional[str] = None

def initialize_milvus():
    """Initialize connection and create collection if needed"""
    try:
        # Connect to Milvus
        connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT)
        
        # Create collection if it doesn't exist
        if not utility.has_collection(COLLECTION_NAME):
            fields = [
                FieldSchema(name="id", dtype=DataType.VARCHAR, 
                         is_primary=True, auto_id=False, max_length=100),
                FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000),
                FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM)
            ]
            
            schema = CollectionSchema(fields, "Semantic search collection")
            collection = Collection(COLLECTION_NAME, schema, consistency_level="Strong")
            
            # Create index
            index_params = {
                "index_type": "IVF_FLAT",
                "metric_type": "L2",
                "params": {"nlist": 128}
            }
            
            collection.create_index("embedding", index_params)
            logger.info("Created new collection with index")
        else:
            collection = Collection(COLLECTION_NAME)
            logger.info("Connected to existing collection")
            
        # Explicit load for version compatibility
        try:
            collection.load()
        except Exception as e:
            logger.warning(f"Load warning (may be normal in some versions): {str(e)}")
            
        return collection
        
    except Exception as e:
        logger.error(f"Milvus initialization failed: {str(e)}")
        raise HTTPException(
            status_code=500,
            detail=f"Database connection failed: {str(e)}"
        )

# Initialize Milvus connection on startup
try:
    collection = initialize_milvus()
except Exception as e:
    logger.error(f"Failed to initialize Milvus: {str(e)}")
    collection = None

@app.on_event("shutdown")
def shutdown_event():
    """Clean up on shutdown"""
    if connections.has_connection("default"):
        connections.disconnect("default")
    logger.info("Disconnected from Milvus")

@app.post("/documents/",
         response_model=dict,
         responses={500: {"model": ErrorResponse}})
async def add_document(document: Document):
    """Add a document to the search index"""
    try:
        if not collection:
            raise HTTPException(status_code=503, detail="Service unavailable")
            
        # Generate embedding
        embedding = model.encode(document.text)
        
        # Prepare data
        doc_id = document.doc_id if document.doc_id else str(uuid.uuid4())
        data = [
            [doc_id],
            [document.text],
            [embedding.tolist()]
        ]
        
        # Insert into Milvus
        collection.insert(data)
        collection.flush()
        
        return {"status": "success", "id": doc_id}
        
    except Exception as e:
        logger.error(f"Document insertion failed: {str(e)}")
        raise HTTPException(
            status_code=500,
            detail=f"Document processing failed: {str(e)}"
        )

@app.get("/search/",
        response_model=SearchResponse,
        responses={
            400: {"model": ErrorResponse},
            500: {"model": ErrorResponse},
            503: {"model": ErrorResponse}
        })
async def search(
    query: str = Query(..., min_length=1, max_length=1000),
    top_k: int = Query(5, ge=1, le=100)
):
    """Search for similar documents"""
    try:
        if not collection:
            raise HTTPException(status_code=503, detail="Service unavailable")
            
        # Version-compatible load check
        try:
            if hasattr(collection, 'is_loaded') and not collection.is_loaded:
                collection.load()
            elif not hasattr(collection, 'is_loaded'):
                collection.load()  # Force load for older versions
        except Exception as e:
            logger.warning(f"Load warning: {str(e)}")
        
        # Generate embedding
        query_embedding = model.encode(query)
        
        # Search parameters
        search_params = {
            "metric_type": "L2",
            "params": {"nprobe": 10}
        }
        
        # Execute search
        results = collection.search(
            data=[query_embedding.tolist()],
            anns_field="embedding",
            param=search_params,
            limit=top_k,
            output_fields=["text"]
        )
        
        # Format results
        ret = []
        for hits in results:
            for hit in hits:
                ret.append({
                    "id": hit.id,
                    "text": hit.entity.get("text"),
                    "score": 1 - hit.distance  # Convert to similarity score
                })
        
        return {"results": ret}
        
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        logger.error(f"Search failed: {str(e)}")
        raise HTTPException(
            status_code=500,
            detail=f"Search processing failed: {str(e)}"
        )

@app.get("/status")
async def service_status():
    """Check service health"""
    try:
        milvus_ok = connections.has_connection("default")
        gpu_available = torch.cuda.is_available()
        
        # Version-compatible collection check
        collection_status = False
        if collection:
            try:
                if hasattr(collection, 'is_loaded'):
                    collection_status = collection.is_loaded
                else:
                    # For older versions, assume loaded if we have a collection object
                    collection_status = True
            except:
                collection_status = False
        
        return {
            "milvus_connected": milvus_ok,
            "gpu_available": gpu_available,
            "collection_loaded": collection_status,
            "model": MODEL_NAME,
            "status": "healthy" if all([milvus_ok, collection_status]) else "degraded"
        }
    except Exception as e:
        return {
            "status": "unhealthy",
            "error": str(e)
        }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 5: Launch the Application

1. Start the FastAPI service.

python3 semantic_search.py

Output.

INFO:     Started server process [59814]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

2. Your semantic search service is now running on port 8000.

Step 6: Test the Application

1. Add the text “The quick brown fox jumps over the lazy dog” to documents.

curl -X POST "http://localhost:8000/documents/" -H "Content-Type: application/json" -d '{"text": "The quick brown fox jumps over the lazy dog"}'

Output.

{"status":"success","id":"c9dbb77f-a41f-4056-bd4b-69eab9334ae3"}

2. Search for “brown fox” from the added documents using curl.

curl -G "http://localhost:8000/search/" --data-urlencode "query=brown fox" --data "top_k=2"

Output.

{"results":[{"id":"c9dbb77f-a41f-4056-bd4b-69eab9334ae3","text":"The quick brown fox jumps over the lazy dog","score":0.24801254272460938},{"id":"f27c880e-9b18-4a6f-b625-dc57e393117d","text":"The quick brown fox jumps over the lazy dog","score":0.24801254272460938}]} 

3. Check the health status of your application.

curl "http://localhost:8000/status"

Output.

{"milvus_connected":true,"gpu_available":true,"collection_loaded":true,"model":"all-MiniLM-L6-v2","status":"healthy"}

Conclusion

You’ve now deployed a high-performance semantic search application leveraging Zilliz (Milvus) on an Ubuntu 24.04 GPU server. The architecture can be extended with custom embedding models, hybrid search (combining vectors and keywords), and real-time indexing for dynamic content.