Exercise 6: Deployment & Serving on OpenShift
In this exercise, you will deploy your fine-tuned LLM to OpenShift/Kubernetes using vLLM for optimized serving. You'll learn how to containerize your model, configure resource limits, and test the deployed service.
Learning Objectives
By the end of this exercise, you will be able to: - Containerize a fine-tuned LLM with vLLM for serving - Configure Kubernetes deployment manifests for LLM serving - Set appropriate resource limits and requests for LLM workloads - Test and validate the deployed LLM endpoint - Understand LLM serving lifecycle management concepts
Prerequisites
Before starting this exercise, ensure you have:
1. Completed Exercise 5: Model Versioning & Packaging
2. A merged model available in the MLflow model registry or local models directory
3. Access to an OpenShift/Kubernetes cluster
4. The oc or kubectl CLI tools configured
Step 1: Review the Dockerfile
Let's first examine the Dockerfile that will be used to containerize our LLM serving application:
# Dockerfile for LLMOps Instruction Tuning Workshop
# Optimized for vLLM serving with fine-tuned models
FROM quay.io/modh/runtime-images:runtime-datascience-ubi9-python-3.11-20250703
# Set working directory
WORKDIR /app
# Install system dependencies
RUN dnf install -y git && \
dnf clean all
# Copy requirements
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY scripts/ /app/scripts/
COPY models/ /app/models/
# Expose port for vLLM server
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Default command - vLLM server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "/app/models/merged_model", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--dtype", "auto", \
"--max-model-len", "2048"]
Key points about this Dockerfile: - Uses the Red Hat runtime datascience image as base - Installs dependencies from requirements.txt - Copies scripts and models directories - Exposes port 8000 for the vLLM server - Includes a health check endpoint - Runs vLLM's OpenAI-compatible API server as the default command
Step 2: Build and Push the Container Image
If you haven't already built and pushed your container image from Exercise 5, you can do so now:
# Build the Docker image
docker build -t llm-instruction-tuning:latest .
# Tag for your registry (replace with your registry path)
docker tag llm-instruction-tuning:latest <your-registry>/llm-instruction-tuning:latest
# Push to registry
docker push <your-registry>/llm-instruction-tuning:latest
In the OpenShift AI environment, you might use the integrated registry:
# Login to OpenShift registry
oc login <openshift-cluster>
docker login -u $(oc whoami) -p $(oc whoami -t) <image-registry-path>
# Build and push
docker build -t <image-registry-path>/llm-instruction-tuning:latest .
docker push <image-registry-path>/llm-instruction-tuning:latest
Step 3: Examine the Kubernetes Manifests
Let's look at the Kubernetes deployment and service manifests:
Deployment Manifest (k8s/deployment.yaml)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-instruction-tuning-deployment
labels:
app: llm-instruction-tuning
spec:
replicas: 1
selector:
matchLabels:
app: llm-instruction-tuning
template:
metadata:
labels:
app: llm-instruction-tuning
spec:
containers:
- name: llm-server
image: <your-registry>/llm-instruction-tuning:latest
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "TinyLlama-1.1B-Chat-v1.0"
resources:
requests:
memory: "8Gi"
cpu: "2"
limits:
memory: "16Gi"
cpu: "4"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 20
Service Manifest (k8s/service.yaml)
apiVersion: v1
kind: Service
metadata:
name: llm-instruction-tuning-service
labels:
app: llm-instruction-tuning
spec:
selector:
app: llm-instruction-tuning
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: ClusterIP
Key considerations in these manifests: - Resource requests and limits are crucial for LLM workloads - Readiness and liveness probes ensure the container is healthy - The service exposes the deployment internally in the cluster - For external access, you would typically create a Route (OpenShift) or Ingress (Kubernetes)
Step 4: Deploy to OpenShift/Kubernetes
Now let's deploy our LLM serving application:
# Create a namespace for our workshop (optional)
oc new-project llm-workshop --display-name="LLM Instruction Tuning Workshop"
# Apply the Kubernetes manifests
oc apply -f k8s/deployment.yaml
oc apply -f k8s/service.yaml
# Check the status of our deployment
oc get deployments
oc get pods
oc get services
Step 5: Access the LLM Endpoint
Once the deployment is running, you can access the LLM endpoint:
Option 1: Port Forwarding (for testing)
# Port forward to access the service locally
oc port-forward service/llm-instruction-tuning-service 8000:80
# Now you can access the API at http://localhost:8000/v1/chat/completions
Option 2: Create a Route (OpenShift) or Ingress (Kubernetes)
For OpenShift:
oc expose svc/llm-instruction-tuning-service --hostname=llm-workshop.<your-domain>
For Kubernetes:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-instruction-tuning-ingress
spec:
rules:
- host: llm-workshop.<your-domain>
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: llm-instruction-tuning-service
port:
number: 80
Step 6: Test the Deployed LLM
Now let's test our deployed LLM using the test client we created earlier:
# If using port forwarding
python scripts/test_client.py --endpoint http://localhost:8000/v1/chat/completions
# If using a route/ingress
python scripts/test_client.py --endpoint https://llm-workshop.<your-domain>/v1/chat/completions
You can also test directly with curl:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "What is MLOps?"}],
"max_tokens": 100
}'
Step 7: Monitor and Manage the Deployment
Checking Logs
oc logs -f deployment/llm-instruction-tuning-deployment
Scaling the Deployment
# Scale to 2 replicas for higher availability
oc scale deployment/llm-instruction-tuning-deployment --replicas=2
Updating the Model
When you have a new version of your model: 1. Build a new container image with the updated model 2. Update the image in the deployment manifest 3. Apply the updated manifest:
oc apply -f k8s/deployment.yaml
Resource Monitoring
Monitor resource usage to ensure your requests and limits are appropriate:
oc top pods
Key Concepts: Kubernetes-native LLM Serving
Why Resource Management is Critical for LLMs
LLM serving workloads have unique resource characteristics: - High memory usage for model weights and KV cache - Variable CPU usage depending on batch size and sequence length - GPU acceleration benefits significantly (when available) - Cold start times can be long due to model loading
LLM Serving Lifecycle Management
- Deployment: Initial rollout of the LLM serving application
- Scaling: Adjusting replica count based on demand
- Updating: Rolling updates with new model versions
- Monitoring: Tracking latency, throughput, and resource utilization
- Rollback: Reverting to previous versions if issues arise
- Decommissioning: Removing old model versions from serving
Best Practices for LLM Serving on Kubernetes
- Set appropriate resource requests/limits based on profiling
- Use readiness/liveness probes to ensure healthy instances
- Consider GPU node selection when GPUs are available
- Implement proper logging and monitoring
- Use blue/green or canary deployments for zero-downtime updates
- Consider model sharding for very large models that don't fit on a single GPU
Summary
In this exercise, you learned how to: 1. Containerize a fine-tuned LLM with vLLM for serving 2. Create Kubernetes deployment and service manifests 3. Deploy to OpenShift/Kubernetes with appropriate resource management 4. Access and test the deployed LLM endpoint 5. Monitor and manage the LLM serving lifecycle
These skills are essential for production LLM deployments, where reliability, scalability, and efficient resource utilization are paramount.