LLMOps Guide
Model Deployment
Ray Serve Configuration
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
name: llm-inference
spec:
serviceUnhealthySecondThreshold: 300
deploymentUnhealthySecondThreshold: 300
serveDeployments:
- name: llm-deployment
numReplicas: 2
rayStartParams:
num-cpus: "16"
num-gpus: "1"
containerConfig:
image: llm-server:latest
env:
- name: MODEL_NAME
value: "llama2-7b"
- name: BATCH_SIZE
value: "4"
Model Monitoring
Prometheus Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: llm-monitoring
spec:
groups:
- name: LLMMetrics
rules:
- alert: HighLatency
expr: |
rate(llm_inference_duration_seconds_sum[5m])
/
rate(llm_inference_duration_seconds_count[5m])
> 1.0
for: 5m
labels:
severity: warning
- alert: HighErrorRate
expr: |
rate(llm_inference_errors_total[5m])
/
rate(llm_inference_requests_total[5m])
> 0.01
for: 5m
labels:
severity: critical
Performance Optimization
Triton Inference Server
apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
name: llm-triton
spec:
predictor:
triton:
storageUri: s3://models/llm
resources:
limits:
cpu: "8"
memory: "16Gi"
nvidia.com/gpu: "1"
containerConcurrency: 4
env:
- name: TRITON_CACHE_CONFIG
value: |
{
"cache_size": "8GB",
"cache_policy": "LRU"
}
Best Practices
Model Management
Version control
A/B testing
Canary deployment
Model registry
Observability
Performance metrics
Token usage
Response quality
Cost tracking
Optimization
Quantization
Batching
Caching
Load balancing
Security
Input validation
Output filtering
Rate limiting
Access control
Last updated