LLMOps Guide

Model Deployment

Ray Serve Configuration

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: llm-inference
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveDeployments:
    - name: llm-deployment
      numReplicas: 2
      rayStartParams:
        num-cpus: "16"
        num-gpus: "1"
      containerConfig:
        image: llm-server:latest
        env:
          - name: MODEL_NAME
            value: "llama2-7b"
          - name: BATCH_SIZE
            value: "4"

Model Monitoring

Prometheus Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: llm-monitoring
spec:
  groups:
  - name: LLMMetrics
    rules:
    - alert: HighLatency
      expr: |
        rate(llm_inference_duration_seconds_sum[5m])
        /
        rate(llm_inference_duration_seconds_count[5m])
        > 1.0
      for: 5m
      labels:
        severity: warning
    - alert: HighErrorRate
      expr: |
        rate(llm_inference_errors_total[5m])
        /
        rate(llm_inference_requests_total[5m])
        > 0.01
      for: 5m
      labels:
        severity: critical

Performance Optimization

Triton Inference Server

apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
  name: llm-triton
spec:
  predictor:
    triton:
      storageUri: s3://models/llm
      resources:
        limits:
          cpu: "8"
          memory: "16Gi"
          nvidia.com/gpu: "1"
      containerConcurrency: 4
      env:
        - name: TRITON_CACHE_CONFIG
          value: |
            {
              "cache_size": "8GB",
              "cache_policy": "LRU"
            }

Best Practices

  1. Model Management

    • Version control

    • A/B testing

    • Canary deployment

    • Model registry

  2. Observability

    • Performance metrics

    • Token usage

    • Response quality

    • Cost tracking

  3. Optimization

    • Quantization

    • Batching

    • Caching

    • Load balancing

  4. Security

    • Input validation

    • Output filtering

    • Rate limiting

    • Access control

Last updated