Google Kubernetes Engine (GKE)

Deploying and managing Google Kubernetes Engine (GKE) clusters

Google Kubernetes Engine (GKE) is Google Cloud's managed Kubernetes service that provides a secure, production-ready environment for deploying containerized applications. This guide focuses on practical deployment scenarios using Terraform and gcloud CLI.

Key Features

  • Autopilot: Fully managed Kubernetes experience with hands-off operations

  • Standard: More control over cluster configuration and node management

  • GKE Enterprise: Advanced multi-cluster management and governance features

  • Auto-scaling: Automatic scaling of node pools based on workload demand

  • Auto-upgrade: Automated Kubernetes version upgrades

  • Multi-zone/region: Deploy across zones/regions for high availability

  • VPC-native networking: Uses alias IP ranges for pod networking

  • Container-Optimized OS: Secure by default OS for GKE nodes

  • Workload Identity: Secure access to Google Cloud services from pods

Deploying GKE with Terraform

Standard Cluster Deployment

resource "google_container_cluster" "primary" {
  name               = "my-gke-cluster"
  location           = "us-central1-a"
  remove_default_node_pool = true
  initial_node_count = 1
  
  # Enable Workload Identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
  
  # Network configuration
  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
  
  # IP allocation policy for VPC-native
  ip_allocation_policy {
    cluster_ipv4_cidr_block  = "/16"
    services_ipv4_cidr_block = "/22"
  }
  
  # Private cluster configuration
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }
  
  # Release channel for auto-upgrades
  release_channel {
    channel = "REGULAR"
  }
  
  # Maintenance window
  maintenance_policy {
    recurring_window {
      start_time = "2022-01-01T00:00:00Z"
      end_time   = "2022-01-02T00:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
    }
  }
}

resource "google_container_node_pool" "primary_nodes" {
  name       = "primary-node-pool"
  location   = "us-central1-a"
  cluster    = google_container_cluster.primary.name
  node_count = 3
  
  management {
    auto_repair  = true
    auto_upgrade = true
  }
  
  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-standard"
    
    # Google recommends custom service accounts with minimal permissions
    service_account = google_service_account.gke_sa.email
    oauth_scopes    = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
    
    # Enable workload identity on node pool
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
    
    labels = {
      env = "production"
    }
    
    tags = ["gke-node", "production"]
  }
}

resource "google_service_account" "gke_sa" {
  account_id   = "gke-service-account"
  display_name = "GKE Service Account"
}

resource "google_project_iam_member" "gke_sa_roles" {
  for_each = toset([
    "roles/logging.logWriter",
    "roles/monitoring.metricWriter",
    "roles/monitoring.viewer",
    "roles/artifactregistry.reader"
  ])
  
  role    = each.key
  member  = "serviceAccount:${google_service_account.gke_sa.email}"
  project = var.project_id
}

resource "google_compute_network" "vpc" {
  name                    = "gke-vpc"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "subnet" {
  name          = "gke-subnet"
  ip_cidr_range = "10.10.0.0/16"
  region        = "us-central1"
  network       = google_compute_network.vpc.id
  
  secondary_ip_range {
    range_name    = "pods"
    ip_cidr_range = "10.20.0.0/16"
  }
  
  secondary_ip_range {
    range_name    = "services"
    ip_cidr_range = "10.30.0.0/16"
  }
}

Autopilot Cluster Deployment

Deploying GKE with gcloud CLI

Creating a Standard Cluster

Creating an Autopilot Cluster

Real-World Example: Deploying a Microservice Application

This example demonstrates deploying a complete microservices application to GKE:

Step 1: Create GKE infrastructure with Terraform

Step 2: Create Kubernetes manifests for the application

Step 3: Create Deployment Pipeline (Cloud Build)

Best Practices

  1. Security

    • Use private clusters with no public endpoint

    • Implement Workload Identity for pod-level access to Google Cloud resources

    • Apply the principle of least privilege for service accounts

    • Enable Binary Authorization for secure supply chain

    • Keep nodes and master updated with release channels

  2. Reliability

    • Deploy across multiple zones/regions for high availability

    • Use Pod Disruption Budgets to ensure availability during maintenance

    • Implement proper health checks and readiness/liveness probes

    • Set appropriate resource requests and limits

    • Use node auto-provisioning to handle fluctuating workloads

  3. Cost Optimization

    • Use Autopilot for hands-off management and optimized costs

    • Leverage Spot VMs for batch or fault-tolerant workloads

    • Set up cluster autoscaler to scale nodes based on demand

    • Use horizontal pod autoscaling (HPA) based on CPU/memory/custom metrics

    • Implement PodNodeSelector to ensure pods run on appropriate nodes

  4. Monitoring and Logging

    • Enable Cloud Monitoring and Logging during cluster creation

    • Set up custom dashboards for cluster and application metrics

    • Create log-based alerts for critical issues

    • Use Cloud Trace and Profiler for application performance monitoring

    • Implement distributed tracing using OpenTelemetry

Common Issues and Troubleshooting

Networking Issues

  • Ensure pod CIDR ranges don't overlap with VPC subnets

  • Check firewall rules for master-to-node and node-to-node communication

  • Verify kube-proxy is running correctly for service networking

  • Use Network Policy to control pod-to-pod traffic

Performance Problems

  • Review pod resource settings (requests/limits)

  • Check for node resource exhaustion (CPU, memory)

  • Look for noisy neighbor issues on shared nodes

  • Monitor network throughput and latency

Deployment Failures

  • Verify service account permissions

  • Check image pull errors (registry access, image existence)

  • Examine pod events with kubectl describe pod

  • Review logs with kubectl logs or Cloud Logging

Scaling Issues

  • Ensure cluster autoscaler is properly configured

  • Check if pods have appropriate resource requests

  • Verify node resource availability

  • Look for pod affinity/anti-affinity conflicts

Further Reading

Last updated