BigQuery

Deploying and managing Google BigQuery for enterprise data analytics

Google BigQuery is a fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. It allows you to analyze large datasets with high performance and cost efficiency.

Key Features

Serverless: No infrastructure to manage or provision
Petabyte Scale: Query petabytes of data with ease
Separation of Storage and Compute: Pay only for the storage you use and the queries you run
Real-time Analytics: Stream data for real-time analysis and ML
Geospatial Analysis: Built-in support for geographic data types
ML Integration: Train and run ML models directly using SQL with BigQuery ML
BI Engine: In-memory analysis service for sub-second query response
Federated Queries: Query data from external sources without copying
Data Governance: Column-level security and dynamic data masking
Cost Controls: Custom quotas and intelligent pricing models

Deploying BigQuery with Terraform

Basic Dataset and Table Creation

resource "google_bigquery_dataset" "default" {
  dataset_id                  = "example_dataset"
  friendly_name               = "test"
  description                 = "This is a test dataset"
  location                    = "US"
  default_table_expiration_ms = 3600000  # 1 hour
  
  labels = {
    env = "default"
  }
  
  access {
    role          = "OWNER"
    user_by_email = google_service_account.bq_admin.email
  }
  
  access {
    role          = "READER"
    group_by_email = "analytics-team@example.com"
  }
}

resource "google_service_account" "bq_admin" {
  account_id   = "bq-admin"
  display_name = "BigQuery Administrator"
}

resource "google_bigquery_table" "default" {
  dataset_id = google_bigquery_dataset.default.dataset_id
  table_id   = "example_table"
  
  time_partitioning {
    type = "DAY"
    field = "timestamp"
  }
  
  clustering = ["customer_id", "region"]
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Customer identifier"
  },
  {
    "name": "timestamp",
    "type": "TIMESTAMP",
    "mode": "REQUIRED",
    "description": "Event timestamp"
  },
  {
    "name": "region",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Customer region"
  },
  {
    "name": "amount",
    "type": "FLOAT",
    "mode": "NULLABLE",
    "description": "Transaction amount"
  }
]
EOF

  deletion_protection = false
}

Advanced Configuration with Authorized Views and Row-Level Security

resource "google_bigquery_dataset" "sensitive_data" {
  dataset_id                  = "sensitive_data"
  friendly_name               = "Sensitive Data"
  description                 = "Contains PII and other sensitive information"
  location                    = "US"
  
  # Enable CMEK (Customer Managed Encryption Keys)
  default_encryption_configuration {
    kms_key_name = google_kms_crypto_key.bq_key.id
  }
  
  # Set access controls
  access {
    role          = "OWNER"
    user_by_email = google_service_account.bq_admin.email
  }
  
  # No direct access for analysts
}

# Create an authorized view dataset
resource "google_bigquery_dataset" "authorized_views" {
  dataset_id                  = "authorized_views"
  friendly_name               = "Authorized Views"
  description                 = "Contains secured views for analysts"
  location                    = "US"
  
  access {
    role          = "OWNER"
    user_by_email = google_service_account.bq_admin.email
  }
  
  access {
    role          = "READER"
    group_by_email = "analysts@example.com"
  }
}

# Create a KMS key for encryption
resource "google_kms_key_ring" "bigquery" {
  name     = "bigquery-keyring"
  location = "global"
}

resource "google_kms_crypto_key" "bq_key" {
  name     = "bigquery-key"
  key_ring = google_kms_key_ring.bigquery.id
}

# Grant BigQuery service account access to the KMS key
data "google_project" "project" {}

resource "google_kms_crypto_key_iam_member" "crypto_key" {
  crypto_key_id = google_kms_crypto_key.bq_key.id
  role          = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  member        = "serviceAccount:bq-${data.google_project.project.number}@bigquery-encryption.iam.gserviceaccount.com"
}

# Create sensitive data table
resource "google_bigquery_table" "customer_data" {
  dataset_id = google_bigquery_dataset.sensitive_data.dataset_id
  table_id   = "customer_data"
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "full_name",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "email",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "address",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "date_of_birth",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "region",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "customer_type",
    "type": "STRING",
    "mode": "NULLABLE"
  }
]
EOF

  # Apply row access policy (defined later)
  deletion_protection = true
}

# Create authorized view
resource "google_bigquery_table" "customer_view" {
  dataset_id = google_bigquery_dataset.authorized_views.dataset_id
  table_id   = "customer_view"
  view {
    query = <<EOF
      SELECT 
        customer_id,
        CONCAT(SUBSTR(full_name, 1, 1), '***') AS masked_name,
        REGEXP_REPLACE(email, r'(.+)@(.+)', r'****@\\2') AS masked_email,
        region,
        customer_type
      FROM 
        ${google_bigquery_dataset.sensitive_data.dataset_id}.${google_bigquery_table.customer_data.table_id}
    EOF
    use_legacy_sql = false
  }
  
  deletion_protection = true
}

# Grant view access to the underlying table
resource "google_bigquery_dataset_access" "authorized_view_access" {
  dataset_id    = google_bigquery_dataset.sensitive_data.dataset_id
  view {
    project_id = data.google_project.project.project_id
    dataset_id = google_bigquery_dataset.authorized_views.dataset_id
    table_id   = google_bigquery_table.customer_view.table_id
  }
}

# Create a row access policy for region-based filtering
resource "google_bigquery_table_iam_policy" "policy" {
  project    = data.google_project.project.project_id
  dataset_id = google_bigquery_dataset.sensitive_data.dataset_id
  table_id   = google_bigquery_table.customer_data.table_id
  
  policy_data = data.google_iam_policy.row_access_policy.policy_data
}

data "google_iam_policy" "row_access_policy" {
  binding {
    role = "roles/bigquery.dataViewer"
    
    members = [
      "group:us-analysts@example.com",
    ]
    
    condition {
      title       = "US Region Access Only"
      description = "Only allows access to US region data"
      expression  = "resource.region == \"US\""
    }
  }
  
  binding {
    role = "roles/bigquery.dataViewer"
    
    members = [
      "group:eu-analysts@example.com",
    ]
    
    condition {
      title       = "EU Region Access Only"
      description = "Only allows access to EU region data"
      expression  = "resource.region == \"EU\""
    }
  }
}

Managing BigQuery with gcloud CLI

Creating Datasets and Tables

# Create a dataset
gcloud bq mk \
  --dataset \
  --description="Sales analytics dataset" \
  --location=US \
  your-project-id:sales_analytics

# Create a table with schema
gcloud bq mk \
  --table \
  --schema="transaction_id:STRING,customer_id:STRING,product_id:STRING,quantity:INTEGER,price:FLOAT,timestamp:TIMESTAMP" \
  your-project-id:sales_analytics.transactions

# Load data into a table from Cloud Storage
gcloud bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  your-project-id:sales_analytics.transactions \
  gs://your-bucket/sales-data/transactions.csv \
  transaction_id:STRING,customer_id:STRING,product_id:STRING,quantity:INTEGER,price:FLOAT,timestamp:TIMESTAMP

Running Queries

# Run a simple query
gcloud bq query \
  --use_legacy_sql=false \
  'SELECT COUNT(*) as transaction_count FROM `your-project-id.sales_analytics.transactions`'

# Run a query and save results to a table
gcloud bq query \
  --use_legacy_sql=false \
  --destination_table=sales_analytics.daily_summary \
  --append_table \
  'SELECT 
     DATE(timestamp) as date, 
     COUNT(*) as transactions, 
     SUM(price * quantity) as revenue 
   FROM 
     `your-project-id.sales_analytics.transactions` 
   GROUP BY 
     date 
   ORDER BY 
     date DESC'

# Export query results to Cloud Storage
gcloud bq extract \
  your-project-id:sales_analytics.daily_summary \
  gs://your-bucket/exports/daily_summary.csv

Managing Permissions

# Grant dataset access
gcloud bq add-iam-policy-binding \
  --dataset=your-project-id:sales_analytics \
  --member=user:user@example.com \
  --role=roles/bigquery.dataViewer

# Grant table access
gcloud bq add-iam-policy-binding \
  --table=your-project-id:sales_analytics.transactions \
  --member=serviceAccount:etl-service@your-project-id.iam.gserviceaccount.com \
  --role=roles/bigquery.dataEditor

# View current IAM policy for a dataset
gcloud bq get-iam-policy \
  --dataset=your-project-id:sales_analytics

Real-World Example: E-commerce Analytics Platform

This example demonstrates a complete data analytics platform for an e-commerce company:

Step 1: Infrastructure Setup with Terraform

# Create datasets for different layers of the data platform
resource "google_bigquery_dataset" "raw_data" {
  dataset_id                  = "raw_data"
  friendly_name               = "Raw Data"
  description                 = "Landing zone for raw data"
  location                    = "US"
  default_table_expiration_ms = 7776000000  # 90 days
  
  labels = {
    environment = "production"
    data_type   = "raw"
  }
}

resource "google_bigquery_dataset" "staging" {
  dataset_id                  = "staging"
  friendly_name               = "Staging"
  description                 = "Intermediate processing zone"
  location                    = "US"
  default_table_expiration_ms = 2592000000  # 30 days
  
  labels = {
    environment = "production"
    data_type   = "staging"
  }
}

resource "google_bigquery_dataset" "data_warehouse" {
  dataset_id                  = "data_warehouse"
  friendly_name               = "Data Warehouse"
  description                 = "Enterprise data warehouse"
  location                    = "US"
  
  labels = {
    environment = "production"
    data_type   = "curated"
  }
}

resource "google_bigquery_dataset" "data_marts" {
  dataset_id                  = "data_marts"
  friendly_name               = "Data Marts"
  description                 = "Business-specific data marts"
  location                    = "US"
  
  labels = {
    environment = "production"
    data_type   = "marts"
  }
}

# Create raw tables
resource "google_bigquery_table" "raw_transactions" {
  dataset_id = google_bigquery_dataset.raw_data.dataset_id
  table_id   = "transactions"
  
  time_partitioning {
    type  = "DAY"
    field = "transaction_date"
  }
  
  schema = <<EOF
[
  {
    "name": "transaction_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "transaction_date",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
  },
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "product_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "quantity",
    "type": "INTEGER",
    "mode": "REQUIRED"
  },
  {
    "name": "unit_price",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "total_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "payment_method",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "store_id",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "json_payload",
    "type": "JSON",
    "mode": "NULLABLE"
  }
]
EOF
}

resource "google_bigquery_table" "raw_customers" {
  dataset_id = google_bigquery_dataset.raw_data.dataset_id
  table_id   = "customers"
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "first_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "last_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "email",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "phone",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "address",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "city",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "state",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "country",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "zip_code",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "registration_date",
    "type": "TIMESTAMP",
    "mode": "NULLABLE"
  }
]
EOF
}

resource "google_bigquery_table" "raw_products" {
  dataset_id = google_bigquery_dataset.raw_data.dataset_id
  table_id   = "products"
  
  schema = <<EOF
[
  {
    "name": "product_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "product_name",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "category",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "subcategory",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "brand",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "supplier_id",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "base_price",
    "type": "FLOAT",
    "mode": "NULLABLE"
  },
  {
    "name": "tax_rate",
    "type": "FLOAT",
    "mode": "NULLABLE"
  }
]
EOF
}

# Create warehouse fact and dimension tables
resource "google_bigquery_table" "fact_sales" {
  dataset_id = google_bigquery_dataset.data_warehouse.dataset_id
  table_id   = "fact_sales"
  
  time_partitioning {
    type  = "DAY"
    field = "transaction_date"
  }
  
  clustering = ["customer_id", "product_id", "store_id"]
  
  schema = <<EOF
[
  {
    "name": "transaction_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "transaction_date",
    "type": "DATE",
    "mode": "REQUIRED"
  },
  {
    "name": "transaction_time",
    "type": "TIME",
    "mode": "REQUIRED"
  },
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "product_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "store_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "quantity",
    "type": "INTEGER",
    "mode": "REQUIRED"
  },
  {
    "name": "unit_price",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "net_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "tax_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "total_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "payment_method",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "is_online",
    "type": "BOOLEAN",
    "mode": "REQUIRED"
  }
]
EOF
}

resource "google_bigquery_table" "dim_customer" {
  dataset_id = google_bigquery_dataset.data_warehouse.dataset_id
  table_id   = "dim_customer"
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "first_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "last_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "full_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "email",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "city",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "state",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "country",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "zip_code",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "registration_date",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "customer_segment",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "valid_from",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
  },
  {
    "name": "valid_to",
    "type": "TIMESTAMP",
    "mode": "NULLABLE"
  },
  {
    "name": "is_current",
    "type": "BOOLEAN",
    "mode": "REQUIRED"
  }
]
EOF
}

# Create data mart views
resource "google_bigquery_table" "sales_mart_monthly" {
  dataset_id = google_bigquery_dataset.data_marts.dataset_id
  table_id   = "sales_by_month"
  
  view {
    query = <<EOF
      SELECT 
        EXTRACT(YEAR FROM transaction_date) AS year,
        EXTRACT(MONTH FROM transaction_date) AS month,
        SUM(quantity) AS total_units_sold,
        SUM(total_amount) AS total_revenue,
        COUNT(DISTINCT transaction_id) AS transaction_count,
        COUNT(DISTINCT customer_id) AS unique_customers
      FROM 
        `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}`
      GROUP BY 
        year, month
      ORDER BY 
        year DESC, month DESC
    EOF
    use_legacy_sql = false
  }
}

resource "google_bigquery_table" "customer_lifetime_value" {
  dataset_id = google_bigquery_dataset.data_marts.dataset_id
  table_id   = "customer_lifetime_value"
  
  view {
    query = <<EOF
      WITH customer_purchases AS (
        SELECT 
          fs.customer_id,
          dc.full_name,
          dc.registration_date,
          SUM(fs.total_amount) AS lifetime_value,
          MIN(fs.transaction_date) AS first_purchase_date,
          MAX(fs.transaction_date) AS last_purchase_date,
          COUNT(DISTINCT fs.transaction_id) AS total_purchases,
          DATE_DIFF(CURRENT_DATE(), MIN(fs.transaction_date), DAY) AS days_since_first_purchase,
          DATE_DIFF(CURRENT_DATE(), MAX(fs.transaction_date), DAY) AS days_since_last_purchase
        FROM 
          `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}` fs
        JOIN
          `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.dim_customer.table_id}` dc
        ON 
          fs.customer_id = dc.customer_id AND dc.is_current = TRUE
        GROUP BY 
          fs.customer_id, dc.full_name, dc.registration_date
      )
      
      SELECT 
        customer_id,
        full_name,
        registration_date,
        lifetime_value,
        total_purchases,
        lifetime_value / total_purchases AS avg_order_value,
        first_purchase_date,
        last_purchase_date,
        days_since_first_purchase,
        days_since_last_purchase,
        CASE 
          WHEN days_since_last_purchase <= 30 THEN 'Active'
          WHEN days_since_last_purchase BETWEEN 31 AND 90 THEN 'At Risk'
          WHEN days_since_last_purchase BETWEEN 91 AND 365 THEN 'Lapsed'
          ELSE 'Inactive'
        END AS customer_status
      FROM 
        customer_purchases
      ORDER BY 
        lifetime_value DESC
    EOF
    use_legacy_sql = false
  }
}

# Set up scheduled query for daily ETL
resource "google_bigquery_data_transfer_config" "daily_etl" {
  display_name           = "Daily ETL pipeline"
  location               = "US"
  data_source_id         = "scheduled_query"
  schedule               = "every 24 hours"
  destination_dataset_id = google_bigquery_dataset.data_warehouse.dataset_id
  
  params = {
    query = <<EOF
      -- Refresh fact_sales table with new data
      DELETE FROM `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}` 
      WHERE DATE(transaction_date) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
      
      INSERT INTO `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}`
      SELECT 
        transaction_id,
        DATE(transaction_date) AS transaction_date,
        TIME(transaction_date) AS transaction_time,
        customer_id,
        product_id,
        store_id,
        quantity,
        unit_price,
        (quantity * unit_price) AS net_amount,
        (quantity * unit_price * p.tax_rate) AS tax_amount,
        total_amount,
        payment_method,
        IFNULL(store_id = 'ONLINE', FALSE) AS is_online
      FROM 
        `${google_bigquery_dataset.raw_data.dataset_id}.${google_bigquery_table.raw_transactions.table_id}` t
      JOIN
        `${google_bigquery_dataset.raw_data.dataset_id}.${google_bigquery_table.raw_products.table_id}` p
      ON
        t.product_id = p.product_id
      WHERE
        DATE(transaction_date) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
    EOF
  }

  service_account_name = google_service_account.bigquery_etl.email
}

# Service account for ETL operations
resource "google_service_account" "bigquery_etl" {
  account_id   = "bigquery-etl"
  display_name = "BigQuery ETL Service Account"
}

# Grant necessary permissions
resource "google_project_iam_member" "etl_permissions" {
  for_each = toset([
    "roles/bigquery.dataEditor",
    "roles/bigquery.jobUser"
  ])
  
  role    = each.key
  member  = "serviceAccount:${google_service_account.bigquery_etl.email}"
  project = data.google_project.project.project_id
}

# Get project information
data "google_project" "project" {}

Step 2: Set up Data Loading Pipelines

Create a Cloud Function to load data from various sources into BigQuery:

# main.py
import json
import base64
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage

# Initialize clients
bigquery_client = bigquery.Client()
storage_client = storage.Client()

def process_transaction_data(event, context):
    """Cloud Function triggered by a Pub/Sub event"""
    # Decode the Pub/Sub message
    pubsub_message = base64.b64decode(event['data']).decode('utf-8')
    message_data = json.loads(pubsub_message)
    
    # Get source file details
    bucket_name = message_data['bucket']
    file_name = message_data['file']
    
    # Set up the job configuration
    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("transaction_id", "STRING"),
            bigquery.SchemaField("transaction_date", "TIMESTAMP"),
            bigquery.SchemaField("customer_id", "STRING"),
            bigquery.SchemaField("product_id", "STRING"),
            bigquery.SchemaField("quantity", "INTEGER"),
            bigquery.SchemaField("unit_price", "FLOAT"),
            bigquery.SchemaField("total_amount", "FLOAT"),
            bigquery.SchemaField("payment_method", "STRING"),
            bigquery.SchemaField("store_id", "STRING"),
            bigquery.SchemaField("json_payload", "JSON"),
        ],
        source_format=bigquery.SourceFormat.CSV,
        skip_leading_rows=1,
    )
    
    # Get the table reference
    table_id = f"{message_data['project_id']}.raw_data.transactions"
    
    # Create a job to load data from GCS to BigQuery
    uri = f"gs://{bucket_name}/{file_name}"
    load_job = bigquery_client.load_table_from_uri(
        uri, table_id, job_config=job_config
    )
    
    # Wait for the job to complete
    load_job.result()
    
    # Get statistics
    table = bigquery_client.get_table(table_id)
    print(f"Loaded {load_job.output_rows} rows into {table_id}")
    
    return f"Loaded {load_job.output_rows} rows into {table_id}"

Step 3: Create Business Intelligence Dashboard using Looker Studio

Set up a connection to BigQuery data marts
Create a dashboard that pulls from the data marts:

-- Example query for sales dashboard
SELECT 
  d.year,
  d.month,
  FORMAT_DATETIME('%b %Y', DATE(d.year, d.month, 1)) AS month_year,
  c.category,
  SUM(s.total_revenue) AS revenue,
  SUM(s.total_units_sold) AS units_sold,
  SUM(s.transaction_count) AS transactions,
  SUM(s.unique_customers) AS customers
FROM 
  `your-project-id.data_marts.sales_by_month` s
JOIN 
  `your-project-id.data_warehouse.dim_date` d ON (s.year = d.year AND s.month = d.month)
JOIN 
  `your-project-id.data_warehouse.dim_product` p ON s.product_id = p.product_id
JOIN 
  `your-project-id.data_warehouse.dim_category` c ON p.category_id = c.category_id
WHERE 
  d.date >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
GROUP BY 
  d.year, d.month, month_year, c.category
ORDER BY 
  d.year, d.month, c.category

Best Practices

Data Organization
- Structure data in datasets by function (raw, staging, production)
- Use consistent naming conventions for datasets and tables
- Consider data lifecycle management with table expiration
- Implement a multi-layer architecture (bronze, silver, gold)
Performance Optimization
- Partition tables by date fields for time-series data
- Apply clustering for frequently filtered fields
- Use materialized views for common query patterns
- Consider BigQuery BI Engine for dashboarding workloads
- Write efficient SQL (avoid SELECT * and unnecessary JOINs)
Cost Management
- Set up billing alerts and quotas
- Use flat-rate pricing for predictable workloads
- Apply table partitioning to reduce scan sizes
- Cache query results when possible
- Consider reservations for consistent usage
Security and Governance
- Implement column-level security for sensitive data
- Use authorized views to control access to underlying data
- Apply row-level security for multi-tenant data
- Enable column-level encryption for PII
- Use VPC Service Controls for additional network security
Operational Excellence
- Monitor query performance with INFORMATION_SCHEMA views
- Schedule routine maintenance jobs
- Set up proper logging and monitoring
- Implement data quality checks
- Document data lineage

Common Issues and Troubleshooting

Performance Problems

Check for missing partitioning or clustering
Optimize JOIN operations (consider denormalizing)
Review query plans for bottlenecks
Use approximate aggregation functions for large datasets
Consider materialized views for frequent queries

Cost Issues

Monitor with BigQuery audit logs
Review largest queries by bytes processed
Control who can run queries (IAM roles)
Set user quotas to limit spending
Optimize storage with appropriate compression

Data Quality Problems

Implement data validation queries
Set up streaming inserts error handling
Create data quality dashboards
Use BigQuery Data Quality Services

Access Control Issues

Audit existing permissions with IAM Policy Analyzer
Use groups instead of individual permissions
Implement principle of least privilege
Document access policies
Regularly review and prune permissions

BigQuery

Deploying and managing Google BigQuery for enterprise data analytics

Key Features

Serverless: No infrastructure to manage or provision
Petabyte Scale: Query petabytes of data with ease
Separation of Storage and Compute: Pay only for the storage you use and the queries you run
Real-time Analytics: Stream data for real-time analysis and ML
Geospatial Analysis: Built-in support for geographic data types
ML Integration: Train and run ML models directly using SQL with BigQuery ML
BI Engine: In-memory analysis service for sub-second query response
Federated Queries: Query data from external sources without copying
Data Governance: Column-level security and dynamic data masking
Cost Controls: Custom quotas and intelligent pricing models

Deploying BigQuery with Terraform

Basic Dataset and Table Creation

resource "google_bigquery_dataset" "default" {
  dataset_id                  = "example_dataset"
  friendly_name               = "test"
  description                 = "This is a test dataset"
  location                    = "US"
  default_table_expiration_ms = 3600000  # 1 hour
  
  labels = {
    env = "default"
  }
  
  access {
    role          = "OWNER"
    user_by_email = google_service_account.bq_admin.email
  }
  
  access {
    role          = "READER"
    group_by_email = "analytics-team@example.com"
  }
}

resource "google_service_account" "bq_admin" {
  account_id   = "bq-admin"
  display_name = "BigQuery Administrator"
}

resource "google_bigquery_table" "default" {
  dataset_id = google_bigquery_dataset.default.dataset_id
  table_id   = "example_table"
  
  time_partitioning {
    type = "DAY"
    field = "timestamp"
  }
  
  clustering = ["customer_id", "region"]
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Customer identifier"
  },
  {
    "name": "timestamp",
    "type": "TIMESTAMP",
    "mode": "REQUIRED",
    "description": "Event timestamp"
  },
  {
    "name": "region",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Customer region"
  },
  {
    "name": "amount",
    "type": "FLOAT",
    "mode": "NULLABLE",
    "description": "Transaction amount"
  }
]
EOF

  deletion_protection = false
}

Advanced Configuration with Authorized Views and Row-Level Security

resource "google_bigquery_dataset" "sensitive_data" {
  dataset_id                  = "sensitive_data"
  friendly_name               = "Sensitive Data"
  description                 = "Contains PII and other sensitive information"
  location                    = "US"
  
  # Enable CMEK (Customer Managed Encryption Keys)
  default_encryption_configuration {
    kms_key_name = google_kms_crypto_key.bq_key.id
  }
  
  # Set access controls
  access {
    role          = "OWNER"
    user_by_email = google_service_account.bq_admin.email
  }
  
  # No direct access for analysts
}

# Create an authorized view dataset
resource "google_bigquery_dataset" "authorized_views" {
  dataset_id                  = "authorized_views"
  friendly_name               = "Authorized Views"
  description                 = "Contains secured views for analysts"
  location                    = "US"
  
  access {
    role          = "OWNER"
    user_by_email = google_service_account.bq_admin.email
  }
  
  access {
    role          = "READER"
    group_by_email = "analysts@example.com"
  }
}

# Create a KMS key for encryption
resource "google_kms_key_ring" "bigquery" {
  name     = "bigquery-keyring"
  location = "global"
}

resource "google_kms_crypto_key" "bq_key" {
  name     = "bigquery-key"
  key_ring = google_kms_key_ring.bigquery.id
}

# Grant BigQuery service account access to the KMS key
data "google_project" "project" {}

resource "google_kms_crypto_key_iam_member" "crypto_key" {
  crypto_key_id = google_kms_crypto_key.bq_key.id
  role          = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  member        = "serviceAccount:bq-${data.google_project.project.number}@bigquery-encryption.iam.gserviceaccount.com"
}

# Create sensitive data table
resource "google_bigquery_table" "customer_data" {
  dataset_id = google_bigquery_dataset.sensitive_data.dataset_id
  table_id   = "customer_data"
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "full_name",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "email",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "address",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "date_of_birth",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "region",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "customer_type",
    "type": "STRING",
    "mode": "NULLABLE"
  }
]
EOF

  # Apply row access policy (defined later)
  deletion_protection = true
}

# Create authorized view
resource "google_bigquery_table" "customer_view" {
  dataset_id = google_bigquery_dataset.authorized_views.dataset_id
  table_id   = "customer_view"
  view {
    query = <<EOF
      SELECT 
        customer_id,
        CONCAT(SUBSTR(full_name, 1, 1), '***') AS masked_name,
        REGEXP_REPLACE(email, r'(.+)@(.+)', r'****@\\2') AS masked_email,
        region,
        customer_type
      FROM 
        ${google_bigquery_dataset.sensitive_data.dataset_id}.${google_bigquery_table.customer_data.table_id}
    EOF
    use_legacy_sql = false
  }
  
  deletion_protection = true
}

# Grant view access to the underlying table
resource "google_bigquery_dataset_access" "authorized_view_access" {
  dataset_id    = google_bigquery_dataset.sensitive_data.dataset_id
  view {
    project_id = data.google_project.project.project_id
    dataset_id = google_bigquery_dataset.authorized_views.dataset_id
    table_id   = google_bigquery_table.customer_view.table_id
  }
}

# Create a row access policy for region-based filtering
resource "google_bigquery_table_iam_policy" "policy" {
  project    = data.google_project.project.project_id
  dataset_id = google_bigquery_dataset.sensitive_data.dataset_id
  table_id   = google_bigquery_table.customer_data.table_id
  
  policy_data = data.google_iam_policy.row_access_policy.policy_data
}

data "google_iam_policy" "row_access_policy" {
  binding {
    role = "roles/bigquery.dataViewer"
    
    members = [
      "group:us-analysts@example.com",
    ]
    
    condition {
      title       = "US Region Access Only"
      description = "Only allows access to US region data"
      expression  = "resource.region == \"US\""
    }
  }
  
  binding {
    role = "roles/bigquery.dataViewer"
    
    members = [
      "group:eu-analysts@example.com",
    ]
    
    condition {
      title       = "EU Region Access Only"
      description = "Only allows access to EU region data"
      expression  = "resource.region == \"EU\""
    }
  }
}

Managing BigQuery with gcloud CLI

Creating Datasets and Tables

# Create a dataset
gcloud bq mk \
  --dataset \
  --description="Sales analytics dataset" \
  --location=US \
  your-project-id:sales_analytics

# Create a table with schema
gcloud bq mk \
  --table \
  --schema="transaction_id:STRING,customer_id:STRING,product_id:STRING,quantity:INTEGER,price:FLOAT,timestamp:TIMESTAMP" \
  your-project-id:sales_analytics.transactions

# Load data into a table from Cloud Storage
gcloud bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  your-project-id:sales_analytics.transactions \
  gs://your-bucket/sales-data/transactions.csv \
  transaction_id:STRING,customer_id:STRING,product_id:STRING,quantity:INTEGER,price:FLOAT,timestamp:TIMESTAMP

Running Queries

# Run a simple query
gcloud bq query \
  --use_legacy_sql=false \
  'SELECT COUNT(*) as transaction_count FROM `your-project-id.sales_analytics.transactions`'

# Run a query and save results to a table
gcloud bq query \
  --use_legacy_sql=false \
  --destination_table=sales_analytics.daily_summary \
  --append_table \
  'SELECT 
     DATE(timestamp) as date, 
     COUNT(*) as transactions, 
     SUM(price * quantity) as revenue 
   FROM 
     `your-project-id.sales_analytics.transactions` 
   GROUP BY 
     date 
   ORDER BY 
     date DESC'

# Export query results to Cloud Storage
gcloud bq extract \
  your-project-id:sales_analytics.daily_summary \
  gs://your-bucket/exports/daily_summary.csv

Managing Permissions

# Grant dataset access
gcloud bq add-iam-policy-binding \
  --dataset=your-project-id:sales_analytics \
  --member=user:user@example.com \
  --role=roles/bigquery.dataViewer

# Grant table access
gcloud bq add-iam-policy-binding \
  --table=your-project-id:sales_analytics.transactions \
  --member=serviceAccount:etl-service@your-project-id.iam.gserviceaccount.com \
  --role=roles/bigquery.dataEditor

# View current IAM policy for a dataset
gcloud bq get-iam-policy \
  --dataset=your-project-id:sales_analytics

Real-World Example: E-commerce Analytics Platform

This example demonstrates a complete data analytics platform for an e-commerce company:

Step 1: Infrastructure Setup with Terraform

# Create datasets for different layers of the data platform
resource "google_bigquery_dataset" "raw_data" {
  dataset_id                  = "raw_data"
  friendly_name               = "Raw Data"
  description                 = "Landing zone for raw data"
  location                    = "US"
  default_table_expiration_ms = 7776000000  # 90 days
  
  labels = {
    environment = "production"
    data_type   = "raw"
  }
}

resource "google_bigquery_dataset" "staging" {
  dataset_id                  = "staging"
  friendly_name               = "Staging"
  description                 = "Intermediate processing zone"
  location                    = "US"
  default_table_expiration_ms = 2592000000  # 30 days
  
  labels = {
    environment = "production"
    data_type   = "staging"
  }
}

resource "google_bigquery_dataset" "data_warehouse" {
  dataset_id                  = "data_warehouse"
  friendly_name               = "Data Warehouse"
  description                 = "Enterprise data warehouse"
  location                    = "US"
  
  labels = {
    environment = "production"
    data_type   = "curated"
  }
}

resource "google_bigquery_dataset" "data_marts" {
  dataset_id                  = "data_marts"
  friendly_name               = "Data Marts"
  description                 = "Business-specific data marts"
  location                    = "US"
  
  labels = {
    environment = "production"
    data_type   = "marts"
  }
}

# Create raw tables
resource "google_bigquery_table" "raw_transactions" {
  dataset_id = google_bigquery_dataset.raw_data.dataset_id
  table_id   = "transactions"
  
  time_partitioning {
    type  = "DAY"
    field = "transaction_date"
  }
  
  schema = <<EOF
[
  {
    "name": "transaction_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "transaction_date",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
  },
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "product_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "quantity",
    "type": "INTEGER",
    "mode": "REQUIRED"
  },
  {
    "name": "unit_price",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "total_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "payment_method",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "store_id",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "json_payload",
    "type": "JSON",
    "mode": "NULLABLE"
  }
]
EOF
}

resource "google_bigquery_table" "raw_customers" {
  dataset_id = google_bigquery_dataset.raw_data.dataset_id
  table_id   = "customers"
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "first_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "last_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "email",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "phone",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "address",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "city",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "state",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "country",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "zip_code",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "registration_date",
    "type": "TIMESTAMP",
    "mode": "NULLABLE"
  }
]
EOF
}

resource "google_bigquery_table" "raw_products" {
  dataset_id = google_bigquery_dataset.raw_data.dataset_id
  table_id   = "products"
  
  schema = <<EOF
[
  {
    "name": "product_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "product_name",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "category",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "subcategory",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "brand",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "supplier_id",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "base_price",
    "type": "FLOAT",
    "mode": "NULLABLE"
  },
  {
    "name": "tax_rate",
    "type": "FLOAT",
    "mode": "NULLABLE"
  }
]
EOF
}

# Create warehouse fact and dimension tables
resource "google_bigquery_table" "fact_sales" {
  dataset_id = google_bigquery_dataset.data_warehouse.dataset_id
  table_id   = "fact_sales"
  
  time_partitioning {
    type  = "DAY"
    field = "transaction_date"
  }
  
  clustering = ["customer_id", "product_id", "store_id"]
  
  schema = <<EOF
[
  {
    "name": "transaction_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "transaction_date",
    "type": "DATE",
    "mode": "REQUIRED"
  },
  {
    "name": "transaction_time",
    "type": "TIME",
    "mode": "REQUIRED"
  },
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "product_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "store_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "quantity",
    "type": "INTEGER",
    "mode": "REQUIRED"
  },
  {
    "name": "unit_price",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "net_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "tax_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "total_amount",
    "type": "FLOAT",
    "mode": "REQUIRED"
  },
  {
    "name": "payment_method",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "is_online",
    "type": "BOOLEAN",
    "mode": "REQUIRED"
  }
]
EOF
}

resource "google_bigquery_table" "dim_customer" {
  dataset_id = google_bigquery_dataset.data_warehouse.dataset_id
  table_id   = "dim_customer"
  
  schema = <<EOF
[
  {
    "name": "customer_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "first_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "last_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "full_name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "email",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "city",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "state",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "country",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "zip_code",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "registration_date",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "customer_segment",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "valid_from",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
  },
  {
    "name": "valid_to",
    "type": "TIMESTAMP",
    "mode": "NULLABLE"
  },
  {
    "name": "is_current",
    "type": "BOOLEAN",
    "mode": "REQUIRED"
  }
]
EOF
}

# Create data mart views
resource "google_bigquery_table" "sales_mart_monthly" {
  dataset_id = google_bigquery_dataset.data_marts.dataset_id
  table_id   = "sales_by_month"
  
  view {
    query = <<EOF
      SELECT 
        EXTRACT(YEAR FROM transaction_date) AS year,
        EXTRACT(MONTH FROM transaction_date) AS month,
        SUM(quantity) AS total_units_sold,
        SUM(total_amount) AS total_revenue,
        COUNT(DISTINCT transaction_id) AS transaction_count,
        COUNT(DISTINCT customer_id) AS unique_customers
      FROM 
        `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}`
      GROUP BY 
        year, month
      ORDER BY 
        year DESC, month DESC
    EOF
    use_legacy_sql = false
  }
}

resource "google_bigquery_table" "customer_lifetime_value" {
  dataset_id = google_bigquery_dataset.data_marts.dataset_id
  table_id   = "customer_lifetime_value"
  
  view {
    query = <<EOF
      WITH customer_purchases AS (
        SELECT 
          fs.customer_id,
          dc.full_name,
          dc.registration_date,
          SUM(fs.total_amount) AS lifetime_value,
          MIN(fs.transaction_date) AS first_purchase_date,
          MAX(fs.transaction_date) AS last_purchase_date,
          COUNT(DISTINCT fs.transaction_id) AS total_purchases,
          DATE_DIFF(CURRENT_DATE(), MIN(fs.transaction_date), DAY) AS days_since_first_purchase,
          DATE_DIFF(CURRENT_DATE(), MAX(fs.transaction_date), DAY) AS days_since_last_purchase
        FROM 
          `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}` fs
        JOIN
          `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.dim_customer.table_id}` dc
        ON 
          fs.customer_id = dc.customer_id AND dc.is_current = TRUE
        GROUP BY 
          fs.customer_id, dc.full_name, dc.registration_date
      )
      
      SELECT 
        customer_id,
        full_name,
        registration_date,
        lifetime_value,
        total_purchases,
        lifetime_value / total_purchases AS avg_order_value,
        first_purchase_date,
        last_purchase_date,
        days_since_first_purchase,
        days_since_last_purchase,
        CASE 
          WHEN days_since_last_purchase <= 30 THEN 'Active'
          WHEN days_since_last_purchase BETWEEN 31 AND 90 THEN 'At Risk'
          WHEN days_since_last_purchase BETWEEN 91 AND 365 THEN 'Lapsed'
          ELSE 'Inactive'
        END AS customer_status
      FROM 
        customer_purchases
      ORDER BY 
        lifetime_value DESC
    EOF
    use_legacy_sql = false
  }
}

# Set up scheduled query for daily ETL
resource "google_bigquery_data_transfer_config" "daily_etl" {
  display_name           = "Daily ETL pipeline"
  location               = "US"
  data_source_id         = "scheduled_query"
  schedule               = "every 24 hours"
  destination_dataset_id = google_bigquery_dataset.data_warehouse.dataset_id
  
  params = {
    query = <<EOF
      -- Refresh fact_sales table with new data
      DELETE FROM `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}` 
      WHERE DATE(transaction_date) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
      
      INSERT INTO `${google_bigquery_dataset.data_warehouse.dataset_id}.${google_bigquery_table.fact_sales.table_id}`
      SELECT 
        transaction_id,
        DATE(transaction_date) AS transaction_date,
        TIME(transaction_date) AS transaction_time,
        customer_id,
        product_id,
        store_id,
        quantity,
        unit_price,
        (quantity * unit_price) AS net_amount,
        (quantity * unit_price * p.tax_rate) AS tax_amount,
        total_amount,
        payment_method,
        IFNULL(store_id = 'ONLINE', FALSE) AS is_online
      FROM 
        `${google_bigquery_dataset.raw_data.dataset_id}.${google_bigquery_table.raw_transactions.table_id}` t
      JOIN
        `${google_bigquery_dataset.raw_data.dataset_id}.${google_bigquery_table.raw_products.table_id}` p
      ON
        t.product_id = p.product_id
      WHERE
        DATE(transaction_date) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
    EOF
  }

  service_account_name = google_service_account.bigquery_etl.email
}

# Service account for ETL operations
resource "google_service_account" "bigquery_etl" {
  account_id   = "bigquery-etl"
  display_name = "BigQuery ETL Service Account"
}

# Grant necessary permissions
resource "google_project_iam_member" "etl_permissions" {
  for_each = toset([
    "roles/bigquery.dataEditor",
    "roles/bigquery.jobUser"
  ])
  
  role    = each.key
  member  = "serviceAccount:${google_service_account.bigquery_etl.email}"
  project = data.google_project.project.project_id
}

# Get project information
data "google_project" "project" {}

Step 2: Set up Data Loading Pipelines

Create a Cloud Function to load data from various sources into BigQuery:

# main.py
import json
import base64
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage

# Initialize clients
bigquery_client = bigquery.Client()
storage_client = storage.Client()

def process_transaction_data(event, context):
    """Cloud Function triggered by a Pub/Sub event"""
    # Decode the Pub/Sub message
    pubsub_message = base64.b64decode(event['data']).decode('utf-8')
    message_data = json.loads(pubsub_message)
    
    # Get source file details
    bucket_name = message_data['bucket']
    file_name = message_data['file']
    
    # Set up the job configuration
    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("transaction_id", "STRING"),
            bigquery.SchemaField("transaction_date", "TIMESTAMP"),
            bigquery.SchemaField("customer_id", "STRING"),
            bigquery.SchemaField("product_id", "STRING"),
            bigquery.SchemaField("quantity", "INTEGER"),
            bigquery.SchemaField("unit_price", "FLOAT"),
            bigquery.SchemaField("total_amount", "FLOAT"),
            bigquery.SchemaField("payment_method", "STRING"),
            bigquery.SchemaField("store_id", "STRING"),
            bigquery.SchemaField("json_payload", "JSON"),
        ],
        source_format=bigquery.SourceFormat.CSV,
        skip_leading_rows=1,
    )
    
    # Get the table reference
    table_id = f"{message_data['project_id']}.raw_data.transactions"
    
    # Create a job to load data from GCS to BigQuery
    uri = f"gs://{bucket_name}/{file_name}"
    load_job = bigquery_client.load_table_from_uri(
        uri, table_id, job_config=job_config
    )
    
    # Wait for the job to complete
    load_job.result()
    
    # Get statistics
    table = bigquery_client.get_table(table_id)
    print(f"Loaded {load_job.output_rows} rows into {table_id}")
    
    return f"Loaded {load_job.output_rows} rows into {table_id}"

Step 3: Create Business Intelligence Dashboard using Looker Studio

Set up a connection to BigQuery data marts
Create a dashboard that pulls from the data marts:

-- Example query for sales dashboard
SELECT 
  d.year,
  d.month,
  FORMAT_DATETIME('%b %Y', DATE(d.year, d.month, 1)) AS month_year,
  c.category,
  SUM(s.total_revenue) AS revenue,
  SUM(s.total_units_sold) AS units_sold,
  SUM(s.transaction_count) AS transactions,
  SUM(s.unique_customers) AS customers
FROM 
  `your-project-id.data_marts.sales_by_month` s
JOIN 
  `your-project-id.data_warehouse.dim_date` d ON (s.year = d.year AND s.month = d.month)
JOIN 
  `your-project-id.data_warehouse.dim_product` p ON s.product_id = p.product_id
JOIN 
  `your-project-id.data_warehouse.dim_category` c ON p.category_id = c.category_id
WHERE 
  d.date >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
GROUP BY 
  d.year, d.month, month_year, c.category
ORDER BY 
  d.year, d.month, c.category

Best Practices

Data Organization
- Structure data in datasets by function (raw, staging, production)
- Use consistent naming conventions for datasets and tables
- Consider data lifecycle management with table expiration
- Implement a multi-layer architecture (bronze, silver, gold)
Performance Optimization
- Partition tables by date fields for time-series data
- Apply clustering for frequently filtered fields
- Use materialized views for common query patterns
- Consider BigQuery BI Engine for dashboarding workloads
- Write efficient SQL (avoid SELECT * and unnecessary JOINs)
Cost Management
- Set up billing alerts and quotas
- Use flat-rate pricing for predictable workloads
- Apply table partitioning to reduce scan sizes
- Cache query results when possible
- Consider reservations for consistent usage
Security and Governance
- Implement column-level security for sensitive data
- Use authorized views to control access to underlying data
- Apply row-level security for multi-tenant data
- Enable column-level encryption for PII
- Use VPC Service Controls for additional network security
Operational Excellence
- Monitor query performance with INFORMATION_SCHEMA views
- Schedule routine maintenance jobs
- Set up proper logging and monitoring
- Implement data quality checks
- Document data lineage

Common Issues and Troubleshooting

Performance Problems

Check for missing partitioning or clustering
Optimize JOIN operations (consider denormalizing)
Review query plans for bottlenecks
Use approximate aggregation functions for large datasets
Consider materialized views for frequent queries

Cost Issues

Monitor with BigQuery audit logs
Review largest queries by bytes processed
Control who can run queries (IAM roles)
Set user quotas to limit spending
Optimize storage with appropriate compression

Data Quality Problems

Implement data validation queries
Set up streaming inserts error handling
Create data quality dashboards
Use BigQuery Data Quality Services

Access Control Issues

Audit existing permissions with IAM Policy Analyzer
Use groups instead of individual permissions
Implement principle of least privilege
Document access policies
Regularly review and prune permissions

BigQuery

Key Features

Deploying BigQuery with Terraform

Basic Dataset and Table Creation

Advanced Configuration with Authorized Views and Row-Level Security

Managing BigQuery with gcloud CLI

Creating Datasets and Tables

Running Queries

Managing Permissions

Real-World Example: E-commerce Analytics Platform

Step 1: Infrastructure Setup with Terraform

Step 2: Set up Data Loading Pipelines

Step 3: Create Business Intelligence Dashboard using Looker Studio

Best Practices

Common Issues and Troubleshooting

Performance Problems

Cost Issues

Data Quality Problems

Access Control Issues

Further Reading

BigQuery

Key Features

Deploying BigQuery with Terraform

Basic Dataset and Table Creation

Advanced Configuration with Authorized Views and Row-Level Security

Managing BigQuery with gcloud CLI

Creating Datasets and Tables

Running Queries

Managing Permissions

Real-World Example: E-commerce Analytics Platform

Step 1: Infrastructure Setup with Terraform

Step 2: Set up Data Loading Pipelines

Step 3: Create Business Intelligence Dashboard using Looker Studio

Best Practices

Common Issues and Troubleshooting

Performance Problems

Cost Issues

Data Quality Problems

Access Control Issues

Further Reading