Terraform ile Monitoring Altyapısı Kurulumu

Monitoring altyapısı kurmak, her sysadmin’in en çok zaman harcadığı ve en çok baş ağrıtan konulardan biridir. Prometheus kuracaksın, Grafana ekleyeceksin, Alertmanager’ı yapılandıracaksın, Node Exporter’ları tüm sunuculara dağıtacaksın… Ve bunu her yeni ortam için tekrar tekrar yapacaksın. Terraform ile bu süreci kodla tanımlayıp, tekrar edilebilir ve tutarlı hale getirmek hem zamanından hem de sinirinden tasarruf ettirir.

Neden Terraform ile Monitoring?

Manuel kurulumların en büyük sorunu tutarsızlıktır. Staging ortamında bir şekilde, production’da başka bir şekilde kurulmuş Prometheus instance’ları, bir süre sonra kim neyi yapılandırdı belli olmayan bir kaosa dönüşür. Terraform ile altyapını kod olarak tanımlayınca şu avantajları elde edersin:

Tekrar üretilebilirlik: Aynı konfigürasyonu dev, staging ve production’da birebir uygularsın
Versiyon kontrolü: Altyapı değişiklikleri Git history’sinde görünür
Takım çalışması: Pull request ile altyapı değişikliklerini review edebilirsin
Hızlı disaster recovery: Bir şeyler patlarsa terraform apply ile sıfırdan ayağa kaldırırsın

Bu yazıda AWS üzerinde Prometheus, Grafana ve Alertmanager’dan oluşan tam bir monitoring stack’ini Terraform ile kuracağız. Gerçek üretim senaryolarına yakın bir yapı olacak, yani sadece “hello world” değil.

Proje Yapısı

Terraform projelerinde dosya organizasyonu kritik. Tek bir main.tf içine her şeyi doldurmak, proje büyüdüğünde bakımını imkansız hale getirir. Şu yapıyı kullanacağız:

monitoring-infra/
├── main.tf
├── variables.tf
├── outputs.tf
├── providers.tf
├── modules/
│   ├── prometheus/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── grafana/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── alertmanager/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── templates/
│   ├── prometheus.yml.tpl
│   ├── alertmanager.yml.tpl
│   └── grafana.ini.tpl
└── environments/
    ├── dev.tfvars
    ├── staging.tfvars
    └── prod.tfvars

Bu yapı sayesinde her bileşeni bağımsız olarak yönetebilir, modülleri farklı projelerde yeniden kullanabilirsin.

Provider ve Backend Yapılandırması

Önce providers.tf dosyasını hazırlayalım. State dosyasını S3’te tutmak ekip çalışması için şart, yoksa her terraform apply çalıştırdığında çakışmalar yaşarsın:

# providers.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "sirket-terraform-state"
    key            = "monitoring/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Project     = "monitoring"
      ManagedBy   = "terraform"
      Environment = var.environment
    }
  }
}

DynamoDB tablosunu state locking için önceden oluşturman gerekiyor. İki kişi aynı anda terraform apply yaparsa state dosyası bozulabilir, bu tablo bunu önler.

Variables Tanımlamaları

variables.tf dosyasına ortam değişkenlerini tanımlayalım:

# variables.tf
variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "eu-west-1"
}

variable "environment" {
  description = "Ortam adi (dev/staging/prod)"
  type        = string
}

variable "vpc_id" {
  description = "Mevcut VPC ID"
  type        = string
}

variable "private_subnet_ids" {
  description = "Private subnet ID listesi"
  type        = list(string)
}

variable "public_subnet_ids" {
  description = "Public subnet ID listesi"
  type        = list(string)
}

variable "prometheus_instance_type" {
  description = "Prometheus EC2 instance tipi"
  type        = string
  default     = "t3.medium"
}

variable "grafana_instance_type" {
  description = "Grafana EC2 instance tipi"
  type        = string
  default     = "t3.small"
}

variable "retention_days" {
  description = "Prometheus veri saklama suresi (gun)"
  type        = number
  default     = 30
}

variable "grafana_admin_password" {
  description = "Grafana admin sifresi"
  type        = string
  sensitive   = true
}

variable "alert_email" {
  description = "Alert email adresi"
  type        = string
}

sensitive = true ile işaretlenen değişkenler Terraform output’larında ve log dosyalarında gizlenir. Şifreleri asla default değeri olarak tanımlama, bunları Terraform Cloud, AWS Secrets Manager veya ortam değişkenleri üzerinden geçir.

Prometheus Modülü

Prometheus için ayrı bir modül oluşturalım. Bu modül EC2 instance, Security Group ve gerekli IAM rollerini yönetecek:

# modules/prometheus/main.tf

# IAM rol - Prometheus'un EC2 metadata okuyabilmesi ve 
# diğer AWS servislerini kesfedebilmesi icin
resource "aws_iam_role" "prometheus" {
  name = "${var.environment}-prometheus-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "prometheus_discovery" {
  name = "prometheus-ec2-discovery"
  role = aws_iam_role.prometheus.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:DescribeInstances",
          "ec2:DescribeInstanceStatus",
          "cloudwatch:GetMetricStatistics",
          "cloudwatch:ListMetrics"
        ]
        Resource = "*"
      }
    ]
  })
}

resource "aws_iam_instance_profile" "prometheus" {
  name = "${var.environment}-prometheus-profile"
  role = aws_iam_role.prometheus.name
}

# Security Group
resource "aws_security_group" "prometheus" {
  name        = "${var.environment}-prometheus-sg"
  description = "Prometheus monitoring server security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 9090
    to_port         = 9090
    protocol        = "tcp"
    security_groups = [var.grafana_sg_id]
    description     = "Grafana'dan Prometheus erisimi"
  }

  ingress {
    from_port   = 9100
    to_port     = 9100
    protocol    = "tcp"
    cidr_blocks = [var.vpc_cidr]
    description = "Node exporter metrikleri"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-prometheus"
  }
}

# Prometheus konfigurasyon dosyasini template'ten olustur
data "template_file" "prometheus_config" {
  template = file("${path.module}/../../templates/prometheus.yml.tpl")

  vars = {
    environment    = var.environment
    retention_days = var.retention_days
    aws_region     = var.aws_region
  }
}

# EC2 Instance
resource "aws_instance" "prometheus" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  subnet_id              = var.private_subnet_ids[0]
  vpc_security_group_ids = [aws_security_group.prometheus.id]
  iam_instance_profile   = aws_iam_instance_profile.prometheus.name

  root_block_device {
    volume_type = "gp3"
    volume_size = 100
    encrypted   = true
  }

  # Prometheus data volume
  ebs_block_device {
    device_name = "/dev/xvdf"
    volume_type = "gp3"
    volume_size = var.data_volume_size
    encrypted   = true
    
    tags = {
      Name = "${var.environment}-prometheus-data"
    }
  }

  user_data = base64encode(templatefile("${path.module}/userdata.sh.tpl", {
    prometheus_config = data.template_file.prometheus_config.rendered
    retention_days    = var.retention_days
    environment       = var.environment
  }))

  tags = {
    Name = "${var.environment}-prometheus"
    Role = "monitoring"
  }

  lifecycle {
    ignore_changes = [ami]
  }
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-22.04-amd64-server-*"]
  }
}

Prometheus Konfigürasyon Template’i

Template dosyaları Terraform’un en güçlü özelliklerinden biri. Ortama göre dinamik konfigürasyonlar üretebilirsin:

# templates/prometheus.yml.tpl
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: '${environment}'
    region: '${aws_region}'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # EC2 service discovery - tum tagged instance'lari otomatik bulur
  - job_name: 'node-exporter'
    ec2_sd_configs:
      - region: ${aws_region}
        port: 9100
        filters:
          - name: tag:Environment
            values:
              - ${environment}
          - name: instance-state-name
            values:
              - running
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance_id
      - source_labels: [__meta_ec2_tag_Role]
        target_label: role

  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://app.sirket.com
          - https://api.sirket.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

EC2 service discovery burada çok işe yarıyor. Her yeni sunucu eklediğinde Prometheus konfigürasyonunu güncellemene gerek yok. Sunucuya Environment tag’ini eklersen Prometheus onu otomatik olarak keşfeder.

Grafana Modülü

Grafana için ayrı modül ve ALB arkasında çalışacak şekilde yapılandıralım:

# modules/grafana/main.tf

resource "aws_security_group" "grafana" {
  name        = "${var.environment}-grafana-sg"
  description = "Grafana security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 3000
    to_port         = 3000
    protocol        = "tcp"
    security_groups = [aws_security_group.grafana_alb.id]
    description     = "ALB'den Grafana erisimi"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "grafana_alb" {
  name        = "${var.environment}-grafana-alb-sg"
  description = "Grafana ALB security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.allowed_cidr_blocks
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_lb" "grafana" {
  name               = "${var.environment}-grafana-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.grafana_alb.id]
  subnets            = var.public_subnet_ids

  enable_deletion_protection = var.environment == "prod" ? true : false
}

resource "aws_lb_listener" "grafana_https" {
  load_balancer_arn = aws_lb.grafana.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.grafana.arn
  }
}

resource "aws_lb_target_group" "grafana" {
  name     = "${var.environment}-grafana-tg"
  port     = 3000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/api/health"
    matcher             = "200"
  }
}

resource "aws_lb_target_group_attachment" "grafana" {
  target_group_arn = aws_lb_target_group.grafana.arn
  target_id        = aws_instance.grafana.id
  port             = 3000
}

resource "aws_instance" "grafana" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  subnet_id              = var.private_subnet_ids[0]
  vpc_security_group_ids = [aws_security_group.grafana.id]

  root_block_device {
    volume_type = "gp3"
    volume_size = 30
    encrypted   = true
  }

  user_data = base64encode(templatefile("${path.module}/userdata.sh.tpl", {
    admin_password    = var.admin_password
    prometheus_url    = var.prometheus_url
    grafana_ini       = templatefile("${path.module}/../../templates/grafana.ini.tpl", {
      domain      = var.grafana_domain
      environment = var.environment
    })
  }))

  tags = {
    Name = "${var.environment}-grafana"
    Role = "monitoring"
  }
}

enable_deletion_protection = var.environment == "prod" ? true : false satırına dikkat et. Production’da yanlışlıkla ALB’yi silmemek için bu ayarı conditional olarak aktifleştiriyoruz.

Environment Bazlı Değişkenler

Farklı ortamlar için farklı boyutlar tanımlayalım:

# environments/prod.tfvars
environment              = "prod"
aws_region               = "eu-west-1"
vpc_id                   = "vpc-0abc123def456"
private_subnet_ids       = ["subnet-0123", "subnet-0456"]
public_subnet_ids        = ["subnet-0789", "subnet-0abc"]
prometheus_instance_type = "t3.xlarge"
grafana_instance_type    = "t3.medium"
retention_days           = 90
alert_email              = "[email protected]"

# environments/dev.tfvars
environment              = "dev"
aws_region               = "eu-west-1"
vpc_id                   = "vpc-0def456abc123"
private_subnet_ids       = ["subnet-0def", "subnet-0ghi"]
public_subnet_ids        = ["subnet-0jkl", "subnet-0mno"]
prometheus_instance_type = "t3.small"
grafana_instance_type    = "t3.micro"
retention_days           = 7
alert_email              = "[email protected]"

CI/CD Pipeline Entegrasyonu

GitHub Actions ile Terraform’u otomatize edelim. Bu workflow, her PR’da plan çalıştırır, main branch’e merge olunca apply eder:

# .github/workflows/terraform-monitoring.yml
name: Terraform Monitoring Infrastructure

on:
  push:
    branches:
      - main
    paths:
      - 'monitoring-infra/**'
  pull_request:
    paths:
      - 'monitoring-infra/**'

env:
  TF_VERSION: "1.6.0"
  AWS_REGION: "eu-west-1"

jobs:
  terraform-plan:
    name: Terraform Plan
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: monitoring-infra

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan (Staging)
        run: |
          terraform plan 
            -var-file=environments/staging.tfvars 
            -var="grafana_admin_password=${{ secrets.GRAFANA_ADMIN_PASSWORD }}" 
            -out=tfplan

      - name: Upload Plan
        uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: monitoring-infra/tfplan

  terraform-apply:
    name: Terraform Apply
    runs-on: ubuntu-latest
    needs: terraform-plan
    if: github.ref == 'refs/heads/main'
    environment: staging
    defaults:
      run:
        working-directory: monitoring-infra

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Download Plan
        uses: actions/download-artifact@v3
        with:
          name: tfplan
          path: monitoring-infra

      - name: Terraform Init
        run: terraform init

      - name: Terraform Apply
        run: terraform apply -auto-approve tfplan

Alertmanager Konfigürasyonu

Alertmanager için de bir template hazırlayalım. Slack ve email bildirimleri için:

# templates/alertmanager.yml.tpl
global:
  smtp_smarthost: 'smtp.sirket.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '${smtp_username}'
  smtp_auth_password: '${smtp_password}'
  slack_api_url: '${slack_webhook_url}'

route:
  group_by: ['alertname', 'environment']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'
    - match:
        environment: prod
      receiver: 'email-ops'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts-${environment}'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${pagerduty_key}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true

  - name: 'email-ops'
    email_configs:
      - to: '${alert_email}'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Sık Karşılaşılan Sorunlar ve Çözümleri

Gerçek hayatta bu altyapıyı kurarken şu sorunlarla karşılaşırsın:

State dosyası kilitleme sorunu: İki pipeline aynı anda çalışırsa Error locking state alırsın. DynamoDB tablosunda kilidi manuel olarak silmen gerekebilir:

# State kilidin manuel kaldirilmasi (dikkatli kullan)
aws dynamodb delete-item 
  --table-name terraform-state-lock 
  --key '{"LockID": {"S": "sirket-terraform-state/monitoring/terraform.tfstate"}}' 
  --region eu-west-1

Drift tespiti: Birisi konsol üzerinden manuel değişiklik yaparsa Terraform state ile gerçek altyapı arasında fark oluşur. Bunu tespit etmek için düzenli terraform plan çalıştır:

# Drift kontrolu - degisiklik olmamali
terraform plan 
  -var-file=environments/prod.tfvars 
  -var="grafana_admin_password=${GRAFANA_PASS}" 
  -detailed-exitcode

# Exit code 0 = degisiklik yok, 2 = degisiklik var
echo "Exit code: $?"

Modül versiyonlama: Production ortamında modülleri versiyon sabitlemeden kullanma. Git tag’i referans ver:

# main.tf icinde modul cagirma - versiyonlu
module "prometheus" {
  source = "git::https://github.com/sirket/terraform-modules.git//prometheus?ref=v2.1.0"
  
  environment   = var.environment
  instance_type = var.prometheus_instance_type
  vpc_id        = var.vpc_id
}

Outputs ve Bağımlılık Yönetimi

outputs.tf dosyası hem debugging için hem de modüller arası bağımlılıkları çözmek için kritik:

# outputs.tf
output "prometheus_private_ip" {
  description = "Prometheus sunucu private IP"
  value       = module.prometheus.private_ip
}

output "grafana_url" {
  description = "Grafana erisim URL'i"
  value       = "https://${module.grafana.alb_dns_name}"
}

output "alertmanager_endpoint" {
  description = "Alertmanager endpoint"
  value       = module.alertmanager.endpoint
  sensitive   = false
}

output "prometheus_security_group_id" {
  description = "Prometheus SG ID - diger sunucular icin Node Exporter erisimi"
  value       = module.prometheus.security_group_id
}

prometheus_security_group_id output’u özellikle önemli. Uygulama sunucularına Node Exporter eklerken bu SG ID’yi kullanarak sadece Prometheus’un 9100 portuna erişebildiğinden emin olursun.

Sonuç

Terraform ile monitoring altyapısı kurmak başlangıçta zahmetli görünse de uzun vadede kazanç çok büyük. Bir kere iyi kurulmuş modüler yapıyla yeni bir ortam için monitoring ayağa kaldırmak terraform apply -var-file=environments/prod.tfvars komutundan ibaret hale geliyor.

Özellikle dikkat etmeni istediğim noktaları özetleyeyim:

State yönetimi: Remote state olmadan ekip çalışması mümkün değil, bunu ilk adımda hallettir
Modüler yapı: Her bileşeni kendi modülüne koy, ileride büyüyen projelerde sağlığını korursun
Sensitive değişkenler: Şifreler ve API anahtarları asla .tfvars dosyalarında committed olmamalı
Drift detection: CI/CD pipeline’ına düzenli terraform plan ekle, altyapı sürprizleri sevmez
Environment ayrımı: Dev’de küçük instance ile test et, production’a geçmeden önce staging’de doğrula

Bu altyapıyı kurduğunda önüne gelecek ilk gerçek test genellikle disk dolması alarmıdır. Prometheus’un data volume’unu başlangıçta büyük tutmanı tavsiye ederim, retention_days değerini düşürmek her zaman terraform apply kadar kolay olmuyor fiziksel disk açısından.