Production Optimization & Scaling
Scale your CI/CD platform for high availability, performance, and reliability
Running Concourse in production requires attention to scaling, monitoring, and maintenance. This final guide covers everything you need to operate Concourse reliably at scale.
High Availability Architecture
Multi-Node Web Cluster
For high availability, run multiple ATC (web) nodes behind a load balancer:
┌─────────────────┐
│ Load Balancer │
│ (Nginx/HAProxy)│
└────────┬────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ ATC-1 │ │ ATC-2 │ │ ATC-3 │
│ (Web/API) │ │ (Web/API) │ │ (Web/API) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┼────────────────┘
│
┌────────┴────────┐
│ PostgreSQL │
│ (Primary) │
└────────┬────────┘
│
┌────────┴────────┐
│ PostgreSQL │
│ (Replica) │
└─────────────────┘Docker Compose for Multi-Web Setup
version: '3.8'
services:
concourse-db:
image: postgres:15
environment:
POSTGRES_DB: concourse
POSTGRES_USER: concourse_user
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
deploy:
resources:
limits:
memory: 2G
concourse-web-1:
image: concourse/concourse:7.11
command: web
depends_on:
- concourse-db
environment: &web-env
CONCOURSE_POSTGRES_HOST: concourse-db
CONCOURSE_POSTGRES_USER: concourse_user
CONCOURSE_POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
CONCOURSE_POSTGRES_DATABASE: concourse
CONCOURSE_EXTERNAL_URL: https://concourse.example.com
CONCOURSE_SESSION_SIGNING_KEY: /keys/session_signing_key
CONCOURSE_TSA_HOST_KEY: /keys/tsa_host_key
CONCOURSE_TSA_AUTHORIZED_KEYS: /keys/authorized_worker_keys
CONCOURSE_CLUSTER_NAME: production
CONCOURSE_ENABLE_GLOBAL_RESOURCES: "true"
volumes:
- ./keys:/keys:ro
networks:
- concourse-net
concourse-web-2:
image: concourse/concourse:7.11
command: web
depends_on:
- concourse-db
environment: *web-env
volumes:
- ./keys:/keys:ro
networks:
- concourse-net
concourse-web-3:
image: concourse/concourse:7.11
command: web
depends_on:
- concourse-db
environment: *web-env
volumes:
- ./keys:/keys:ro
networks:
- concourse-net
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
- "2222:2222" # TSA
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- concourse-web-1
- concourse-web-2
- concourse-web-3
networks:
- concourse-net
networks:
concourse-net:
volumes:
pgdata:Load Balancer Configuration
events {
worker_connections 1024;
}
http {
upstream concourse_web {
least_conn;
server concourse-web-1:8080;
server concourse-web-2:8080;
server concourse-web-3:8080;
}
server {
listen 80;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
ssl_certificate /etc/nginx/certs/server.crt;
ssl_certificate_key /etc/nginx/certs/server.key;
location / {
proxy_pass http://concourse_web;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_read_timeout 900s;
proxy_buffering off;
}
}
}
stream {
upstream concourse_tsa {
server concourse-web-1:2222;
server concourse-web-2:2222;
server concourse-web-3:2222;
}
server {
listen 2222;
proxy_pass concourse_tsa;
}
}Worker Scaling
Horizontal Worker Scaling
Add workers to handle more concurrent builds:
version: '3.8'
services:
worker-1:
image: concourse/concourse:7.11
command: worker
privileged: true
environment: &worker-env
CONCOURSE_TSA_HOST: concourse.example.com:2222
CONCOURSE_TSA_PUBLIC_KEY: /keys/tsa_host_key.pub
CONCOURSE_TSA_WORKER_PRIVATE_KEY: /keys/worker_key
CONCOURSE_RUNTIME: containerd
CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
CONCOURSE_WORK_DIR: /worker-state
volumes:
- ./keys:/keys:ro
- worker1-state:/worker-state
worker-2:
image: concourse/concourse:7.11
command: worker
privileged: true
environment: *worker-env
volumes:
- ./keys:/keys:ro
- worker2-state:/worker-state
worker-3:
image: concourse/concourse:7.11
command: worker
privileged: true
environment: *worker-env
volumes:
- ./keys:/keys:ro
- worker3-state:/worker-state
volumes:
worker1-state:
worker2-state:
worker3-state:Worker Tags for Specialization
Route specific workloads to appropriate workers:
# GPU worker
concourse worker \
--tag=gpu \
--tag=high-memory
# ARM worker
concourse worker \
--tag=arm64
# Windows worker (for Windows builds)
concourse worker \
--tag=windowsUse tags in your pipeline:
jobs:
- name: train-ml-model
plan:
- task: train
tags: [gpu]
config:
platform: linux
# ...
- name: build-arm-image
plan:
- task: build
tags: [arm64]
config:
platform: linux
# ...Worker Resource Limits
environment:
# Limit concurrent containers
CONCOURSE_GARDEN_MAX_CONTAINERS: 250
# Memory limits
CONCOURSE_GARDEN_DEFAULT_CONTAINER_MEMORY_LIMIT: 4g
# CPU limits
CONCOURSE_GARDEN_DEFAULT_CONTAINER_CPU_SHARES: 1024
# Disk quotas
CONCOURSE_BAGGAGECLAIM_VOLUMES_DIR: /worker-state/volumesPerformance Tuning
Database Optimization
PostgreSQL tuning for Concourse (example for 8GB RAM):
# Memory settings
shared_buffers = 2GB # 25% of RAM
effective_cache_size = 6GB # 75% of RAM
maintenance_work_mem = 512MB
work_mem = 256MB
# SSD storage optimizations
random_page_cost = 1.1
effective_io_concurrency = 200
# Connection pooling
max_connections = 200
# Write performance
wal_buffers = 64MB
checkpoint_completion_target = 0.9Connection Pooling with PgBouncer
For high-traffic installations:
[databases]
concourse = host=postgres port=5432 dbname=concourse
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50Web Node Tuning
environment:
# Build scheduling
CONCOURSE_BUILD_TRACKER_INTERVAL: 5s
CONCOURSE_RESOURCE_CHECKING_INTERVAL: 30s
# Connection limits
CONCOURSE_CONCURRENT_REQUEST_LIMIT: 50
# Garbage collection
CONCOURSE_GC_INTERVAL: 30s
CONCOURSE_GC_ONE_OFF_GRACE_PERIOD: 5m
# Enable pipeline caching
CONCOURSE_ENABLE_PIPELINE_INSTANCES: "true"
CONCOURSE_ENABLE_CACHE_STREAMED_VOLUMES: "true"Global Resources
Reduce redundant resource checks across pipelines:
environment:
CONCOURSE_ENABLE_GLOBAL_RESOURCES: "true"This deduplicates resource checking when multiple pipelines use the same resource definition.
Monitoring
Prometheus Metrics
Concourse exposes Prometheus metrics:
environment:
CONCOURSE_PROMETHEUS_BIND_IP: 0.0.0.0
CONCOURSE_PROMETHEUS_BIND_PORT: 9391Prometheus scrape config:
scrape_configs:
- job_name: 'concourse'
static_configs:
- targets:
- 'concourse-web-1:9391'
- 'concourse-web-2:9391'
- 'concourse-web-3:9391'
metrics_path: /metricsKey Metrics to Monitor
# Build queue depth
concourse_builds_running
concourse_builds_pending
# Worker health
concourse_workers_registered
concourse_workers_containers
# Resource check performance
concourse_resource_checks_total
concourse_resource_check_duration_seconds
# Database connections
concourse_db_connections_total
# API latency
concourse_http_responses_duration_seconds_bucketAlerting Rules
groups:
- name: concourse
rules:
- alert: ConcourseWorkerDown
expr: concourse_workers_registered < 1
for: 5m
labels:
severity: critical
annotations:
summary: "No workers registered"
- alert: ConcourseHighBuildQueue
expr: concourse_builds_pending > 50
for: 10m
labels:
severity: warning
annotations:
summary: "High build queue ({{ $value }} pending)"
- alert: ConcourseResourceCheckSlow
expr: histogram_quantile(0.95, concourse_resource_check_duration_seconds_bucket) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Resource checks taking too long"Backup & Recovery
Database Backups
#!/bin/bash
BACKUP_DIR="/backups/concourse"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/concourse_${TIMESTAMP}.sql.gz"
# Create backup
pg_dump -h localhost -U concourse_user -d concourse | gzip > "$BACKUP_FILE"
# Keep only last 7 days
find "$BACKUP_DIR" -name "concourse_*.sql.gz" -mtime +7 -delete
# Upload to S3
aws s3 cp "$BACKUP_FILE" s3://my-backups/concourse/Schedule with cron:
0 2 * * * /opt/scripts/backup-concourse.shKey Backups
#!/bin/bash
tar -czf /backups/concourse-keys-$(date +%Y%m%d).tar.gz \
/opt/concourse/keys/
# Encrypt before storing
gpg --symmetric --cipher-algo AES256 \
/backups/concourse-keys-$(date +%Y%m%d).tar.gzDisaster Recovery
Restore procedure:
# 1. Stop Concourse
docker compose down
# 2. Restore database
gunzip -c backup.sql.gz | psql -h localhost -U concourse_user -d concourse
# 3. Restore keys
tar -xzf concourse-keys-backup.tar.gz -C /
# 4. Start Concourse
docker compose up -d
# 5. Verify workers reconnect
fly -t main workersMaintenance Tasks
Garbage Collection
Concourse automatically cleans up, but you can tune it:
environment:
# How often to run GC
CONCOURSE_GC_INTERVAL: 30s
# How long to keep one-off build containers
CONCOURSE_GC_ONE_OFF_GRACE_PERIOD: 5m
# How long to keep missing workers before pruning
CONCOURSE_GC_MISSING_GRACE_PERIOD: 10m
# Failed containers grace period
CONCOURSE_GC_FAILED_GRACE_PERIOD: 120hManual Cleanup
# Prune stalled workers
fly -t main prune-worker -w stale-worker-name
# Archive old pipelines
fly -t main archive-pipeline -p old-pipeline
# Clear resource cache (force re-check)
fly -t main check-resource -r my-pipeline/my-resource --from version:1.2.3Upgrading Concourse
# 1. Review release notes for breaking changes
# https://github.com/concourse/concourse/releases
# 2. Backup everything
./backup-concourse.sh
./backup-keys.sh
# 3. Update image version
# docker-compose.yml: image: concourse/concourse:7.12
# 4. Rolling restart (for HA setups)
docker compose up -d --no-deps concourse-web-1
# Wait for healthy
docker compose up -d --no-deps concourse-web-2
docker compose up -d --no-deps concourse-web-3
# 5. Update workers
docker compose -f docker-compose.workers.yml up -d
# 6. Verify
fly -t main workers
fly -t main statusCost Optimization
Right-Size Workers
Monitor actual resource usage and adjust:
# Check container resource usage
fly -t main containers
# Worker resource utilization
curl -s http://localhost:9391/metrics | grep concourse_workersScheduled Workers
For non-critical workloads, scale down during off-hours:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: concourse-worker
spec:
scaleTargetRef:
name: concourse-worker
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
query: concourse_builds_pending
threshold: "5"Pipeline Efficiency
Reduce build times and resource usage:
# Use caching aggressively
caches:
- path: .npm
- path: node_modules
- path: .cache/pip
# Avoid redundant work
- get: source
params:
depth: 1 # Shallow clone
# Parallelize where possible
- in_parallel:
limit: 4 # Control parallelism
steps:
- task: test-1
- task: test-2Troubleshooting Guide
Workers Not Connecting
# Check TSA logs
docker compose logs concourse-web | grep tsa
# Verify key permissions
ls -la keys/
# Should be: -rw-r--r-- (644) for public keys
# -rw------- (600) for private keys
# Test TSA connectivity from worker
nc -zv concourse.example.com 2222Builds Stuck Pending
# Check worker capacity
fly -t main workers
# Look for resource issues
fly -t main containers
# Check for job locks
fly -t main builds -j pipeline/jobResource Checks Failing
# Force resource check
fly -t main check-resource -r pipeline/resource
# View resource check errors
fly -t main resource-versions -r pipeline/resource
# Debug with intercept
fly -t main intercept -j pipeline/job -s get-resourceHigh Memory Usage
# Check container counts per worker
curl -s http://localhost:9391/metrics | grep containers
# Reduce concurrent builds
fly -t main pause-job -j pipeline/expensive-job
# Manually trigger GC (restart web to force)
docker compose restart concourse-web-1Production Checklist
Initial Deployment
- Multi-node web cluster for HA
- Multiple workers across availability zones
- PostgreSQL with replication
- TLS/HTTPS configured
- Authentication backend configured
- Credential manager integrated
- Monitoring and alerting set up
- Backup procedures tested
Ongoing Operations
- Daily database backups verified
- Log aggregation configured
- Alerts responding to pages
- Regular Concourse updates planned
- Capacity planning reviewed quarterly
- Security patches applied promptly
- Runbooks documented
Series Complete!
Congratulations! You've completed the Concourse CI Mastery Series. You now have the knowledge to deploy, configure, and operate Concourse CI at any scale—from a single VPS to enterprise high-availability clusters.
Quick Reference
| Topic | Guide |
|---|---|
| Installation | Part 1 |
| Core Concepts | Part 2 |
| Pipeline Development | Part 3 |
| Advanced Patterns | Part 4 |
| Security | Part 5 |
| Production Operations | Part 6 (this guide) |
