Your apps are deployed and your CI/CD pipeline works. But how do you know when something breaks? This guide covers the observability stack: metrics with Prometheus, visualization with Grafana, logs with Loki, uptime monitoring, and alerting.
The Observability Stack
We're building a complete monitoring solution:
| Component | Purpose | Port |
|---|---|---|
| Prometheus | Metrics collection and storage | 9090 |
| Grafana | Dashboards and visualization | 3001 |
| Loki | Log aggregation | 3100 |
| Promtail | Log shipping to Loki | — |
| Uptime Kuma | Uptime monitoring and alerting | 3002 |
| cAdvisor | Container metrics | 8080 |
| Node Exporter | Host system metrics | 9100 |
All of these run as containers alongside your applications in Dokploy.
Deploy Prometheus
Prometheus scrapes metrics from your applications and stores them as time series data.
Create the Service
- In Dokploy, Create Service → Docker Compose
- Name it
monitoring - Paste this compose file:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
volumes:
prometheus_data:Prometheus Configuration
Create prometheus.yml in the same directory
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: []
rule_files: []
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (host metrics)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# cAdvisor (container metrics)
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Your applications (add as needed)
- job_name: 'myapp'
static_configs:
- targets: ['myapp:3000']
metrics_path: '/metrics'Add Node Exporter and cAdvisor
Expand your compose file for host and container metrics
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
ports:
- "9100:9100"
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
restart: unless-stopped
volumes:
prometheus_data:Verify Prometheus
Visit http://your-server-ip:9090 and check:
- Status → Targets — All targets should be "UP"
- Graph — Try querying
upto see all monitored services
Deploy Grafana
Grafana turns your metrics into dashboards.
Add to Compose
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3001:3000"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:Initial Setup
- Visit
http://your-server-ip:3001 - Login with admin / your-secure-password
- Go to Connections → Data sources → Add data source
- Select Prometheus
- Set URL:
http://prometheus:9090 - Click Save & test
Import Dashboards
Don't build from scratch. Import community dashboards.
- Go to Dashboards → Import
- Enter dashboard ID and click Load
- Select your Prometheus data source
- Click Import
| Dashboard | ID | Purpose |
|---|---|---|
| Node Exporter Full | 1860 | Host system metrics |
| Docker Container | 893 | Container overview |
| cAdvisor | 14282 | Detailed container metrics |
| Traefik | 4475 | Reverse proxy metrics |
Custom Application Dashboard
PromQL queries for your own apps
rate(http_requests_total{job="myapp"}[5m])rate(http_requests_total{job="myapp",status=~"5.."}[5m])histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="myapp"}[5m]))container_memory_usage_bytes{name="myapp"}rate(container_cpu_usage_seconds_total{name="myapp"}[5m])Deploy Loki for Logs
Metrics tell you something's wrong. Logs tell you why.
Add Loki and Promtail
loki:
image: grafana/loki:latest
container_name: loki
volumes:
- loki_data:/loki
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail.yml:/etc/promtail/config.yml:ro
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
loki_data:Promtail Configuration
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Docker container logs
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: containerlogs
__path__: /var/lib/docker/containers/*/*log
pipeline_stages:
- json:
expressions:
output: log
stream: stream
attrs:
- json:
expressions:
tag:
source: attrs
- regex:
expression: (?P<container_name>(?:[a-zA-Z0-9][a-zA-Z0-9_.-]+))
source: tag
- labels:
stream:
container_name:
- output:
source: output
# System logs
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslogConnect Loki to Grafana
- In Grafana, go to Connections → Data sources → Add data source
- Select Loki
- Set URL:
http://loki:3100 - Click Save & test
Query Logs
In Grafana, go to Explore and select Loki as the data source
{container_name="myapp"}{container_name="myapp"} |= "error"{container_name="myapp"} | json | level="error"count_over_time({container_name="myapp"} |= "error" [5m])Deploy Uptime Kuma
External uptime monitoring catches issues that internal monitoring misses.
Add to Compose
uptime-kuma:
image: louislam/uptime-kuma:latest
container_name: uptime-kuma
volumes:
- uptime_kuma_data:/app/data
ports:
- "3002:3001"
restart: unless-stopped
volumes:
uptime_kuma_data:Initial Setup
- Visit
http://your-server-ip:3002 - Create your admin account
- Click Add New Monitor
Monitor Types
HTTP(s) — Web endpoints
- URL:
https://app.yourdomain.com/health - Interval: 60 seconds
- Retries: 3
TCP — Database connectivity
- Hostname:
main-db - Port: 5432
Docker Container
- Container name:
myapp - Checks if container is running
DNS
- Hostname:
app.yourdomain.com - Verifies DNS resolution
Status Page
Uptime Kuma can generate a public status page:
- Go to Status Pages
- Click New Status Page
- Add your monitors
- Set the slug (e.g.,
status) - Access at
http://your-server-ip:3002/status/status
Point a subdomain like status.yourdomain.com at it for a professional look.
Health Checks
Health checks tell Dokploy (and your monitoring) whether your app is actually working.
Node.js / Express Health Endpoint
app.get('/health', (req, res) => {
res.json({
status: 'ok',
timestamp: new Date().toISOString()
});
});app.get('/health', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
res.json({
status: 'ok',
database: 'connected',
cache: 'connected'
});
} catch (error) {
res.status(503).json({
status: 'error',
message: error.message
});
}
});Dockerfile Health Check
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1Parameters:
--interval=30s— Check every 30 seconds--timeout=3s— Fail if no response in 3 seconds--start-period=10s— Grace period for app startup--retries=3— Mark unhealthy after 3 failures
Dokploy Health Check
In your application settings:
- Go to Advanced tab
- Set Health Check Path:
/health - Set Health Check Interval:
30
Dokploy uses this to:
- Determine when new deployments are ready
- Restart containers that become unhealthy
- Route traffic only to healthy instances
Alerting
Dashboards are useless at 3 AM. Set up alerts.
Grafana Alerts
Go to Alerting → Alert rules → New alert rule
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05container_memory_usage_bytes{name="myapp"} / container_spec_memory_limit_bytes{name="myapp"} > 0.9up{job="myapp"} == 0Notification Channels
Configure where alerts go in Alerting → Contact points
Slack
Add new contact point → Slack → Enter your Slack webhook URL
Discord
Add new contact point → Discord → Enter your Discord webhook URL
Configure SMTP in Grafana settings first, then add contact point → Email
PagerDuty / Opsgenie
For on-call rotations, integrate with your PagerDuty or Opsgenie API key
Uptime Kuma Alerts
Uptime Kuma has built-in notifications:
- Go to Settings → Notifications
- Click Setup Notification
- Choose provider (Slack, Discord, Telegram, Email, etc.)
- Test the notification
- Assign to your monitors
Resource Limits
Prevent runaway containers from taking down your server.
Docker Resource Limits
services:
myapp:
image: myapp:latest
deploy:
resources:
limits:
cpus: '1.0'
memory: 512M
reservations:
cpus: '0.25'
memory: 128MRecommended Limits
| App Type | Memory | CPU |
|---|---|---|
| Static site (Nginx) | 64-128M | 0.25 |
| Node.js API | 256-512M | 0.5-1.0 |
| Laravel/Django | 256-512M | 0.5-1.0 |
| Next.js | 512M-1G | 0.5-1.0 |
| Background workers | 256-512M | 0.5 |
Start conservative and increase based on actual usage in Grafana.
Auto-Restart on OOM
Docker automatically restarts containers killed for exceeding memory limits if you have:
restart: unless-stoppedOr in Dokploy, enable Auto Restart in application settings.
Complete Monitoring Stack
Here's the full docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- prometheus_data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3001:3000"
restart: unless-stopped
loki:
image: grafana/loki:latest
container_name: loki
volumes:
- loki_data:/loki
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail.yml:/etc/promtail/config.yml:ro
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
ports:
- "9100:9100"
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
restart: unless-stopped
uptime-kuma:
image: louislam/uptime-kuma:latest
container_name: uptime-kuma
volumes:
- uptime_kuma_data:/app/data
ports:
- "3002:3001"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
loki_data:
uptime_kuma_data:Quick Reference
Service URLs
| Service | URL | Default Credentials |
|---|---|---|
| Prometheus | :9090 | None |
| Grafana | :3001 | admin / (your password) |
| Uptime Kuma | :3002 | (set on first visit) |
| cAdvisor | :8080 | None |
| Node Exporter | :9100/metrics | None |
| Loki | :3100 | None (API only) |
Useful PromQL Queries
# CPU usage by container
rate(container_cpu_usage_seconds_total[5m])
# Memory usage by container
container_memory_usage_bytes
# Disk usage
node_filesystem_avail_bytes / node_filesystem_size_bytes
# Network traffic
rate(node_network_receive_bytes_total[5m])
# HTTP request rate
rate(http_requests_total[5m])
# Container restarts
increase(container_restart_count[1h])What's Next
You can see what's happening. Part 6 locks it down:
Production Hardening
SSO with Authentik, Cloudflare Tunnel for zero-trust access, secrets management with Infisical, and a disaster recovery playbook.
