Troubleshooting Runbooks
This section documents practical runbooks for infrastructure operations, production troubleshooting, incident response, deployment failures, server issues, monitoring alerts, and recovery procedures.
The goal is to demonstrate structured operational thinking, not just command knowledge.
Runbooks
1. Production Incident Response Runbook
Focus areas:
- Confirm impact
- Identify affected users or services
- Check recent changes
- Review logs and metrics
- Isolate the failing layer
- Apply safest recovery option
- Validate service restoration
- Document root cause and prevention
2. Linux Server High CPU Troubleshooting
Focus areas:
- Load average
- Top CPU processes
- Application logs
- Cron jobs
- Background tasks
- Recent deployments
- Resource saturation
- Recovery options
3. Disk Full Troubleshooting
Focus areas:
- Filesystem usage
- Inode usage
- Large files
- Log growth
- Docker volume usage
- Cleanup strategy
- Prevention with log rotation
4. Docker Compose Application Down
Focus areas:
- Container status
- Container logs
- Port conflicts
- Environment variables
- Docker networks
- Volumes
- Restart policies
- Application health checks
5. PostgreSQL Backup and Restore Runbook
Focus areas:
- Manual backup
- Automated backup
- Restore validation
- Connection troubleshooting
- Disk space validation
- Recovery testing
6. Redis Troubleshooting Runbook
Focus areas:
- Redis availability
- Memory usage
- Persistence
- Connection errors
- Eviction policy
- Restart behavior
- Backup considerations
7. Deployment Failed but CI/CD Passed
Focus areas:
- Health checks
- Application logs
- Environment variables
- Database migrations
- Reverse proxy routing
- DNS
- SSL certificates
- Rollback decision
8. Rollback vs Hotfix Decision Runbook
Focus areas:
- Customer impact
- Risk assessment
- Recent changes
- Time to recovery
- Data impact
- Communication
- Validation after recovery
Operational Principles
A strong infrastructure engineer should be able to:
- Confirm the issue.
- Measure impact.
- Check recent changes.
- Review logs, metrics, and alerts.
- Isolate the failing layer.
- Apply the safest fix.
- Validate recovery.
- Document root cause.
- Add prevention steps.
- Improve automation and monitoring.