Steven Rosales

Troubleshooting Runbooks

This section documents practical runbooks for infrastructure operations, production troubleshooting, incident response, deployment failures, server issues, monitoring alerts, and recovery procedures.

The goal is to demonstrate structured operational thinking, not just command knowledge.


Runbooks

1. Production Incident Response Runbook

Focus areas:


2. Linux Server High CPU Troubleshooting

Focus areas:


3. Disk Full Troubleshooting

Focus areas:


4. Docker Compose Application Down

Focus areas:


5. PostgreSQL Backup and Restore Runbook

Focus areas:


6. Redis Troubleshooting Runbook

Focus areas:


7. Deployment Failed but CI/CD Passed

Focus areas:


8. Rollback vs Hotfix Decision Runbook

Focus areas:


Operational Principles

A strong infrastructure engineer should be able to:

  1. Confirm the issue.
  2. Measure impact.
  3. Check recent changes.
  4. Review logs, metrics, and alerts.
  5. Isolate the failing layer.
  6. Apply the safest fix.
  7. Validate recovery.
  8. Document root cause.
  9. Add prevention steps.
  10. Improve automation and monitoring.