Runbook-as-code, two years in — IT Maintenance Hub

We started writing our runbooks as version-controlled markdown two years ago. The pattern stuck, we made some mistakes, and we'd like to spare you a few of them. This is not a sales post — these notes are for any IT team thinking about doing the same.

What we mean by runbook-as-code

Every runbook lives in a Git repo. Each runbook is a markdown file with a small YAML preamble — the alert it responds to, the team that owns it, the severity level, the linked dashboards. The body is plain prose with executable code blocks. Pull requests get reviewed before merge. The CI pipeline lints the YAML, checks the links, and rejects runbooks without a "rollback" section.

That's it. There's no clever tooling. The discipline is the whole product.

What we got right

Reviews force quality. The first month after we required PRs on every runbook change, the wiki-era cargo cult of pasted-together notes died quietly. People couldn't get away with "I think this command did the thing once" because someone else read it.
Searching is easy. Grep works. The whole repo is plain text. Compare to the wiki where search returned 47 partial matches across deleted pages and never surfaced the canonical one.
The runbook follows the alert. Our monitoring config references the runbook by path. When you click an alert, you land on the right markdown file, on the right line. The same path is in the post-mortem template.

What we got wrong

We let the YAML grow. Six months in, the preamble had thirty-something fields. Most of them weren't read by anything. We cut it back to seven required fields and made everything else optional. Lesson: schemas grow until you discipline them.
We required runbooks for things that didn't need them. "Every alert needs a runbook" is well-intentioned and wrong. Some alerts are obvious enough that a runbook insults the on-call's intelligence. We added a kind: notify-only for those cases and stopped pretending.
We treated runbooks as documentation. Runbooks aren't docs. They're scripts an exhausted human runs at 3am. Every time we caught ourselves writing a paragraph of background, we cut it. The "why" goes in a separate doc; the runbook tells you what to do, in order.
We trusted runbooks to age gracefully. They don't. We added a quarterly "is this still right?" review with a small Slack reminder bot. About 30% of runbooks fail the review. Almost all the failures are stale links or a tool that's been replaced.

The format we landed on

After a year of iteration, our runbook template is roughly:

Preamble (YAML): alert name, severity, owning team, related dashboards, last review date.
Symptoms: one paragraph. What does the on-call see? What does the user see?
Quick check: three to five commands the on-call runs first. Each in a code block, copy-pasteable.
Likely causes: short list, ordered by frequency.
Mitigation: numbered steps, with explicit "if X, then Y" branches.
Rollback: required section. How do you undo whatever you just did?
Escalate when: explicit list of conditions that mean "wake up the next layer."
Post-incident: pointer to the post-mortem template; what to capture.

The 3am test

The single most useful internal practice is what we call the 3am test. When a runbook is reviewed, the reviewer is asked: "could you run this at 3am, half-asleep, on a phone, against a customer estate you've never seen?" If the answer is no, the runbook is too long, too clever, or assumes too much context. Most rewrites get triggered by that question.

What runbook-as-code is not

It's not a substitute for production knowledge. We've watched teams adopt the pattern and assume that the existence of runbooks would teach the team how the system works. It doesn't. Runbooks help you respond to known failure modes. The harder problem — building the kind of intuition that recognises a new failure mode — still has to come from people working on the system. The runbook lets that intuition scale; it doesn't replace it.

If you'd like to see how we handle the runbook reference inside the imhub product itself — alert routing, runbook attachment, the post-mortem flow — we're happy to walk through it.

Further reading: Google's SRE books, Brendan Gregg on observability, the original PagerDuty blog on incident response.