Maintenance

Keep the lights on. What matters, how to check it, how to fix it, when to check it, what changed.

What this is

You run things. A website, a database, a cron job, a server, some API keys, a few subscriptions, maybe a rental property. Each one can break. Most of the time they don't. But when they do, you want a runbook — not a troubleshooting session.

This is not the flywheel. The flywheel discovers friction you didn't know about. This guide handles friction you already know about — known systems, known failure modes, known fixes. The flywheel is R&D. This is operations.

The maintenance guide is a ranked inventory of everything you depend on, with a health check for each one, a runbook for when it breaks, and a schedule that makes sure you actually look. The output is a status dashboard: green, yellow, red. If everything's green, you never see it. If something's yellow, it shows up in the daily briefing. If something's red, it wakes you up.


The five layers

1. What matters

A ranked inventory of the systems you depend on. Not everything — just the ones where downtime costs you. Each system gets a priority tier:

2. How to check it

For each system, a health check. A command, a URL, a query — something an agent can run without interpretation.

The check is the runbook's smallest unit. If you can't write a check, you don't understand the system well enough to maintain it.

3. How to fix it

For each known failure mode, a procedure. Not a troubleshooting guide — a script. "If the site is down: check nginx, check DNS, check the deploy. Here are the commands." The fix is a sequence of steps an agent can follow. If the agent can't fix it, it escalates — tells you what it found and what it tried.

Runbooks are per-system, per-failure-mode. One file each. The file describes the symptom, the diagnosis steps, and the fix. The agent reads the file and follows it. You don't need to remember how to restart postgres at 2am — the runbook remembers.

4. When to check it

A schedule. The cadence matches the priority:

The maintenance agent runs on a cron. It produces a status report. Green rows are silent. Yellow rows show up in the daily briefing. Red rows notify you immediately. The schedule is self-managed — if a system keeps going yellow, the agent proposes upgrading its check frequency.

5. What changed

A changelog. Every time a system changes — new deploy, config update, dependency upgrade, cert renewal — the maintenance guide logs it. When something breaks, the changelog is the first place you look. "What changed since it was working?"

The changelog is also how you catch drift. If nothing has changed in three months, either the system is perfectly stable or nobody's watching. The maintenance agent flags stale systems for review.


The dashboard

The maintenance agent produces a status file after each run. Here's what it looks like:

Example status output
Website 200 OK · checked 6:01am
Database responsive · 142ms · checked 6:01am
SSL certificate expires in 12 days · auto-renew pending
Backups latest: 2h ago · 3.2GB
Cron: nightly sync last ran 9 days ago · expected: daily
DNS resolves correctly · TTL 3600

The dashboard is a file, not an app. Static HTML, regenerated on each check run. Archived by date. The daily briefing reads it and surfaces anything non-green.


The output structure

Everything lives in one folder. The agent creates it and maintains it across runs.

ops/ The maintenance folder. Everything lives here.
inventory.md Ranked list of systems with priority tiers, health check commands, and owner notes.
runbooks/ One file per system per failure mode. Symptom, diagnosis, fix steps.
status.html Latest dashboard. Green/yellow/red for each system. Regenerated on each run.
changelog.md What changed, when, and who/what changed it. Append-only log.
incidents/ One file per incident. What broke, when, how it was fixed, time to resolution.
proposals/ One file per proposed maintenance change. Each proposal includes the fix and the lesson it teaches.
schedule.md Check cadence for each system. Managed by the maintenance agent itself.
expirations.md Upcoming deadlines: certs, domains, API keys, subscriptions, contracts. Sorted by date.

Talking to other systems

Maintenance doesn't exist in isolation. It talks to the other guides through the folder convention:

Each system has a self-reflection skill — a periodic check that runs with full local context. Maintenance knows its own check history, failure rates, and false positives. It logs friction about its own operation to the flywheel. The systems improve each other.


Common systems to track

Web server nginx, Apache, Caddy. Is it running? Is it responding? Is the config valid? critical
Database Postgres, SQLite, MySQL. Can you connect? Can you query? Are backups running? critical
DNS Records pointing correctly. Registrar renewed. DNSSEC valid. TTL reasonable. critical
SSL certificates Expiry dates. Auto-renewal status. Test that renewal actually works. important
Backups Running on schedule. Restorable (test it). Size trending normally. important
Cron jobs Still running. Not silently failing. Output logged. Exit codes checked. important
API keys Expiry dates. Who has access. Rotation schedule. Stored securely. important
Domains Renewal dates. Auto-renew enabled. Registrar access verified. important
Subscriptions What you're paying for. Is any of it unused? Renewal dates. Cost changes. nice-to-have
Dependencies Major version updates. Security advisories. End-of-life schedules. nice-to-have

The handoff

Instructions for you, the AI agent

Build a maintenance system with the user, one layer at a time. Start with inventory, end with automation. Every step has an approval gate.

  1. Interview. Ask the user: What do you run? What breaks? What worries you? What wakes you up at 2am? What would ruin your day if it went down? Don't assume — some people run servers, some run spreadsheets. Start with whatever they have.
  2. Build the inventory. Create ops/inventory.md. List each system with: name, what it does, priority tier (critical/important/nice-to-have), how to access it, and who owns it. Rank by priority. Review together.
  3. Write health checks. For each system in the inventory, write a check command. Something you can run right now that returns pass/fail. Test each check live — run it and show the result. If a check fails, that's your first finding. Write checks to ops/inventory.md alongside each system entry. Verify each check works.
  4. Run the first status check. Execute every health check. Generate ops/status.html with green/yellow/red results. This is your baseline. If anything is already yellow or red, flag it. Show the dashboard to the user.
  5. Write runbooks. For each critical and important system, ask: "What goes wrong with this?" Write one runbook per failure mode to ops/runbooks/. Each runbook has: symptom (how you'd notice), diagnosis (commands to run), fix (steps to resolve), escalation (what to do if the fix doesn't work). Review each runbook.
  6. Build the expirations list. Scan for anything with a deadline: SSL certs, domain renewals, API key expirations, subscription renewals, contracts. Write to ops/expirations.md sorted by date. Highlight anything within 30 days.
  7. Set up the schedule. Create ops/schedule.md. Critical systems: daily. Important: weekly. Nice-to-have: monthly. Suggest automation — a cron job that runs health checks and generates the status dashboard. If the user has a daily briefing, connect the status output to it.
  8. Start the changelog. Create ops/changelog.md. Log today as the first entry: "Maintenance system initialized." Going forward, any change to any tracked system gets an entry.
  9. Connect to other systems. If the user has a flywheel, set up the observation bridge: red checks that can't be auto-fixed write to flywheel/observations/. If they have a daily briefing, feed the status dashboard into it. If they have neither, the maintenance system works standalone.
  10. Self-assess. Review the maintenance system's own performance. Are health checks producing false positives? Are runbooks accurate? Are any systems missing from the inventory? Write self-improvement proposals to ops/proposals/ and log observations to the flywheel if one exists. Every proposal should include a lesson: what this failure taught you that future systems should inherit.

Key rules: Health checks must be testable right now — run each one and show the result. Runbooks are scripts, not essays — an agent should be able to follow them mechanically. Don't touch production systems without explicit approval. If a check requires SSH or elevated access, confirm access before attempting. The maintenance system starts simple (inventory + checks) and grows (runbooks + schedule + automation). Don't overwhelm the user with ten runbooks on day one — start with the critical systems and expand. Every maintenance proposal should have a lesson in it — not just what to change, but what this incident teaches about how to build or operate better next time.


Patterns at work
  • Fix Your Papercuts — maintenance prevents known papercuts from recurring. The runbook is a papercut fix, written down permanently.
  • Solved Problems Stay Solved — each runbook is a solved problem. You debug it once, write it down, never debug it again.
  • Don't Ask Me to Track It — automated checks instead of manual monitoring. The system watches itself.
  • The Steering File — the inventory is a steering file for operations. It tells the agent what to care about.
  • Memory Is Files — the changelog is memory. Incident reports are memory. The maintenance system remembers what broke and how it was fixed.
  • The Folder Is the Interface — systems communicate by writing files to each other's folders. The protocol is markdown in the right place.
Related guides
  • Flywheel — discovers new friction; graduates findings into maintenance when they become permanent ops
  • Daily Briefing — surfaces maintenance alerts in the morning briefing
  • Wall of Data — the infrastructure that collects data from the systems you maintain
  • Zero to Dev — start here if you haven't set up yet
External references
  • Google SRE Book — the industry standard on reliability engineering (this guide is the personal-scale version)
  • Uptime Kuma — self-hosted monitoring tool; pairs well with the health check pattern

How to start

Navigate to your project root, create the ops folder, and hand it to an agent.

cd /path/to/your/project-root && mkdir -p ops/runbooks ops/incidents ops/proposals
Mac/Linux example. Replace the path with your actual project root.
cd "$env:USERPROFILE\<project-root>"; mkdir ops -Force; mkdir ops\runbooks, ops\incidents, ops\proposals -Force
PowerShell variant. Replace <project-root> with the folder name you actually use.
claude
Or use codex, gemini, or whichever agent you prefer.
Follow the instructions on this page. If anything looks unsafe or beyond what I'd reasonably want, tell me before doing it: