> bits_and_friends _

$ cat /blog/2026-05-27-ki-im-it-betrieb-monitoring-runbooks-dokumentation.en.md

AI in IT operations — monitoring, runbooks and documentation, rethought

[de] [en]

The old debate “will AI replace the admin?” misses the point. The productive question is: where does an experienced administrator spend time today on work that sits below their qualification — and which of those activities can an AI take over with adequate quality?

Three areas are currently especially fruitful.

Classifying monitoring alerts

Monitoring systems produce alerts. Lots of alerts. A large share of them are not real incidents but consequences of known patterns — short spikes, scheduled jobs, network hiccups, or the notorious cascading follow-on alerts after a single root cause.

AI helps on three levels:

  • Deduplication: related alerts are grouped into a single incident. A database outage produces not ten separate tickets but one summary ticket with the ten follow-on symptoms as detail.
  • Classification: is this alert a known pattern (with an associated standard remedy) or something new? Known patterns are enriched with a pointer to the right runbook; new ones go to human review.
  • Anomaly detection: instead of fixed thresholds (CPU >90 per cent triggers an alert), deviation from typical patterns is judged — what is normal on Wednesday evening may be unusual on Sunday morning.

The result: from 200 alerts a day, 20 cases emerge, of which 15 come with a proposed fix and 5 go to human review. That is not less information — it is less noise.

Suggesting runbooks instead of searching for them

A runbook is the written instruction for how a specific incident is resolved. Good runbooks exist in every mature operation. The problem is not their quality, but their findability. When an incident strikes at three in the morning, the on-call engineer not seldom spends twenty minutes finding the right runbook.

AI support shortens this step dramatically:

  • Based on alert texts, affected systems and historical incidents, candidate runbooks are suggested.
  • With a clear match, a single runbook opens directly.
  • With multiple candidates, the two or three most likely are shown with reasoning.
  • From the runbook’s steps, pre-prepared commands can be generated where useful, that the human only releases.

Important: the AI does not replace the runbook and does not execute it autonomously. It searches, suggests, prepares the execution. The human pulls the trigger.

Keeping documentation current

The most honest weakness of almost every IT department is the currency of documentation. Configs change. Systems migrate. Responsibilities shift. Documentation regularly trails three to six months behind — and some pages have been wrong for years.

Here AI helps on two paths:

  • Spotting drift. An AI agent regularly compares the documented target state (e.g. “Server X runs Ubuntu 22.04 with Postgres 15”) with the actual state from monitoring. Deviations are reported — as a proposal to update the doc or correct the actual state.
  • Generating documentation from cases. When an incident is resolved and documented, the AI can turn the resolution into a draft for a new runbook or supplementary wiki page. The human reviewer trims, corrects, augments — the tedious first draft is already there.

The result is not perfect documentation, but documentation that no longer sits stale, but moves actively with operations.

Where the human must stay

There are areas in IT operations where AI must not act autonomously today — and should not for the foreseeable future:

  • Actions on production systems (restart, failover, configuration change) — the human decides. The AI can prepare and execute, but not decide.
  • Security-relevant assessments (is this an attack? is this a false positive?) — classification can be supported; assessment stays human.
  • Personal decisions (permissions, access denial, escalation approvals) — both legally and operationally: human in the loop.

This line is not drawn out of mistrust, but out of experience. It will move — but it will only move safely if the steps before it work reliably.

What remains in the end

An IT operation with AI support does not have fewer administrators, but different ones. The activity shifts from reacting to symptoms to designing patterns: how do we categorise alerts? Which runbooks do we need? How do we maintain our documentation? This work is more demanding than the old one — and it scales with the team’s experience.

AI takes the mechanics. What remains is the engineering judgement.