Server healing agent that monitors, diagnoses, and self-repairs without human input.
Server incidents at 3am require on-call engineers who are often slow to respond. Most alerts are noise. The ones that matter require context that takes time to gather.
Implemented MAPE-K (Monitor, Analyze, Plan, Execute, Knowledge) — the same feedback loop used in autonomic computing research — as a local, LLM-driven healing agent.
Monitor agent continuously tails logs and system metrics. Analyze phase uses local LLM (llama-cpp-python with CUDA) to classify incident severity and root cause. Plan phase generates a remediation script. Execute phase runs it with a human-in-the-loop confirmation gate for destructive actions. All knowledge persisted locally.
Reduced mean time to diagnosis from manual 15-minute investigation to under 90 seconds. Zero external API calls — entirely local inference.
