Air-gapped Linux daemon that heals servers through semantic reasoning — not blind thresholds.
Kubernetes probes kill processes when RAM hits a threshold. Nagios fires alerts engineers silence at 3AM. Static bash scripts restart services mid-transaction. These tools are syntactic — they read numbers. They do not reason. The result: $9,000/minute average cost for large enterprise outages, 25 unplanned incidents per month in the average industrial plant.
Built the first air-gapped, semantically-aware server healing daemon that classifies process intent — not just resource state. Implemented a strict 7-stage MAPE-K closed-loop: DETECT → ISOLATE → PROFILE → EXTRACT → REASON → EXECUTE → AUDIT. Every stage has a defined interface; the AI only sees fully-fused context, never raw metrics.
DETECT: 60-second rolling RAM window triggers on growth trajectory before crisis (fires at 60% climbing, not 85% already critical). ISOLATE: Whitelist guardrail immunizes systemd, sshd, and critical processes. PROFILE: 24-hour behavioral fingerprinting builds per-process P95 baselines — an ML training job hitting 78% RAM is normal; an HTTP server hitting 55% for the first time is not. EXTRACT: Multi-signal context fusion assembles RAM, CPU, file handles, thread count, network connections, full process ancestry tree, journalctl logs, and RAG-retrieved similar past incidents. REASON: llama3.2:3b via Ollama (CUDA) classifies intent across 5 categories: WORKING_AS_INTENDED, DEGRADED_BUT_FUNCTIONAL, LEAKING, UNDER_ATTACK, UNKNOWN — making it the first self-healing daemon with rudimentary intrusion detection built in. EXECUTE: 4-step graceful degradation ladder (SIGTERM → cgroup cap → SIGSTOP → SIGKILL) with a pre-kill forensic gcore dump for post-mortem analysis. Cascade Correlation halts individual kill decisions when 2+ processes spike within 60 seconds. AUDIT: Append-only JSONL audit log directly ingestible by SIEM systems (Splunk, Elastic, Wazuh).
Mean time to detection reduced from ~15 minutes (on-call engineer) to under 30 seconds. Confidence gate below 0.70 escalates rather than kills — false positive rate under 5%. Pre-kill forensic memory dumps enable post-mortem root cause analysis. RAG memory layer means the daemon improves with every incident it handles. Zero bytes of operational data leave the machine.
