Back to articles
AI8 min read

AI Agents for Infrastructure Operations

How to integrate LLM-powered agents into monitoring, incident response, and automated ticket management workflows.

2024-03-22

AI Agents for Infrastructure Operations

Large Language Models have moved beyond chatbots. In infrastructure operations, AI agents can dramatically reduce mean time to resolution (MTTR), automate routine tasks, and augment on-call engineers. Here's how I've been integrating LLM-powered agents into operational workflows.

The Agent Architecture

An effective ops agent needs three capabilities: the ability to observe (read metrics, logs, alerts), reason (analyze patterns, correlate events), and act (execute runbooks, create tickets, send notifications).

from openai import OpenAI
import json

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "query_metrics",
            "description": "Query Prometheus metrics for a given service",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "PromQL query"},
                    "duration": {"type": "string", "description": "Time range, e.g. '1h'"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_logs",
            "description": "Search application logs in Loki",
            "parameters": {
                "type": "object",
                "properties": {
                    "service": {"type": "string"},
                    "query": {"type": "string"},
                    "severity": {"type": "string", "enum": ["error", "warn", "info"]}
                },
                "required": ["service", "query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "execute_runbook",
            "description": "Execute a predefined runbook action",
            "parameters": {
                "type": "object",
                "properties": {
                    "runbook_id": {"type": "string"},
                    "parameters": {"type": "object"}
                },
                "required": ["runbook_id"]
            }
        }
    }
]

def handle_alert(alert: dict) -> str:
    messages = [
        {
            "role": "system",
            "content": "You are an SRE agent. Analyze alerts, investigate root causes, and execute runbooks when appropriate."
        },
        {
            "role": "user",
            "content": f"Alert fired: {json.dumps(alert)}"
        }
    ]

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )

    # Agent processes tool calls iteratively
    return process_agent_loop(response, messages)

Incident Response Automation

When an alert fires, the agent follows a structured investigation:

  1. Gather context: Query relevant metrics and logs around the alert timestamp
  2. Correlate events: Check if related services are also affected
  3. Identify root cause: Match patterns against known issues
  4. Execute remediation: Run the appropriate runbook if confidence is high
  5. Document findings: Create an incident ticket with full context

The key is the confidence threshold. For well-understood issues (disk full, pod crash loop, certificate expiry), the agent can act autonomously. For novel issues, it escalates to a human with a detailed analysis.

Intelligent Alert Routing

Not all alerts are equal. An AI agent can triage alerts based on severity, blast radius, and service ownership:

def triage_alert(alert: dict) -> dict:
    """Analyze alert and determine routing."""
    analysis = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Triage this alert. Respond with JSON:
            {
                "severity": "critical|high|medium|low",
                "team": "platform|backend|frontend|data",
                "suggested_runbook": "runbook-id or null",
                "summary": "one-line summary"
            }"""
        }, {
            "role": "user",
            "content": json.dumps(alert)
        }],
        response_format={"type": "json_object"}
    )

    return json.loads(analysis.choices[0].message.content)

This eliminates alert fatigue by routing the right alert to the right person with the right context.

Automated Ticket Management

One of the highest-ROI applications is automated ticket creation and enrichment. When the agent detects an issue, it creates a ticket with:

  • Alert details and timeline
  • Relevant metrics graphs (captured as links)
  • Log excerpts showing the error pattern
  • Similar past incidents and their resolutions
  • Suggested remediation steps

This turns a cryptic alert into an actionable ticket, saving 10-15 minutes per incident.

Safety Guardrails

AI agents in production need strict guardrails:

  • Read-only by default: Agents should only have write access for specific, approved runbooks
  • Human-in-the-loop: Critical actions (restart production service, scale up infrastructure) require approval
  • Audit trail: Every action the agent takes is logged and reviewable
  • Kill switch: Ability to instantly disable the agent if it misbehaves

Results

In my deployments, AI-assisted operations have delivered:

  • 40% reduction in MTTR for known issue patterns
  • 60% fewer pages escalated to humans during off-hours
  • 80% of routine tickets auto-enriched with investigation context

The goal isn't to replace on-call engineers — it's to give them a highly capable assistant that handles the routine so they can focus on the complex.

Share this article

Need help with your infrastructure?

Let's discuss your project and find the best solution together.

Get in touch