Guide 20 min read

Building AI Agents for Infrastructure

Learn the patterns for building specialized infrastructure agents that know YOUR systems

By Joel Zamboni -

Who this is for: DevOps engineers, platform teams, CTOs evaluating AI

ChatGPT knows about DevOps. It can explain Kubernetes concepts, suggest Terraform patterns, and help debug error messages.

But it can’t tell you why YOUR pods are crashing at 3am.

This guide teaches you the patterns for building AI agents that go beyond generic advice—agents that understand your specific infrastructure and can take meaningful action.

Why Vertical Agents Beat Generic AI

The Problem with Generic AI

Ask ChatGPT “How do I fix a Kubernetes pod crash loop?” and you’ll get a comprehensive answer covering all possible causes. That’s helpful for learning, but not for solving YOUR problem at 2am.

Generic AI gives you:

  • Broad explanations that apply to everyone
  • Suggestions that may not fit your architecture
  • No context about your specific systems
  • Potential for hallucination on specifics

The Vertical Agent Approach

Vertical agents flip this model:

AspectGeneric AIVertical Agent
ScopeEverythingOne domain
ContextNoneYour infrastructure
OutputGeneric adviceActionable steps
HallucinationHigher riskLower (constrained domain)

A monitoring agent that knows YOUR infrastructure can tell you:

  • “Pod X is crash looping because the database connection pool is exhausted”
  • “This happened twice last month after the traffic spike on Tuesday”
  • “Runbook 7.3 addresses this—should I apply it?”

Webera’s Agent Philosophy

We don’t build one AI that does everything. We build 8 specialists that each do one thing excellently:

  • Sentinel — Monitoring and observability
  • Guardian — Security and compliance
  • Optimizer — Cost and performance
  • Conductor — CI/CD and deployment
  • Dispatcher — Alerting and routing
  • Navigator — Discovery and documentation
  • Keeper — Secrets and access management
  • Warden — Audit and governance

Each agent is a deep expert in its domain, with context about YOUR systems.


Part 1: Anatomy of an Infrastructure Agent

Every effective agent needs three things: Why, How, and What.

The Why: Identity and Mission

Before writing any code, define:

  • Who is this agent? Give it a name and personality
  • What problem does it solve? One specific problem, not many
  • What does success look like? Measurable outcomes

Example: Sentinel (Monitoring Agent)

identity:
  name: Sentinel
  tagline: "Watching while you sleep"
  mission: Ensure no issue goes undetected
  success_metrics:
    - Zero surprise outages
    - Mean time to detection < 5 minutes
    - False positive rate < 10%

The How: Operating Principles

Define how the agent makes decisions:

  • What can it do autonomously? Read-only operations, non-production changes
  • What requires approval? Production changes, deletions
  • Who does it collaborate with? Other agents, humans

The What: Specific Responsibilities

List the concrete things this agent does:

  • Domain-specific knowledge it needs
  • Workflows it executes
  • Outputs it produces
  • Reference materials it uses

Part 2: Decision Authority—The Critical Pattern

Without clear authority boundaries, agents either ask permission for everything (useless) or act autonomously on everything (dangerous).

The Authority Matrix

Action TypeAuthority Level
Read-only discoveryAutonomous
Assessment and analysisAutonomous
Non-production changesAutonomous
Production proposalsAutonomous to propose
Production executionRequires approval
Delete or remove anythingRequires approval

Implementing the Matrix

Here’s how a well-designed agent handles a request:

Request: “Set up monitoring for production”

  1. AUTONOMOUS: Discover current infrastructure
  2. AUTONOMOUS: Assess what’s missing
  3. AUTONOMOUS: Propose monitoring stack
  4. APPROVAL REQUIRED: Execute changes to production
  5. AUTONOMOUS: Verify and document

The agent does the thinking, humans approve the action.

Why This Matters

Consider two scenarios:

Scenario A: No authority matrix Agent receives alert about high CPU. Does it scale up? Does it investigate? Does it wake someone up? Without clear boundaries, it either does nothing useful or does something dangerous.

Scenario B: Clear authority matrix Agent receives alert. It AUTONOMOUSLY investigates and correlates with recent deployments. It AUTONOMOUSLY proposes a rollback with supporting evidence. It REQUIRES APPROVAL before executing the rollback.

The second agent is useful AND safe.


Part 3: Context Injection—Your Infrastructure, Not Generic Advice

The difference between “ChatGPT knows DevOps” and “Our agents know YOUR infrastructure” is context.

The Context File Pattern

Agents need structured knowledge about YOUR systems:

# .webera/context.yaml
infrastructure:
  cloud: aws
  region: us-east-1
  account_id: "123456789012"

  kubernetes:
    version: "1.28"
    cluster: "production-eks"
    namespaces:
      - name: api
        criticality: high
      - name: workers
        criticality: medium

  databases:
    - type: postgresql
      version: "15"
      name: "primary-db"
      rds_instance: "db.r6g.xlarge"

services:
  - name: api
    repository: "company/api"
    criticality: high
    dependencies: [primary-db, redis]
    sla_target: "99.9%"

  - name: worker
    repository: "company/worker"
    criticality: medium
    dependencies: [primary-db, rabbitmq]

Why Context Beats Prompting

Without context file:

User: "Why is my API slow?"
Agent: "There could be many reasons. Check your database queries,
       network latency, CPU usage..."

With context file:

User: "Why is my API slow?"
Agent: "Your API service depends on primary-db (PostgreSQL 15 on
       db.r6g.xlarge). Checking CloudWatch metrics... Connection
       pool is at 95% capacity. This matches the pattern from last
       Tuesday's incident. Recommend increasing pool size per
       runbook 4.2."

The difference is actionable specificity.

Keeping Context Updated

Context files should be:

  • Versioned — In your git repository
  • Auto-discovered — Agents can update them (with approval)
  • Validated — Schema-checked to prevent errors

Part 4: Inter-Agent Collaboration

Single agents are limited. Agent systems are powerful.

The Handoff Pattern

Agents work together through defined handoffs:

Sentinel (monitoring) ──detects issue──► Dispatcher (routing)
Dispatcher ──routes to──► On-call engineer

Guardian (security) ──secures──► Conductor (pipelines)
Optimizer (cost) ◄──metrics from── Sentinel (monitoring)

Designing Handoffs

Each handoff needs:

  1. Clear trigger — When does the handoff happen?
  2. Context passing — What information transfers?
  3. Acknowledgment — How does the receiving agent confirm?

Example handoff:

handoff:
  from: sentinel
  to: dispatcher
  trigger: alert_threshold_exceeded
  context:
    - alert_type
    - affected_service
    - metrics_snapshot
    - suggested_runbook
  acknowledgment: dispatcher_received

Real-World Example

1. Sentinel detects: High error rate on API service (5xx > 1%)
2. Sentinel outputs:
   - Alert with context (service, metrics, timeframe)
   - Correlation with recent events
   - Suggested runbook
3. Sentinel suggests: "Engage Dispatcher to route this alert"
4. Dispatcher receives: Alert + context
5. Dispatcher checks: Runbook exists for this scenario
6. Dispatcher decides: Route to API team based on on-call schedule
7. Dispatcher notifies: Slack + PagerDuty with full context

No human intervention until step 7. But humans stay in control.


Part 5: Client Customization

The same agent should behave differently for different contexts.

Why Customization Matters

  • Client A: SOC 2 focused, strict change control, requires approval for everything
  • Client B: Move fast, break things (but fix fast), autonomous for non-production

Same agent, different behavior.

The Customization Pattern

# Client-specific settings
agent_customization:
  sentinel:
    focus_areas:
      - "API latency"
      - "Database connections"
    ignore_namespaces:
      - "kube-system"
      - "monitoring"
    alert_threshold_multiplier: 1.5  # More lenient
    notes: "Previous P1 was API latency related - prioritize"

  guardian:
    compliance_focus:
      - "SOC2"
      - "HIPAA"
    backup_priority: "critical"
    approval_required_for: "all_changes"
    notes: "Healthcare client, strict compliance required"

Implementation Tips

  1. Check customization first — Before any action, load client settings
  2. Apply notes as context — Historical notes inform current decisions
  3. Default to safe — If no customization, use conservative defaults

Part 6: Building Your First Agent

Ready to build? Here’s the step-by-step process.

Step 1: Choose a Narrow Domain

NOT this: “Infrastructure agent” DO this: “Monitoring and alerting agent”

Narrow scope = deep expertise = better results.

Step 2: Define Identity (Why)

identity:
  name: [Agent name]
  tagline: [One-line mission]
  problem_solved: [Specific problem]
  success_criteria:
    - [Measurable outcome 1]
    - [Measurable outcome 2]

Step 3: Define Authority (How)

authority:
  autonomous:
    - Read infrastructure state
    - Analyze metrics and logs
    - Generate reports
    - Propose changes
  requires_approval:
    - Execute production changes
    - Modify security settings
    - Delete resources
  handoffs_to:
    - [Other agent for related work]

Step 4: Define Knowledge (What)

knowledge:
  domain_expertise:
    - [Technical area 1]
    - [Technical area 2]
  reference_materials:
    - [Documentation source]
    - [Runbook location]
  output_formats:
    - [Report type]
    - [Alert format]

Step 5: Create Context Injection

Define what the agent needs to know about each infrastructure:

  • Service inventory
  • Dependencies
  • SLAs and criticality
  • Historical incidents

Step 6: Test Incrementally

  1. Read-only first — Can it correctly understand the infrastructure?
  2. Analysis second — Are its assessments accurate?
  3. Proposals third — Are suggested actions appropriate?
  4. Execution last — Does it execute safely with approval?

Part 7: Common Pitfalls

PitfallProblemSolution
Too broadAgent doesn’t know when to engageNarrow the domain
No authority matrixAsks permission for everythingDefine autonomous actions
No contextGeneric advice, not specificInject infrastructure context
No handoffsAgent works in isolationDefine relationships
No customizationSame behavior for all clientsAdd client-specific settings
No testingDangerous in productionTest incrementally

Why We Built 8 Agents, Not 1

The Temptation

“Build one AI that handles all DevOps.”

It sounds efficient. One system to rule them all.

The Reality

Vertical beats horizontal for specialized domains:

  • Monitoring requires different expertise than security
  • Cost optimization requires different context than deployment
  • 8 specialists > 1 generalist

Each of our agents is a deep expert in one thing. They collaborate when needed, but they don’t try to do everything.

Our Agents Know YOUR Infrastructure Because:

  • They read your context files
  • They apply your customizations
  • They follow your runbooks
  • They integrate with your tools
  • They learn from your incidents

ChatGPT knows about DevOps. Our agents know YOUR infrastructure.


Next Steps

Option 1: Build Your Own

Use this guide to create agents for your specific needs. Start narrow, test thoroughly, expand gradually.

Option 2: Use Ours

Our 8 agents are already built, tested, and battle-hardened across dozens of client infrastructures.

Book a discovery call to see how our AI team can work with yours.

Option 3: Hybrid

Some clients use our agents while building their own for specific domains. We’re happy to share patterns and collaborate.


Further Reading

Ready to accelerate your infrastructure?

Our team of senior engineers + AI agents can implement these practices in days, not months.

Book a Discovery Call