ChatGPT knows about DevOps. It can explain Kubernetes concepts, suggest Terraform patterns, and help debug error messages.

But it can’t tell you why YOUR pods are crashing at 3am.

This guide teaches you the patterns for building AI agents that go beyond generic advice—agents that understand your specific infrastructure and can take meaningful action.

Why Vertical Agents Beat Generic AI

The Problem with Generic AI

Ask ChatGPT “How do I fix a Kubernetes pod crash loop?” and you’ll get a comprehensive answer covering all possible causes. That’s helpful for learning, but not for solving YOUR problem at 2am.

Generic AI gives you:

Broad explanations that apply to everyone
Suggestions that may not fit your architecture
No context about your specific systems
Potential for hallucination on specifics

The Vertical Agent Approach

Vertical agents flip this model:

Aspect	Generic AI	Vertical Agent
Scope	Everything	One domain
Context	None	Your infrastructure
Output	Generic advice	Actionable steps
Hallucination	Higher risk	Lower (constrained domain)

A monitoring agent that knows YOUR infrastructure can tell you:

“Pod X is crash looping because the database connection pool is exhausted”
“This happened twice last month after the traffic spike on Tuesday”
“Runbook 7.3 addresses this—should I apply it?”

Webera’s Agent Philosophy

We don’t build one AI that does everything. We build 8 specialists that each do one thing excellently:

Sentinel — Monitoring and observability
Guardian — Security and compliance
Optimizer — Cost and performance
Conductor — CI/CD and deployment
Dispatcher — Alerting and routing
Navigator — Discovery and documentation
Keeper — Secrets and access management
Warden — Audit and governance

Each agent is a deep expert in its domain, with context about YOUR systems.

Part 1: Anatomy of an Infrastructure Agent

Every effective agent needs three things: Why, How, and What.

The Why: Identity and Mission

Before writing any code, define:

Who is this agent? Give it a name and personality
What problem does it solve? One specific problem, not many
What does success look like? Measurable outcomes

Example: Sentinel (Monitoring Agent)

identity:
  name: Sentinel
  tagline: "Watching while you sleep"
  mission: Ensure no issue goes undetected
  success_metrics:
    - Zero surprise outages
    - Mean time to detection < 5 minutes
    - False positive rate < 10%

The How: Operating Principles

Define how the agent makes decisions:

What can it do autonomously? Read-only operations, non-production changes
What requires approval? Production changes, deletions
Who does it collaborate with? Other agents, humans

The What: Specific Responsibilities

List the concrete things this agent does:

Domain-specific knowledge it needs
Workflows it executes
Outputs it produces
Reference materials it uses

Part 2: Decision Authority—The Critical Pattern

Without clear authority boundaries, agents either ask permission for everything (useless) or act autonomously on everything (dangerous).

The Authority Matrix

Action Type	Authority Level
Read-only discovery	Autonomous
Assessment and analysis	Autonomous
Non-production changes	Autonomous
Production proposals	Autonomous to propose
Production execution	Requires approval
Delete or remove anything	Requires approval

Implementing the Matrix

Here’s how a well-designed agent handles a request:

Request: “Set up monitoring for production”

AUTONOMOUS: Discover current infrastructure
AUTONOMOUS: Assess what’s missing
AUTONOMOUS: Propose monitoring stack
APPROVAL REQUIRED: Execute changes to production
AUTONOMOUS: Verify and document

The agent does the thinking, humans approve the action.

Why This Matters

Consider two scenarios:

Scenario A: No authority matrix Agent receives alert about high CPU. Does it scale up? Does it investigate? Does it wake someone up? Without clear boundaries, it either does nothing useful or does something dangerous.

Scenario B: Clear authority matrix Agent receives alert. It AUTONOMOUSLY investigates and correlates with recent deployments. It AUTONOMOUSLY proposes a rollback with supporting evidence. It REQUIRES APPROVAL before executing the rollback.

The second agent is useful AND safe.

Part 3: Context Injection—Your Infrastructure, Not Generic Advice

The difference between “ChatGPT knows DevOps” and “Our agents know YOUR infrastructure” is context.

The Context File Pattern

Agents need structured knowledge about YOUR systems:

# .webera/context.yaml
infrastructure:
  cloud: aws
  region: us-east-1
  account_id: "123456789012"

  kubernetes:
    version: "1.28"
    cluster: "production-eks"
    namespaces:
      - name: api
        criticality: high
      - name: workers
        criticality: medium

  databases:
    - type: postgresql
      version: "15"
      name: "primary-db"
      rds_instance: "db.r6g.xlarge"

services:
  - name: api
    repository: "company/api"
    criticality: high
    dependencies: [primary-db, redis]
    sla_target: "99.9%"

  - name: worker
    repository: "company/worker"
    criticality: medium
    dependencies: [primary-db, rabbitmq]

Why Context Beats Prompting

Without context file:

User: "Why is my API slow?"
Agent: "There could be many reasons. Check your database queries,
       network latency, CPU usage..."

With context file:

User: "Why is my API slow?"
Agent: "Your API service depends on primary-db (PostgreSQL 15 on
       db.r6g.xlarge). Checking CloudWatch metrics... Connection
       pool is at 95% capacity. This matches the pattern from last
       Tuesday's incident. Recommend increasing pool size per
       runbook 4.2."

The difference is actionable specificity.

Keeping Context Updated

Context files should be:

Versioned — In your git repository
Auto-discovered — Agents can update them (with approval)
Validated — Schema-checked to prevent errors

Part 4: Inter-Agent Collaboration

Single agents are limited. Agent systems are powerful.

The Handoff Pattern

Agents work together through defined handoffs:

Sentinel (monitoring) ──detects issue──► Dispatcher (routing)
Dispatcher ──routes to──► On-call engineer

Guardian (security) ──secures──► Conductor (pipelines)
Optimizer (cost) ◄──metrics from── Sentinel (monitoring)

Designing Handoffs

Each handoff needs:

Clear trigger — When does the handoff happen?
Context passing — What information transfers?
Acknowledgment — How does the receiving agent confirm?

Example handoff:

handoff:
  from: sentinel
  to: dispatcher
  trigger: alert_threshold_exceeded
  context:
    - alert_type
    - affected_service
    - metrics_snapshot
    - suggested_runbook
  acknowledgment: dispatcher_received

Real-World Example

1. Sentinel detects: High error rate on API service (5xx > 1%)
2. Sentinel outputs:
   - Alert with context (service, metrics, timeframe)
   - Correlation with recent events
   - Suggested runbook
3. Sentinel suggests: "Engage Dispatcher to route this alert"
4. Dispatcher receives: Alert + context
5. Dispatcher checks: Runbook exists for this scenario
6. Dispatcher decides: Route to API team based on on-call schedule
7. Dispatcher notifies: Slack + PagerDuty with full context

No human intervention until step 7. But humans stay in control.

Part 5: Client Customization

The same agent should behave differently for different contexts.

Why Customization Matters

Client A: SOC 2 focused, strict change control, requires approval for everything
Client B: Move fast, break things (but fix fast), autonomous for non-production

Same agent, different behavior.

The Customization Pattern

# Client-specific settings
agent_customization:
  sentinel:
    focus_areas:
      - "API latency"
      - "Database connections"
    ignore_namespaces:
      - "kube-system"
      - "monitoring"
    alert_threshold_multiplier: 1.5  # More lenient
    notes: "Previous P1 was API latency related - prioritize"

  guardian:
    compliance_focus:
      - "SOC2"
      - "HIPAA"
    backup_priority: "critical"
    approval_required_for: "all_changes"
    notes: "Healthcare client, strict compliance required"

Implementation Tips

Check customization first — Before any action, load client settings
Apply notes as context — Historical notes inform current decisions
Default to safe — If no customization, use conservative defaults

Part 6: Building Your First Agent

Ready to build? Here’s the step-by-step process.

Step 1: Choose a Narrow Domain

NOT this: “Infrastructure agent” DO this: “Monitoring and alerting agent”

Narrow scope = deep expertise = better results.

Step 2: Define Identity (Why)

identity:
  name: [Agent name]
  tagline: [One-line mission]
  problem_solved: [Specific problem]
  success_criteria:
    - [Measurable outcome 1]
    - [Measurable outcome 2]

Step 3: Define Authority (How)

authority:
  autonomous:
    - Read infrastructure state
    - Analyze metrics and logs
    - Generate reports
    - Propose changes
  requires_approval:
    - Execute production changes
    - Modify security settings
    - Delete resources
  handoffs_to:
    - [Other agent for related work]

Step 4: Define Knowledge (What)

knowledge:
  domain_expertise:
    - [Technical area 1]
    - [Technical area 2]
  reference_materials:
    - [Documentation source]
    - [Runbook location]
  output_formats:
    - [Report type]
    - [Alert format]

Step 5: Create Context Injection

Define what the agent needs to know about each infrastructure:

Service inventory
Dependencies
SLAs and criticality
Historical incidents

Step 6: Test Incrementally

Read-only first — Can it correctly understand the infrastructure?
Analysis second — Are its assessments accurate?
Proposals third — Are suggested actions appropriate?
Execution last — Does it execute safely with approval?

Part 7: Common Pitfalls

Pitfall	Problem	Solution
Too broad	Agent doesn’t know when to engage	Narrow the domain
No authority matrix	Asks permission for everything	Define autonomous actions
No context	Generic advice, not specific	Inject infrastructure context
No handoffs	Agent works in isolation	Define relationships
No customization	Same behavior for all clients	Add client-specific settings
No testing	Dangerous in production	Test incrementally

Why We Built 8 Agents, Not 1

The Temptation

“Build one AI that handles all DevOps.”

It sounds efficient. One system to rule them all.

The Reality

Vertical beats horizontal for specialized domains:

Monitoring requires different expertise than security
Cost optimization requires different context than deployment
8 specialists > 1 generalist

Each of our agents is a deep expert in one thing. They collaborate when needed, but they don’t try to do everything.

Our Agents Know YOUR Infrastructure Because:

They read your context files
They apply your customizations
They follow your runbooks
They integrate with your tools
They learn from your incidents

ChatGPT knows about DevOps. Our agents know YOUR infrastructure.

Next Steps

Option 1: Build Your Own

Use this guide to create agents for your specific needs. Start narrow, test thoroughly, expand gradually.

Option 2: Use Ours

Our 8 agents are already built, tested, and battle-hardened across dozens of client infrastructures.

Book a discovery call to see how our AI team can work with yours.

Option 3: Hybrid

Some clients use our agents while building their own for specific domains. We’re happy to share patterns and collaborate.