Cymantis
← Back to posts

From SOAR Playbooks to Agentic Orchestration: Modernizing Incident Response

Why rigid SOAR playbooks fail at scale and how multi-agent orchestration delivers dramatically better incident response — with a technical migration guide from traditional SOAR to agentic IR.

SOARIncident ResponseAgentic AIOrchestrationAutomationCymantis

From SOAR Playbooks to Agentic Orchestration: Modernizing Incident Response

By Cymantis Labs

SOAR was supposed to be the answer. Security Orchestration, Automation, and Response platforms promised to eliminate the manual drudgery of incident response — to give overwhelmed SOC teams the ability to codify their best analysts' workflows into reusable playbooks and execute them at machine speed.

Instead, most organizations are drowning in a different kind of complexity. They manage 200+ playbooks, each one a brittle chain of if/then/else logic that works perfectly for the exact scenario it was designed for — and breaks the moment a threat deviates from the script. Playbook maintenance has become a full-time job. The "automation" that was supposed to free analysts has created a new category of toil: playbook engineering.

Here's the data that should make every IR leader pause: research on multi-agent AI systems for incident response shows that agentic orchestration delivers 100% actionable recommendations compared to just 1.7% for single-agent approaches, with 80x better action specificity. Multi-agent systems don't just classify alerts — they investigate, contextualize, and recommend precise response actions with full justification chains.

The gap between what SOAR promised and what agentic orchestration delivers isn't incremental. It's generational. This post lays out the technical architecture, the migration path, and the governance model for making the shift — with working code, real configurations, and the hard-won lessons from teams that have already made the transition.


Why SOAR Playbooks Fail at Scale

Before building the replacement, we need to be honest about why SOAR fails. Not in theory — in practice, in production, at scale. The failure modes are structural, not incidental.

The Brittleness Problem

A SOAR playbook is a static decision tree authored at a specific point in time by an analyst who understood a specific threat scenario. It encodes assumptions about alert format, data availability, API responses, and threat behavior. When any of those assumptions change — and they always change — the playbook either fails silently or produces incorrect results.

Consider a phishing playbook designed in 2024. It expects an email alert from your SEG, extracts URLs and attachments, detonates them in a sandbox, checks the sender domain against threat intel, and makes a disposition. Now a threat actor uses QR codes instead of URLs. The extraction step finds nothing. The playbook marks the alert as clean. The phishing email lands. Nobody knows until the credential harvest succeeds.

The playbook didn't fail technically — it executed perfectly. It just couldn't handle a scenario its author didn't anticipate. And this is the fundamental problem: static logic cannot adapt to novel threats.

Playbook Sprawl

Organizations that take SOAR seriously end up managing hundreds of playbooks:

  • 20+ playbooks for phishing variants (URL, attachment, BEC, QR, callback)
  • 15+ playbooks for endpoint alerts (malware, suspicious process, lateral movement, persistence)
  • 10+ playbooks for identity events (impossible travel, credential stuffing, privilege escalation)
  • 30+ playbooks for cloud security (IAM changes, S3 exposure, security group modifications)
  • 50+ enrichment sub-playbooks (IP reputation, domain analysis, file hash lookup, user context)

Each playbook needs its own error handling, its own input validation, its own API integrations, and its own test cases. When a vendor changes an API endpoint, every playbook that calls it breaks. When the SIEM schema changes, every playbook that parses its output breaks. The combinatorial complexity is staggering.

The Maintenance Burden

A 2025 survey by the Ponemon Institute found that organizations spend an average of 12 FTE hours per week maintaining SOAR playbooks — updating integrations, fixing broken API calls, adding new branches for threat variants, and testing changes. That's a half-time security engineer dedicated entirely to keeping the automation running — not improving it, not building new capabilities, just preventing decay.

Worse, most organizations don't have adequate testing for playbooks. Changes are tested manually (if at all), deployed to production, and validated only when a real incident triggers them. The blast radius of a broken playbook is an unhandled incident.

The False Sense of Automation

Perhaps the most insidious failure: SOAR creates an illusion of coverage. Leadership sees 200 playbooks and assumes 200 scenarios are handled. In reality, many of those playbooks haven't been tested against current threat behavior, half have broken integrations that nobody has noticed, and the remaining ones handle the easy cases that analysts could resolve in minutes anyway.

The hard cases — the novel threats, the multi-stage attacks, the adversaries who deliberately evade your documented procedures — still land on an analyst's desk with no automation support. SOAR handles the 80% of incidents that don't really need automation and fails on the 20% that desperately do.

Pro Tip: Audit your SOAR playbooks quarterly. Track execution success rates, mean execution time, and — critically — how often a playbook completes but the incident still requires manual analyst intervention. That last metric reveals your true automation gap.


SOAR vs. Agentic Orchestration — A Side-by-Side Comparison

The difference between SOAR and agentic orchestration isn't just technical — it's philosophical. SOAR encodes procedures. Agentic orchestration encodes reasoning.

Comparison Matrix

Dimension Traditional SOAR Agentic Orchestration
Decision Logic Static decision trees, if/then/else branches Dynamic reasoning with LLM-powered analysis
Adaptability Requires manual playbook updates for new scenarios Adapts to novel threats using learned patterns and reasoning
Context Retention Stateless — each execution starts from zero Maintains investigation context across steps and incidents
Multi-Step Reasoning Pre-defined enrichment sequence, fixed order Dynamic investigation — each step informs the next
Maintenance Overhead 12+ FTE hours/week for playbook maintenance Self-improving; agents learn from outcomes
Novel Threat Handling Fails silently or escalates without context Reasons about unfamiliar patterns using threat knowledge
Error Recovery Hard-coded error handlers per integration Graceful degradation with alternative investigation paths
Action Specificity Generic response actions (block IP, isolate host) Precise, justified response recommendations with blast radius analysis
Scalability Linear — each new scenario needs a new playbook Emergent — agents compose capabilities dynamically
Audit Trail Execution logs of playbook steps Full reasoning chains with evidence and confidence scores

Architecture Comparison

Traditional SOAR Architecture:

graph LR
    siemAlert["SIEM Alert"] --> playbookEngine["Playbook Engine<br/>(Static Logic)"]
    playbookEngine --> responseActions["Response Actions"]
    playbookEngine --> enrichment["Enrichment<br/>(Fixed)"]

Agentic Orchestration Architecture:

graph TD
    siemAlert["SIEM Alert"] --> triageAgent
    
    subgraph orchestrationLayer["Agent Orchestration Layer"]
        direction LR
        triageAgent["Triage Agent"] --> investigationAgent["Investigation Agent"]
        investigationAgent --> responseOrchestrator["Response Orchestrator"]
        
        triageAgent --> sharedState["Shared State & Memory"]
        investigationAgent --> sharedState
        responseOrchestrator --> sharedState
        
        sharedState --> learningAgent["Learning Agent"]
        sharedState --> documentationAgent["Documentation Agent"]
        sharedState --> governanceLayer["Governance Layer"]
    end

The critical difference is the shared state layer. In SOAR, each playbook execution is an island. In an agentic system, every agent contributes to and draws from a shared investigation context. The triage agent's severity classification informs the investigation agent's depth. The investigation agent's findings shape the response orchestrator's actions. The learning agent's historical analysis improves every future triage decision.

Pro Tip: When evaluating agentic IR platforms, ask specifically about state management. If the system can't show you how Agent A's output influences Agent B's reasoning in real time, it's SOAR with a chatbot bolted on — not true agentic orchestration.


The Agentic Incident Response Architecture

This is the core technical section. We'll walk through each agent in the agentic IR system, its responsibilities, its interfaces, and its implementation.

Triage Agent

The triage agent is the front door. Every alert hits this agent first. Its job is not to investigate — it's to classify, prioritize, and route.

Responsibilities:

  • Receive raw alerts from SIEM, EDR, cloud, and identity sources
  • Normalize alert data into a common schema
  • Classify severity using contextual signals (asset criticality, user role, threat intel)
  • Determine the investigation path (automated, assisted, or manual escalation)
  • Deduplicate against active investigations
from dataclasses import dataclass, field
from enum import Enum

class Severity(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    INFORMATIONAL = "informational"

class InvestigationPath(Enum):
    AUTONOMOUS = "autonomous"
    ASSISTED = "assisted"
    MANUAL = "manual"

# TriageResult(alert_id, severity, investigation_path, confidence,
# reasoning, enrichment_hints, deduplicated, parent_investigation_id)

class TriageAgent:
    """Front-door agent: classifies, prioritizes, and routes alerts
    using weighted contextual severity scoring."""

    def __init__(self, llm_client, asset_db, threat_intel, active_investigations):
        self.llm, self.asset_db = llm_client, asset_db
        self.threat_intel, self.active_investigations = threat_intel, active_investigations
        self.severity_weights = {
            "asset_criticality": 0.30, "threat_intel_match": 0.25,
            "behavioral_anomaly": 0.20, "alert_fidelity": 0.15,
            "temporal_proximity": 0.10}

    async def triage(self, raw_alert: dict) -> TriageResult:
        """Main triage pipeline: normalize, deduplicate, classify, route."""
        normalized = self._normalize_alert(raw_alert)
        dedup_result = await self._check_deduplication(normalized)
        if dedup_result:
            return dedup_result
        context = await self._gather_triage_context(normalized)
        severity, confidence = self._calculate_severity(context)
        path = self._determine_path(severity, confidence, context)
        reasoning = await self._generate_reasoning(normalized, context, severity, path)
        return TriageResult(
            alert_id=normalized["alert_id"], severity=severity,
            investigation_path=path, confidence=confidence,
            reasoning=reasoning, enrichment_hints=self._suggest_enrichments(context))

    # Additional methods:
    # - _gather_triage_context(): Pulls asset criticality, threat intel, alert history
    # - _calculate_severity(): Weighted scoring across 5 contextual signals
    # - _determine_path(): Routes to autonomous/assisted/manual by severity + confidence
    # - _check_deduplication(): Fingerprints alerts against active investigations
    # - _normalize_alert(): Converts raw alerts into common schema
    # - _suggest_enrichments(): Recommends deep TI, vuln scans, IOC sweeps
    # - _generate_reasoning(): LLM-generated human-readable triage justification

Pro Tip: The triage agent should never make response decisions — only classification and routing decisions. Keep triage fast (sub-second) by limiting enrichment to lightweight lookups. Deep enrichment belongs in the investigation agent.

Investigation Agent

The investigation agent is where agentic orchestration truly separates itself from SOAR. Instead of a fixed enrichment sequence, the investigation agent conducts a dynamic, multi-step investigation where each step informs the next.

Responsibilities:

  • Multi-source enrichment (threat intel, CMDB, vulnerability data, historical incidents)
  • Dynamic investigation planning — choose next steps based on findings
  • Evidence correlation across data sources
  • Confidence scoring for findings
  • Blast radius assessment
from dataclasses import dataclass, field

# InvestigationStep(tool, query, result, duration_ms, confidence, findings)
# InvestigationReport(investigation_id, alert_id, steps, findings, confidence,
# blast_radius, recommended_actions, evidence_chain, timeline)

class InvestigationAgent:
    """Conducts dynamic, multi-step investigations where each enrichment
    step informs the next — mirroring how a senior analyst works."""

    def __init__(self, llm_client, tool_registry, state_store):
        self.llm, self.tools, self.state = llm_client, tool_registry, state_store
        self.max_steps, self.confidence_threshold = 15, 0.85

    async def investigate(self, triage_result: dict) -> InvestigationReport:
        """Run a dynamic, LLM-guided investigation."""
        investigation_id = f"inv-{triage_result['alert_id']}"
        steps, findings = [], []
        plan = await self._create_investigation_plan(triage_result)

        for step_num in range(self.max_steps):
            next_action = await self._decide_next_step(
                triage_result, steps, findings, plan)
            if next_action["action"] == "conclude":
                break
            step = await self._execute_step(next_action)
            steps.append(step)
            findings.extend(await self._extract_findings(step, triage_result))
            await self.state.update(investigation_id, {
                "current_step": step_num + 1, "findings_count": len(findings)})
            if self._assess_confidence(findings) >= self.confidence_threshold:
                break

        blast_radius = await self._assess_blast_radius(triage_result, findings)
        return InvestigationReport(
            investigation_id=investigation_id, alert_id=triage_result["alert_id"],
            steps=steps, findings=findings,
            confidence=self._assess_confidence(findings), blast_radius=blast_radius,
            recommended_actions=await self._generate_recommendations(
                triage_result, findings, blast_radius),
            evidence_chain=[s.tool for s in steps],
            timeline=self._build_timeline(steps, findings))

    # Additional methods:
    # - _create_investigation_plan(): LLM generates hypotheses and data source order
    # - _decide_next_step(): LLM picks next action from SIEM/EDR/TI/identity/cloud/CMDB
    # - _execute_step(): Runs a tool query with timing and error handling
    # - _assess_blast_radius(): Maps affected assets, users, and services
    # - _assess_confidence(): Weighted average biased toward high-confidence findings
    # - _extract_findings(): Structures raw tool results into typed findings
    # - _build_timeline(): Chronological reconstruction from steps and findings
    # - _generate_recommendations(): LLM-powered response action suggestions

Pro Tip: Cap investigation steps at 15 with an escape hatch. Unbounded investigation loops are the agentic equivalent of an infinite while loop — the agent will keep finding "interesting" threads to pull without converging on a conclusion. Set a step budget and force a determination.

Response Orchestrator

The response orchestrator translates investigation findings into concrete response actions — and critically, enforces approval gates for high-risk actions.

Responsibilities:

  • Generate response action plans based on investigation findings
  • Classify actions by risk level (informational, reversible, destructive)
  • Enforce approval gates: auto-execute low-risk, require approval for high-risk
  • Coordinate actions across endpoint, network, identity, and cloud systems
  • Track action execution and rollback capabilities
from dataclasses import dataclass
from enum import Enum

class ActionRisk(Enum):
    LOW = "low"             # Informational: create ticket, send notification
    MEDIUM = "medium"       # Reversible: block IP, disable user
    HIGH = "high"           # Significant: isolate endpoint, revoke sessions
    CRITICAL = "critical"   # Destructive: wipe endpoint, terminate instances

# ApprovalStatus: AUTO_APPROVED | PENDING | APPROVED | DENIED | TIMED_OUT
# ResponseAction(action_id, type, target, params, risk_level, justification, rollback_available)
# ResponsePlan(investigation_id, actions, priority_order, containment_time, requires_approval)

class ResponseOrchestrator:
    """Translates investigation findings into coordinated response
    actions with risk-based approval gates and rollback capabilities."""

    def __init__(self, llm_client, action_registry, approval_service, state_store):
        self.llm, self.actions = llm_client, action_registry
        self.approvals, self.state = approval_service, state_store
        self.auto_approve_threshold = ActionRisk.MEDIUM

    async def create_response_plan(self, investigation_report: dict) -> ResponsePlan:
        """Generate a response plan from investigation findings."""
        recommended = await self._recommend_actions(investigation_report)
        actions = []
        for rec in recommended:
            risk = self._classify_risk(rec)
            action = ResponseAction(
                action_id=f"act-{investigation_report['investigation_id']}-{len(actions)}",
                action_type=rec["type"], target=rec["target"],
                parameters=rec.get("parameters", {}), risk_level=risk,
                justification=rec["justification"],
                rollback_available=rec.get("reversible", False))
            if risk.value <= self.auto_approve_threshold.value:
                action.approval_status = ApprovalStatus.AUTO_APPROVED
            actions.append(action)

        return ResponsePlan(
            investigation_id=investigation_report["investigation_id"],
            actions=actions, priority_order=self._prioritize_actions(actions),
            estimated_containment_time=self._estimate_containment_time(actions),
            requires_human_approval=any(
                a.approval_status == ApprovalStatus.PENDING for a in actions))

    # Additional methods:
    # - execute_plan(): Runs approved actions in priority order with rollback on failure
    # - _request_approval(): Routes high-risk actions to humans with timeout escalation
    # - _execute_action(): Dispatches a single action to the target system handler
    # - _classify_risk(): Maps action types to LOW/MEDIUM/HIGH/CRITICAL risk levels
    # - _prioritize_actions(): Orders containment first, then eradication, then recovery
    # - _estimate_containment_time(): Sums expected durations for containment actions
    # - _recommend_actions(): LLM recommends response actions from investigation findings

Pro Tip: Always classify response actions by reversibility, not just severity. An analyst is far more likely to approve a "disable user account" (easily reversible) than a "wipe endpoint" (irreversible) — even when both are appropriate. Surface rollback procedures in the approval request to speed up human decision-making.

Documentation Agent

Incident documentation is the task every analyst hates and every compliance auditor demands. The documentation agent eliminates this burden by automatically generating structured incident reports from investigation and response data.

Responsibilities:

  • Real-time incident timeline generation
  • Structured report creation following organizational templates
  • Compliance artifact generation (evidence preservation, chain of custody)
  • Executive summary creation for non-technical stakeholders
  • MITRE ATT&CK mapping for technique coverage tracking
from dataclasses import dataclass
from datetime import datetime

# IncidentReport fields: incident_id, title, executive_summary, timeline,
# technical_analysis, affected_assets, response_actions, mitre_mappings,
# recommendations, compliance_artifacts, generated_at

class DocumentationAgent:
    """Automatically generates structured incident documentation
    from investigation and response data — eliminating the
    documentation burden while ensuring compliance."""

    def __init__(self, llm_client, template_store, state_store):
        self.llm, self.templates, self.state = llm_client, template_store, state_store

    async def generate_report(
        self, investigation: dict, response_results: dict, triage: dict,
    ) -> IncidentReport:
        """Generate a complete incident report from investigation data."""
        incident_id = investigation["investigation_id"]
        timeline = self._build_timeline(triage, investigation, response_results)
        exec_summary = await self._generate_executive_summary(
            triage, investigation, response_results)
        tech_analysis = await self._generate_technical_analysis(
            investigation, response_results)
        mitre = await self._map_to_mitre(investigation)
        compliance = self._generate_compliance_artifacts(
            incident_id, timeline, investigation, response_results)
        recommendations = await self._generate_recommendations(
            investigation, response_results)
        return IncidentReport(
            incident_id=incident_id,
            title=self._generate_title(triage, investigation),
            executive_summary=exec_summary, timeline=timeline,
            technical_analysis=tech_analysis,
            affected_assets=investigation.get("blast_radius", {}).get("affected_assets", []),
            response_actions=response_results.get("executed", []),
            mitre_mappings=mitre, recommendations=recommendations,
            compliance_artifacts=compliance,
            generated_at=datetime.utcnow().isoformat())

    # Additional methods:
    # - _build_timeline(): Chronological merge of triage, investigation, and response events
    # - _generate_executive_summary(): LLM creates CISO-audience summary (no jargon)
    # - _generate_technical_analysis(): LLM reconstructs attack chain with IOCs
    # - _map_to_mitre(): Maps findings to MITRE ATT&CK techniques with confidence
    # - _generate_compliance_artifacts(): NIST 800-61 structure, evidence hashes, custody chain
    # - _generate_recommendations(): Forward-looking security improvement suggestions

Pro Tip: Generate incident reports incrementally — don't wait until the investigation completes. Start the timeline the moment triage begins, update it as investigation progresses, and finalize after response. This gives stakeholders real-time visibility and ensures nothing is lost if the process is interrupted.

Learning Agent

The learning agent closes the feedback loop. After each incident, it analyzes what happened, what worked, what didn't, and feeds those lessons back into the system to improve future responses.

Responsibilities:

  • Post-incident analysis of triage accuracy, investigation efficiency, and response effectiveness
  • Pattern recognition across historical incidents
  • Triage model calibration based on outcomes
  • Detection gap identification
  • Runbook improvement recommendations
from dataclasses import dataclass
from typing import Optional

# LearningInsight(insight_type, description, evidence, recommendation, priority, applicable_to)
# FeedbackReport(incident_id, insights, triage_accuracy_score,
# investigation_efficiency_score, response_effectiveness_score, model_updates, detection_gaps)

class LearningAgent:
    """Post-incident analysis agent that closes the feedback loop.
    Improves triage accuracy, investigation efficiency, and
    response effectiveness over time."""

    def __init__(self, llm_client, metrics_store, model_registry, state_store):
        self.llm, self.metrics = llm_client, metrics_store
        self.models, self.state = model_registry, state_store

    async def analyze_incident(
        self, triage: dict, investigation: dict,
        response: dict, analyst_feedback: Optional[dict] = None,
    ) -> FeedbackReport:
        """Analyze a completed incident for improvement opportunities."""
        insights = []
        triage_score, ti = await self._evaluate_triage(triage, investigation, analyst_feedback)
        insights.extend(ti)
        inv_score, ii = await self._evaluate_investigation(investigation, analyst_feedback)
        insights.extend(ii)
        resp_score, ri = await self._evaluate_response(response, analyst_feedback)
        insights.extend(ri)

        detection_gaps = await self._identify_detection_gaps(triage, investigation)
        model_updates = await self._generate_model_updates(insights, triage, investigation)
        await self.metrics.record({
            "incident_id": investigation.get("investigation_id"),
            "triage_accuracy": triage_score, "investigation_efficiency": inv_score,
            "response_effectiveness": resp_score})

        return FeedbackReport(
            incident_id=investigation.get("investigation_id", ""),
            insights=insights, triage_accuracy_score=triage_score,
            investigation_efficiency_score=inv_score,
            response_effectiveness_score=resp_score,
            model_updates=model_updates, detection_gaps=detection_gaps)

    # Additional methods:
    # - _evaluate_triage(): Compares triage severity vs actual; penalizes misclassification
    # - _evaluate_investigation(): Flags excessive steps, redundant queries, low confidence
    # - _evaluate_response(): Scores execution success rate and approval bottlenecks
    # - _identify_detection_gaps(): Finds investigation-only findings with no detection rule
    # - _generate_model_updates(): Proposes severity weight and threshold adjustments

Pro Tip: Don't trust the learning agent to self-improve without human oversight. All model updates and threshold changes proposed by the learning agent should go through a review queue. Unsupervised self-modification is how you get triage drift — a gradual miscalibration that looks fine on a daily basis but is catastrophic over months.


Multi-Agent Coordination Patterns

Individual agents are only as effective as their ability to communicate, share state, and coordinate. This section covers the infrastructure patterns that make multi-agent orchestration work in production.

Message Bus Architecture

Agents communicate through a structured message bus that provides ordered delivery, persistence, and observability. Each message has a type, a source agent, a target agent (or broadcast), and a payload.

import asyncio
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Callable, Optional
from collections import defaultdict

class MessageType(Enum):
    TRIAGE_COMPLETE = "triage_complete"
    INVESTIGATION_COMPLETE = "investigation_complete"
    RESPONSE_PLAN_READY = "response_plan_ready"
    APPROVAL_REQUIRED = "approval_required"
    # Also: ACTION_EXECUTED, DOCUMENTATION_READY, LEARNING_INSIGHT, CIRCUIT_BREAKER

# AgentMessage(message_id, message_type, source_agent, target_agent,
# payload, correlation_id, timestamp)

class AgentMessageBus:
    """Central message bus for inter-agent communication with ordered
    delivery, persistence, dead-letter handling, and observability."""

    def __init__(self, persistence_backend, metrics_collector):
        self.persistence, self.metrics = persistence_backend, metrics_collector
        self.subscribers: dict[str, list[Callable]] = defaultdict(list)
        self.dead_letter_queue: list = []

    async def publish(self, message: AgentMessage):
        """Publish a message to all subscribers with persistence and dead-letter handling."""
        await self.persistence.store(message)
        self.metrics.increment("agent_messages_published", tags={
            "type": message.message_type.value, "source": message.source_agent})
        handlers = self.subscribers.get(message.message_type.value, [])
        if not handlers:
            self.dead_letter_queue.append(message)
            return
        results = await asyncio.gather(
            *(self._deliver(h, message) for h in handlers), return_exceptions=True)
        for r in results:
            if isinstance(r, Exception):
                self.dead_letter_queue.append(message)

    # Additional methods:
    # - subscribe(): Registers an agent handler for a message type
    # - _deliver(): Delivers with 30s timeout and error handling

    # SharedStateStore: Shared cross-agent investigation context with atomic
    # updates, versioning, and TTL-based expiry.
    # - update(): Atomically update state with optimistic locking
    # - get(): Retrieves current investigation state
    # - subscribe_changes(): Watches for state updates via callback

Agent Lifecycle Orchestration

The orchestration layer manages the full lifecycle: agent spawning, health monitoring, and graceful degradation.

import asyncio

class AgentOrchestrator:
    """Top-level orchestrator: manages agent lifecycle and the
    end-to-end incident response pipeline."""

    def __init__(self, message_bus, state_store, config):
        self.bus, self.state, self.config = message_bus, state_store, config
        self.agents = {}

    async def handle_alert(self, raw_alert: dict):
        """Entry point: process an alert through the full agentic pipeline."""
        # Phase 1: Triage
        triage_result = await self.agents["triage"].triage(raw_alert)
        await self.bus.publish(AgentMessage(
            message_id=f"msg-triage-{triage_result.alert_id}",
            message_type=MessageType.TRIAGE_COMPLETE,
            source_agent="triage", target_agent="investigation",
            payload=triage_result.__dict__,
            correlation_id=triage_result.alert_id))
        if triage_result.deduplicated:
            return

        # Phase 2: Investigation
        investigation = await self.agents["investigation"].investigate(
            triage_result.__dict__)
        await self.bus.publish(AgentMessage(
            message_id=f"msg-inv-{investigation.investigation_id}",
            message_type=MessageType.INVESTIGATION_COMPLETE,
            source_agent="investigation", target_agent="response",
            payload=investigation.__dict__,
            correlation_id=triage_result.alert_id))

        # Phase 3: Response
        plan = await self.agents["response"].create_response_plan(investigation.__dict__)
        results = await self.agents["response"].execute_plan(plan)

        # Phase 4: Documentation + Learning (parallel)
        report, feedback = await asyncio.gather(
            self.agents["documentation"].generate_report(
                investigation.__dict__, results, triage_result.__dict__),
            self.agents["learning"].analyze_incident(
                triage_result.__dict__, investigation.__dict__, results))

    # Additional methods:
    # - _health_check_loop(): Monitors agent health every 30s, triggers circuit breakers
    # - _handle_unhealthy_agent(): Publishes circuit breaker msg, falls back to manual

Pro Tip: Always run documentation and learning agents in parallel — they have no dependency on each other. This shaves meaningful time off your end-to-end incident processing. Pipeline parallelism is the easiest performance win in multi-agent systems.


Governance & Guardrails — The Cymantis View

Autonomous agents without governance are a liability. The Cymantis approach to agentic IR governance is built on four pillars: escalation thresholds, human-in-the-loop approval, comprehensive audit trails, and circuit breakers.

Escalation Policy Configuration

Governance policies should be defined declaratively — not embedded in agent code. This allows security leadership to adjust thresholds without code changes.

# agentic_ir_governance.yaml
# Cymantis Agentic IR Governance Policy v2.1

escalation_policy:
  severity_thresholds:
    critical:
      investigation_path: assisted  # Always human-in-the-loop
      max_autonomous_actions: 0     # No auto-execution for critical
      escalation_timeout_minutes: 5
      escalation_targets:
        - ir_lead
        - soc_manager
        - ciso  # After 15 minutes without response
    high:
      investigation_path: assisted
      max_autonomous_actions: 3     # Auto-execute up to 3 low-risk actions
      escalation_timeout_minutes: 15
      escalation_targets:
        - ir_lead
        - soc_manager
    medium:
      investigation_path: autonomous
      max_autonomous_actions: 10
      escalation_timeout_minutes: 60
      escalation_targets:
        - soc_analyst
    low:
      investigation_path: autonomous
      max_autonomous_actions: 20
      escalation_timeout_minutes: 240
      escalation_targets:
        - soc_analyst

approval_gates:
  action_risk_classification:
    auto_approve:
      - create_ticket
      - send_notification
      - enrich_ioc
      - add_watchlist
      - tag_asset
    require_approval:
      - block_ip
      - block_domain
      - disable_user_account
      - quarantine_email
      - isolate_endpoint
      - revoke_sessions
    require_senior_approval:
      - wipe_endpoint
      - terminate_cloud_instance
      - disable_service_account
      - modify_firewall_rule
      - revoke_api_keys

  approval_timeouts:
    standard: 300        # 5 minutes
    urgent: 60           # 1 minute for critical severity
    after_hours: 600     # 10 minutes outside business hours

  timeout_behavior:
    critical_severity: escalate_to_next_tier
    high_severity: escalate_to_next_tier
    medium_severity: hold_and_notify
    low_severity: auto_approve_reversible

circuit_breakers:
  global:
    max_actions_per_hour: 100
    max_critical_actions_per_hour: 10
    max_endpoints_isolated_per_hour: 5
    max_users_disabled_per_hour: 10

  per_investigation:
    max_steps: 20
    max_duration_minutes: 60
    max_api_calls: 200
    max_response_actions: 15

  kill_switch:
    enabled: true
    trigger_conditions:
      - consecutive_failures > 5
      - false_positive_rate > 0.3
      - response_actions_per_minute > 10
    action: pause_all_autonomous_actions
    notification:
      - soc_manager
      - ir_lead
      - ciso

audit_trail:
  retention_days: 2555  # 7 years for compliance
  fields_required:
    - timestamp
    - agent_name
    - action_type
    - target
    - justification
    - approval_status
    - approver_identity
    - outcome
    - evidence_hash
  tamper_protection: sha256_chain
  export_formats:
    - json
    - csv
    - syslog_cef

compliance_mappings:
  nist_800_61:
    detection: triage_agent
    analysis: investigation_agent
    containment: response_orchestrator
    eradication: response_orchestrator
    recovery: response_orchestrator
    post_incident: learning_agent
  nist_800_53:
    IR-4: "Incident Handling - agentic pipeline"
    IR-5: "Incident Monitoring - continuous agent health"
    IR-6: "Incident Reporting - documentation agent"
    AU-6: "Audit Review - audit trail + learning agent"
    SI-4: "System Monitoring - triage agent integration"

Cymantis Recommendations for Governance

  1. Start with assisted mode everywhere. Deploy all agents in assisted (human-in-the-loop) mode for the first 30 days. Use this period to calibrate confidence thresholds and build trust with the SOC team.

  2. Graduate autonomy by action type, not by severity. Even in autonomous mode, destructive actions should always require approval. A medium-severity incident can still justify endpoint isolation if the investigation supports it — but a human should approve it.

  3. Audit everything, even autonomous actions. Every agent decision, every enrichment query, every response action must have a complete audit trail with evidence hashes. This isn't just compliance — it's debugging infrastructure.

  4. Implement kill switches at every level. Global kill switch for all autonomous operations. Per-agent kill switches. Per-investigation kill switches. When an agent goes off the rails, you need to stop it in seconds, not minutes.

  5. Review circuit breaker triggers monthly. As your environment changes — new endpoints, new SaaS integrations, new threat patterns — the boundaries of "normal" agent behavior shift. Circuit breaker thresholds need to track those changes.

Pro Tip: Build a governance dashboard that shows every autonomous action taken in the last 24 hours, with the full reasoning chain. Make it the first thing the SOC manager reviews every morning. Transparency is what separates trustworthy autonomy from a compliance nightmare.


Migration Guide: SOAR to Agentic Orchestration

Migrating from traditional SOAR to agentic orchestration isn't a rip-and-replace. It's a graduated transition that preserves your existing investment while layering on agentic capabilities.

Step 1: Audit Existing Playbooks

Before building anything new, audit what you have. Map every playbook to its trigger, its data sources, its actions, and its maintenance history.

#!/bin/bash
# playbook_audit.sh — Inventory existing SOAR playbooks

echo "=== SOAR Playbook Audit ==="
echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""

# Export playbook inventory from SOAR platform
# Adapt for your platform: Splunk SOAR, Palo Alto XSOAR, etc.
soar-cli playbooks list --format json | python3 -c "
import json, sys

playbooks = json.load(sys.stdin)
print(f'Total playbooks: {len(playbooks)}')
print(f'Active: {sum(1 for p in playbooks if p[\"status\"] == \"active\")}')
print(f'Disabled: {sum(1 for p in playbooks if p[\"status\"] == \"disabled\")}')
print()

# Identify candidates for agentic migration
for pb in sorted(playbooks, key=lambda x: x.get('execution_count', 0), reverse=True):
    print(f'  [{pb[\"status\"]:8s}] {pb[\"name\"]:50s} '
          f'executions={pb.get(\"execution_count\", 0):6d} '
          f'success_rate={pb.get(\"success_rate\", 0):.1%} '
          f'last_modified={pb.get(\"last_modified\", \"unknown\")}')
"

Step 2: Classify Playbooks by Migration Priority

Not every playbook needs to become an agent. Classify them:

  • Retire: Playbooks with <5 executions/month or <50% success rate. These are broken or irrelevant — remove them.
  • Keep as automation: Simple, deterministic workflows (password resets, ticket creation, notification routing). These don't need AI — keep them as lightweight automations.
  • Convert to agent capability: Complex investigation and response playbooks that require multi-step reasoning, cross-source enrichment, or adaptive logic. These are your migration targets.

Step 3: Build the Agent Foundation

Deploy the shared infrastructure first: message bus, state store, governance policy engine, and audit trail.

# docker-compose.agentic-ir.yaml
version: "3.8"

services:
  message-bus:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes

  state-store:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: agentic_ir
      POSTGRES_USER: ir_system
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    volumes:
      - pg_data:/var/lib/postgresql/data
    secrets:
      - db_password

  vector-store:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

  governance-engine:
    build: ./services/governance
    environment:
      POLICY_PATH: /config/governance.yaml
      STATE_STORE_URL: postgresql://ir_system@state-store/agentic_ir
      MESSAGE_BUS_URL: redis://message-bus:6379
    volumes:
      - ./config:/config:ro
    depends_on:
      - state-store
      - message-bus

  audit-logger:
    build: ./services/audit
    environment:
      STATE_STORE_URL: postgresql://ir_system@state-store/agentic_ir
      LOG_RETENTION_DAYS: 2555
      TAMPER_PROTECTION: sha256_chain
    depends_on:
      - state-store

volumes:
  redis_data:
  pg_data:
  qdrant_data:

secrets:
  db_password:
    file: ./secrets/db_password.txt

Step 4: Convert Playbooks to Agent Capabilities

For each playbook targeted for conversion, extract the intent (what it's trying to accomplish) and the tools it uses — then map those to agent capabilities.

Example: Phishing Playbook Conversion

Before (SOAR playbook — pseudocode):

# Traditional SOAR phishing playbook — rigid, linear
def phishing_playbook(alert):
    # Step 1: Extract observables (fixed extraction)
    urls = extract_urls(alert.email_body)
    attachments = extract_attachments(alert.email)
    sender = alert.sender_address

    # Step 2: Enrich (fixed sequence)
    url_results = virustotal_lookup(urls)
    attachment_results = sandbox_detonate(attachments)
    sender_rep = check_sender_reputation(sender)

    # Step 3: Decision (static logic)
    if any(r.malicious for r in url_results + attachment_results):
        quarantine_email(alert.message_id)
        block_sender(sender)
        create_ticket("malicious_phishing", alert)
    elif sender_rep.score < 0.3:
        quarantine_email(alert.message_id)
        create_ticket("suspicious_phishing", alert)
    else:
        close_alert(alert, "benign")

After (agent capability — dynamic, adaptive):

# Agentic phishing investigation — dynamic, context-aware
class PhishingInvestigationCapability:
    """Replaces the static phishing playbook with dynamic investigation.
    The agent adapts based on what it finds — just like a human analyst."""

    async def investigate(self, alert: dict, agent_context: dict) -> dict:
        findings = []
        observables = await self.extract_observables_adaptive(alert)

        for obs in observables:
            result = await self.investigate_observable(obs, agent_context)
            findings.append(result)
            if result.get("suspicious"):  # Adaptive pivot: dig deeper
                findings.extend(await self.adaptive_pivot(obs, result, agent_context))

        # Check for coordinated campaign (not possible in static playbooks)
        campaign = await self.check_campaign_indicators(
            observables, findings, agent_context)
        if campaign.get("match"):
            findings.append({"type": "campaign_match",
                "campaign_id": campaign["campaign_id"],
                "confidence": campaign["confidence"], "severity": "high",
                "summary": f"Part of active campaign: {campaign['name']}"})

        findings.extend(await self.check_sender_history(alert, agent_context))
        return {
            "findings": findings,
            "confidence": self.calculate_confidence(findings),
            "recommended_actions": self.recommend_actions(findings)}

    # Additional methods:
    # - investigate_observable(): Multi-source lookup (VT, URLScan, WHOIS, sandbox)
    # - adaptive_pivot(): Searches for related emails from suspicious domains
    # - extract_observables_adaptive(): Handles URLs, QR codes, attachments, callbacks
    # - check_campaign_indicators(): Correlates against known active campaigns
    # - check_sender_history(): Historical context for sender/domain patterns

Step 5: Run Shadow Mode

Deploy agents in shadow mode — they process every alert alongside your existing SOAR playbooks, but take no response actions. Compare results over 30 days:

  • Did the agent classify severity accurately?
  • Did the agent identify the same findings as the playbook?
  • Did the agent find anything the playbook missed?
  • Did the agent produce actionable response recommendations?

Step 6: Graduate to Production

Based on shadow mode metrics, graduate agents to production by capability and by severity level:

  1. Week 1–4: Agents handle low-severity investigations autonomously. Medium and above remain in assisted mode.
  2. Week 5–8: Agents handle medium-severity investigations autonomously (low-risk actions only). High and critical remain assisted.
  3. Week 9–12: Agents handle all severities, with approval gates for high-risk response actions at every severity level.
  4. Ongoing: Continuous tuning via the learning agent feedback loop.

Step 7: Decommission Legacy Playbooks

As agents prove themselves, retire the corresponding SOAR playbooks. But don't delete them — archive them as reference material for the learning agent and as fallback procedures.

Pro Tip: Keep a "break glass" SOAR playbook for every critical response capability. If the agentic system goes down, you need a manual fallback. Test these quarterly, just like disaster recovery runbooks.


Measuring Agentic IR Performance

You can't improve what you don't measure. Agentic IR introduces new dimensions to incident response metrics — not just speed, but accuracy, efficiency, and analyst satisfaction.

Key Metrics

Metric Definition Traditional SOC Baseline Agentic IR Target
MTTD (Mean Time to Detect) Time from attack start to alert generation 197 days (IBM 2024) <24 hours
MTTI (Mean Time to Investigate) Time from alert to investigation complete 4–8 hours <5 minutes
MTTR (Mean Time to Respond) Time from alert to containment complete 8–24 hours <15 minutes
Containment Time Time from response decision to containment 30–120 minutes <2 minutes
Triage Accuracy Correct severity classification rate 60–75% >92%
False Positive Rate Alerts closed as false positive 80%+ <30% (of escalated alerts)
Investigation Depth Data sources consulted per investigation 2–3 5–8
Analyst Satisfaction Survey-based satisfaction with tools 3.2/5 4.5/5
Autonomous Resolution Rate Incidents resolved without human touch 0–10% 60–80% for low/medium

Dashboard Queries

Track these metrics in real time with SIEM dashboard panels.

Mean Time to Investigate (MTTI) — Splunk SPL:

index=agentic_ir sourcetype=investigation_metrics
| eval investigation_duration_min = (investigation_end - investigation_start) / 60
| stats
    avg(investigation_duration_min) as avg_mtti_min
    median(investigation_duration_min) as median_mtti_min
    perc95(investigation_duration_min) as p95_mtti_min
    count as total_investigations
    by severity
| eval avg_mtti_min = round(avg_mtti_min, 2)
| eval median_mtti_min = round(median_mtti_min, 2)
| sort - severity

Triage Accuracy Over Time — Splunk SPL:

index=agentic_ir sourcetype=learning_metrics
| eval accurate = if(triage_severity == actual_severity, 1, 0)
| bin _time span=1d
| stats
    avg(accurate) as daily_accuracy
    count as daily_volume
    avg(triage_accuracy_score) as avg_triage_score
    by _time
| eval daily_accuracy = round(daily_accuracy * 100, 1)
| eval avg_triage_score = round(avg_triage_score * 100, 1)

Response Action Effectiveness — Splunk SPL:

index=agentic_ir sourcetype=response_metrics
| stats
    count(eval(action_status="executed")) as actions_executed
    count(eval(action_status="failed")) as actions_failed
    count(eval(action_status="pending_approval")) as actions_pending
    count(eval(action_status="auto_approved")) as actions_auto_approved
    avg(containment_time_seconds) as avg_containment_sec
    by action_type
| eval success_rate = round(actions_executed / (actions_executed + actions_failed) * 100, 1)
| eval avg_containment_sec = round(avg_containment_sec, 1)
| sort - actions_executed

Agent Health Dashboard — Splunk SPL:

index=agentic_ir sourcetype=agent_heartbeat
| stats
    latest(status) as current_status
    latest(last_action_time) as last_active
    avg(processing_time_ms) as avg_processing_ms
    count(eval(status="error")) as error_count
    count as total_heartbeats
    by agent_name
| eval health = case(
    current_status == "healthy" AND error_count < 5, "GREEN",
    current_status == "healthy" AND error_count >= 5, "YELLOW",
    current_status == "degraded", "YELLOW",
    1==1, "RED"
  )
| table agent_name health current_status last_active avg_processing_ms error_count

Autonomous vs. Assisted Resolution — Executive KPI:

index=agentic_ir sourcetype=incident_metrics
| eval resolution_type = case(
    investigation_path == "autonomous" AND human_intervention == 0, "Fully Autonomous",
    investigation_path == "assisted" AND human_intervention == 1, "Human Assisted",
    investigation_path == "manual", "Manual",
    1==1, "Other"
  )
| stats count by resolution_type
| eventstats sum(count) as total
| eval percentage = round(count / total * 100, 1)
| sort - count

Pro Tip: Track analyst satisfaction monthly with a simple 5-question survey. The best metric for agentic IR success isn't MTTR — it's whether your analysts feel like the system helps them or creates new problems. If satisfaction drops, your agents are getting in the way, not getting out of it.


The Cymantis View: Where This Is All Heading

The shift from SOAR playbooks to agentic orchestration is not optional — it's inevitable. The same forces that made SOAR necessary (alert volume, staffing shortages, adversary speed) have now made SOAR insufficient. Rigid playbooks can't keep pace with polymorphic threats, multi-vector attacks, and the sheer volume of telemetry that modern environments generate.

But the transition requires engineering discipline. Multi-agent systems introduce new failure modes: agent hallucination, reasoning loops, cascade failures, state corruption, and governance drift. Organizations that deploy agents without the governance infrastructure to constrain them will discover that an autonomous agent making bad decisions at machine speed is worse than no automation at all.

The Cymantis position is clear:

  1. Agentic IR is the future of incident response. Organizations that invest now will have a 12–18 month advantage over those that wait.
  2. Governance is not optional. Every autonomous capability must have an equally robust governance mechanism. No exceptions.
  3. Start small, prove value, then scale. Deploy a single triage agent in shadow mode. Prove it classifies correctly. Then add investigation. Then response. Each step earns trust.
  4. Measure relentlessly. If you can't show that your agentic system outperforms your current playbooks on MTTI, MTTR, accuracy, and analyst satisfaction, you haven't earned the right to scale it.
  5. Keep humans in the loop for high-stakes decisions. Full autonomy is not the goal. Effective human-AI collaboration is. The best SOCs will be small teams of senior analysts governing fleets of agents — not watching dashboards.

The alert fatigue era is ending. The playbook maintenance era is ending. What comes next is a SOC where analysts spend their time on judgment, strategy, and adversary engagement — and agents handle everything else.


Final Thoughts

SOAR was a necessary step in the evolution of incident response. It proved that automation has a place in security operations. But static playbooks were always a stopgap — a way to encode yesterday's response procedures for yesterday's threats.

The multi-agent approach isn't just faster; it's fundamentally different. Agents don't follow scripts — they reason about evidence, adapt to context, and coordinate dynamically. They investigate like your best analyst and document like your most diligent one. They learn from every incident and get better over time.

The organizations that move first will compound their advantage. Every incident their agents process makes them smarter, faster, more accurate. The learning loop creates a flywheel: better triage leads to more focused investigations, which leads to more precise responses, which generates better training data for the next cycle.

For security leaders evaluating this transition: don't wait for the perfect platform. The building blocks exist today — LLM APIs, agent orchestration frameworks, vector databases for memory, and the governance patterns outlined in this post. The gap between "theoretically possible" and "production-ready" is engineering effort, not research breakthrough.

Start with shadow mode. Measure everything. Graduate autonomy as trust is earned. Keep humans where they matter most — at the decision boundary between "safe to automate" and "too consequential to delegate."

The future SOC isn't bigger. It's smarter. Build it.

Cymantis Labs helps security teams design, deploy, and govern agentic incident response architectures — from SOAR migration assessments to full multi-agent orchestration in production. We bring the engineering rigor and operational experience to make autonomous IR production-safe.


Resources & References

SOAR & Incident Response Foundations

Multi-Agent AI Research

Agentic Security Operations

MITRE Frameworks

Governance & Compliance

Industry Research & Benchmarks


For more insights or to schedule a Cymantis Agentic IR Assessment, contact our research and automation team at cymantis.com.