From SOAR Playbooks to Agentic Orchestration: Modernizing Incident Response
Why rigid SOAR playbooks fail at scale and how multi-agent orchestration delivers dramatically better incident response — with a technical migration guide from traditional SOAR to agentic IR.
From SOAR Playbooks to Agentic Orchestration: Modernizing Incident Response
By Cymantis Labs
SOAR was supposed to be the answer. Security Orchestration, Automation, and Response platforms promised to eliminate the manual drudgery of incident response — to give overwhelmed SOC teams the ability to codify their best analysts' workflows into reusable playbooks and execute them at machine speed.
Instead, most organizations are drowning in a different kind of complexity. They manage 200+ playbooks, each one a brittle chain of if/then/else logic that works perfectly for the exact scenario it was designed for — and breaks the moment a threat deviates from the script. Playbook maintenance has become a full-time job. The "automation" that was supposed to free analysts has created a new category of toil: playbook engineering.
Here's the data that should make every IR leader pause: research on multi-agent AI systems for incident response shows that agentic orchestration delivers 100% actionable recommendations compared to just 1.7% for single-agent approaches, with 80x better action specificity. Multi-agent systems don't just classify alerts — they investigate, contextualize, and recommend precise response actions with full justification chains.
The gap between what SOAR promised and what agentic orchestration delivers isn't incremental. It's generational. This post lays out the technical architecture, the migration path, and the governance model for making the shift — with working code, real configurations, and the hard-won lessons from teams that have already made the transition.
Why SOAR Playbooks Fail at Scale
Before building the replacement, we need to be honest about why SOAR fails. Not in theory — in practice, in production, at scale. The failure modes are structural, not incidental.
The Brittleness Problem
A SOAR playbook is a static decision tree authored at a specific point in time by an analyst who understood a specific threat scenario. It encodes assumptions about alert format, data availability, API responses, and threat behavior. When any of those assumptions change — and they always change — the playbook either fails silently or produces incorrect results.
Consider a phishing playbook designed in 2024. It expects an email alert from your SEG, extracts URLs and attachments, detonates them in a sandbox, checks the sender domain against threat intel, and makes a disposition. Now a threat actor uses QR codes instead of URLs. The extraction step finds nothing. The playbook marks the alert as clean. The phishing email lands. Nobody knows until the credential harvest succeeds.
The playbook didn't fail technically — it executed perfectly. It just couldn't handle a scenario its author didn't anticipate. And this is the fundamental problem: static logic cannot adapt to novel threats.
Playbook Sprawl
Organizations that take SOAR seriously end up managing hundreds of playbooks:
- 20+ playbooks for phishing variants (URL, attachment, BEC, QR, callback)
- 15+ playbooks for endpoint alerts (malware, suspicious process, lateral movement, persistence)
- 10+ playbooks for identity events (impossible travel, credential stuffing, privilege escalation)
- 30+ playbooks for cloud security (IAM changes, S3 exposure, security group modifications)
- 50+ enrichment sub-playbooks (IP reputation, domain analysis, file hash lookup, user context)
Each playbook needs its own error handling, its own input validation, its own API integrations, and its own test cases. When a vendor changes an API endpoint, every playbook that calls it breaks. When the SIEM schema changes, every playbook that parses its output breaks. The combinatorial complexity is staggering.
The Maintenance Burden
A 2025 survey by the Ponemon Institute found that organizations spend an average of 12 FTE hours per week maintaining SOAR playbooks — updating integrations, fixing broken API calls, adding new branches for threat variants, and testing changes. That's a half-time security engineer dedicated entirely to keeping the automation running — not improving it, not building new capabilities, just preventing decay.
Worse, most organizations don't have adequate testing for playbooks. Changes are tested manually (if at all), deployed to production, and validated only when a real incident triggers them. The blast radius of a broken playbook is an unhandled incident.
The False Sense of Automation
Perhaps the most insidious failure: SOAR creates an illusion of coverage. Leadership sees 200 playbooks and assumes 200 scenarios are handled. In reality, many of those playbooks haven't been tested against current threat behavior, half have broken integrations that nobody has noticed, and the remaining ones handle the easy cases that analysts could resolve in minutes anyway.
The hard cases — the novel threats, the multi-stage attacks, the adversaries who deliberately evade your documented procedures — still land on an analyst's desk with no automation support. SOAR handles the 80% of incidents that don't really need automation and fails on the 20% that desperately do.
Pro Tip: Audit your SOAR playbooks quarterly. Track execution success rates, mean execution time, and — critically — how often a playbook completes but the incident still requires manual analyst intervention. That last metric reveals your true automation gap.
SOAR vs. Agentic Orchestration — A Side-by-Side Comparison
The difference between SOAR and agentic orchestration isn't just technical — it's philosophical. SOAR encodes procedures. Agentic orchestration encodes reasoning.
Comparison Matrix
| Dimension | Traditional SOAR | Agentic Orchestration |
|---|---|---|
| Decision Logic | Static decision trees, if/then/else branches | Dynamic reasoning with LLM-powered analysis |
| Adaptability | Requires manual playbook updates for new scenarios | Adapts to novel threats using learned patterns and reasoning |
| Context Retention | Stateless — each execution starts from zero | Maintains investigation context across steps and incidents |
| Multi-Step Reasoning | Pre-defined enrichment sequence, fixed order | Dynamic investigation — each step informs the next |
| Maintenance Overhead | 12+ FTE hours/week for playbook maintenance | Self-improving; agents learn from outcomes |
| Novel Threat Handling | Fails silently or escalates without context | Reasons about unfamiliar patterns using threat knowledge |
| Error Recovery | Hard-coded error handlers per integration | Graceful degradation with alternative investigation paths |
| Action Specificity | Generic response actions (block IP, isolate host) | Precise, justified response recommendations with blast radius analysis |
| Scalability | Linear — each new scenario needs a new playbook | Emergent — agents compose capabilities dynamically |
| Audit Trail | Execution logs of playbook steps | Full reasoning chains with evidence and confidence scores |
Architecture Comparison
Traditional SOAR Architecture:
graph LR
siemAlert["SIEM Alert"] --> playbookEngine["Playbook Engine<br/>(Static Logic)"]
playbookEngine --> responseActions["Response Actions"]
playbookEngine --> enrichment["Enrichment<br/>(Fixed)"]
Agentic Orchestration Architecture:
graph TD
siemAlert["SIEM Alert"] --> triageAgent
subgraph orchestrationLayer["Agent Orchestration Layer"]
direction LR
triageAgent["Triage Agent"] --> investigationAgent["Investigation Agent"]
investigationAgent --> responseOrchestrator["Response Orchestrator"]
triageAgent --> sharedState["Shared State & Memory"]
investigationAgent --> sharedState
responseOrchestrator --> sharedState
sharedState --> learningAgent["Learning Agent"]
sharedState --> documentationAgent["Documentation Agent"]
sharedState --> governanceLayer["Governance Layer"]
end
The critical difference is the shared state layer. In SOAR, each playbook execution is an island. In an agentic system, every agent contributes to and draws from a shared investigation context. The triage agent's severity classification informs the investigation agent's depth. The investigation agent's findings shape the response orchestrator's actions. The learning agent's historical analysis improves every future triage decision.
Pro Tip: When evaluating agentic IR platforms, ask specifically about state management. If the system can't show you how Agent A's output influences Agent B's reasoning in real time, it's SOAR with a chatbot bolted on — not true agentic orchestration.
The Agentic Incident Response Architecture
This is the core technical section. We'll walk through each agent in the agentic IR system, its responsibilities, its interfaces, and its implementation.
Triage Agent
The triage agent is the front door. Every alert hits this agent first. Its job is not to investigate — it's to classify, prioritize, and route.
Responsibilities:
- Receive raw alerts from SIEM, EDR, cloud, and identity sources
- Normalize alert data into a common schema
- Classify severity using contextual signals (asset criticality, user role, threat intel)
- Determine the investigation path (automated, assisted, or manual escalation)
- Deduplicate against active investigations
from dataclasses import dataclass, field
from enum import Enum
class Severity(Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
INFORMATIONAL = "informational"
class InvestigationPath(Enum):
AUTONOMOUS = "autonomous"
ASSISTED = "assisted"
MANUAL = "manual"
# TriageResult(alert_id, severity, investigation_path, confidence,
# reasoning, enrichment_hints, deduplicated, parent_investigation_id)
class TriageAgent:
"""Front-door agent: classifies, prioritizes, and routes alerts
using weighted contextual severity scoring."""
def __init__(self, llm_client, asset_db, threat_intel, active_investigations):
self.llm, self.asset_db = llm_client, asset_db
self.threat_intel, self.active_investigations = threat_intel, active_investigations
self.severity_weights = {
"asset_criticality": 0.30, "threat_intel_match": 0.25,
"behavioral_anomaly": 0.20, "alert_fidelity": 0.15,
"temporal_proximity": 0.10}
async def triage(self, raw_alert: dict) -> TriageResult:
"""Main triage pipeline: normalize, deduplicate, classify, route."""
normalized = self._normalize_alert(raw_alert)
dedup_result = await self._check_deduplication(normalized)
if dedup_result:
return dedup_result
context = await self._gather_triage_context(normalized)
severity, confidence = self._calculate_severity(context)
path = self._determine_path(severity, confidence, context)
reasoning = await self._generate_reasoning(normalized, context, severity, path)
return TriageResult(
alert_id=normalized["alert_id"], severity=severity,
investigation_path=path, confidence=confidence,
reasoning=reasoning, enrichment_hints=self._suggest_enrichments(context))
# Additional methods:
# - _gather_triage_context(): Pulls asset criticality, threat intel, alert history
# - _calculate_severity(): Weighted scoring across 5 contextual signals
# - _determine_path(): Routes to autonomous/assisted/manual by severity + confidence
# - _check_deduplication(): Fingerprints alerts against active investigations
# - _normalize_alert(): Converts raw alerts into common schema
# - _suggest_enrichments(): Recommends deep TI, vuln scans, IOC sweeps
# - _generate_reasoning(): LLM-generated human-readable triage justification
Pro Tip: The triage agent should never make response decisions — only classification and routing decisions. Keep triage fast (sub-second) by limiting enrichment to lightweight lookups. Deep enrichment belongs in the investigation agent.
Investigation Agent
The investigation agent is where agentic orchestration truly separates itself from SOAR. Instead of a fixed enrichment sequence, the investigation agent conducts a dynamic, multi-step investigation where each step informs the next.
Responsibilities:
- Multi-source enrichment (threat intel, CMDB, vulnerability data, historical incidents)
- Dynamic investigation planning — choose next steps based on findings
- Evidence correlation across data sources
- Confidence scoring for findings
- Blast radius assessment
from dataclasses import dataclass, field
# InvestigationStep(tool, query, result, duration_ms, confidence, findings)
# InvestigationReport(investigation_id, alert_id, steps, findings, confidence,
# blast_radius, recommended_actions, evidence_chain, timeline)
class InvestigationAgent:
"""Conducts dynamic, multi-step investigations where each enrichment
step informs the next — mirroring how a senior analyst works."""
def __init__(self, llm_client, tool_registry, state_store):
self.llm, self.tools, self.state = llm_client, tool_registry, state_store
self.max_steps, self.confidence_threshold = 15, 0.85
async def investigate(self, triage_result: dict) -> InvestigationReport:
"""Run a dynamic, LLM-guided investigation."""
investigation_id = f"inv-{triage_result['alert_id']}"
steps, findings = [], []
plan = await self._create_investigation_plan(triage_result)
for step_num in range(self.max_steps):
next_action = await self._decide_next_step(
triage_result, steps, findings, plan)
if next_action["action"] == "conclude":
break
step = await self._execute_step(next_action)
steps.append(step)
findings.extend(await self._extract_findings(step, triage_result))
await self.state.update(investigation_id, {
"current_step": step_num + 1, "findings_count": len(findings)})
if self._assess_confidence(findings) >= self.confidence_threshold:
break
blast_radius = await self._assess_blast_radius(triage_result, findings)
return InvestigationReport(
investigation_id=investigation_id, alert_id=triage_result["alert_id"],
steps=steps, findings=findings,
confidence=self._assess_confidence(findings), blast_radius=blast_radius,
recommended_actions=await self._generate_recommendations(
triage_result, findings, blast_radius),
evidence_chain=[s.tool for s in steps],
timeline=self._build_timeline(steps, findings))
# Additional methods:
# - _create_investigation_plan(): LLM generates hypotheses and data source order
# - _decide_next_step(): LLM picks next action from SIEM/EDR/TI/identity/cloud/CMDB
# - _execute_step(): Runs a tool query with timing and error handling
# - _assess_blast_radius(): Maps affected assets, users, and services
# - _assess_confidence(): Weighted average biased toward high-confidence findings
# - _extract_findings(): Structures raw tool results into typed findings
# - _build_timeline(): Chronological reconstruction from steps and findings
# - _generate_recommendations(): LLM-powered response action suggestions
Pro Tip: Cap investigation steps at 15 with an escape hatch. Unbounded investigation loops are the agentic equivalent of an infinite while loop — the agent will keep finding "interesting" threads to pull without converging on a conclusion. Set a step budget and force a determination.
Response Orchestrator
The response orchestrator translates investigation findings into concrete response actions — and critically, enforces approval gates for high-risk actions.
Responsibilities:
- Generate response action plans based on investigation findings
- Classify actions by risk level (informational, reversible, destructive)
- Enforce approval gates: auto-execute low-risk, require approval for high-risk
- Coordinate actions across endpoint, network, identity, and cloud systems
- Track action execution and rollback capabilities
from dataclasses import dataclass
from enum import Enum
class ActionRisk(Enum):
LOW = "low" # Informational: create ticket, send notification
MEDIUM = "medium" # Reversible: block IP, disable user
HIGH = "high" # Significant: isolate endpoint, revoke sessions
CRITICAL = "critical" # Destructive: wipe endpoint, terminate instances
# ApprovalStatus: AUTO_APPROVED | PENDING | APPROVED | DENIED | TIMED_OUT
# ResponseAction(action_id, type, target, params, risk_level, justification, rollback_available)
# ResponsePlan(investigation_id, actions, priority_order, containment_time, requires_approval)
class ResponseOrchestrator:
"""Translates investigation findings into coordinated response
actions with risk-based approval gates and rollback capabilities."""
def __init__(self, llm_client, action_registry, approval_service, state_store):
self.llm, self.actions = llm_client, action_registry
self.approvals, self.state = approval_service, state_store
self.auto_approve_threshold = ActionRisk.MEDIUM
async def create_response_plan(self, investigation_report: dict) -> ResponsePlan:
"""Generate a response plan from investigation findings."""
recommended = await self._recommend_actions(investigation_report)
actions = []
for rec in recommended:
risk = self._classify_risk(rec)
action = ResponseAction(
action_id=f"act-{investigation_report['investigation_id']}-{len(actions)}",
action_type=rec["type"], target=rec["target"],
parameters=rec.get("parameters", {}), risk_level=risk,
justification=rec["justification"],
rollback_available=rec.get("reversible", False))
if risk.value <= self.auto_approve_threshold.value:
action.approval_status = ApprovalStatus.AUTO_APPROVED
actions.append(action)
return ResponsePlan(
investigation_id=investigation_report["investigation_id"],
actions=actions, priority_order=self._prioritize_actions(actions),
estimated_containment_time=self._estimate_containment_time(actions),
requires_human_approval=any(
a.approval_status == ApprovalStatus.PENDING for a in actions))
# Additional methods:
# - execute_plan(): Runs approved actions in priority order with rollback on failure
# - _request_approval(): Routes high-risk actions to humans with timeout escalation
# - _execute_action(): Dispatches a single action to the target system handler
# - _classify_risk(): Maps action types to LOW/MEDIUM/HIGH/CRITICAL risk levels
# - _prioritize_actions(): Orders containment first, then eradication, then recovery
# - _estimate_containment_time(): Sums expected durations for containment actions
# - _recommend_actions(): LLM recommends response actions from investigation findings
Pro Tip: Always classify response actions by reversibility, not just severity. An analyst is far more likely to approve a "disable user account" (easily reversible) than a "wipe endpoint" (irreversible) — even when both are appropriate. Surface rollback procedures in the approval request to speed up human decision-making.
Documentation Agent
Incident documentation is the task every analyst hates and every compliance auditor demands. The documentation agent eliminates this burden by automatically generating structured incident reports from investigation and response data.
Responsibilities:
- Real-time incident timeline generation
- Structured report creation following organizational templates
- Compliance artifact generation (evidence preservation, chain of custody)
- Executive summary creation for non-technical stakeholders
- MITRE ATT&CK mapping for technique coverage tracking
from dataclasses import dataclass
from datetime import datetime
# IncidentReport fields: incident_id, title, executive_summary, timeline,
# technical_analysis, affected_assets, response_actions, mitre_mappings,
# recommendations, compliance_artifacts, generated_at
class DocumentationAgent:
"""Automatically generates structured incident documentation
from investigation and response data — eliminating the
documentation burden while ensuring compliance."""
def __init__(self, llm_client, template_store, state_store):
self.llm, self.templates, self.state = llm_client, template_store, state_store
async def generate_report(
self, investigation: dict, response_results: dict, triage: dict,
) -> IncidentReport:
"""Generate a complete incident report from investigation data."""
incident_id = investigation["investigation_id"]
timeline = self._build_timeline(triage, investigation, response_results)
exec_summary = await self._generate_executive_summary(
triage, investigation, response_results)
tech_analysis = await self._generate_technical_analysis(
investigation, response_results)
mitre = await self._map_to_mitre(investigation)
compliance = self._generate_compliance_artifacts(
incident_id, timeline, investigation, response_results)
recommendations = await self._generate_recommendations(
investigation, response_results)
return IncidentReport(
incident_id=incident_id,
title=self._generate_title(triage, investigation),
executive_summary=exec_summary, timeline=timeline,
technical_analysis=tech_analysis,
affected_assets=investigation.get("blast_radius", {}).get("affected_assets", []),
response_actions=response_results.get("executed", []),
mitre_mappings=mitre, recommendations=recommendations,
compliance_artifacts=compliance,
generated_at=datetime.utcnow().isoformat())
# Additional methods:
# - _build_timeline(): Chronological merge of triage, investigation, and response events
# - _generate_executive_summary(): LLM creates CISO-audience summary (no jargon)
# - _generate_technical_analysis(): LLM reconstructs attack chain with IOCs
# - _map_to_mitre(): Maps findings to MITRE ATT&CK techniques with confidence
# - _generate_compliance_artifacts(): NIST 800-61 structure, evidence hashes, custody chain
# - _generate_recommendations(): Forward-looking security improvement suggestions
Pro Tip: Generate incident reports incrementally — don't wait until the investigation completes. Start the timeline the moment triage begins, update it as investigation progresses, and finalize after response. This gives stakeholders real-time visibility and ensures nothing is lost if the process is interrupted.
Learning Agent
The learning agent closes the feedback loop. After each incident, it analyzes what happened, what worked, what didn't, and feeds those lessons back into the system to improve future responses.
Responsibilities:
- Post-incident analysis of triage accuracy, investigation efficiency, and response effectiveness
- Pattern recognition across historical incidents
- Triage model calibration based on outcomes
- Detection gap identification
- Runbook improvement recommendations
from dataclasses import dataclass
from typing import Optional
# LearningInsight(insight_type, description, evidence, recommendation, priority, applicable_to)
# FeedbackReport(incident_id, insights, triage_accuracy_score,
# investigation_efficiency_score, response_effectiveness_score, model_updates, detection_gaps)
class LearningAgent:
"""Post-incident analysis agent that closes the feedback loop.
Improves triage accuracy, investigation efficiency, and
response effectiveness over time."""
def __init__(self, llm_client, metrics_store, model_registry, state_store):
self.llm, self.metrics = llm_client, metrics_store
self.models, self.state = model_registry, state_store
async def analyze_incident(
self, triage: dict, investigation: dict,
response: dict, analyst_feedback: Optional[dict] = None,
) -> FeedbackReport:
"""Analyze a completed incident for improvement opportunities."""
insights = []
triage_score, ti = await self._evaluate_triage(triage, investigation, analyst_feedback)
insights.extend(ti)
inv_score, ii = await self._evaluate_investigation(investigation, analyst_feedback)
insights.extend(ii)
resp_score, ri = await self._evaluate_response(response, analyst_feedback)
insights.extend(ri)
detection_gaps = await self._identify_detection_gaps(triage, investigation)
model_updates = await self._generate_model_updates(insights, triage, investigation)
await self.metrics.record({
"incident_id": investigation.get("investigation_id"),
"triage_accuracy": triage_score, "investigation_efficiency": inv_score,
"response_effectiveness": resp_score})
return FeedbackReport(
incident_id=investigation.get("investigation_id", ""),
insights=insights, triage_accuracy_score=triage_score,
investigation_efficiency_score=inv_score,
response_effectiveness_score=resp_score,
model_updates=model_updates, detection_gaps=detection_gaps)
# Additional methods:
# - _evaluate_triage(): Compares triage severity vs actual; penalizes misclassification
# - _evaluate_investigation(): Flags excessive steps, redundant queries, low confidence
# - _evaluate_response(): Scores execution success rate and approval bottlenecks
# - _identify_detection_gaps(): Finds investigation-only findings with no detection rule
# - _generate_model_updates(): Proposes severity weight and threshold adjustments
Pro Tip: Don't trust the learning agent to self-improve without human oversight. All model updates and threshold changes proposed by the learning agent should go through a review queue. Unsupervised self-modification is how you get triage drift — a gradual miscalibration that looks fine on a daily basis but is catastrophic over months.
Multi-Agent Coordination Patterns
Individual agents are only as effective as their ability to communicate, share state, and coordinate. This section covers the infrastructure patterns that make multi-agent orchestration work in production.
Message Bus Architecture
Agents communicate through a structured message bus that provides ordered delivery, persistence, and observability. Each message has a type, a source agent, a target agent (or broadcast), and a payload.
import asyncio
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Callable, Optional
from collections import defaultdict
class MessageType(Enum):
TRIAGE_COMPLETE = "triage_complete"
INVESTIGATION_COMPLETE = "investigation_complete"
RESPONSE_PLAN_READY = "response_plan_ready"
APPROVAL_REQUIRED = "approval_required"
# Also: ACTION_EXECUTED, DOCUMENTATION_READY, LEARNING_INSIGHT, CIRCUIT_BREAKER
# AgentMessage(message_id, message_type, source_agent, target_agent,
# payload, correlation_id, timestamp)
class AgentMessageBus:
"""Central message bus for inter-agent communication with ordered
delivery, persistence, dead-letter handling, and observability."""
def __init__(self, persistence_backend, metrics_collector):
self.persistence, self.metrics = persistence_backend, metrics_collector
self.subscribers: dict[str, list[Callable]] = defaultdict(list)
self.dead_letter_queue: list = []
async def publish(self, message: AgentMessage):
"""Publish a message to all subscribers with persistence and dead-letter handling."""
await self.persistence.store(message)
self.metrics.increment("agent_messages_published", tags={
"type": message.message_type.value, "source": message.source_agent})
handlers = self.subscribers.get(message.message_type.value, [])
if not handlers:
self.dead_letter_queue.append(message)
return
results = await asyncio.gather(
*(self._deliver(h, message) for h in handlers), return_exceptions=True)
for r in results:
if isinstance(r, Exception):
self.dead_letter_queue.append(message)
# Additional methods:
# - subscribe(): Registers an agent handler for a message type
# - _deliver(): Delivers with 30s timeout and error handling
# SharedStateStore: Shared cross-agent investigation context with atomic
# updates, versioning, and TTL-based expiry.
# - update(): Atomically update state with optimistic locking
# - get(): Retrieves current investigation state
# - subscribe_changes(): Watches for state updates via callback
Agent Lifecycle Orchestration
The orchestration layer manages the full lifecycle: agent spawning, health monitoring, and graceful degradation.
import asyncio
class AgentOrchestrator:
"""Top-level orchestrator: manages agent lifecycle and the
end-to-end incident response pipeline."""
def __init__(self, message_bus, state_store, config):
self.bus, self.state, self.config = message_bus, state_store, config
self.agents = {}
async def handle_alert(self, raw_alert: dict):
"""Entry point: process an alert through the full agentic pipeline."""
# Phase 1: Triage
triage_result = await self.agents["triage"].triage(raw_alert)
await self.bus.publish(AgentMessage(
message_id=f"msg-triage-{triage_result.alert_id}",
message_type=MessageType.TRIAGE_COMPLETE,
source_agent="triage", target_agent="investigation",
payload=triage_result.__dict__,
correlation_id=triage_result.alert_id))
if triage_result.deduplicated:
return
# Phase 2: Investigation
investigation = await self.agents["investigation"].investigate(
triage_result.__dict__)
await self.bus.publish(AgentMessage(
message_id=f"msg-inv-{investigation.investigation_id}",
message_type=MessageType.INVESTIGATION_COMPLETE,
source_agent="investigation", target_agent="response",
payload=investigation.__dict__,
correlation_id=triage_result.alert_id))
# Phase 3: Response
plan = await self.agents["response"].create_response_plan(investigation.__dict__)
results = await self.agents["response"].execute_plan(plan)
# Phase 4: Documentation + Learning (parallel)
report, feedback = await asyncio.gather(
self.agents["documentation"].generate_report(
investigation.__dict__, results, triage_result.__dict__),
self.agents["learning"].analyze_incident(
triage_result.__dict__, investigation.__dict__, results))
# Additional methods:
# - _health_check_loop(): Monitors agent health every 30s, triggers circuit breakers
# - _handle_unhealthy_agent(): Publishes circuit breaker msg, falls back to manual
Pro Tip: Always run documentation and learning agents in parallel — they have no dependency on each other. This shaves meaningful time off your end-to-end incident processing. Pipeline parallelism is the easiest performance win in multi-agent systems.
Governance & Guardrails — The Cymantis View
Autonomous agents without governance are a liability. The Cymantis approach to agentic IR governance is built on four pillars: escalation thresholds, human-in-the-loop approval, comprehensive audit trails, and circuit breakers.
Escalation Policy Configuration
Governance policies should be defined declaratively — not embedded in agent code. This allows security leadership to adjust thresholds without code changes.
# agentic_ir_governance.yaml
# Cymantis Agentic IR Governance Policy v2.1
escalation_policy:
severity_thresholds:
critical:
investigation_path: assisted # Always human-in-the-loop
max_autonomous_actions: 0 # No auto-execution for critical
escalation_timeout_minutes: 5
escalation_targets:
- ir_lead
- soc_manager
- ciso # After 15 minutes without response
high:
investigation_path: assisted
max_autonomous_actions: 3 # Auto-execute up to 3 low-risk actions
escalation_timeout_minutes: 15
escalation_targets:
- ir_lead
- soc_manager
medium:
investigation_path: autonomous
max_autonomous_actions: 10
escalation_timeout_minutes: 60
escalation_targets:
- soc_analyst
low:
investigation_path: autonomous
max_autonomous_actions: 20
escalation_timeout_minutes: 240
escalation_targets:
- soc_analyst
approval_gates:
action_risk_classification:
auto_approve:
- create_ticket
- send_notification
- enrich_ioc
- add_watchlist
- tag_asset
require_approval:
- block_ip
- block_domain
- disable_user_account
- quarantine_email
- isolate_endpoint
- revoke_sessions
require_senior_approval:
- wipe_endpoint
- terminate_cloud_instance
- disable_service_account
- modify_firewall_rule
- revoke_api_keys
approval_timeouts:
standard: 300 # 5 minutes
urgent: 60 # 1 minute for critical severity
after_hours: 600 # 10 minutes outside business hours
timeout_behavior:
critical_severity: escalate_to_next_tier
high_severity: escalate_to_next_tier
medium_severity: hold_and_notify
low_severity: auto_approve_reversible
circuit_breakers:
global:
max_actions_per_hour: 100
max_critical_actions_per_hour: 10
max_endpoints_isolated_per_hour: 5
max_users_disabled_per_hour: 10
per_investigation:
max_steps: 20
max_duration_minutes: 60
max_api_calls: 200
max_response_actions: 15
kill_switch:
enabled: true
trigger_conditions:
- consecutive_failures > 5
- false_positive_rate > 0.3
- response_actions_per_minute > 10
action: pause_all_autonomous_actions
notification:
- soc_manager
- ir_lead
- ciso
audit_trail:
retention_days: 2555 # 7 years for compliance
fields_required:
- timestamp
- agent_name
- action_type
- target
- justification
- approval_status
- approver_identity
- outcome
- evidence_hash
tamper_protection: sha256_chain
export_formats:
- json
- csv
- syslog_cef
compliance_mappings:
nist_800_61:
detection: triage_agent
analysis: investigation_agent
containment: response_orchestrator
eradication: response_orchestrator
recovery: response_orchestrator
post_incident: learning_agent
nist_800_53:
IR-4: "Incident Handling - agentic pipeline"
IR-5: "Incident Monitoring - continuous agent health"
IR-6: "Incident Reporting - documentation agent"
AU-6: "Audit Review - audit trail + learning agent"
SI-4: "System Monitoring - triage agent integration"
Cymantis Recommendations for Governance
-
Start with assisted mode everywhere. Deploy all agents in assisted (human-in-the-loop) mode for the first 30 days. Use this period to calibrate confidence thresholds and build trust with the SOC team.
-
Graduate autonomy by action type, not by severity. Even in autonomous mode, destructive actions should always require approval. A medium-severity incident can still justify endpoint isolation if the investigation supports it — but a human should approve it.
-
Audit everything, even autonomous actions. Every agent decision, every enrichment query, every response action must have a complete audit trail with evidence hashes. This isn't just compliance — it's debugging infrastructure.
-
Implement kill switches at every level. Global kill switch for all autonomous operations. Per-agent kill switches. Per-investigation kill switches. When an agent goes off the rails, you need to stop it in seconds, not minutes.
-
Review circuit breaker triggers monthly. As your environment changes — new endpoints, new SaaS integrations, new threat patterns — the boundaries of "normal" agent behavior shift. Circuit breaker thresholds need to track those changes.
Pro Tip: Build a governance dashboard that shows every autonomous action taken in the last 24 hours, with the full reasoning chain. Make it the first thing the SOC manager reviews every morning. Transparency is what separates trustworthy autonomy from a compliance nightmare.
Migration Guide: SOAR to Agentic Orchestration
Migrating from traditional SOAR to agentic orchestration isn't a rip-and-replace. It's a graduated transition that preserves your existing investment while layering on agentic capabilities.
Step 1: Audit Existing Playbooks
Before building anything new, audit what you have. Map every playbook to its trigger, its data sources, its actions, and its maintenance history.
#!/bin/bash
# playbook_audit.sh — Inventory existing SOAR playbooks
echo "=== SOAR Playbook Audit ==="
echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""
# Export playbook inventory from SOAR platform
# Adapt for your platform: Splunk SOAR, Palo Alto XSOAR, etc.
soar-cli playbooks list --format json | python3 -c "
import json, sys
playbooks = json.load(sys.stdin)
print(f'Total playbooks: {len(playbooks)}')
print(f'Active: {sum(1 for p in playbooks if p[\"status\"] == \"active\")}')
print(f'Disabled: {sum(1 for p in playbooks if p[\"status\"] == \"disabled\")}')
print()
# Identify candidates for agentic migration
for pb in sorted(playbooks, key=lambda x: x.get('execution_count', 0), reverse=True):
print(f' [{pb[\"status\"]:8s}] {pb[\"name\"]:50s} '
f'executions={pb.get(\"execution_count\", 0):6d} '
f'success_rate={pb.get(\"success_rate\", 0):.1%} '
f'last_modified={pb.get(\"last_modified\", \"unknown\")}')
"
Step 2: Classify Playbooks by Migration Priority
Not every playbook needs to become an agent. Classify them:
- Retire: Playbooks with <5 executions/month or <50% success rate. These are broken or irrelevant — remove them.
- Keep as automation: Simple, deterministic workflows (password resets, ticket creation, notification routing). These don't need AI — keep them as lightweight automations.
- Convert to agent capability: Complex investigation and response playbooks that require multi-step reasoning, cross-source enrichment, or adaptive logic. These are your migration targets.
Step 3: Build the Agent Foundation
Deploy the shared infrastructure first: message bus, state store, governance policy engine, and audit trail.
# docker-compose.agentic-ir.yaml
version: "3.8"
services:
message-bus:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
command: redis-server --appendonly yes
state-store:
image: postgres:16-alpine
environment:
POSTGRES_DB: agentic_ir
POSTGRES_USER: ir_system
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
volumes:
- pg_data:/var/lib/postgresql/data
secrets:
- db_password
vector-store:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
governance-engine:
build: ./services/governance
environment:
POLICY_PATH: /config/governance.yaml
STATE_STORE_URL: postgresql://ir_system@state-store/agentic_ir
MESSAGE_BUS_URL: redis://message-bus:6379
volumes:
- ./config:/config:ro
depends_on:
- state-store
- message-bus
audit-logger:
build: ./services/audit
environment:
STATE_STORE_URL: postgresql://ir_system@state-store/agentic_ir
LOG_RETENTION_DAYS: 2555
TAMPER_PROTECTION: sha256_chain
depends_on:
- state-store
volumes:
redis_data:
pg_data:
qdrant_data:
secrets:
db_password:
file: ./secrets/db_password.txt
Step 4: Convert Playbooks to Agent Capabilities
For each playbook targeted for conversion, extract the intent (what it's trying to accomplish) and the tools it uses — then map those to agent capabilities.
Example: Phishing Playbook Conversion
Before (SOAR playbook — pseudocode):
# Traditional SOAR phishing playbook — rigid, linear
def phishing_playbook(alert):
# Step 1: Extract observables (fixed extraction)
urls = extract_urls(alert.email_body)
attachments = extract_attachments(alert.email)
sender = alert.sender_address
# Step 2: Enrich (fixed sequence)
url_results = virustotal_lookup(urls)
attachment_results = sandbox_detonate(attachments)
sender_rep = check_sender_reputation(sender)
# Step 3: Decision (static logic)
if any(r.malicious for r in url_results + attachment_results):
quarantine_email(alert.message_id)
block_sender(sender)
create_ticket("malicious_phishing", alert)
elif sender_rep.score < 0.3:
quarantine_email(alert.message_id)
create_ticket("suspicious_phishing", alert)
else:
close_alert(alert, "benign")
After (agent capability — dynamic, adaptive):
# Agentic phishing investigation — dynamic, context-aware
class PhishingInvestigationCapability:
"""Replaces the static phishing playbook with dynamic investigation.
The agent adapts based on what it finds — just like a human analyst."""
async def investigate(self, alert: dict, agent_context: dict) -> dict:
findings = []
observables = await self.extract_observables_adaptive(alert)
for obs in observables:
result = await self.investigate_observable(obs, agent_context)
findings.append(result)
if result.get("suspicious"): # Adaptive pivot: dig deeper
findings.extend(await self.adaptive_pivot(obs, result, agent_context))
# Check for coordinated campaign (not possible in static playbooks)
campaign = await self.check_campaign_indicators(
observables, findings, agent_context)
if campaign.get("match"):
findings.append({"type": "campaign_match",
"campaign_id": campaign["campaign_id"],
"confidence": campaign["confidence"], "severity": "high",
"summary": f"Part of active campaign: {campaign['name']}"})
findings.extend(await self.check_sender_history(alert, agent_context))
return {
"findings": findings,
"confidence": self.calculate_confidence(findings),
"recommended_actions": self.recommend_actions(findings)}
# Additional methods:
# - investigate_observable(): Multi-source lookup (VT, URLScan, WHOIS, sandbox)
# - adaptive_pivot(): Searches for related emails from suspicious domains
# - extract_observables_adaptive(): Handles URLs, QR codes, attachments, callbacks
# - check_campaign_indicators(): Correlates against known active campaigns
# - check_sender_history(): Historical context for sender/domain patterns
Step 5: Run Shadow Mode
Deploy agents in shadow mode — they process every alert alongside your existing SOAR playbooks, but take no response actions. Compare results over 30 days:
- Did the agent classify severity accurately?
- Did the agent identify the same findings as the playbook?
- Did the agent find anything the playbook missed?
- Did the agent produce actionable response recommendations?
Step 6: Graduate to Production
Based on shadow mode metrics, graduate agents to production by capability and by severity level:
- Week 1–4: Agents handle low-severity investigations autonomously. Medium and above remain in assisted mode.
- Week 5–8: Agents handle medium-severity investigations autonomously (low-risk actions only). High and critical remain assisted.
- Week 9–12: Agents handle all severities, with approval gates for high-risk response actions at every severity level.
- Ongoing: Continuous tuning via the learning agent feedback loop.
Step 7: Decommission Legacy Playbooks
As agents prove themselves, retire the corresponding SOAR playbooks. But don't delete them — archive them as reference material for the learning agent and as fallback procedures.
Pro Tip: Keep a "break glass" SOAR playbook for every critical response capability. If the agentic system goes down, you need a manual fallback. Test these quarterly, just like disaster recovery runbooks.
Measuring Agentic IR Performance
You can't improve what you don't measure. Agentic IR introduces new dimensions to incident response metrics — not just speed, but accuracy, efficiency, and analyst satisfaction.
Key Metrics
| Metric | Definition | Traditional SOC Baseline | Agentic IR Target |
|---|---|---|---|
| MTTD (Mean Time to Detect) | Time from attack start to alert generation | 197 days (IBM 2024) | <24 hours |
| MTTI (Mean Time to Investigate) | Time from alert to investigation complete | 4–8 hours | <5 minutes |
| MTTR (Mean Time to Respond) | Time from alert to containment complete | 8–24 hours | <15 minutes |
| Containment Time | Time from response decision to containment | 30–120 minutes | <2 minutes |
| Triage Accuracy | Correct severity classification rate | 60–75% | >92% |
| False Positive Rate | Alerts closed as false positive | 80%+ | <30% (of escalated alerts) |
| Investigation Depth | Data sources consulted per investigation | 2–3 | 5–8 |
| Analyst Satisfaction | Survey-based satisfaction with tools | 3.2/5 | 4.5/5 |
| Autonomous Resolution Rate | Incidents resolved without human touch | 0–10% | 60–80% for low/medium |
Dashboard Queries
Track these metrics in real time with SIEM dashboard panels.
Mean Time to Investigate (MTTI) — Splunk SPL:
index=agentic_ir sourcetype=investigation_metrics
| eval investigation_duration_min = (investigation_end - investigation_start) / 60
| stats
avg(investigation_duration_min) as avg_mtti_min
median(investigation_duration_min) as median_mtti_min
perc95(investigation_duration_min) as p95_mtti_min
count as total_investigations
by severity
| eval avg_mtti_min = round(avg_mtti_min, 2)
| eval median_mtti_min = round(median_mtti_min, 2)
| sort - severity
Triage Accuracy Over Time — Splunk SPL:
index=agentic_ir sourcetype=learning_metrics
| eval accurate = if(triage_severity == actual_severity, 1, 0)
| bin _time span=1d
| stats
avg(accurate) as daily_accuracy
count as daily_volume
avg(triage_accuracy_score) as avg_triage_score
by _time
| eval daily_accuracy = round(daily_accuracy * 100, 1)
| eval avg_triage_score = round(avg_triage_score * 100, 1)
Response Action Effectiveness — Splunk SPL:
index=agentic_ir sourcetype=response_metrics
| stats
count(eval(action_status="executed")) as actions_executed
count(eval(action_status="failed")) as actions_failed
count(eval(action_status="pending_approval")) as actions_pending
count(eval(action_status="auto_approved")) as actions_auto_approved
avg(containment_time_seconds) as avg_containment_sec
by action_type
| eval success_rate = round(actions_executed / (actions_executed + actions_failed) * 100, 1)
| eval avg_containment_sec = round(avg_containment_sec, 1)
| sort - actions_executed
Agent Health Dashboard — Splunk SPL:
index=agentic_ir sourcetype=agent_heartbeat
| stats
latest(status) as current_status
latest(last_action_time) as last_active
avg(processing_time_ms) as avg_processing_ms
count(eval(status="error")) as error_count
count as total_heartbeats
by agent_name
| eval health = case(
current_status == "healthy" AND error_count < 5, "GREEN",
current_status == "healthy" AND error_count >= 5, "YELLOW",
current_status == "degraded", "YELLOW",
1==1, "RED"
)
| table agent_name health current_status last_active avg_processing_ms error_count
Autonomous vs. Assisted Resolution — Executive KPI:
index=agentic_ir sourcetype=incident_metrics
| eval resolution_type = case(
investigation_path == "autonomous" AND human_intervention == 0, "Fully Autonomous",
investigation_path == "assisted" AND human_intervention == 1, "Human Assisted",
investigation_path == "manual", "Manual",
1==1, "Other"
)
| stats count by resolution_type
| eventstats sum(count) as total
| eval percentage = round(count / total * 100, 1)
| sort - count
Pro Tip: Track analyst satisfaction monthly with a simple 5-question survey. The best metric for agentic IR success isn't MTTR — it's whether your analysts feel like the system helps them or creates new problems. If satisfaction drops, your agents are getting in the way, not getting out of it.
The Cymantis View: Where This Is All Heading
The shift from SOAR playbooks to agentic orchestration is not optional — it's inevitable. The same forces that made SOAR necessary (alert volume, staffing shortages, adversary speed) have now made SOAR insufficient. Rigid playbooks can't keep pace with polymorphic threats, multi-vector attacks, and the sheer volume of telemetry that modern environments generate.
But the transition requires engineering discipline. Multi-agent systems introduce new failure modes: agent hallucination, reasoning loops, cascade failures, state corruption, and governance drift. Organizations that deploy agents without the governance infrastructure to constrain them will discover that an autonomous agent making bad decisions at machine speed is worse than no automation at all.
The Cymantis position is clear:
- Agentic IR is the future of incident response. Organizations that invest now will have a 12–18 month advantage over those that wait.
- Governance is not optional. Every autonomous capability must have an equally robust governance mechanism. No exceptions.
- Start small, prove value, then scale. Deploy a single triage agent in shadow mode. Prove it classifies correctly. Then add investigation. Then response. Each step earns trust.
- Measure relentlessly. If you can't show that your agentic system outperforms your current playbooks on MTTI, MTTR, accuracy, and analyst satisfaction, you haven't earned the right to scale it.
- Keep humans in the loop for high-stakes decisions. Full autonomy is not the goal. Effective human-AI collaboration is. The best SOCs will be small teams of senior analysts governing fleets of agents — not watching dashboards.
The alert fatigue era is ending. The playbook maintenance era is ending. What comes next is a SOC where analysts spend their time on judgment, strategy, and adversary engagement — and agents handle everything else.
Final Thoughts
SOAR was a necessary step in the evolution of incident response. It proved that automation has a place in security operations. But static playbooks were always a stopgap — a way to encode yesterday's response procedures for yesterday's threats.
The multi-agent approach isn't just faster; it's fundamentally different. Agents don't follow scripts — they reason about evidence, adapt to context, and coordinate dynamically. They investigate like your best analyst and document like your most diligent one. They learn from every incident and get better over time.
The organizations that move first will compound their advantage. Every incident their agents process makes them smarter, faster, more accurate. The learning loop creates a flywheel: better triage leads to more focused investigations, which leads to more precise responses, which generates better training data for the next cycle.
For security leaders evaluating this transition: don't wait for the perfect platform. The building blocks exist today — LLM APIs, agent orchestration frameworks, vector databases for memory, and the governance patterns outlined in this post. The gap between "theoretically possible" and "production-ready" is engineering effort, not research breakthrough.
Start with shadow mode. Measure everything. Graduate autonomy as trust is earned. Keep humans where they matter most — at the decision boundary between "safe to automate" and "too consequential to delegate."
The future SOC isn't bigger. It's smarter. Build it.
Cymantis Labs helps security teams design, deploy, and govern agentic incident response architectures — from SOAR migration assessments to full multi-agent orchestration in production. We bring the engineering rigor and operational experience to make autonomous IR production-safe.
Resources & References
SOAR & Incident Response Foundations
- NIST SP 800-61 Rev. 2 — Computer Security Incident Handling Guide: https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final — The foundational framework for incident response processes
- Gartner — Market Guide for SOAR Solutions: https://www.gartner.com/en/documents/ — Analysis of SOAR market evolution, convergence with AI, and vendor landscape
- SANS — Incident Handler's Handbook: https://www.sans.org/white-papers/33901/ — Practical guide to incident response procedures and best practices
- Splunk SOAR Documentation: https://docs.splunk.com/Documentation/SOAR/ — Splunk SOAR platform documentation and playbook development
- Palo Alto XSOAR Documentation: https://docs-cortex.paloaltonetworks.com/r/Cortex-XSOAR — Cortex XSOAR platform documentation and marketplace
Multi-Agent AI Research
- AutoGen — Multi-Agent Conversation Framework: https://github.com/microsoft/autogen — Microsoft's open-source framework for building multi-agent AI systems
- CrewAI — AI Agent Orchestration: https://github.com/crewAIInc/crewAI — Framework for orchestrating role-playing AI agents
- LangGraph — Stateful Multi-Agent Workflows: https://python.langchain.com/docs/langgraph — LangChain's framework for building stateful, multi-step agent workflows
- OpenAI Function Calling Documentation: https://platform.openai.com/docs/guides/function-calling — Building tool-calling agents with structured outputs
Agentic Security Operations
- CrowdStrike — The Rise of the Agentic SOC: https://www.crowdstrike.com/en-us/blog/agentic-ai-soc-guide/ — CrowdStrike's perspective on Charlotte AI and autonomous security operations
- Microsoft Copilot for Security: https://www.microsoft.com/en-us/security/business/ai-machine-learning/microsoft-copilot-security — Enterprise agentic security platform
- Prophet Security — AI SOC Analyst: https://prophetsecurity.ai — Autonomous investigation and detection engineering platform
- Anomali — Agentic SIEM Architecture: https://www.anomali.com/resources/what-is-siem — Research on SIEM evolution toward agentic platforms
MITRE Frameworks
- MITRE ATT&CK Framework: https://attack.mitre.org/ — Adversary tactics, techniques, and procedures knowledge base
- MITRE D3FEND: https://d3fend.mitre.org/ — Defensive technique knowledge graph for mapping response capabilities
- MITRE ATLAS (Adversarial Threat Landscape for AI): https://atlas.mitre.org/ — Threats specific to AI/ML systems
Governance & Compliance
- NIST AI Risk Management Framework: https://www.nist.gov/artificial-intelligence/ai-risk-management-framework — Federal guidance on AI governance and risk management
- NIST SP 800-53 Rev. 5 — Security and Privacy Controls: https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final — IR-4, IR-5, IR-6, AU-6, SI-4 control families relevant to agentic IR
- SOC 2 Trust Services Criteria: https://www.aicpa.org/resources/landing/system-and-organization-controls-soc-suite-of-services — SOC 2 audit considerations for automated response systems
Industry Research & Benchmarks
- IBM Cost of a Data Breach Report 2025: https://www.ibm.com/reports/data-breach — Annual breach cost analysis including detection time metrics (MTTD, MTTR)
- Ponemon Institute — State of SOAR: https://www.ponemon.org/ — Research on SOAR adoption, playbook maintenance burden, and effectiveness metrics
- SANS SOC Survey: https://www.sans.org/white-papers/ — Annual survey of SOC operations, staffing, tool adoption, and alert fatigue metrics
For more insights or to schedule a Cymantis Agentic IR Assessment, contact our research and automation team at cymantis.com.
