Building an Agentic SOC: From Alert Fatigue to Autonomous Detection and Response
A technical guide to transforming traditional SOC operations with agentic AI — from autonomous alert triage and investigation to coordinated response, while maintaining human-in-the-loop governance.
Building an Agentic SOC: From Alert Fatigue to Autonomous Detection and Response
By Cymantis Labs
The modern Security Operations Center is drowning. Analysts face an average of 11,000 alerts per day. Over 80% of those are false positives. Mean Time to Respond (MTTR) is measured in hours — sometimes days — for incidents that demand minutes. Tier 1 analysts burn out in 18 months. Hiring pipelines can't keep pace with the volume, and the adversaries aren't slowing down.
We've spent the last decade layering automation on top of broken processes: SOAR playbooks that handle the happy path but choke on edge cases, static correlation rules that fire on everything or nothing, and enrichment workflows that add data without adding context.
Agentic AI changes the equation. Not by replacing analysts, but by giving them something they've never had: autonomous reasoning at machine speed with human judgment at decision time. This is the architecture of the Agentic SOC — a system where AI agents triage, investigate, and recommend response actions, while humans govern the boundaries and approve high-risk actions.
This guide walks through the technical blueprint: from data foundations to multi-agent orchestration, with working code, real configurations, and the governance guardrails that make it production-safe.
The Three Generations of SIEM
Before building the future, we need to understand how we got here. The evolution of SIEM platforms maps directly to the evolution of SOC operating models — and each generation addressed (or failed to address) the alert fatigue problem differently.
Generation 1: System of Record (2005–2015)
The first generation of SIEMs — ArcSight, QRadar, early Splunk — were log aggregation engines. Their value proposition was simple: collect everything, search anything. Correlation rules were hand-crafted by senior engineers, and the operating model was reactive. Something happens, you search for it.
The alert fatigue problem: Rules were binary — they fired or they didn't. No risk scoring, no context, no adaptive thresholds. A failed SSH login from an admin's home IP generated the same alert as one from a known C2 server. SOC teams responded by tuning rules so aggressively that coverage gaps became the norm.
Generation 2: System of Intelligence (2015–2023)
The second generation introduced analytics: UEBA, risk-based alerting, machine learning anomaly detection, and threat intelligence enrichment. Splunk ES, Microsoft Sentinel, and Google Chronicle moved from raw correlation to behavioral baselines and entity risk aggregation.
The alert fatigue improvement: Risk-based alerting (RBA) was a genuine breakthrough. Instead of alerting on individual events, you aggregate risk by entity and fire when thresholds are breached. This reduced alert volume by 80–90% in well-tuned deployments. But the investigation and response workflow remained manual. An analyst still had to pick up the notable, pull context from five tools, make a judgment call, and execute response actions by hand.
Generation 3: System of Action (2024–Present)
The third generation is the agentic SIEM — platforms that don't just detect and enrich, but reason and act. CrowdStrike's Charlotte AI, Microsoft's Copilot for Security, Anomali's agentic SIEM architecture, and open-source frameworks like LangChain-based security agents are pushing SIEM from intelligence into autonomous operations.
The alert fatigue solution: Agents don't just score alerts — they investigate them. They correlate across data sources, query threat intel APIs, check historical baselines, assess blast radius, and either resolve the alert autonomously or escalate with a complete investigation package. The analyst reviews a finished investigation, not a raw alert.
Pro Tip: The generational shift isn't about replacing your SIEM — it's about layering agentic capabilities on top of your existing investment. Every Phase 1 and Phase 2 investment (data normalization, RBA tuning, enrichment pipelines) becomes the foundation for Phase 3.
What Makes a SOC "Agentic"?
The term "agentic" gets thrown around loosely. Let's define it precisely in the context of security operations.
An agentic SOC is an operating model where AI agents — autonomous software entities with defined goals, tools, and decision boundaries — perform the cognitive work of alert triage, investigation, and response recommendation. Unlike traditional SOAR playbooks, which follow predetermined decision trees, agentic systems exhibit three critical differentiators:
1. Context Retention
Traditional playbooks are stateless. Each execution starts from zero. An agentic system maintains context across investigations: it remembers that this user had a similar alert last week that was a false positive, that this endpoint was recently reimaged, that this IP appeared in a threat intel feed three days ago.
2. Adaptive Reasoning
Playbooks follow if/then/else logic trees designed at authoring time. Agents reason dynamically: "The alert says suspicious PowerShell execution. Let me check the command line arguments. This looks like an encoded payload. Let me decode it. The decoded content is a download cradle pointing to a domain. Let me check that domain against threat intel. It's clean, but it was registered 48 hours ago. Let me check passive DNS and WHOIS. The registrant matches a known bulletproof hosting pattern. Escalate as high-confidence."
3. Multi-Step Investigation
A SOAR playbook enriches an alert with a fixed set of lookups. An agent conducts a dynamic investigation: each step informs the next. If the first enrichment reveals something interesting, the agent follows that thread. If it's a dead end, the agent pivots. This mirrors how a skilled analyst actually works — except the agent does it in seconds.
Architecture Overview
The agentic SOC architecture has four layers:
graph TD
subgraph governanceLayer["GOVERNANCE LAYER"]
policyEngine["Policy Engine"]
approvalGates["Approval Gates"]
auditTrail["Audit Trail"]
end
subgraph orchestrationLayer["ORCHESTRATION LAYER"]
agentRouter["Agent Router"]
taskQueue["Task Queue"]
stateManager["State Manager"]
end
subgraph agentLayer["AGENT LAYER"]
triageAgent["Triage Agent"]
investigationAgent["Investigation Agent"]
responseAgent["Response Agent"]
end
subgraph dataFoundationLayer["DATA FOUNDATION LAYER"]
siem["SIEM"]
threatIntel["Threat Intel"]
cmdb["CMDB"]
edr["EDR"]
iam["IAM"]
end
Each layer has distinct responsibilities, and the governance layer wraps everything — no agent acts without policy enforcement.
The Four-Phase Agentic SOC Journey
Deploying an agentic SOC isn't a weekend project. It's a phased transformation that builds on existing investments. Each phase delivers standalone value while laying foundations for the next.
Phase 1: AI-Ready Foundation
Goal: Ensure your data is clean, normalized, and enriched enough for AI agents to reason over it. Garbage in, garbage out — this is doubly true for LLM-based agents that will make decisions based on your data quality.
The single most common failure mode for agentic SOC deployments is poor data quality. An AI agent that queries your SIEM and gets inconsistent field names, missing timestamps, or unnormalized asset identifiers will produce unreliable results.
CIM Compliance Validation
Splunk's Common Information Model (CIM) provides the normalized field vocabulary that agents depend on. Before deploying any AI capability, validate your CIM compliance:
| tstats count where index=* by index, sourcetype
| eval cim_model=case(
sourcetype LIKE "%sysmon%", "Endpoint",
sourcetype LIKE "%wineventlog%", "Endpoint",
sourcetype LIKE "%cloudtrail%", "Change",
sourcetype LIKE "%linux_secure%", "Authentication",
sourcetype LIKE "%firewall%", "Network_Traffic",
sourcetype LIKE "%proxy%", "Web",
sourcetype LIKE "%dns%", "Network_Resolution",
1=1, "UNMAPPED"
)
| stats count by cim_model, index, sourcetype
| sort - count
Data Quality Scoring
Create a data quality index that agents can reference before making decisions:
| tstats count as total_events where index=* by sourcetype
| join sourcetype [
| tstats count as has_dest where index=* dest=* by sourcetype
]
| join sourcetype [
| tstats count as has_user where index=* user=* by sourcetype
]
| join sourcetype [
| tstats count as has_src where index=* src=* by sourcetype
]
| eval dest_coverage=round(has_dest/total_events*100, 1)
| eval user_coverage=round(has_user/total_events*100, 1)
| eval src_coverage=round(has_src/total_events*100, 1)
| eval quality_score=round((dest_coverage + user_coverage + src_coverage) / 3, 1)
| sort - quality_score
| table sourcetype total_events dest_coverage user_coverage src_coverage quality_score
Asset Identity Normalization
Agents need a single, authoritative identifier for every entity. Build an asset normalization lookup:
| inputlookup asset_lookup_by_str
| eval normalized_host=lower(dns)
| eval normalized_host=if(isnull(normalized_host), lower(nt_host), normalized_host)
| dedup normalized_host
| eval asset_id=md5(normalized_host)
| table asset_id normalized_host ip mac nt_host dns owner priority bunit category
| outputlookup asset_identity_normalized.csv
Log Enrichment Pipeline
Pre-enrich events at ingest time so agents don't have to make expensive lookups during investigation:
# Scheduled search: Enrich endpoint events with asset context
index=endpoint sourcetype=sysmon
| lookup asset_identity_normalized.csv ip as src_ip OUTPUT asset_id, owner, priority, bunit
| lookup threat_intel_ip_lookup.csv ip as dest_ip OUTPUT threat_category, threat_score, threat_source
| lookup geo_ip_lookup.csv ip as dest_ip OUTPUT country, city, asn, asn_org
| eval enrichment_time=now()
| collect index=enriched_endpoint
Pro Tip: Measure your "enrichment coverage" — the percentage of events that have all critical fields populated. Agents should not operate on data sources with less than 85% enrichment coverage. Below that threshold, route alerts to human analysts instead.
Phase 2: Autonomous Triage
Goal: Deploy AI agents that can autonomously classify, score, and route incoming alerts — suppressing confirmed false positives and prioritizing high-confidence threats for investigation.
The triage agent is the first agent most organizations deploy, and it delivers the highest immediate ROI. A well-tuned triage agent can autonomously close 60–70% of alerts, reducing analyst workload to a manageable volume.
Alert Scoring Configuration
Define scoring thresholds in a structured configuration that the triage agent references:
# triage_agent_config.yaml
agent:
name: "soc-triage-agent"
version: "1.2.0"
model: "gpt-4o"
temperature: 0.1 # Low temperature for consistent classification
max_reasoning_steps: 10
scoring:
thresholds:
auto_close: 15 # Score <= 15: auto-close as benign
low_priority: 40 # Score 16-40: queue for batch review
medium_priority: 70 # Score 41-70: standard investigation
high_priority: 90 # Score 71-90: priority investigation
critical_escalate: 100 # Score 91+: immediate escalation
factors:
asset_criticality:
crown_jewel: 25
high: 15
medium: 10
low: 5
user_risk:
privileged_admin: 20
service_account: 15
standard_user: 5
contractor: 10
threat_intel_match:
known_apt: 30
known_malware: 25
suspicious_ioc: 15
clean: 0
historical_context:
first_seen_behavior: 15
recurring_false_positive: -20
similar_confirmed_incident: 25
recently_investigated_benign: -15
time_context:
outside_business_hours: 10
holiday_weekend: 15
during_change_window: -10
false_positive_patterns:
- name: "vulnerability_scanner"
conditions:
src_ip_in: "scanner_allowlist"
dest_port_in: [80, 443, 8080, 8443]
action: "auto_close"
confidence: 0.95
- name: "admin_scheduled_task"
conditions:
user_in: "admin_group"
process_name_in: ["schtasks.exe", "at.exe"]
time_window: "change_window"
action: "auto_close"
confidence: 0.90
- name: "edr_update_noise"
conditions:
source: "endpoint_protection"
signature_match: "PUA.*adware|PUP.*toolbar"
action: "auto_close"
confidence: 0.95
escalation:
critical_path:
- notify: "soc_lead"
method: ["slack", "pagerduty"]
timeout_minutes: 5
- notify: "incident_commander"
method: ["pagerduty", "phone"]
timeout_minutes: 15
Triage Agent Implementation
Here's a production-grade triage agent using a tool-calling LLM pattern:
"""soc_triage_agent.py — Autonomous alert triage with LLM reasoning."""
import json, hashlib, logging
from datetime import datetime, timezone
import yaml
from openai import OpenAI
logger = logging.getLogger("soc_triage_agent")
# AlertVerdict(Enum): AUTO_CLOSE | LOW_PRIORITY | INVESTIGATE | ESCALATE | CRITICAL
# TriageResult(dataclass): alert_id, verdict, score, reasoning,
# enrichments, recommended_actions, confidence, processing_time_ms
class TriageAgent:
"""Scores and classifies SIEM alerts via LLM tool-calling loop."""
# SYSTEM_PROMPT — Tier 1 SOC analyst persona + rules (fail-open, check TI)
# TOOL_DEFINITIONS — Function-calling schemas: query_siem,
# check_threat_intel, get_asset_context, get_user_context,
# check_alert_history, submit_verdict
SYSTEM_PROMPT = "..." # Full prompt omitted for brevity
TOOL_DEFINITIONS = [...] # Six tool schemas omitted for brevity
def __init__(self, config_path: str):
with open(config_path) as f:
self.config = yaml.safe_load(f)
self.client = OpenAI()
self.model = self.config["agent"]["model"]
self.tool_handlers = {
"query_siem": self._handle_siem_query,
"check_threat_intel": self._handle_threat_intel,
"get_asset_context": self._handle_asset_context,
"get_user_context": self._handle_user_context,
"check_alert_history": self._handle_alert_history,
"submit_verdict": self._handle_verdict,
}
def triage_alert(self, alert: dict) -> TriageResult:
"""Core agentic loop — LLM calls tools autonomously until verdict."""
start = datetime.now(timezone.utc)
alert_id = alert.get("id", hashlib.md5(
json.dumps(alert, sort_keys=True).encode()).hexdigest()[:12])
self._enrichments, self._verdict = {}, None
messages = [
{"role": "system", "content": self.SYSTEM_PROMPT},
{"role": "user", "content":
f"Triage this alert:\n```json\n{json.dumps(alert, indent=2)}\n```"}
]
for _ in range(self.config["agent"].get("max_reasoning_steps", 10)):
resp = self.client.chat.completions.create(
model=self.model, messages=messages,
tools=self.TOOL_DEFINITIONS, tool_choice="auto",
temperature=self.config["agent"].get("temperature", 0.1))
msg = resp.choices[0].message
messages.append(msg.model_dump())
if not msg.tool_calls:
break
for tc in msg.tool_calls:
handler = self.tool_handlers.get(tc.function.name)
args = json.loads(tc.function.arguments)
result = handler(**args) if handler else {"error": "unknown tool"}
messages.append({"role": "tool", "tool_call_id": tc.id,
"content": json.dumps(result)})
if self._verdict:
break
if not self._verdict:
self._verdict = {"verdict": "escalate", "score": 75,
"reasoning": "No convergence — escalating.", "confidence": 0.3}
elapsed_ms = int((datetime.now(timezone.utc) - start).total_seconds() * 1000)
return TriageResult(alert_id=alert_id,
verdict=AlertVerdict(self._verdict["verdict"]),
score=self._verdict["score"], reasoning=self._verdict["reasoning"],
enrichments=self._enrichments, confidence=self._verdict["confidence"],
processing_time_ms=elapsed_ms)
# ── Tool Handlers (integrate with your infrastructure) ──────────
# _handle_siem_query(query, time_range) — SPL via Splunk REST
# _handle_threat_intel(indicator, type) — IOC lookup against TI feeds
# _handle_asset_context(asset_id) — CMDB metadata retrieval
# _handle_user_context(username) — IAM user risk profile
# _handle_alert_history(detection, entity) — Prior outcomes & FP rates
# _handle_verdict(**kwargs) — Capture final verdict
Splunk Integration: Feeding the Triage Agent
Create a modular input or scripted alert action that feeds new notables to the triage agent:
"""splunk_triage_bridge.py — Polls Splunk for new notables,
dispatches to triage agent, writes results back to a triage index."""
import time, json, logging
import splunklib.client as client
import splunklib.results as results
from soc_triage_agent import TriageAgent
logger = logging.getLogger("splunk_triage_bridge")
SPLUNK_CONFIG = {
"host": "localhost", "port": 8089,
"username": "svc_triage_agent",
"password": "VAULT_REFERENCE", # Use a secrets manager in production
}
NOTABLE_QUERY = """
search index=notable earliest=-5m latest=now
NOT [| inputlookup triaged_alerts.csv | fields rule_id]
| fields _time, rule_name, rule_id, src, dest, user, severity,
urgency, security_domain, risk_score
| head 50
"""
def poll_and_triage():
"""Main loop: fetch new notables, triage via agent, write results."""
agent = TriageAgent("triage_agent_config.yaml")
service = client.connect(**SPLUNK_CONFIG)
while True:
try:
job = service.jobs.create(NOTABLE_QUERY)
while not job.is_done():
time.sleep(1)
for result in results.JSONResultsReader(
job.results(output_mode="json")):
if isinstance(result, dict):
triage_result = agent.triage_alert(result)
logger.info(f"Alert {triage_result.alert_id}: "
f"{triage_result.verdict.value} "
f"(score={triage_result.score})")
write_triage_result(service, triage_result)
except Exception as e:
logger.error(f"Triage polling error: {e}")
time.sleep(300) # Poll every 5 minutes
# write_triage_result(service, result) — Serializes TriageResult as JSON
# event → soc_triage index (sourcetype=soc:triage:result) for audit trail
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
poll_and_triage()
Pro Tip: Start with the noisiest detection rules first. Identify your top 10 alert generators (they typically account for 60–80% of alert volume), and configure the triage agent to handle those. This delivers visible ROI within the first week while you tune and expand coverage.
Phase 3: Intelligent Investigation
Goal: Deploy investigation agents that autonomously gather evidence, correlate across data sources, and build investigation narratives — delivering a complete investigation package to analysts instead of a raw alert.
Once triage separates signal from noise, investigation agents take the medium-to-high priority alerts and conduct full investigations. This is where the agentic approach truly differentiates from SOAR: the investigation is dynamic, not scripted.
Investigation Agent Workflow
The investigation agent follows a structured but adaptive workflow:
- Alert Intake — Receive triaged alert with initial enrichments from triage agent
- Scope Assessment — Determine what data sources and tools are relevant
- Evidence Collection — Query SIEM, EDR, identity systems, network tools
- Threat Intel Correlation — Check all indicators against multiple TI sources
- Behavioral Analysis — Compare current activity against historical baselines
- Timeline Construction — Build a chronological event timeline
- Impact Assessment — Estimate blast radius and business impact
- Recommendation — Provide verdict with confidence score and response options
Multi-Source Enrichment Pipeline
"""investigation_agent.py — Multi-source enrichment and investigation.
Correlates across SIEM, EDR, threat intel, and identity systems."""
import asyncio, logging
from dataclasses import dataclass
logger = logging.getLogger("investigation_agent")
@dataclass
class InvestigationContext:
"""Accumulated evidence across all enrichment sources."""
alert_id: str
timeline: list
indicators: dict
affected_assets: list
affected_users: list
threat_intel_hits: list
behavioral_anomalies: list
blast_radius: dict
confidence: float
narrative: str = ""
class InvestigationAgent:
"""Orchestrates multi-source investigations with dynamic enrichment."""
def __init__(self, config: dict):
self.config = config
self.enrichment_sources = { # Pluggable enrichment backends
"siem": SIEMEnrichment(config["siem"]),
"edr": EDREnrichment(config["edr"]),
"threat_intel": ThreatIntelEnrichment(config["threat_intel"]),
"identity": IdentityEnrichment(config["identity"]),
"network": NetworkEnrichment(config["network"]),
}
async def investigate(self, triaged_alert: dict) -> InvestigationContext:
"""Five-phase adaptive investigation on a triaged alert."""
ctx = InvestigationContext(
alert_id=triaged_alert["alert_id"], timeline=[],
indicators={}, affected_assets=[], affected_users=[],
threat_intel_hits=[], behavioral_anomalies=[],
blast_radius={}, confidence=0.0)
# Phase 1: Parallel context gathering (asset, user, network)
await asyncio.gather(
self._enrich_asset_context(triaged_alert, ctx),
self._enrich_user_context(triaged_alert, ctx),
self._enrich_network_context(triaged_alert, ctx))
# Phase 2: Check all extracted IOCs against threat intel
indicators = self._extract_indicators(triaged_alert, ctx)
await asyncio.gather(
*[self._check_indicator(ioc, ctx) for ioc in indicators])
# Phase 3–5: Behavioral analysis → timeline → narrative
await self._behavioral_analysis(triaged_alert, ctx)
await self._build_timeline(triaged_alert, ctx)
await self._assess_blast_radius(ctx)
ctx.narrative = await self._generate_narrative(ctx)
return ctx
# ── Enrichment Methods ──────────────────────────────────────────
# _enrich_asset_context(alert, ctx) — CMDB lookup + recent changes (7d)
# _enrich_user_context(alert, ctx) — Identity, risk, auth patterns (7d)
# _enrich_network_context(alert, ctx) — DNS lookups + traffic patterns (24h)
# _extract_indicators(alert, ctx) — Extract IPs, domains, hashes
# _check_indicator(ioc, ctx) — TI lookup; append malicious hits
# _behavioral_analysis(alert, ctx) — Current vs 30-day process baseline
# _build_timeline(alert, ctx) — Chronological notable/risk events (48h)
# _assess_blast_radius(ctx) — Scope: isolated / department / enterprise
# _generate_narrative(ctx) — LLM-generated investigation summary
# _is_rfc1918(ip) — RFC 1918 private range check
Pro Tip: Investigation agents should be measured on "enrichment completeness" — the percentage of available context they actually gather before making a recommendation. Target 90%+ enrichment completeness for high-priority alerts. An agent that skips threat intel checks or doesn't pull user context is making decisions with an incomplete picture.
Phase 4: Coordinated Response
Goal: Deploy response agents that can execute containment and remediation actions — with governance guardrails that enforce human approval for high-impact actions.
This is where the agentic SOC delivers its most visible value — and where governance becomes non-negotiable. Response agents must never operate without policy-driven guardrails.
Multi-Agent Response Orchestration
Response actions are coordinated by a central orchestrator that evaluates the investigation context and dispatches actions to specialized response agents:
"""response_orchestrator.py — Multi-agent response with approval gates."""
import logging
from dataclasses import dataclass
from enum import Enum
logger = logging.getLogger("response_orchestrator")
class RiskLevel(Enum):
LOW = "low" # Auto-execute
MEDIUM = "medium" # Execute + notify
HIGH = "high" # SOC lead approval
CRITICAL = "critical" # CISO/IC approval
class ActionStatus(Enum):
PENDING_APPROVAL = "pending_approval"
APPROVED = "approved"
EXECUTED = "executed"
DENIED = "denied"
FAILED = "failed"
@dataclass
class ResponseAction:
action_type: str
target: str
parameters: dict
risk_level: RiskLevel
justification: str
status: ActionStatus = ActionStatus.PENDING_APPROVAL
class ResponseOrchestrator:
"""Coordinates response actions with policy-driven approval gates."""
ACTION_RISK_MAP = {
"quarantine_file": RiskLevel.LOW,
"block_ip_firewall": RiskLevel.MEDIUM,
"kill_process": RiskLevel.MEDIUM,
"block_domain_dns": RiskLevel.MEDIUM,
"isolate_endpoint": RiskLevel.HIGH,
"disable_user_account": RiskLevel.HIGH,
"revoke_session_tokens": RiskLevel.HIGH,
"network_segment_isolate": RiskLevel.CRITICAL,
}
def __init__(self, config: dict):
self.approval_engine = ApprovalEngine(config["governance"])
self.audit_logger = AuditLogger(config["audit"])
self.response_agents = {
"endpoint": EndpointResponseAgent(config["edr"]),
"network": NetworkResponseAgent(config["firewall"]),
"identity": IdentityResponseAgent(config["iam"]),
}
async def coordinate_response(self, investigation_context,
recommended_actions: list) -> list:
"""Route actions through governance gates; execute if approved."""
results = []
for spec in recommended_actions:
action = ResponseAction(
action_type=spec["type"], target=spec["target"],
parameters=spec.get("parameters", {}),
risk_level=self.ACTION_RISK_MAP.get(
spec["type"], RiskLevel.HIGH),
justification=spec.get("justification", ""))
self.audit_logger.log_proposed_action(action, investigation_context)
approval = await self.approval_engine.check_approval(action)
if approval.auto_approved:
action.status = ActionStatus.APPROVED
result = await self._execute_action(action)
action.status = (ActionStatus.EXECUTED if result["success"]
else ActionStatus.FAILED)
elif approval.requires_human:
action.status = ActionStatus.PENDING_APPROVAL
await self._request_approval(action, investigation_context)
else:
action.status = ActionStatus.DENIED
self.audit_logger.log_action_result(action)
results.append(action)
return results
# ── Supporting Methods ──────────────────────────────────────────
# _execute_action(action) — Route to endpoint/network/identity agent
# _request_approval(action, context) — Notify via Slack/PagerDuty/Teams
Governance Guardrail Configuration
The governance layer is the most critical component of the response system. Define clear policies for what agents can do autonomously versus what requires human approval:
# governance_policy.yaml
governance:
version: "2.0"
effective_date: "2025-12-01"
last_review: "2025-11-15"
next_review: "2026-03-01"
# Global constraints — override all other policies
global_constraints:
max_auto_actions_per_hour: 50
max_auto_actions_per_incident: 5
require_investigation_context: true
minimum_confidence_for_action: 0.80
business_hours_only_for_critical: false
dry_run_mode: false # Set true during initial deployment
# Risk-based approval matrix
approval_matrix:
low:
auto_approve: true
notification: "soc_channel"
timeout_minutes: null
examples:
- "quarantine_file"
- "add_watchlist_entry"
- "create_ticket"
medium:
auto_approve: true
notification: ["soc_channel", "soc_lead"]
timeout_minutes: null
require_rollback_plan: true
examples:
- "block_ip_firewall"
- "block_domain_dns"
- "kill_process"
high:
auto_approve: false
required_approvers: ["soc_lead"]
notification: ["soc_channel", "incident_commander"]
timeout_minutes: 30
escalation_on_timeout: "incident_commander"
require_rollback_plan: true
examples:
- "isolate_endpoint"
- "disable_user_account"
- "revoke_session_tokens"
critical:
auto_approve: false
required_approvers: ["incident_commander", "ciso"]
notification: ["soc_channel", "exec_channel", "pagerduty"]
timeout_minutes: 15
escalation_on_timeout: "ciso"
require_rollback_plan: true
require_business_impact_assessment: true
examples:
- "network_segment_isolate"
- "full_incident_response"
- "enterprise_password_reset"
# Asset-specific overrides
asset_overrides:
crown_jewels:
pattern: "category=crown_jewel OR priority=critical"
minimum_risk_level: "high" # No auto-actions on critical assets
required_approvers: ["soc_lead", "asset_owner"]
production_databases:
pattern: "category=database AND environment=production"
minimum_risk_level: "critical" # Require IC + CISO for any action
required_approvers: ["incident_commander", "dba_lead", "ciso"]
domain_controllers:
pattern: "category=domain_controller"
minimum_risk_level: "high"
required_approvers: ["soc_lead", "ad_admin"]
# Time-based restrictions
time_restrictions:
change_freeze_periods:
- start: "2025-12-20T00:00:00Z"
end: "2026-01-02T23:59:59Z"
policy: "no_auto_actions"
reason: "Holiday change freeze"
maintenance_windows:
- cron: "0 2 * * 0" # Sundays at 2 AM
duration_hours: 4
policy: "elevated_auto_approve"
reason: "Scheduled maintenance window"
# Audit and compliance
audit:
log_all_decisions: true
log_enrichment_data: true
retention_days: 365
export_format: "json"
compliance_frameworks: ["SOC2", "NIST-800-53", "FedRAMP"]
Pro Tip: Deploy Phase 4 in "dry run" mode for the first 30 days. The response orchestrator evaluates all actions and logs what it would do, but doesn't execute. Review the decision log weekly with your SOC team. This builds trust and catches policy gaps before they become incidents.
Measuring Agentic SOC Performance
You can't improve what you don't measure. An agentic SOC demands a new set of KPIs that go beyond traditional metrics. Here are the metrics that matter, with Splunk queries to track them.
Core KPIs
1. Mean Time to Triage (MTTT)
index=soc_triage sourcetype="soc:triage:result"
| eval triage_time_sec=processing_time_ms/1000
| stats
avg(triage_time_sec) as avg_mttt,
median(triage_time_sec) as median_mttt,
perc95(triage_time_sec) as p95_mttt,
count as total_triaged
by verdict
| eval avg_mttt=round(avg_mttt, 1)
| eval median_mttt=round(median_mttt, 1)
| eval p95_mttt=round(p95_mttt, 1)
| sort verdict
2. Autonomous Resolution Rate
index=soc_triage sourcetype="soc:triage:result"
| stats
count(eval(verdict="auto_close")) as auto_closed,
count(eval(verdict="low_priority")) as low_priority,
count(eval(verdict="investigate" OR verdict="escalate" OR verdict="critical")) as human_required,
count as total
| eval auto_resolution_pct=round(auto_closed/total*100, 1)
| eval human_intervention_pct=round(human_required/total*100, 1)
3. False Positive Suppression Accuracy
index=soc_triage sourcetype="soc:triage:result" verdict="auto_close"
| join alert_id [
search index=soc_feedback sourcetype="soc:analyst:feedback"
| fields alert_id, analyst_verdict, feedback_correct
]
| stats
count as total_auto_closed,
count(eval(feedback_correct="true")) as correct,
count(eval(feedback_correct="false")) as incorrect
| eval accuracy_pct=round(correct/total_auto_closed*100, 2)
| eval false_negative_rate=round(incorrect/total_auto_closed*100, 2)
4. MTTR Comparison: Before and After
index=notable
| eval era=if(_time < relative_time(now(), "-90d"), "before_agentic", "after_agentic")
| eval resolution_time=if(isnotnull(status_end), status_end - _time, null())
| stats
avg(resolution_time) as avg_mttr,
median(resolution_time) as median_mttr,
perc95(resolution_time) as p95_mttr
by era
| eval avg_mttr_hours=round(avg_mttr/3600, 1)
| eval median_mttr_hours=round(median_mttr/3600, 1)
| eval p95_mttr_hours=round(p95_mttr/3600, 1)
| fields era, avg_mttr_hours, median_mttr_hours, p95_mttr_hours
5. Analyst Time Savings
index=soc_triage sourcetype="soc:triage:result"
| eval estimated_manual_minutes=case(
verdict="auto_close", 5,
verdict="low_priority", 10,
verdict="investigate", 30,
verdict="escalate", 45,
verdict="critical", 60
)
| eval agent_minutes=processing_time_ms/60000
| eval time_saved_minutes=estimated_manual_minutes - agent_minutes
| stats
sum(time_saved_minutes) as total_minutes_saved,
count as total_alerts
| eval hours_saved=round(total_minutes_saved/60, 1)
| eval fte_equivalent=round(hours_saved/(8*22), 1)
| eval cost_savings=fte_equivalent * 120000
Executive Dashboard Query
Build a single-pane executive view:
index=soc_triage sourcetype="soc:triage:result" earliest=-30d
| stats
count as total_alerts,
count(eval(verdict="auto_close")) as auto_resolved,
avg(processing_time_ms) as avg_triage_ms,
avg(confidence) as avg_confidence,
avg(score) as avg_score
| eval auto_resolution_rate=round(auto_resolved/total_alerts*100, 1)
| eval avg_triage_seconds=round(avg_triage_ms/1000, 1)
| eval avg_confidence=round(avg_confidence*100, 1)
| eval alerts_per_day=round(total_alerts/30, 0)
| eval analyst_hours_saved=round(auto_resolved*5/60, 0)
| table alerts_per_day, auto_resolution_rate, avg_triage_seconds, avg_confidence, analyst_hours_saved
Pro Tip: Create a weekly "Agent Accuracy Review" where a senior analyst randomly samples 20 auto-closed alerts and 10 escalated alerts. Track the agreement rate over time. Your target is 95%+ agreement on auto-closes and less than 2% false negatives (real threats that were auto-closed).
Governance & Guardrails — The Cymantis View
At Cymantis, we believe the agentic SOC only succeeds with governance as a first-class architectural concern — not an afterthought bolted on after deployment. Here is how we recommend implementing governance across the four phases.
Principle 1: Graduated Autonomy
Never deploy agents at full autonomy from day one. Follow the graduated autonomy model:
graph LR
subgraph phase1["Phase 1: Shadow"]
p1Autonomy["Autonomy: 0%"]
p1AI["AI: Observe only"]
p1Human["Human: Full control"]
p1Duration["Duration: 2–4 weeks"]
end
subgraph phase2["Phase 2: Advisor"]
p2Autonomy["Autonomy: 25%"]
p2AI["AI: Recommends"]
p2Human["Human: Decides/acts"]
p2Duration["Duration: 4–8 weeks"]
end
subgraph phase3["Phase 3: Co-pilot"]
p3Autonomy["Autonomy: 50%"]
p3AI["AI: Low-risk auto"]
p3Human["Human: Escalations"]
p3Duration["Duration: 8–12 weeks"]
end
subgraph phase4["Phase 4: Autonomous"]
p4Autonomy["Autonomy: 75%"]
p4AI["AI: Policy-governed"]
p4Human["Human: Oversight & exceptions"]
p4Duration["Duration: Ongoing"]
end
subgraph phase5["Phase 5: Full Trust"]
p5Autonomy["Autonomy: 90%+"]
p5AI["AI: Full autonomy"]
p5Human["Human: Exceptions only"]
p5Duration["Duration: Mature"]
end
phase1 --> phase2
phase2 --> phase3
phase3 --> phase4
phase4 --> phase5
Principle 2: Explainability Is Non-Negotiable
Every agent decision must be traceable. The audit trail should answer:
- What — What action was taken (or recommended)?
- Why — What evidence led to this decision? What was the reasoning chain?
- Who — Which agent made the decision? What model version?
- When — Exact timestamps for every step.
- What if — What would have happened with a different threshold?
Audit Trail Schema
{
"event_id": "triage-2025-12-01-a3f8c2",
"timestamp": "2025-12-01T14:32:18.445Z",
"agent": {
"name": "soc-triage-agent",
"version": "1.2.0",
"model": "gpt-4o-2025-08-06",
"config_hash": "sha256:9f3a..."
},
"alert": {
"id": "notable-28847",
"title": "Suspicious PowerShell Encoded Command",
"source": "ESCU - Malicious PowerShell - Rule",
"severity": "high"
},
"reasoning_chain": [
{
"step": 1,
"action": "get_asset_context",
"input": {"asset_identifier": "WKS-FIN-042"},
"output": {"criticality": "high", "owner": "jsmith", "bunit": "Finance"},
"duration_ms": 145
},
{
"step": 2,
"action": "check_threat_intel",
"input": {"indicator": "invoke-mimikatz.ps1", "indicator_type": "hash"},
"output": {"malicious": true, "sources": ["MITRE", "AlienVault"]},
"duration_ms": 890
},
{
"step": 3,
"action": "check_alert_history",
"input": {"detection_name": "Malicious PowerShell", "entity": "WKS-FIN-042"},
"output": {"total_occurrences": 0, "false_positive_rate": 0.0},
"duration_ms": 230
}
],
"verdict": {
"decision": "escalate",
"score": 88,
"confidence": 0.92,
"reasoning": "High-criticality Finance asset executing encoded PowerShell matching known Mimikatz signature. First occurrence on this asset. No false positive history. Escalating for immediate investigation.",
"recommended_actions": ["isolate_endpoint", "investigate_lateral_movement"]
},
"policy_applied": "governance_policy_v2.0",
"total_processing_ms": 3420
}
Principle 3: Feedback Loops Are Mandatory
Agents improve through feedback. Implement structured feedback channels:
# feedback_policy.yaml
feedback:
# Analyst feedback on auto-closed alerts
auto_close_review:
sample_rate: 0.10 # Review 10% of auto-closes
review_sla_hours: 48
feedback_fields:
- correct_verdict: boolean
- should_have_been: enum[auto_close, investigate, escalate]
- notes: string
# Mandatory review for high-confidence escalations
escalation_review:
sample_rate: 1.0 # Review 100% of escalations
review_sla_hours: 24
feedback_fields:
- correct_verdict: boolean
- actual_severity: enum[false_positive, low, medium, high, critical]
- investigation_quality: enum[incomplete, adequate, thorough]
- notes: string
# Weekly model performance review
weekly_review:
metrics_tracked:
- auto_close_accuracy
- escalation_precision
- false_negative_rate
- average_confidence_calibration
threshold_alerts:
auto_close_accuracy_below: 0.93
false_negative_rate_above: 0.03
confidence_calibration_drift: 0.10
Principle 4: Kill Switches and Circuit Breakers
Every agentic system needs an emergency stop:
# circuit_breakers.yaml
circuit_breakers:
# Anomaly detection on agent behavior
auto_close_rate_spike:
metric: "auto_close_rate_1h"
baseline: 0.65
threshold: 0.85 # If auto-close rate exceeds 85%, halt
action: "pause_triage_agent"
notification: ["soc_lead", "engineering"]
resume: "manual"
# Response action rate limiting
response_action_flood:
metric: "response_actions_1h"
threshold: 50
action: "pause_response_agents"
notification: ["incident_commander", "ciso"]
resume: "manual"
# Model degradation detection
confidence_drift:
metric: "avg_confidence_24h"
baseline: 0.82
threshold_low: 0.60 # If avg confidence drops, something is wrong
action: "switch_to_advisor_mode"
notification: ["soc_lead", "ml_engineering"]
resume: "after_review"
# Global emergency stop
emergency_stop:
trigger: "manual"
action: "halt_all_agents"
notification: ["all_soc", "ciso", "cto"]
resume: "ciso_approval"
Pro Tip: Test your kill switches quarterly. Run a tabletop exercise where the agentic system is "behaving anomalously" and measure how long it takes your team to detect the anomaly, trigger the circuit breaker, and switch to manual operations. Target: under 15 minutes from anomaly detection to full manual takeover.
Migration Checklist: Traditional SOC to Agentic SOC
Use this checklist to assess readiness and track progress. Each item has specific, verifiable criteria.
Data Foundation (Phase 1 Prerequisites)
- CIM compliance validated for all critical data sources (>85% field coverage)
- Asset identity normalization complete — single
asset_idacross all indices - User identity normalization complete — mapping across AD, IAM, cloud directories
- Threat intelligence feeds operational — at least 3 sources (commercial + open source + ISAC)
- Log enrichment pipeline deployed — geo-IP, asset context, user context at ingest time
- Data quality dashboard operational — measuring field coverage, latency, volume anomalies
- Historical baseline established — minimum 30 days of normalized, enriched data
Platform Readiness (Phase 2 Prerequisites)
- API access configured for SIEM (Splunk REST API, Sentinel API, or equivalent)
- EDR API access configured (CrowdStrike, Defender, SentinelOne, or equivalent)
- Identity provider API access configured (Entra ID, Okta, or equivalent)
- Threat intel API access configured (VirusTotal, Recorded Future, MISP, or equivalent)
- Secure secrets management deployed for API credentials (Vault, AWS Secrets Manager)
- Agent execution environment provisioned (container runtime, GPU for local models if needed)
- Network segmentation allows agent-to-tool communication with least privilege
Governance Framework (Phase 3–4 Prerequisites)
- Governance policy document authored and approved by CISO
- Approval matrix defined — who approves what, at what risk level
- Audit trail schema defined and logging infrastructure deployed
- Circuit breaker thresholds defined and tested
- Feedback loop process documented and assigned to analyst rotation
- Graduated autonomy schedule approved — shadow, advisor, co-pilot, autonomous milestones
- Rollback procedures tested — can you revert to fully manual operations in under 15 minutes?
- Legal and compliance review complete — especially for automated response actions
Operational Readiness
- SOC team briefed on agentic workflows — they understand what the agents do and don't do
- Runbooks updated to include agent-assisted investigation procedures
- On-call rotation includes "agent oversight" responsibility
- Incident response plan updated to include agent failure scenarios
- Metrics dashboard deployed — MTTT, auto-resolution rate, accuracy, confidence
- Weekly review cadence established — accuracy review, feedback review, policy review
Key Questions for Your Next SOC Review
Use these questions to assess where your organization stands on the agentic SOC maturity curve:
-
Data Readiness: Can you query any entity (host, user, IP) and get a complete picture across all data sources in under 30 seconds?
-
Alert Volume: What percentage of your alerts are autonomously resolvable today? If you don't know, start measuring.
-
Investigation Consistency: If three different analysts investigate the same alert, do they follow the same steps and reach the same conclusion? If not, you have a process problem that agents can help standardize.
-
Response Speed: What's your current MTTR for high-severity incidents? What would a 10x improvement mean for your risk posture?
-
Governance Maturity: Do you have a documented policy for automated response actions? Can you explain to an auditor exactly what your systems are authorized to do autonomously?
-
Feedback Culture: When an analyst disagrees with an automated triage decision, is there a structured way to capture that feedback and improve the system?
-
Failure Preparedness: If your agentic system stops working at 2 AM on a Saturday, how long until you detect the failure and revert to manual operations?
-
Skill Evolution: Are your analysts being retrained for the agentic SOC — moving from alert-processing to agent-supervision, threat hunting, and detection engineering?
-
Vendor Independence: Are your agentic capabilities locked to a single vendor platform, or do you have the architectural flexibility to swap models, tools, and integrations?
-
Measurable Impact: Can you demonstrate, with data, that your agentic capabilities are reducing MTTR, improving accuracy, and freeing analyst capacity for higher-value work?
Cymantis Recommendations
Based on our work with security teams across federal, enterprise, and critical infrastructure environments, here are our top recommendations for organizations beginning the agentic SOC journey:
Start Small, Prove Value, Expand
Don't boil the ocean. Pick the three noisiest detection rules in your environment and build a triage agent that handles them. Measure auto-close accuracy for 30 days. Show leadership the numbers. Then expand.
Invest in Data Before AI
The fastest way to fail at agentic SOC is to deploy agents on top of messy data. Spend 60% of your Phase 1 budget on data normalization, CIM compliance, and enrichment pipelines. This investment pays dividends across every subsequent phase.
Governance Is Architecture, Not Policy
Don't treat governance as a PDF that lives in SharePoint. Build it into the system: approval gates in code, audit trails in your SIEM, circuit breakers in your orchestration layer, feedback loops in your analyst workflow. Policy-as-code, not policy-as-document.
Plan for the Analyst Evolution
The agentic SOC doesn't eliminate analyst roles — it transforms them. Tier 1 analysts become agent supervisors and tuners. Tier 2 analysts become detection engineers and threat hunters. Tier 3 analysts become agent architects and governance designers. Plan the career path evolution alongside the technology deployment.
Measure Everything, Trust Nothing
Every agent decision should be auditable, every metric should be tracked, and every threshold should be justified with data. The moment you stop measuring agent accuracy is the moment you lose control.
Final Thoughts
The agentic SOC isn't a product you buy — it's an operating model you build. It's the convergence of a decade of SIEM evolution, the maturation of large language models, and the operational reality that human-only SOCs can't scale to meet the threat landscape.
The organizations that will thrive are the ones that approach this transformation with engineering discipline: clean data foundations, graduated autonomy, policy-driven governance, and relentless measurement.
Smaller teams. Bigger impact. Faster response. Better sleep.
The alert fatigue era is ending. The agentic era is here. The only question is whether your SOC will lead the transition or be forced into it by the next breach that your drowning analysts didn't catch in time.
Cymantis Labs helps security teams design, deploy, and govern agentic SOC architectures — from data foundation assessments to full multi-agent orchestration. We bring the engineering rigor and operational experience to make autonomous security operations production-safe.
Resources & References
Agentic SOC Architecture
- CrowdStrike — The Rise of the Agentic SOC: https://www.crowdstrike.com/en-us/blog/agentic-ai-soc-guide/ — CrowdStrike's perspective on Charlotte AI and autonomous security operations
- Anomali — The Evolution of SIEM: https://www.anomali.com/resources/what-is-siem — Research on SIEM generations and the shift to agentic platforms
- Prophet Security — AI SOC Analyst: https://prophetsecurity.ai — Detection engineering and autonomous investigation platform
AI & LLM Frameworks for Security
- OpenAI Function Calling Documentation: https://platform.openai.com/docs/guides/function-calling — Building tool-calling agents
- LangChain Agent Framework: https://python.langchain.com/docs/modules/agents/ — Open-source agent orchestration
- Microsoft Copilot for Security: https://www.microsoft.com/en-us/security/business/ai-machine-learning/microsoft-copilot-security — Enterprise agentic security platform
SIEM & Detection Engineering
- Splunk Enterprise Security: https://docs.splunk.com/Documentation/ES/latest — Splunk ES documentation
- Splunk Risk-Based Alerting: https://docs.splunk.com/Documentation/ES/latest/User/RiskBasedAlerting — RBA implementation guide
- Splunk Common Information Model: https://docs.splunk.com/Documentation/CIM/latest — CIM reference for data normalization
MITRE ATT&CK & Threat Intelligence
- MITRE ATT&CK Framework: https://attack.mitre.org/ — Adversary tactics, techniques, and procedures
- MITRE D3FEND: https://d3fend.mitre.org/ — Defensive technique knowledge graph
- MITRE ATLAS (Adversarial Threat Landscape for AI): https://atlas.mitre.org/ — Threats to AI/ML systems
Governance & Compliance
- NIST AI Risk Management Framework: https://www.nist.gov/artificial-intelligence/ai-risk-management-framework — Federal guidance on AI governance
- NIST SP 800-53 Rev. 5: https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final — Security and privacy controls
- SOC 2 Trust Services Criteria: https://www.aicpa.org/resources/landing/system-and-organization-controls-soc-suite-of-services — SOC 2 compliance framework
Industry Research
- IBM Cost of a Data Breach Report: https://www.ibm.com/reports/data-breach — Annual breach cost analysis including detection time metrics
- SANS SOC Survey: https://www.sans.org/white-papers/ — Annual survey of SOC operations, staffing, and tool adoption
- Gartner — Market Guide for SOAR: https://www.gartner.com/en/documents/ — SOAR market evolution and convergence with AI
For more insights or to schedule a Cymantis Agentic SOC Assessment, contact our research and automation team at cymantis.com.
