Building an Agentic SOC: From Alert Fatigue to Autonomous Detection and Response

By Cymantis Labs

The modern Security Operations Center is drowning. Analysts face an average of 11,000 alerts per day. Over 80% of those are false positives. Mean Time to Respond (MTTR) is measured in hours — sometimes days — for incidents that demand minutes. Tier 1 analysts burn out in 18 months. Hiring pipelines can't keep pace with the volume, and the adversaries aren't slowing down.

We've spent the last decade layering automation on top of broken processes: SOAR playbooks that handle the happy path but choke on edge cases, static correlation rules that fire on everything or nothing, and enrichment workflows that add data without adding context.

Agentic AI changes the equation. Not by replacing analysts, but by giving them something they've never had: autonomous reasoning at machine speed with human judgment at decision time. This is the architecture of the Agentic SOC — a system where AI agents triage, investigate, and recommend response actions, while humans govern the boundaries and approve high-risk actions.

This guide walks through the technical blueprint: from data foundations to multi-agent orchestration, with working code, real configurations, and the governance guardrails that make it production-safe.

The Three Generations of SIEM

Before building the future, we need to understand how we got here. The evolution of SIEM platforms maps directly to the evolution of SOC operating models — and each generation addressed (or failed to address) the alert fatigue problem differently.

Generation 1: System of Record (2005–2015)

The first generation of SIEMs — ArcSight, QRadar, early Splunk — were log aggregation engines. Their value proposition was simple: collect everything, search anything. Correlation rules were hand-crafted by senior engineers, and the operating model was reactive. Something happens, you search for it.

The alert fatigue problem: Rules were binary — they fired or they didn't. No risk scoring, no context, no adaptive thresholds. A failed SSH login from an admin's home IP generated the same alert as one from a known C2 server. SOC teams responded by tuning rules so aggressively that coverage gaps became the norm.

Generation 2: System of Intelligence (2015–2023)

The second generation introduced analytics: UEBA, risk-based alerting, machine learning anomaly detection, and threat intelligence enrichment. Splunk ES, Microsoft Sentinel, and Google Chronicle moved from raw correlation to behavioral baselines and entity risk aggregation.

The alert fatigue improvement: Risk-based alerting (RBA) was a genuine breakthrough. Instead of alerting on individual events, you aggregate risk by entity and fire when thresholds are breached. This reduced alert volume by 80–90% in well-tuned deployments. But the investigation and response workflow remained manual. An analyst still had to pick up the notable, pull context from five tools, make a judgment call, and execute response actions by hand.

Generation 3: System of Action (2024–Present)

The third generation is the agentic SIEM — platforms that don't just detect and enrich, but reason and act. CrowdStrike's Charlotte AI, Microsoft's Copilot for Security, Anomali's agentic SIEM architecture, and open-source frameworks like LangChain-based security agents are pushing SIEM from intelligence into autonomous operations.

The alert fatigue solution: Agents don't just score alerts — they investigate them. They correlate across data sources, query threat intel APIs, check historical baselines, assess blast radius, and either resolve the alert autonomously or escalate with a complete investigation package. The analyst reviews a finished investigation, not a raw alert.

Pro Tip: The generational shift isn't about replacing your SIEM — it's about layering agentic capabilities on top of your existing investment. Every Phase 1 and Phase 2 investment (data normalization, RBA tuning, enrichment pipelines) becomes the foundation for Phase 3.

What Makes a SOC "Agentic"?

The term "agentic" gets thrown around loosely. Let's define it precisely in the context of security operations.

An agentic SOC is an operating model where AI agents — autonomous software entities with defined goals, tools, and decision boundaries — perform the cognitive work of alert triage, investigation, and response recommendation. Unlike traditional SOAR playbooks, which follow predetermined decision trees, agentic systems exhibit three critical differentiators:

1. Context Retention

Traditional playbooks are stateless. Each execution starts from zero. An agentic system maintains context across investigations: it remembers that this user had a similar alert last week that was a false positive, that this endpoint was recently reimaged, that this IP appeared in a threat intel feed three days ago.

2. Adaptive Reasoning

Playbooks follow if/then/else logic trees designed at authoring time. Agents reason dynamically: "The alert says suspicious PowerShell execution. Let me check the command line arguments. This looks like an encoded payload. Let me decode it. The decoded content is a download cradle pointing to a domain. Let me check that domain against threat intel. It's clean, but it was registered 48 hours ago. Let me check passive DNS and WHOIS. The registrant matches a known bulletproof hosting pattern. Escalate as high-confidence."

3. Multi-Step Investigation

A SOAR playbook enriches an alert with a fixed set of lookups. An agent conducts a dynamic investigation: each step informs the next. If the first enrichment reveals something interesting, the agent follows that thread. If it's a dead end, the agent pivots. This mirrors how a skilled analyst actually works — except the agent does it in seconds.

Architecture Overview

The agentic SOC architecture has four layers:

graph TD
    subgraph governanceLayer["GOVERNANCE LAYER"]
        policyEngine["Policy Engine"]
        approvalGates["Approval Gates"]
        auditTrail["Audit Trail"]
    end
    subgraph orchestrationLayer["ORCHESTRATION LAYER"]
        agentRouter["Agent Router"]
        taskQueue["Task Queue"]
        stateManager["State Manager"]
    end
    subgraph agentLayer["AGENT LAYER"]
        triageAgent["Triage Agent"]
        investigationAgent["Investigation Agent"]
        responseAgent["Response Agent"]
    end
    subgraph dataFoundationLayer["DATA FOUNDATION LAYER"]
        siem["SIEM"]
        threatIntel["Threat Intel"]
        cmdb["CMDB"]
        edr["EDR"]
        iam["IAM"]
    end

Each layer has distinct responsibilities, and the governance layer wraps everything — no agent acts without policy enforcement.

The Four-Phase Agentic SOC Journey

Deploying an agentic SOC isn't a weekend project. It's a phased transformation that builds on existing investments. Each phase delivers standalone value while laying foundations for the next.

Phase 1: AI-Ready Foundation

Goal: Ensure your data is clean, normalized, and enriched enough for AI agents to reason over it. Garbage in, garbage out — this is doubly true for LLM-based agents that will make decisions based on your data quality.

The single most common failure mode for agentic SOC deployments is poor data quality. An AI agent that queries your SIEM and gets inconsistent field names, missing timestamps, or unnormalized asset identifiers will produce unreliable results.

CIM Compliance Validation

Splunk's Common Information Model (CIM) provides the normalized field vocabulary that agents depend on. Before deploying any AI capability, validate your CIM compliance:

| tstats count where index=* by index, sourcetype
| eval cim_model=case(
    sourcetype LIKE "%sysmon%", "Endpoint",
    sourcetype LIKE "%wineventlog%", "Endpoint",
    sourcetype LIKE "%cloudtrail%", "Change",
    sourcetype LIKE "%linux_secure%", "Authentication",
    sourcetype LIKE "%firewall%", "Network_Traffic",
    sourcetype LIKE "%proxy%", "Web",
    sourcetype LIKE "%dns%", "Network_Resolution",
    1=1, "UNMAPPED"
  )
| stats count by cim_model, index, sourcetype
| sort - count

Data Quality Scoring

Create a data quality index that agents can reference before making decisions:

| tstats count as total_events where index=* by sourcetype
| join sourcetype [
    | tstats count as has_dest where index=* dest=* by sourcetype
  ]
| join sourcetype [
    | tstats count as has_user where index=* user=* by sourcetype
  ]
| join sourcetype [
    | tstats count as has_src where index=* src=* by sourcetype
  ]
| eval dest_coverage=round(has_dest/total_events*100, 1)
| eval user_coverage=round(has_user/total_events*100, 1)
| eval src_coverage=round(has_src/total_events*100, 1)
| eval quality_score=round((dest_coverage + user_coverage + src_coverage) / 3, 1)
| sort - quality_score
| table sourcetype total_events dest_coverage user_coverage src_coverage quality_score

Asset Identity Normalization

Agents need a single, authoritative identifier for every entity. Build an asset normalization lookup:

| inputlookup asset_lookup_by_str
| eval normalized_host=lower(dns)
| eval normalized_host=if(isnull(normalized_host), lower(nt_host), normalized_host)
| dedup normalized_host
| eval asset_id=md5(normalized_host)
| table asset_id normalized_host ip mac nt_host dns owner priority bunit category
| outputlookup asset_identity_normalized.csv

Log Enrichment Pipeline

Pre-enrich events at ingest time so agents don't have to make expensive lookups during investigation:

# Scheduled search: Enrich endpoint events with asset context
index=endpoint sourcetype=sysmon
| lookup asset_identity_normalized.csv ip as src_ip OUTPUT asset_id, owner, priority, bunit
| lookup threat_intel_ip_lookup.csv ip as dest_ip OUTPUT threat_category, threat_score, threat_source
| lookup geo_ip_lookup.csv ip as dest_ip OUTPUT country, city, asn, asn_org
| eval enrichment_time=now()
| collect index=enriched_endpoint

Pro Tip: Measure your "enrichment coverage" — the percentage of events that have all critical fields populated. Agents should not operate on data sources with less than 85% enrichment coverage. Below that threshold, route alerts to human analysts instead.

Phase 2: Autonomous Triage

Goal: Deploy AI agents that can autonomously classify, score, and route incoming alerts — suppressing confirmed false positives and prioritizing high-confidence threats for investigation.

The triage agent is the first agent most organizations deploy, and it delivers the highest immediate ROI. A well-tuned triage agent can autonomously close 60–70% of alerts, reducing analyst workload to a manageable volume.

Alert Scoring Configuration

Define scoring thresholds in a structured configuration that the triage agent references:

# triage_agent_config.yaml
agent:
  name: "soc-triage-agent"
  version: "1.2.0"
  model: "gpt-4o"
  temperature: 0.1  # Low temperature for consistent classification
  max_reasoning_steps: 10

scoring:
  thresholds:
    auto_close: 15        # Score <= 15: auto-close as benign
    low_priority: 40      # Score 16-40: queue for batch review
    medium_priority: 70   # Score 41-70: standard investigation
    high_priority: 90     # Score 71-90: priority investigation
    critical_escalate: 100  # Score 91+: immediate escalation

  factors:
    asset_criticality:
      crown_jewel: 25
      high: 15
      medium: 10
      low: 5
    
    user_risk:
      privileged_admin: 20
      service_account: 15
      standard_user: 5
      contractor: 10

    threat_intel_match:
      known_apt: 30
      known_malware: 25
      suspicious_ioc: 15
      clean: 0

    historical_context:
      first_seen_behavior: 15
      recurring_false_positive: -20
      similar_confirmed_incident: 25
      recently_investigated_benign: -15

    time_context:
      outside_business_hours: 10
      holiday_weekend: 15
      during_change_window: -10

false_positive_patterns:
  - name: "vulnerability_scanner"
    conditions:
      src_ip_in: "scanner_allowlist"
      dest_port_in: [80, 443, 8080, 8443]
    action: "auto_close"
    confidence: 0.95

  - name: "admin_scheduled_task"
    conditions:
      user_in: "admin_group"
      process_name_in: ["schtasks.exe", "at.exe"]
      time_window: "change_window"
    action: "auto_close"
    confidence: 0.90

  - name: "edr_update_noise"
    conditions:
      source: "endpoint_protection"
      signature_match: "PUA.*adware|PUP.*toolbar"
    action: "auto_close"
    confidence: 0.95

escalation:
  critical_path:
    - notify: "soc_lead"
      method: ["slack", "pagerduty"]
      timeout_minutes: 5
    - notify: "incident_commander"
      method: ["pagerduty", "phone"]
      timeout_minutes: 15

Triage Agent Implementation

Here's a production-grade triage agent using a tool-calling LLM pattern:

"""soc_triage_agent.py — Autonomous alert triage with LLM reasoning."""
import json, hashlib, logging
from datetime import datetime, timezone
import yaml
from openai import OpenAI

logger = logging.getLogger("soc_triage_agent")

# AlertVerdict(Enum): AUTO_CLOSE | LOW_PRIORITY | INVESTIGATE | ESCALATE | CRITICAL
# TriageResult(dataclass): alert_id, verdict, score, reasoning,
# enrichments, recommended_actions, confidence, processing_time_ms

class TriageAgent:
    """Scores and classifies SIEM alerts via LLM tool-calling loop."""
    # SYSTEM_PROMPT — Tier 1 SOC analyst persona + rules (fail-open, check TI)
    # TOOL_DEFINITIONS — Function-calling schemas: query_siem,
    #   check_threat_intel, get_asset_context, get_user_context,
    #   check_alert_history, submit_verdict
    SYSTEM_PROMPT = "..."       # Full prompt omitted for brevity
    TOOL_DEFINITIONS = [...]    # Six tool schemas omitted for brevity

    def __init__(self, config_path: str):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)
        self.client = OpenAI()
        self.model = self.config["agent"]["model"]
        self.tool_handlers = {
            "query_siem": self._handle_siem_query,
            "check_threat_intel": self._handle_threat_intel,
            "get_asset_context": self._handle_asset_context,
            "get_user_context": self._handle_user_context,
            "check_alert_history": self._handle_alert_history,
            "submit_verdict": self._handle_verdict,
        }

    def triage_alert(self, alert: dict) -> TriageResult:
        """Core agentic loop — LLM calls tools autonomously until verdict."""
        start = datetime.now(timezone.utc)
        alert_id = alert.get("id", hashlib.md5(
            json.dumps(alert, sort_keys=True).encode()).hexdigest()[:12])
        self._enrichments, self._verdict = {}, None
        messages = [
            {"role": "system", "content": self.SYSTEM_PROMPT},
            {"role": "user", "content":
                f"Triage this alert:\n```json\n{json.dumps(alert, indent=2)}\n```"}
        ]
        for _ in range(self.config["agent"].get("max_reasoning_steps", 10)):
            resp = self.client.chat.completions.create(
                model=self.model, messages=messages,
                tools=self.TOOL_DEFINITIONS, tool_choice="auto",
                temperature=self.config["agent"].get("temperature", 0.1))
            msg = resp.choices[0].message
            messages.append(msg.model_dump())
            if not msg.tool_calls:
                break
            for tc in msg.tool_calls:
                handler = self.tool_handlers.get(tc.function.name)
                args = json.loads(tc.function.arguments)
                result = handler(**args) if handler else {"error": "unknown tool"}
                messages.append({"role": "tool", "tool_call_id": tc.id,
                                 "content": json.dumps(result)})
            if self._verdict:
                break
        if not self._verdict:
            self._verdict = {"verdict": "escalate", "score": 75,
                "reasoning": "No convergence — escalating.", "confidence": 0.3}
        elapsed_ms = int((datetime.now(timezone.utc) - start).total_seconds() * 1000)
        return TriageResult(alert_id=alert_id,
            verdict=AlertVerdict(self._verdict["verdict"]),
            score=self._verdict["score"], reasoning=self._verdict["reasoning"],
            enrichments=self._enrichments, confidence=self._verdict["confidence"],
            processing_time_ms=elapsed_ms)

    # ── Tool Handlers (integrate with your infrastructure) ──────────
    # _handle_siem_query(query, time_range)    — SPL via Splunk REST
    # _handle_threat_intel(indicator, type)     — IOC lookup against TI feeds
    # _handle_asset_context(asset_id)           — CMDB metadata retrieval
    # _handle_user_context(username)            — IAM user risk profile
    # _handle_alert_history(detection, entity)  — Prior outcomes & FP rates
    # _handle_verdict(**kwargs)                 — Capture final verdict

Splunk Integration: Feeding the Triage Agent

Create a modular input or scripted alert action that feeds new notables to the triage agent:

"""splunk_triage_bridge.py — Polls Splunk for new notables,
dispatches to triage agent, writes results back to a triage index."""
import time, json, logging
import splunklib.client as client
import splunklib.results as results
from soc_triage_agent import TriageAgent

logger = logging.getLogger("splunk_triage_bridge")

SPLUNK_CONFIG = {
    "host": "localhost", "port": 8089,
    "username": "svc_triage_agent",
    "password": "VAULT_REFERENCE",  # Use a secrets manager in production
}

NOTABLE_QUERY = """
search index=notable earliest=-5m latest=now
    NOT [| inputlookup triaged_alerts.csv | fields rule_id]
| fields _time, rule_name, rule_id, src, dest, user, severity,
         urgency, security_domain, risk_score
| head 50
"""

def poll_and_triage():
    """Main loop: fetch new notables, triage via agent, write results."""
    agent = TriageAgent("triage_agent_config.yaml")
    service = client.connect(**SPLUNK_CONFIG)
    while True:
        try:
            job = service.jobs.create(NOTABLE_QUERY)
            while not job.is_done():
                time.sleep(1)
            for result in results.JSONResultsReader(
                    job.results(output_mode="json")):
                if isinstance(result, dict):
                    triage_result = agent.triage_alert(result)
                    logger.info(f"Alert {triage_result.alert_id}: "
                        f"{triage_result.verdict.value} "
                        f"(score={triage_result.score})")
                    write_triage_result(service, triage_result)
        except Exception as e:
            logger.error(f"Triage polling error: {e}")
        time.sleep(300)  # Poll every 5 minutes

# write_triage_result(service, result) — Serializes TriageResult as JSON
# event → soc_triage index (sourcetype=soc:triage:result) for audit trail

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    poll_and_triage()

Pro Tip: Start with the noisiest detection rules first. Identify your top 10 alert generators (they typically account for 60–80% of alert volume), and configure the triage agent to handle those. This delivers visible ROI within the first week while you tune and expand coverage.

Phase 3: Intelligent Investigation

Goal: Deploy investigation agents that autonomously gather evidence, correlate across data sources, and build investigation narratives — delivering a complete investigation package to analysts instead of a raw alert.

Once triage separates signal from noise, investigation agents take the medium-to-high priority alerts and conduct full investigations. This is where the agentic approach truly differentiates from SOAR: the investigation is dynamic, not scripted.

Investigation Agent Workflow

The investigation agent follows a structured but adaptive workflow:

Alert Intake — Receive triaged alert with initial enrichments from triage agent
Scope Assessment — Determine what data sources and tools are relevant
Evidence Collection — Query SIEM, EDR, identity systems, network tools
Threat Intel Correlation — Check all indicators against multiple TI sources
Behavioral Analysis — Compare current activity against historical baselines
Timeline Construction — Build a chronological event timeline
Impact Assessment — Estimate blast radius and business impact
Recommendation — Provide verdict with confidence score and response options

Multi-Source Enrichment Pipeline

"""investigation_agent.py — Multi-source enrichment and investigation.
Correlates across SIEM, EDR, threat intel, and identity systems."""
import asyncio, logging
from dataclasses import dataclass

logger = logging.getLogger("investigation_agent")

@dataclass
class InvestigationContext:
    """Accumulated evidence across all enrichment sources."""
    alert_id: str
    timeline: list
    indicators: dict
    affected_assets: list
    affected_users: list
    threat_intel_hits: list
    behavioral_anomalies: list
    blast_radius: dict
    confidence: float
    narrative: str = ""

class InvestigationAgent:
    """Orchestrates multi-source investigations with dynamic enrichment."""

    def __init__(self, config: dict):
        self.config = config
        self.enrichment_sources = {  # Pluggable enrichment backends
            "siem": SIEMEnrichment(config["siem"]),
            "edr": EDREnrichment(config["edr"]),
            "threat_intel": ThreatIntelEnrichment(config["threat_intel"]),
            "identity": IdentityEnrichment(config["identity"]),
            "network": NetworkEnrichment(config["network"]),
        }

    async def investigate(self, triaged_alert: dict) -> InvestigationContext:
        """Five-phase adaptive investigation on a triaged alert."""
        ctx = InvestigationContext(
            alert_id=triaged_alert["alert_id"], timeline=[],
            indicators={}, affected_assets=[], affected_users=[],
            threat_intel_hits=[], behavioral_anomalies=[],
            blast_radius={}, confidence=0.0)

        # Phase 1: Parallel context gathering (asset, user, network)
        await asyncio.gather(
            self._enrich_asset_context(triaged_alert, ctx),
            self._enrich_user_context(triaged_alert, ctx),
            self._enrich_network_context(triaged_alert, ctx))

        # Phase 2: Check all extracted IOCs against threat intel
        indicators = self._extract_indicators(triaged_alert, ctx)
        await asyncio.gather(
            *[self._check_indicator(ioc, ctx) for ioc in indicators])

        # Phase 3–5: Behavioral analysis → timeline → narrative
        await self._behavioral_analysis(triaged_alert, ctx)
        await self._build_timeline(triaged_alert, ctx)
        await self._assess_blast_radius(ctx)
        ctx.narrative = await self._generate_narrative(ctx)
        return ctx

    # ── Enrichment Methods ──────────────────────────────────────────
    # _enrich_asset_context(alert, ctx)   — CMDB lookup + recent changes (7d)
    # _enrich_user_context(alert, ctx)    — Identity, risk, auth patterns (7d)
    # _enrich_network_context(alert, ctx) — DNS lookups + traffic patterns (24h)
    # _extract_indicators(alert, ctx)     — Extract IPs, domains, hashes
    # _check_indicator(ioc, ctx)          — TI lookup; append malicious hits
    # _behavioral_analysis(alert, ctx)    — Current vs 30-day process baseline
    # _build_timeline(alert, ctx)         — Chronological notable/risk events (48h)
    # _assess_blast_radius(ctx)           — Scope: isolated / department / enterprise
    # _generate_narrative(ctx)            — LLM-generated investigation summary
    # _is_rfc1918(ip)                     — RFC 1918 private range check

Pro Tip: Investigation agents should be measured on "enrichment completeness" — the percentage of available context they actually gather before making a recommendation. Target 90%+ enrichment completeness for high-priority alerts. An agent that skips threat intel checks or doesn't pull user context is making decisions with an incomplete picture.

Phase 4: Coordinated Response

Goal: Deploy response agents that can execute containment and remediation actions — with governance guardrails that enforce human approval for high-impact actions.

This is where the agentic SOC delivers its most visible value — and where governance becomes non-negotiable. Response agents must never operate without policy-driven guardrails.

Multi-Agent Response Orchestration

Response actions are coordinated by a central orchestrator that evaluates the investigation context and dispatches actions to specialized response agents:

"""response_orchestrator.py — Multi-agent response with approval gates."""
import logging
from dataclasses import dataclass
from enum import Enum

logger = logging.getLogger("response_orchestrator")

class RiskLevel(Enum):
    LOW = "low"            # Auto-execute
    MEDIUM = "medium"      # Execute + notify
    HIGH = "high"          # SOC lead approval
    CRITICAL = "critical"  # CISO/IC approval

class ActionStatus(Enum):
    PENDING_APPROVAL = "pending_approval"
    APPROVED = "approved"
    EXECUTED = "executed"
    DENIED = "denied"
    FAILED = "failed"

@dataclass
class ResponseAction:
    action_type: str
    target: str
    parameters: dict
    risk_level: RiskLevel
    justification: str
    status: ActionStatus = ActionStatus.PENDING_APPROVAL

class ResponseOrchestrator:
    """Coordinates response actions with policy-driven approval gates."""

    ACTION_RISK_MAP = {
        "quarantine_file": RiskLevel.LOW,
        "block_ip_firewall": RiskLevel.MEDIUM,
        "kill_process": RiskLevel.MEDIUM,
        "block_domain_dns": RiskLevel.MEDIUM,
        "isolate_endpoint": RiskLevel.HIGH,
        "disable_user_account": RiskLevel.HIGH,
        "revoke_session_tokens": RiskLevel.HIGH,
        "network_segment_isolate": RiskLevel.CRITICAL,
    }

    def __init__(self, config: dict):
        self.approval_engine = ApprovalEngine(config["governance"])
        self.audit_logger = AuditLogger(config["audit"])
        self.response_agents = {
            "endpoint": EndpointResponseAgent(config["edr"]),
            "network": NetworkResponseAgent(config["firewall"]),
            "identity": IdentityResponseAgent(config["iam"]),
        }

    async def coordinate_response(self, investigation_context,
                                   recommended_actions: list) -> list:
        """Route actions through governance gates; execute if approved."""
        results = []
        for spec in recommended_actions:
            action = ResponseAction(
                action_type=spec["type"], target=spec["target"],
                parameters=spec.get("parameters", {}),
                risk_level=self.ACTION_RISK_MAP.get(
                    spec["type"], RiskLevel.HIGH),
                justification=spec.get("justification", ""))
            self.audit_logger.log_proposed_action(action, investigation_context)
            approval = await self.approval_engine.check_approval(action)
            if approval.auto_approved:
                action.status = ActionStatus.APPROVED
                result = await self._execute_action(action)
                action.status = (ActionStatus.EXECUTED if result["success"]
                                 else ActionStatus.FAILED)
            elif approval.requires_human:
                action.status = ActionStatus.PENDING_APPROVAL
                await self._request_approval(action, investigation_context)
            else:
                action.status = ActionStatus.DENIED
            self.audit_logger.log_action_result(action)
            results.append(action)
        return results

    # ── Supporting Methods ──────────────────────────────────────────
    # _execute_action(action)            — Route to endpoint/network/identity agent
    # _request_approval(action, context) — Notify via Slack/PagerDuty/Teams

Governance Guardrail Configuration

The governance layer is the most critical component of the response system. Define clear policies for what agents can do autonomously versus what requires human approval:

# governance_policy.yaml
governance:
  version: "2.0"
  effective_date: "2025-12-01"
  last_review: "2025-11-15"
  next_review: "2026-03-01"

  # Global constraints — override all other policies
  global_constraints:
    max_auto_actions_per_hour: 50
    max_auto_actions_per_incident: 5
    require_investigation_context: true
    minimum_confidence_for_action: 0.80
    business_hours_only_for_critical: false
    dry_run_mode: false  # Set true during initial deployment

  # Risk-based approval matrix
  approval_matrix:
    low:
      auto_approve: true
      notification: "soc_channel"
      timeout_minutes: null
      examples:
        - "quarantine_file"
        - "add_watchlist_entry"
        - "create_ticket"

    medium:
      auto_approve: true
      notification: ["soc_channel", "soc_lead"]
      timeout_minutes: null
      require_rollback_plan: true
      examples:
        - "block_ip_firewall"
        - "block_domain_dns"
        - "kill_process"

    high:
      auto_approve: false
      required_approvers: ["soc_lead"]
      notification: ["soc_channel", "incident_commander"]
      timeout_minutes: 30
      escalation_on_timeout: "incident_commander"
      require_rollback_plan: true
      examples:
        - "isolate_endpoint"
        - "disable_user_account"
        - "revoke_session_tokens"

    critical:
      auto_approve: false
      required_approvers: ["incident_commander", "ciso"]
      notification: ["soc_channel", "exec_channel", "pagerduty"]
      timeout_minutes: 15
      escalation_on_timeout: "ciso"
      require_rollback_plan: true
      require_business_impact_assessment: true
      examples:
        - "network_segment_isolate"
        - "full_incident_response"
        - "enterprise_password_reset"

  # Asset-specific overrides
  asset_overrides:
    crown_jewels:
      pattern: "category=crown_jewel OR priority=critical"
      minimum_risk_level: "high"  # No auto-actions on critical assets
      required_approvers: ["soc_lead", "asset_owner"]

    production_databases:
      pattern: "category=database AND environment=production"
      minimum_risk_level: "critical"  # Require IC + CISO for any action
      required_approvers: ["incident_commander", "dba_lead", "ciso"]

    domain_controllers:
      pattern: "category=domain_controller"
      minimum_risk_level: "high"
      required_approvers: ["soc_lead", "ad_admin"]

  # Time-based restrictions
  time_restrictions:
    change_freeze_periods:
      - start: "2025-12-20T00:00:00Z"
        end: "2026-01-02T23:59:59Z"
        policy: "no_auto_actions"
        reason: "Holiday change freeze"

    maintenance_windows:
      - cron: "0 2 * * 0"   # Sundays at 2 AM
        duration_hours: 4
        policy: "elevated_auto_approve"
        reason: "Scheduled maintenance window"

  # Audit and compliance
  audit:
    log_all_decisions: true
    log_enrichment_data: true
    retention_days: 365
    export_format: "json"
    compliance_frameworks: ["SOC2", "NIST-800-53", "FedRAMP"]

Pro Tip: Deploy Phase 4 in "dry run" mode for the first 30 days. The response orchestrator evaluates all actions and logs what it would do, but doesn't execute. Review the decision log weekly with your SOC team. This builds trust and catches policy gaps before they become incidents.

Measuring Agentic SOC Performance

You can't improve what you don't measure. An agentic SOC demands a new set of KPIs that go beyond traditional metrics. Here are the metrics that matter, with Splunk queries to track them.

Core KPIs

1. Mean Time to Triage (MTTT)

index=soc_triage sourcetype="soc:triage:result"
| eval triage_time_sec=processing_time_ms/1000
| stats
    avg(triage_time_sec) as avg_mttt,
    median(triage_time_sec) as median_mttt,
    perc95(triage_time_sec) as p95_mttt,
    count as total_triaged
    by verdict
| eval avg_mttt=round(avg_mttt, 1)
| eval median_mttt=round(median_mttt, 1)
| eval p95_mttt=round(p95_mttt, 1)
| sort verdict

2. Autonomous Resolution Rate

index=soc_triage sourcetype="soc:triage:result"
| stats
    count(eval(verdict="auto_close")) as auto_closed,
    count(eval(verdict="low_priority")) as low_priority,
    count(eval(verdict="investigate" OR verdict="escalate" OR verdict="critical")) as human_required,
    count as total
| eval auto_resolution_pct=round(auto_closed/total*100, 1)
| eval human_intervention_pct=round(human_required/total*100, 1)

3. False Positive Suppression Accuracy

index=soc_triage sourcetype="soc:triage:result" verdict="auto_close"
| join alert_id [
    search index=soc_feedback sourcetype="soc:analyst:feedback"
    | fields alert_id, analyst_verdict, feedback_correct
  ]
| stats
    count as total_auto_closed,
    count(eval(feedback_correct="true")) as correct,
    count(eval(feedback_correct="false")) as incorrect
| eval accuracy_pct=round(correct/total_auto_closed*100, 2)
| eval false_negative_rate=round(incorrect/total_auto_closed*100, 2)

4. MTTR Comparison: Before and After

index=notable
| eval era=if(_time < relative_time(now(), "-90d"), "before_agentic", "after_agentic")
| eval resolution_time=if(isnotnull(status_end), status_end - _time, null())
| stats
    avg(resolution_time) as avg_mttr,
    median(resolution_time) as median_mttr,
    perc95(resolution_time) as p95_mttr
    by era
| eval avg_mttr_hours=round(avg_mttr/3600, 1)
| eval median_mttr_hours=round(median_mttr/3600, 1)
| eval p95_mttr_hours=round(p95_mttr/3600, 1)
| fields era, avg_mttr_hours, median_mttr_hours, p95_mttr_hours

5. Analyst Time Savings

index=soc_triage sourcetype="soc:triage:result"
| eval estimated_manual_minutes=case(
    verdict="auto_close", 5,
    verdict="low_priority", 10,
    verdict="investigate", 30,
    verdict="escalate", 45,
    verdict="critical", 60
  )
| eval agent_minutes=processing_time_ms/60000
| eval time_saved_minutes=estimated_manual_minutes - agent_minutes
| stats
    sum(time_saved_minutes) as total_minutes_saved,
    count as total_alerts
| eval hours_saved=round(total_minutes_saved/60, 1)
| eval fte_equivalent=round(hours_saved/(8*22), 1)
| eval cost_savings=fte_equivalent * 120000

Executive Dashboard Query

Build a single-pane executive view:

index=soc_triage sourcetype="soc:triage:result" earliest=-30d
| stats
    count as total_alerts,
    count(eval(verdict="auto_close")) as auto_resolved,
    avg(processing_time_ms) as avg_triage_ms,
    avg(confidence) as avg_confidence,
    avg(score) as avg_score
| eval auto_resolution_rate=round(auto_resolved/total_alerts*100, 1)
| eval avg_triage_seconds=round(avg_triage_ms/1000, 1)
| eval avg_confidence=round(avg_confidence*100, 1)
| eval alerts_per_day=round(total_alerts/30, 0)
| eval analyst_hours_saved=round(auto_resolved*5/60, 0)
| table alerts_per_day, auto_resolution_rate, avg_triage_seconds, avg_confidence, analyst_hours_saved

Pro Tip: Create a weekly "Agent Accuracy Review" where a senior analyst randomly samples 20 auto-closed alerts and 10 escalated alerts. Track the agreement rate over time. Your target is 95%+ agreement on auto-closes and less than 2% false negatives (real threats that were auto-closed).

Governance & Guardrails — The Cymantis View

At Cymantis, we believe the agentic SOC only succeeds with governance as a first-class architectural concern — not an afterthought bolted on after deployment. Here is how we recommend implementing governance across the four phases.

Principle 1: Graduated Autonomy

Never deploy agents at full autonomy from day one. Follow the graduated autonomy model:

graph LR
    subgraph phase1["Phase 1: Shadow"]
        p1Autonomy["Autonomy: 0%"]
        p1AI["AI: Observe only"]
        p1Human["Human: Full control"]
        p1Duration["Duration: 2–4 weeks"]
    end
    subgraph phase2["Phase 2: Advisor"]
        p2Autonomy["Autonomy: 25%"]
        p2AI["AI: Recommends"]
        p2Human["Human: Decides/acts"]
        p2Duration["Duration: 4–8 weeks"]
    end
    subgraph phase3["Phase 3: Co-pilot"]
        p3Autonomy["Autonomy: 50%"]
        p3AI["AI: Low-risk auto"]
        p3Human["Human: Escalations"]
        p3Duration["Duration: 8–12 weeks"]
    end
    subgraph phase4["Phase 4: Autonomous"]
        p4Autonomy["Autonomy: 75%"]
        p4AI["AI: Policy-governed"]
        p4Human["Human: Oversight & exceptions"]
        p4Duration["Duration: Ongoing"]
    end
    subgraph phase5["Phase 5: Full Trust"]
        p5Autonomy["Autonomy: 90%+"]
        p5AI["AI: Full autonomy"]
        p5Human["Human: Exceptions only"]
        p5Duration["Duration: Mature"]
    end
    phase1 --> phase2
    phase2 --> phase3
    phase3 --> phase4
    phase4 --> phase5

Principle 2: Explainability Is Non-Negotiable

Every agent decision must be traceable. The audit trail should answer:

What — What action was taken (or recommended)?
Why — What evidence led to this decision? What was the reasoning chain?
Who — Which agent made the decision? What model version?
When — Exact timestamps for every step.
What if — What would have happened with a different threshold?

Audit Trail Schema

{
  "event_id": "triage-2025-12-01-a3f8c2",
  "timestamp": "2025-12-01T14:32:18.445Z",
  "agent": {
    "name": "soc-triage-agent",
    "version": "1.2.0",
    "model": "gpt-4o-2025-08-06",
    "config_hash": "sha256:9f3a..."
  },
  "alert": {
    "id": "notable-28847",
    "title": "Suspicious PowerShell Encoded Command",
    "source": "ESCU - Malicious PowerShell - Rule",
    "severity": "high"
  },
  "reasoning_chain": [
    {
      "step": 1,
      "action": "get_asset_context",
      "input": {"asset_identifier": "WKS-FIN-042"},
      "output": {"criticality": "high", "owner": "jsmith", "bunit": "Finance"},
      "duration_ms": 145
    },
    {
      "step": 2,
      "action": "check_threat_intel",
      "input": {"indicator": "invoke-mimikatz.ps1", "indicator_type": "hash"},
      "output": {"malicious": true, "sources": ["MITRE", "AlienVault"]},
      "duration_ms": 890
    },
    {
      "step": 3,
      "action": "check_alert_history",
      "input": {"detection_name": "Malicious PowerShell", "entity": "WKS-FIN-042"},
      "output": {"total_occurrences": 0, "false_positive_rate": 0.0},
      "duration_ms": 230
    }
  ],
  "verdict": {
    "decision": "escalate",
    "score": 88,
    "confidence": 0.92,
    "reasoning": "High-criticality Finance asset executing encoded PowerShell matching known Mimikatz signature. First occurrence on this asset. No false positive history. Escalating for immediate investigation.",
    "recommended_actions": ["isolate_endpoint", "investigate_lateral_movement"]
  },
  "policy_applied": "governance_policy_v2.0",
  "total_processing_ms": 3420
}

Principle 3: Feedback Loops Are Mandatory

Agents improve through feedback. Implement structured feedback channels:

# feedback_policy.yaml
feedback:
  # Analyst feedback on auto-closed alerts
  auto_close_review:
    sample_rate: 0.10  # Review 10% of auto-closes
    review_sla_hours: 48
    feedback_fields:
      - correct_verdict: boolean
      - should_have_been: enum[auto_close, investigate, escalate]
      - notes: string

  # Mandatory review for high-confidence escalations
  escalation_review:
    sample_rate: 1.0   # Review 100% of escalations
    review_sla_hours: 24
    feedback_fields:
      - correct_verdict: boolean
      - actual_severity: enum[false_positive, low, medium, high, critical]
      - investigation_quality: enum[incomplete, adequate, thorough]
      - notes: string

  # Weekly model performance review
  weekly_review:
    metrics_tracked:
      - auto_close_accuracy
      - escalation_precision
      - false_negative_rate
      - average_confidence_calibration
    threshold_alerts:
      auto_close_accuracy_below: 0.93
      false_negative_rate_above: 0.03
      confidence_calibration_drift: 0.10

Principle 4: Kill Switches and Circuit Breakers

Every agentic system needs an emergency stop:

# circuit_breakers.yaml
circuit_breakers:
  # Anomaly detection on agent behavior
  auto_close_rate_spike:
    metric: "auto_close_rate_1h"
    baseline: 0.65
    threshold: 0.85  # If auto-close rate exceeds 85%, halt
    action: "pause_triage_agent"
    notification: ["soc_lead", "engineering"]
    resume: "manual"

  # Response action rate limiting
  response_action_flood:
    metric: "response_actions_1h"
    threshold: 50
    action: "pause_response_agents"
    notification: ["incident_commander", "ciso"]
    resume: "manual"

  # Model degradation detection
  confidence_drift:
    metric: "avg_confidence_24h"
    baseline: 0.82
    threshold_low: 0.60  # If avg confidence drops, something is wrong
    action: "switch_to_advisor_mode"
    notification: ["soc_lead", "ml_engineering"]
    resume: "after_review"

  # Global emergency stop
  emergency_stop:
    trigger: "manual"
    action: "halt_all_agents"
    notification: ["all_soc", "ciso", "cto"]
    resume: "ciso_approval"

Pro Tip: Test your kill switches quarterly. Run a tabletop exercise where the agentic system is "behaving anomalously" and measure how long it takes your team to detect the anomaly, trigger the circuit breaker, and switch to manual operations. Target: under 15 minutes from anomaly detection to full manual takeover.

Migration Checklist: Traditional SOC to Agentic SOC

Use this checklist to assess readiness and track progress. Each item has specific, verifiable criteria.

Data Foundation (Phase 1 Prerequisites)

CIM compliance validated for all critical data sources (>85% field coverage)
Asset identity normalization complete — single asset_id across all indices
User identity normalization complete — mapping across AD, IAM, cloud directories
Threat intelligence feeds operational — at least 3 sources (commercial + open source + ISAC)
Log enrichment pipeline deployed — geo-IP, asset context, user context at ingest time
Data quality dashboard operational — measuring field coverage, latency, volume anomalies
Historical baseline established — minimum 30 days of normalized, enriched data

Platform Readiness (Phase 2 Prerequisites)

API access configured for SIEM (Splunk REST API, Sentinel API, or equivalent)
EDR API access configured (CrowdStrike, Defender, SentinelOne, or equivalent)
Identity provider API access configured (Entra ID, Okta, or equivalent)
Threat intel API access configured (VirusTotal, Recorded Future, MISP, or equivalent)
Secure secrets management deployed for API credentials (Vault, AWS Secrets Manager)
Agent execution environment provisioned (container runtime, GPU for local models if needed)
Network segmentation allows agent-to-tool communication with least privilege

Governance Framework (Phase 3–4 Prerequisites)

Governance policy document authored and approved by CISO
Approval matrix defined — who approves what, at what risk level
Audit trail schema defined and logging infrastructure deployed
Circuit breaker thresholds defined and tested
Feedback loop process documented and assigned to analyst rotation
Graduated autonomy schedule approved — shadow, advisor, co-pilot, autonomous milestones
Rollback procedures tested — can you revert to fully manual operations in under 15 minutes?
Legal and compliance review complete — especially for automated response actions

Operational Readiness

SOC team briefed on agentic workflows — they understand what the agents do and don't do
Runbooks updated to include agent-assisted investigation procedures
On-call rotation includes "agent oversight" responsibility
Incident response plan updated to include agent failure scenarios
Metrics dashboard deployed — MTTT, auto-resolution rate, accuracy, confidence
Weekly review cadence established — accuracy review, feedback review, policy review

Key Questions for Your Next SOC Review

Use these questions to assess where your organization stands on the agentic SOC maturity curve:

Data Readiness: Can you query any entity (host, user, IP) and get a complete picture across all data sources in under 30 seconds?
Alert Volume: What percentage of your alerts are autonomously resolvable today? If you don't know, start measuring.
Investigation Consistency: If three different analysts investigate the same alert, do they follow the same steps and reach the same conclusion? If not, you have a process problem that agents can help standardize.
Response Speed: What's your current MTTR for high-severity incidents? What would a 10x improvement mean for your risk posture?
Governance Maturity: Do you have a documented policy for automated response actions? Can you explain to an auditor exactly what your systems are authorized to do autonomously?
Feedback Culture: When an analyst disagrees with an automated triage decision, is there a structured way to capture that feedback and improve the system?
Failure Preparedness: If your agentic system stops working at 2 AM on a Saturday, how long until you detect the failure and revert to manual operations?
Skill Evolution: Are your analysts being retrained for the agentic SOC — moving from alert-processing to agent-supervision, threat hunting, and detection engineering?
Vendor Independence: Are your agentic capabilities locked to a single vendor platform, or do you have the architectural flexibility to swap models, tools, and integrations?
Measurable Impact: Can you demonstrate, with data, that your agentic capabilities are reducing MTTR, improving accuracy, and freeing analyst capacity for higher-value work?

Cymantis Recommendations

Based on our work with security teams across federal, enterprise, and critical infrastructure environments, here are our top recommendations for organizations beginning the agentic SOC journey:

Start Small, Prove Value, Expand

Don't boil the ocean. Pick the three noisiest detection rules in your environment and build a triage agent that handles them. Measure auto-close accuracy for 30 days. Show leadership the numbers. Then expand.

Invest in Data Before AI

The fastest way to fail at agentic SOC is to deploy agents on top of messy data. Spend 60% of your Phase 1 budget on data normalization, CIM compliance, and enrichment pipelines. This investment pays dividends across every subsequent phase.

Governance Is Architecture, Not Policy

Don't treat governance as a PDF that lives in SharePoint. Build it into the system: approval gates in code, audit trails in your SIEM, circuit breakers in your orchestration layer, feedback loops in your analyst workflow. Policy-as-code, not policy-as-document.

Plan for the Analyst Evolution

The agentic SOC doesn't eliminate analyst roles — it transforms them. Tier 1 analysts become agent supervisors and tuners. Tier 2 analysts become detection engineers and threat hunters. Tier 3 analysts become agent architects and governance designers. Plan the career path evolution alongside the technology deployment.

Measure Everything, Trust Nothing

Every agent decision should be auditable, every metric should be tracked, and every threshold should be justified with data. The moment you stop measuring agent accuracy is the moment you lose control.

Final Thoughts

The agentic SOC isn't a product you buy — it's an operating model you build. It's the convergence of a decade of SIEM evolution, the maturation of large language models, and the operational reality that human-only SOCs can't scale to meet the threat landscape.

The organizations that will thrive are the ones that approach this transformation with engineering discipline: clean data foundations, graduated autonomy, policy-driven governance, and relentless measurement.

Smaller teams. Bigger impact. Faster response. Better sleep.

The alert fatigue era is ending. The agentic era is here. The only question is whether your SOC will lead the transition or be forced into it by the next breach that your drowning analysts didn't catch in time.

Cymantis Labs helps security teams design, deploy, and govern agentic SOC architectures — from data foundation assessments to full multi-agent orchestration. We bring the engineering rigor and operational experience to make autonomous security operations production-safe.

Resources & References

Agentic SOC Architecture

CrowdStrike — The Rise of the Agentic SOC: https://www.crowdstrike.com/en-us/blog/agentic-ai-soc-guide/ — CrowdStrike's perspective on Charlotte AI and autonomous security operations
Anomali — The Evolution of SIEM: https://www.anomali.com/resources/what-is-siem — Research on SIEM generations and the shift to agentic platforms
Prophet Security — AI SOC Analyst: https://prophetsecurity.ai — Detection engineering and autonomous investigation platform

AI & LLM Frameworks for Security

OpenAI Function Calling Documentation: https://platform.openai.com/docs/guides/function-calling — Building tool-calling agents
LangChain Agent Framework: https://python.langchain.com/docs/modules/agents/ — Open-source agent orchestration
Microsoft Copilot for Security: https://www.microsoft.com/en-us/security/business/ai-machine-learning/microsoft-copilot-security — Enterprise agentic security platform

SIEM & Detection Engineering

Splunk Enterprise Security: https://docs.splunk.com/Documentation/ES/latest — Splunk ES documentation
Splunk Risk-Based Alerting: https://docs.splunk.com/Documentation/ES/latest/User/RiskBasedAlerting — RBA implementation guide
Splunk Common Information Model: https://docs.splunk.com/Documentation/CIM/latest — CIM reference for data normalization

MITRE ATT&CK & Threat Intelligence

MITRE ATT&CK Framework: https://attack.mitre.org/ — Adversary tactics, techniques, and procedures
MITRE D3FEND: https://d3fend.mitre.org/ — Defensive technique knowledge graph
MITRE ATLAS (Adversarial Threat Landscape for AI): https://atlas.mitre.org/ — Threats to AI/ML systems

Governance & Compliance

NIST AI Risk Management Framework: https://www.nist.gov/artificial-intelligence/ai-risk-management-framework — Federal guidance on AI governance
NIST SP 800-53 Rev. 5: https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final — Security and privacy controls
SOC 2 Trust Services Criteria: https://www.aicpa.org/resources/landing/system-and-organization-controls-soc-suite-of-services — SOC 2 compliance framework

Industry Research

IBM Cost of a Data Breach Report: https://www.ibm.com/reports/data-breach — Annual breach cost analysis including detection time metrics
SANS SOC Survey: https://www.sans.org/white-papers/ — Annual survey of SOC operations, staffing, and tool adoption
Gartner — Market Guide for SOAR: https://www.gartner.com/en/documents/ — SOAR market evolution and convergence with AI

For more insights or to schedule a Cymantis Agentic SOC Assessment, contact our research and automation team at cymantis.com.