Purple Teaming with AI: Autonomous Adversary Simulation Using MITRE ATT&CK

By Cymantis Labs

Your red team runs an engagement once a quarter — maybe twice if budget allows. It takes two weeks to scope, three weeks to execute, and another two weeks to write the report. By the time the findings land on the CISO's desk, the environment has changed. New cloud workloads spun up. A developer pushed a misconfigured Kubernetes manifest. Three new SaaS integrations went live. The attack surface your red team tested no longer exists.

Meanwhile, your adversaries aren't operating on quarterly cycles. APT groups are deploying generative AI to automate reconnaissance, craft phishing campaigns, and generate polymorphic payloads at machine speed. Ransomware operators iterate weekly. Initial access brokers refresh their inventory daily. The offense-defense asymmetry has never been wider.

Traditional purple teaming — where red and blue teams collaborate in periodic, structured exercises — was a genuine improvement over siloed operations. But it still operates on human timelines. Scoping meetings. Scheduling conflicts. Manual attack execution. Manual detection validation. Manual report generation. The cycle time from "identify a gap" to "validate the fix" is measured in weeks or months.

Agentic purple teaming compresses that cycle into hours. An autonomous red agent executes MITRE ATT&CK-aligned attack chains against your production environment. An autonomous blue agent validates whether your detections fire, your alerts route correctly, and your response playbooks activate. An orchestration layer coordinates the exercise, enforces safety boundaries, and produces a real-time gap analysis that tells you exactly where your defenses fail — and exactly what to fix.

This isn't theoretical. The building blocks exist today: MITRE Caldera for adversary emulation, LLM-driven decision engines for adaptive attack planning, Splunk and SIEM APIs for detection validation, and ATT&CK Navigator for coverage visualization. This post shows you how to assemble them into a production-grade autonomous purple team system.

Red, Blue, Purple — And Why AI Changes Everything

Before diving into architecture, let's ground the terminology and establish why AI fundamentally changes the purple team operating model.

The Traditional Model

Red Team — Offensive operators who simulate real-world adversaries. They attempt to achieve objectives (data exfiltration, domain compromise, ransomware deployment) using the same tactics, techniques, and procedures (TTPs) as threat actors. Red teams test whether an organization can be compromised.

Blue Team — Defensive operators who detect, respond to, and mitigate threats. They operate the SOC, maintain detection rules, run incident response, and harden the environment. Blue teams test whether an organization can detect and stop an attack.

Purple Team — A collaborative function that bridges red and blue. Purple team exercises are structured engagements where offensive actions are executed in coordination with defensive validation. The goal isn't to "win" — it's to identify detection gaps, validate security controls, and improve defensive coverage systematically.

What's Broken

The traditional purple team model has three structural problems:

Periodicity. Exercises happen quarterly or annually. The gap between exercises is a gap in validation. Controls degrade, configurations drift, new attack techniques emerge, and detection coverage erodes silently between cycles.
Scale. A human red team can realistically execute 20–40 ATT&CK techniques per engagement. MITRE ATT&CK Enterprise contains over 200 techniques and 600+ sub-techniques. No human team has the bandwidth to test comprehensive coverage in a single exercise.
Speed. From technique execution to detection validation to gap analysis to remediation to re-test — the feedback loop takes weeks. In an environment where new code deploys daily and infrastructure changes hourly, this latency is unacceptable.

The AI-Augmented Model

Agentic AI addresses all three problems:

Continuous execution. AI agents don't need scheduling. They can execute adversary simulation continuously — daily, hourly, or on-demand triggered by infrastructure changes.
Comprehensive coverage. An autonomous red agent can systematically walk the ATT&CK matrix, testing hundreds of techniques across your environment in a single exercise.
Rapid feedback. Detection validation happens in real time. The moment an attack technique executes, the blue agent queries your SIEM to confirm whether the detection fired. Gap analysis is available in minutes, not weeks.

The key insight: AI doesn't replace human red and blue teamers. It amplifies them. Human operators design the adversary profiles, define the safety boundaries, interpret the results, and make strategic decisions about remediation priorities. AI agents handle the repetitive, high-volume execution and validation work that humans can't scale.

Pro Tip: Start by identifying the 80% of your purple team workload that is repetitive execution and validation — running the same Atomic Red Team tests, checking the same Splunk searches, updating the same coverage spreadsheets. That's your automation target. Reserve human operators for creative adversary simulation, novel attack chain development, and strategic analysis.

The Agentic Purple Team Architecture

A production agentic purple team system has four core components: the Red Agent, the Blue Agent, the Orchestration Layer, and the Reporting Engine. Each component is autonomous within its domain but coordinated through the orchestration layer.

graph TD
    OrchestrationLayer["ORCHESTRATION LAYER<br/>Exercise State | Safety Controls | ATT&CK Mapping Engine"]
    
    RedAgent["RED AGENT<br/>LLM Core<br/>ATT&CK Planner<br/>Executor"]
    BlueAgent["BLUE AGENT<br/>Detection Validator<br/>SIEM Query<br/>Response Checker"]
    ReportingEngine["REPORTING ENGINE<br/>Gap Analysis<br/>ATT&CK Navigator"]
    
    TargetEnvironment["Target Environment"]
    SIEMPlatform["SIEM/SOAR Platform"]
    ResultsDatabase["Results Database"]
    
    OrchestrationLayer --> RedAgent
    OrchestrationLayer --> BlueAgent
    OrchestrationLayer --> ReportingEngine
    
    RedAgent <--> BlueAgent
    RedAgent --> TargetEnvironment
    BlueAgent --> SIEMPlatform
    ReportingEngine --> ResultsDatabase

Red Agent: Autonomous Adversary Simulation

The Red Agent is an LLM-driven decision engine that plans and executes multi-phase attack chains. Unlike static adversary emulation tools that replay predefined scripts, the Red Agent makes dynamic decisions based on the target environment's responses — adapting its approach just as a human attacker would.

Core Architecture

The Red Agent operates in a continuous loop: observe → plan → execute → evaluate → adapt. The LLM serves as the planning brain, interpreting environmental feedback and selecting the next technique based on the current attack state, available tools, and target profile.

"""
red_agent.py — Autonomous adversary simulation agent for purple team exercises.
Plans and executes multi-phase ATT&CK-aligned attack chains using LLM-driven
decision-making with tool access to adversary emulation frameworks.
"""
import json, logging, time
from dataclasses import dataclass, field
from enum import Enum
from openai import OpenAI

logger = logging.getLogger("red_agent")

# AttackPhase(Enum): RECONNAISSANCE → INITIAL_ACCESS → EXECUTION →
# PERSISTENCE → PRIVILEGE_ESCALATION → DEFENSE_EVASION → CREDENTIAL_ACCESS →
# DISCOVERY → LATERAL_MOVEMENT → COLLECTION → EXFILTRATION → IMPACT

# TechniqueResult: technique_id, name, phase, success, output, timestamp,
# duration_ms, target_host, artifacts, detection_expected

# AttackState: exercise_id, current_phase, compromised_hosts,
# harvested_credentials, discovered_services, executed/failed_techniques


class RedAgent:
    """
    Autonomous red team agent that plans and executes ATT&CK-aligned
    attack chains using LLM reasoning and adversary emulation tools.
    """
    # SYSTEM_PROMPT — Rules of engagement + ATT&CK kill-chain planning strategy:
    #   Only execute within approved scope; stop on kill switch; prefer stealth;
    #   adapt based on discovered defenses; provide ATT&CK IDs + reasoning
    #
    # TOOL_DEFINITIONS — Four LLM function-calling tools:
    #   execute_technique      — run ATT&CK technique via emulation framework
    #   query_environment      — recon: network scan, service/AD/share enum
    #   check_detection_status — verify if technique triggered SIEM alerts
    #   report_finding         — log detection gaps and control failures

    def __init__(self, config: dict):
        self.client = OpenAI()
        self.model = config.get("model", "gpt-4o")
        self.exercise_config = config
        self.state = AttackState(
            exercise_id=config["exercise_id"],
            current_phase=AttackPhase.RECONNAISSANCE
        )
        self.conversation_history = [
            {"role": "system", "content": self.SYSTEM_PROMPT}
        ]
        self.max_steps = config.get("max_steps", 50)
        self.safety_controller = SafetyController(config.get("safety", {}))

    def run_exercise(self) -> list[TechniqueResult]:
        """Execute a full autonomous attack chain exercise."""
        results, step = [], 0
        logger.info("Starting exercise: %s", self.state.exercise_id)

        while step < self.max_steps:
            if not self.safety_controller.is_safe_to_continue(self.state):
                logger.warning("Safety halt at step %d", step)
                break
            try:
                action = self.plan_next_action()
                if action.get("exercise_complete"):
                    break
                if action.get("technique_result"):
                    result = action["technique_result"]
                    results.append(result)
                    self._update_state(result)
                    time.sleep(
                        self.exercise_config.get("detection_delay_seconds", 30)
                    )
            except Exception as e:
                logger.error("Step %d error: %s", step, e)
                if not self.exercise_config.get("continue_on_error", True):
                    break
            step += 1

        logger.info("Complete: %d techniques in %d steps", len(results), step)
        return results

    # ── Remaining methods ──────────────────────────────────────────────
    # plan_next_action()        — build state summary, query LLM with tool defs
    # _build_state_summary()    — serialize attack state for LLM context window
    # _update_state(result)     — update compromised hosts, creds, services
    # _process_response(msg)    — extract tool calls, safety-check, dispatch
    # _dispatch_tool(name, args)— route to caldera/env/detection handlers
    # _execute_via_caldera()    — trigger ATT&CK technique via Caldera REST API
    # _query_environment()      — recon queries (network scan, AD/share enum)
    # _check_detections()       — query SIEM for alerts triggered by technique
    # _report_finding()         — record detection gap in results database
    # _map_technique_to_phase() — ATT&CK technique ID prefix → AttackPhase


class SafetyController:
    """Enforces safety boundaries during autonomous red team operations."""

    def __init__(self, config: dict):
        self.max_compromised_hosts = config.get("max_compromised_hosts", 10)
        self.blocked_techniques = config.get("blocked_techniques", [])
        self.max_duration_minutes = config.get("max_duration_minutes", 120)
        self.kill_switch_active = False

    def is_safe_to_continue(self, state: AttackState) -> bool:
        if self.kill_switch_active:
            return False
        if len(state.compromised_hosts) >= self.max_compromised_hosts:
            return False
        return (time.time() - state.start_time) / 60 < self.max_duration_minutes

    def validate_technique(self, args: dict, scope: dict) -> bool:
        if args.get("technique_id", "") in self.blocked_techniques:
            return False
        target = args.get("target_host", "")
        return not scope.get("allowed_targets") or target in scope["allowed_targets"]

Pro Tip: The detection_delay_seconds parameter is critical. Your SIEM needs time to ingest, parse, and correlate events before the Blue Agent can validate detections. Set this to match your environment's actual detection pipeline latency — typically 15–60 seconds for well-tuned Splunk deployments, longer for cloud-native SIEMs with batched ingestion.

Blue Agent: Detection and Response Validation

The Blue Agent is the defensive counterpart. Its job is to validate whether each technique executed by the Red Agent was detected by your security stack. It queries your SIEM, checks alert routing, validates response playbook activation, and records the results.

"""
blue_agent.py — Autonomous detection validation agent for purple team exercises.
Validates SIEM detections, alert routing, and response activation against
red team technique execution results.
"""

import json
import logging
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

import requests

logger = logging.getLogger("blue_agent")


class DetectionStatus(Enum):
    DETECTED = "detected"
    PARTIAL = "partial"
    MISSED = "missed"
    DELAYED = "delayed"
    FALSE_POSITIVE = "false_positive"


class ResponseStatus(Enum):
    ACTIVATED = "activated"
    PARTIAL = "partial"
    NOT_TRIGGERED = "not_triggered"
    FAILED = "failed"


@dataclass
class DetectionResult:
    technique_id: str
    technique_name: str
    detection_status: DetectionStatus
    response_status: ResponseStatus
    detection_time_ms: Optional[int] = None
    detection_rule_name: Optional[str] = None
    alert_severity: Optional[str] = None
    siem_search_id: Optional[str] = None
    response_playbook: Optional[str] = None
    evidence: dict = field(default_factory=dict)
    gap_details: Optional[str] = None


class BlueAgent:
    """
    Autonomous blue team agent that validates detections and response
    actions against red team technique execution results.
    """

    def __init__(self, config: dict):
        self.splunk_url = config["splunk_url"]
        self.splunk_token = config["splunk_token"]
        self.soar_url = config.get("soar_url")
        self.soar_token = config.get("soar_token")
        self.detection_timeout_seconds = config.get("detection_timeout", 120)
        self.technique_detection_map = self._load_detection_map(
            config.get("detection_map_path", "detection_map.yaml")
        )

    def validate_technique(self, technique_result) -> DetectionResult:
        """
        Validate whether a red team technique was detected and
        whether the appropriate response was triggered.
        """
        technique_id = technique_result.technique_id
        target_host = technique_result.target_host
        execution_time = technique_result.timestamp

        logger.info(
            "Validating detection for %s on %s",
            technique_id, target_host
        )

        # Step 1: Check SIEM for detection
        detection = self._check_siem_detection(
            technique_id, target_host, execution_time
        )

        # Step 2: Check response activation
        response = self._check_response_activation(
            technique_id, target_host, execution_time
        )

        # Step 3: Build detection result
        result = DetectionResult(
            technique_id=technique_id,
            technique_name=technique_result.technique_name,
            detection_status=detection["status"],
            response_status=response["status"],
            detection_time_ms=detection.get("time_ms"),
            detection_rule_name=detection.get("rule_name"),
            alert_severity=detection.get("severity"),
            siem_search_id=detection.get("search_id"),
            response_playbook=response.get("playbook_name"),
            evidence={
                "siem_results": detection.get("raw_results", []),
                "response_results": response.get("raw_results", [])
            }
        )

        # Step 4: Identify gaps
        if result.detection_status == DetectionStatus.MISSED:
            result.gap_details = self._analyze_detection_gap(
                technique_id, target_host, execution_time
            )

        return result

    # ── Remaining methods ──────────────────────────────────────────────
    # _check_siem_detection()      — query Splunk for detections matching technique
    # _build_validation_query()    — build Splunk SPL query for detection validation
    # _execute_splunk_search()     — execute search against Splunk REST API
    # _check_response_activation()   — check whether SOAR playbooks activated
    # _analyze_detection_gap()      — analyze why detection was missed, provide guidance
    # _load_detection_map()         — load technique-to-detection mapping from YAML

The Blue Agent uses a technique-to-detection mapping file that links ATT&CK technique IDs to specific SIEM detection rules. Here's the mapping format:

# detection_map.yaml — Maps ATT&CK techniques to Splunk detection rules
# Used by Blue Agent to validate detection coverage during purple team exercises

T1059.001:  # PowerShell Command and Scripting Interpreter
  - rule_name: "Suspicious PowerShell Execution"
    index: notable
    expected_severity: high
    data_sources:
      - "WinEventLog:Microsoft-Windows-PowerShell/Operational"
      - "WinEventLog:Security"
  - rule_name: "Encoded PowerShell Command Detected"
    index: notable
    expected_severity: critical
    data_sources:
      - "WinEventLog:Microsoft-Windows-PowerShell/Operational"

T1003.001:  # LSASS Memory Credential Dumping
  - rule_name: "LSASS Memory Access Detected"
    index: notable
    expected_severity: critical
    data_sources:
      - "WinEventLog:Microsoft-Windows-Sysmon/Operational"

T1021.002:  # SMB/Windows Admin Shares
  - rule_name: "Lateral Movement via Admin Shares"
    index: notable
    expected_severity: high
    data_sources:
      - "WinEventLog:Security"

T1053.005:  # Scheduled Task
  - rule_name: "Suspicious Scheduled Task Creation"
    index: notable
    expected_severity: medium
    data_sources:
      - "WinEventLog:Security"
      - "WinEventLog:Microsoft-Windows-TaskScheduler/Operational"

T1070.001:  # Indicator Removal — Clear Windows Event Logs
  - rule_name: "Windows Event Log Cleared"
    index: notable
    expected_severity: critical
    data_sources:
      - "WinEventLog:Security"

T1548.002:  # Abuse Elevation Control — UAC Bypass
  - rule_name: "UAC Bypass Attempt Detected"
    index: notable
    expected_severity: high
    data_sources:
      - "WinEventLog:Microsoft-Windows-Sysmon/Operational"

T1566.001:  # Spearphishing Attachment
  - rule_name: "Suspicious Email Attachment Execution"
    index: notable
    expected_severity: high
    data_sources:
      - "WinEventLog:Microsoft-Windows-Sysmon/Operational"
      - "stream:smtp"

T1486:  # Data Encrypted for Impact (Ransomware)
  - rule_name: "Ransomware File Encryption Behavior"
    index: notable
    expected_severity: critical
    data_sources:
      - "WinEventLog:Microsoft-Windows-Sysmon/Operational"

Pro Tip: The most common reason for detection misses isn't a missing rule — it's a missing data source. Before blaming detection engineering, the Blue Agent first checks whether the raw telemetry exists in the SIEM. A "telemetry gap" finding is often more actionable than a "detection gap" finding because it points to a log forwarding or ingestion issue that affects multiple detections.

Orchestration Layer

The Orchestration Layer is the command-and-control plane for the entire purple team exercise. It manages exercise state, coordinates red and blue agents, enforces safety controls, and feeds results to the reporting engine.

"""
orchestrator.py — Purple team exercise orchestration engine.
Coordinates red and blue agents, manages exercise lifecycle,
enforces safety boundaries, and aggregates results.
"""

import json
import logging
import time
import uuid
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger("orchestrator")


@dataclass
class ExerciseConfig:
    exercise_id: str
    name: str
    threat_profile: str
    scope: dict
    techniques: list
    safety: dict
    red_agent_config: dict
    blue_agent_config: dict
    max_duration_minutes: int = 120
    detection_delay_seconds: int = 30
    continue_on_error: bool = True


@dataclass
class ExerciseResult:
    exercise_id: str
    name: str
    threat_profile: str
    start_time: float
    end_time: Optional[float] = None
    techniques_executed: int = 0
    techniques_detected: int = 0
    techniques_missed: int = 0
    techniques_partial: int = 0
    detection_coverage_pct: float = 0.0
    mean_detection_time_ms: float = 0.0
    findings: list = field(default_factory=list)
    technique_results: list = field(default_factory=list)
    gap_analysis: dict = field(default_factory=dict)


class PurpleTeamOrchestrator:
    """
    Orchestrates autonomous purple team exercises by coordinating
    red and blue agents through structured attack-detect-validate cycles.
    """

    def __init__(self, config: ExerciseConfig):
        self.config = config
        self.red_agent = RedAgent(self._build_red_config())
        self.blue_agent = BlueAgent(config.blue_agent_config)
        self.result = ExerciseResult(
            exercise_id=config.exercise_id,
            name=config.name,
            threat_profile=config.threat_profile,
            start_time=time.time()
        )

    def run_exercise(self) -> ExerciseResult:
        """Execute a full purple team exercise."""
        logger.info(
            "=== Starting Purple Team Exercise: %s ===", self.config.name
        )
        logger.info("Threat profile: %s", self.config.threat_profile)
        logger.info(
            "Scope: %d target hosts, %d techniques",
            len(self.config.scope.get("allowed_targets", [])),
            len(self.config.techniques)
        )

        # Phase 1: Red agent executes attack chain
        red_results = self.red_agent.run_exercise()

        # Phase 2: Blue agent validates each technique
        for red_result in red_results:
            blue_result = self.blue_agent.validate_technique(red_result)
            self.result.technique_results.append({
                "red": red_result,
                "blue": blue_result
            })
            self._update_counters(blue_result)

        # Phase 3: Generate gap analysis
        self.result.gap_analysis = self._generate_gap_analysis()
        self.result.end_time = time.time()

        # Calculate coverage metrics
        total = self.result.techniques_executed
        if total > 0:
            self.result.detection_coverage_pct = (
                self.result.techniques_detected / total
            ) * 100

        logger.info("=== Exercise Complete ===")
        logger.info(
            "Coverage: %.1f%% (%d/%d detected)",
            self.result.detection_coverage_pct,
            self.result.techniques_detected,
            total
        )

        return self.result

    # ── Remaining methods ──────────────────────────────────────────────
    # _update_counters()            — update exercise counters based on detection results
    # _generate_gap_analysis()       — generate comprehensive gap analysis from results
    # _generate_recommendations()    — generate prioritized remediation recommendations
    # _build_red_config()            — build red agent configuration from exercise config
    # _get_tactic_for_technique()    — look up ATT&CK tactic for a technique ID

Reporting Engine

The Reporting Engine transforms raw exercise results into actionable intelligence: ATT&CK Navigator layers for coverage visualization, executive summaries for leadership, and detailed technical findings for engineering teams.

"""
reporter.py — Purple team exercise reporting engine.
Generates ATT&CK Navigator layers, gap analysis reports,
and remediation recommendations from exercise results.
"""

import json
import logging
from datetime import datetime

logger = logging.getLogger("reporter")


class PurpleTeamReporter:
    """Generates reports and ATT&CK Navigator layers from exercise results."""

    NAVIGATOR_TEMPLATE = {
        "name": "",
        "versions": {"attack": "14", "navigator": "4.9.1", "layer": "4.5"},
        "domain": "enterprise-attack",
        "description": "",
        "filters": {"platforms": ["Windows", "Linux", "macOS"]},
        "sorting": 0,
        "layout": {
            "layout": "side",
            "aggregateFunction": "average",
            "showID": True,
            "showName": True
        },
        "hideDisabled": False,
        "techniques": [],
        "gradient": {
            "colors": ["#ff6666", "#ffff66", "#66ff66"],
            "minValue": 0,
            "maxValue": 100
        },
        "legendItems": [
            {"label": "Detected", "color": "#66ff66"},
            {"label": "Partial Detection", "color": "#ffff66"},
            {"label": "Missed / No Coverage", "color": "#ff6666"},
            {"label": "Not Tested", "color": "#d3d3d3"}
        ]
    }

    def generate_navigator_layer(
        self, exercise_result, output_path: str
    ) -> str:
        """
        Generate an ATT&CK Navigator JSON layer showing
        detection coverage from exercise results.
        """
        layer = self.NAVIGATOR_TEMPLATE.copy()
        layer["name"] = f"Purple Team: {exercise_result.name}"
        layer["description"] = (
            f"Detection coverage results from purple team exercise "
            f"'{exercise_result.name}' executed on "
            f"{datetime.fromtimestamp(exercise_result.start_time).isoformat()}"
        )

        techniques = []
        for tr in exercise_result.technique_results:
            blue = tr["blue"]
            technique_entry = {
                "techniqueID": blue.technique_id,
                "tactic": "",
                "enabled": True,
                "showSubtechniques": True
            }

            if blue.detection_status == DetectionStatus.DETECTED:
                technique_entry["color"] = "#66ff66"
                technique_entry["score"] = 100
                technique_entry["comment"] = (
                    f"Detected by: {blue.detection_rule_name} "
                    f"(latency: {blue.detection_time_ms}ms)"
                )
            elif blue.detection_status == DetectionStatus.PARTIAL:
                technique_entry["color"] = "#ffff66"
                technique_entry["score"] = 50
                technique_entry["comment"] = "Partial detection — review needed"
            elif blue.detection_status == DetectionStatus.MISSED:
                technique_entry["color"] = "#ff6666"
                technique_entry["score"] = 0
                technique_entry["comment"] = (
                    f"MISSED: {blue.gap_details or 'No detection rule'}"
                )

            techniques.append(technique_entry)

        layer["techniques"] = techniques

        with open(output_path, "w") as f:
            json.dump(layer, f, indent=2)

        logger.info("Navigator layer written to %s", output_path)
        return output_path

    # ── Remaining methods ──────────────────────────────────────────────
    # generate_executive_summary()  — generate markdown executive summary with metrics

Pro Tip: Export your ATT&CK Navigator layers to your organization's internal wiki or security dashboard after every exercise. Over time, you build a historical record of coverage improvement that is invaluable for demonstrating security program maturity to auditors and leadership.

MITRE ATT&CK Integration

The MITRE ATT&CK framework is the backbone of autonomous purple teaming. Every technique the Red Agent executes, every detection the Blue Agent validates, and every gap the Reporting Engine identifies is anchored to a specific ATT&CK technique ID. This section covers how to deeply integrate ATT&CK into your purple team automation.

Mapping Adversary Simulation to ATT&CK Techniques

The first step is building a threat-profile-to-technique mapping. Different adversary groups use different subsets of ATT&CK. Your purple team exercises should test the techniques most relevant to your threat landscape.

Threat Profile	Primary Tactics	Key Techniques	Common Tools
APT29 (Cozy Bear)	Initial Access, Execution, Defense Evasion	T1566.001, T1059.001, T1027, T1071.001	Cobalt Strike, PowerShell
FIN7 (Financial)	Initial Access, Persistence, Collection	T1566.001, T1053.005, T1560.001, T1041	Carbanak, PowerShell
Ransomware Operator	Initial Access, Lateral Movement, Impact	T1566.002, T1021.002, T1486, T1490	RDP, PsExec, Mimikatz
Insider Threat	Collection, Exfiltration, Defense Evasion	T1560.001, T1041, T1070.004, T1567.002	Native OS tools, cloud storage
Supply Chain	Initial Access, Execution, Persistence	T1195.002, T1059.001, T1053.005, T1543.003	Malicious packages, scripts

Technique Selection Strategy

Not all ATT&CK techniques are equally relevant to every organization. Use this prioritization framework to select techniques for each exercise:

Threat Intelligence Driven. Start with techniques used by the threat groups most likely to target your industry and region. Pull technique lists from MITRE ATT&CK Groups pages and map them to your threat profile.
Coverage Gap Driven. After your first exercise, prioritize techniques that were missed. Re-test after remediation to validate the fix.
Crown Jewel Driven. Identify the attack paths to your most critical assets and test every technique along those paths. If your crown jewel is a database server, test every lateral movement and credential access technique that could reach it.
Compliance Driven. Map regulatory requirements (PCI DSS, HIPAA, CMMC) to ATT&CK techniques and ensure those techniques are covered in your exercises.

Exercise Configuration with ATT&CK Profiles

Here's a complete exercise configuration that defines a threat profile, selects techniques, and sets safety boundaries:

# exercise_config.yaml — APT29 Simulation Exercise
# Autonomous purple team exercise targeting APT29 TTPs

exercise:
  id: "PT-2025-Q4-APT29"
  name: "APT29 Cozy Bear Simulation — Q4 2025"
  description: >
    Autonomous adversary simulation targeting APT29 TTPs
    against corporate Windows domain environment.
  threat_profile: "apt29_cozy_bear"
  classification: "INTERNAL — AUTHORIZED EXERCISE"

scope:
  allowed_targets:
    - "10.10.50.0/24"   # Purple team lab subnet
    - "WS-PT-001"       # Target workstation
    - "WS-PT-002"       # Target workstation
    - "DC-PT-001"       # Target domain controller
    - "SRV-PT-FILE"     # Target file server
  excluded_targets:
    - "10.10.50.1"      # Gateway — do not touch
    - "DC-PROD-*"       # Production DCs — absolutely not
  allowed_techniques_max_impact: "high"  # No "critical" destructive techniques
  network_boundaries:
    - "No egress to internet"
    - "No lateral to production VLANs"

techniques:
  # Reconnaissance
  - id: "T1595.002"
    name: "Active Scanning — Vulnerability Scanning"
    priority: medium
    stealth: quiet

  # Initial Access
  - id: "T1566.001"
    name: "Phishing — Spearphishing Attachment"
    priority: high
    stealth: moderate
    parameters:
      payload_type: "macro_document"
      delivery: "simulated_email"

  # Execution
  - id: "T1059.001"
    name: "Command and Scripting Interpreter — PowerShell"
    priority: critical
    stealth: moderate
    parameters:
      encoding: "base64"
      bypass_method: "amsi_bypass"

  - id: "T1059.003"
    name: "Command and Scripting Interpreter — Windows Command Shell"
    priority: high
    stealth: quiet

  # Persistence
  - id: "T1053.005"
    name: "Scheduled Task/Job — Scheduled Task"
    priority: high
    stealth: quiet
    parameters:
      task_name: "SystemHealthCheck"
      trigger: "daily"

  - id: "T1547.001"
    name: "Boot or Logon Autostart Execution — Registry Run Keys"
    priority: high
    stealth: moderate

  # Privilege Escalation
  - id: "T1548.002"
    name: "Abuse Elevation Control Mechanism — UAC Bypass"
    priority: high
    stealth: stealthy

  # Defense Evasion
  - id: "T1027"
    name: "Obfuscated Files or Information"
    priority: medium
    stealth: stealthy

  - id: "T1070.001"
    name: "Indicator Removal — Clear Windows Event Logs"
    priority: critical
    stealth: loud

  # Credential Access
  - id: "T1003.001"
    name: "OS Credential Dumping — LSASS Memory"
    priority: critical
    stealth: moderate
    parameters:
      method: "comsvcs_minidump"

  - id: "T1558.003"
    name: "Steal or Forge Kerberos Tickets — Kerberoasting"
    priority: high
    stealth: stealthy

  # Discovery
  - id: "T1082"
    name: "System Information Discovery"
    priority: low
    stealth: quiet

  - id: "T1087.002"
    name: "Account Discovery — Domain Account"
    priority: medium
    stealth: quiet

  # Lateral Movement
  - id: "T1021.002"
    name: "Remote Services — SMB/Windows Admin Shares"
    priority: critical
    stealth: moderate

  - id: "T1021.001"
    name: "Remote Services — Remote Desktop Protocol"
    priority: high
    stealth: loud

  # Collection
  - id: "T1560.001"
    name: "Archive Collected Data — Archive via Utility"
    priority: medium
    stealth: quiet

  # Command and Control
  - id: "T1071.001"
    name: "Application Layer Protocol — Web Protocols"
    priority: high
    stealth: stealthy
    parameters:
      protocol: "HTTPS"
      domain: "cdn-update.purpleteam.local"

  # Exfiltration
  - id: "T1041"
    name: "Exfiltration Over C2 Channel"
    priority: critical
    stealth: stealthy

safety:
  max_compromised_hosts: 4
  max_duration_minutes: 90
  blocked_techniques:
    - "T1485"   # Data Destruction
    - "T1486"   # Data Encrypted for Impact
    - "T1489"   # Service Stop
    - "T1490"   # Inhibit System Recovery
    - "T1529"   # System Shutdown/Reboot
    - "T1561"   # Disk Wipe
  kill_switch:
    enabled: true
    trigger_conditions:
      - "production_network_access_detected"
      - "unscoped_host_compromised"
      - "manual_operator_override"
    notification_channels:
      - "slack:#purple-team-ops"
      - "pagerduty:purple-team-lead"
  allowed_subnets:
    - "10.10.50.0/24"
  require_human_approval:
    - "T1003.001"   # Credential dumping requires approval
    - "T1021.002"   # Lateral movement requires approval

red_agent:
  model: "gpt-4o"
  max_steps: 40
  detection_delay_seconds: 30
  continue_on_error: true
  stealth_preference: "moderate"

blue_agent:
  splunk_url: "https://splunk-es.internal:8089"
  splunk_token: "${SPLUNK_API_TOKEN}"
  soar_url: "https://soar.internal/api"
  soar_token: "${SOAR_API_TOKEN}"
  detection_timeout: 120
  detection_map_path: "configs/detection_map.yaml"

reporting:
  output_directory: "reports/PT-2025-Q4-APT29"
  navigator_layer: true
  executive_summary: true
  detailed_findings: true
  coverage_trend: true

Pro Tip: The require_human_approval list is one of the most important safety features. For high-impact techniques like credential dumping and lateral movement, the orchestrator pauses and requests explicit human approval before the Red Agent proceeds. This maintains human oversight for techniques that carry real risk even in a lab environment.

Building Your First AI Purple Team Exercise

This section provides a step-by-step walkthrough for setting up and running your first autonomous purple team exercise. We'll use MITRE Caldera as the adversary emulation backend, an LLM for adaptive planning, and Splunk for detection validation.

Step 1: Set Up MITRE Caldera with LLM Plugins

MITRE Caldera is an open-source adversary emulation platform that provides the execution engine for our Red Agent. Install it and configure the API for external integration.

# Clone and install Caldera
git clone https://github.com/mitre/caldera.git --recursive
cd caldera

# Create a Python virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start Caldera server (default port 8888)
python server.py --insecure --fresh

# Verify the API is accessible
curl -s http://localhost:8888/api/v2/health \
  -H "KEY:ADMIN123" | python3 -m json.tool

Configure Caldera's API access for external agent integration:

# caldera/conf/local.yml — Caldera server configuration

users:
  red:
    red: admin
  blue:
    blue: admin

api_key_red: CALDERA_RED_API_KEY_CHANGE_ME
api_key_blue: CALDERA_BLUE_API_KEY_CHANGE_ME

port: 8888
host: 0.0.0.0

plugins:
  - sandcat       # Default agent (Go-based implant)
  - stockpile     # ATT&CK technique library
  - compass       # ATT&CK Navigator integration
  - access        # Initial access techniques
  - manx          # Reverse shell agent
  - response       # Blue team response actions

exfil_dir: /tmp/caldera_exfil
reports_dir: /tmp/caldera_reports

logging:
  level: DEBUG
  file: logs/caldera.log

Step 2: Deploy Caldera Agents on Target Systems

Deploy Caldera's Sandcat agent on your purple team lab hosts:

# On target Windows workstation (PowerShell)
# Downloads and executes the Sandcat agent
$server="http://caldera-server:8888"
$url="$server/file/download"
$wc=New-Object System.Net.WebClient
$wc.Headers.add("platform","windows")
$wc.Headers.add("file","sandcat.go")
$wc.Headers.add("server","$server")
$output="C:\Users\Public\sandcat.exe"
$wc.DownloadFile($url,$output)
Start-Process -FilePath $output -ArgumentList "-server $server -group purple_team" -WindowStyle Hidden

# On target Linux server
curl -s -X POST http://caldera-server:8888/file/download \
  -H "platform:linux" -H "file:sandcat.go" \
  -H "server:http://caldera-server:8888" \
  -o /tmp/sandcat
chmod +x /tmp/sandcat
nohup /tmp/sandcat -server http://caldera-server:8888 -group purple_team &

Verify agent connectivity:

# List active agents via Caldera API
curl -s http://localhost:8888/api/v2/agents \
  -H "KEY:CALDERA_RED_API_KEY_CHANGE_ME" | python3 -m json.tool

# Expected output shows registered agents with their properties
# {
# "paw": "abc123",
# "host": "WS-PT-001",
# "platform": "windows",
# "group": "purple_team",
# "trusted": true,
# "last_seen": "2025-12-22T10:30:00Z"
# }

Step 3: Configure Adversary Profiles

Create a custom adversary profile in Caldera that aligns with your threat model:

# Create APT29 adversary profile via Caldera API
curl -s -X POST http://localhost:8888/api/v2/adversaries \
  -H "KEY:CALDERA_RED_API_KEY_CHANGE_ME" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "APT29 Purple Team Profile",
    "description": "APT29 Cozy Bear TTP simulation for purple team exercise",
    "atomic_ordering": [
      "90c2efaa-8205-480d-8bb6-61d90dbaf81b",
      "d69e9e0c-59cd-4012-a7b7-4c2b4c79afb5",
      "3b5db901-2a6a-4006-a04e-5321f7e0755d",
      "6469befa-748a-4b9c-a96d-f191fde47d89",
      "a398986f-813f-4086-bbcd-3f5406d12bc0"
    ],
    "objective": "495a9828-cab1-44dd-a0ca-66e58177d8cc"
  }'

Step 4: Build the LLM Integration Layer

Connect the LLM planning engine to Caldera's execution backend:

"""
caldera_integration.py — Bridges the LLM-based Red Agent with
MITRE Caldera's adversary emulation execution engine.
"""

import json
import logging
import time
from typing import Optional

import requests

logger = logging.getLogger("caldera_integration")


class CalderaExecutor:
    """
    Provides the execution backend for the Red Agent by translating
    ATT&CK technique requests into Caldera API operations.
    """

    def __init__(self, caldera_url: str, api_key: str):
        self.base_url = caldera_url.rstrip("/")
        self.headers = {
            "KEY": api_key,
            "Content-Type": "application/json"
        }
        self.ability_cache = {}
        self._load_abilities()

    def execute_technique(
        self,
        technique_id: str,
        target_paw: str,
        parameters: Optional[dict] = None
    ) -> dict:
        """
        Execute a specific ATT&CK technique against a target agent
        via Caldera's operation API.
        """
        abilities = self.ability_cache.get(technique_id, [])
        if not abilities:
            return {
                "executed": False,
                "error": f"No Caldera ability found for {technique_id}"
            }

        # Select the first matching ability
        ability = abilities[0]

        # Create a one-off operation for this technique
        operation_payload = {
            "name": f"PT-{technique_id}-{int(time.time())}",
            "adversary": {"adversary_id": "", "name": ""},
            "group": "purple_team",
            "auto_close": True,
            "jitter": "2/5",
            "source": {
                "id": "ed32b9c3-9593-4c33-b0db-e2007315096b"
            },
            "planner": {
                "id": "aaa7c857-37a0-4c4a-85f7-4e9f7f30e31a"
            },
            "manual_command": {
                "paw": target_paw,
                "ability_id": ability["ability_id"]
            }
        }

        if parameters:
            operation_payload["facts"] = [
                {"trait": k, "value": v}
                for k, v in parameters.items()
            ]

        # Start the operation
        response = requests.post(
            f"{self.base_url}/api/v2/operations",
            headers=self.headers,
            json=operation_payload,
            timeout=30
        )

        if response.status_code != 200:
            return {
                "executed": False,
                "error": f"Caldera API error: {response.status_code}"
            }

        operation = response.json()
        operation_id = operation.get("id")

        # Wait for operation to complete
        result = self._wait_for_operation(operation_id)

        return {
            "executed": True,
            "success": result.get("success", False),
            "output": result.get("output", ""),
            "duration_ms": result.get("duration_ms", 0),
            "technique_name": ability["name"],
            "operation_id": operation_id,
            "artifacts": result.get("artifacts", {})
        }

    # ── Remaining methods ──────────────────────────────────────────────
    # _load_abilities()          — cache Caldera abilities indexed by ATT&CK technique ID
    # _wait_for_operation()      — poll Caldera until operation completes or times out

Step 5: Run the Autonomous Exercise

With all components in place, launch the exercise:

"""
run_exercise.py — Launch an autonomous purple team exercise.
"""

import yaml
import logging
from orchestrator import PurpleTeamOrchestrator, ExerciseConfig
from reporter import PurpleTeamReporter

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(name)s] %(levelname)s: %(message)s"
)

# Load exercise configuration
with open("configs/exercise_config.yaml", "r") as f:
    raw_config = yaml.safe_load(f)

config = ExerciseConfig(
    exercise_id=raw_config["exercise"]["id"],
    name=raw_config["exercise"]["name"],
    threat_profile=raw_config["exercise"]["threat_profile"],
    scope=raw_config["scope"],
    techniques=raw_config["techniques"],
    safety=raw_config["safety"],
    red_agent_config=raw_config["red_agent"],
    blue_agent_config=raw_config["blue_agent"],
    max_duration_minutes=raw_config["safety"]["max_duration_minutes"],
    detection_delay_seconds=raw_config["red_agent"]["detection_delay_seconds"]
)

# Run the exercise
orchestrator = PurpleTeamOrchestrator(config)
result = orchestrator.run_exercise()

# Generate reports
reporter = PurpleTeamReporter()
reporter.generate_navigator_layer(
    result,
    f"{raw_config['reporting']['output_directory']}/navigator_layer.json"
)
summary = reporter.generate_executive_summary(result)

print(summary)
print(f"\nDetection Coverage: {result.detection_coverage_pct:.1f}%")
print(f"Mean Detection Time: {result.mean_detection_time_ms:.0f}ms")
print(f"Total Gaps Found: {len(result.findings)}")

Step 6: Validate Detections in Splunk

After the Red Agent executes techniques, validate detection coverage directly in Splunk. Here are the key validation queries:

| `notable`
| where earliest_time >= relative_time(now(), "-2h")
| eval exercise_tag = if(match(src, "10\.10\.50\."), "purple_team", "production")
| search exercise_tag="purple_team"
| stats count by rule_name, severity, mitre_technique_id, dest
| sort - count

Check for technique-specific detection coverage:

| `notable`
| where earliest_time >= relative_time(now(), "-4h")
| search (dest="WS-PT-*" OR dest="DC-PT-*" OR dest="SRV-PT-*")
| eval technique = mvindex(split(mitre_technique_id, ","), 0)
| stats
    count AS total_detections
    dc(rule_name) AS unique_rules
    values(rule_name) AS rules_fired
    min(_time) AS first_detection
    max(_time) AS last_detection
    by technique
| eval detection_span_seconds = last_detection - first_detection
| sort technique

Identify telemetry gaps — techniques that generated raw events but no detections:

| tstats count WHERE index=* host="WS-PT-*" OR host="DC-PT-*"
    earliest=-4h latest=now()
    by index, sourcetype, host
| join type=left host
    [| `notable`
    | where earliest_time >= relative_time(now(), "-4h")
    | search (dest="WS-PT-*" OR dest="DC-PT-*")
    | stats count AS detection_count by dest
    | rename dest AS host]
| eval detection_count = if(isnull(detection_count), 0, detection_count)
| where detection_count = 0
| stats values(sourcetype) AS available_telemetry by host
| eval gap_note = "Raw telemetry exists but no detections fired — detection rule gap"

Step 7: Generate the Gap Analysis Report

The final step produces the actionable output — a gap analysis that maps missed detections to specific remediation actions:

# Generate all reports from exercise results
python3 run_exercise.py 2>&1 | tee exercise_output.log

# Open the ATT&CK Navigator layer in your browser
# Upload the generated JSON to https://mitre-attack.github.io/attack-navigator/
open reports/PT-2025-Q4-APT29/navigator_layer.json

# Review the executive summary
cat reports/PT-2025-Q4-APT29/executive_summary.md

Pro Tip: Integrate the exercise runner into your CI/CD pipeline. Every time a new detection rule is deployed, automatically re-run the purple team exercise for the relevant techniques. This creates a continuous validation loop: new rule deploys → automated test confirms it catches the attack → coverage dashboard updates in real time.

Detection Coverage Gap Analysis — The Cymantis View

Detection coverage measurement is where most purple team programs fail. They test techniques, record pass/fail, and generate a spreadsheet. That's not gap analysis — it's a checklist. True gap analysis requires understanding why a detection failed, what would fix it, and how to prioritize remediation across dozens of gaps.

At Cymantis, we structure detection gap analysis across three dimensions: data availability, detection logic, and response integration. A gap in any dimension means the defense is incomplete.

Dimension 1: Data Availability

Before you can detect an attack, the relevant telemetry must exist in your SIEM. The most common reason for missed detections isn't a bad rule — it's a missing data source.

| `purple_team_techniques`
| join type=left technique_id
    [| tstats count WHERE index=*
        earliest=-24h latest=now()
        by index, sourcetype
    | stats values(sourcetype) AS available_sourcetypes by index
    | eval available_data = mvjoin(available_sourcetypes, ", ")]
| eval data_gap = if(isnull(available_data) OR available_data="", "YES", "NO")
| where data_gap="YES"
| table technique_id, technique_name, required_data_sources, data_gap
| sort technique_id

This query identifies ATT&CK techniques where the required data sources are not present in your SIEM. Every technique flagged here is a guaranteed detection blind spot.

Dimension 2: Detection Logic

For techniques where telemetry exists but detections didn't fire, the issue is in the detection rule itself — incorrect logic, overly narrow filters, or tuning that suppressed the alert.

| `notable`
| where earliest_time >= relative_time(now(), "-30d")
| eval technique = mvindex(split(mitre_technique_id, ","), 0)
| stats
    count AS total_fires
    dc(dest) AS unique_targets
    avg(urgency) AS avg_urgency
    by technique, rule_name
| append
    [| inputlookup attack_techniques.csv
    | rename technique_id AS technique
    | eval total_fires = 0, unique_targets = 0, avg_urgency = 0,
        rule_name = "NO RULE MAPPED"]
| dedup technique rule_name
| where total_fires = 0 AND rule_name = "NO RULE MAPPED"
| table technique, rule_name, total_fires
| eval gap_type = "detection_logic_gap"
| sort technique

Dimension 3: Response Integration

A detection that fires but doesn't trigger the appropriate response action is a gap. The alert exists, but the SOAR playbook didn't activate, the notification didn't route, or the containment action didn't execute.

| `notable`
| where earliest_time >= relative_time(now(), "-7d")
| search status!="closed"
| join type=left rule_name
    [| rest /services/configs/conf-workflow_actions
    | stats count AS playbook_count by title
    | rename title AS rule_name]
| eval playbook_count = if(isnull(playbook_count), 0, playbook_count)
| where playbook_count = 0
| stats count AS unlinked_alerts by rule_name, severity
| sort - severity, - count
| eval gap_type = "response_integration_gap"

Coverage Dashboard

Combine all three dimensions into a unified coverage score:

| inputlookup purple_team_results.csv
| eval coverage_score = case(
    detection_status="detected" AND response_status="activated", 100,
    detection_status="detected" AND response_status!="activated", 66,
    detection_status="partial", 33,
    detection_status="missed", 0,
    1=1, 0)
| stats
    avg(coverage_score) AS overall_coverage
    count(eval(coverage_score=100)) AS fully_covered
    count(eval(coverage_score=66)) AS detect_only
    count(eval(coverage_score=33)) AS partial
    count(eval(coverage_score=0)) AS blind_spots
    count AS total_tested
| eval overall_pct = round(overall_coverage, 1)
| eval summary = "Overall Coverage: " . overall_pct . "% | Fully Covered: " .
    fully_covered . " | Detect Only: " . detect_only . " | Partial: " .
    partial . " | Blind Spots: " . blind_spots

Cymantis Recommendations

Based on our work with purple team programs across enterprise, federal, and critical infrastructure environments, here are the patterns that distinguish mature programs from checkbox exercises:

1. Measure Coverage Continuously, Not Periodically

Detection coverage degrades over time. Data sources change, detection rules drift, SIEM configurations get modified, and new techniques emerge. If you only measure coverage during quarterly exercises, you're flying blind between measurements.

Action: Schedule automated purple team exercises weekly. Even a reduced-scope exercise targeting your top 20 critical techniques provides more value than a comprehensive exercise once a quarter.

2. Track Coverage Trends, Not Point-in-Time Scores

A coverage score of 72% is meaningless without context. Is it improving or declining? Which tactics are getting better and which are getting worse? Trend data tells the story.

Action: Store every exercise result in a time-series index. Build a Splunk dashboard that shows coverage percentage over time, broken down by ATT&CK tactic. Present this to leadership monthly.

3. Prioritize Gaps by Business Impact, Not Technique Count

Not all detection gaps are equal. A gap in detecting Kerberoasting (T1558.003) against your domain controllers is existentially more important than a gap in detecting System Information Discovery (T1082) on a development workstation.

Action: Weight gaps by the criticality of the assets they affect. A missed detection on a crown jewel system should be prioritized above five missed detections on low-value endpoints.

4. Close the Loop — Re-Test After Remediation

The gap analysis is only valuable if it drives remediation. And remediation is only confirmed if you re-test. Too many organizations identify gaps, write new rules, and never validate that the rules actually work.

Action: For every detection gap, create a remediation ticket with a re-test requirement. The ticket isn't closed until the automated purple team exercise confirms the new detection fires for the technique that was previously missed.

5. Build a Detection Engineering Feedback Loop

Every purple team exercise should feed directly into your detection engineering pipeline. Missed techniques become detection development backlog items. Slow detections become performance optimization tasks. False positives become tuning requirements.

Action: Integrate the gap analysis output with your ticketing system (Jira, ServiceNow). Auto-create tickets for each finding with the ATT&CK technique, the raw telemetry analysis, and a suggested detection approach.

Pro Tip: The single most impactful metric for a purple team program isn't detection coverage percentage — it's mean time to close a detection gap. If you can identify a gap on Monday and validate the fix by Wednesday, you have a mature program. If gap remediation takes months, coverage percentage is an illusion.

Governance for Autonomous Red Teaming

Running autonomous adversary simulation against your environment requires robust governance. Without it, you risk service disruption, scope creep, legal liability, and the erosion of trust between security teams and the business. This section covers the governance framework that makes autonomous purple teaming production-safe.

The Governance Stack

Governance for autonomous red teaming operates at four layers:

Legal and Authorization — Written authorization from executive leadership, scope agreements, and compliance with applicable laws and regulations.
Operational Boundaries — Technical controls that constrain what the Red Agent can do: scope restrictions, blocked techniques, time windows, and kill switches.
Human Oversight — Approval gates for high-impact actions, real-time monitoring by authorized operators, and manual override capabilities.
Audit and Accountability — Comprehensive logging of every action, decision chain records, and post-exercise review processes.

Governance Policy Configuration

Define your governance policies as code — machine-readable, version-controlled, and enforced by the orchestration layer:

# governance_policy.yaml — Purple team exercise governance controls

authorization:
  document_ref: "SEC-AUTH-2025-047"
  approved_by: "CISO — Jane Smith"
  approval_date: "2025-11-15"
  valid_until: "2026-03-15"
  scope_summary: >
    Authorized autonomous adversary simulation against designated
    purple team lab environment (10.10.50.0/24) using MITRE ATT&CK
    techniques up to "high" impact level. Destructive techniques
    explicitly excluded.
  legal_review: "Completed — Legal ref: LGL-2025-1209"
  insurance_notification: "Cyber insurance carrier notified — ref: INS-2025-889"

operational_boundaries:
  # Time restrictions
  allowed_hours:
    timezone: "America/New_York"
    weekday_start: "08:00"
    weekday_end: "18:00"
    weekend_allowed: false
    holiday_blackout: true

  # Scope restrictions
  network_scope:
    allowed_subnets:
      - "10.10.50.0/24"
    blocked_subnets:
      - "10.10.0.0/16"   # Production network
      - "10.20.0.0/16"   # Corporate network
      - "172.16.0.0/12"  # Management network
    dns_restrictions:
      allowed_domains:
        - "*.purpleteam.local"
      blocked_domains:
        - "*.prod.internal"
        - "*.corp.internal"

  # Technique restrictions
  technique_policy:
    max_impact_level: "high"
    blocked_categories:
      - "impact"          # No destructive techniques
      - "resource-hijacking"
    blocked_techniques:
      - "T1485"   # Data Destruction
      - "T1486"   # Data Encrypted for Impact
      - "T1489"   # Service Stop
      - "T1490"   # Inhibit System Recovery
      - "T1496"   # Resource Hijacking
      - "T1529"   # System Shutdown/Reboot
      - "T1531"   # Account Access Removal
      - "T1561"   # Disk Wipe
    require_approval:
      - technique_id: "T1003.*"
        reason: "Credential dumping — high risk of credential exposure"
        approver: "purple_team_lead"
      - technique_id: "T1021.*"
        reason: "Lateral movement — risk of scope expansion"
        approver: "purple_team_lead"
      - technique_id: "T1078.*"
        reason: "Valid accounts — risk of legitimate credential misuse"
        approver: "soc_manager"

  # Resource limits
  resource_limits:
    max_concurrent_operations: 3
    max_compromised_hosts: 5
    max_harvested_credentials: 20
    max_exfiltrated_data_mb: 100
    max_exercise_duration_minutes: 120

human_oversight:
  # Real-time monitoring requirements
  monitoring:
    required_observers: 1
    observer_roles:
      - "purple_team_lead"
      - "soc_analyst_senior"
    monitoring_dashboard: "https://splunk-es.internal/app/purple_team/live"
    notification_channels:
      - type: "slack"
        channel: "#purple-team-ops"
        events: ["exercise_start", "exercise_end", "approval_request", "safety_trigger"]
      - type: "pagerduty"
        service: "purple-team-lead"
        events: ["safety_trigger", "kill_switch", "scope_violation"]

  # Approval workflows
  approval_gates:
    - name: "exercise_start"
      approver: "purple_team_lead"
      timeout_minutes: 30
      auto_deny_on_timeout: true

    - name: "high_impact_technique"
      approver: "purple_team_lead"
      timeout_minutes: 10
      auto_deny_on_timeout: true

    - name: "scope_extension"
      approver: "soc_manager"
      timeout_minutes: 60
      auto_deny_on_timeout: true

  # Kill switch configuration
  kill_switch:
    enabled: true
    authorized_operators:
      - "purple_team_lead"
      - "soc_manager"
      - "ciso"
    auto_triggers:
      - condition: "technique_outside_scope"
        action: "halt_exercise"
        notification: "immediate"
      - condition: "production_network_traffic"
        action: "halt_exercise"
        notification: "immediate"
      - condition: "max_duration_exceeded"
        action: "graceful_shutdown"
        notification: "standard"
      - condition: "unscoped_host_compromised"
        action: "halt_exercise"
        notification: "immediate"
    manual_trigger:
      method: "slack_command"
      command: "/purple-team kill"
      confirmation_required: true

audit_and_accountability:
  # Logging requirements
  logging:
    log_destination: "index=purple_team_audit"
    log_level: "debug"
    fields_required:
      - "timestamp"
      - "exercise_id"
      - "agent_type"         # red or blue
      - "action"
      - "technique_id"
      - "target_host"
      - "result"
      - "operator_id"
      - "approval_status"
    retention_days: 365
    tamper_protection: true

  # Post-exercise review
  post_exercise:
    review_required: true
    review_deadline_hours: 48
    review_participants:
      - "purple_team_lead"
      - "soc_manager"
      - "detection_engineering_lead"
    deliverables:
      - "executive_summary"
      - "detailed_findings"
      - "navigator_layer"
      - "remediation_plan"
      - "lessons_learned"

  # Incident handling (if something goes wrong)
  incident_response:
    escalation_path:
      - level: 1
        contact: "purple_team_lead"
        method: "slack"
      - level: 2
        contact: "soc_manager"
        method: "pagerduty"
      - level: 3
        contact: "ciso"
        method: "phone"
    rollback_procedures:
      - "Terminate all Caldera operations via API"
      - "Kill all Sandcat agent processes on target hosts"
      - "Revoke all harvested credentials"
      - "Remove all persistence mechanisms"
      - "Restore target hosts from snapshot"
    documentation_requirements:
      - "Timeline of events"
      - "Root cause analysis"
      - "Impact assessment"
      - "Corrective actions"

Key Governance Principles

1. Authorization Before Automation. Never run autonomous adversary simulation without explicit written authorization from an executive with the authority to approve offensive security testing. This isn't optional — it's a legal requirement in most jurisdictions.

2. Scope as Code. Define scope in machine-readable configuration that the orchestration layer enforces automatically. Human-readable scope documents are necessary for authorization, but they don't prevent scope violations. Code does.

3. Kill Switch Always On. The kill switch is the most important safety control. It must be always enabled, instantly accessible, and tested regularly. A kill switch that doesn't work when you need it is worse than no kill switch at all.

4. Audit Everything. Every technique execution, every approval decision, every safety trigger, and every state change must be logged to an immutable audit trail. This protects the purple team, satisfies compliance requirements, and provides the evidence base for post-exercise analysis.

5. Blast Radius Containment. Design your lab environment with network segmentation that physically prevents the Red Agent from reaching production systems. Software controls fail; network controls are harder to bypass. Use dedicated VLANs, firewall rules, and — if possible — air-gapped environments for high-impact technique testing.

Pro Tip: Run a governance tabletop exercise before your first autonomous purple team engagement. Walk through the scenario: "The Red Agent executes a lateral movement technique that accidentally reaches a production-adjacent system. The kill switch fires. Now what?" If your team can't answer that question confidently, you're not ready for autonomous operations.

Putting It All Together: The Continuous Purple Team Operating Model

The architecture, tooling, and governance described in this post come together in a continuous operating model that fundamentally changes how organizations validate their defenses.

The Weekly Cycle

graph TD
    Monday["Monday<br/>Automated exercise: Top 20 critical techniques<br/>[Fully autonomous]"]
    Tuesday["Tuesday<br/>Blue agent validates detections, generates gap report<br/>[Fully autonomous]"]
    Wednesday["Wednesday<br/>Detection engineering reviews gaps, prioritizes fixes<br/>[Human-led]"]
    Thursday["Thursday<br/>New/updated detection rules deployed<br/>[Human-led]"]
    Friday["Friday<br/>Automated re-test of remediated techniques<br/>[Fully autonomous]"]
    
    Monday --> Tuesday
    Tuesday --> Wednesday
    Wednesday --> Thursday
    Thursday --> Friday

The Monthly Cycle

graph TD
    Week1["Week 1<br/>Full-scope exercise against primary threat profile"]
    Week2["Week 2<br/>Gap remediation sprint"]
    Week3["Week 3<br/>Full-scope exercise against secondary threat profile"]
    Week4["Week 4<br/>Trend analysis, executive reporting, coverage review"]
    
    Week1 --> Week2
    Week2 --> Week3
    Week3 --> Week4

Key Performance Indicators

Track these KPIs to measure purple team program maturity:

Detection Coverage Percentage — Techniques detected / techniques tested. Target: >80%.
Mean Detection Time (MDT) — Average time from technique execution to alert. Target: <60 seconds.
Mean Time to Close Gap (MTTCG) — Average time from gap identification to validated fix. Target: <5 business days.
Coverage Trend — Month-over-month change in detection coverage percentage. Target: Positive trend.
Technique Test Frequency — Average number of times each critical technique is tested per month. Target: ≥4.
False Positive Rate — Percentage of detection fires that are false positives during exercises. Target: <5%.

Final Thoughts

The purple team model was always the right idea — offense and defense working together to find gaps before adversaries do. The limitation was never conceptual; it was operational. Human teams can't execute at the speed, scale, and frequency that continuous security validation demands.

Agentic AI removes that bottleneck. An autonomous Red Agent that plans and executes MITRE ATT&CK-aligned attack chains. A Blue Agent that validates every detection in your SIEM pipeline. An Orchestration Layer that enforces safety boundaries and coordinates the exercise lifecycle. A Reporting Engine that produces real-time gap analysis with actionable remediation guidance.

The result is a continuous offense-defense feedback loop that operates at machine speed with human judgment at the decision points that matter: defining threat profiles, setting governance boundaries, prioritizing remediation, and making strategic investment decisions.

Organizations that adopt agentic adversary simulation gain something that periodic testing can never provide: real-time visibility into their defensive posture. Not a snapshot from last quarter's engagement. Not a spreadsheet that's outdated before the ink dries. A live, continuously validated map of what you can detect, what you can't, and exactly what to fix next.

The adversaries are already using AI. Your purple team should be too.

Cymantis Labs helps security teams design, deploy, and govern autonomous purple team programs — from initial architecture through continuous operations. We bring the adversary simulation expertise, detection engineering depth, and governance rigor to make autonomous security validation production-safe and operationally sustainable.

Resources & References

MITRE Frameworks & Tools

MITRE ATT&CK Framework: https://attack.mitre.org/ — Adversary tactics, techniques, and procedures knowledge base
MITRE ATT&CK Navigator: https://mitre-attack.github.io/attack-navigator/ — Web-based tool for visualizing ATT&CK coverage and gaps
MITRE Caldera: https://caldera.mitre.org/ — Open-source adversary emulation platform
MITRE D3FEND: https://d3fend.mitre.org/ — Defensive technique knowledge graph
MITRE ATLAS: https://atlas.mitre.org/ — Adversarial threat landscape for AI systems

Adversary Emulation & Red Team

Atomic Red Team: https://github.com/redcanaryco/atomic-red-team — Library of atomic tests mapped to ATT&CK techniques
MITRE Caldera GitHub: https://github.com/mitre/caldera — Source code and documentation for Caldera
Infection Monkey (Akamai): https://github.com/guardicore/monkey — Open-source breach and attack simulation tool
Prelude Operator: https://www.prelude.org/ — Autonomous red teaming platform

Detection Engineering & SIEM

Splunk Enterprise Security: https://docs.splunk.com/Documentation/ES/latest — Splunk ES documentation
Splunk ESCU (Security Content Updates): https://research.splunk.com — Community detection rules mapped to ATT&CK
Sigma Rules: https://github.com/SigmaHQ/sigma — Generic signature format for SIEM rules
Elastic Detection Rules: https://github.com/elastic/detection-rules — Open detection rules for Elastic Security

AI & LLM Frameworks

OpenAI Function Calling: https://platform.openai.com/docs/guides/function-calling — Building tool-calling AI agents
LangChain Agent Framework: https://python.langchain.com/docs/modules/agents/ — Open-source agent orchestration
Microsoft Copilot for Security: https://www.microsoft.com/en-us/security/business/ai-machine-learning/microsoft-copilot-security — Enterprise AI security platform

Governance & Compliance

NIST AI Risk Management Framework: https://www.nist.gov/artificial-intelligence/ai-risk-management-framework — Federal guidance on AI governance
NIST SP 800-53 Rev. 5: https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final — Security and privacy controls
PTES (Penetration Testing Execution Standard): http://www.pentest-standard.org/ — Penetration testing methodology and standards
CREST — Penetration Testing Guide: https://www.crest-approved.org/ — Professional standards for penetration testing

Research & Industry Reports

Microsoft AI Red Team: https://www.microsoft.com/en-us/security/blog/ai-red-team/ — Lessons learned from red teaming AI systems at scale
SANS Purple Team Survey: https://www.sans.org/white-papers/ — Annual survey of purple team operations and tooling
IBM Cost of a Data Breach Report: https://www.ibm.com/reports/data-breach — Detection time metrics and breach cost analysis
Mandiant M-Trends Report: https://www.mandiant.com/m-trends — Annual threat landscape and detection gap analysis

For more insights or to schedule a Cymantis Purple Team Architecture Assessment, contact our research and adversary simulation team at cymantis.com.