Purple Teaming with AI: Autonomous Adversary Simulation Using MITRE ATT&CK
How to build AI-powered purple team exercises using autonomous adversary simulation, MITRE ATT&CK-aligned testing, and agentic red/blue team coordination for continuous security validation.
Purple Teaming with AI: Autonomous Adversary Simulation Using MITRE ATT&CK
By Cymantis Labs
Your red team runs an engagement once a quarter — maybe twice if budget allows. It takes two weeks to scope, three weeks to execute, and another two weeks to write the report. By the time the findings land on the CISO's desk, the environment has changed. New cloud workloads spun up. A developer pushed a misconfigured Kubernetes manifest. Three new SaaS integrations went live. The attack surface your red team tested no longer exists.
Meanwhile, your adversaries aren't operating on quarterly cycles. APT groups are deploying generative AI to automate reconnaissance, craft phishing campaigns, and generate polymorphic payloads at machine speed. Ransomware operators iterate weekly. Initial access brokers refresh their inventory daily. The offense-defense asymmetry has never been wider.
Traditional purple teaming — where red and blue teams collaborate in periodic, structured exercises — was a genuine improvement over siloed operations. But it still operates on human timelines. Scoping meetings. Scheduling conflicts. Manual attack execution. Manual detection validation. Manual report generation. The cycle time from "identify a gap" to "validate the fix" is measured in weeks or months.
Agentic purple teaming compresses that cycle into hours. An autonomous red agent executes MITRE ATT&CK-aligned attack chains against your production environment. An autonomous blue agent validates whether your detections fire, your alerts route correctly, and your response playbooks activate. An orchestration layer coordinates the exercise, enforces safety boundaries, and produces a real-time gap analysis that tells you exactly where your defenses fail — and exactly what to fix.
This isn't theoretical. The building blocks exist today: MITRE Caldera for adversary emulation, LLM-driven decision engines for adaptive attack planning, Splunk and SIEM APIs for detection validation, and ATT&CK Navigator for coverage visualization. This post shows you how to assemble them into a production-grade autonomous purple team system.
Red, Blue, Purple — And Why AI Changes Everything
Before diving into architecture, let's ground the terminology and establish why AI fundamentally changes the purple team operating model.
The Traditional Model
Red Team — Offensive operators who simulate real-world adversaries. They attempt to achieve objectives (data exfiltration, domain compromise, ransomware deployment) using the same tactics, techniques, and procedures (TTPs) as threat actors. Red teams test whether an organization can be compromised.
Blue Team — Defensive operators who detect, respond to, and mitigate threats. They operate the SOC, maintain detection rules, run incident response, and harden the environment. Blue teams test whether an organization can detect and stop an attack.
Purple Team — A collaborative function that bridges red and blue. Purple team exercises are structured engagements where offensive actions are executed in coordination with defensive validation. The goal isn't to "win" — it's to identify detection gaps, validate security controls, and improve defensive coverage systematically.
What's Broken
The traditional purple team model has three structural problems:
-
Periodicity. Exercises happen quarterly or annually. The gap between exercises is a gap in validation. Controls degrade, configurations drift, new attack techniques emerge, and detection coverage erodes silently between cycles.
-
Scale. A human red team can realistically execute 20–40 ATT&CK techniques per engagement. MITRE ATT&CK Enterprise contains over 200 techniques and 600+ sub-techniques. No human team has the bandwidth to test comprehensive coverage in a single exercise.
-
Speed. From technique execution to detection validation to gap analysis to remediation to re-test — the feedback loop takes weeks. In an environment where new code deploys daily and infrastructure changes hourly, this latency is unacceptable.
The AI-Augmented Model
Agentic AI addresses all three problems:
- Continuous execution. AI agents don't need scheduling. They can execute adversary simulation continuously — daily, hourly, or on-demand triggered by infrastructure changes.
- Comprehensive coverage. An autonomous red agent can systematically walk the ATT&CK matrix, testing hundreds of techniques across your environment in a single exercise.
- Rapid feedback. Detection validation happens in real time. The moment an attack technique executes, the blue agent queries your SIEM to confirm whether the detection fired. Gap analysis is available in minutes, not weeks.
The key insight: AI doesn't replace human red and blue teamers. It amplifies them. Human operators design the adversary profiles, define the safety boundaries, interpret the results, and make strategic decisions about remediation priorities. AI agents handle the repetitive, high-volume execution and validation work that humans can't scale.
Pro Tip: Start by identifying the 80% of your purple team workload that is repetitive execution and validation — running the same Atomic Red Team tests, checking the same Splunk searches, updating the same coverage spreadsheets. That's your automation target. Reserve human operators for creative adversary simulation, novel attack chain development, and strategic analysis.
The Agentic Purple Team Architecture
A production agentic purple team system has four core components: the Red Agent, the Blue Agent, the Orchestration Layer, and the Reporting Engine. Each component is autonomous within its domain but coordinated through the orchestration layer.
graph TD
OrchestrationLayer["ORCHESTRATION LAYER<br/>Exercise State | Safety Controls | ATT&CK Mapping Engine"]
RedAgent["RED AGENT<br/>LLM Core<br/>ATT&CK Planner<br/>Executor"]
BlueAgent["BLUE AGENT<br/>Detection Validator<br/>SIEM Query<br/>Response Checker"]
ReportingEngine["REPORTING ENGINE<br/>Gap Analysis<br/>ATT&CK Navigator"]
TargetEnvironment["Target Environment"]
SIEMPlatform["SIEM/SOAR Platform"]
ResultsDatabase["Results Database"]
OrchestrationLayer --> RedAgent
OrchestrationLayer --> BlueAgent
OrchestrationLayer --> ReportingEngine
RedAgent <--> BlueAgent
RedAgent --> TargetEnvironment
BlueAgent --> SIEMPlatform
ReportingEngine --> ResultsDatabase
Red Agent: Autonomous Adversary Simulation
The Red Agent is an LLM-driven decision engine that plans and executes multi-phase attack chains. Unlike static adversary emulation tools that replay predefined scripts, the Red Agent makes dynamic decisions based on the target environment's responses — adapting its approach just as a human attacker would.
Core Architecture
The Red Agent operates in a continuous loop: observe → plan → execute → evaluate → adapt. The LLM serves as the planning brain, interpreting environmental feedback and selecting the next technique based on the current attack state, available tools, and target profile.
"""
red_agent.py — Autonomous adversary simulation agent for purple team exercises.
Plans and executes multi-phase ATT&CK-aligned attack chains using LLM-driven
decision-making with tool access to adversary emulation frameworks.
"""
import json, logging, time
from dataclasses import dataclass, field
from enum import Enum
from openai import OpenAI
logger = logging.getLogger("red_agent")
# AttackPhase(Enum): RECONNAISSANCE → INITIAL_ACCESS → EXECUTION →
# PERSISTENCE → PRIVILEGE_ESCALATION → DEFENSE_EVASION → CREDENTIAL_ACCESS →
# DISCOVERY → LATERAL_MOVEMENT → COLLECTION → EXFILTRATION → IMPACT
# TechniqueResult: technique_id, name, phase, success, output, timestamp,
# duration_ms, target_host, artifacts, detection_expected
# AttackState: exercise_id, current_phase, compromised_hosts,
# harvested_credentials, discovered_services, executed/failed_techniques
class RedAgent:
"""
Autonomous red team agent that plans and executes ATT&CK-aligned
attack chains using LLM reasoning and adversary emulation tools.
"""
# SYSTEM_PROMPT — Rules of engagement + ATT&CK kill-chain planning strategy:
# Only execute within approved scope; stop on kill switch; prefer stealth;
# adapt based on discovered defenses; provide ATT&CK IDs + reasoning
#
# TOOL_DEFINITIONS — Four LLM function-calling tools:
# execute_technique — run ATT&CK technique via emulation framework
# query_environment — recon: network scan, service/AD/share enum
# check_detection_status — verify if technique triggered SIEM alerts
# report_finding — log detection gaps and control failures
def __init__(self, config: dict):
self.client = OpenAI()
self.model = config.get("model", "gpt-4o")
self.exercise_config = config
self.state = AttackState(
exercise_id=config["exercise_id"],
current_phase=AttackPhase.RECONNAISSANCE
)
self.conversation_history = [
{"role": "system", "content": self.SYSTEM_PROMPT}
]
self.max_steps = config.get("max_steps", 50)
self.safety_controller = SafetyController(config.get("safety", {}))
def run_exercise(self) -> list[TechniqueResult]:
"""Execute a full autonomous attack chain exercise."""
results, step = [], 0
logger.info("Starting exercise: %s", self.state.exercise_id)
while step < self.max_steps:
if not self.safety_controller.is_safe_to_continue(self.state):
logger.warning("Safety halt at step %d", step)
break
try:
action = self.plan_next_action()
if action.get("exercise_complete"):
break
if action.get("technique_result"):
result = action["technique_result"]
results.append(result)
self._update_state(result)
time.sleep(
self.exercise_config.get("detection_delay_seconds", 30)
)
except Exception as e:
logger.error("Step %d error: %s", step, e)
if not self.exercise_config.get("continue_on_error", True):
break
step += 1
logger.info("Complete: %d techniques in %d steps", len(results), step)
return results
# ── Remaining methods ──────────────────────────────────────────────
# plan_next_action() — build state summary, query LLM with tool defs
# _build_state_summary() — serialize attack state for LLM context window
# _update_state(result) — update compromised hosts, creds, services
# _process_response(msg) — extract tool calls, safety-check, dispatch
# _dispatch_tool(name, args)— route to caldera/env/detection handlers
# _execute_via_caldera() — trigger ATT&CK technique via Caldera REST API
# _query_environment() — recon queries (network scan, AD/share enum)
# _check_detections() — query SIEM for alerts triggered by technique
# _report_finding() — record detection gap in results database
# _map_technique_to_phase() — ATT&CK technique ID prefix → AttackPhase
class SafetyController:
"""Enforces safety boundaries during autonomous red team operations."""
def __init__(self, config: dict):
self.max_compromised_hosts = config.get("max_compromised_hosts", 10)
self.blocked_techniques = config.get("blocked_techniques", [])
self.max_duration_minutes = config.get("max_duration_minutes", 120)
self.kill_switch_active = False
def is_safe_to_continue(self, state: AttackState) -> bool:
if self.kill_switch_active:
return False
if len(state.compromised_hosts) >= self.max_compromised_hosts:
return False
return (time.time() - state.start_time) / 60 < self.max_duration_minutes
def validate_technique(self, args: dict, scope: dict) -> bool:
if args.get("technique_id", "") in self.blocked_techniques:
return False
target = args.get("target_host", "")
return not scope.get("allowed_targets") or target in scope["allowed_targets"]
Pro Tip: The detection_delay_seconds parameter is critical. Your SIEM needs time to ingest, parse, and correlate events before the Blue Agent can validate detections. Set this to match your environment's actual detection pipeline latency — typically 15–60 seconds for well-tuned Splunk deployments, longer for cloud-native SIEMs with batched ingestion.
Blue Agent: Detection and Response Validation
The Blue Agent is the defensive counterpart. Its job is to validate whether each technique executed by the Red Agent was detected by your security stack. It queries your SIEM, checks alert routing, validates response playbook activation, and records the results.
"""
blue_agent.py — Autonomous detection validation agent for purple team exercises.
Validates SIEM detections, alert routing, and response activation against
red team technique execution results.
"""
import json
import logging
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import requests
logger = logging.getLogger("blue_agent")
class DetectionStatus(Enum):
DETECTED = "detected"
PARTIAL = "partial"
MISSED = "missed"
DELAYED = "delayed"
FALSE_POSITIVE = "false_positive"
class ResponseStatus(Enum):
ACTIVATED = "activated"
PARTIAL = "partial"
NOT_TRIGGERED = "not_triggered"
FAILED = "failed"
@dataclass
class DetectionResult:
technique_id: str
technique_name: str
detection_status: DetectionStatus
response_status: ResponseStatus
detection_time_ms: Optional[int] = None
detection_rule_name: Optional[str] = None
alert_severity: Optional[str] = None
siem_search_id: Optional[str] = None
response_playbook: Optional[str] = None
evidence: dict = field(default_factory=dict)
gap_details: Optional[str] = None
class BlueAgent:
"""
Autonomous blue team agent that validates detections and response
actions against red team technique execution results.
"""
def __init__(self, config: dict):
self.splunk_url = config["splunk_url"]
self.splunk_token = config["splunk_token"]
self.soar_url = config.get("soar_url")
self.soar_token = config.get("soar_token")
self.detection_timeout_seconds = config.get("detection_timeout", 120)
self.technique_detection_map = self._load_detection_map(
config.get("detection_map_path", "detection_map.yaml")
)
def validate_technique(self, technique_result) -> DetectionResult:
"""
Validate whether a red team technique was detected and
whether the appropriate response was triggered.
"""
technique_id = technique_result.technique_id
target_host = technique_result.target_host
execution_time = technique_result.timestamp
logger.info(
"Validating detection for %s on %s",
technique_id, target_host
)
# Step 1: Check SIEM for detection
detection = self._check_siem_detection(
technique_id, target_host, execution_time
)
# Step 2: Check response activation
response = self._check_response_activation(
technique_id, target_host, execution_time
)
# Step 3: Build detection result
result = DetectionResult(
technique_id=technique_id,
technique_name=technique_result.technique_name,
detection_status=detection["status"],
response_status=response["status"],
detection_time_ms=detection.get("time_ms"),
detection_rule_name=detection.get("rule_name"),
alert_severity=detection.get("severity"),
siem_search_id=detection.get("search_id"),
response_playbook=response.get("playbook_name"),
evidence={
"siem_results": detection.get("raw_results", []),
"response_results": response.get("raw_results", [])
}
)
# Step 4: Identify gaps
if result.detection_status == DetectionStatus.MISSED:
result.gap_details = self._analyze_detection_gap(
technique_id, target_host, execution_time
)
return result
# ── Remaining methods ──────────────────────────────────────────────
# _check_siem_detection() — query Splunk for detections matching technique
# _build_validation_query() — build Splunk SPL query for detection validation
# _execute_splunk_search() — execute search against Splunk REST API
# _check_response_activation() — check whether SOAR playbooks activated
# _analyze_detection_gap() — analyze why detection was missed, provide guidance
# _load_detection_map() — load technique-to-detection mapping from YAML
The Blue Agent uses a technique-to-detection mapping file that links ATT&CK technique IDs to specific SIEM detection rules. Here's the mapping format:
# detection_map.yaml — Maps ATT&CK techniques to Splunk detection rules
# Used by Blue Agent to validate detection coverage during purple team exercises
T1059.001: # PowerShell Command and Scripting Interpreter
- rule_name: "Suspicious PowerShell Execution"
index: notable
expected_severity: high
data_sources:
- "WinEventLog:Microsoft-Windows-PowerShell/Operational"
- "WinEventLog:Security"
- rule_name: "Encoded PowerShell Command Detected"
index: notable
expected_severity: critical
data_sources:
- "WinEventLog:Microsoft-Windows-PowerShell/Operational"
T1003.001: # LSASS Memory Credential Dumping
- rule_name: "LSASS Memory Access Detected"
index: notable
expected_severity: critical
data_sources:
- "WinEventLog:Microsoft-Windows-Sysmon/Operational"
T1021.002: # SMB/Windows Admin Shares
- rule_name: "Lateral Movement via Admin Shares"
index: notable
expected_severity: high
data_sources:
- "WinEventLog:Security"
T1053.005: # Scheduled Task
- rule_name: "Suspicious Scheduled Task Creation"
index: notable
expected_severity: medium
data_sources:
- "WinEventLog:Security"
- "WinEventLog:Microsoft-Windows-TaskScheduler/Operational"
T1070.001: # Indicator Removal — Clear Windows Event Logs
- rule_name: "Windows Event Log Cleared"
index: notable
expected_severity: critical
data_sources:
- "WinEventLog:Security"
T1548.002: # Abuse Elevation Control — UAC Bypass
- rule_name: "UAC Bypass Attempt Detected"
index: notable
expected_severity: high
data_sources:
- "WinEventLog:Microsoft-Windows-Sysmon/Operational"
T1566.001: # Spearphishing Attachment
- rule_name: "Suspicious Email Attachment Execution"
index: notable
expected_severity: high
data_sources:
- "WinEventLog:Microsoft-Windows-Sysmon/Operational"
- "stream:smtp"
T1486: # Data Encrypted for Impact (Ransomware)
- rule_name: "Ransomware File Encryption Behavior"
index: notable
expected_severity: critical
data_sources:
- "WinEventLog:Microsoft-Windows-Sysmon/Operational"
Pro Tip: The most common reason for detection misses isn't a missing rule — it's a missing data source. Before blaming detection engineering, the Blue Agent first checks whether the raw telemetry exists in the SIEM. A "telemetry gap" finding is often more actionable than a "detection gap" finding because it points to a log forwarding or ingestion issue that affects multiple detections.
Orchestration Layer
The Orchestration Layer is the command-and-control plane for the entire purple team exercise. It manages exercise state, coordinates red and blue agents, enforces safety controls, and feeds results to the reporting engine.
"""
orchestrator.py — Purple team exercise orchestration engine.
Coordinates red and blue agents, manages exercise lifecycle,
enforces safety boundaries, and aggregates results.
"""
import json
import logging
import time
import uuid
from dataclasses import dataclass, field
from typing import Optional
logger = logging.getLogger("orchestrator")
@dataclass
class ExerciseConfig:
exercise_id: str
name: str
threat_profile: str
scope: dict
techniques: list
safety: dict
red_agent_config: dict
blue_agent_config: dict
max_duration_minutes: int = 120
detection_delay_seconds: int = 30
continue_on_error: bool = True
@dataclass
class ExerciseResult:
exercise_id: str
name: str
threat_profile: str
start_time: float
end_time: Optional[float] = None
techniques_executed: int = 0
techniques_detected: int = 0
techniques_missed: int = 0
techniques_partial: int = 0
detection_coverage_pct: float = 0.0
mean_detection_time_ms: float = 0.0
findings: list = field(default_factory=list)
technique_results: list = field(default_factory=list)
gap_analysis: dict = field(default_factory=dict)
class PurpleTeamOrchestrator:
"""
Orchestrates autonomous purple team exercises by coordinating
red and blue agents through structured attack-detect-validate cycles.
"""
def __init__(self, config: ExerciseConfig):
self.config = config
self.red_agent = RedAgent(self._build_red_config())
self.blue_agent = BlueAgent(config.blue_agent_config)
self.result = ExerciseResult(
exercise_id=config.exercise_id,
name=config.name,
threat_profile=config.threat_profile,
start_time=time.time()
)
def run_exercise(self) -> ExerciseResult:
"""Execute a full purple team exercise."""
logger.info(
"=== Starting Purple Team Exercise: %s ===", self.config.name
)
logger.info("Threat profile: %s", self.config.threat_profile)
logger.info(
"Scope: %d target hosts, %d techniques",
len(self.config.scope.get("allowed_targets", [])),
len(self.config.techniques)
)
# Phase 1: Red agent executes attack chain
red_results = self.red_agent.run_exercise()
# Phase 2: Blue agent validates each technique
for red_result in red_results:
blue_result = self.blue_agent.validate_technique(red_result)
self.result.technique_results.append({
"red": red_result,
"blue": blue_result
})
self._update_counters(blue_result)
# Phase 3: Generate gap analysis
self.result.gap_analysis = self._generate_gap_analysis()
self.result.end_time = time.time()
# Calculate coverage metrics
total = self.result.techniques_executed
if total > 0:
self.result.detection_coverage_pct = (
self.result.techniques_detected / total
) * 100
logger.info("=== Exercise Complete ===")
logger.info(
"Coverage: %.1f%% (%d/%d detected)",
self.result.detection_coverage_pct,
self.result.techniques_detected,
total
)
return self.result
# ── Remaining methods ──────────────────────────────────────────────
# _update_counters() — update exercise counters based on detection results
# _generate_gap_analysis() — generate comprehensive gap analysis from results
# _generate_recommendations() — generate prioritized remediation recommendations
# _build_red_config() — build red agent configuration from exercise config
# _get_tactic_for_technique() — look up ATT&CK tactic for a technique ID
Reporting Engine
The Reporting Engine transforms raw exercise results into actionable intelligence: ATT&CK Navigator layers for coverage visualization, executive summaries for leadership, and detailed technical findings for engineering teams.
"""
reporter.py — Purple team exercise reporting engine.
Generates ATT&CK Navigator layers, gap analysis reports,
and remediation recommendations from exercise results.
"""
import json
import logging
from datetime import datetime
logger = logging.getLogger("reporter")
class PurpleTeamReporter:
"""Generates reports and ATT&CK Navigator layers from exercise results."""
NAVIGATOR_TEMPLATE = {
"name": "",
"versions": {"attack": "14", "navigator": "4.9.1", "layer": "4.5"},
"domain": "enterprise-attack",
"description": "",
"filters": {"platforms": ["Windows", "Linux", "macOS"]},
"sorting": 0,
"layout": {
"layout": "side",
"aggregateFunction": "average",
"showID": True,
"showName": True
},
"hideDisabled": False,
"techniques": [],
"gradient": {
"colors": ["#ff6666", "#ffff66", "#66ff66"],
"minValue": 0,
"maxValue": 100
},
"legendItems": [
{"label": "Detected", "color": "#66ff66"},
{"label": "Partial Detection", "color": "#ffff66"},
{"label": "Missed / No Coverage", "color": "#ff6666"},
{"label": "Not Tested", "color": "#d3d3d3"}
]
}
def generate_navigator_layer(
self, exercise_result, output_path: str
) -> str:
"""
Generate an ATT&CK Navigator JSON layer showing
detection coverage from exercise results.
"""
layer = self.NAVIGATOR_TEMPLATE.copy()
layer["name"] = f"Purple Team: {exercise_result.name}"
layer["description"] = (
f"Detection coverage results from purple team exercise "
f"'{exercise_result.name}' executed on "
f"{datetime.fromtimestamp(exercise_result.start_time).isoformat()}"
)
techniques = []
for tr in exercise_result.technique_results:
blue = tr["blue"]
technique_entry = {
"techniqueID": blue.technique_id,
"tactic": "",
"enabled": True,
"showSubtechniques": True
}
if blue.detection_status == DetectionStatus.DETECTED:
technique_entry["color"] = "#66ff66"
technique_entry["score"] = 100
technique_entry["comment"] = (
f"Detected by: {blue.detection_rule_name} "
f"(latency: {blue.detection_time_ms}ms)"
)
elif blue.detection_status == DetectionStatus.PARTIAL:
technique_entry["color"] = "#ffff66"
technique_entry["score"] = 50
technique_entry["comment"] = "Partial detection — review needed"
elif blue.detection_status == DetectionStatus.MISSED:
technique_entry["color"] = "#ff6666"
technique_entry["score"] = 0
technique_entry["comment"] = (
f"MISSED: {blue.gap_details or 'No detection rule'}"
)
techniques.append(technique_entry)
layer["techniques"] = techniques
with open(output_path, "w") as f:
json.dump(layer, f, indent=2)
logger.info("Navigator layer written to %s", output_path)
return output_path
# ── Remaining methods ──────────────────────────────────────────────
# generate_executive_summary() — generate markdown executive summary with metrics
Pro Tip: Export your ATT&CK Navigator layers to your organization's internal wiki or security dashboard after every exercise. Over time, you build a historical record of coverage improvement that is invaluable for demonstrating security program maturity to auditors and leadership.
MITRE ATT&CK Integration
The MITRE ATT&CK framework is the backbone of autonomous purple teaming. Every technique the Red Agent executes, every detection the Blue Agent validates, and every gap the Reporting Engine identifies is anchored to a specific ATT&CK technique ID. This section covers how to deeply integrate ATT&CK into your purple team automation.
Mapping Adversary Simulation to ATT&CK Techniques
The first step is building a threat-profile-to-technique mapping. Different adversary groups use different subsets of ATT&CK. Your purple team exercises should test the techniques most relevant to your threat landscape.
| Threat Profile | Primary Tactics | Key Techniques | Common Tools |
|---|---|---|---|
| APT29 (Cozy Bear) | Initial Access, Execution, Defense Evasion | T1566.001, T1059.001, T1027, T1071.001 | Cobalt Strike, PowerShell |
| FIN7 (Financial) | Initial Access, Persistence, Collection | T1566.001, T1053.005, T1560.001, T1041 | Carbanak, PowerShell |
| Ransomware Operator | Initial Access, Lateral Movement, Impact | T1566.002, T1021.002, T1486, T1490 | RDP, PsExec, Mimikatz |
| Insider Threat | Collection, Exfiltration, Defense Evasion | T1560.001, T1041, T1070.004, T1567.002 | Native OS tools, cloud storage |
| Supply Chain | Initial Access, Execution, Persistence | T1195.002, T1059.001, T1053.005, T1543.003 | Malicious packages, scripts |
Technique Selection Strategy
Not all ATT&CK techniques are equally relevant to every organization. Use this prioritization framework to select techniques for each exercise:
-
Threat Intelligence Driven. Start with techniques used by the threat groups most likely to target your industry and region. Pull technique lists from MITRE ATT&CK Groups pages and map them to your threat profile.
-
Coverage Gap Driven. After your first exercise, prioritize techniques that were missed. Re-test after remediation to validate the fix.
-
Crown Jewel Driven. Identify the attack paths to your most critical assets and test every technique along those paths. If your crown jewel is a database server, test every lateral movement and credential access technique that could reach it.
-
Compliance Driven. Map regulatory requirements (PCI DSS, HIPAA, CMMC) to ATT&CK techniques and ensure those techniques are covered in your exercises.
Exercise Configuration with ATT&CK Profiles
Here's a complete exercise configuration that defines a threat profile, selects techniques, and sets safety boundaries:
# exercise_config.yaml — APT29 Simulation Exercise
# Autonomous purple team exercise targeting APT29 TTPs
exercise:
id: "PT-2025-Q4-APT29"
name: "APT29 Cozy Bear Simulation — Q4 2025"
description: >
Autonomous adversary simulation targeting APT29 TTPs
against corporate Windows domain environment.
threat_profile: "apt29_cozy_bear"
classification: "INTERNAL — AUTHORIZED EXERCISE"
scope:
allowed_targets:
- "10.10.50.0/24" # Purple team lab subnet
- "WS-PT-001" # Target workstation
- "WS-PT-002" # Target workstation
- "DC-PT-001" # Target domain controller
- "SRV-PT-FILE" # Target file server
excluded_targets:
- "10.10.50.1" # Gateway — do not touch
- "DC-PROD-*" # Production DCs — absolutely not
allowed_techniques_max_impact: "high" # No "critical" destructive techniques
network_boundaries:
- "No egress to internet"
- "No lateral to production VLANs"
techniques:
# Reconnaissance
- id: "T1595.002"
name: "Active Scanning — Vulnerability Scanning"
priority: medium
stealth: quiet
# Initial Access
- id: "T1566.001"
name: "Phishing — Spearphishing Attachment"
priority: high
stealth: moderate
parameters:
payload_type: "macro_document"
delivery: "simulated_email"
# Execution
- id: "T1059.001"
name: "Command and Scripting Interpreter — PowerShell"
priority: critical
stealth: moderate
parameters:
encoding: "base64"
bypass_method: "amsi_bypass"
- id: "T1059.003"
name: "Command and Scripting Interpreter — Windows Command Shell"
priority: high
stealth: quiet
# Persistence
- id: "T1053.005"
name: "Scheduled Task/Job — Scheduled Task"
priority: high
stealth: quiet
parameters:
task_name: "SystemHealthCheck"
trigger: "daily"
- id: "T1547.001"
name: "Boot or Logon Autostart Execution — Registry Run Keys"
priority: high
stealth: moderate
# Privilege Escalation
- id: "T1548.002"
name: "Abuse Elevation Control Mechanism — UAC Bypass"
priority: high
stealth: stealthy
# Defense Evasion
- id: "T1027"
name: "Obfuscated Files or Information"
priority: medium
stealth: stealthy
- id: "T1070.001"
name: "Indicator Removal — Clear Windows Event Logs"
priority: critical
stealth: loud
# Credential Access
- id: "T1003.001"
name: "OS Credential Dumping — LSASS Memory"
priority: critical
stealth: moderate
parameters:
method: "comsvcs_minidump"
- id: "T1558.003"
name: "Steal or Forge Kerberos Tickets — Kerberoasting"
priority: high
stealth: stealthy
# Discovery
- id: "T1082"
name: "System Information Discovery"
priority: low
stealth: quiet
- id: "T1087.002"
name: "Account Discovery — Domain Account"
priority: medium
stealth: quiet
# Lateral Movement
- id: "T1021.002"
name: "Remote Services — SMB/Windows Admin Shares"
priority: critical
stealth: moderate
- id: "T1021.001"
name: "Remote Services — Remote Desktop Protocol"
priority: high
stealth: loud
# Collection
- id: "T1560.001"
name: "Archive Collected Data — Archive via Utility"
priority: medium
stealth: quiet
# Command and Control
- id: "T1071.001"
name: "Application Layer Protocol — Web Protocols"
priority: high
stealth: stealthy
parameters:
protocol: "HTTPS"
domain: "cdn-update.purpleteam.local"
# Exfiltration
- id: "T1041"
name: "Exfiltration Over C2 Channel"
priority: critical
stealth: stealthy
safety:
max_compromised_hosts: 4
max_duration_minutes: 90
blocked_techniques:
- "T1485" # Data Destruction
- "T1486" # Data Encrypted for Impact
- "T1489" # Service Stop
- "T1490" # Inhibit System Recovery
- "T1529" # System Shutdown/Reboot
- "T1561" # Disk Wipe
kill_switch:
enabled: true
trigger_conditions:
- "production_network_access_detected"
- "unscoped_host_compromised"
- "manual_operator_override"
notification_channels:
- "slack:#purple-team-ops"
- "pagerduty:purple-team-lead"
allowed_subnets:
- "10.10.50.0/24"
require_human_approval:
- "T1003.001" # Credential dumping requires approval
- "T1021.002" # Lateral movement requires approval
red_agent:
model: "gpt-4o"
max_steps: 40
detection_delay_seconds: 30
continue_on_error: true
stealth_preference: "moderate"
blue_agent:
splunk_url: "https://splunk-es.internal:8089"
splunk_token: "${SPLUNK_API_TOKEN}"
soar_url: "https://soar.internal/api"
soar_token: "${SOAR_API_TOKEN}"
detection_timeout: 120
detection_map_path: "configs/detection_map.yaml"
reporting:
output_directory: "reports/PT-2025-Q4-APT29"
navigator_layer: true
executive_summary: true
detailed_findings: true
coverage_trend: true
Pro Tip: The require_human_approval list is one of the most important safety features. For high-impact techniques like credential dumping and lateral movement, the orchestrator pauses and requests explicit human approval before the Red Agent proceeds. This maintains human oversight for techniques that carry real risk even in a lab environment.
Building Your First AI Purple Team Exercise
This section provides a step-by-step walkthrough for setting up and running your first autonomous purple team exercise. We'll use MITRE Caldera as the adversary emulation backend, an LLM for adaptive planning, and Splunk for detection validation.
Step 1: Set Up MITRE Caldera with LLM Plugins
MITRE Caldera is an open-source adversary emulation platform that provides the execution engine for our Red Agent. Install it and configure the API for external integration.
# Clone and install Caldera
git clone https://github.com/mitre/caldera.git --recursive
cd caldera
# Create a Python virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Start Caldera server (default port 8888)
python server.py --insecure --fresh
# Verify the API is accessible
curl -s http://localhost:8888/api/v2/health \
-H "KEY:ADMIN123" | python3 -m json.tool
Configure Caldera's API access for external agent integration:
# caldera/conf/local.yml — Caldera server configuration
users:
red:
red: admin
blue:
blue: admin
api_key_red: CALDERA_RED_API_KEY_CHANGE_ME
api_key_blue: CALDERA_BLUE_API_KEY_CHANGE_ME
port: 8888
host: 0.0.0.0
plugins:
- sandcat # Default agent (Go-based implant)
- stockpile # ATT&CK technique library
- compass # ATT&CK Navigator integration
- access # Initial access techniques
- manx # Reverse shell agent
- response # Blue team response actions
exfil_dir: /tmp/caldera_exfil
reports_dir: /tmp/caldera_reports
logging:
level: DEBUG
file: logs/caldera.log
Step 2: Deploy Caldera Agents on Target Systems
Deploy Caldera's Sandcat agent on your purple team lab hosts:
# On target Windows workstation (PowerShell)
# Downloads and executes the Sandcat agent
$server="http://caldera-server:8888"
$url="$server/file/download"
$wc=New-Object System.Net.WebClient
$wc.Headers.add("platform","windows")
$wc.Headers.add("file","sandcat.go")
$wc.Headers.add("server","$server")
$output="C:\Users\Public\sandcat.exe"
$wc.DownloadFile($url,$output)
Start-Process -FilePath $output -ArgumentList "-server $server -group purple_team" -WindowStyle Hidden
# On target Linux server
curl -s -X POST http://caldera-server:8888/file/download \
-H "platform:linux" -H "file:sandcat.go" \
-H "server:http://caldera-server:8888" \
-o /tmp/sandcat
chmod +x /tmp/sandcat
nohup /tmp/sandcat -server http://caldera-server:8888 -group purple_team &
Verify agent connectivity:
# List active agents via Caldera API
curl -s http://localhost:8888/api/v2/agents \
-H "KEY:CALDERA_RED_API_KEY_CHANGE_ME" | python3 -m json.tool
# Expected output shows registered agents with their properties
# {
# "paw": "abc123",
# "host": "WS-PT-001",
# "platform": "windows",
# "group": "purple_team",
# "trusted": true,
# "last_seen": "2025-12-22T10:30:00Z"
# }
Step 3: Configure Adversary Profiles
Create a custom adversary profile in Caldera that aligns with your threat model:
# Create APT29 adversary profile via Caldera API
curl -s -X POST http://localhost:8888/api/v2/adversaries \
-H "KEY:CALDERA_RED_API_KEY_CHANGE_ME" \
-H "Content-Type: application/json" \
-d '{
"name": "APT29 Purple Team Profile",
"description": "APT29 Cozy Bear TTP simulation for purple team exercise",
"atomic_ordering": [
"90c2efaa-8205-480d-8bb6-61d90dbaf81b",
"d69e9e0c-59cd-4012-a7b7-4c2b4c79afb5",
"3b5db901-2a6a-4006-a04e-5321f7e0755d",
"6469befa-748a-4b9c-a96d-f191fde47d89",
"a398986f-813f-4086-bbcd-3f5406d12bc0"
],
"objective": "495a9828-cab1-44dd-a0ca-66e58177d8cc"
}'
Step 4: Build the LLM Integration Layer
Connect the LLM planning engine to Caldera's execution backend:
"""
caldera_integration.py — Bridges the LLM-based Red Agent with
MITRE Caldera's adversary emulation execution engine.
"""
import json
import logging
import time
from typing import Optional
import requests
logger = logging.getLogger("caldera_integration")
class CalderaExecutor:
"""
Provides the execution backend for the Red Agent by translating
ATT&CK technique requests into Caldera API operations.
"""
def __init__(self, caldera_url: str, api_key: str):
self.base_url = caldera_url.rstrip("/")
self.headers = {
"KEY": api_key,
"Content-Type": "application/json"
}
self.ability_cache = {}
self._load_abilities()
def execute_technique(
self,
technique_id: str,
target_paw: str,
parameters: Optional[dict] = None
) -> dict:
"""
Execute a specific ATT&CK technique against a target agent
via Caldera's operation API.
"""
abilities = self.ability_cache.get(technique_id, [])
if not abilities:
return {
"executed": False,
"error": f"No Caldera ability found for {technique_id}"
}
# Select the first matching ability
ability = abilities[0]
# Create a one-off operation for this technique
operation_payload = {
"name": f"PT-{technique_id}-{int(time.time())}",
"adversary": {"adversary_id": "", "name": ""},
"group": "purple_team",
"auto_close": True,
"jitter": "2/5",
"source": {
"id": "ed32b9c3-9593-4c33-b0db-e2007315096b"
},
"planner": {
"id": "aaa7c857-37a0-4c4a-85f7-4e9f7f30e31a"
},
"manual_command": {
"paw": target_paw,
"ability_id": ability["ability_id"]
}
}
if parameters:
operation_payload["facts"] = [
{"trait": k, "value": v}
for k, v in parameters.items()
]
# Start the operation
response = requests.post(
f"{self.base_url}/api/v2/operations",
headers=self.headers,
json=operation_payload,
timeout=30
)
if response.status_code != 200:
return {
"executed": False,
"error": f"Caldera API error: {response.status_code}"
}
operation = response.json()
operation_id = operation.get("id")
# Wait for operation to complete
result = self._wait_for_operation(operation_id)
return {
"executed": True,
"success": result.get("success", False),
"output": result.get("output", ""),
"duration_ms": result.get("duration_ms", 0),
"technique_name": ability["name"],
"operation_id": operation_id,
"artifacts": result.get("artifacts", {})
}
# ── Remaining methods ──────────────────────────────────────────────
# _load_abilities() — cache Caldera abilities indexed by ATT&CK technique ID
# _wait_for_operation() — poll Caldera until operation completes or times out
Step 5: Run the Autonomous Exercise
With all components in place, launch the exercise:
"""
run_exercise.py — Launch an autonomous purple team exercise.
"""
import yaml
import logging
from orchestrator import PurpleTeamOrchestrator, ExerciseConfig
from reporter import PurpleTeamReporter
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(name)s] %(levelname)s: %(message)s"
)
# Load exercise configuration
with open("configs/exercise_config.yaml", "r") as f:
raw_config = yaml.safe_load(f)
config = ExerciseConfig(
exercise_id=raw_config["exercise"]["id"],
name=raw_config["exercise"]["name"],
threat_profile=raw_config["exercise"]["threat_profile"],
scope=raw_config["scope"],
techniques=raw_config["techniques"],
safety=raw_config["safety"],
red_agent_config=raw_config["red_agent"],
blue_agent_config=raw_config["blue_agent"],
max_duration_minutes=raw_config["safety"]["max_duration_minutes"],
detection_delay_seconds=raw_config["red_agent"]["detection_delay_seconds"]
)
# Run the exercise
orchestrator = PurpleTeamOrchestrator(config)
result = orchestrator.run_exercise()
# Generate reports
reporter = PurpleTeamReporter()
reporter.generate_navigator_layer(
result,
f"{raw_config['reporting']['output_directory']}/navigator_layer.json"
)
summary = reporter.generate_executive_summary(result)
print(summary)
print(f"\nDetection Coverage: {result.detection_coverage_pct:.1f}%")
print(f"Mean Detection Time: {result.mean_detection_time_ms:.0f}ms")
print(f"Total Gaps Found: {len(result.findings)}")
Step 6: Validate Detections in Splunk
After the Red Agent executes techniques, validate detection coverage directly in Splunk. Here are the key validation queries:
| `notable`
| where earliest_time >= relative_time(now(), "-2h")
| eval exercise_tag = if(match(src, "10\.10\.50\."), "purple_team", "production")
| search exercise_tag="purple_team"
| stats count by rule_name, severity, mitre_technique_id, dest
| sort - count
Check for technique-specific detection coverage:
| `notable`
| where earliest_time >= relative_time(now(), "-4h")
| search (dest="WS-PT-*" OR dest="DC-PT-*" OR dest="SRV-PT-*")
| eval technique = mvindex(split(mitre_technique_id, ","), 0)
| stats
count AS total_detections
dc(rule_name) AS unique_rules
values(rule_name) AS rules_fired
min(_time) AS first_detection
max(_time) AS last_detection
by technique
| eval detection_span_seconds = last_detection - first_detection
| sort technique
Identify telemetry gaps — techniques that generated raw events but no detections:
| tstats count WHERE index=* host="WS-PT-*" OR host="DC-PT-*"
earliest=-4h latest=now()
by index, sourcetype, host
| join type=left host
[| `notable`
| where earliest_time >= relative_time(now(), "-4h")
| search (dest="WS-PT-*" OR dest="DC-PT-*")
| stats count AS detection_count by dest
| rename dest AS host]
| eval detection_count = if(isnull(detection_count), 0, detection_count)
| where detection_count = 0
| stats values(sourcetype) AS available_telemetry by host
| eval gap_note = "Raw telemetry exists but no detections fired — detection rule gap"
Step 7: Generate the Gap Analysis Report
The final step produces the actionable output — a gap analysis that maps missed detections to specific remediation actions:
# Generate all reports from exercise results
python3 run_exercise.py 2>&1 | tee exercise_output.log
# Open the ATT&CK Navigator layer in your browser
# Upload the generated JSON to https://mitre-attack.github.io/attack-navigator/
open reports/PT-2025-Q4-APT29/navigator_layer.json
# Review the executive summary
cat reports/PT-2025-Q4-APT29/executive_summary.md
Pro Tip: Integrate the exercise runner into your CI/CD pipeline. Every time a new detection rule is deployed, automatically re-run the purple team exercise for the relevant techniques. This creates a continuous validation loop: new rule deploys → automated test confirms it catches the attack → coverage dashboard updates in real time.
Detection Coverage Gap Analysis — The Cymantis View
Detection coverage measurement is where most purple team programs fail. They test techniques, record pass/fail, and generate a spreadsheet. That's not gap analysis — it's a checklist. True gap analysis requires understanding why a detection failed, what would fix it, and how to prioritize remediation across dozens of gaps.
At Cymantis, we structure detection gap analysis across three dimensions: data availability, detection logic, and response integration. A gap in any dimension means the defense is incomplete.
Dimension 1: Data Availability
Before you can detect an attack, the relevant telemetry must exist in your SIEM. The most common reason for missed detections isn't a bad rule — it's a missing data source.
| `purple_team_techniques`
| join type=left technique_id
[| tstats count WHERE index=*
earliest=-24h latest=now()
by index, sourcetype
| stats values(sourcetype) AS available_sourcetypes by index
| eval available_data = mvjoin(available_sourcetypes, ", ")]
| eval data_gap = if(isnull(available_data) OR available_data="", "YES", "NO")
| where data_gap="YES"
| table technique_id, technique_name, required_data_sources, data_gap
| sort technique_id
This query identifies ATT&CK techniques where the required data sources are not present in your SIEM. Every technique flagged here is a guaranteed detection blind spot.
Dimension 2: Detection Logic
For techniques where telemetry exists but detections didn't fire, the issue is in the detection rule itself — incorrect logic, overly narrow filters, or tuning that suppressed the alert.
| `notable`
| where earliest_time >= relative_time(now(), "-30d")
| eval technique = mvindex(split(mitre_technique_id, ","), 0)
| stats
count AS total_fires
dc(dest) AS unique_targets
avg(urgency) AS avg_urgency
by technique, rule_name
| append
[| inputlookup attack_techniques.csv
| rename technique_id AS technique
| eval total_fires = 0, unique_targets = 0, avg_urgency = 0,
rule_name = "NO RULE MAPPED"]
| dedup technique rule_name
| where total_fires = 0 AND rule_name = "NO RULE MAPPED"
| table technique, rule_name, total_fires
| eval gap_type = "detection_logic_gap"
| sort technique
Dimension 3: Response Integration
A detection that fires but doesn't trigger the appropriate response action is a gap. The alert exists, but the SOAR playbook didn't activate, the notification didn't route, or the containment action didn't execute.
| `notable`
| where earliest_time >= relative_time(now(), "-7d")
| search status!="closed"
| join type=left rule_name
[| rest /services/configs/conf-workflow_actions
| stats count AS playbook_count by title
| rename title AS rule_name]
| eval playbook_count = if(isnull(playbook_count), 0, playbook_count)
| where playbook_count = 0
| stats count AS unlinked_alerts by rule_name, severity
| sort - severity, - count
| eval gap_type = "response_integration_gap"
Coverage Dashboard
Combine all three dimensions into a unified coverage score:
| inputlookup purple_team_results.csv
| eval coverage_score = case(
detection_status="detected" AND response_status="activated", 100,
detection_status="detected" AND response_status!="activated", 66,
detection_status="partial", 33,
detection_status="missed", 0,
1=1, 0)
| stats
avg(coverage_score) AS overall_coverage
count(eval(coverage_score=100)) AS fully_covered
count(eval(coverage_score=66)) AS detect_only
count(eval(coverage_score=33)) AS partial
count(eval(coverage_score=0)) AS blind_spots
count AS total_tested
| eval overall_pct = round(overall_coverage, 1)
| eval summary = "Overall Coverage: " . overall_pct . "% | Fully Covered: " .
fully_covered . " | Detect Only: " . detect_only . " | Partial: " .
partial . " | Blind Spots: " . blind_spots
Cymantis Recommendations
Based on our work with purple team programs across enterprise, federal, and critical infrastructure environments, here are the patterns that distinguish mature programs from checkbox exercises:
1. Measure Coverage Continuously, Not Periodically
Detection coverage degrades over time. Data sources change, detection rules drift, SIEM configurations get modified, and new techniques emerge. If you only measure coverage during quarterly exercises, you're flying blind between measurements.
Action: Schedule automated purple team exercises weekly. Even a reduced-scope exercise targeting your top 20 critical techniques provides more value than a comprehensive exercise once a quarter.
2. Track Coverage Trends, Not Point-in-Time Scores
A coverage score of 72% is meaningless without context. Is it improving or declining? Which tactics are getting better and which are getting worse? Trend data tells the story.
Action: Store every exercise result in a time-series index. Build a Splunk dashboard that shows coverage percentage over time, broken down by ATT&CK tactic. Present this to leadership monthly.
3. Prioritize Gaps by Business Impact, Not Technique Count
Not all detection gaps are equal. A gap in detecting Kerberoasting (T1558.003) against your domain controllers is existentially more important than a gap in detecting System Information Discovery (T1082) on a development workstation.
Action: Weight gaps by the criticality of the assets they affect. A missed detection on a crown jewel system should be prioritized above five missed detections on low-value endpoints.
4. Close the Loop — Re-Test After Remediation
The gap analysis is only valuable if it drives remediation. And remediation is only confirmed if you re-test. Too many organizations identify gaps, write new rules, and never validate that the rules actually work.
Action: For every detection gap, create a remediation ticket with a re-test requirement. The ticket isn't closed until the automated purple team exercise confirms the new detection fires for the technique that was previously missed.
5. Build a Detection Engineering Feedback Loop
Every purple team exercise should feed directly into your detection engineering pipeline. Missed techniques become detection development backlog items. Slow detections become performance optimization tasks. False positives become tuning requirements.
Action: Integrate the gap analysis output with your ticketing system (Jira, ServiceNow). Auto-create tickets for each finding with the ATT&CK technique, the raw telemetry analysis, and a suggested detection approach.
Pro Tip: The single most impactful metric for a purple team program isn't detection coverage percentage — it's mean time to close a detection gap. If you can identify a gap on Monday and validate the fix by Wednesday, you have a mature program. If gap remediation takes months, coverage percentage is an illusion.
Governance for Autonomous Red Teaming
Running autonomous adversary simulation against your environment requires robust governance. Without it, you risk service disruption, scope creep, legal liability, and the erosion of trust between security teams and the business. This section covers the governance framework that makes autonomous purple teaming production-safe.
The Governance Stack
Governance for autonomous red teaming operates at four layers:
- Legal and Authorization — Written authorization from executive leadership, scope agreements, and compliance with applicable laws and regulations.
- Operational Boundaries — Technical controls that constrain what the Red Agent can do: scope restrictions, blocked techniques, time windows, and kill switches.
- Human Oversight — Approval gates for high-impact actions, real-time monitoring by authorized operators, and manual override capabilities.
- Audit and Accountability — Comprehensive logging of every action, decision chain records, and post-exercise review processes.
Governance Policy Configuration
Define your governance policies as code — machine-readable, version-controlled, and enforced by the orchestration layer:
# governance_policy.yaml — Purple team exercise governance controls
authorization:
document_ref: "SEC-AUTH-2025-047"
approved_by: "CISO — Jane Smith"
approval_date: "2025-11-15"
valid_until: "2026-03-15"
scope_summary: >
Authorized autonomous adversary simulation against designated
purple team lab environment (10.10.50.0/24) using MITRE ATT&CK
techniques up to "high" impact level. Destructive techniques
explicitly excluded.
legal_review: "Completed — Legal ref: LGL-2025-1209"
insurance_notification: "Cyber insurance carrier notified — ref: INS-2025-889"
operational_boundaries:
# Time restrictions
allowed_hours:
timezone: "America/New_York"
weekday_start: "08:00"
weekday_end: "18:00"
weekend_allowed: false
holiday_blackout: true
# Scope restrictions
network_scope:
allowed_subnets:
- "10.10.50.0/24"
blocked_subnets:
- "10.10.0.0/16" # Production network
- "10.20.0.0/16" # Corporate network
- "172.16.0.0/12" # Management network
dns_restrictions:
allowed_domains:
- "*.purpleteam.local"
blocked_domains:
- "*.prod.internal"
- "*.corp.internal"
# Technique restrictions
technique_policy:
max_impact_level: "high"
blocked_categories:
- "impact" # No destructive techniques
- "resource-hijacking"
blocked_techniques:
- "T1485" # Data Destruction
- "T1486" # Data Encrypted for Impact
- "T1489" # Service Stop
- "T1490" # Inhibit System Recovery
- "T1496" # Resource Hijacking
- "T1529" # System Shutdown/Reboot
- "T1531" # Account Access Removal
- "T1561" # Disk Wipe
require_approval:
- technique_id: "T1003.*"
reason: "Credential dumping — high risk of credential exposure"
approver: "purple_team_lead"
- technique_id: "T1021.*"
reason: "Lateral movement — risk of scope expansion"
approver: "purple_team_lead"
- technique_id: "T1078.*"
reason: "Valid accounts — risk of legitimate credential misuse"
approver: "soc_manager"
# Resource limits
resource_limits:
max_concurrent_operations: 3
max_compromised_hosts: 5
max_harvested_credentials: 20
max_exfiltrated_data_mb: 100
max_exercise_duration_minutes: 120
human_oversight:
# Real-time monitoring requirements
monitoring:
required_observers: 1
observer_roles:
- "purple_team_lead"
- "soc_analyst_senior"
monitoring_dashboard: "https://splunk-es.internal/app/purple_team/live"
notification_channels:
- type: "slack"
channel: "#purple-team-ops"
events: ["exercise_start", "exercise_end", "approval_request", "safety_trigger"]
- type: "pagerduty"
service: "purple-team-lead"
events: ["safety_trigger", "kill_switch", "scope_violation"]
# Approval workflows
approval_gates:
- name: "exercise_start"
approver: "purple_team_lead"
timeout_minutes: 30
auto_deny_on_timeout: true
- name: "high_impact_technique"
approver: "purple_team_lead"
timeout_minutes: 10
auto_deny_on_timeout: true
- name: "scope_extension"
approver: "soc_manager"
timeout_minutes: 60
auto_deny_on_timeout: true
# Kill switch configuration
kill_switch:
enabled: true
authorized_operators:
- "purple_team_lead"
- "soc_manager"
- "ciso"
auto_triggers:
- condition: "technique_outside_scope"
action: "halt_exercise"
notification: "immediate"
- condition: "production_network_traffic"
action: "halt_exercise"
notification: "immediate"
- condition: "max_duration_exceeded"
action: "graceful_shutdown"
notification: "standard"
- condition: "unscoped_host_compromised"
action: "halt_exercise"
notification: "immediate"
manual_trigger:
method: "slack_command"
command: "/purple-team kill"
confirmation_required: true
audit_and_accountability:
# Logging requirements
logging:
log_destination: "index=purple_team_audit"
log_level: "debug"
fields_required:
- "timestamp"
- "exercise_id"
- "agent_type" # red or blue
- "action"
- "technique_id"
- "target_host"
- "result"
- "operator_id"
- "approval_status"
retention_days: 365
tamper_protection: true
# Post-exercise review
post_exercise:
review_required: true
review_deadline_hours: 48
review_participants:
- "purple_team_lead"
- "soc_manager"
- "detection_engineering_lead"
deliverables:
- "executive_summary"
- "detailed_findings"
- "navigator_layer"
- "remediation_plan"
- "lessons_learned"
# Incident handling (if something goes wrong)
incident_response:
escalation_path:
- level: 1
contact: "purple_team_lead"
method: "slack"
- level: 2
contact: "soc_manager"
method: "pagerduty"
- level: 3
contact: "ciso"
method: "phone"
rollback_procedures:
- "Terminate all Caldera operations via API"
- "Kill all Sandcat agent processes on target hosts"
- "Revoke all harvested credentials"
- "Remove all persistence mechanisms"
- "Restore target hosts from snapshot"
documentation_requirements:
- "Timeline of events"
- "Root cause analysis"
- "Impact assessment"
- "Corrective actions"
Key Governance Principles
1. Authorization Before Automation. Never run autonomous adversary simulation without explicit written authorization from an executive with the authority to approve offensive security testing. This isn't optional — it's a legal requirement in most jurisdictions.
2. Scope as Code. Define scope in machine-readable configuration that the orchestration layer enforces automatically. Human-readable scope documents are necessary for authorization, but they don't prevent scope violations. Code does.
3. Kill Switch Always On. The kill switch is the most important safety control. It must be always enabled, instantly accessible, and tested regularly. A kill switch that doesn't work when you need it is worse than no kill switch at all.
4. Audit Everything. Every technique execution, every approval decision, every safety trigger, and every state change must be logged to an immutable audit trail. This protects the purple team, satisfies compliance requirements, and provides the evidence base for post-exercise analysis.
5. Blast Radius Containment. Design your lab environment with network segmentation that physically prevents the Red Agent from reaching production systems. Software controls fail; network controls are harder to bypass. Use dedicated VLANs, firewall rules, and — if possible — air-gapped environments for high-impact technique testing.
Pro Tip: Run a governance tabletop exercise before your first autonomous purple team engagement. Walk through the scenario: "The Red Agent executes a lateral movement technique that accidentally reaches a production-adjacent system. The kill switch fires. Now what?" If your team can't answer that question confidently, you're not ready for autonomous operations.
Putting It All Together: The Continuous Purple Team Operating Model
The architecture, tooling, and governance described in this post come together in a continuous operating model that fundamentally changes how organizations validate their defenses.
The Weekly Cycle
graph TD
Monday["Monday<br/>Automated exercise: Top 20 critical techniques<br/>[Fully autonomous]"]
Tuesday["Tuesday<br/>Blue agent validates detections, generates gap report<br/>[Fully autonomous]"]
Wednesday["Wednesday<br/>Detection engineering reviews gaps, prioritizes fixes<br/>[Human-led]"]
Thursday["Thursday<br/>New/updated detection rules deployed<br/>[Human-led]"]
Friday["Friday<br/>Automated re-test of remediated techniques<br/>[Fully autonomous]"]
Monday --> Tuesday
Tuesday --> Wednesday
Wednesday --> Thursday
Thursday --> Friday
The Monthly Cycle
graph TD
Week1["Week 1<br/>Full-scope exercise against primary threat profile"]
Week2["Week 2<br/>Gap remediation sprint"]
Week3["Week 3<br/>Full-scope exercise against secondary threat profile"]
Week4["Week 4<br/>Trend analysis, executive reporting, coverage review"]
Week1 --> Week2
Week2 --> Week3
Week3 --> Week4
Key Performance Indicators
Track these KPIs to measure purple team program maturity:
- Detection Coverage Percentage — Techniques detected / techniques tested. Target: >80%.
- Mean Detection Time (MDT) — Average time from technique execution to alert. Target: <60 seconds.
- Mean Time to Close Gap (MTTCG) — Average time from gap identification to validated fix. Target: <5 business days.
- Coverage Trend — Month-over-month change in detection coverage percentage. Target: Positive trend.
- Technique Test Frequency — Average number of times each critical technique is tested per month. Target: ≥4.
- False Positive Rate — Percentage of detection fires that are false positives during exercises. Target: <5%.
Final Thoughts
The purple team model was always the right idea — offense and defense working together to find gaps before adversaries do. The limitation was never conceptual; it was operational. Human teams can't execute at the speed, scale, and frequency that continuous security validation demands.
Agentic AI removes that bottleneck. An autonomous Red Agent that plans and executes MITRE ATT&CK-aligned attack chains. A Blue Agent that validates every detection in your SIEM pipeline. An Orchestration Layer that enforces safety boundaries and coordinates the exercise lifecycle. A Reporting Engine that produces real-time gap analysis with actionable remediation guidance.
The result is a continuous offense-defense feedback loop that operates at machine speed with human judgment at the decision points that matter: defining threat profiles, setting governance boundaries, prioritizing remediation, and making strategic investment decisions.
Organizations that adopt agentic adversary simulation gain something that periodic testing can never provide: real-time visibility into their defensive posture. Not a snapshot from last quarter's engagement. Not a spreadsheet that's outdated before the ink dries. A live, continuously validated map of what you can detect, what you can't, and exactly what to fix next.
The adversaries are already using AI. Your purple team should be too.
Cymantis Labs helps security teams design, deploy, and govern autonomous purple team programs — from initial architecture through continuous operations. We bring the adversary simulation expertise, detection engineering depth, and governance rigor to make autonomous security validation production-safe and operationally sustainable.
Resources & References
MITRE Frameworks & Tools
- MITRE ATT&CK Framework: https://attack.mitre.org/ — Adversary tactics, techniques, and procedures knowledge base
- MITRE ATT&CK Navigator: https://mitre-attack.github.io/attack-navigator/ — Web-based tool for visualizing ATT&CK coverage and gaps
- MITRE Caldera: https://caldera.mitre.org/ — Open-source adversary emulation platform
- MITRE D3FEND: https://d3fend.mitre.org/ — Defensive technique knowledge graph
- MITRE ATLAS: https://atlas.mitre.org/ — Adversarial threat landscape for AI systems
Adversary Emulation & Red Team
- Atomic Red Team: https://github.com/redcanaryco/atomic-red-team — Library of atomic tests mapped to ATT&CK techniques
- MITRE Caldera GitHub: https://github.com/mitre/caldera — Source code and documentation for Caldera
- Infection Monkey (Akamai): https://github.com/guardicore/monkey — Open-source breach and attack simulation tool
- Prelude Operator: https://www.prelude.org/ — Autonomous red teaming platform
Detection Engineering & SIEM
- Splunk Enterprise Security: https://docs.splunk.com/Documentation/ES/latest — Splunk ES documentation
- Splunk ESCU (Security Content Updates): https://research.splunk.com — Community detection rules mapped to ATT&CK
- Sigma Rules: https://github.com/SigmaHQ/sigma — Generic signature format for SIEM rules
- Elastic Detection Rules: https://github.com/elastic/detection-rules — Open detection rules for Elastic Security
AI & LLM Frameworks
- OpenAI Function Calling: https://platform.openai.com/docs/guides/function-calling — Building tool-calling AI agents
- LangChain Agent Framework: https://python.langchain.com/docs/modules/agents/ — Open-source agent orchestration
- Microsoft Copilot for Security: https://www.microsoft.com/en-us/security/business/ai-machine-learning/microsoft-copilot-security — Enterprise AI security platform
Governance & Compliance
- NIST AI Risk Management Framework: https://www.nist.gov/artificial-intelligence/ai-risk-management-framework — Federal guidance on AI governance
- NIST SP 800-53 Rev. 5: https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final — Security and privacy controls
- PTES (Penetration Testing Execution Standard): http://www.pentest-standard.org/ — Penetration testing methodology and standards
- CREST — Penetration Testing Guide: https://www.crest-approved.org/ — Professional standards for penetration testing
Research & Industry Reports
- Microsoft AI Red Team: https://www.microsoft.com/en-us/security/blog/ai-red-team/ — Lessons learned from red teaming AI systems at scale
- SANS Purple Team Survey: https://www.sans.org/white-papers/ — Annual survey of purple team operations and tooling
- IBM Cost of a Data Breach Report: https://www.ibm.com/reports/data-breach — Detection time metrics and breach cost analysis
- Mandiant M-Trends Report: https://www.mandiant.com/m-trends — Annual threat landscape and detection gap analysis
For more insights or to schedule a Cymantis Purple Team Architecture Assessment, contact our research and adversary simulation team at cymantis.com.
