New The 2026 Continuous Validation Methodology Paper is now available. Read the paper →

ML & Cyber Analytics.

Vendor-neutral landscape map: model families, training pipelines, deployment patterns — plus which statistical/ML models fit which security-analytics problems and where they reliably fail.

Model families

  • Linear (LR, Lasso, Ridge). Cheap, interpretable, baseline for any tabular task. Coefficients show which feature drives the score.
  • Tree-based (Random Forest, XGBoost, LightGBM, CatBoost). Dominant for tabular security data (alerts, log events). Handles missing values, mixed types, non-linear interactions. SHAP for per-prediction explanation.
  • Deep neural (MLP, CNN, RNN/LSTM). Good for sequences (network flows, process trees), images (icon-similarity for malware family), and raw bytes (deep-learning malware classifiers).
  • Transformer. State of the art for log understanding, code analysis, text-heavy security tasks. Expensive; often distilled or used for offline batch enrichment rather than real-time scoring.
  • Graph neural networks. Authentication graphs, lateral-movement graphs, malware-similarity graphs. Niche but growing.

Training pipelines

  • Offline batch. Daily/weekly retrain on accumulated labeled data. Lowest operational complexity. Standard for most security ML.
  • Online learning. Update on each new labeled example. Useful where labels arrive faster than batch cycle (real-time fraud). Beware label-quality drift poisoning the model.
  • Federated. Train across customer tenants without centralizing data. Compelling story for security vendors; significant engineering cost. Honest claim only if model improves measurably from federation versus per-tenant baseline.
  • Active learning. Model queries analyst for labels on uncertain examples. Maximizes labeling ROI when SOC time is the bottleneck.

Deployment patterns

  • In-product real-time. Score every event inline. Latency budget tight (<10 ms typical). Model must be small + cached features pre-computed.
  • Sidecar / async. Event published to queue, scored async. Higher latency budget (seconds). Larger models possible.
  • Batch scoring. Periodic enrichment job over historical data. Largest models. Used for hunt and ranking, not for blocking.
  • Edge / on-device. Endpoint-side ML. Constrained model size. Examples: on-device URL classifier, on-device process-behavior classifier.

Analytics fit — what works and what fails

  • Anomaly detection. Works on stable baselines (user login patterns, network egress volume by host). Fails on adversarial drift: attacker observes baseline, stays inside it. Also fails on concept drift: baseline itself moves due to legitimate change (new application, holiday traffic) producing false positives.
  • Supervised classification. Works when labeled data is large + threat distribution stable + features extractable. Examples: domain-classification (DGA vs legit), URL phishing detection, malware family identification. Fails when novelty rate exceeds retraining cadence.
  • Clustering. Useful for triage (group similar alerts, surface representative example) and for exploration (cluster newly seen samples to find emerging family). Weak as primary decision surface: no ground truth = no measurable accuracy.
  • Ranking. Strong fit for SOC triage. Train on analyst dispositions (true positive / false positive) → rank new alerts by predicted disposition. Measurable in alert-to-resolution time reduction.
  • UEBA-style risk scoring. Combine multiple weak signals into composite risk. Useful as a hunt input. Often over-claimed as a detection.

Failure modes per technique

  • Data-quality dependency. Garbage labels → garbage model. Vendor "AI" trained on biased label set fails on your data.
  • FP/FN bias cost. Threshold choice is a business question, not a model question. Authentication: false-positive cost = user friction; false-negative cost = breach. Threshold drives behavior.
  • Model maintenance cost. 3-year deployment requires retraining cadence, feature-pipeline maintenance, drift monitoring, label-quality auditing. Most vendor demos ignore this.
  • Adversarial drift. Attackers test against deployed models. Detection-as-code rulesets (Sigma) and ML models both decay; neither is "set and forget."

Evaluating vendor ML claims

  1. "What features?" Vendor unwilling to disclose = often the model is shallow.
  2. "What's the FP rate on your data, on my data?" Customer-side measurement against a known representative window.
  3. "How often retrained, on what data?" Federated claims = verify or discount.
  4. "What's the explainability surface for an analyst?" SHAP, top-k feature contributions, or pure black box?
  5. "What's the operational cost of a wrong decision and how is the loop closed?" Labeling feedback path or fire-and-forget?
Rule of thumbML in security earns its keep when it ranks and prioritizes analyst attention. It struggles when promoted to autonomous decision-maker on novel threats. The honest deployment shape is "model surfaces candidates, analyst decides, decisions feed back into training."

From reference to evidence

Run this against your own environment.