ML & Cyber Analytics.
Vendor-neutral landscape map: model families, training pipelines, deployment patterns — plus which statistical/ML models fit which security-analytics problems and where they reliably fail.
Model families
- Linear (LR, Lasso, Ridge). Cheap, interpretable, baseline for any tabular task. Coefficients show which feature drives the score.
- Tree-based (Random Forest, XGBoost, LightGBM, CatBoost). Dominant for tabular security data (alerts, log events). Handles missing values, mixed types, non-linear interactions. SHAP for per-prediction explanation.
- Deep neural (MLP, CNN, RNN/LSTM). Good for sequences (network flows, process trees), images (icon-similarity for malware family), and raw bytes (deep-learning malware classifiers).
- Transformer. State of the art for log understanding, code analysis, text-heavy security tasks. Expensive; often distilled or used for offline batch enrichment rather than real-time scoring.
- Graph neural networks. Authentication graphs, lateral-movement graphs, malware-similarity graphs. Niche but growing.
Training pipelines
- Offline batch. Daily/weekly retrain on accumulated labeled data. Lowest operational complexity. Standard for most security ML.
- Online learning. Update on each new labeled example. Useful where labels arrive faster than batch cycle (real-time fraud). Beware label-quality drift poisoning the model.
- Federated. Train across customer tenants without centralizing data. Compelling story for security vendors; significant engineering cost. Honest claim only if model improves measurably from federation versus per-tenant baseline.
- Active learning. Model queries analyst for labels on uncertain examples. Maximizes labeling ROI when SOC time is the bottleneck.
Deployment patterns
- In-product real-time. Score every event inline. Latency budget tight (<10 ms typical). Model must be small + cached features pre-computed.
- Sidecar / async. Event published to queue, scored async. Higher latency budget (seconds). Larger models possible.
- Batch scoring. Periodic enrichment job over historical data. Largest models. Used for hunt and ranking, not for blocking.
- Edge / on-device. Endpoint-side ML. Constrained model size. Examples: on-device URL classifier, on-device process-behavior classifier.
Analytics fit — what works and what fails
- Anomaly detection. Works on stable baselines (user login patterns, network egress volume by host). Fails on adversarial drift: attacker observes baseline, stays inside it. Also fails on concept drift: baseline itself moves due to legitimate change (new application, holiday traffic) producing false positives.
- Supervised classification. Works when labeled data is large + threat distribution stable + features extractable. Examples: domain-classification (DGA vs legit), URL phishing detection, malware family identification. Fails when novelty rate exceeds retraining cadence.
- Clustering. Useful for triage (group similar alerts, surface representative example) and for exploration (cluster newly seen samples to find emerging family). Weak as primary decision surface: no ground truth = no measurable accuracy.
- Ranking. Strong fit for SOC triage. Train on analyst dispositions (true positive / false positive) → rank new alerts by predicted disposition. Measurable in alert-to-resolution time reduction.
- UEBA-style risk scoring. Combine multiple weak signals into composite risk. Useful as a hunt input. Often over-claimed as a detection.
Failure modes per technique
- Data-quality dependency. Garbage labels → garbage model. Vendor "AI" trained on biased label set fails on your data.
- FP/FN bias cost. Threshold choice is a business question, not a model question. Authentication: false-positive cost = user friction; false-negative cost = breach. Threshold drives behavior.
- Model maintenance cost. 3-year deployment requires retraining cadence, feature-pipeline maintenance, drift monitoring, label-quality auditing. Most vendor demos ignore this.
- Adversarial drift. Attackers test against deployed models. Detection-as-code rulesets (Sigma) and ML models both decay; neither is "set and forget."
Evaluating vendor ML claims
- "What features?" Vendor unwilling to disclose = often the model is shallow.
- "What's the FP rate on your data, on my data?" Customer-side measurement against a known representative window.
- "How often retrained, on what data?" Federated claims = verify or discount.
- "What's the explainability surface for an analyst?" SHAP, top-k feature contributions, or pure black box?
- "What's the operational cost of a wrong decision and how is the loop closed?" Labeling feedback path or fire-and-forget?
Rule of thumbML in security earns its keep when it ranks and prioritizes analyst attention. It struggles when promoted to autonomous decision-maker on novel threats. The honest deployment shape is "model surfaces candidates, analyst decides, decisions feed back into training."
Related notes in this domain
From reference to evidence