Background · 7 AI, Data & Emerging Risk

NLP in Security — Reference.

Natural-language processing applied to security work: log clustering, phishing detection, report summarization, and where modern LLM-driven techniques fit (and don't).

Log clustering

Tokenization first. Replace numbers, IPs, UUIDs, paths with placeholders before vectorizing. Otherwise every line is unique.
TF-IDF + cosine. Cheap baseline. Works for medium-cardinality log corpora. Tune min_df / max_df to prevent rare tokens dominating.
Drain / Spell algorithms. Template-extraction algorithms designed for logs. Output is a tree of templates with placeholders; each new log line is assigned to a template. Best ROI for unstructured operational logs.
Embedding-based (sentence-transformers, BGE). Higher semantic quality, more compute. Worth it for English-language security text (alerts, ticket bodies) where tokens vary while meaning repeats.
Granularity knob. Tighter clusters = more clusters = less load reduction per cluster. Looser clusters = fewer clusters = larger but more heterogeneous. Calibrate per consumer: analyst wants ~50–200 clusters per day per source.

Phishing detection

Stable linguistic features.
- Urgency lexicon: "immediately", "within 24 hours", "your account will be suspended".
- Brand-impersonation cues: brand name + generic greeting + non-brand reply-to.
- URL constructions: subdomain padding (microsoft.com.malicious.io), homoglyphs (Cyrillic а for Latin a), URL shorteners obscuring destination.
Image-embedded text problem. Modern campaigns ship body as a single image (no text features for classifier). Counter: OCR pipeline before NLP. tesseract or hosted Vision API → text → classifier.
Multimodal classifier. Train on (rendered-screenshot, headers, body-text) tuple. Visual similarity to known brand login page is a stronger feature than any text feature alone.
Evaluation gotcha. Holdout must be temporally separated from training. Phishing distribution drifts weekly; in-time evaluation overstates production performance.

Report summarization with LLMs

What LLMs do well on security text.
- Factual extraction: pull IOCs, CVEs, dates from prose.
- Cross-format normalization: convert vendor-specific advisory wording to internal schema.
- Draft generation: first-pass summary, ticket body, advisory paragraph — analyst edits.
Unreliable at.
- Confidence calibration: LLM equally confident about correct and incorrect claims.
- Novelty detection: trained on past data, weakly distinguishes "new" from "looks like past".
- Attribution: hallucinates actor attribution if asked for a specific actor.
Honest deployment. LLM as drafter, analyst as approver. Logged feedback (accepted, edited, rejected) feeds prompt and model refinement.

Prompt patterns for security LLMs

Cite-or-decline. "Cite source IDs from the provided documents for every claim. If no source supports the claim, say 'no source'." Reduces hallucination materially.
Structured output. JSON schema with named fields. Easier to validate downstream than free-form prose.
Two-pass. First pass extracts; second pass evaluates whether the extraction is correct. Catches obvious errors.

Rule of thumbFor pre-LLM NLP, the right tool is usually Drain (logs) or sentence-transformers (security text). For LLM-driven work, give the model documents to ground on and require citation — most of the value of an LLM in a security pipeline is structured extraction with a verifiable trail, not freeform analysis.

Related notes in this domain

From reference to evidence