NLP in Security — Reference.
Natural-language processing applied to security work: log clustering, phishing detection, report summarization, and where modern LLM-driven techniques fit (and don't).
Log clustering
- Tokenization first. Replace numbers, IPs, UUIDs, paths with placeholders before vectorizing. Otherwise every line is unique.
- TF-IDF + cosine. Cheap baseline. Works for medium-cardinality log corpora. Tune
min_df/max_dfto prevent rare tokens dominating. - Drain / Spell algorithms. Template-extraction algorithms designed for logs. Output is a tree of templates with placeholders; each new log line is assigned to a template. Best ROI for unstructured operational logs.
- Embedding-based (sentence-transformers, BGE). Higher semantic quality, more compute. Worth it for English-language security text (alerts, ticket bodies) where tokens vary while meaning repeats.
- Granularity knob. Tighter clusters = more clusters = less load reduction per cluster. Looser clusters = fewer clusters = larger but more heterogeneous. Calibrate per consumer: analyst wants ~50–200 clusters per day per source.
Phishing detection
- Stable linguistic features.
- Urgency lexicon: "immediately", "within 24 hours", "your account will be suspended".
- Brand-impersonation cues: brand name + generic greeting + non-brand reply-to.
- URL constructions: subdomain padding (
microsoft.com.malicious.io), homoglyphs (Cyrillic а for Latin a), URL shorteners obscuring destination.
- Image-embedded text problem. Modern campaigns ship body as a single image (no text features for classifier). Counter: OCR pipeline before NLP.
tesseractor hosted Vision API → text → classifier. - Multimodal classifier. Train on (rendered-screenshot, headers, body-text) tuple. Visual similarity to known brand login page is a stronger feature than any text feature alone.
- Evaluation gotcha. Holdout must be temporally separated from training. Phishing distribution drifts weekly; in-time evaluation overstates production performance.
Report summarization with LLMs
- What LLMs do well on security text.
- Factual extraction: pull IOCs, CVEs, dates from prose.
- Cross-format normalization: convert vendor-specific advisory wording to internal schema.
- Draft generation: first-pass summary, ticket body, advisory paragraph — analyst edits.
- Unreliable at.
- Confidence calibration: LLM equally confident about correct and incorrect claims.
- Novelty detection: trained on past data, weakly distinguishes "new" from "looks like past".
- Attribution: hallucinates actor attribution if asked for a specific actor.
- Honest deployment. LLM as drafter, analyst as approver. Logged feedback (accepted, edited, rejected) feeds prompt and model refinement.
Prompt patterns for security LLMs
- Cite-or-decline. "Cite source IDs from the provided documents for every claim. If no source supports the claim, say 'no source'." Reduces hallucination materially.
- Structured output. JSON schema with named fields. Easier to validate downstream than free-form prose.
- Two-pass. First pass extracts; second pass evaluates whether the extraction is correct. Catches obvious errors.
Rule of thumbFor pre-LLM NLP, the right tool is usually Drain (logs) or sentence-transformers (security text). For LLM-driven work, give the model documents to ground on and require citation — most of the value of an LLM in a security pipeline is structured extraction with a verifiable trail, not freeform analysis.
Related notes in this domain
From reference to evidence