New The 2026 Continuous Validation Methodology Paper is now available. Read the paper →

NLP in Security — Reference.

Natural-language processing applied to security work: log clustering, phishing detection, report summarization, and where modern LLM-driven techniques fit (and don't).

Log clustering

  • Tokenization first. Replace numbers, IPs, UUIDs, paths with placeholders before vectorizing. Otherwise every line is unique.
  • TF-IDF + cosine. Cheap baseline. Works for medium-cardinality log corpora. Tune min_df / max_df to prevent rare tokens dominating.
  • Drain / Spell algorithms. Template-extraction algorithms designed for logs. Output is a tree of templates with placeholders; each new log line is assigned to a template. Best ROI for unstructured operational logs.
  • Embedding-based (sentence-transformers, BGE). Higher semantic quality, more compute. Worth it for English-language security text (alerts, ticket bodies) where tokens vary while meaning repeats.
  • Granularity knob. Tighter clusters = more clusters = less load reduction per cluster. Looser clusters = fewer clusters = larger but more heterogeneous. Calibrate per consumer: analyst wants ~50–200 clusters per day per source.

Phishing detection

  • Stable linguistic features.
    • Urgency lexicon: "immediately", "within 24 hours", "your account will be suspended".
    • Brand-impersonation cues: brand name + generic greeting + non-brand reply-to.
    • URL constructions: subdomain padding (microsoft.com.malicious.io), homoglyphs (Cyrillic а for Latin a), URL shorteners obscuring destination.
  • Image-embedded text problem. Modern campaigns ship body as a single image (no text features for classifier). Counter: OCR pipeline before NLP. tesseract or hosted Vision API → text → classifier.
  • Multimodal classifier. Train on (rendered-screenshot, headers, body-text) tuple. Visual similarity to known brand login page is a stronger feature than any text feature alone.
  • Evaluation gotcha. Holdout must be temporally separated from training. Phishing distribution drifts weekly; in-time evaluation overstates production performance.

Report summarization with LLMs

  • What LLMs do well on security text.
    • Factual extraction: pull IOCs, CVEs, dates from prose.
    • Cross-format normalization: convert vendor-specific advisory wording to internal schema.
    • Draft generation: first-pass summary, ticket body, advisory paragraph — analyst edits.
  • Unreliable at.
    • Confidence calibration: LLM equally confident about correct and incorrect claims.
    • Novelty detection: trained on past data, weakly distinguishes "new" from "looks like past".
    • Attribution: hallucinates actor attribution if asked for a specific actor.
  • Honest deployment. LLM as drafter, analyst as approver. Logged feedback (accepted, edited, rejected) feeds prompt and model refinement.

Prompt patterns for security LLMs

  • Cite-or-decline. "Cite source IDs from the provided documents for every claim. If no source supports the claim, say 'no source'." Reduces hallucination materially.
  • Structured output. JSON schema with named fields. Easier to validate downstream than free-form prose.
  • Two-pass. First pass extracts; second pass evaluates whether the extraction is correct. Catches obvious errors.
Rule of thumbFor pre-LLM NLP, the right tool is usually Drain (logs) or sentence-transformers (security text). For LLM-driven work, give the model documents to ground on and require citation — most of the value of an LLM in a security pipeline is structured extraction with a verifiable trail, not freeform analysis.

From reference to evidence

Run this against your own environment.