Text & Code

Text and code attribution for LLMs and coding assistants.

From book-length memorization to GitHub-license contamination. Detection, attribution, and evidence for text-based generative models.

Methods

What we run.

  • nv-recall
    Stanford Algorithm 1 — verbatim memorization detection
  • Min-K%++
    Likelihood-based MIA (Zhang et al., ICLR 2025)
  • DE-COP
    Black-box multiple-choice probing (Duarte et al.)
  • Cooper PDE
    Probabilistic discoverable extraction (Cooper et al., 2024)
  • CodeMIA
    Code-specific membership inference
  • LLM clone detector
    Model lineage analysis
  • License contamination
    GPL / MIT / Apache violation detection
  • RAG two-stage analysis
    For retrieval-augmented systems
  • Concept memorization
    Distributional memorization via MMD / KS
Use cases

Who this is for.

  • Author estate monitoring a frontier model for book memorization.
  • Publisher pursuing a Kadrey-style matter.
  • Software vendor investigating coding-assistant license contamination.
  • News org checking for article-level extraction.
Pricing anchor. $2K–$5K per work × model pair. Bartz settlement comparable: $3K / work.