Text & Code
Text and code attribution for LLMs and coding assistants.
From book-length memorization to GitHub-license contamination. Detection, attribution, and evidence for text-based generative models.
Methods
What we run.
- nv-recallStanford Algorithm 1 — verbatim memorization detection
- Min-K%++Likelihood-based MIA (Zhang et al., ICLR 2025)
- DE-COPBlack-box multiple-choice probing (Duarte et al.)
- Cooper PDEProbabilistic discoverable extraction (Cooper et al., 2024)
- CodeMIACode-specific membership inference
- LLM clone detectorModel lineage analysis
- License contaminationGPL / MIT / Apache violation detection
- RAG two-stage analysisFor retrieval-augmented systems
- Concept memorizationDistributional memorization via MMD / KS
Use cases
Who this is for.
- Author estate monitoring a frontier model for book memorization.
- Publisher pursuing a Kadrey-style matter.
- Software vendor investigating coding-assistant license contamination.
- News org checking for article-level extraction.
Pricing anchor. $2K–$5K per work × model pair. Bartz settlement comparable: $3K / work.