
citeformer
stableVerifiably-cited LLM text via constrained decoding
v0.3.0Released 2026-04-25·Pythonpip install citeformerWhat It Is#
citeformer makes citation fabrication structurally impossible. Before a language model picks its next token, citeformer compiles a tiny grammar that only admits citation markers pointing at sources you actually supplied, and hands that grammar to the decoder via XGrammar / llguidance / GBNF (locally) or strict structured outputs (across modern API providers). Out-of-scope [N] tokens get masked to zero probability before sampling — the sampler never sees them. Bibliographies are rendered deterministically by the library in six academic styles (APA-7, MLA, Chicago, IEEE, Nature, Vancouver), and every emitted claim can be NLI-verified against its cited source after the fact.
Why It Matters#
LLM-generated citations are wrong 14–95% of the time depending on the benchmark; RAG systems still fabricate 3–13% of cited URLs; NeurIPS 2025 accepted ~50 papers with AI-generated fake references. Prompting doesn't fix it; post-hoc verification doesn't fix it. The only real fix is structural — make the invalid output token-impossible before the model reaches the decision point. citeformer delivers that contract across ten backends (HF, vLLM, llama.cpp, OpenAI, Anthropic, Gemini, Mistral, Fireworks, OpenRouter, Together), proven across a 40-run multi-prompt sweep at 0.0 ± 0.0 fabrication — the std is identically zero because the guarantee is a contract, not a mean.
What Ships in 0.3#
- Logit-masked GBNF — the
cite-idterminal is compiled per call to"[" ("1" | "2" | ... | "N") "]"and handed to XGrammar (default) or llguidance; out-of-scope tokens get masked before sampling - Ten backends, two enforcement loci, one
GenerationResult— local backends (HF, vLLM, llama.cpp) enforce in-process; API backends (OpenAI, Anthropic, Gemini, Mistral, OpenRouter, Together) enforce inside the provider runtime via strict structured outputs; Fireworks accepts citeformer's GBNF natively unchanged - Six hand-written CSL formatters — APA-7, MLA, Chicago, IEEE, Nature, Vancouver — ~1 kLOC, no
citeproc-pydependency, 300 locked snapshots pin formatter outputs result.verify()— DeBERTa-v3-large-MNLI entailment per(source, cited sentence)pair, returning a typedVerificationReportwith coverage check for uncited-but-entailed sentences- Source adapters —
Source.from_doi(...),Source.from_arxiv(...), raw-contentSource(metadata=..., content=...); httpx + pypdf + GROBID + readability for fetch and parse - Streaming — token-level streaming preserved across all backends; structural guarantee holds mid-stream
- HF Space demo under
hf-space/runs the adversarial "100% → 0% fabrication" swing on CPU in a browser
Ecosystem Fit#
citeformer slots into any RAG pipeline that emits cited prose: drop it in front of the LLM call, hand it your Source list, and result.text is guaranteed not to contain [N] for N > len(sources). Apache-2.0, Python 3.11+ (tested through 3.14). The literature-review notebook (examples/08_literature_review.ipynb) walks end-to-end from arXiv fetch → grammar-constrained generation → NLI verification → APA-7 bibliography on a laptop-friendly 500 MB model.
When to Reach For It#
- Any RAG pipeline where fabricated citations are a correctness failure, not a UX nit
- Academic / research workflows that need deterministic bibliography rendering across styles
- Audit-grade applications where every
[N]must be traceable and entailment-verifiable - Replacing the "regex + retry" citation-validation loop with a pre-sampling structural guarantee