NexusRAG

Ask questions across your research papers and get answers with citations you can check. NexusRAG runs entirely on your machine and ships with a reproducible benchmark, so every retrieval claim is measured against ground truth rather than asserted.

Most local RAG tools bundle a hybrid retriever, a reranker, and a "verifier," but never report numbers — so it is impossible to know which parts actually help. NexusRAG is built around the measurement: a strictly-additive ablation on two BEIR benchmarks with bootstrap confidence intervals and paired randomization tests, plus a faithfulness verifier evaluated as a real evidence detector.

What the benchmark shows

Retrieval quality on SciFact (300 claims, 5,183 abstracts) and NFCorpus (323 queries, 3,633 documents), CPU-only, exact search.

System	SciFact nDCG@10	NFCorpus nDCG@10
BM25	0.666	0.312
Dense (MiniLM, the usual default)	0.648	0.319
Dense (BGE-small)	0.708	0.342
Hybrid (RRF)	0.704	0.352
+ Corrective PRF	0.703	0.346

The single biggest lever is the embedding model: swapping the common all-MiniLM-L6-v2 for bge-small-en-v1.5 moves dense retrieval from below BM25 to clearly above it (+0.060 nDCG@10 on SciFact, paired randomization p < 0.001). Reciprocal-rank fusion then beats BM25 by +0.037 [+0.014, +0.061] on SciFact and +0.040 [+0.025, +0.055] on NFCorpus — the 95% bootstrap CI of the paired per-query difference excludes zero in both cases, so the win is real but modest. The confidence-gated corrective loop runs a single re-retrieval pass only on low-confidence queries and is roughly neutral on nDCG here. A cross-encoder reranker was also evaluated and does not help on these abstract-level corpora: it lowers nDCG@10 (0.702 vs 0.734) and Recall@20 (0.886 vs 0.900) at ~45× the latency — reported as-is.

nDCG@10 uses graded relevance (the BEIR/pytrec_eval convention), RRF k = 60, the corrective threshold is selected on a held-out split, all bootstrap and randomization tests use seed 0, dense retrieval is exact, BEIR dataset revisions are pinned, and every number is generated from committed results in benchmarks/results/. Full per-metric tables with CIs and p-values are in paper/main.pdf.

How it works

flowchart LR
    D[Documents] --> C[Chunk] --> E[BGE embeddings + BM25]
    Q[Question] --> R[RRF fusion]
    E --> R
    R --> G{Confident?}
    G -- yes --> S[Answer with citations]
    G -- no --> P[Expand + re-retrieve] --> S
    S --> V[Grounding check]

Documents are parsed, chunked, embedded into LanceDB, and indexed for BM25. A query fuses dense and lexical results with reciprocal rank fusion; if the top dense score is weak, a pseudo-relevance-feedback pass expands the query and re-retrieves. A local model answers using only the retrieved passages, with inline citations, and an NLI model checks that each answer sentence is entailed by its sources.

Getting started

python -m venv .venv && source .venv/bin/activate
pip install -e ".[eval]"
make run            # web UI at http://localhost:8000 (needs a local Ollama for generation)

Reproduce the benchmark on CPU. Each command downloads its datasets and small models from Hugging Face on first run, then caches them:

make eval           # SciFact + NFCorpus ablation (downloads BGE-small)
make faithfulness   # evidence detection (downloads the NLI + reranker models)
make paper          # regenerate tables, figures, and the PDF (needs `tectonic`)

make eval-sample runs a small vendored subset with no dataset download. Building the PDF needs the tectonic LaTeX engine; the tables and figures are regenerated by python -m nexusrag.eval.report without it.

Reproducibility and limitations

The full ablation is CPU-only and runs in roughly 15–25 min per dataset on a modern laptop (embedding 3.6k–5.2k abstracts with BGE-small, exact search). Models cache locally on first run: BGE-small ~130 MB, cross-encoder ~90 MB, DeBERTa-NLI ~280 MB, plus llama3.2:3b ~2 GB via Ollama for generation — about 8 GB RAM to run the full stack. Design and component-level limitations are documented in docs/ARCHITECTURE.md.

Scope is deliberately narrow: two abstract-level BEIR datasets (the 300-query SciFact set is BEIR's maximum). Broader datasets (FiQA, SciDocs), domain encoders (SPECTER2, SciNCL), additional neural baselines (SPLADE, ColBERTv2, monoT5), full-paper chunking ablations, and end-to-end answer-quality scoring (RAGAs / LLM-as-judge) are future work, not claimed here. The frontend/ directory is an optional static UI served by FastAPI for local use; it is not needed for the benchmark or API.

Tech stack

Area	Tools
Language	Python 3.11–3.12, typed, mypy strict
Retrieval	sentence-transformers (BGE-small), rank-bm25, RRF (k=60), cross-encoder reranker, DeBERTa NLI, LanceDB (cosine, exact)
Serving	FastAPI, Uvicorn, Ollama (`llama3.2:3b`, pinned)
Evaluation	SciFact, NFCorpus (BEIR, revisions pinned), bootstrap CIs, paired randomization + delta CIs, Holm correction
Quality	pytest (259 tests, 63% coverage), ruff, mypy (strict), GitHub Actions CI, gitleaks, pip-audit, Docker

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.github		.github
benchmarks		benchmarks
configs		configs
docs		docs
examples		examples
frontend		frontend
notebooks		notebooks
paper		paper
screenshots		screenshots
src/nexusrag		src/nexusrag
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
.zenodo.json		.zenodo.json
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-runtime.lock		requirements-runtime.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NexusRAG

What the benchmark shows

How it works

Getting started

Reproducibility and limitations

Tech stack

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NexusRAG

What the benchmark shows

How it works

Getting started

Reproducibility and limitations

Tech stack

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages