measure how much your llm judges actually agree.
when you use llm-as-a-judge for evals, the quiet failure mode is that your judges disagree with each other and nobody notices. the headline score still moves run to run, but the signal underneath is noise. this library runs N judge models on the same items, computes inter-judge agreement, and points at the items where the agreement breaks down.
llm-as-judge is the default pattern for grading open-ended output: rag answers, code review, reasoning traces, anything without a clean string-match. the problem is that a single judge gives you a number with no error bar. you cannot tell a real quality change from the judge being moody, and you cannot tell whether two judges that produce the same average are actually making the same decisions item by item.
two failures show up constantly:
- one bad judge drags the score. average four judges, have one that is systematically harsh on a class of answers, and the headline metric is just wrong in a direction you cannot see.
- judges agree on the average but not per item. the mean looks fine while individual verdicts are close to random. that average is useless for ranking two systems, which is the thing evals exist to do.
agreement statistics are how you catch both. they are standard in any field that has leaned on human raters for decades (medicine, content moderation, linguistics), and judge models are just raters with an api. this library brings that toolkit to the eval loop. it does not try to crown a correct judge. it surfaces the disagreement and leaves the call to you.
given a benchmark of (prompt, candidate_response) pairs and a set of judge models:
- runs each judge on each item, producing a score per
(item, judge)pair. - computes agreement across judges, then reports per-judge bias (mean offset) and per-judge variance.
- ranks the items by disagreement so you can audit the worst cases by hand.
output is one json report per run plus a printable summary.
five agreement numbers, each answering a different question:
- cohen's kappa (pairwise mean). chance-corrected agreement between two judges, averaged over every judge pair. catches "they agree, but only as often as random guessing would predict."
- krippendorff's alpha (ordinal). single agreement coefficient across all judges at once, ordinal by default so a 4-vs-5 disagreement counts less than a 1-vs-5. handles a missing score here and there without falling over.
- fleiss' kappa. the multi-rater generalisation of cohen's, treating categories as nominal. when your judges return scores from a small fixed set (1..5, pass/fail, A/B/C) and you want a single chance-corrected number across all of them, fleiss is the literature-standard answer.
- intraclass correlation, ICC(2,k). reliability of the averaged judge score, treating your judges as a sample of all possible judges. this is the number that tells you whether the panel average is trustworthy even when individual judges are shaky.
- exact-match agreement. plain fraction of items where every judge gave the identical score. no correction, no weighting, just a blunt floor.
the spread between these is the point. high icc with near-zero kappa means the average is stable but no single judge is reliable, which is a very different situation from all five being individually solid.
pip install llm-judge-calibrationfrom source:
git clone https://github.com/WatchTree-19/llm-judge-calibration
cd llm-judge-calibration
pip install -e .five lines from benchmark to report:
from llm_judge_calibration import Benchmark, JudgeConfig, JudgeRunner, CalibrationReport
bench = Benchmark.from_jsonl("examples/data/mini_eval.jsonl")
judges = [JudgeConfig(model="gpt-4o-mini"), JudgeConfig(model="gpt-4o"), JudgeConfig(model="claude-3-5-sonnet-20240620")]
scores = JudgeRunner(bench=bench, judges=judges, rubric="rate factual accuracy 1 to 5").run()
CalibrationReport.from_scores(scores).print_summary()or from the cli:
llm-judge-calibration run \
--bench examples/data/mini_eval.jsonl \
--judge gpt-4o-mini --judge gpt-4o --judge claude-3-5-sonnet-20240620 \
--rubric "rate the response from 1 to 5 on factual accuracy" \
--out report.jsonset the provider env vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, and so on) before running. judge calls go through litellm, so anything litellm supports works: openai, anthropic, cohere, mistral, together, groq, local vllm endpoints.
running the live llm judges needs api keys, so to show the agreement layer on real data with numbers you can reproduce exactly, here it is on irr::anxiety, a standard public inter-rater dataset (3 raters scoring 20 subjects on a 1 to 6 ordinal scale). the raters map onto judges and the subjects onto items with no change to the code:
======================================================================
calibration report
======================================================================
n items : 20
n judges : 3
rubric : irr::anxiety (rater1, rater2, rater3)
inter-judge agreement
cohen's kappa (pairwise mean) : -0.02 (worse than chance)
krippendorff's alpha : 0.17
fleiss' kappa : -0.04
intraclass correlation (ICC2k) : 0.43
exact-match agreement : 0.00
per-judge bias (mean score)
rater1 : 3.15
rater2 : 3.15
rater3 : 2.30
per-judge variance
rater1 : 1.42
rater2 : 1.24
rater3 : 1.31
======================================================================
this is exactly the case the tool exists to catch. exact-match is zero (the three raters never once gave the same score) and pairwise kappa sits at chance, yet icc reads a moderate 0.43 and rater1 and rater2 share an identical 3.15 mean. anyone eyeballing the averages would call this panel roughly aligned. it is not. the per-item verdicts are noise, rater3 runs almost a full point low, and any ranking built on this panel would be unreliable. on a real eval that gap between a healthy-looking average and chance-level per-item agreement is the difference between a number you can ship and a number that is quietly lying to you.
reproduce it: load irr::anxiety (shipped with the R irr package), feed the three rater columns to CalibrationReport.from_scores, print the summary.
a Benchmark is a jsonl file, one object per line:
{"id": "001", "prompt": "what is the capital of france?", "candidate_response": "paris is the capital of france."}id and candidate_response are required. prompt is optional, since plenty of rubrics work on the response alone ("is this output hostile?"). anything else on the line is kept as metadata.
- not a stand-in for human ground truth. if you have gold labels, use them. this is for the common case where you do not.
- not a "pick the best judge" button. it shows you the disagreement, you decide.
- not a benchmark. bring your own items.
MIT. see LICENSE.