Glossary
LLM-as-Judge
Also known as: LLM as a judge, LLM-as-a-Judge, LLM judge
Definition
LLM-as-Judge is an evaluation technique where an independent large language model scores another LLM's output against a rubric — accuracy, relevance, tone, consistency, completeness, or custom criteria. It enables continuous, automated quality measurement on every production trace, replacing manual sampling.
Why it matters
Manual evaluation does not scale. A reviewer sampling 1% of traces misses regressions that affect 99% of users. Hard-coded heuristics catch simple errors but miss subtle quality degradation, tone problems, and factual inconsistency. LLM-as-Judge sits in between: cheap enough to score every trace, sophisticated enough to catch problems heuristics miss.
Done well, it provides the continuous quality signal that NIST AI RMF MEASURE-2.6 expects and that SR 11-7 ongoing monitoring increasingly demands. Done poorly, it inherits the judge model's biases and its scores become noise. Best practice is to use a different model family for the judge than for the system being evaluated, anchor scores against human-labeled ground truth periodically, and report calibration drift as part of the eval.
In practice
Prism Evaluations use LLM-as-Judge to score every trace across five fixed dimensions within seconds of completion. Judges run independently of the production model, their decisions are themselves logged, and quality scores feed both real-time alerts and weekly regression reports.
Related
More glossary terms
Start tracing in 5 minutes
One SDK. Five minutes. Full audit trails, PII redaction, and guardrail enforcement, from day one.