What is LLM as Judge and Why Should you use it?
In the last article we covered statistical metrics like Perplexity, BLEU, ROUGE and more, as well as some of the statistical concepts that underpin them, their strengths (accuracy, reliability) and weaknesses (no subjective focus, use of reference texts. Between human evaluation (manual testing) and statistical measures we get a mix of high-value qualitative assessment on a small part of the test surface, and a rigorous but limited view on a wider area. That still leaves a lot of middle ground uncovered! That’s why there’s been a push the last few years to get coverage for the space between - something that has a level of subjectivity and nuance but that also scales up. This is where LLM-as-a-Judge comes in. In our manual testing for LLMs article I compared this to a kind of ouroboros where AI validates AI - and rightly so, that isn’t necessarily a bad thing. LLMs are able to do some things better than humans and LLM-as-a-Judge plays to those strengths - but it does not replace the need for human oversight and statistical assessment. There’s also metrics that combine LLM-as-a-Judge with statistical metrics - but we’ll talk more about that later.