Large Language Model System Evals in the Wild
Chances are, you've been building your LLM app with "vibes-based engineering", where you look at the LLM outputs in dev and eyeball it with "Looks good to me". This is good to start, but how do you know if you have regressions? What if the models get switch out? How do you know improving one set of user queries won't break another?
The longer you've deployed your app in production, the less reliable vibes-based evals are. You'll have no idea if the change that improved this set of user queries regressed another set of queries you fixed last week.
That's where system evals come in. We teach you how to align a human or LLM judge with a qualitative metric using statistical metrics. With an aligned judge, we can use it to evaluate the output of LLMs resulting in numbers we can compare over time when we have new data, queries, or models!
Preview the Issue
What you'll learn
The value of evals
- What are evals?
- Why should you design your own evals?
- Convincing your team and manager
- Evaluating your system as a whole
Designing your first eval
- Understanding the overall eval structure
- Balancing different goals of eval
- Ensuring reproducibility in eval design
Using robust testing
- Using property-based LLM unit tests
- Doing vibes-based engineering right
- Conducting loss analysis
Selecting quality measures
- Choosing measures of quality
- Picking a grading scale
- Using statistical metric functions
Running effective task evals grader
- Choosing a human vs a LLM grader
- Writing great prompts for LLM-as-a-judge
- Setting up the develop-label-analyze loop
Table of Contents
- Eval Shows the way
- Different Levels of Eval
- Why Design Your Own Eval?
- A Typical LLM App Architecture Eval Feedback Loop Overview
- Types of Evals
- System Eval vs Model Eval
- Human Judge vs LLM-as-a-judge
- Eval of LLM-driven components
- Interlude: Trouble at the Forest Cafe
- Before Your First Eval
- Doing Vibes Right
- Property-based Testing
- Setting Up Property-based Tests
- Interlude: Taming Shoggoth’s Wildest Recipes
- Designing the Eval Plan
- Eval Plan Design Process Overview
- Anatomy of a Quality Dimension
- Choosing Your Quality Dimensions
- Common Quality Dimensions
- Quality Dimensions at Odds
- Turning Quality Dimensions into Questions
- Picking a Grading Scale
- Reference-based Questions
- Grading Culinary Creations
- Aligning and Running the Eval
- Design and Align Overview
- Generating Eval Items
- Choosing Your Judge
- Running the Human Judging Process
- Designing the Grading Prompt
- Measuring Alignment
- Iterating on the Grading Prompt
- Running LLM-as-a-judge
- Analyzing Eval Results
- Eval Results Analysis Overview
- Downloading Eval Results
- Computing Question-level Scores
- Computing Scores by Slice
- Improving Your System
- Eval your RAG Component
- Interlude: Getting the Cafe in Order
- The Takeaways
- What You Measure Matters
- Iterate, Iterate, Iterate
- Look at Your Data. Look Again
- Increase Your Input