Large Language Model System Evals in the Wild

5 ratings

Chances are, you've been building your LLM app with "vibes-based engineering", where you look at the LLM outputs in dev and eyeball it with "Looks good to me". This is good to start, but how do you know if you have regressions? What if the models get switch out? How do you know improving one set of user queries won't break another?

The longer you've deployed your app in production, the less reliable vibes-based evals are. You'll have no idea if the change that improved this set of user queries regressed another set of queries you fixed last week.

That's where system evals come in. We teach you how to align a human or LLM judge with a qualitative metric using statistical metrics. With an aligned judge, we can use it to evaluate the output of LLMs resulting in numbers we can compare over time when we have new data, queries, or models!

Preview the Issue

Subscribe to Forest Friends

What you'll learn

The value of evals

What are evals?
Why should you design your own evals?
Convincing your team and manager
Evaluating your system as a whole

Designing your first eval

Understanding the overall eval structure
Balancing different goals of eval
Ensuring reproducibility in eval design

Using robust testing

Using property-based LLM unit tests
Doing vibes-based engineering right
Conducting loss analysis

Selecting quality measures

Choosing measures of quality
Picking a grading scale
Using statistical metric functions

Running effective task evals grader

Choosing a human vs a LLM grader
Writing great prompts for LLM-as-a-judge
Setting up the develop-label-analyze loop

Eval Shows the way
Different Levels of Eval
Why Design Your Own Eval?
A Typical LLM App Architecture Eval Feedback Loop Overview
Types of Evals
System Eval vs Model Eval
- Human Judge vs LLM-as-a-judge
- Eval of LLM-driven components
Interlude: Trouble at the Forest Cafe
Before Your First Eval
Doing Vibes Right
- Property-based Testing
- Setting Up Property-based Tests
Interlude: Taming Shoggoth’s Wildest Recipes
Designing the Eval Plan
Eval Plan Design Process Overview
Anatomy of a Quality Dimension
- Choosing Your Quality Dimensions
- Common Quality Dimensions
- Quality Dimensions at Odds
- Turning Quality Dimensions into Questions
- Picking a Grading Scale
- Reference-based Questions
Grading Culinary Creations
Aligning and Running the Eval
Design and Align Overview
Generating Eval Items
- Choosing Your Judge
- Running the Human Judging Process
- Designing the Grading Prompt
- Measuring Alignment
- Iterating on the Grading Prompt
- Running LLM-as-a-judge
Analyzing Eval Results
Eval Results Analysis Overview
- Downloading Eval Results
- Computing Question-level Scores
- Computing Scores by Slice
- Improving Your System
- Eval your RAG Component
Interlude: Getting the Cafe in Order
The Takeaways
What You Measure Matters
- Iterate, Iterate, Iterate
- Look at Your Data. Look Again
- Increase Your Input

I want this!

Lovingly Aligned Diagrams

10+

Meticulously Selected Images

70+

Animal Friends Running a Cafe

Shoggoths as LLMs

More than 1?

Size

68.3 MB

Length

64 pages

No refunds allowed

Ratings

(5 ratings)

5 stars

100%

4 stars

3 stars

2 stars

1 star