Evaluation Overview
Evals give you a repeatable check of your LLM application's behavior. You replace guesswork with data.
LLM evaluation is part of the AI engineering loop:
- Trace -> Capture real behavior with Observability.
- Monitor -> Track production quality with online evaluators and Score Analytics.
- Build datasets -> Turn traces into reusable evaluation assets with annotations and datasets.
- Experiment -> Validate changes with experiments and CI/CD checks before they ship.
- Evaluate -> Decide what is good enough to ship.
Watch this walkthrough of Langfuse Evaluation and how to use it to improve your LLM application.
Trace
Capture what users asked, what your system did, and where the output came from with traces and observations. Trace data enables the rest of the loop: monitoring, datasets, and regression checks. For the broader pattern, see tracing.
Monitor
Use production signals to find the traces worth reviewing. Monitoring helps you spot data drift, flag quality issues automatically, and discover examples for your evaluation set.
- Track trends, data drift, and regressions in Score Analytics and custom dashboards.
- Score live behavior with LLM-as-a-Judge, custom scores via API/SDK, or user feedback.
Build datasets
Turn raw traces into reusable evaluation assets. Start with human review, name the failure modes, then convert useful examples into datasets and score definitions. Start with datasets and error analysis.
- Use Annotation Queues, Scores via UI, and
TEXTscores for open coding. - Group notes into failure modes (axial coding), then turn stable categories into structured labels and evaluation criteria.
Experiment
Use experiments to confirm that a prompt, model, retrieval setup, agent implementation, or evaluator variant improves quality without regressions.
- Compare candidates on the same dataset and scoring criteria with experiments via UI for prompt and model changes, or experiments via SDK for application logic.
- Run regression checks manually or in CI/CD before you ship to production.
Evaluate
Judge experiment outputs before shipping. Start with manual review to understand quality and failure modes, then automate dedicated evaluators where they add repeatable signal. The tradeoffs are covered in evaluation methods.
- Use manual evaluation to build intuition and calibrate automated evaluators.
- Use scores via API/SDK for custom evaluation pipelines, guardrails, runtime checks, user feedback, and internal review workflows.
- Use LLM-as-a-Judge for qualities that require language understanding, such as relevance, tone, completeness, or factuality.
Which Langfuse feature should I use?
| If you want to... | Use this Langfuse feature |
|---|---|
| Capture application behavior | Observability, traces and observations |
| Segment traces for later review | Tags, metadata, users, sessions, environments, releases |
| Review examples manually | Annotation Queues, Scores via UI |
| Open Coding: capture open-ended notes | TEXT scores, Annotation Queues |
| Axial Coding: derive failure modes | Stable error categories, evaluation criteria |
| Create reusable test cases | Datasets |
| Compare changes before shipping | Experiments via UI, Experiments via SDK |
| Gate pull requests or deploys | CI/CD experiments |
| Monitor production quality | LLM-as-a-Judge, Scores via API/SDK |
| Analyze evaluator results | Score Analytics, custom dashboards |
GitHub Discussions
Last edited