Evaluation Overview

Evals give you a repeatable check of your LLM application's behavior. You replace guesswork with data.

LLM evaluation is part of the AI engineering loop:

Trace -> Capture real behavior with Observability.
Monitor -> Track production quality with online evaluators and Score Analytics.
Build datasets -> Turn traces into reusable evaluation assets with annotations and datasets.
Experiment -> Validate changes with experiments and CI/CD checks before they ship.
Evaluate -> Decide what is good enough to ship.

Deploy

Online

Trace

traces · sessions · agents · prompts

Online

Monitor

dashboards · LLM-as-judge · feedback

Offline

Build datasets

datasets · features-as-tests

Offline

Experiment

prompts · models · code variants

Offline

Evaluate

judges · custom evals · annotation

🎥

Watch this walkthrough of Langfuse Evaluation and how to use it to improve your LLM application.

Trace

Capture what users asked, what your system did, and where the output came from with traces and observations. Trace data enables the rest of the loop: monitoring, datasets, and regression checks. For the broader pattern, see tracing.

Monitor

Use production signals to find the traces worth reviewing. Monitoring helps you spot data drift, flag quality issues automatically, and discover examples for your evaluation set.

Track trends, data drift, and regressions in Score Analytics and custom dashboards.
Score live behavior with LLM-as-a-Judge, custom scores via API/SDK, or user feedback.

Build datasets

Turn raw traces into reusable evaluation assets. Start with human review, name the failure modes, then convert useful examples into datasets and score definitions. Start with datasets and error analysis.

Use Annotation Queues, Scores via UI, and TEXT scores for open coding.
Group notes into failure modes (axial coding), then turn stable categories into structured labels and evaluation criteria.

Experiment

Use experiments to confirm that a prompt, model, retrieval setup, agent implementation, or evaluator variant improves quality without regressions.

Compare candidates on the same dataset and scoring criteria with experiments via UI for prompt and model changes, or experiments via SDK for application logic.
Run regression checks manually or in CI/CD before you ship to production.

Evaluate

Judge experiment outputs before shipping. Start with manual review to understand quality and failure modes, then automate dedicated evaluators where they add repeatable signal. The tradeoffs are covered in evaluation methods.

Use manual evaluation to build intuition and calibrate automated evaluators.
Use scores via API/SDK for custom evaluation pipelines, guardrails, runtime checks, user feedback, and internal review workflows.
Use LLM-as-a-Judge for qualities that require language understanding, such as relevance, tone, completeness, or factuality.

Which Langfuse feature should I use?

If you want to...	Use this Langfuse feature
Capture application behavior	Observability, traces and observations
Segment traces for later review	Tags, metadata, users, sessions, environments, releases
Review examples manually	Annotation Queues, Scores via UI
Open Coding: capture open-ended notes	`TEXT` scores, Annotation Queues
Axial Coding: derive failure modes	Stable error categories, evaluation criteria
Create reusable test cases	Datasets
Compare changes before shipping	Experiments via UI, Experiments via SDK
Gate pull requests or deploys	CI/CD experiments
Monitor production quality	LLM-as-a-Judge, Scores via API/SDK
Analyze evaluator results	Score Analytics, custom dashboards

GitHub Discussions

Was this page helpful?

On this page