Offline Evaluation
Test before you shipRun evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.
Online Evaluation
Monitor in productionEvaluate real user interactions in real-time to detect issues and measure quality on live traffic.
Evaluation workflow
- Offline evaluation flow
- Online evaluation flow
1
Create a dataset
Create a dataset with from manually curated test cases, historical production traces, or synthetic data generation.
2
Define evaluators
Create to score performance:
- Human review
- Heuristic rules
- LLM-as-judge
- Pairwise comparison
3
Run an experiment
Execute your application on the dataset to create an . Configure repetitions, concurrency, and caching to optimize runs.
4
Analyze results
Compare experiments for benchmarking, unit tests, regression tests, or backtesting.
Get started
Evaluation quickstart
Get started with offline evaluation.
Manage datasets
Create and manage datasets for evaluation through the UI or SDK.
Run offline evaluations
Explore evaluation types, techniques, and frameworks for comprehensive testing.
Analyze results
View and analyze evaluation results, compare experiments, filter data, and export findings.
Run online evaluations
Monitor production quality in real-time from the Observability tab.
Follow tutorials
Learn by following step-by-step tutorials, from simple chatbots to complex agent evaluations.
To set up a LangSmith instance, visit the Platform setup section to choose between cloud, hybrid, or self-hosted. All options include observability, evaluation, prompt engineering, and deployment.