LLM-as-a-Judge
- HobbyFull
- ProFull
- TeamFull
- Self HostedPro & Enterprise(Pro & Enterprise)
LLM-as-a-judge is a technique to evaluate the quality of LLM applications by using an LLM as a judge. The LLM is given a trace or a dataset entry and asked to score and reason about the output. The scores and reasoning are stored as scores in Langfuse.
What are common evaluation tasks?
LLM-as-a-judge evaluation tasks can be very use-case-specific. Common tasks for which Langfuse provides prebuilt prompts are:
- Hallucination
- Helpfulness
- Relevance
- Toxicity
- Correctness
- Contextrelevance
- Contextcorrectness
- Conciseness
LLM-as-a-judge evaluators in Langfuse help to evaluate:
Alternatively, you can run any custom evaluation functions or packages on Langfuse data via the API/SDKs.
Custom end-to-end example: External evaluation pipeline.
Video Walkthrough
Get Started
Configure LLM provider
Langfuse supports a variety of LLM providers including OpenAI, Anthropic, Azure OpenAI, and AWS Bedrock.
To use LLM-as-a-judge, you have to configure your LLM provider in the Langfuse project settings.
Note: tool/function calling needs to be supported by the model for LLM-as-a-judge to work.
Create an LLM-as-a-judge template
LLM-as-a-judge uses a prompt template and model configuration to evaluate traces. In Langfuse this configuration is stored in an Evaluator Template
as it can be reused across multiple evaluators.
To help get you started, Langfuse includes a set of predefined prompts for common evaluation tasks, but you can also write your own or customize the Langfuse-provided prompts.
Prompt templates contain {{variables}}
that are substituted with actual data when an evaluator is run. You can create an arbitrary number of custom variables that can later be referenced when creating the evaluator. Common variables are input
, output
, context
, ground_truth
, etc.
Langfuse uses function/tool calling to extract the evaluation output. At the bottom of the form, you can configure score
and reasoning
variables which will be used to instruct the LLM on how to score and reason about the evaluation.
Currently, LLM-as-a-judge templates only support numeric
scores. Support for categorical
and boolean
scores is on our roadmap. (GitHub Issue)
Set up an evaluator
Now that you have created an evaluator template, you can configure on what data it should be applied by Langfuse.
Here we need to configure the following aspects:
- Which Evaluator Template to use
- Trigger: On what incoming data should the evaluator be executed?
- Name of the
scores
which will be created as a result of the evaluation. - Specify how Langfuse should fill the
{{variables}}
in the template.- Langfuse traces can be deeply nested (see conceptual overview). You can query from the trace directly, or from any nested observation via its name.
- Select whether to use the
Input
,Output
, ormetadata
value.
- Optional: Add sampling to reduce costs when running evaluations on a large volume of production data.
- Optional: Configure custom delay. This is how you can ensure all data arrived at Langfuse servers before evaluation is executed. The time starts when the trace is first added to Langfuse while it might be still in progress. This is especially important for long-running agent executions.
✨ Done! You have created an evaluator which will now automatically be executed on all data that matches the selected trigger.
Monitoring of Evaluators
Each evaluator has its own log page where you can view the progress and logs to potentially debug any issues.