Add faithfulness check#2556
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new LLM-based check, Faithfulness, along with its structured output model FaithfulnessCheckResult, a Jinja2 prompt template, and corresponding unit tests. The check evaluates whether an AI agent's answer faithfully represents the provided source material. The feedback recommends validating that both the answer and source are successfully resolved before running the check to prevent sending unresolved inputs (like None or NoMatch) to the LLM, and suggests adding a unit test to verify this error handling.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async def get_inputs(self, trace: Trace[InputType, OutputType]) -> dict[str, Any]: | ||
| answer = provided_or_resolve( | ||
| trace, | ||
| key=self.answer_key, | ||
| value=provide_not_none(self.answer), | ||
| ) | ||
| source = self._resolve_source(trace) | ||
|
|
||
| return { | ||
| "answer": answer, | ||
| "source": self._format_source(source), | ||
| "threshold": self.threshold, | ||
| } |
There was a problem hiding this comment.
If answer or source cannot be resolved (resulting in None or NoMatch), the check currently proceeds and passes the stringified representation (e.g., "None" or "NoMatch(...)") to the LLM. This leads to incorrect evaluations and wasted API calls.
We should validate that both answer and source are successfully resolved and are not None before proceeding.
@override
async def get_inputs(self, trace: Trace[InputType, OutputType]) -> dict[str, Any]:
answer = provided_or_resolve(
trace,
key=self.answer_key,
value=provide_not_none(self.answer),
)
if answer is None or isinstance(answer, NoMatch):
raise ValueError(
f"Could not resolve answer from trace using key '{self.answer_key}' "
"and no direct answer was provided."
)
source = self._resolve_source(trace)
if source is None or isinstance(source, NoMatch):
raise ValueError(
f"Could not resolve source from trace using key '{self.source_key}' "
"and no direct source was provided."
)
return {
"answer": answer,
"source": self._format_source(source),
"threshold": self.threshold,
}There was a problem hiding this comment.
Closing to consolidate efforts. The feedback here is solid and actionable — will revisit in the next iteration if we return to faithfulness evals. For now, focusing on #2555 which has a strategic duplicate scenario to resolve first. Thanks for the thorough review.—
| assert data["kind"] == "faithfulness" | ||
| assert isinstance(restored, Faithfulness) | ||
| assert restored.answer_key == "trace.last.outputs.answer" | ||
| assert restored.source_key == "trace.last.metadata.source" | ||
| assert restored.threshold == 0.9 |
There was a problem hiding this comment.
It would be beneficial to add a unit test to verify that a ValueError is raised when the answer or source cannot be resolved from the trace.
| assert data["kind"] == "faithfulness" | |
| assert isinstance(restored, Faithfulness) | |
| assert restored.answer_key == "trace.last.outputs.answer" | |
| assert restored.source_key == "trace.last.metadata.source" | |
| assert restored.threshold == 0.9 | |
| assert data["kind"] == "faithfulness" | |
| assert isinstance(restored, Faithfulness) | |
| assert restored.answer_key == "trace.last.outputs.answer" | |
| assert restored.source_key == "trace.last.metadata.source" | |
| assert restored.threshold == 0.9 | |
| async def test_unresolved_inputs_raise_value_error() -> None: | |
| check = Faithfulness( | |
| answer_key="trace.last.outputs.non_existent", | |
| source_key="trace.last.metadata.non_existent", | |
| ) | |
| with pytest.raises(ValueError, match="Could not resolve answer"): | |
| await check.run(Trace()) |
Adds a Faithfulness check for evaluating whether an answer is grounded in the provided source material.
The check follows the existing LLM judge pattern, uses a structured judge response with a 0-1 faithfulness score, and supports a configurable passing threshold. Source material can be passed directly or resolved from the trace with
source_key; answers can also be passed directly or resolved withanswer_key.Includes unit tests for passing and failing cases, threshold handling, trace-based inputs, prompt contents, serialization, and exports.
Validation:
make test-unit PACKAGE=giskard-checksmake checkCloses #2368