Skip to content

Add faithfulness check#2556

Open
josediegorobles wants to merge 2 commits into
Giskard-AI:mainfrom
josediegorobles:add-faithfulness-check
Open

Add faithfulness check#2556
josediegorobles wants to merge 2 commits into
Giskard-AI:mainfrom
josediegorobles:add-faithfulness-check

Conversation

@josediegorobles

Copy link
Copy Markdown

Adds a Faithfulness check for evaluating whether an answer is grounded in the provided source material.

The check follows the existing LLM judge pattern, uses a structured judge response with a 0-1 faithfulness score, and supports a configurable passing threshold. Source material can be passed directly or resolved from the trace with source_key; answers can also be passed directly or resolved with answer_key.

Includes unit tests for passing and failing cases, threshold handling, trace-based inputs, prompt contents, serialization, and exports.

Validation:

  • make test-unit PACKAGE=giskard-checks
  • make check

Closes #2368

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new LLM-based check, Faithfulness, along with its structured output model FaithfulnessCheckResult, a Jinja2 prompt template, and corresponding unit tests. The check evaluates whether an AI agent's answer faithfully represents the provided source material. The feedback recommends validating that both the answer and source are successfully resolved before running the check to prevent sending unresolved inputs (like None or NoMatch) to the LLM, and suggests adding a unit test to verify this error handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +65 to +77
async def get_inputs(self, trace: Trace[InputType, OutputType]) -> dict[str, Any]:
answer = provided_or_resolve(
trace,
key=self.answer_key,
value=provide_not_none(self.answer),
)
source = self._resolve_source(trace)

return {
"answer": answer,
"source": self._format_source(source),
"threshold": self.threshold,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If answer or source cannot be resolved (resulting in None or NoMatch), the check currently proceeds and passes the stringified representation (e.g., "None" or "NoMatch(...)") to the LLM. This leads to incorrect evaluations and wasted API calls.

We should validate that both answer and source are successfully resolved and are not None before proceeding.

    @override
    async def get_inputs(self, trace: Trace[InputType, OutputType]) -> dict[str, Any]:
        answer = provided_or_resolve(
            trace,
            key=self.answer_key,
            value=provide_not_none(self.answer),
        )
        if answer is None or isinstance(answer, NoMatch):
            raise ValueError(
                f"Could not resolve answer from trace using key '{self.answer_key}' "
                "and no direct answer was provided."
            )

        source = self._resolve_source(trace)
        if source is None or isinstance(source, NoMatch):
            raise ValueError(
                f"Could not resolve source from trace using key '{self.source_key}' "
                "and no direct source was provided."
            )

        return {
            "answer": answer,
            "source": self._format_source(source),
            "threshold": self.threshold,
        }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing to consolidate efforts. The feedback here is solid and actionable — will revisit in the next iteration if we return to faithfulness evals. For now, focusing on #2555 which has a strategic duplicate scenario to resolve first. Thanks for the thorough review.—

Comment on lines +236 to +240
assert data["kind"] == "faithfulness"
assert isinstance(restored, Faithfulness)
assert restored.answer_key == "trace.last.outputs.answer"
assert restored.source_key == "trace.last.metadata.source"
assert restored.threshold == 0.9

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It would be beneficial to add a unit test to verify that a ValueError is raised when the answer or source cannot be resolved from the trace.

Suggested change
assert data["kind"] == "faithfulness"
assert isinstance(restored, Faithfulness)
assert restored.answer_key == "trace.last.outputs.answer"
assert restored.source_key == "trace.last.metadata.source"
assert restored.threshold == 0.9
assert data["kind"] == "faithfulness"
assert isinstance(restored, Faithfulness)
assert restored.answer_key == "trace.last.outputs.answer"
assert restored.source_key == "trace.last.metadata.source"
assert restored.threshold == 0.9
async def test_unresolved_inputs_raise_value_error() -> None:
check = Faithfulness(
answer_key="trace.last.outputs.non_existent",
source_key="trace.last.metadata.non_existent",
)
with pytest.raises(ValueError, match="Could not resolve answer"):
await check.run(Trace())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add faithfulness check

1 participant