feat(checks): add Bias LLM judge check#2440
Conversation
|
Hi @kevinmessiaen — I've opened this draft PR for issue #2366. Could you please add the safe for build label so CI can run? Thank you! |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Bias check to the Giskard checks library, designed to detect stereotyping, discrimination, and unfair representation in AI agent responses. The implementation includes the Bias class, a Jinja2 prompt template, and comprehensive unit tests. Feedback was provided regarding the get_inputs method in bias.py, specifically pointing out that the current logic does not correctly handle NoMatch or None values during data resolution, which could result in passing incorrect string representations to the LLM judge.
| context: str | None = None | ||
| if self.context_key is not None: | ||
| resolved = provided_or_resolve( | ||
| trace, key=self.context_key, value=provide_not_none(None) | ||
| ) | ||
| if not isinstance(resolved, NoMatch): | ||
| context = str(resolved) | ||
|
|
||
| return { | ||
| "trace": trace, | ||
| "output": str( | ||
| provided_or_resolve( | ||
| trace, | ||
| key=self.key, | ||
| value=provide_not_none(self.output), | ||
| ) | ||
| ), | ||
| "protected_attributes": attributes, | ||
| "context": context, | ||
| } |
There was a problem hiding this comment.
The current implementation of get_inputs has a few issues with data resolution:
NoMatchhandling foroutput: Ifprovided_or_resolvefails to find the output (e.g., the JSONPath inkeyis invalid for the trace), it returns aNoMatchobject. Callingstr()on it (lines 151-157) will pass the string representation of theNoMatchobject to the LLM judge, which is incorrect. The check should fail if the required output cannot be resolved.Nonehandling forcontext: If the resolved context isNone(e.g., a null value in the trace),str(resolved)results in the string"None". In the Jinja template,{% if context %}will evaluate to true for the string"None", causing the LLM to see<CONTEXT>None</CONTEXT>, which is likely unintended.
I suggest refactoring the resolution logic to handle these cases explicitly.
# Resolve context if context_key is provided
context: str | None = None
if self.context_key is not None:
resolved = provided_or_resolve(
trace, key=self.context_key, value=provide_not_none(None)
)
if not isinstance(resolved, NoMatch) and resolved is not None:
context = str(resolved)
# Resolve output
resolved_output = provided_or_resolve(
trace,
key=self.key,
value=provide_not_none(self.output),
)
if isinstance(resolved_output, NoMatch) or resolved_output is None:
raise ValueError(f"Could not resolve output for bias check using key '{self.key}'")
return {
"trace": trace,
"output": str(resolved_output),
"protected_attributes": attributes,
"context": context,
}| DEFAULT_PROTECTED_ATTRIBUTES: list[str] = [ | ||
| "gender", | ||
| "race", | ||
| "age", | ||
| "religion", | ||
| "nationality", | ||
| "sexual_orientation", | ||
| "socioeconomic_status", | ||
| "disability", | ||
| ] |
There was a problem hiding this comment.
where did you base this on? Should we add more categories or descriptions to make it more epxlicit?
There was a problem hiding this comment.
These categories are based on commonly recognised protected attributes in AI fairness literature — specifically aligned with the EU AI Act's list of prohibited discrimination grounds and DeepEval's BiasMetric categories. Happy to add more explicit descriptions per attribute if that would help (e.g. what counts as gender bias vs race bias). Would a Literal type with docstring per value work, or do you prefer keeping it as plain strings?
There was a problem hiding this comment.
Hi, I would say that it is nice to add the specific references and files where we derive this from.
There was a problem hiding this comment.
Thanks! Here are the specific references:
EU AI Act, Article 5 & Annex III — lists protected characteristics including sex, race, ethnicity, religion, disability, age, and sexual orientation as prohibited discrimination grounds
DeepEval BiasMetric — https://docs.confident-ai.com/docs/metrics-bias — uses gender, religion, race, politics as core categories
ISO/IEC 24368:2022 — AI fairness standard referencing demographic attributes
I can add these as inline comments above DEFAULT_PROTECTED_ATTRIBUTES in the code if that works.
| default="trace.last.outputs", | ||
| description="JSONPath expression to extract the output to evaluate from the trace.", | ||
| ) | ||
| protected_attributes: list[str] | None = Field( |
There was a problem hiding this comment.
Do you feel there is a way to add more nuance to this?
There was a problem hiding this comment.
Good point — one way to add nuance would be to support a severity_threshold (e.g. ignore minor imprecision, only flag clear stereotyping) or allow per-attribute custom descriptions so users can tailor what "gender bias" means in their context. Would either direction align with what you had in mind?
There was a problem hiding this comment.
How do you see this severity_threshold solidly work in an LLM setting?
The per attribute descriptions could work too but how do you think to integrate this?
There was a problem hiding this comment.
Good pushback — severity_threshold is tricky in an LLM setting because the model's confidence isn't reliably calibrated, so a numeric threshold would be arbitrary.
Per-attribute descriptions are more practical. I'd integrate them as an optional attribute_descriptions: dict[str, str] | None field — if provided, the value overrides the generic description for that attribute in the Jinja template. For example:
pythonattribute_descriptions={"gender": "Look for assumptions about professional roles based on gender"}
Would you like me to implement this instead?
There was a problem hiding this comment.
where did you base this prompt on and do you have any references? It would be great to understand how this was composed and how it might capture bias.
There was a problem hiding this comment.
The prompt structure was inspired by DeepEval's BiasMetric evaluation criteria and the Giskard red-teaming bias/fairness documentation. The five bias types (stereotyping, unfair generalisation, exclusionary language, differential treatment, contextual endorsement) are drawn from academic fairness literature. Happy to add a comment block at the top of the template citing these references if that would be useful.
There was a problem hiding this comment.
Can you specficallt mention the URLs and reasoning?
There was a problem hiding this comment.
The prompt was composed based on:
DeepEval BiasMetric — https://docs.confident-ai.com/docs/metrics-bias
Giskard bias/fairness red-teaming docs — https://docs.giskard.ai/en/stable/knowledge/key_vulnerabilities/ethics/index.html
Blodgett et al. (2020) — "Language (Technology) is Power" — https://aclanthology.org/2020.acl-main.485 — academic taxonomy of bias types in NLP
I can add these as a comment block at the top of bias.j2 for traceability.
|
Hi @kevinmessiaen @davidberenstein1957 — just following up. Could you add the safe for build label so CI can run? Happy to address any further feedback once the checks are green. Thanks! |
Supports protected_attributes and context_key per issue spec. Fixes Giskard-AI#2366 git add libs/giskard-checks/src/giskard/checks/__init__.py#
- Raise ValueError if output cannot be resolved - Guard against str(None) being passed as context Addresses review feedback from gemini-code-assist
f33d22e to
8841753
Compare
davidberenstein1957
left a comment
There was a problem hiding this comment.
Hi, I added some nuance and follow ups.
| DEFAULT_PROTECTED_ATTRIBUTES: list[str] = [ | ||
| "gender", | ||
| "race", | ||
| "age", | ||
| "religion", | ||
| "nationality", | ||
| "sexual_orientation", | ||
| "socioeconomic_status", | ||
| "disability", | ||
| ] |
There was a problem hiding this comment.
Hi, I would say that it is nice to add the specific references and files where we derive this from.
| default="trace.last.outputs", | ||
| description="JSONPath expression to extract the output to evaluate from the trace.", | ||
| ) | ||
| protected_attributes: list[str] | None = Field( |
There was a problem hiding this comment.
How do you see this severity_threshold solidly work in an LLM setting?
The per attribute descriptions could work too but how do you think to integrate this?
There was a problem hiding this comment.
Can you specficallt mention the URLs and reasoning?
What does this PR do?
Adds a
Biasbuilt-in LLM check that detects stereotyping, discrimination,and unfair representation across configurable demographic dimensions.
Why?
Closes #2366 — bias detection is a core Giskard mission and no built-in check existed.
How?
Follows the exact pattern of the existing
Toxicitycheck:BaseLLMCheck, registered as"bias"prompts/judges/bias.j2protected_attributes: list[str] | Nonefor filteringcontext_keyfor evaluating relative bias against inputTesting
tests/builtin/test_bias.pyFixes #2366