Skip to content

feat(checks): add Bias LLM judge check#2440

Open
Kushagra651 wants to merge 2 commits into
Giskard-AI:mainfrom
Kushagra651:feat/bias-check
Open

feat(checks): add Bias LLM judge check#2440
Kushagra651 wants to merge 2 commits into
Giskard-AI:mainfrom
Kushagra651:feat/bias-check

Conversation

@Kushagra651

Copy link
Copy Markdown

What does this PR do?

Adds a Bias built-in LLM check that detects stereotyping, discrimination,
and unfair representation across configurable demographic dimensions.

Why?

Closes #2366 — bias detection is a core Giskard mission and no built-in check existed.

How?

Follows the exact pattern of the existing Toxicity check:

  • Subclasses BaseLLMCheck, registered as "bias"
  • Jinja2 prompt at prompts/judges/bias.j2
  • Supports protected_attributes: list[str] | None for filtering
  • Supports context_key for evaluating relative bias against input

Testing

  • 14 unit tests in tests/builtin/test_bias.py
  • All 4 acceptance criteria from the issue covered

Fixes #2366

@Kushagra651

Copy link
Copy Markdown
Author

Hi @kevinmessiaen — I've opened this draft PR for issue #2366. Could you please add the safe for build label so CI can run? Thank you!

@Kushagra651 Kushagra651 marked this pull request as ready for review May 9, 2026 13:19

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Bias check to the Giskard checks library, designed to detect stereotyping, discrimination, and unfair representation in AI agent responses. The implementation includes the Bias class, a Jinja2 prompt template, and comprehensive unit tests. Feedback was provided regarding the get_inputs method in bias.py, specifically pointing out that the current logic does not correctly handle NoMatch or None values during data resolution, which could result in passing incorrect string representations to the LLM judge.

Comment on lines +141 to +160
context: str | None = None
if self.context_key is not None:
resolved = provided_or_resolve(
trace, key=self.context_key, value=provide_not_none(None)
)
if not isinstance(resolved, NoMatch):
context = str(resolved)

return {
"trace": trace,
"output": str(
provided_or_resolve(
trace,
key=self.key,
value=provide_not_none(self.output),
)
),
"protected_attributes": attributes,
"context": context,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of get_inputs has a few issues with data resolution:

  1. NoMatch handling for output: If provided_or_resolve fails to find the output (e.g., the JSONPath in key is invalid for the trace), it returns a NoMatch object. Calling str() on it (lines 151-157) will pass the string representation of the NoMatch object to the LLM judge, which is incorrect. The check should fail if the required output cannot be resolved.
  2. None handling for context: If the resolved context is None (e.g., a null value in the trace), str(resolved) results in the string "None". In the Jinja template, {% if context %} will evaluate to true for the string "None", causing the LLM to see <CONTEXT>None</CONTEXT>, which is likely unintended.

I suggest refactoring the resolution logic to handle these cases explicitly.

        # Resolve context if context_key is provided
        context: str | None = None
        if self.context_key is not None:
            resolved = provided_or_resolve(
                trace, key=self.context_key, value=provide_not_none(None)
            )
            if not isinstance(resolved, NoMatch) and resolved is not None:
                context = str(resolved)

        # Resolve output
        resolved_output = provided_or_resolve(
            trace,
            key=self.key,
            value=provide_not_none(self.output),
        )
        if isinstance(resolved_output, NoMatch) or resolved_output is None:
            raise ValueError(f"Could not resolve output for bias check using key '{self.key}'")

        return {
            "trace": trace,
            "output": str(resolved_output),
            "protected_attributes": attributes,
            "context": context,
        }

Comment on lines +19 to +28
DEFAULT_PROTECTED_ATTRIBUTES: list[str] = [
"gender",
"race",
"age",
"religion",
"nationality",
"sexual_orientation",
"socioeconomic_status",
"disability",
]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you base this on? Should we add more categories or descriptions to make it more epxlicit?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These categories are based on commonly recognised protected attributes in AI fairness literature — specifically aligned with the EU AI Act's list of prohibited discrimination grounds and DeepEval's BiasMetric categories. Happy to add more explicit descriptions per attribute if that would help (e.g. what counts as gender bias vs race bias). Would a Literal type with docstring per value work, or do you prefer keeping it as plain strings?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I would say that it is nice to add the specific references and files where we derive this from.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Here are the specific references:

EU AI Act, Article 5 & Annex III — lists protected characteristics including sex, race, ethnicity, religion, disability, age, and sexual orientation as prohibited discrimination grounds
DeepEval BiasMetric — https://docs.confident-ai.com/docs/metrics-bias — uses gender, religion, race, politics as core categories
ISO/IEC 24368:2022 — AI fairness standard referencing demographic attributes

I can add these as inline comments above DEFAULT_PROTECTED_ATTRIBUTES in the code if that works.

default="trace.last.outputs",
description="JSONPath expression to extract the output to evaluate from the trace.",
)
protected_attributes: list[str] | None = Field(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel there is a way to add more nuance to this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — one way to add nuance would be to support a severity_threshold (e.g. ignore minor imprecision, only flag clear stereotyping) or allow per-attribute custom descriptions so users can tailor what "gender bias" means in their context. Would either direction align with what you had in mind?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you see this severity_threshold solidly work in an LLM setting?
The per attribute descriptions could work too but how do you think to integrate this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pushback — severity_threshold is tricky in an LLM setting because the model's confidence isn't reliably calibrated, so a numeric threshold would be arbitrary.
Per-attribute descriptions are more practical. I'd integrate them as an optional attribute_descriptions: dict[str, str] | None field — if provided, the value overrides the generic description for that attribute in the Jinja template. For example:
pythonattribute_descriptions={"gender": "Look for assumptions about professional roles based on gender"}
Would you like me to implement this instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you base this prompt on and do you have any references? It would be great to understand how this was composed and how it might capture bias.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt structure was inspired by DeepEval's BiasMetric evaluation criteria and the Giskard red-teaming bias/fairness documentation. The five bias types (stereotyping, unfair generalisation, exclusionary language, differential treatment, contextual endorsement) are drawn from academic fairness literature. Happy to add a comment block at the top of the template citing these references if that would be useful.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specficallt mention the URLs and reasoning?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt was composed based on:

DeepEval BiasMetric — https://docs.confident-ai.com/docs/metrics-bias
Giskard bias/fairness red-teaming docs — https://docs.giskard.ai/en/stable/knowledge/key_vulnerabilities/ethics/index.html
Blodgett et al. (2020) — "Language (Technology) is Power" — https://aclanthology.org/2020.acl-main.485 — academic taxonomy of bias types in NLP

I can add these as a comment block at the top of bias.j2 for traceability.

@Kushagra651

Copy link
Copy Markdown
Author

Hi @kevinmessiaen @davidberenstein1957 — just following up. Could you add the safe for build label so CI can run? Happy to address any further feedback once the checks are green. Thanks!

Supports protected_attributes and context_key per issue spec.

Fixes Giskard-AI#2366

git add libs/giskard-checks/src/giskard/checks/__init__.py#
- Raise ValueError if output cannot be resolved
- Guard against str(None) being passed as context

Addresses review feedback from gemini-code-assist

@davidberenstein1957 davidberenstein1957 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I added some nuance and follow ups.

Comment on lines +19 to +28
DEFAULT_PROTECTED_ATTRIBUTES: list[str] = [
"gender",
"race",
"age",
"religion",
"nationality",
"sexual_orientation",
"socioeconomic_status",
"disability",
]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I would say that it is nice to add the specific references and files where we derive this from.

default="trace.last.outputs",
description="JSONPath expression to extract the output to evaluate from the trace.",
)
protected_attributes: list[str] | None = Field(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you see this severity_threshold solidly work in an LLM setting?
The per attribute descriptions could work too but how do you think to integrate this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specficallt mention the URLs and reasoning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add bias check

2 participants