Skip to content

feat(datasource): implement get_dataframe_by_tag for drift metric reference data#183

Open
SudipSinha wants to merge 2 commits into
mainfrom
fix/missing-get-dataframe-by-tag
Open

feat(datasource): implement get_dataframe_by_tag for drift metric reference data#183
SudipSinha wants to merge 2 commits into
mainfrom
fix/missing-get-dataframe-by-tag

Conversation

@SudipSinha

@SudipSinha SudipSinha commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

All drift metric endpoints call data_source.get_dataframe_by_tag(model_id, reference_tag) to retrieve reference data, but the method doesn't exist on DataSource. Every drift computation fails with AttributeError and HTTP 500.

Root cause

The drift endpoints were written to call get_dataframe_by_tag (matching the Java service's filterRowsByTagEquals pattern), but the method was never implemented on the Python DataSource class.

How it works

The method reads all stored data for a model, then filters rows by tag:

  1. Read input data and metadata via ModelData.data()
  2. Find the tags column in metadata (column index 3, contains list[str] per row)
  3. Build a boolean mask: tag in row_tags for each row
  4. Apply the mask to the input data array
  5. Return a pandas DataFrame with input columns only

Returns input columns only because drift metrics iterate over request.fit_columns (input feature names) and access reference_df[feature_name].

Reads all rows (no batch size) because reference data tagged e.g. "TRAINING" could be anywhere in the dataset, not just the last N rows.

Handles numpy arrays from column_names()ModelData.column_names() returns numpy arrays, not Python lists. The implementation converts to list() before calling .index("tags") to avoid AttributeError. Test mocks use np.array() to match the real return type and catch this class of bug.

Files changed

File Change
src/service/data/datasources/data_source.py Add get_dataframe_by_tag method
tests/service/data/datasources/test_datasource.py 3 tests: matching rows, non-existent tag, missing data (mocks use numpy arrays)

Test plan

  • All 24 datasource tests pass (3 new)
  • Test mocks use np.array() for column_names() return values, matching real ModelData behavior
  • ruff check clean
  • ruff format clean
  • pyrefly check clean
  • Rebuild container image and verify drift metric computation succeeds with tagged reference data

🤖 Generated with Claude Code

@SudipSinha SudipSinha self-assigned this Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@SudipSinha, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 57 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df0f7ded-78ac-4b65-8ec4-fa6441353484

📥 Commits

Reviewing files that changed from the base of the PR and between eeca7e6 and bd48cc2.

📒 Files selected for processing (2)
  • src/service/data/datasources/data_source.py
  • tests/service/data/datasources/test_datasource.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/missing-get-dataframe-by-tag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SudipSinha SudipSinha added bug Something isn't working python Pull requests that update python code ok-to-test Proceed with CI testing tests high-priority Critical issues that should be addressed soon labels Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

PR image build and manifest generation completed successfully!

📦 PR image: quay.io/trustyai/trustyai-service-python-ci:bd48cc2ceed2620c6cc32cec7de2576227246d04

🗂️ CI manifests

devFlags:
  manifests:
    - contextDir: config
      sourcePath: ''
      uri: https://api.github.com/repos/trustyai-explainability/trustyai-service-operator-ci/tarball/service-python-bd48cc2ceed2620c6cc32cec7de2576227246d04

@SudipSinha SudipSinha force-pushed the fix/missing-get-dataframe-by-tag branch from a062c5d to 69ce598 Compare June 16, 2026 10:04
SudipSinha added a commit that referenced this pull request Jun 16, 2026
Amended from PR #183 — column_names() returns numpy arrays, not lists,
so .index() needs list() wrapping. Test mocks updated to use np.array()
to match real return types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"""
try:
model_data = ModelData(model_id)
input_data, _, metadata = await model_data.data()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no need to get output_data here?

@SudipSinha SudipSinha marked this pull request as ready for review June 18, 2026 14:08
SudipSinha and others added 2 commits June 18, 2026 16:36
…erence data

All drift metric endpoints call data_source.get_dataframe_by_tag() to
retrieve reference data filtered by tag (e.g. "TRAINING"), but the
method did not exist on DataSource, causing AttributeError and HTTP 500
for every drift computation.

Also converts metadata_names to list before calling .index() since
column_names() returns numpy arrays which lack .index(). Test mocks
use numpy arrays to match the real return type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sudip Sinha <Sudip.Sinha@RedHat.com>
…e_by_tag

MariaDB round-trips tags through json/gzip/LONGBLOB, producing numpy
arrays instead of Python lists. The isinstance(cell, list) check
returned False, causing all rows to be filtered out. Extract tags via
_extract_tags helper that handles both list and np.ndarray.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sudip Sinha <Sudip.Sinha@RedHat.com>
@SudipSinha SudipSinha force-pushed the fix/missing-get-dataframe-by-tag branch from 28feb59 to bd48cc2 Compare June 18, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working high-priority Critical issues that should be addressed soon ok-to-test Proceed with CI testing python Pull requests that update python code tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants