feat(datasource): implement get_dataframe_by_tag for drift metric reference data#183
feat(datasource): implement get_dataframe_by_tag for drift metric reference data#183SudipSinha wants to merge 2 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 14 minutes and 57 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
PR image build and manifest generation completed successfully! 📦 PR image: 🗂️ CI manifests |
a062c5d to
69ce598
Compare
Amended from PR #183 — column_names() returns numpy arrays, not lists, so .index() needs list() wrapping. Test mocks updated to use np.array() to match real return types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| """ | ||
| try: | ||
| model_data = ModelData(model_id) | ||
| input_data, _, metadata = await model_data.data() |
There was a problem hiding this comment.
Is there no need to get output_data here?
…erence data All drift metric endpoints call data_source.get_dataframe_by_tag() to retrieve reference data filtered by tag (e.g. "TRAINING"), but the method did not exist on DataSource, causing AttributeError and HTTP 500 for every drift computation. Also converts metadata_names to list before calling .index() since column_names() returns numpy arrays which lack .index(). Test mocks use numpy arrays to match the real return type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sudip Sinha <Sudip.Sinha@RedHat.com>
…e_by_tag MariaDB round-trips tags through json/gzip/LONGBLOB, producing numpy arrays instead of Python lists. The isinstance(cell, list) check returned False, causing all rows to be filtered out. Extract tags via _extract_tags helper that handles both list and np.ndarray. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sudip Sinha <Sudip.Sinha@RedHat.com>
28feb59 to
bd48cc2
Compare
Summary
All drift metric endpoints call
data_source.get_dataframe_by_tag(model_id, reference_tag)to retrieve reference data, but the method doesn't exist onDataSource. Every drift computation fails withAttributeErrorand HTTP 500.Root cause
The drift endpoints were written to call
get_dataframe_by_tag(matching the Java service'sfilterRowsByTagEqualspattern), but the method was never implemented on the PythonDataSourceclass.How it works
The method reads all stored data for a model, then filters rows by tag:
ModelData.data()tagscolumn in metadata (column index 3, containslist[str]per row)tag in row_tagsfor each rowReturns input columns only because drift metrics iterate over
request.fit_columns(input feature names) and accessreference_df[feature_name].Reads all rows (no batch size) because reference data tagged e.g. "TRAINING" could be anywhere in the dataset, not just the last N rows.
Handles numpy arrays from
column_names()—ModelData.column_names()returns numpy arrays, not Python lists. The implementation converts tolist()before calling.index("tags")to avoidAttributeError. Test mocks usenp.array()to match the real return type and catch this class of bug.Files changed
src/service/data/datasources/data_source.pyget_dataframe_by_tagmethodtests/service/data/datasources/test_datasource.pyTest plan
np.array()forcolumn_names()return values, matching realModelDatabehaviorruff checkcleanruff formatcleanpyrefly checkclean🤖 Generated with Claude Code