ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4
ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4nidhinthomas-ai wants to merge 35 commits into
Conversation
|
Initial TODO items for reviewing this branch:
|
* remove unneccary checks * modify handling of std calculations * update docs * fix mypy type hints * simplify logic of vectorized operations * update expected test values
|
Reviews (6): Last reviewed commit: "MAINT: modify handling of results csv fi..." | Re-trigger Greptile |
|
@tylerjereddy I have addressed all your initial re-review comments and responded to each individually. I have also performed another round of self-review and fixed/responded/resolved those comments as well. I have also addressed all of the |
|
Reviews (7): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile |
The CI is failing, as a start, are you sure you want me to start my review from there? gh-22 has been open for a few months re: intermittent failures. |
|
Reviews (8): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile |
|
Reviews (9): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile |
* fix logic for handling missing user provided save paths * add test to handle edge case
|
Reviews (10): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile |
|
Reviews (11): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile |
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
|
Reviews (12): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile |
tylerjereddy
left a comment
There was a problem hiding this comment.
I did a full review of the entire diff again, and added several more comments. This is a bit better than before, though still neglects some of the usual checklist items.
I'm a bit perplexed by the decision to ask me for a full review before updating the PR to reflect data structure changes happening elsewhere, and making sure that all supporting precursor PRs are fully reviewed and merged--I won't be able to commit to another detailed review of this size for quite some time now, and the schedule has already slipped too much. In fact, summer conference season is about to start, along with the next batch of postdocs, so you may find it pretty hard to convince me to review this again anytime soon, let alone the other large PR(s) that may still need a few rounds of review.
Try to organize your PRs, their sizes/scope, and workflow wisely, and respect reviewer time/bandwidth, or your scheduling may slip further.
| import logging | ||
| from tqdm.auto import tqdm | ||
| from pyarrow.lib import ArrowInvalid # type: ignore[import-untyped] | ||
| import ast |
There was a problem hiding this comment.
Honestly gh-33 seems to have a bit of a scoping issue where there's a contour-related bug fix and removal of some parquet-related dependency mushed together, which is probably going to slow down the review there. Those aren't clearly communicated as both being necessary for removal of ast here--I thought you indicated it was mostly the parquet-related change that was needed to move forward here. Either way, mushing those two things together is going to drain reviewer bandwidth and delay review both there and here.
It would probably be better to wait for your dependent PRs to be fully refined and merged before pinging me for another round of review here, and your CI is still failing here, but I'll give you another full round of review. Keep in mind that I'm unlikely to keep reviewing this PR frequently, so if you ping me to review it in a state where it isn't ready to go, you may face weeks of delay before the next review cycle.
This thread needs to remain open--the original point is valid--should be no need to use ast in this work.
|
|
||
|
|
||
| def _std_agg(x): | ||
| return x.std(ddof=0) |
There was a problem hiding this comment.
Why is this function needed? np.std() already exists and has the same default for ddof.
If I simply remove this function and use np.std like the rest of the world in the diff below, the full testsuite passes locally. Adding spurious abstractions without an explanation in the code drains reviewer time. Maybe you have a good reason for this, but I've found no testable evidence for it, and so it is draining reviewer time.
Details
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -19,10 +19,6 @@ __all__ = [
log = logging.getLogger(__name__)
-def _std_agg(x):
- return x.std(ddof=0)
-
-
def _merge_composition_data(
summary_df: pd.DataFrame,
composition_df: pd.DataFrame,
@@ -618,8 +614,7 @@ def _calculate_summary_statistics(
carry = list(set(carry_over_cols).intersection(df.columns))
# initialize dictionary keys for aggregation
- _std_agg.__name__ = "std"
- agg_spec = {c: ["min", "max", "median", _std_agg] for c in df_num.columns}
+ agg_spec = {c: ["min", "max", "median", np.std] for c in df_num.columns}
agg_spec.update({c: ["first"] for c in df[carry].columns})
grouped = df.groupby(There was a problem hiding this comment.
Using this diff, the test fails for me with the following warning using pandas==2.3.3:
: FutureWarning: The provided callable <function std at 0x112e0aac0> is currently using SeriesGroupBy.std. In
a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string
"std" instead.
).agg(agg_spec) # type: ignore[arg-type]but passes without any warnings when using pandas>=3.0.0
There was a problem hiding this comment.
You still haven't clearly explained to me why you wrote _std_agg() and presented it for review as such when np.std already exists? What was the basis for this decision? If it was an oversight then taking ownership of that and correcting it is pretty simple. If it is based on an implementation/approach somewhere else, then linking to that for clarity saves the reviewer time.
Your current response implies that there's a flaw with my suggestion that justifies your original decision--even if that were true and you were carefully shimming around different behaviors in different pandas versions, you should have still clearly documented it with a comment in the new function (# XXX supporting pandas 2.x and 3.x behavior) rather than asking me to research it as the reviewer by not including that information inline. Your previous analysis of std ddof behavior at #4 (comment) only mentions that the pandas default for ddof is 1 and you want to change it to 0, but there's no mention of std function override behavior differences in different pandas versions.
On the topic of version support ranges, I haven't really seen much discussion on your plans since the May 5 comment about establishing a policy on version support ranges (#28 (comment)), another delay which makes it hard (almost impossible?) for me to review your code--what version ranges are you going to support? It is pretty hard for the reviewer to provide sane feedback without clear version support ranges.
In principle you could also use lambda x: np.std(x) here if you change your expected column names a bit. It looks like you may be trying to support pandas 2.x and 3.x aggregation function override behaviors and preserve column names at the same time. But asking the reviewer to divine that is a lot.
Here's a sample diff that contains clear explanations, doesn't write a new function, and passes tests on both versions of pandas that have been mentioned.
Details
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -19,10 +19,6 @@ __all__ = [
log = logging.getLogger(__name__)
-def _std_agg(x):
- return x.std(ddof=0)
-
-
def _merge_composition_data(
summary_df: pd.DataFrame,
composition_df: pd.DataFrame,
@@ -618,8 +614,11 @@ def _calculate_summary_statistics(
carry = list(set(carry_over_cols).intersection(df.columns))
# initialize dictionary keys for aggregation
- _std_agg.__name__ = "std"
- agg_spec = {c: ["min", "max", "median", _std_agg] for c in df_num.columns}
+ # we want std with ddof=0 vs. the pandas default for std
+ # we can't use `np.std` directly because pandas 2.x ignores
+ # it and uses its own std with different ddof behavior
+ # TODO: change to np.std when we require pandas >=3.x
+ agg_spec = {c: ["min", "max", "median", lambda x: np.std(x)] for c in df_num.columns}
agg_spec.update({c: ["first"] for c in df[carry].columns})
grouped = df.groupby(
@@ -628,6 +627,8 @@ def _calculate_summary_statistics(
grouped.columns = ['_'.join(filter(None, map(str, col))) for col in grouped.columns]
grouped.rename(columns={f"{c}_first": c for c in df[carry].columns}, inplace=True)
+ # preserve the "std" column name after using the lambda function
+ grouped.columns = grouped.columns.str.replace("<lambda_0>", "std")
return grouped
There was a problem hiding this comment.
Thank you for the suggestion. I have pushed a commit to reflect these changes with detail about reasoning behind the logic used for std calculation here. I obtained the previous logic based on this discussion: https://stackoverflow.com/a/50307104 because I had tried the solution using lambda, but didn't think about replacing the appended string with the appropriate std suffix for those column names. Going forward, I will make sure to provide more information in inline comments for changes requiring further explantion.
| raise ValueError(f"Columns {missing_cols} not found in composition_df.") | ||
|
|
||
| return summary_df.merge( | ||
| composition_df[[merge_key, *cols_to_add]], on=merge_key, how="left" |
There was a problem hiding this comment.
Why does it matter if the merge is "left?" The reason should be justified by regression testing.
If I change to a right merge per diff below, the full testsuite passes locally.
Details
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -60,7 +60,7 @@ def _merge_composition_data(
raise ValueError(f"Columns {missing_cols} not found in composition_df.")
return summary_df.merge(
- composition_df[[merge_key, *cols_to_add]], on=merge_key, how="left"
+ composition_df[[merge_key, *cols_to_add]], on=merge_key, how="right"
)| Parameters | ||
| ---------- | ||
| fname : str | ||
| The filename, e.g., 'offset -1_center_A1_Bf_Raw_uuid_bubble_data.parquet.gzip'. |
There was a problem hiding this comment.
Ok, that doesn't prohibit you from cleaning it up. But "ok" seems like this can stay as is..
| A dictionary with 'UniqueID', 'Class', 'Offset', 'Position', and 'Label', | ||
| or an empty dictionary if the pattern does not match. | ||
| """ | ||
| if method.lower() == "bubblesam": |
There was a problem hiding this comment.
I believe this will allow the Literal above to accept quite a few more options
There was a problem hiding this comment.
I have modified this to only accept the literal string options.
| method, | ||
| n_value_errors, | ||
| len_df, | ||
| ): |
There was a problem hiding this comment.
partial typing, weird decision
| input_dir = tmp_path / "input" | ||
| input_dir.mkdir() | ||
|
|
||
| # a file that contains all the necessary information for |
There was a problem hiding this comment.
a file or filename? currently the logic here is quite separated from file creation
| good = ("offset -1_center_A1_O_Ph_Raw_11111111-" | ||
| f"1111-1111-1111-111111111111_{file_suff}.parquet.gzip") | ||
| if method == "BubbleSAM": | ||
| bbox = "[0.0, 0.0, 10.0, 10.0]" |
There was a problem hiding this comment.
string handling of this numeric information should disappear per other PRs
| }).to_parquet( | ||
| input_dir / good | ||
| ) | ||
| # a parquet file that is not in the correct format and |
There was a problem hiding this comment.
file or filename? here and elsewhere the comments are a bit painful to parse; sometimes the file contents are the problem and sometimes the filename proper is the problem, and the failure of the comments to capture this is draining reviewer time
| assert exp_msg in all_msgs | ||
|
|
||
| assert isinstance(df, pd.DataFrame) | ||
| assert len(df) == len_df |
There was a problem hiding this comment.
checking shape may be slightly less crude
This merge request introduces a new library module,
neat_ml/analysis, which analyzes the bubble detection results either from BubbleSAM or OpenCV. It also adds an accompanying test suite.What’s inside
The
data_analysis.pyscript is the core of this feature. It processes raw bubble data (centroids, areas, etc.) stored in parquet files and produces comprehensive quantitative metrics. The workflow can be summarized as:*_bubble_data.parquet.gzip) or BubbleSAM (*_masks_filtered.parquet.gzip).NND): Mean and median distance between the closest bubbles.per_image_metrics.csv: A detailed report with all calculated metrics for every single image.aggregated_metrics.csv: A summary report where metrics are aggregated (min, max, median, std) across different experimental groups.radius: Connects all bubbles within a given search radius.knn: Connects each bubble to its k-nearest neighbors.delaunay: Creates a graph based on Delaunay triangulation, connecting "natural" neighbors. It then calculates network metrics like average node degree, clustering coefficient, and properties of the largest connected component (LCC).OpenCVandBubbleSAM.The analysis typically uses the radius graph method with the parameter 30. Users can also change the parameter for their custom dataset. This parameter defines the maximum distance (in pixels) for two bubbles to be considered connected in the spatial graph.