ENH: Add scripts to analyze BubbleSAM and OpenCV detection results by nidhinthomas-ai · Pull Request #4 · lanl/ldrd_neat_ml

nidhinthomas-ai · 2025-09-25T06:22:46Z

This merge request introduces a new library module, neat_ml/analysis, which analyzes the bubble detection results either from BubbleSAM or OpenCV. It also adds an accompanying test suite.

What’s inside

Analysis Module (neat_ml/analysis/): A new module dedicated to post-processing and analyzing bubble detection data.
- data_analysis.py: A script that extracts bubble features, computes a wide range of spatial metrics, and generates summary reports.
- tests/test_analysis.py: Unit tests for the analysis module to ensure correctness and stability.
Workflow Integration (workflow/lib_workflow.py):
- Modified to include a new wrapper function, stage_analyze_features(), which integrates the data_analysis.py script into the main processing workflow.
Detailed Description of data_analysis.py

The data_analysis.py script is the core of this feature. It processes raw bubble data (centroids, areas, etc.) stored in parquet files and produces comprehensive quantitative metrics. The workflow can be summarized as:

Scan and Load: Recursively scans a directory for result files from either OpenCV (*_bubble_data.parquet.gzip) or BubbleSAM (*_masks_filtered.parquet.gzip).
Parse Metadata: Extracts experimental metadata (e.g., offset, position, label) directly from the filenames.
Per-Image Analysis: For each image, it calculates a suite of metrics:
- Basic Blob Stats: Number of bubbles, mean/median/std of area and radius.
- Coverage: Percentage of the image area covered by bubbles.
- Nearest-Neighbor Distance (NND): Mean and median distance between the closest bubbles.
- Voronoi Tessellation: Statistics on the area of Voronoi cells, which describe the local space around each bubble.
- Graph-Based Metrics: Models the spatial relationship between bubbles as a graph and computes network properties (e.g., connectivity, clustering).
Output Generation: Saves the results into two CSV files:
- per_image_metrics.csv: A detailed report with all calculated metrics for every single image.
- aggregated_metrics.csv: A summary report where metrics are aggregated (min, max, median, std) across different experimental groups.

Key Methods
- full_analysis(): The main entry point that orchestrates the entire pipeline—from processing the input directory to saving the final per-image and aggregated CSV reports.
- process_parquet_files(): Manages file discovery, loading, and iteration. It identifies the correct parsing function based on the specified mode ('OpenCV' or 'BubbleSAM') and compiles all per-image results into a single DataFrame.
- calculate_all_spatial_metrics(): A wrapper function that computes the full set of spatial metrics for a single image's data. It calls the specialized functions below.
- calculate_nnd_stats(): Computes the mean and median nearest-neighbor distances between bubble centroids using a KD-Tree for efficient querying. This helps quantify how clustered the bubbles are.
- calculate_voronoi_stats(): Performs Voronoi tessellation on the bubble centroids and calculates statistics (mean, median, std) on the resulting cell areas. This provides insight into the spatial organization and density of the bubbles.
- calculate_graph_metrics(): Constructs a spatial graph where bubbles are nodes and edges connect nearby bubbles. It supports three construction methods:
  - radius: Connects all bubbles within a given search radius.
  - knn: Connects each bubble to its k-nearest neighbors.
  - delaunay: Creates a graph based on Delaunay triangulation, connecting "natural" neighbors. It then calculates network metrics like average node degree, clustering coefficient, and properties of the largest connected component (LCC).
- calculate_summary_statistics(): Aggregates the per-image metrics DataFrame. It groups the data by specified columns (e.g., 'Label', 'Time') and computes summary statistics (min, max, median, std) for all numerical metrics.
- merge_composition_data(): An optional utility to merge the metrics DataFrame with an external CSV file (e.g., containing sample composition data) using a shared key like a UniqueID.
- parse_filename() and load_df(): Helper functions tailored to parse metadata from the specific filename conventions and load the data formats produced by OpenCV and BubbleSAM.
Runtime Tip

The analysis typically uses the radius graph method with the parameter 30. Users can also change the parameter for their custom dataset. This parameter defines the maximum distance (in pixels) for two bubbles to be considered connected in the spatial graph.

adamwitmer · 2026-02-10T18:42:11Z

* remove unneccary checks * modify handling of std calculations * update docs * fix mypy type hints * simplify logic of vectorized operations * update expected test values

greptile-apps · 2026-06-09T17:28:22Z

_{Reviews (6): Last reviewed commit: "MAINT: modify handling of results csv fi..." | Re-trigger Greptile}

* enforce desired behavior with tests * remove leftover unused function * fix minor formatting issues/docstring

adamwitmer · 2026-06-09T18:42:01Z

@tylerjereddy I have addressed all your initial re-review comments and responded to each individually. I have also performed another round of self-review and fixed/responded/resolved those comments as well. I have also addressed all of the greptile review comments (unless more show up after my most recent commit). I opened PRs #33 and #34 to address issues surrounding data structure handling that should simplify the logic here once merged. Hopefully this PR should be in better shape for you to review now, thanks.

greptile-apps · 2026-06-09T18:46:09Z

_{Reviews (7): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile}

tylerjereddy · 2026-06-09T18:52:55Z

Hopefully this PR should be in better shape for you to review now, thanks.

The CI is failing, as a start, are you sure you want me to start my review from there?

gh-22 has been open for a few months re: intermittent failures.

greptile-apps · 2026-06-09T19:12:32Z

_{Reviews (8): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile}

greptile-apps · 2026-06-09T19:23:43Z

_{Reviews (9): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile}

* fix logic for handling missing user provided save paths * add test to handle edge case

greptile-apps · 2026-06-10T00:28:54Z

_{Reviews (10): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile}

greptile-apps · 2026-06-10T02:17:40Z

_{Reviews (11): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile}

greptile-apps · 2026-06-10T02:17:43Z

Want your agent to iterate on Greptile's feedback? Try greploops.

greptile-apps · 2026-06-10T14:09:18Z

_{Reviews (12): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile}

tylerjereddy

I did a full review of the entire diff again, and added several more comments. This is a bit better than before, though still neglects some of the usual checklist items.

I'm a bit perplexed by the decision to ask me for a full review before updating the PR to reflect data structure changes happening elsewhere, and making sure that all supporting precursor PRs are fully reviewed and merged--I won't be able to commit to another detailed review of this size for quite some time now, and the schedule has already slipped too much. In fact, summer conference season is about to start, along with the next batch of postdocs, so you may find it pretty hard to convince me to review this again anytime soon, let alone the other large PR(s) that may still need a few rounds of review.

Try to organize your PRs, their sizes/scope, and workflow wisely, and respect reviewer time/bandwidth, or your scheduling may slip further.

tylerjereddy · 2026-06-12T15:10:05Z

+import logging
+from tqdm.auto import tqdm
+from pyarrow.lib import ArrowInvalid  # type: ignore[import-untyped]
+import ast


Honestly gh-33 seems to have a bit of a scoping issue where there's a contour-related bug fix and removal of some parquet-related dependency mushed together, which is probably going to slow down the review there. Those aren't clearly communicated as both being necessary for removal of ast here--I thought you indicated it was mostly the parquet-related change that was needed to move forward here. Either way, mushing those two things together is going to drain reviewer bandwidth and delay review both there and here.

It would probably be better to wait for your dependent PRs to be fully refined and merged before pinging me for another round of review here, and your CI is still failing here, but I'll give you another full round of review. Keep in mind that I'm unlikely to keep reviewing this PR frequently, so if you ping me to review it in a state where it isn't ready to go, you may face weeks of delay before the next review cycle.

This thread needs to remain open--the original point is valid--should be no need to use ast in this work.

tylerjereddy · 2026-06-12T15:25:04Z

+
+
+def _std_agg(x):
+    return x.std(ddof=0)


Why is this function needed? np.std() already exists and has the same default for ddof.

If I simply remove this function and use np.std like the rest of the world in the diff below, the full testsuite passes locally. Adding spurious abstractions without an explanation in the code drains reviewer time. Maybe you have a good reason for this, but I've found no testable evidence for it, and so it is draining reviewer time.

Details

--- a/neat_ml/analysis/data_analysis.py +++ b/neat_ml/analysis/data_analysis.py @@ -19,10 +19,6 @@ __all__ = [ log = logging.getLogger(__name__) -def _std_agg(x): - return x.std(ddof=0) - - def _merge_composition_data( summary_df: pd.DataFrame, composition_df: pd.DataFrame, @@ -618,8 +614,7 @@ def _calculate_summary_statistics( carry = list(set(carry_over_cols).intersection(df.columns)) # initialize dictionary keys for aggregation - _std_agg.__name__ = "std" - agg_spec = {c: ["min", "max", "median", _std_agg] for c in df_num.columns} + agg_spec = {c: ["min", "max", "median", np.std] for c in df_num.columns} agg_spec.update({c: ["first"] for c in df[carry].columns}) grouped = df.groupby(

Using this diff, the test fails for me with the following warning using pandas==2.3.3:

: FutureWarning: The provided callable <function std at 0x112e0aac0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead. ).agg(agg_spec) # type: ignore[arg-type]

but passes without any warnings when using pandas>=3.0.0

You still haven't clearly explained to me why you wrote _std_agg() and presented it for review as such when np.std already exists? What was the basis for this decision? If it was an oversight then taking ownership of that and correcting it is pretty simple. If it is based on an implementation/approach somewhere else, then linking to that for clarity saves the reviewer time.

Your current response implies that there's a flaw with my suggestion that justifies your original decision--even if that were true and you were carefully shimming around different behaviors in different pandas versions, you should have still clearly documented it with a comment in the new function (# XXX supporting pandas 2.x and 3.x behavior) rather than asking me to research it as the reviewer by not including that information inline. Your previous analysis of std ddof behavior at #4 (comment) only mentions that the pandas default for ddof is 1 and you want to change it to 0, but there's no mention of std function override behavior differences in different pandas versions.

On the topic of version support ranges, I haven't really seen much discussion on your plans since the May 5 comment about establishing a policy on version support ranges (#28 (comment)), another delay which makes it hard (almost impossible?) for me to review your code--what version ranges are you going to support? It is pretty hard for the reviewer to provide sane feedback without clear version support ranges.

In principle you could also use lambda x: np.std(x) here if you change your expected column names a bit. It looks like you may be trying to support pandas 2.x and 3.x aggregation function override behaviors and preserve column names at the same time. But asking the reviewer to divine that is a lot.

Here's a sample diff that contains clear explanations, doesn't write a new function, and passes tests on both versions of pandas that have been mentioned.

Details

--- a/neat_ml/analysis/data_analysis.py +++ b/neat_ml/analysis/data_analysis.py @@ -19,10 +19,6 @@ __all__ = [ log = logging.getLogger(__name__) -def _std_agg(x): - return x.std(ddof=0) - - def _merge_composition_data( summary_df: pd.DataFrame, composition_df: pd.DataFrame, @@ -618,8 +614,11 @@ def _calculate_summary_statistics( carry = list(set(carry_over_cols).intersection(df.columns)) # initialize dictionary keys for aggregation - _std_agg.__name__ = "std" - agg_spec = {c: ["min", "max", "median", _std_agg] for c in df_num.columns} + # we want std with ddof=0 vs. the pandas default for std + # we can't use `np.std` directly because pandas 2.x ignores + # it and uses its own std with different ddof behavior + # TODO: change to np.std when we require pandas >=3.x + agg_spec = {c: ["min", "max", "median", lambda x: np.std(x)] for c in df_num.columns} agg_spec.update({c: ["first"] for c in df[carry].columns}) grouped = df.groupby( @@ -628,6 +627,8 @@ def _calculate_summary_statistics( grouped.columns = ['_'.join(filter(None, map(str, col))) for col in grouped.columns] grouped.rename(columns={f"{c}_first": c for c in df[carry].columns}, inplace=True) + # preserve the "std" column name after using the lambda function + grouped.columns = grouped.columns.str.replace("<lambda_0>", "std") return grouped

Thank you for the suggestion. I have pushed a commit to reflect these changes with detail about reasoning behind the logic used for std calculation here. I obtained the previous logic based on this discussion: https://stackoverflow.com/a/50307104 because I had tried the solution using lambda, but didn't think about replacing the appended string with the appropriate std suffix for those column names. Going forward, I will make sure to provide more information in inline comments for changes requiring further explantion.

tylerjereddy · 2026-06-12T15:30:43Z

+        raise ValueError(f"Columns {missing_cols} not found in composition_df.")
+
+    return summary_df.merge(
+        composition_df[[merge_key, *cols_to_add]], on=merge_key, how="left"


Why does it matter if the merge is "left?" The reason should be justified by regression testing.

If I change to a right merge per diff below, the full testsuite passes locally.

Details

--- a/neat_ml/analysis/data_analysis.py +++ b/neat_ml/analysis/data_analysis.py @@ -60,7 +60,7 @@ def _merge_composition_data( raise ValueError(f"Columns {missing_cols} not found in composition_df.") return summary_df.merge( - composition_df[[merge_key, *cols_to_add]], on=merge_key, how="left" + composition_df[[merge_key, *cols_to_add]], on=merge_key, how="right" )

tylerjereddy · 2026-06-12T15:34:48Z

+    Parameters
+    ----------
+    fname : str
+        The filename, e.g., 'offset -1_center_A1_Bf_Raw_uuid_bubble_data.parquet.gzip'.


Ok, that doesn't prohibit you from cleaning it up. But "ok" seems like this can stay as is..

tylerjereddy · 2026-06-12T15:36:16Z

+        A dictionary with 'UniqueID', 'Class', 'Offset', 'Position', and 'Label',
+        or an empty dictionary if the pattern does not match.
+    """
+    if method.lower() == "bubblesam":


I believe this will allow the Literal above to accept quite a few more options

I have modified this to only accept the literal string options.

tylerjereddy · 2026-06-12T20:00:09Z

+    method,
+    n_value_errors,
+    len_df,
+):


partial typing, weird decision

tylerjereddy · 2026-06-12T20:01:42Z

+    input_dir = tmp_path / "input"
+    input_dir.mkdir()
+
+    # a file that contains all the necessary information for


a file or filename? currently the logic here is quite separated from file creation

tylerjereddy · 2026-06-12T20:02:09Z

+    good = ("offset -1_center_A1_O_Ph_Raw_11111111-"
+        f"1111-1111-1111-111111111111_{file_suff}.parquet.gzip")
+    if method == "BubbleSAM":
+        bbox = "[0.0, 0.0, 10.0, 10.0]"


string handling of this numeric information should disappear per other PRs

tylerjereddy · 2026-06-12T20:04:21Z

+        }).to_parquet(
+        input_dir / good
+    )
+    # a parquet file that is not in the correct format and


file or filename? here and elsewhere the comments are a bit painful to parse; sometimes the file contents are the problem and sometimes the filename proper is the problem, and the failure of the comments to capture this is draining reviewer time

tylerjereddy · 2026-06-12T20:16:23Z

+        assert exp_msg in all_msgs
+
+    assert isinstance(df, pd.DataFrame)
+    assert len(df) == len_df


checking shape may be slightly less crude

adamwitmer mentioned this pull request Oct 10, 2025

MAINT: TODO for open PRs #6

Open

5 tasks

tylerjereddy added the enhancement New feature or request label Oct 15, 2025

adamwitmer reviewed Dec 11, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 11, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 11, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 16, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 16, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/bubblesam/bubblesam.py

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Jun 3, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Jun 3, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Jun 3, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Jun 3, 2026

View reviewed changes

Comment thread neat_ml/workflow/lib_workflow.py Outdated

Adam Witmer added 2 commits June 8, 2026 22:45

MAINT: address reviewer comments

2d09911

* remove unneccary checks * modify handling of std calculations * update docs * fix mypy type hints * simplify logic of vectorized operations * update expected test values

ENH: vectorize bubblesam centroid calculation

05912a2

adamwitmer mentioned this pull request Jun 9, 2026

ENH: modify centroid and bbox object handling for opencv bubble detection #34

Merged

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread neat_ml/workflow/lib_workflow.py

MAINT: modify handling of results csv filepaths

d214811

adamwitmer reviewed Jun 9, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py

adamwitmer reviewed Jun 9, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Jun 9, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Jun 9, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

MAINT: PR lanl#4 revisions

3d0d74e

* enforce desired behavior with tests * remove leftover unused function * fix minor formatting issues/docstring

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread neat_ml/workflow/lib_workflow.py

MAINT: address AI reviewer comment

6b9a1b6

* fix logic for handling missing user provided save paths * add test to handle edge case

tylerjereddy requested changes Jun 12, 2026

View reviewed changes

MAINT: modify pandas agg operation to use lambda function

f57eaa5

adamwitmer mentioned this pull request Jun 22, 2026

MAINT: remove fastparquet compression for bubblesam parquet #39

Open

MAINT: PR lanl#4 revisions

62b38c9

Conversation

nidhinthomas-ai commented Sep 25, 2025 • edited by adamwitmer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamwitmer commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamwitmer commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026

Uh oh!

tylerjereddy commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamwitmer Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nidhinthomas-ai commented Sep 25, 2025 •

edited by adamwitmer

Loading

adamwitmer commented Feb 10, 2026 •

edited

Loading

adamwitmer Jun 20, 2026 •

edited

Loading