Skip to content

ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4

Open
nidhinthomas-ai wants to merge 35 commits into
lanl:mainfrom
nidhinthomas-ai:nidhin_data_analysis
Open

ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4
nidhinthomas-ai wants to merge 35 commits into
lanl:mainfrom
nidhinthomas-ai:nidhin_data_analysis

Conversation

@nidhinthomas-ai

@nidhinthomas-ai nidhinthomas-ai commented Sep 25, 2025

Copy link
Copy Markdown

This merge request introduces a new library module, neat_ml/analysis, which analyzes the bubble detection results either from BubbleSAM or OpenCV. It also adds an accompanying test suite.

What’s inside

  • Analysis Module (neat_ml/analysis/): A new module dedicated to post-processing and analyzing bubble detection data.
    • data_analysis.py: A script that extracts bubble features, computes a wide range of spatial metrics, and generates summary reports.
    • tests/test_analysis.py: Unit tests for the analysis module to ensure correctness and stability.
  • Workflow Integration (workflow/lib_workflow.py):
    • Modified to include a new wrapper function, stage_analyze_features(), which integrates the data_analysis.py script into the main processing workflow.
  • Detailed Description of data_analysis.py

The data_analysis.py script is the core of this feature. It processes raw bubble data (centroids, areas, etc.) stored in parquet files and produces comprehensive quantitative metrics. The workflow can be summarized as:

  1. Scan and Load: Recursively scans a directory for result files from either OpenCV (*_bubble_data.parquet.gzip) or BubbleSAM (*_masks_filtered.parquet.gzip).
  2. Parse Metadata: Extracts experimental metadata (e.g., offset, position, label) directly from the filenames.
  3. Per-Image Analysis: For each image, it calculates a suite of metrics:
    • Basic Blob Stats: Number of bubbles, mean/median/std of area and radius.
    • Coverage: Percentage of the image area covered by bubbles.
    • Nearest-Neighbor Distance (NND): Mean and median distance between the closest bubbles.
    • Voronoi Tessellation: Statistics on the area of Voronoi cells, which describe the local space around each bubble.
    • Graph-Based Metrics: Models the spatial relationship between bubbles as a graph and computes network properties (e.g., connectivity, clustering).
  4. Output Generation: Saves the results into two CSV files:
    • per_image_metrics.csv: A detailed report with all calculated metrics for every single image.
    • aggregated_metrics.csv: A summary report where metrics are aggregated (min, max, median, std) across different experimental groups.
  • Key Methods
    • full_analysis(): The main entry point that orchestrates the entire pipeline—from processing the input directory to saving the final per-image and aggregated CSV reports.
    • process_parquet_files(): Manages file discovery, loading, and iteration. It identifies the correct parsing function based on the specified mode ('OpenCV' or 'BubbleSAM') and compiles all per-image results into a single DataFrame.
    • calculate_all_spatial_metrics(): A wrapper function that computes the full set of spatial metrics for a single image's data. It calls the specialized functions below.
    • calculate_nnd_stats(): Computes the mean and median nearest-neighbor distances between bubble centroids using a KD-Tree for efficient querying. This helps quantify how clustered the bubbles are.
    • calculate_voronoi_stats(): Performs Voronoi tessellation on the bubble centroids and calculates statistics (mean, median, std) on the resulting cell areas. This provides insight into the spatial organization and density of the bubbles.
    • calculate_graph_metrics(): Constructs a spatial graph where bubbles are nodes and edges connect nearby bubbles. It supports three construction methods:
      • radius: Connects all bubbles within a given search radius.
      • knn: Connects each bubble to its k-nearest neighbors.
      • delaunay: Creates a graph based on Delaunay triangulation, connecting "natural" neighbors.
It then calculates network metrics like average node degree, clustering coefficient, and properties of the largest connected component (LCC).
    • calculate_summary_statistics(): Aggregates the per-image metrics DataFrame. It groups the data by specified columns (e.g., 'Label', 'Time') and computes summary statistics (min, max, median, std) for all numerical metrics.
    • merge_composition_data(): An optional utility to merge the metrics DataFrame with an external CSV file (e.g., containing sample composition data) using a shared key like a UniqueID.
    • parse_filename() and load_df(): Helper functions tailored to parse metadata from the specific filename conventions and load the data formats produced by OpenCV and BubbleSAM.
  • Runtime Tip

The analysis typically uses the radius graph method with the parameter 30. Users can also change the parameter for their custom dataset. This parameter defines the maximum distance (in pixels) for two bubbles to be considered connected in the spatial graph.

@adamwitmer adamwitmer mentioned this pull request Oct 10, 2025
5 tasks
@tylerjereddy tylerjereddy added the enhancement New feature or request label Oct 15, 2025
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
@adamwitmer

adamwitmer commented Feb 10, 2026

Copy link
Copy Markdown
Collaborator

Initial TODO items for reviewing this branch:

  • read/understand diff/PR (read line-by-line, probing for weaknesses)
  • check for unnecessary complexity; areas for improvement/simplification
  • rebase branch against main (and push backup branch)
  • fix issues with git lfs (i.e. image storage with zenodo https://github.com/lanl/ldrd_neat_ml_images)
  • fix github CI
  • run test-suite and check test coverage
  • run branch according to README.md instructions
  • copy remaining review comments from gitlab
  • perform detailed code review
  • address all review comments
  • triple check diff
    • 1st check
    • 2nd check
    • 3rd check

Comment thread .github/workflows/ci.yml
Comment thread neat_ml/bubblesam/bubblesam.py
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/workflow/lib_workflow.py Outdated
Adam Witmer added 2 commits June 8, 2026 22:45
* remove unneccary checks
* modify handling of std calculations
* update docs
* fix mypy type hints
* simplify logic of vectorized operations
* update expected test values
Comment thread neat_ml/workflow/lib_workflow.py
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Reviews (6): Last reviewed commit: "MAINT: modify handling of results csv fi..." | Re-trigger Greptile

Comment thread neat_ml/analysis/data_analysis.py
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
* enforce desired behavior with tests
* remove leftover unused function
* fix minor formatting issues/docstring
@adamwitmer

Copy link
Copy Markdown
Collaborator

@tylerjereddy I have addressed all your initial re-review comments and responded to each individually. I have also performed another round of self-review and fixed/responded/resolved those comments as well. I have also addressed all of the greptile review comments (unless more show up after my most recent commit). I opened PRs #33 and #34 to address issues surrounding data structure handling that should simplify the logic here once merged. Hopefully this PR should be in better shape for you to review now, thanks.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Reviews (7): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile

@tylerjereddy

Copy link
Copy Markdown
Collaborator

Hopefully this PR should be in better shape for you to review now, thanks.

The CI is failing, as a start, are you sure you want me to start my review from there?

gh-22 has been open for a few months re: intermittent failures.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Reviews (8): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile

Comment thread neat_ml/workflow/lib_workflow.py
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Reviews (9): Last reviewed commit: "MAINT: PR #4 revisions" | Re-trigger Greptile

* fix logic for handling missing user provided save paths
* add test to handle edge case
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Reviews (10): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Reviews (11): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Want your agent to iterate on Greptile's feedback? Try greploops.

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Reviews (12): Last reviewed commit: "MAINT: address AI reviewer comment" | Re-trigger Greptile

@tylerjereddy tylerjereddy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a full review of the entire diff again, and added several more comments. This is a bit better than before, though still neglects some of the usual checklist items.

I'm a bit perplexed by the decision to ask me for a full review before updating the PR to reflect data structure changes happening elsewhere, and making sure that all supporting precursor PRs are fully reviewed and merged--I won't be able to commit to another detailed review of this size for quite some time now, and the schedule has already slipped too much. In fact, summer conference season is about to start, along with the next batch of postdocs, so you may find it pretty hard to convince me to review this again anytime soon, let alone the other large PR(s) that may still need a few rounds of review.

Try to organize your PRs, their sizes/scope, and workflow wisely, and respect reviewer time/bandwidth, or your scheduling may slip further.

import logging
from tqdm.auto import tqdm
from pyarrow.lib import ArrowInvalid # type: ignore[import-untyped]
import ast

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly gh-33 seems to have a bit of a scoping issue where there's a contour-related bug fix and removal of some parquet-related dependency mushed together, which is probably going to slow down the review there. Those aren't clearly communicated as both being necessary for removal of ast here--I thought you indicated it was mostly the parquet-related change that was needed to move forward here. Either way, mushing those two things together is going to drain reviewer bandwidth and delay review both there and here.

It would probably be better to wait for your dependent PRs to be fully refined and merged before pinging me for another round of review here, and your CI is still failing here, but I'll give you another full round of review. Keep in mind that I'm unlikely to keep reviewing this PR frequently, so if you ping me to review it in a state where it isn't ready to go, you may face weeks of delay before the next review cycle.

This thread needs to remain open--the original point is valid--should be no need to use ast in this work.

Comment thread neat_ml/analysis/data_analysis.py Outdated


def _std_agg(x):
return x.std(ddof=0)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function needed? np.std() already exists and has the same default for ddof.

If I simply remove this function and use np.std like the rest of the world in the diff below, the full testsuite passes locally. Adding spurious abstractions without an explanation in the code drains reviewer time. Maybe you have a good reason for this, but I've found no testable evidence for it, and so it is draining reviewer time.

Details
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -19,10 +19,6 @@ __all__ = [
 log = logging.getLogger(__name__)
 
 
-def _std_agg(x):
-    return x.std(ddof=0)
-
-
 def _merge_composition_data(
     summary_df: pd.DataFrame,
     composition_df: pd.DataFrame,
@@ -618,8 +614,7 @@ def _calculate_summary_statistics(
     carry = list(set(carry_over_cols).intersection(df.columns))
 
     # initialize dictionary keys for aggregation
-    _std_agg.__name__ = "std"
-    agg_spec = {c: ["min", "max", "median", _std_agg] for c in df_num.columns}
+    agg_spec = {c: ["min", "max", "median", np.std] for c in df_num.columns}
     agg_spec.update({c: ["first"] for c in df[carry].columns})
 
     grouped = df.groupby(

@adamwitmer adamwitmer Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this diff, the test fails for me with the following warning using pandas==2.3.3:

: FutureWarning: The provided callable <function std at 0x112e0aac0> is currently using SeriesGroupBy.std. In 
a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string 
"std" instead.
).agg(agg_spec)  # type: ignore[arg-type]

but passes without any warnings when using pandas>=3.0.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still haven't clearly explained to me why you wrote _std_agg() and presented it for review as such when np.std already exists? What was the basis for this decision? If it was an oversight then taking ownership of that and correcting it is pretty simple. If it is based on an implementation/approach somewhere else, then linking to that for clarity saves the reviewer time.

Your current response implies that there's a flaw with my suggestion that justifies your original decision--even if that were true and you were carefully shimming around different behaviors in different pandas versions, you should have still clearly documented it with a comment in the new function (# XXX supporting pandas 2.x and 3.x behavior) rather than asking me to research it as the reviewer by not including that information inline. Your previous analysis of std ddof behavior at #4 (comment) only mentions that the pandas default for ddof is 1 and you want to change it to 0, but there's no mention of std function override behavior differences in different pandas versions.

On the topic of version support ranges, I haven't really seen much discussion on your plans since the May 5 comment about establishing a policy on version support ranges (#28 (comment)), another delay which makes it hard (almost impossible?) for me to review your code--what version ranges are you going to support? It is pretty hard for the reviewer to provide sane feedback without clear version support ranges.

In principle you could also use lambda x: np.std(x) here if you change your expected column names a bit. It looks like you may be trying to support pandas 2.x and 3.x aggregation function override behaviors and preserve column names at the same time. But asking the reviewer to divine that is a lot.

Here's a sample diff that contains clear explanations, doesn't write a new function, and passes tests on both versions of pandas that have been mentioned.

Details
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -19,10 +19,6 @@ __all__ = [
 log = logging.getLogger(__name__)
 
 
-def _std_agg(x):
-    return x.std(ddof=0)
-
-
 def _merge_composition_data(
     summary_df: pd.DataFrame,
     composition_df: pd.DataFrame,
@@ -618,8 +614,11 @@ def _calculate_summary_statistics(
     carry = list(set(carry_over_cols).intersection(df.columns))
 
     # initialize dictionary keys for aggregation
-    _std_agg.__name__ = "std"
-    agg_spec = {c: ["min", "max", "median", _std_agg] for c in df_num.columns}
+    # we want std with ddof=0 vs. the pandas default for std
+    # we can't use `np.std` directly because pandas 2.x ignores
+    # it and uses its own std with different ddof behavior
+    # TODO: change to np.std when we require pandas >=3.x
+    agg_spec = {c: ["min", "max", "median", lambda x: np.std(x)] for c in df_num.columns}
     agg_spec.update({c: ["first"] for c in df[carry].columns})
 
     grouped = df.groupby(
@@ -628,6 +627,8 @@ def _calculate_summary_statistics(
 
     grouped.columns = ['_'.join(filter(None, map(str, col))) for col in grouped.columns]
     grouped.rename(columns={f"{c}_first": c for c in df[carry].columns}, inplace=True)
+    # preserve the "std" column name after using the lambda function
+    grouped.columns = grouped.columns.str.replace("<lambda_0>", "std")
 
     return grouped
 

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. I have pushed a commit to reflect these changes with detail about reasoning behind the logic used for std calculation here. I obtained the previous logic based on this discussion: https://stackoverflow.com/a/50307104 because I had tried the solution using lambda, but didn't think about replacing the appended string with the appropriate std suffix for those column names. Going forward, I will make sure to provide more information in inline comments for changes requiring further explantion.

raise ValueError(f"Columns {missing_cols} not found in composition_df.")

return summary_df.merge(
composition_df[[merge_key, *cols_to_add]], on=merge_key, how="left"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter if the merge is "left?" The reason should be justified by regression testing.

If I change to a right merge per diff below, the full testsuite passes locally.

Details
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -60,7 +60,7 @@ def _merge_composition_data(
         raise ValueError(f"Columns {missing_cols} not found in composition_df.")
 
     return summary_df.merge(
-        composition_df[[merge_key, *cols_to_add]], on=merge_key, how="left"
+        composition_df[[merge_key, *cols_to_add]], on=merge_key, how="right"
     )

Parameters
----------
fname : str
The filename, e.g., 'offset -1_center_A1_Bf_Raw_uuid_bubble_data.parquet.gzip'.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that doesn't prohibit you from cleaning it up. But "ok" seems like this can stay as is..

Comment thread neat_ml/analysis/data_analysis.py Outdated
A dictionary with 'UniqueID', 'Class', 'Offset', 'Position', and 'Label',
or an empty dictionary if the pattern does not match.
"""
if method.lower() == "bubblesam":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will allow the Literal above to accept quite a few more options

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified this to only accept the literal string options.

method,
n_value_errors,
len_df,
):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial typing, weird decision

input_dir = tmp_path / "input"
input_dir.mkdir()

# a file that contains all the necessary information for

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a file or filename? currently the logic here is quite separated from file creation

good = ("offset -1_center_A1_O_Ph_Raw_11111111-"
f"1111-1111-1111-111111111111_{file_suff}.parquet.gzip")
if method == "BubbleSAM":
bbox = "[0.0, 0.0, 10.0, 10.0]"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string handling of this numeric information should disappear per other PRs

}).to_parquet(
input_dir / good
)
# a parquet file that is not in the correct format and

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file or filename? here and elsewhere the comments are a bit painful to parse; sometimes the file contents are the problem and sometimes the filename proper is the problem, and the failure of the comments to capture this is draining reviewer time

assert exp_msg in all_msgs

assert isinstance(df, pd.DataFrame)
assert len(df) == len_df

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking shape may be slightly less crude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants