Skip to content

Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050)#3080

Draft
turbomam wants to merge 5 commits into
mainfrom
issue-3050-first-in-chain
Draft

Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050)#3080
turbomam wants to merge 5 commits into
mainfrom
issue-3050-first-in-chain

Conversation

@turbomam

@turbomam turbomam commented May 18, 2026

Copy link
Copy Markdown
Member

Closes #3050.

What this adds

A boolean slot first_of_class_in_chain, added to ProcessedSample, OrganismSample, and DataObject. When true, marks an instance as the earliest of its class in a processing chain.

ProcessedSample and OrganismSample chain via material_processing_set; DataObject chains via workflow_execution_set.

Example: src/data/valid/Database-first_of_class_in_chain.yaml shows an OrganismSample chain linked by Culturing and a ProcessedSample chain linked by LibraryPreparation, with only the first instance of each class asserting the slot.

Design rationale, alternatives considered, and validated production queries: NMDC first_of_class_in_chain — design conversation (2026-05-19)

Convention (in the slot's comments)

  • Assert only when true. Never assert false. Absence means downstream.
  • If a new earlier instance is later added, the new earliest instance MUST set the slot and the previous head MUST have the slot removed.

What the schema cannot enforce

LinkML can validate the slot's type and range. It cannot enforce that at most one instance per chain per class asserts the slot, or that the asserting instance actually is the upstream-terminal instance of its class; those are cross-document invariants. The range: boolean will also accept false despite the convention. A refscan-style nightly job comparing asserted markings against chain topology would catch drift.

Generated files

Per CLAUDE.md, regenerated nmdc_schema/ artifacts are not committed in development PRs; they will be updated immediately before merge or release.

🤖 Generated with Claude Code

…mple

Resolves #3050 as a concrete proposal anchoring the design discussion.

Adds a boolean slot first_in_chain (in basic_slots.yaml) and wires it into
ProcessedSample and OrganismSample. The slot is a convenience flag for
identifying the head of a processing or culturing chain without recursive
has_input / has_output traversal in MongoDB.

Convention: assert only when true. Absence is the negation; do not assert
false. The slot's comments record this and acknowledge the DRY tradeoff
that Chris Mungall raised on 2026-05-18: the same answer can be derived
from chain topology, so the assertion is bookkeeping, not new information.
A note also points at an alternative design (Chain as a first-class entity
identified by initial Sample subclass instance plus terminal data objects)
that would supersede this slot if adopted.

Adds one valid example (Database-first_in_chain.yaml) with two short chains:
- OrganismSample head (first_in_chain: true) linked via Culturing to a
  subculture OrganismSample (no slot assertion).
- ProcessedSample head (first_in_chain: true) linked via LibraryPreparation
  to a derived ProcessedSample (no slot assertion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 18:20
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-3080/

Built to branch gh-pages at 2026-05-18 19:27 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a LinkML schema convenience flag for identifying the first sample in documented processing/culturing chains, supporting simpler downstream querying.

Changes:

  • Defines the first_in_chain boolean slot with usage comments.
  • Adds the slot to ProcessedSample and OrganismSample.
  • Adds a valid database example demonstrating organism and processed sample chains.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/schema/core.yaml Wires first_in_chain onto ProcessedSample and OrganismSample.
src/schema/basic_slots.yaml Defines the new first_in_chain slot and documentation.
src/data/valid/Database-first_in_chain.yaml Adds an example database instance showing asserted chain heads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/schema/basic_slots.yaml Outdated
Comment thread src/schema/basic_slots.yaml Outdated
Comment on lines +237 to +240
Open design question (cmungall, 2026-05-18 DM): a chain might be
better modeled as a first-class entity identified by its initial
Sample subclass instance plus one or more terminal data objects.
That direction would supersede this slot.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in commit 19dbdf8, which was pushed before this review surfaced. The DM-referenced "Chain as first-class entity" speculation was removed from the slot's comments and is no longer in the generated schema docs.

The PR body still cites the date of @cmungall's DRY pushback for design provenance, but that's a PR description, not generated schema doc, and is a normal venue for recording where an objection came from.

Comment thread src/data/valid/Database-first_in_chain.yaml Outdated
turbomam and others added 2 commits May 18, 2026 14:28
…-entity speculation

Per Mark's review on PR #3080:
- The bookkeeping discipline only works if it is followed consistently, so
  the two SHOULDs in the slot's comments become MUSTs. If you cannot commit
  to the maintenance, do not assert the slot at all.
- Removed the speculative note that framed a future Chain class as a
  supersession path for this slot. Whether Chain-as-entity would actually
  eliminate the asserted-bookkeeping problem has not been worked out, so
  it does not belong as a comment alongside the slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The slot's semantics are "first instance of this class in the chain", not
"absolute first node in the chain". The previous name implied the latter,
which led to confusion when reviewing.

Concretely: in every prod chain that includes a ProcessedSample, the chain's
absolute head is a Biosample, not a ProcessedSample. ProcessedSample only
ever marks the first of its class in the chain, not the chain root.

Verified against prod 2026-05-18: 5,549 ProcessedSamples are first-of-class
(producing MP has no procsm in has_input); 14,023 continue an existing
procsm subchain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@turbomam turbomam changed the title Add first_in_chain convenience flag on ProcessedSample and OrganismSample (#3050) Add first_of_class_in_chain convenience flag on ProcessedSample and OrganismSample (#3050) May 18, 2026
@turbomam

Copy link
Copy Markdown
Member Author

Prod-data evidence (verified 2026-05-18 via SSH tunnel to NMDC prod Mongo)

What population this slot would address

Probe Count
processed_sample_set total 20,776
ProcessedSamples produced by an upstream MaterialProcessing 20,776 (all of them)
ProcessedSamples that are first of class (producing MP has no procsm in has_input) 5,549
ProcessedSamples that continue an existing procsm subchain 14,023
data_object_set total 266,399
DataObjects with a producing WorkflowExecution 246,891
DataObjects that are absolute chain heads (no producing WE) 19,508

Two observations:

  1. Every ProcessedSample in prod has an upstream MaterialProcessing, which is why the slot's semantics had to be "first of class," not "first node." The chain's absolute head is always a Biosample.
  2. The OrganismSample class exists in the schema (PR Add Organism, OrganismSample, Isolation, Culturing classes for JGI isolate workflows #2977) but no osm- instances participate in chains in prod yet. The slot is forward-looking for that class.

Working $graphLookup queries (validated against prod)

Q1. Walk upstream from any ProcessedSample to find the chain's absolute head
db.processed_sample_set.aggregate([
  { $match: { id: "<some_procsm_id>" }},
  { $graphLookup: {
      from: "material_processing_set",
      startWith: "$id",
      connectFromField: "has_input",
      connectToField:   "has_output",
      as: "upstream_chain",
      depthField: "depth",
      maxDepth: 20
  }}
])

Real result against prod, seeded with the output of a LibraryPreparation:

Seed: nmdc:procsm-11-s71h1s64
  depth=0 nmdc:LibraryPreparation  in=[procsm-11-e3m9am88] out=[procsm-11-s71h1s64]
  depth=1 nmdc:Extraction          in=[procsm-11-m6cgda89] out=[procsm-11-e3m9am88]
  depth=2 nmdc:Pooling              in=[bsm-11-zw0jr671, bsm-11-ftr88019, bsm-11-06qrej20]
                                    out=[procsm-11-m6cgda89]
HEAD: nmdc:bsm-11-zw0jr671 (Biosample, not in processed_sample_set)
Q2. Walk downstream from a Biosample to enumerate its derived chain
db.biosample_set.aggregate([
  { $match: { id: "<bsm_id>" }},
  { $graphLookup: {
      from: "material_processing_set",
      startWith: "$id",
      connectFromField: "has_output",    // swapped vs Q1
      connectToField:   "has_input",     // swapped vs Q1
      as: "downstream_chain",
      depthField: "depth",
      maxDepth: 20
  }}
])

Sample result:

Seed: nmdc:bsm-11-5rkz1n03
  depth=0 nmdc:Extraction           in=[bsm-11-5rkz1n03] out=[procsm-11-bsqeg644]
  depth=1 nmdc:LibraryPreparation   in=[procsm-11-bsqeg644] out=[procsm-11-8f15qp47]
Q3. Same pattern for DataObject through workflow_execution_set
db.data_object_set.aggregate([
  { $match: { id: "<dobj_id>" }},
  { $graphLookup: {
      from: "workflow_execution_set",
      startWith: "$id",
      connectFromField: "has_input",
      connectToField:   "has_output",
      as: "upstream_chain",
      depthField: "depth",
      maxDepth: 20
  }}
])

Real result, seeded with a MagsAnalysis output:

Seed: nmdc:dobj-11-nc0exk40
  depth=0 nmdc:MagsAnalysis              n_in=13 n_out=9
  depth=1 nmdc:MetagenomeAnnotation      n_in=1  n_out=23
  depth=1 nmdc:MetagenomeAssembly        n_in=1  n_out=6
  depth=2 nmdc:ReadQcAnalysis            n_in=2  n_out=3
Q4. Population-level: count first-of-class ProcessedSamples from the MP side (cheap)
db.material_processing_set.aggregate([
  { $match: { "has_output.0": { $regex: ":procsm-" }}},
  { $match: { "has_input": { $not: { $elemMatch: { $regex: ":procsm-" }}}}},
  { $count: "first_of_class_procsm" }
])
// Result: 5,549
Q5. Validate asserted `first_of_class_in_chain: true` against the topology (the refscan check that doesn't exist yet)
// Any ProcessedSample asserting first_of_class_in_chain: true
// that has a producing MP with another procsm in its inputs?
// (= the assertion is stale or wrong)
db.processed_sample_set.aggregate([
  { $match: { first_of_class_in_chain: true }},
  { $lookup: {
      from: "material_processing_set",
      localField: "id",
      foreignField: "has_output",
      as: "producing_mp"
  }},
  { $match: { "producing_mp.has_input": { $elemMatch: { $regex: ":procsm-" }}}},
  { $project: { id: 1, n_violating_mps: { $size: "$producing_mp" }}}
])

This is the consistency check that path A in the PR description would put into refscan.

Scale evidence (why we want the slot rather than computing on every read)

Running a single-pass $graphLookup over all 20,776 ProcessedSamples to label heads vs downstream timed out at the server's 120-second maxTimeMS ceiling. Per-sample lookups (Q1) return in <100 ms. The cliff between per-sample and population-level is exactly the cost the slot avoids amortizing per query.

This is also implicit in the runtime team's /nmdcschema/linked_instances design: it materializes traversal results into temp collections (drop_stale_temp_linked_instances_collections exists in the codebase), which means the team has independently concluded that live $graphLookup doesn't scale per request.

🤖 Generated with Claude Code

DataObject chains via workflow_execution_set the same way ProcessedSample and
OrganismSample chain via material_processing_set. The prod-data evidence in
PR #3080 shows 19,508 absolute chain-head DataObjects (no producing
WorkflowExecution) and 246,891 downstream — the pattern this slot addresses
is most strongly attested for this class.

Schema-only addition; the example file's introductory comment notes that a
DataObject demonstration was omitted because a minimal valid WorkflowExecution
requires several extra required slots (was_informed_by, etc.) and would not
clarify behavior beyond the two sample-chain demos already present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@turbomam turbomam changed the title Add first_of_class_in_chain convenience flag on ProcessedSample and OrganismSample (#3050) Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050) May 18, 2026
@turbomam turbomam requested review from aclum, cmungall and eecavanna May 18, 2026 19:18
@turbomam turbomam changed the title Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050) Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050) May 18, 2026
1. Acknowledge the boolean-accepts-false enforcement gap in the slot's
   comments. The "never assert false" convention is one the schema cannot
   express through its range alone, parallel to the cross-document
   discipline gap already documented in the PR description. A LinkML rule
   with equals_expression: "True" could enforce this, but generator
   support is uneven and the slot itself is under active design review.

2. Update the chain-diagram comment in the example file to use the full
   IDs (nmdc:osm-99-aaaaaa1, etc.) that match the actual instances below.
   The previous shortened form invited confusion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aclum

aclum commented May 19, 2026

Copy link
Copy Markdown
Contributor

To discuss at the metadata meeting:
how will this be enforced? with runtime?

related question: are two DataObjects that are has_output from a DataGeneration record allowed to both be labeled with first_in_class_chain? This can happen with NucleotideSequencing records if the data is not interleaved data_object_type of Metagenome Raw Read 1 and Metagenome Raw Read 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create slot on Sample to indicate a record is the head or first sample record in a chain

3 participants