Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050)#3080
Add first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050)#3080turbomam wants to merge 5 commits into
first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050)#3080Conversation
…mple Resolves #3050 as a concrete proposal anchoring the design discussion. Adds a boolean slot first_in_chain (in basic_slots.yaml) and wires it into ProcessedSample and OrganismSample. The slot is a convenience flag for identifying the head of a processing or culturing chain without recursive has_input / has_output traversal in MongoDB. Convention: assert only when true. Absence is the negation; do not assert false. The slot's comments record this and acknowledge the DRY tradeoff that Chris Mungall raised on 2026-05-18: the same answer can be derived from chain topology, so the assertion is bookkeeping, not new information. A note also points at an alternative design (Chain as a first-class entity identified by initial Sample subclass instance plus terminal data objects) that would supersede this slot if adopted. Adds one valid example (Database-first_in_chain.yaml) with two short chains: - OrganismSample head (first_in_chain: true) linked via Culturing to a subculture OrganismSample (no slot assertion). - ProcessedSample head (first_in_chain: true) linked via LibraryPreparation to a derived ProcessedSample (no slot assertion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
There was a problem hiding this comment.
Pull request overview
This PR adds a LinkML schema convenience flag for identifying the first sample in documented processing/culturing chains, supporting simpler downstream querying.
Changes:
- Defines the
first_in_chainboolean slot with usage comments. - Adds the slot to
ProcessedSampleandOrganismSample. - Adds a valid database example demonstrating organism and processed sample chains.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/schema/core.yaml |
Wires first_in_chain onto ProcessedSample and OrganismSample. |
src/schema/basic_slots.yaml |
Defines the new first_in_chain slot and documentation. |
src/data/valid/Database-first_in_chain.yaml |
Adds an example database instance showing asserted chain heads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Open design question (cmungall, 2026-05-18 DM): a chain might be | ||
| better modeled as a first-class entity identified by its initial | ||
| Sample subclass instance plus one or more terminal data objects. | ||
| That direction would supersede this slot. |
There was a problem hiding this comment.
Already addressed in commit 19dbdf8, which was pushed before this review surfaced. The DM-referenced "Chain as first-class entity" speculation was removed from the slot's comments and is no longer in the generated schema docs.
The PR body still cites the date of @cmungall's DRY pushback for design provenance, but that's a PR description, not generated schema doc, and is a normal venue for recording where an objection came from.
…-entity speculation Per Mark's review on PR #3080: - The bookkeeping discipline only works if it is followed consistently, so the two SHOULDs in the slot's comments become MUSTs. If you cannot commit to the maintenance, do not assert the slot at all. - Removed the speculative note that framed a future Chain class as a supersession path for this slot. Whether Chain-as-entity would actually eliminate the asserted-bookkeeping problem has not been worked out, so it does not belong as a comment alongside the slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The slot's semantics are "first instance of this class in the chain", not "absolute first node in the chain". The previous name implied the latter, which led to confusion when reviewing. Concretely: in every prod chain that includes a ProcessedSample, the chain's absolute head is a Biosample, not a ProcessedSample. ProcessedSample only ever marks the first of its class in the chain, not the chain root. Verified against prod 2026-05-18: 5,549 ProcessedSamples are first-of-class (producing MP has no procsm in has_input); 14,023 continue an existing procsm subchain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prod-data evidence (verified 2026-05-18 via SSH tunnel to NMDC prod Mongo)What population this slot would address
Two observations:
Working
|
DataObject chains via workflow_execution_set the same way ProcessedSample and OrganismSample chain via material_processing_set. The prod-data evidence in PR #3080 shows 19,508 absolute chain-head DataObjects (no producing WorkflowExecution) and 246,891 downstream — the pattern this slot addresses is most strongly attested for this class. Schema-only addition; the example file's introductory comment notes that a DataObject demonstration was omitted because a minimal valid WorkflowExecution requires several extra required slots (was_informed_by, etc.) and would not clarify behavior beyond the two sample-chain demos already present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
first_of_class_in_chain convenience flag on ProcessedSample, OrganismSample, and DataObject (#3050)
1. Acknowledge the boolean-accepts-false enforcement gap in the slot's comments. The "never assert false" convention is one the schema cannot express through its range alone, parallel to the cross-document discipline gap already documented in the PR description. A LinkML rule with equals_expression: "True" could enforce this, but generator support is uneven and the slot itself is under active design review. 2. Update the chain-diagram comment in the example file to use the full IDs (nmdc:osm-99-aaaaaa1, etc.) that match the actual instances below. The previous shortened form invited confusion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
To discuss at the metadata meeting: related question: are two DataObjects that are has_output from a DataGeneration record allowed to both be labeled with first_in_class_chain? This can happen with NucleotideSequencing records if the data is not interleaved |
Closes #3050.
What this adds
A boolean slot
first_of_class_in_chain, added toProcessedSample,OrganismSample, andDataObject. Whentrue, marks an instance as the earliest of its class in a processing chain.ProcessedSample and OrganismSample chain via
material_processing_set; DataObject chains viaworkflow_execution_set.Example:
src/data/valid/Database-first_of_class_in_chain.yamlshows an OrganismSample chain linked by Culturing and a ProcessedSample chain linked by LibraryPreparation, with only the first instance of each class asserting the slot.Design rationale, alternatives considered, and validated production queries: NMDC first_of_class_in_chain — design conversation (2026-05-19)
Convention (in the slot's
comments)true. Never assertfalse. Absence means downstream.What the schema cannot enforce
LinkML can validate the slot's type and range. It cannot enforce that at most one instance per chain per class asserts the slot, or that the asserting instance actually is the upstream-terminal instance of its class; those are cross-document invariants. The
range: booleanwill also acceptfalsedespite the convention. A refscan-style nightly job comparing asserted markings against chain topology would catch drift.Generated files
Per
CLAUDE.md, regeneratednmdc_schema/artifacts are not committed in development PRs; they will be updated immediately before merge or release.🤖 Generated with Claude Code