Skip to content

feat(research-ops): add enterprise Research Operations domain (v2.9.0)#752

Merged
alirezarezvani merged 3 commits into
devfrom
claude/research-skills-category-qWMjm
May 27, 2026
Merged

feat(research-ops): add enterprise Research Operations domain (v2.9.0)#752
alirezarezvani merged 3 commits into
devfrom
claude/research-skills-category-qWMjm

Conversation

@alirezarezvani
Copy link
Copy Markdown
Owner

@alirezarezvani alirezarezvani commented May 27, 2026

Summary

New top-level research-ops/ domain — the managed, cross-functional counterpart to the academic research/ domain. Single domain plugin (the commercial/ v2.8.0 pattern): orchestrator (context: fork) + 4 managed sub-skills, each now wired with per-skill onboarding, a customization config the tools actually consume, and an isolated opt-in autoresearch bridge.

Skill Purpose
research-ops-skills Orchestrator — two-signal routing + Matt Pocock grill discipline
clinical-research Endpoint selection, sample-size/power (means/proportions/survival), phase-gate feasibility
research-finance R&D program budgeting (F&A split), burn/runway, capitalize-vs-expense routing
market-research TAM/SAM/SOM (both methods), survey sampling (FPC + per-segment), Kotler segmentation
product-research Goal-matched study design, method-based saturation, insight synthesis

Per-skill integration (onboarding · customization · autoresearch)

Each sub-skill ships three integration scripts beside its 3 analysis tools:

  • onboard.py — its own onboarding questionnaire (distinct question set per skill). Interactive, or --defaults / --set key=value / --reset / --scope {global,project}. Writes ~/.config/research-ops/<skill>.json or ./.research-ops/<skill>.json.
  • config_loader.py — loads that config (precedence project > global > defaults; RESEARCH_OPS_NO_CONFIG=1 bypass). Every scoring tool reads it, so saved answers change behavior: default profile, thresholds (alpha/power/dropout, F&A rate, runway, confidence, MoE, insight source-threshold), and the named owners printed on clinical/finance outputs. CLI flags always override.
  • ar_evaluator.py — an isolated, opt-in ground-truth evaluator bridging to engineering/autoresearch-agent. Invoked only on explicit user request; the loop edits the skill's input file, the evaluator is never edited. Metrics: clinical feasibility_composite↑, finance runway_months↑, market tam_divergence↓, product validated_insights↑. No cross-skill coupling.

Hard rules (unchanged, enforced via persona + anti-patterns)

clinical = estimate + named clinical owner (never fact) · finance surfaces assumptions, routes treatment to a named finance owner (never auto-decides) · market shows method (both ways) + assumptions (never a single number) · product requires recurrence across independent participants.

Plugin builds

research-ops/ is a Claude Code plugin (plugin.json + marketplace entry, 62 plugins). Cross-platform artifacts materialized: Codex symlinks, Gemini SKILL.md mirrors, Vibe symlinks for all 5 skills; indices regenerated. Root CLAUDE.md bumped to v2.9.0.

Test plan

  • All 24 scripts (12 analysis + 12 onboarding/customization/autoresearch) pass --help, --sample/--show, and compileall
  • 12 scoring tools emit valid JSON; every scoring tool accepts --profile
  • Customization verified in use end-to-end: onboarding answers flow into profile, power, named owners, F&A rate, confidence/MoE, and insight threshold (e.g. threshold=2 promotes a second cluster to INSIGHT)
  • No network/LLM imports; RESEARCH_OPS_NO_CONFIG=1 bypass works
  • scripts/check_plugin_json.py --all passes; marketplace + all 3 sync indices parse
  • Cross-platform syncs discover all 5 research-ops skills; unrelated codex run/status/review symlink churn reverted
  • Optional: /plugin-audit 8-phase pipeline on the orchestrator + one sub-skill

https://claude.ai/code/session_01PUNmQVE4WYvcrzpq2anC3D

alirezarezvani and others added 2 commits May 26, 2026 16:07
New top-level research-ops/ domain — the managed, cross-functional
counterpart to the academic research/ domain. Single domain plugin
(commercial/ pattern): orchestrator (context: fork) + 4 sub-skills:
clinical-research (study design), research-finance (R&D program finance),
market-research (sizing/survey/segmentation methodology), product-research
(user-research method + insight repository).

12 stdlib-only deterministic Python tools (all pass --help/--sample/--output
json), 12 reference docs (5-7 sources each), 4 asset templates, a
cs-research-ops-orchestrator agent, and 6 commands (router + grill + 4
per-skill). Hard rules enforced via persona + anti-patterns: clinical
outputs are estimates with a named clinical owner, finance routes
capitalize-vs-expense to a named finance owner, market sizes show method
(both ways) + assumptions, product insights require cross-participant
recurrence. Explicit distinct_from framing keeps clear of ra-qm-team,
finance, research/grants, product-team, and marketing-skill.

Registered in marketplace.json (61 -> 62 plugins) and the codex/gemini/
hermes/vibe sync scripts + generate-docs; root CLAUDE.md bumped to v2.9.0.

https://claude.ai/code/session_01PUNmQVE4WYvcrzpq2anC3D
@alirezarezvani alirezarezvani marked this pull request as ready for review May 27, 2026 00:33
@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Review: feat(research-ops) — enterprise Research Operations domain (v2.9.0)

Overall this is a well-designed PR following the commercial/ + business-operations/ v2.8.0 pattern. Domain differentiation is clear, hard-rules are thoughtfully enforced, and the corpus is internally consistent. A handful of code-quality issues should be addressed before merge.


Blocking Issues (fix before merge to dev)

1. File handle leak in all 12 scripts

Every script uses:

data = SAMPLE if (args.sample or not args.input) else json.load(open(args.input))

open() without a with block leaks the file handle. Replace with:

if args.sample or not args.input:
    data = SAMPLE
else:
    with open(args.input) as f:
        data = json.load(f)

Affects all 12 scripts: endpoint_selector.py, market_sizer.py, capex_vs_opex_router.py, insight_synthesizer.py, and the remaining 8.

2. Dead variable in endpoint_selector.py:classify()

top = ordered[0]["composite"] is assigned but never referenced. Any linter flags this. Remove it or use it.

3. import math inside function body — saturation_planner.py:usability_plan()

import math is nested inside the function. Move it to module level with the other imports.

4. No JSON parse error handling on --input files

Malformed JSON produces a raw JSONDecodeError traceback. The main() already has try/except ValueError for calculation errors — extend it to also catch json.JSONDecodeError and emit a friendly error: invalid JSON in <path> to stderr.


Non-blocking Issues (can follow in v2.9.1)

5. --sample silently overrides --standard in capex_vs_opex_router.py

python3 capex_vs_opex_router.py --sample --standard usgaap silently uses IFRS from the hardcoded sample. If the intent is that sample data wins, add a comment. If not, let the explicit CLI flag take precedence.

6. Very long line in market_sizer.py:bottoms_up()

The implied_customers_at_SOM expression is ~180 chars. Split it into a named intermediate.


Test Plan Gap

7. Plugin-audit unchecked — recommended before dev→main

commercial/ and business-operations/ both ran the 8-phase /plugin-audit and posted scores in their PRs. For a new top-level domain, consider running it on research-ops-skills (orchestrator) and clinical-research before the dev→main merge.


Minor

8. Authorship inconsistency in orchestrator SKILL.md frontmatter

research-ops-skills/SKILL.md has author: claude-code-skills; plugin.json, the agent, and all commands use Alireza Rezvani. Align them.


Positive Findings

  • context: fork correctly set in orchestrator SKILL.md — consistent with sibling orchestrators
  • plugin.json skills array uses "./skills/..." prefix per canonical CC 2.1.144+ form
  • marketplace.json correctly bumped 61 → 62; all 5 sync scripts updated consistently in one PR
  • CLAUDE.md counters (335 skills, 15 domains, ~463 tools, ~602 refs) are arithmetically correct
  • ESTIMATE banners in clinical tools with named owner requirements are exactly right for biostatistics tooling
  • Named owner chain in phase_gate_scorer.py (PI, Medical Monitor, Biostatistician, Regulatory Owner) and never-auto-decides in capex_vs_opex_router.py match the domain's hard rules
  • Forcing-question library in orchestrator SKILL.md implements one-per-turn Matt Pocock discipline with recommended answers and canon citations
  • distinct_from coverage (ra-qm-team / finance / research/grants / product-team / marketing-skill) is thorough and defensible
  • z-table lookup in sample_size_estimator.py avoids scipy entirely — clean stdlib-only design
  • Division guard in market_sizer.py:bottoms_up() correctly protects against price=0
  • source extension field in plugin.json uses the approved format with spec + build_pattern + distinct_from

Verdict: Approve with items 1–4 addressed before merge. Items 5–8 can be a follow-up. Item 7 strongly recommended before dev→main. The domain design and hard-rule implementation are solid.

…utoresearch bridge

Connects each research-ops sub-skill to engineering/autoresearch-agent and adds
per-skill onboarding with a customization approach that the tools actually consume.

Each sub-skill (clinical-research, research-finance, market-research,
product-research) now ships three integration scripts:
- onboard.py        — its own onboarding questionnaire; interactive or
                      --defaults / --set key=value / --reset / --scope; writes
                      ~/.config/research-ops/<skill>.json (global) or
                      ./.research-ops/<skill>.json (project).
- config_loader.py  — loads that config (project > global > defaults;
                      RESEARCH_OPS_NO_CONFIG=1 bypass). Every scoring tool now
                      reads it so saved answers change behavior: default profile,
                      thresholds (alpha/power/dropout, F&A rate, runway, confidence,
                      MoE, insight source-threshold), and named owners printed on
                      clinical/finance outputs. CLI flags always override.
- ar_evaluator.py   — an isolated, OPT-IN ground-truth evaluator bridging to
                      autoresearch. The loop edits the skill's input file and the
                      evaluator (never edited) scores it: clinical
                      feasibility_composite (higher), finance runway_months (higher),
                      market tam_divergence (lower), product validated_insights
                      (higher). No cross-skill coupling; invoked only on explicit
                      user request.

SKILL.md (each), the orchestrator, the agent, per-skill commands, and the domain
CLAUDE.md document the onboarding + opt-in autoresearch handoff. plugin.json /
marketplace describe 24 tools (12 analysis + 12 integration).

Builds the cross-platform plugins: codex symlinks + gemini SKILL.md mirrors +
vibe symlinks materialized for all 5 research-ops skills; indices regenerated.
Root CLAUDE.md scope/highlights updated.

https://claude.ai/code/session_01PUNmQVE4WYvcrzpq2anC3D
@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Code Review — PR #752: feat(research-ops): enterprise Research Operations domain (v2.9.0)

Overall this is a well-structured, ambitious PR that follows the repo's established patterns cleanly. The domain design, plugin packaging, and SKILL.md quality are all strong. A few issues need attention before merge — one regression and one class of code bug are the main blockers.


✅ What's done well

  • Plugin.json — all five skills use the required ./ prefix convention (./skills/research-ops-skills, etc.); all ClawHub required fields present; source.distinct_from scope boundaries are clear and accurate.
  • SKILL.md quality — all four sub-skills have YAML frontmatter with Use when trigger clauses, a Forcing-question library (5 questions each, with recommended answers + canon citations), and 3 references citing 5–7 authoritative sources. Fully compliant with the Matt Pocock grill pattern.
  • Hard rules enforced — clinical outputs carry the "ESTIMATE + named clinical owner" banner, finance surfaces assumption blocks and routes to a named finance owner, market outputs always show method + assumptions, product insights require recurrence across independent participants. The persona constraints are correctly wired.
  • ar_evaluator design — isolated, opt-in, locked-ground-truth. Loop edits the skill's input file; the evaluator is never edited. Metrics are machine-parseable (feasibility_composite: <float>). No cross-skill coupling detected.
  • config_loader.py design — project > global > defaults precedence is correct; RESEARCH_OPS_NO_CONFIG=1 bypass works as documented; deep-merge handles nested dicts properly.
  • Stdlib-only — confirmed across all 24 scripts; no network or LLM calls.

🔴 Critical — File handle leaks (6 scripts)

Multiple analysis scripts open JSON files without a context manager:

# ❌ current pattern (leaks handle until GC)
data = json.load(open(args.input))

# ✅ correct
with open(args.input) as f:
    data = json.load(f)

Affected scripts:

  • clinical-research/scripts/endpoint_selector.py
  • clinical-research/scripts/phase_gate_scorer.py
  • research-finance/scripts/program_budget_planner.py
  • research-finance/scripts/capex_vs_opex_router.py
  • market-research/scripts/market_sizer.py
  • product-research/scripts/insight_synthesizer.py

This is a straightforward fix across all six files.


🔴 Possible regression — .codex/skills-index.json source swaps

Three unrelated skills have their source paths swapped in the index:

Skill Before After
review engineering-team/playwright-pro/skills/review engineering-team/self-improving-agent/skills/review
run engineering/agenthub/skills/run engineering/autoresearch-agent/skills/run
status engineering/agenthub/skills/status engineering/autoresearch-agent/skills/status

These are distinct skills with distinct descriptions — the swap doesn't look intentional. If the sync script regenerated these from a non-deterministic dict or glob, the churn should be reverted. Please confirm whether this was intended refactoring or a sync artifact, and revert if the latter.


🟡 High — Input validation gaps

capex_vs_opex_router.py — items with an empty or missing criteria dict are routed without flagging the input as incomplete. At minimum, surface a warning rather than silently proceeding with default behavior.

insight_synthesizer.py — observations missing the participant key silently default to "UNKNOWN" rather than raising an error or warning. A malformed observations list could produce misleading output (e.g., a single participant contributing to multiple "recurrences").


🟡 Medium — Silent config failure in config_loader.py

_read_json() catches both OSError and JSONDecodeError and returns None silently. If a saved config is unreadable (e.g., permission denied), the tool falls back to defaults with no user-visible warning. This makes it hard to diagnose misconfiguration. A stderr warning on fallback would help.


🟡 Medium — Hardcoded constants without parametrization

  • phase_gate_scorer.py — recruitment nominal capacity (25.0 enroll/site/yr) is hardcoded with no flag to override.
  • endpoint_selector.py — surrogate endpoint penalty weight (0.7) is hardcoded. Both values are domain-specific and could reasonably vary by trial type; exposing them as --profile-settable config keys would be consistent with the customization pattern already in place.

🟡 Medium — Edge cases in numerical tools

  • sample_size_estimator.py — no guard for dropout=1.0 (produces division by zero) or alpha/power combinations that yield inf sample size.
  • market_sizer.pytam_divergence calculation has a zero-denominator risk if top_down_tam=0; bottoms-up path silently returns implied_customers_at_SOM: None when price=0 rather than raising ValueError.
  • program_budget_planner.py — silent 0.0 fallback when periods count mismatches amounts array length; a warning would surface data entry errors.

🔵 Minor — ar_evaluator exit-code ambiguity

On JSON decode error, ar_evaluator.py prints both {METRIC}: N/A and an error line before returning exit code 1. The autoresearch loop treats non-zero exit as DISCARD, which is correct, but the N/A metric line on stdout before the error may confuse parsers that read only the first output line. Consider writing the error to stderr only.


Summary

Severity Count Action
🔴 Critical 2 File handle leaks (fix in 6 scripts) + codex index regression (confirm/revert)
🟡 Medium 4 Input validation, silent config failure, hardcoded constants, numerical edge cases
🔵 Minor 1 ar_evaluator stderr/stdout split

The domain design, plugin structure, SKILL.md quality, and hard-rule enforcement are all excellent. Once the file-handle leaks are fixed and the codex index regression is confirmed/reverted, this is ready to merge.

@alirezarezvani alirezarezvani merged commit e7d038c into dev May 27, 2026
5 checks passed
@alirezarezvani alirezarezvani deleted the claude/research-skills-category-qWMjm branch May 27, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants