Thank you for your interest in contributing to the CrossUserDataset repository. This guide describes how to propose changes, what kinds of contributions are welcome, and what review standards apply.
- Documentation improvements — clarifications, typo fixes, additional examples, broken-link repairs.
- Benchmark additions — new baseline / DG / DA methods, new setting variants, new metrics. These must follow the loader / metric contract in
benchmark/utils/README.mdand must not require re-releasing the dataset itself. - Analysis notebooks — additional EDA, domain-specific analyses, reproducibility sanity checks. Add under
EDA/with a short README entry. - Bug fixes — in preprocessing documentation, benchmark code, or EDA utilities.
- Metadata updates — Croissant schema tweaks, CITATION.cff updates, checksum regeneration.
- Dataset-level changes — new labels, new waves, new participant recruitment, label corrections, new anonymisation steps. These require IRB review and a new Dataverse revision.
- Licensing changes — code and data licenses are TBD until public release. Any license-related change requires author discussion.
- Schema-breaking changes — renaming columns, changing pkl tuple structure, dropping labels.
Email [TBD contact email on acceptance] before submitting PRs in this category.
- Open an issue first describing the change, especially for anything non-trivial. This avoids duplicate work and lets maintainers weigh in on direction.
- Fork the repository and create a topic branch from
main. - Make the change in small, reviewable commits. Follow the style conventions below.
- Open a pull request referencing the issue. Fill in the PR template with:
- Summary of the change.
- Motivation / related issue.
- Test plan (what you ran, what you verified).
- Impact (does it affect schema? benchmark results? reproducibility?).
- Respond to review. We aim to respond within 10 business days; see
MAINTENANCE.md.
- Markdown for all prose docs.
- File paths as relative links wherever possible (e.g.,
[utils.py](./EDA/utils.py)). - Wave IDs:
D1 / D2 / D3on disk and in public docs;D-1 / D-2 / D-3is acceptable as a dict-key convention inside EDA code. - Label filenames: lowercase (
valence.pkl,stress.pkl,valenceChange.pkl,mental.pkl). Seedata/README.md.
- Python 3.10 recommended; code must remain compatible with 3.9–3.11.
- Follow PEP 8; run
ruff check(if added) before committing. - Type hints for new function signatures.
- No breaking changes to
EDA.utilspublic API (load_esm,load_userinfo,load_df_X_combined,load_and_attach, etc.) without a matching version bump.
- Use clear, short commit subjects (≤ 72 characters). The body should explain the why, not the what.
- One logical change per commit. Squash trivial fixups before merging.
To be merged, a new benchmark method must:
- Conform to the loader contract (5-tuple pickle input; see
benchmark/utils/README.md). - Use the unified training loop (≤ 50 epochs, patience-based early stopping, fixed seed, 30 Optuna trials on validation AUROC).
- Emit output rows in the canonical schema:
tier, wave_or_source, target, task, model, family, n_train, n_val, n_test, acc, macro_f1, precision, recall, auroc, auroc_std, n_features_after_alignment. - Include upstream attribution — original paper citation, upstream implementation link, licence compatibility note.
- Reproduce on D1 / D2 / D3 at least for one task on one setting. Summary AUROC should be committed to the matching CSV in
benchmark/results/.
Do not open a public issue for privacy or security concerns. Email [TBD contact email on acceptance] directly. See RESPONSIBILITY.md for the incident-response process.
Contributors are expected to maintain a respectful, inclusive environment. Personal attacks, harassment, and discriminatory language are not tolerated. Project maintainers reserve the right to moderate or close contributions that violate this standard.