All notable changes to this repository and its accompanying Harvard Dataverse release are documented here.
The format follows Keep a Changelog, and this project adheres to Semantic Versioning for its schema and metadata artefacts (see MAINTENANCE.md § 4).
- Repository scaffolding aligned with NeurIPS Datasets & Benchmarks Track artefact requirements:
LICENSE,LICENSE-DATA.md,CITATION.cff,AUTHORS.md,RESPONSIBILITY.md,MAINTENANCE.md,CHANGELOG.md,CONTRIBUTING.md. - Three-setting benchmark documentation under
benchmark/, including per-setting READMEs (setting_a,setting_b,setting_c), a shared loader/metric contract (utils/), and placeholder result CSVs underbenchmark/results/. - Cross-wave feature-alignment report (
docs/feature_alignment.md, mirrored aspreprocessing/feature_alignment.md). - Per-wave QC thresholds and alias map in
preprocessing/pipeline_decisions.md. - Long-form docs under
docs/:dataset_overview.md,ethics.md,consent_form_en.md,neurips_db_checklist.md. - Top-level
requirements.txtand optionalenvironment.ymlfor reproducibility. - Orphan-code pointers: READMEs in
basemodel-benchmarking/anddomain_adaptation/now cross-reference thebenchmark/protocol docs.
README.mdrewritten to remove[DATASET_NAME]/[TODO]placeholders and to reflect the three-setting benchmark, full model inventory, reproducibility policy, and repository layout.DATASHEET.mdexpanded with per-wave QC numbers, ABC Logger reference, 8-stage pipeline summary, and explicit[TBD on acceptance]markers for fields held for peer review.data/README.mdanddata/schema.mdreconciled: pkl files are now consistently documented as 5-element tuples (features, y, groups, t, datetimes), label filenames are lowercase (valence.pkl,mental.pkl,valenceChange.pkl, …), and feature-column counts are corrected (8,037 / 10,122 / 10,581total;7,961 / 10,102 / 10,550afterPIF#removal).EDA/utils.pyandEDA/README.mdupdated to reflect the releasedD1 / D2 / D3folder naming on disk while keeping internal dict keys asD-1 / D-2 / D-3for notebook compatibility..gitignorepatched to whitelist documentation assets (images/*.png,metadata/checksums.md5,benchmark/results/*.csv|*.json).- License, data-access, and citation metadata now keep code license, data license, DUA terms, authors, and final dataset/paper names explicitly marked as TBD until public release.
- README visuals expanded with preserved study-design, coverage, temporal-label, and embedding figures under
images/. - EDA notebook directory casing normalised from
eda/toEDA/to match documentation links.
- The placeholder
[DATASET_NAME]strings across older drafts are no longer used; the working name isCrossUserDatasetpending final naming on acceptance.
- Stale
.gitkeepplaceholders underbenchmark/setting_a|b|c|utils/, now replaced by real README files.
- Contradictions between
data/README.md(5-tuple) anddata/schema.md(DataFrame). The tuple description matchesEDA.utils.load_and_attachin the code and is the canonical format. - Inconsistent label filenames across README and data folder.
- Feature-column counts in the schema (previously "~10,000 for D1; ~20,000 for D2/D3"; now corrected to
8,037 / 10,122 / 10,581total).
- No data-level changes. Privacy transformations (MD5-hashed contacts, UUID-replaced MAC addresses, GPS longitude displacement) remain as previously documented.
Initial public release of the dataset on Harvard Dataverse, accompanying the NeurIPS Datasets and Benchmarks Track paper.
- D1 archive (Feb–Apr 2020, 92 participants post-QC, 10,259 ESM responses).
- D2 archive (Dec 2020–Jan 2021, 99 participants post-QC, 21,042 ESM responses).
- D3 archive (Nov 2021–Jan 2022, 106 participants post-QC, 21,838 ESM responses).
- Per-wave pre-extracted feature matrices (
{label}.pkl),EsmResponse.csv,UserInfo.csv. - Three-setting benchmark definitions (Setting A / B / C) with fixed splits, eligibility filters, and result templates.
- Full Datasheet for Datasets and Croissant metadata.