A complete machine-learning pipeline for indoor air quality analysis and edge inference, built on the Dalton multi-site indoor AQI dataset.
The project trains a lightweight binary classifier that predicts whether PM2.5 will exceed 60 µg/m³ in the next 10 minutes — and exports the model as a float32 TFLite file ready to run on an ESP32-S3 microcontroller.
- Background
- Dataset
- Repository Structure
- ML Pipeline Overview
- Model Architecture
- Scripts & Notebook Reference
- Key Configuration Parameters
- Output Artifacts
- Setup
- Running the Project
- Edge Deployment (ESP32-S3)
- Contributing
- License
Indoor air quality (IAQ) is strongly linked to occupant health. PM2.5 — fine particulate matter ≤ 2.5 µm — is one of the most harmful pollutants in indoor environments.
This project:
- Loads continuous multi-sensor readings from dozens of real indoor sites.
- Detects sudden PM2.5 spikes and sustained degradation events.
- Trains a tiny neural network that predicts impending high-PM2.5 episodes.
- Exports the trained model to TFLite for low-power edge inference.
The project uses the Dalton indoor AQI dataset, organized by site type:
| Site prefix | Description |
|---|---|
A* |
Academic/study desks |
C* |
Classroom teacher desks |
F* |
Food-prep kitchens |
H* |
Residential homes (multiple rooms each) |
R* |
General room deployments |
Each CSV file contains ~1 Hz sensor readings with these raw columns:
| Raw column | Friendly name | Unit |
|---|---|---|
ts |
timestamp |
ISO datetime |
T |
temperature |
°C |
H |
humidity |
% RH |
PMS2_5 |
pm25 |
µg/m³ |
PMS10 |
pm10 |
µg/m³ |
CO2 |
co2 |
ppm |
VoC |
voc |
ppb |
The dataset folder (
dalton-dataset-files/) is excluded from version control via.gitignore.
Place it locally atd:/Projects/AQI/dalton-dataset-files/Data/or updateDATA_ROOTin each script/notebook.
.
├── aqi_pm25_predictor.ipynb # End-to-end training + TFLite export (main notebook)
├── analyze_pm_spikes.py # Spike & degradation-event detection + 4 plot types
├── visualize_trends.py # PM2.5 / temperature / humidity trend visualizations
├── compare_models.py # Numerical parity check: Keras vs TFLite outputs
├── tflite_model_test.py # Latency + I/O sanity test for the TFLite model
├── requirements.txt # Python dependencies
├── LICENSE
└── .github/
└── workflows/
└── python-ci.yml # CI: syntax check on every push / PR
Generated at runtime (excluded from Git):
figures/ # PNG plots produced by visualization scripts
model_output/ # Trained model files, scaler_params.json
CSV files (all sites)
│
▼
Load & merge → downsample to ≤ 3M rows → parse timestamps → float32 cast
│
▼
Feature selection: [temperature, humidity, pm25] + optional [voc]
│
▼
Label creation: future_pm25 (t + 600 rows) > 60 µg/m³ → binary label
│
▼
Sliding window (size = 20) → flatten → input vector (dim = 20 × F)
│
▼
StandardScaler (fit on train split only) → z-score normalization
│
▼
Dense neural network → binary_crossentropy + EarlyStopping
│
▼
TFLite float32 export → scaler_params.json (for firmware)
A minimal Dense-only network designed for microcontroller deployment:
Input (20 × F) — flattened sliding window
│
Dense(32, ReLU)
│
Dense(16, ReLU)
│
Dense(1, Sigmoid) — P(PM2.5 will exceed threshold in 10 min)
F= number of active features (3 if VOC missing, 4 if present)- Input dim = 60 or 80 depending on VOC availability
- Total parameters: ~2 400 (float32 TFLite ≈ 12–15 KB)
- Training uses class-weighted binary cross-entropy and Adam (lr = 1e-3)
- EarlyStopping on
val_losswith patience = 5,restore_best_weights = True
The primary end-to-end workflow. Cells in order:
| # | Section | What it does |
|---|---|---|
| 1 | Install | pip-installs all dependencies |
| 2 | Imports | Libraries, versions, GPU check |
| 3 | GPU config | Enables memory growth, sets DEVICE |
| 4 | Load data | Reads all CSVs, renames columns, caps rows, casts to float32 |
| 5 | Feature selection | Drops low-coverage optional columns |
| 6 | Label creation | Creates binary label via 600-step forward shift of pm25 |
| 7 | Sliding window | Builds flat input vectors with make_windows() |
| 8 | Normalize | Fits StandardScaler on train split, saves scaler_params.json |
| 9 | Build model | Defines Keras Sequential model |
| 10 | Train | Fits with class weights, EarlyStopping, learning-curve plots |
| 11 | Evaluate | Accuracy, F1, confusion matrix, probability distribution |
| 12 | TFLite export | Converts model to float32 .tflite |
| 13 | Deployment summary | File size, sanity-check Keras vs TFLite delta |
Detects and visualizes two event types across the full dataset:
| Method | Description |
|---|---|
| Spike | PM2.5 rose ≥ 15 µg/m³ over the preceding 10 samples (rate-of-change) |
| Degradation event | Sustained period ≥ 60 s above 60 µg/m³; events within 120 s are merged |
Produces 4 figures in figures/spikes/:
pm25_spikes_overview.png— full timeseries with spike markers & event shadingpm25_roc_signal.png— rate-of-change signal below the PM2.5 traceevent_statistics.png— histograms of event duration and peak PM2.5events_per_site.png— bar chart of event count per site
Prints a per-site summary table to stdout.
Generates three individual trend figures + one combined overview:
| Figure | Content |
|---|---|
pm25_trend.png |
Raw PM2.5 + 1-min and 10-min moving averages + 60 µg/m³ threshold |
temperature_trend.png |
Temperature with short and long moving averages |
humidity_trend.png |
Relative humidity with moving averages |
combined_trends.png |
All three metrics in vertically stacked subplots |
All figures are saved to figures/ at 150 DPI.
Loads best_model.h5 and model_float32.tflite from model_output/, runs the same random input through both, and prints the numerical difference to verify conversion parity.
Loads model_float32.tflite, runs a dummy input, and reports:
- Input/output tensor shapes and dtypes
- Model prediction
- Inference latency in milliseconds
| Parameter | Location | Default | Meaning |
|---|---|---|---|
PM25_THRESHOLD |
notebook / scripts | 60 µg/m³ |
WHO "Unhealthy for Sensitive Groups" boundary |
FORECAST_STEPS |
notebook | 600 rows |
Prediction horizon (~10 min at 1 Hz) |
WINDOW_SIZE |
notebook | 20 rows |
Sliding-window history (~20 seconds) |
MAX_ROWS_GLOBAL |
notebook (cell 8) | 3_000_000 |
Cap on combined rows before training |
MAX_ROWS_PER_FILE |
notebook (cell 8) | 100_000 |
Per-CSV row cap to protect peak RAM |
MAX_ROWS |
notebook (cell 16) | 1_000_000 |
Stratified subsample before windowing |
SPIKE_DELTA |
analyze_pm_spikes | 15 µg/m³ |
Minimum PM2.5 rise to count as a spike |
SPIKE_WINDOW |
analyze_pm_spikes | 10 rows |
Look-back window for rate-of-change |
MIN_EVENT_DURATION |
analyze_pm_spikes | 60 s |
Minimum duration for a degradation event |
MERGE_GAP |
analyze_pm_spikes | 120 s |
Gap below which two events are merged |
After a full notebook run, model_output/ contains:
| File | Description |
|---|---|
model_float32.tflite |
TFLite model, float32 weights, ready for ESP32-S3 |
scaler_params.json |
z-score normalization params (mean + scale per feature) for firmware |
scaler_params.json schema:
{
"feature_cols": ["temperature", "humidity", "pm25"],
"window_size": 20,
"input_dim": 60,
"mean": [...],
"scale": [...],
"pm25_threshold": 60,
"forecast_steps": 600
}Requirements: Python 3.9–3.11
# 1. Clone the repo
git clone https://github.com/<your-username>/aqi.git
cd aqi
# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txtGPU (optional): Install
tensorflow-directml-pluginon Windows ortensorflow[and-cuda]on Linux for GPU-accelerated training.
Visualize sensor trends:
python visualize_trends.pyAnalyze PM2.5 spikes and degradation events:
python analyze_pm_spikes.pyTrain the model and export TFLite — open and run all cells in:
aqi_pm25_predictor.ipynb
Verify TFLite model after training:
python tflite_model_test.pyCompare Keras vs TFLite outputs:
python compare_models.py- Copy
model_output/model_float32.tfliteto your firmware project. - Load
scaler_params.json→ apply z-score normalization to each incoming sensor reading before inference:normalized = (raw_value − mean[i]) / scale[i] - Assemble a rolling window of 20 normalized readings per feature into a flat float32 array of length
input_dim. - Run the TFLite interpreter; output sigmoid probability > 0.5 → predict high PM2.5 in 10 min.
The model is ~12–15 KB and runs a single inference in < 1 ms on the ESP32-S3 CPU.
- Fork the repository and create a feature branch.
- Follow existing code style (PEP 8, module-level docstrings, type hints where practical).
- Test your changes against at least one site's CSV before submitting a PR.
- Open a pull request with a clear description of changes and motivation.
This project is licensed under the MIT License.