Skip to content

Tharun007-TK/airguard-edgeML

Repository files navigation

Indoor AQI Monitoring & PM2.5 Prediction

A complete machine-learning pipeline for indoor air quality analysis and edge inference, built on the Dalton multi-site indoor AQI dataset.
The project trains a lightweight binary classifier that predicts whether PM2.5 will exceed 60 µg/m³ in the next 10 minutes — and exports the model as a float32 TFLite file ready to run on an ESP32-S3 microcontroller.


Table of Contents

  1. Background
  2. Dataset
  3. Repository Structure
  4. ML Pipeline Overview
  5. Model Architecture
  6. Scripts & Notebook Reference
  7. Key Configuration Parameters
  8. Output Artifacts
  9. Setup
  10. Running the Project
  11. Edge Deployment (ESP32-S3)
  12. Contributing
  13. License

Background

Indoor air quality (IAQ) is strongly linked to occupant health. PM2.5 — fine particulate matter ≤ 2.5 µm — is one of the most harmful pollutants in indoor environments.
This project:

  • Loads continuous multi-sensor readings from dozens of real indoor sites.
  • Detects sudden PM2.5 spikes and sustained degradation events.
  • Trains a tiny neural network that predicts impending high-PM2.5 episodes.
  • Exports the trained model to TFLite for low-power edge inference.

Dataset

The project uses the Dalton indoor AQI dataset, organized by site type:

Site prefix Description
A* Academic/study desks
C* Classroom teacher desks
F* Food-prep kitchens
H* Residential homes (multiple rooms each)
R* General room deployments

Each CSV file contains ~1 Hz sensor readings with these raw columns:

Raw column Friendly name Unit
ts timestamp ISO datetime
T temperature °C
H humidity % RH
PMS2_5 pm25 µg/m³
PMS10 pm10 µg/m³
CO2 co2 ppm
VoC voc ppb

The dataset folder (dalton-dataset-files/) is excluded from version control via .gitignore.
Place it locally at d:/Projects/AQI/dalton-dataset-files/Data/ or update DATA_ROOT in each script/notebook.


Repository Structure

.
├── aqi_pm25_predictor.ipynb   # End-to-end training + TFLite export (main notebook)
├── analyze_pm_spikes.py       # Spike & degradation-event detection + 4 plot types
├── visualize_trends.py        # PM2.5 / temperature / humidity trend visualizations
├── compare_models.py          # Numerical parity check: Keras vs TFLite outputs
├── tflite_model_test.py       # Latency + I/O sanity test for the TFLite model
├── requirements.txt           # Python dependencies
├── LICENSE
└── .github/
    └── workflows/
        └── python-ci.yml      # CI: syntax check on every push / PR

Generated at runtime (excluded from Git):

figures/           # PNG plots produced by visualization scripts
model_output/      # Trained model files, scaler_params.json

ML Pipeline Overview

CSV files (all sites)
        │
        ▼
 Load & merge → downsample to ≤ 3M rows → parse timestamps → float32 cast
        │
        ▼
 Feature selection: [temperature, humidity, pm25] + optional [voc]
        │
        ▼
 Label creation: future_pm25 (t + 600 rows) > 60 µg/m³ → binary label
        │
        ▼
 Sliding window (size = 20) → flatten → input vector (dim = 20 × F)
        │
        ▼
 StandardScaler (fit on train split only) → z-score normalization
        │
        ▼
 Dense neural network → binary_crossentropy + EarlyStopping
        │
        ▼
 TFLite float32 export → scaler_params.json (for firmware)

Model Architecture

A minimal Dense-only network designed for microcontroller deployment:

Input  (20 × F)    — flattened sliding window
  │
Dense(32, ReLU)
  │
Dense(16, ReLU)
  │
Dense(1,  Sigmoid) — P(PM2.5 will exceed threshold in 10 min)
  • F = number of active features (3 if VOC missing, 4 if present)
  • Input dim = 60 or 80 depending on VOC availability
  • Total parameters: ~2 400 (float32 TFLite ≈ 12–15 KB)
  • Training uses class-weighted binary cross-entropy and Adam (lr = 1e-3)
  • EarlyStopping on val_loss with patience = 5, restore_best_weights = True

Scripts & Notebook Reference

aqi_pm25_predictor.ipynb

The primary end-to-end workflow. Cells in order:

# Section What it does
1 Install pip-installs all dependencies
2 Imports Libraries, versions, GPU check
3 GPU config Enables memory growth, sets DEVICE
4 Load data Reads all CSVs, renames columns, caps rows, casts to float32
5 Feature selection Drops low-coverage optional columns
6 Label creation Creates binary label via 600-step forward shift of pm25
7 Sliding window Builds flat input vectors with make_windows()
8 Normalize Fits StandardScaler on train split, saves scaler_params.json
9 Build model Defines Keras Sequential model
10 Train Fits with class weights, EarlyStopping, learning-curve plots
11 Evaluate Accuracy, F1, confusion matrix, probability distribution
12 TFLite export Converts model to float32 .tflite
13 Deployment summary File size, sanity-check Keras vs TFLite delta

analyze_pm_spikes.py

Detects and visualizes two event types across the full dataset:

Method Description
Spike PM2.5 rose ≥ 15 µg/m³ over the preceding 10 samples (rate-of-change)
Degradation event Sustained period ≥ 60 s above 60 µg/m³; events within 120 s are merged

Produces 4 figures in figures/spikes/:

  • pm25_spikes_overview.png — full timeseries with spike markers & event shading
  • pm25_roc_signal.png — rate-of-change signal below the PM2.5 trace
  • event_statistics.png — histograms of event duration and peak PM2.5
  • events_per_site.png — bar chart of event count per site

Prints a per-site summary table to stdout.


visualize_trends.py

Generates three individual trend figures + one combined overview:

Figure Content
pm25_trend.png Raw PM2.5 + 1-min and 10-min moving averages + 60 µg/m³ threshold
temperature_trend.png Temperature with short and long moving averages
humidity_trend.png Relative humidity with moving averages
combined_trends.png All three metrics in vertically stacked subplots

All figures are saved to figures/ at 150 DPI.


compare_models.py

Loads best_model.h5 and model_float32.tflite from model_output/, runs the same random input through both, and prints the numerical difference to verify conversion parity.


tflite_model_test.py

Loads model_float32.tflite, runs a dummy input, and reports:

  • Input/output tensor shapes and dtypes
  • Model prediction
  • Inference latency in milliseconds

Key Configuration Parameters

Parameter Location Default Meaning
PM25_THRESHOLD notebook / scripts 60 µg/m³ WHO "Unhealthy for Sensitive Groups" boundary
FORECAST_STEPS notebook 600 rows Prediction horizon (~10 min at 1 Hz)
WINDOW_SIZE notebook 20 rows Sliding-window history (~20 seconds)
MAX_ROWS_GLOBAL notebook (cell 8) 3_000_000 Cap on combined rows before training
MAX_ROWS_PER_FILE notebook (cell 8) 100_000 Per-CSV row cap to protect peak RAM
MAX_ROWS notebook (cell 16) 1_000_000 Stratified subsample before windowing
SPIKE_DELTA analyze_pm_spikes 15 µg/m³ Minimum PM2.5 rise to count as a spike
SPIKE_WINDOW analyze_pm_spikes 10 rows Look-back window for rate-of-change
MIN_EVENT_DURATION analyze_pm_spikes 60 s Minimum duration for a degradation event
MERGE_GAP analyze_pm_spikes 120 s Gap below which two events are merged

Output Artifacts

After a full notebook run, model_output/ contains:

File Description
model_float32.tflite TFLite model, float32 weights, ready for ESP32-S3
scaler_params.json z-score normalization params (mean + scale per feature) for firmware

scaler_params.json schema:

{
  "feature_cols":   ["temperature", "humidity", "pm25"],
  "window_size":    20,
  "input_dim":      60,
  "mean":           [...],
  "scale":          [...],
  "pm25_threshold": 60,
  "forecast_steps": 600
}

Setup

Requirements: Python 3.9–3.11

# 1. Clone the repo
git clone https://github.com/<your-username>/aqi.git
cd aqi

# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

GPU (optional): Install tensorflow-directml-plugin on Windows or tensorflow[and-cuda] on Linux for GPU-accelerated training.


Running the Project

Visualize sensor trends:

python visualize_trends.py

Analyze PM2.5 spikes and degradation events:

python analyze_pm_spikes.py

Train the model and export TFLite — open and run all cells in:

aqi_pm25_predictor.ipynb

Verify TFLite model after training:

python tflite_model_test.py

Compare Keras vs TFLite outputs:

python compare_models.py

Edge Deployment (ESP32-S3)

  1. Copy model_output/model_float32.tflite to your firmware project.
  2. Load scaler_params.json → apply z-score normalization to each incoming sensor reading before inference:
    normalized = (raw_value − mean[i]) / scale[i]
    
  3. Assemble a rolling window of 20 normalized readings per feature into a flat float32 array of length input_dim.
  4. Run the TFLite interpreter; output sigmoid probability > 0.5 → predict high PM2.5 in 10 min.

The model is ~12–15 KB and runs a single inference in < 1 ms on the ESP32-S3 CPU.


Contributing

  1. Fork the repository and create a feature branch.
  2. Follow existing code style (PEP 8, module-level docstrings, type hints where practical).
  3. Test your changes against at least one site's CSV before submitting a PR.
  4. Open a pull request with a clear description of changes and motivation.

License

This project is licensed under the MIT License.

About

AirGuard-EdgeML is a lightweight edge-based machine learning system that predicts indoor air quality degradation and enables proactive ventilation control on ESP32 devices using TensorFlow Lite Micro.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors