IoT Environmental Predictor

// overview

What this project does

Cheap IoT sensors like the DHT11 drift, saturate, and give noisy readings. This project asks: can we cross-validate what one sensor reports against what the others suggest it should be?

Three regression models learn the relationship between humidity, light, CO2, and time-of-day and the room temperature. Trained on one week of real office sensor data, then tested on two separate time windows to see how well the learned patterns hold up.

Load the UCI splits

Three pre-defined files: training, test1 (same week), test2 (following week). Keeps results comparable with the original paper.

Clean for physical limits

Screen readings outside DHT11 operating range (5-45°C, 0-100% RH) and CO2 sensor limits. The real dataset had zero bad rows — useful to know.

Engineer time features

Hour encoded as sine and cosine so the model understands 23:00 and 00:01 are adjacent. LDR ratio computed against 60-min rolling max to capture relative brightness transitions.

Train three models

Linear Regression (scaled), Random Forest (100 trees, depth 8), Gradient Boosting (150 estimators, lr=0.08). Each evaluated on both test windows separately.

Compare and explain

Metrics, residuals, and feature importances. The results tell a clear story about which model actually generalizes — and why.

// exploratory analysis

What the sensors tell each other

Temperature, light, and CO2 all rise together when the room is occupied. Humidity moves inversely with temperature — the physical relationship is real and consistent across the dataset.

EDA plots: temperature timeline, correlation matrix, sensor distributions

Top: temperature across all three dataset splits (different weeks). Middle: correlation matrix and Light vs Temperature scatter by occupancy. Bottom: sensor value distributions for the training set.

// model results

Linear regression wins, and the gap is not close

Evaluated on two separate time windows. Test1 is from the same week as training (pre-training dates). Test2 is the following week — a completely fresh thermal environment.

Model	Test1 R²	Test1 RMSE	Test2 R²	Test2 RMSE
Linear Regression best	0.9716	0.1733 °C	0.8981	0.3258 °C
Gradient Boosting	0.3936	0.8004 °C	0.1215	0.9566 °C
Random Forest	0.1155	0.9666 °C	-0.7905	1.3657 °C

Random Forest achieves R² = -0.79 on Test2. A negative R² means the model predicts worse than just guessing the training mean every time. It memorized the thermal fingerprint of one specific week and could not generalize when the heating pattern shifted slightly the following week. This is distribution shift — one of the most common failure modes in real IoT deployments.

Actual vs predicted scatter plots for all three models

Actual vs predicted temperature on Test1. Linear Regression predictions cluster tightly along the diagonal. Tree models show systematic bias and spread.

// feature importance

The LDR knows more than you think

Light intensity is the dominant predictor of temperature — more than humidity or CO2. That sounds backwards until you realize what Light is actually encoding: occupancy. Lights on means people in the room, and people generate heat.

The LDR reading is doing double duty as an indirect body-heat sensor.

Light (LDR)

46.3%

CO2

16.2%

Humidity

14.9%

HumidityRatio

9.8%

day_of_week

6.1%

hour_sin

5.8%

hour_cos

0.9%

light_ratio

0.1%

Feature importance bar chart and model comparison across test splits

Left: Random Forest feature importances. Right: R² and RMSE across both test splits. The degradation from Test1 to Test2 is the distribution shift problem made visible.

// run it yourself

Four steps to reproduce

Download the UCI dataset files, install dependencies, run the script. The code is intentionally short — under 90 lines including comments.

# 1. Clone the repo
git clone https://github.com/sobanmujtaba/IoT-Environmental-Predictor
cd IoT-Environmental-Predictor

# 2. Install dependencies
pip install scikit-learn pandas numpy matplotlib

# 3. Download the dataset (place alongside predict.py)
#    https://archive.ics.uci.edu/dataset/357/occupancy+detection
#    Files: datatraining.txt  datatest.txt  datatest2.txt

# 4. Run
python predict.py