IoT + Machine Learning

Predicting temperature
from sensor noise

Three regression models trained on real DHT11, LDR, and CO2 data. One clear winner. One surprising result.

Temperature
--
Humidity
--
Light (LDR)
--
CO2 (ppm)
--
Predicted Temp
--
Simulated readings from UCI dataset samples

What this project does

Cheap IoT sensors like the DHT11 drift, saturate, and give noisy readings. This project asks: can we cross-validate what one sensor reports against what the others suggest it should be?

Three regression models learn the relationship between humidity, light, CO2, and time-of-day and the room temperature. Trained on one week of real office sensor data, then tested on two separate time windows to see how well the learned patterns hold up.

01
Load the UCI splits
Three pre-defined files: training, test1 (same week), test2 (following week). Keeps results comparable with the original paper.
02
Clean for physical limits
Screen readings outside DHT11 operating range (5-45°C, 0-100% RH) and CO2 sensor limits. The real dataset had zero bad rows — useful to know.
03
Engineer time features
Hour encoded as sine and cosine so the model understands 23:00 and 00:01 are adjacent. LDR ratio computed against 60-min rolling max to capture relative brightness transitions.
04
Train three models
Linear Regression (scaled), Random Forest (100 trees, depth 8), Gradient Boosting (150 estimators, lr=0.08). Each evaluated on both test windows separately.
05
Compare and explain
Metrics, residuals, and feature importances. The results tell a clear story about which model actually generalizes — and why.

What the sensors tell each other

Temperature, light, and CO2 all rise together when the room is occupied. Humidity moves inversely with temperature — the physical relationship is real and consistent across the dataset.

EDA plots: temperature timeline, correlation matrix, sensor distributions
Top: temperature across all three dataset splits (different weeks). Middle: correlation matrix and Light vs Temperature scatter by occupancy. Bottom: sensor value distributions for the training set.

Linear regression wins, and the gap is not close

Evaluated on two separate time windows. Test1 is from the same week as training (pre-training dates). Test2 is the following week — a completely fresh thermal environment.

Model Test1 R² Test1 RMSE Test2 R² Test2 RMSE
Linear Regression best 0.9716 0.1733 °C 0.8981 0.3258 °C
Gradient Boosting 0.3936 0.8004 °C 0.1215 0.9566 °C
Random Forest 0.1155 0.9666 °C -0.7905 1.3657 °C

Random Forest achieves R² = -0.79 on Test2. A negative R² means the model predicts worse than just guessing the training mean every time. It memorized the thermal fingerprint of one specific week and could not generalize when the heating pattern shifted slightly the following week. This is distribution shift — one of the most common failure modes in real IoT deployments.

Actual vs predicted scatter plots for all three models
Actual vs predicted temperature on Test1. Linear Regression predictions cluster tightly along the diagonal. Tree models show systematic bias and spread.

The LDR knows more than you think

Light intensity is the dominant predictor of temperature — more than humidity or CO2. That sounds backwards until you realize what Light is actually encoding: occupancy. Lights on means people in the room, and people generate heat.

The LDR reading is doing double duty as an indirect body-heat sensor.

Light (LDR)
46.3%
CO2
16.2%
Humidity
14.9%
HumidityRatio
9.8%
day_of_week
6.1%
hour_sin
5.8%
hour_cos
0.9%
light_ratio
0.1%
Feature importance bar chart and model comparison across test splits
Left: Random Forest feature importances. Right: R² and RMSE across both test splits. The degradation from Test1 to Test2 is the distribution shift problem made visible.

Four steps to reproduce

Download the UCI dataset files, install dependencies, run the script. The code is intentionally short — under 90 lines including comments.

# 1. Clone the repo
git clone https://github.com/sobanmujtaba/IoT-Environmental-Predictor
cd IoT-Environmental-Predictor

# 2. Install dependencies
pip install scikit-learn pandas numpy matplotlib

# 3. Download the dataset (place alongside predict.py)
#    https://archive.ics.uci.edu/dataset/357/occupancy+detection
#    Files: datatraining.txt  datatest.txt  datatest2.txt

# 4. Run
python predict.py