Crop yield prediction using IoT sensor data and machine learning enables Indian farmers and agricultural planners to forecast harvest volumes weeks in advance, optimise inputs, and improve supply chain logistics. This guide covers building an end-to-end crop yield prediction system with ESP32 sensors, Python ML pipeline, and practical deployment in Indian conditions.
Table of Contents
- Importance of Yield Prediction in India
- Key Sensor Data for Yield Models
- Hardware Setup
- Data Collection Pipeline
- Machine Learning Model
- Python ML Code
- Field Deployment
- Frequently Asked Questions
Importance of Yield Prediction in India
India produces 330+ million tonnes of food grains annually. Even a 5% improvement in yield prediction accuracy translates to better government procurement planning, reduced post-harvest losses (currently 15-30%), and improved farmer income. Key use cases:
- Government: State Agricultural Departments use yield forecasts for MSP procurement planning and food security buffers
- Banks: Kisan Credit Card sanctioning uses predicted yield as collateral assessment
- Commodity traders: Mandi price forecasting based on supply predictions
- Agri-input companies: Fertilizer and pesticide demand planning
- Farmers: Sell-forward decisions, input optimisation
Key Sensor Data for Yield Models
Yield depends on multiple interacting factors. The most predictive sensor variables are:
| Parameter | Sensor | Yield Impact |
|---|---|---|
| Soil moisture (root zone) | Capacitive soil sensor | High (water stress = 20-40% yield loss) |
| Air temperature (min/max) | BME280 | High (heat stress at flowering critical) |
| Relative humidity | BME280/SHT10 | Medium (disease risk, pollination) |
| Solar radiation (LDR/BH1750) | LDR or BH1750 | High (photosynthesis, biomass) |
| Rainfall | Tipping bucket gauge | High (water balance) |
Hardware Setup
Recommended Sensors from Zbotic
A field node consists of:
- ESP32 (data collection and WiFi transmission)
- BME280 (temperature, humidity, atmospheric pressure)
- 2x Capacitive soil moisture sensors (at 15cm and 30cm depth)
- BH1750 light intensity sensor (for solar radiation proxy)
- DS3231 RTC for accurate timestamps
- Solar power (5W panel + 10Ah LiPo)
Data Collection Pipeline
The ESP32 sends sensor readings every 30 minutes to a central server:
- ESP32 reads all sensors and timestamps with RTC
- Data sent via WiFi (or LoRa gateway) to MQTT broker
- InfluxDB stores time-series data on Raspberry Pi or cloud VM
- Python ML pipeline queries InfluxDB weekly for model training and prediction
Minimum training data: 2 complete crop seasons (6-8 months for most Indian crops). With existing IMD weather station data, you can bootstrap a model immediately and refine as field sensor data accumulates.
Machine Learning Model
For crop yield prediction, a Random Forest Regressor provides an excellent balance of accuracy and interpretability:
- Linear regression: Simple baseline, works well with 3-5 features and historical yield data
- Random Forest: Handles non-linear interactions, robust to missing data, feature importance output
- XGBoost: Best accuracy with large datasets (5+ years, multiple farms)
- LSTM: Best for sequential time-series patterns (monsoon progression, crop phenology stages)
Python ML Code
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score
import joblib
# Load sensor data from CSV (exported from InfluxDB or ThingSpeak)
df = pd.read_csv('farm_data_2years.csv', parse_dates=['date'])
# Feature engineering
df['growing_degree_days'] = ((df['max_temp'] + df['min_temp']) / 2 - 10).clip(lower=0)
df['gdd_cumulative'] = df.groupby(['season', 'field_id'])['growing_degree_days'].cumsum()
df['rainfall_7d_sum'] = df.groupby('field_id')['total_rainfall'].rolling(7).sum().reset_index(drop=True)
df['vpd'] = df['avg_temp'] * (1 - df['avg_humidity']/100) * 0.066
features = ['gdd_cumulative', 'avg_soil_moisture', 'rainfall_7d_sum',
'avg_humidity', 'avg_light_lux', 'vpd', 'days_since_sowing']
target = 'actual_yield_kg_per_acre'
model_df = df[features + [target]].dropna()
X, y = model_df[features], model_df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
rf_model = RandomForestRegressor(n_estimators=200, max_depth=10,
min_samples_leaf=5, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)
y_pred = rf_model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.0f} kg/acre ({mae/y_test.mean()*100:.1f}% error)")
print(f"R2 score: {r2:.3f}")
importances = pd.Series(rf_model.feature_importances_, index=features).sort_values(ascending=False)
print("Feature Importances:")
print(importances)
joblib.dump(rf_model, 'yield_model.pkl')
joblib.dump(scaler, 'yield_scaler.pkl')
Field Deployment
Integration steps for a complete system:
- ESP32 nodes: Deploy 2-3 per 10 acres for representative sampling
- Gateway: Raspberry Pi 4 at farmhouse edge with MQTT broker, InfluxDB, and Grafana
- Weekly model run: Cron job updates predictions every Monday morning
- Farmer interface: WhatsApp bot sends weekly yield forecast in local language (Hindi, Marathi, Telugu)
- Extension integration: Share prediction data with local Krishi Vigyan Kendra (KVK)
Typical accuracy with 2 years of training data: Wheat (Punjab) MAE plus or minus 8%, Paddy (AP/Karnataka) MAE plus or minus 12%, Tomato polyhouse MAE plus or minus 6%.
Related Sensing Products
- GY-BME280 5V variant for 5V microcontroller systems
- Capacitive Soil Moisture Sensor for root zone monitoring
Frequently Asked Questions
How much historical data do I need to train a reliable yield model?
Minimum 2 complete crop seasons (same crop, same field). With 3-5 seasons, accuracy improves significantly. You can augment with IMD weather data (available free from data.gov.in) and published agronomic yield tables for your region.
Can I use this system for multiple crops?
Train separate models for each crop. Crop-specific features (flowering date, critical irrigation stages) differ significantly. Using a single model across crops degrades accuracy by 15-25%.
Is the ML model retraining automatic?
Add an automated retraining pipeline: after each harvest, add actual yield data to the dataset and retrain. Validate new model against held-out last season. If R2 improves, deploy the new model automatically.
What government resources support IoT-based precision farming?
ICAR provides free agronomic data. NABARD funds precision farming pilots under the Agricultural Infrastructure Fund. The Digital Agriculture Mission 2021-25 actively promotes IoT and ML-based advisory systems.
Add comment