Solar Power Forecasting with ML: Python and Weather API

Building a solar power forecasting system using machine learning in Python combines two cutting-edge fields — renewable energy management and AI — into a practical application relevant to India’s growing solar infrastructure. Grid operators, solar plant owners, and battery storage managers all need accurate solar generation forecasts to optimise their operations. This guide covers the complete pipeline from weather API data collection to ML model training and deployment.

Why Solar Power Forecasting Matters
Data Sources and Weather APIs for India
Feature Engineering for Solar Forecasting
ML Model Selection and Training
Complete Python Implementation
Model Evaluation Metrics
Deployment for Indian Solar Systems
Frequently Asked Questions

Why Solar Power Forecasting Matters

Accurate solar power forecasting using machine learning is critical for:

Grid balancing: NLDC/SLDC operators need day-ahead solar forecasts to schedule backup generation. Forecasting errors cost Indian grid operators Rs 500-2,000 crore annually.
Battery management: Predictive charging/discharging of battery storage using tomorrow’s solar forecast extends battery life by 10-20%.
Trading/scheduling: RE generators must submit day-ahead generation schedules to NLDC under CERC IEGC regulations. Accurate forecasts reduce deviation settlement charges.
Maintenance planning: Schedule panel cleaning and maintenance on forecast low-generation days to minimise opportunity cost.

Recommended: Arduino UNO R3 Development Board — Collect real-time solar panel voltage and current data with Arduino to build your own training dataset for local ML models tailored to your specific installation.

Data Sources and Weather APIs for India

Key data sources for Indian solar forecasting:

OpenWeatherMap API (free tier): 5-day hourly forecast including cloud cover, humidity, temperature, wind. Available globally including India. Free for 1000 calls/day.
India Meteorological Department (IMD): Government agency providing meteorological gridded data. Registration required. Free for research.
Solargis (commercial): Best-in-class solar irradiance data for India at 15-minute resolution. Paid API, Rs 15,000-50,000/year for commercial use.
NASA POWER API (free): Historical and near-real-time solar radiation data at any location globally. Excellent for training datasets.
PVGIS (European Commission, free): Historical solar radiation data with hourly resolution for India. Best free historical dataset for panel-level calculations.

# Fetch weather data using OpenWeatherMap API
import requests
import pandas as pd
from datetime import datetime

API_KEY = 'your_openweathermap_api_key'
LAT, LON = 18.5204, 73.8567  # Pune, Maharashtra

def get_weather_forecast(lat, lon):
    url = f"https://api.openweathermap.org/data/2.5/forecast"
    params = {
        'lat': lat, 'lon': lon,
        'appid': API_KEY,
        'units': 'metric'
    }
    r = requests.get(url, params=params)
    data = r.json()
    
    records = []
    for item in data['list']:
        records.append({
            'datetime': datetime.fromtimestamp(item['dt']),
            'temp_c': item['main']['temp'],
            'clouds_pct': item['clouds']['all'],
            'humidity': item['main']['humidity'],
            'wind_speed': item['wind']['speed'],
            'description': item['weather'][0]['description']
        })
    return pd.DataFrame(records)

df_forecast = get_weather_forecast(LAT, LON)
print(df_forecast.head())

Feature Engineering for Solar Forecasting

Raw weather data needs to be transformed into features that capture solar physics:

import numpy as np
import pandas as pd
from pvlib import location, irradiance

def engineer_solar_features(df, lat=18.52, lon=73.86, altitude=560):
    """Add solar position and derived features to weather dataframe"""
    
    site = location.Location(lat, lon, altitude=altitude, tz='Asia/Kolkata')
    
    # Solar position features
    solar_pos = site.get_solarposition(df.index)
    df['solar_elevation'] = solar_pos['elevation']
    df['solar_azimuth'] = solar_pos['azimuth']
    df['cos_zenith'] = np.cos(np.radians(solar_pos['zenith']))
    
    # Clear-sky irradiance (maximum possible)
    clearsky = site.get_clearsky(df.index)
    df['ghi_clearsky'] = clearsky['ghi']
    df['dni_clearsky'] = clearsky['dni']
    
    # Cloud-sky modifier
    df['cloud_modifier'] = (1 - df['clouds_pct']/100) * 0.7 + 0.3
    df['estimated_ghi'] = df['ghi_clearsky'] * df['cloud_modifier']
    
    # Temperature correction factor for panel efficiency
    # Panel efficiency drops 0.4%/C above 25C
    df['temp_correction'] = 1 - 0.004 * (df['temp_c'] - 25).clip(lower=0)
    
    # Time-based cyclical features
    df['hour_sin'] = np.sin(2 * np.pi * df.index.hour / 24)
    df['hour_cos'] = np.cos(2 * np.pi * df.index.hour / 24)
    df['doy_sin'] = np.sin(2 * np.pi * df.index.dayofyear / 365)
    df['doy_cos'] = np.cos(2 * np.pi * df.index.dayofyear / 365)
    
    return df

Recommended: Waveshare Solar Power Manager Module (D) — Monitor your solar system output in real time and log data to build an accurate training dataset for your local forecasting ML model.

ML Model Selection and Training

Several ML architectures work well for solar forecasting:

Gradient Boosting (XGBoost/LightGBM): Best overall performance for day-ahead forecasting. Fast training, handles non-linear relationships well. RMSE typically 8-12% of rated capacity.
Random Forest: Good baseline, robust to outliers (important for monsoon anomalies in India). RMSE typically 10-15%.
LSTM (Long Short-Term Memory): Best for capturing temporal patterns (multi-day cloud patterns). Requires more data (1+ years). RMSE typically 7-10% with sufficient data.
Linear Regression with solar physics features: Simple, interpretable baseline. RMSE 15-20% but useful for understanding relationships.

For Indian conditions, a hybrid physics + ML approach (use pvlib for clear-sky baseline, then train ML to predict the cloud correction factor) often outperforms pure ML.

Complete Python Implementation

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import joblib

# Feature columns
FEATURES = [
    'cos_zenith', 'solar_elevation', 'estimated_ghi',
    'temp_c', 'temp_correction', 'clouds_pct', 'humidity',
    'wind_speed', 'hour_sin', 'hour_cos', 'doy_sin', 'doy_cos',
    'ghi_clearsky'
]
TARGET = 'power_kw'  # Actual measured solar output

def train_forecasting_model(df):
    # Filter daytime only (elevation > 5 degrees)
    df_day = df[df['solar_elevation'] > 5].copy()
    df_day = df_day.dropna(subset=FEATURES + [TARGET])
    
    X = df_day[FEATURES]
    y = df_day[TARGET]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=False  # Time series: no shuffle!
    )
    
    model = GradientBoostingRegressor(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.05,
        subsample=0.8,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    y_pred = np.maximum(y_pred, 0)  # Solar output can't be negative
    
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"MAE: {mae:.3f} kW, RMSE: {rmse:.3f} kW")
    
    # Save model
    joblib.dump(model, 'solar_forecast_model.pkl')
    return model, X_test, y_test, y_pred

def forecast_tomorrow(model, lat, lon):
    """Generate tomorrow's hourly solar forecast"""
    df_weather = get_weather_forecast(lat, lon)
    df_weather = df_weather.set_index('datetime')
    df_features = engineer_solar_features(df_weather, lat, lon)
    
    # Filter to tomorrow's dates
    tomorrow = pd.Timestamp.now().date() + pd.Timedelta(days=1)
    df_tomorrow = df_features[df_features.index.date == tomorrow]
    
    forecast_kw = model.predict(df_tomorrow[FEATURES])
    forecast_kw = np.maximum(forecast_kw, 0)
    
    return pd.Series(forecast_kw, index=df_tomorrow.index, name='forecast_kw')

Model Evaluation Metrics

Standard metrics for solar forecasting in India:

nRMSE (normalised RMSE): RMSE as % of installed capacity. Target: below 10% for day-ahead, below 5% for hour-ahead
MAE (Mean Absolute Error): Average absolute error in kW. More interpretable than RMSE for operational use
Skill Score: Improvement over naive persistence forecast (use yesterday’s generation as forecast)
CERC Metric: India’s CERC IEGC allows 15% deviation for RE generators; models achieving below 10% nRMSE meet this requirement comfortably

Deployment for Indian Solar Systems

# Simple Flask API for solar forecast deployment
from flask import Flask, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('solar_forecast_model.pkl')

@app.route('/forecast/<lat>/<lon>')
def get_forecast(lat, lon):
    forecast = forecast_tomorrow(model, float(lat), float(lon))
    return jsonify({
        'location': {'lat': lat, 'lon': lon},
        'forecast': [
            {'time': str(t), 'power_kw': float(p)}
            for t, p in forecast.items()
        ]
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Frequently Asked Questions

How much training data is needed for a solar ML model?

Minimum 6 months, ideally 1-2 years of hourly data. For Indian systems, ensure your training data covers at least one complete monsoon season (June-September) as monsoon cloud patterns are drastically different from clear-sky winter/summer months. More data always helps, especially for capturing rare weather events.

Can I use LSTM for a small 5 kW residential solar system?

Yes, but gradient boosting (XGBoost) typically performs as well or better than LSTM for short-horizon (1-24 hour) forecasting with less computational complexity. LSTMs show advantages for multi-day (2-7 day) forecasts where sequential temporal patterns matter more.

What is the best free weather API for solar forecasting in India?

OpenWeatherMap free tier (1000 calls/day) combined with NASA POWER historical data is the best free combination for Indian solar forecasting. For serious applications, Solargis or Tomorrow.io provide significantly more accurate irradiance forecasts at a cost.

Does the model need retraining after installation?

Yes. Solar panels degrade 0.4-0.7% per year, dust accumulation patterns change seasonally, and panel orientation may shift slightly. Retrain your model quarterly using the latest 3-6 months of data. Implement automated retraining with performance monitoring triggers.

Shop Solar & Renewable Energy at Zbotic