Regression Based Prediction for Precipitation using Global Climate Data

Heroku App Link: https://precipitation-ml.herokuapp.com/

Note for Heroku app: Heroku free tier is being used. Give some time for Heroku dyno to come out of sleep if opening the link after a while.

GitHub Repository

See the link below for the code used to get the DataFrame used in this research. https://github.com/singparvi/Global-Precipitation/blob/master/Data_and_Code/Get_Precipiration_Data_NASA.ipynb

The code for the predictive modelling done in this project can be found in the link:- https://github.com/singparvi/Global-Precipitation/blob/master/Data_and_Code/Predict_Precipitation.ipynb

Abstract

Climate is becoming increasingly unpredictable over decades and it has never been more critical to make a better prediction on climate in human history. There is a substantial human and capital cost involved with a severe climate event. Current meteorological models can predict the climate in any area of the world with great accuracy. This research aims at predicting Precipitation using historical information and leveraging Multiple Linear Regression (MLR) in python Machine Learning to make a prediction. The prediction does not have to be for the following week or month, but it can be many years in the future. Multiple models were run using weather data from NASA and at best, an R^2 of 48.5% was obtained. The R^2 value may seem low but the Mean Absolute Error (MAE) has a 67% improvement from the baseline. The low R^2 was attributed to the non-availability of crucial climate information that could be added later to make a more refined prediction. Temperature was identified as the key feature in predicting Precipitation. This research’s best model’s predictions may enable decision-makers like governments or even insurance companies (with financial interests) to make predictions to plan for policies or undertaken risks.

Finding the Data and the JSON hurdle

Finding the data itself was a big hurdle in this project. The data must provide a learning opportunity and at the same time, Machine Learning practices can be applied on it. The data must have enough observations to devote time to. At least 100,000 so that there are enough observations to train, validate and then test. On the other hand, the data must relate to something that can tie into a business case.

After much investigation, it was finalized to take up the topic first to predict Rainfall. Precipitation was the only thing that was closest to the topic of interest.

NASA maintains an app called POWER Single Point Data Access ¹ that provides data in JavaScript Object Notation (JSON) format through an Application Programming Interface (API) to users based on the geographical location of interest. NASA’s POWER app requires latitude and longitude to provide the weather information. The latitude and longitude of various countries were gathered using an existing CSV file from GitHub user albertyw ².

A program was written in python notebook that takes in the latitude longitude information from the country list, pass it to NASA’s app to fetch the following information:-

Lattitude (degrees in decimal)
Longitude (degrees in decimal)
Elevation (m)
PRECTOT - Precipitation (mm day-1)
QV2M - Specific Humidity at 2 Meters (g/kg)
PS - Surface Pressure (kPa)
TS - Earth Skin Temperature (C)
T2MDEW - Dew/Frost Point at 2 Meters (C)
T2M - Temperature Range at 2 Meters (C)
WS50M - Wind Speed at 50 Meters (m/s)
WS10M - Wind Speed at 10 Meters (m/s)
T2MWET - Wet Bulb Temperature at 2 Meters (C)
T2M_RANGE - Temperature Range at 2 Meters (C)
RH2M - Relative Humidity at 2 Meters (%)
KT - Insolation Clearness Index (dimensionless)
CLRSKY_SFC_SW_DWN - Clear Sky Insolation Incident on a Horizontal Surface (kW-hr/m^2/day)
ALLSKY_SFC_SW_DWN - All Sky Insolation Incident on a Horizontal Surface (kW-hr/m^2/day)
ALLSKY_SFC_LW_DWN - Downward Thermal Infrared (Longwave) Radiative Flux (kW-hr/m^2/day)

The python code also merges the country code, latitude and longitude data to make a single data frame for use. The data included weather data for each day for 240 countries from 1980 to 2020. The resulting dataset had 3506400 rows × 22 columns.

NASA-POWER-DataFrame

The data frame built was used in the research further. The learning to now be able to use JSON data available publically, send JSON requests, receive and interpret and convert them to pandas DataFrame was a small achievement in the machine learning model.

EDA and Machine Learning Model

Data from NASA’s application was cleaned and unnecessary or repetitive columns were dropped. Before getting into features and target selection, another feature was included in the data that should affect the Precipitation of any interest area. Based on a hypothesis that Precipitation will be higher in countries with more forest areas, Forest Area data from World Bank ³ was imported in the data frame as a feature through pandas merge function.

Due to resource limitations and quick turnarounds in model training, only the last twenty years of data were considered.

The data was ready for some Exploratory Data Analysis (EDA). Pandas Profiling was used to generate a report to see the type of data, missing values and data distribution.

The features were selected to be the following:-

country_code
lat
long
elevation
surface_pressure
skin_temperature
dew_frost
temperature2m
windspeed10m
windspeed50m
wet_bulb_temp
temp_range
clearness_index
clear_sky_insolation
all_sky_insolation
radiative_flux
Forest_Cover(sq km)

The definition of all the features mentioned above was provided in the text above. Precipitation was chosen as the target.

The target was skewed to the right due to the presence of some 300 observations.

Baseline

Precipitation mean was chosen to set a baseline to compare the model performance. Precipitation mean was calculated for the entire data and was determined as 2.787 mm. Mean Absolute Error (MAE) was calculated and was found to be 3.379 mm. The baseline MAE is used to compare various models to see how each model fare in precipitation predictions.

Models

The model pipelines that use more time and resources in fitting the data frame from 2001 - 2020 was split as follows:-

Train - Data from 2008 - 2012 Validate - Data from 2013 only Test - Data from 2014

The various models run and their findings are discussed below:-

1. Ordinal Encoder and RandomForestRegressor pipeline

The code to instantiate and fit the pipeline was as simple as:-

1
2
3
4
5
pipeline_randomforest_OE = make_pipeline(
    ce.OrdinalEncoder(),
    RandomForestRegressor(n_estimators=100, random_state=42, verbose=1,n_jobs=-1)
)
pipeline_randomforest_OE.fit(X_train, y_train)

Since the data was super clean with no missing values, compute or scaling were not used.

Parameters to benchmark the model:-

Parameter	Value
Time to fit the model	22 sec
Training Score (R²)	93.70 %
Validation Score (R²)	44.02 %
Test Score (R²)	44.22 %
Baseline MAE	3.379 mm
Model MAE	2.198 mm
Improvement over Baseline MAE	53.73 %

2. OneHotEncoder and RandomForestRegressor pipeline

The code to instantiate and fit the pipeline was:-

1
2
3
4
5
pipeline_randomforest_OHE = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    RandomForestRegressor(n_estimators=100, random_state=42, verbose=1,n_jobs=-1)
)
pipeline_randomforest_OHE.fit(X_train, y_train)

Parameters to benchmark the model:-

Parameter	Value
Time to fit the model	222 sec
Training Score (R²)	93.70 %
Validation Score (R²)	47.76 %
Validation Score (R²)	47.30 %
Baseline MAE	3.379 mm
Model MAE	1.965 mm
Improvement over Baseline MAE	71.89 %

3. OrdinalEncoder and XGBoost pipeline

Before the XGBoost pipeline can be instantiated and fit, train, validation and test dataset were updated as follows:-

Train - Data from 2001 - 2012 Validate - Data from 2013 - 2016 Test - Data from 2017 - 2020

This was done as XGBoost is able to fit the model much faster as compared with the RandomForestRegressor.

The rest was similar to what was done in the past. The code to instantiated and fit the pipeline was:-

1
2
3
4
5
pipeline_xgboost = make_pipeline(
    ce.OrdinalEncoder(),
    XGBRegressor(n_estimators=100, random_state=42, verbose=1, n_jobs=-1)
)
pipeline_xgboost.fit(X_train, y_train)

Parameters to benchmark the model:-

Parameter	Value
Time to fit the model	11.47 sec
Training Score (R²)	59.00 %
Validation Score (R²)	48.45 %
Test Score (R²)	35.98 %
Baseline MAE	3.379 mm
Model MAE	2.022 mm
Improvement over Baseline MAE	67.09 %

Analysis

Some correction in Model MAE was expected as RandomForestRegressor tries to fit a model with infinite depth. The model score for the RandomForestRegressor reflects this while the model wasn’t doing very well with the validation score. Another thing to note in the XGBoost model is that the data for a longer duration was used compared to the earlier run models. This may be another source due to which our improvement over the baseline was reduced compared to the previous model.

Features Importances from XGBRegressor

XGBRegressor was used to extract the top 15 features that contribute to the prediction of Precipitation.

Top-15-Features Top 15 Features as determined after XGBRegressor Run

Permutation Importance from XGBRegressor

Permutation importance provides an insight in ranking the features of the data by permuting different values in any feature. Web bulb temperature still ranks the top in predicting the Precipitation, however, the ranking has changed for other factors, as shown in the output below.

Permutation-Importance Image Showing Permutation Importance by priority

Partial Dependence Plot (PDP)

A Partial Dependence Plot was built to see the effect of more than one feature on the predicted Precipitation. From the PDP, it can be inferred that the relationship between the features and the target is monotonic.

PDP-Wet-Bulb_Temperature Partial Dependence Plot showing the Variation of Precipitation with Wet Bulb Temperature

PDP-Wet-Bulb_Temperature-and-Radiative-Flux Partial Dependence Plot showing the Variation of Precipitation with Wet Bulb Temperature and Radiative Flux

Shap Values

Shap values are what the features contribute to the final predicted value.

Shap Values Image Showing how Precipitation Changes with Change in Features

Conclusion

Based on all the features used in this research, Temperature was the key feature that predicts Precipitation of any region. With temperatures rising globally due to global warming, the research in this project shows that the precipitation levels are also likely to increase. If it interests you in testing the XGBoost model to make predictions, then use the app in the link below.

Sources

¹NASA POWER app

²Latitude Longitude of Countries from albertyw

³Forest Cover Data from World Bank

Regression Based Prediction for Precipitation using Global Climate Data

This project was to use Multiple Linear Regression (MLR) Machine Learning models on Global Climate Data from NASA to make precipitation prediction.

Regression Based Prediction for Precipitation using Global Climate Data

This project was to use Multiple Linear Regression (MLR) Machine Learning models on Global Climate Data from NASA to make precipitation prediction.

GitHub Repository

Abstract

Finding the Data and the JSON hurdle

EDA and Machine Learning Model

Baseline

Models

1. Ordinal Encoder and RandomForestRegressor pipeline

2. OneHotEncoder and RandomForestRegressor pipeline

3. OrdinalEncoder and XGBoost pipeline

Features Importances from XGBRegressor

Permutation Importance from XGBRegressor

Partial Dependence Plot (PDP)

Shap Values

Conclusion

Sources