Rainfall Prediction Using Python: Data Science Project

Rainfall Prediction Using Python

In the realm of data science projects, one fascinating area of exploration is Rainfall Prediction Using Python. By leveraging the power of programming and data analysis, we can uncover valuable insights to forecast precipitation patterns. In this article, we will delve into the techniques and methodologies involved in rainfall prediction using Python, offering an insightful glimpse into this intriguing domain. Similarly to this, you can also practice Face Recognition in Python.

Rainfall Prediction Using Python: Significance

Rainfall plays a pivotal role in various domains, such as agriculture, hydrology, and disaster management. Accurate prediction of rainfall patterns can help farmers plan their crop cycles effectively, assist in water resource management, and enhance overall disaster preparedness. Employing data science projects to predict rainfall empowers us to make informed decisions, mitigate risks, and optimize resource allocation.

Data Collection and Preprocessing

Before we can predict rainfall, we need reliable and relevant data. Meteorological agencies and weather stations provide historical rainfall data, a foundation for our analysis. Additionally, factors such as temperature, humidity, wind speed, and atmospheric pressure can contribute to accurate predictions. By collecting and preprocessing this data, we can ensure its suitability for modeling and analysis.

Exploratory Data Analysis and Feature Engineering

Once we have the data, exploratory data analysis (EDA) allows us to gain insights and understand the underlying patterns. Visualizing the data through graphs and statistical summaries helps identify trends, seasonality, and correlations with other variables. Feature engineering, a crucial step in the prediction process, involves transforming raw data into meaningful predictors. Features like temporal patterns, climatic indices, and geographical factors can significantly enhance the accuracy of our predictions.

Model Building and Evaluation

With preprocessed data and well-engineered features, we can build predictive models. Python offers an extensive range of libraries, such as sci-kit-learn and TensorFlow, which provide various machine-learning algorithms suitable for rainfall prediction. Regression models, time series analysis, and neural networks are among the commonly employed techniques. Evaluating the models involves assessing their performance using metrics like mean squared error (MSE), root means squared error (RMSE), and coefficient of determination (R-squared).

Incorporating Advanced Techniques

To further enhance rainfall prediction accuracy, advanced techniques can be employed. For example, ensemble methods like Random Forest and Gradient Boosting combine multiple models to produce more robust predictions. Deep learning models, such as recurrent neural networks (RNNs) or long short-term memory (LSTM), are adept at capturing temporal dependencies and are particularly effective for time series data like rainfall. By incorporating these techniques, we can continually refine and improve our predictions.

Rainfall Prediction Using Python: Feature Selection and Dimensionality Reduction

When building rainfall prediction models, feature selection and dimensionality reduction techniques are crucial in improving model efficiency and accuracy. Let’s explore three essential methods in this domain:

Rainfall Prediction Using Python: Correlation Analysis

Correlation analysis helps us identify the relationships between different variables and their impact on rainfall prediction. By calculating correlation coefficients, such as Pearson’s correlation coefficient, we can assess the strength and direction of these relationships. This analysis enables us to select the most influential features for our models.

Principal Component Analysis (PCA)

PCA is a widely-used dimensionality reduction technique that transforms a high-dimensional dataset into a lower-dimensional space. By identifying the principal components that capture the most variance in the data, PCA allows us to reduce the complexity of the feature space. Applying PCA can improve computational efficiency and eliminate redundant features in rainfall prediction models.

Feature Importance Ranking

Feature importance ranking techniques help us prioritize the most significant predictors in rainfall prediction. By using methods like recursive feature elimination (RFE) or feature importance scores from ensemble models, such as random forests, we can determine the relative importance of each feature. This ranking helps us focus on the most relevant predictors, leading to better-performing models.

Model Evaluation and Fine-tuning

Evaluating and fine-tuning our rainfall prediction models is crucial to ensure their accuracy and reliability. Let’s delve into three key aspects of model evaluation and fine-tuning:

Rainfall Prediction Using Python: Cross-Validation

Cross-validation techniques, such as k-fold cross-validation, provide a robust way to evaluate the performance of our models. We can obtain reliable performance estimates by dividing the dataset into multiple subsets and iteratively training and testing the model. Cross-validation helps us detect potential overfitting and assess the generalizability of our rainfall prediction models.

Rainfall Prediction Using Python: Hyperparameter Tuning

Hyperparameter tuning involves optimizing the model’s hyperparameters to enhance its performance. Hyperparameters, like learning rate, regularization strength, and model complexity, are set before training the model. Techniques such as grid search and random search allow us to systematically explore different hyperparameter combinations and select the ones that yield the best performance on validation data.

Rainfall Prediction Using Python: Model Interpretability

Interpreting the trained rainfall prediction model is essential for gaining insights and building trust. Techniques like feature importance plots, partial dependence plots, and SHAP values help us understand the model’s behavior and attribute the contributions of each feature. Model interpretability enhances our understanding of the factors influencing rainfall prediction and provides valuable insights for decision-making.

Rainfall Prediction Using Python: Data Visualization Techniques for Rainfall Analysis

To gain a comprehensive understanding of rainfall patterns and trends, data visualization techniques are invaluable. Let’s explore two powerful methods for visualizing rainfall data and extracting meaningful insights:

Time Series Plots

Time series plots allow us to observe the temporal variation in rainfall data over a specific period. By plotting the rainfall values against time, we can identify patterns, seasonality, and long-term trends. These plots provide a visual representation of how rainfall changes over different time intervals, aiding in the detection of cyclic and recurring patterns.

Heatmaps and Geographic Visualizations

Heatmaps and geographic visualizations provide a spatial perspective on rainfall patterns. By plotting rainfall values on a map, we can observe geographical variations and identify regions with higher or lower rainfall intensity. These visualizations help in identifying areas prone to heavy rainfall or drought, assisting in better resource allocation and decision-making for various sectors.

Scatter Plots with Additional Variables

Scatter plots with additional variables allow us to explore the relationship between rainfall and related factors. By plotting rainfall against variables like temperature, humidity, or wind speed, we can identify correlations or patterns. These scatter plots help in understanding the influence of these variables on rainfall and can be useful for predicting rainfall based on the values of related factors.

Advanced Machine Learning Techniques for Rainfall Prediction

To further enhance the accuracy and predictive capabilities of rainfall models, advanced machine learning techniques can be employed. Let’s delve into two powerful techniques that have shown promise in rainfall prediction:

Long Short-Term Memory (LSTM) Networks

LSTM networks are a type of recurrent neural network (RNN) known for their ability to capture long-term dependencies in sequential data. In the context of rainfall prediction, LSTM networks can effectively model temporal patterns and capture the complex relationships between past rainfall observations. Their memory cells allow them to retain important historical information, enabling accurate predictions.

XGBoost: Extreme Gradient Boosting

XGBoost is a popular gradient-boosting algorithm that excels in handling structured data. It combines multiple weak prediction models to create a powerful ensemble model. In rainfall prediction, XGBoost can leverage the strengths of decision trees and gradient boosting to capture nonlinear relationships between predictors and rainfall. It is particularly effective in handling large feature sets and producing highly accurate predictions.

Convolutional Neural Networks (CNNs)

While commonly used in computer vision tasks, CNNs can also be adapted for rainfall prediction. By treating rainfall data as a two-dimensional grid, CNNs can learn spatial patterns and capture local relationships. They can identify features in rainfall images and make predictions based on these learned patterns. CNNs offer a unique approach to rainfall prediction, especially when data has a spatial component.

Rainfall Prediction Using Python: Source Code

All the code and datasets are available on GitHub. You can visit and download the project.

1: Import Data and Required Packages

import warnings
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import ConfusionMatrixDisplay

2: Reading Data

df = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')

3: Data Understanding

df.head()
df.shape

4: Data Checks to perform

  • Check Missing values
  • Check Duplicates
  • Check data type
  • Check the number of unique values in each column
  • Check the statistics of the data set
  • Check various categories present in the different categorical column

4.1: Check Missing values

round(df.isna().sum()/df.shape[0]*100 , 4)

4.2: Check Duplicates

df.duplicated().sum()

4.3: Check data types

df.info()

4.4: Checking the number of unique values of each column

df.nunique()

4.5: Check the statistics of the data set

df.describe().T

5: Exploratory Data Analysis(EDA)

5.1: Categorical variables

cat_df = df.select_dtypes('object')
cat_df.head()
cat_df.nunique()
cat_df.isna().mean()*100

5.2: RainTomorrow

df.dropna(subset=['RainTomorrow'], inplace=True)
d = df['RainTomorrow'].value_counts()
labels = list(d.index)
d
plt.pie(d, labels=labels, autopct='%1.3f%%')
plt.show()

5.3: RainToday

df.dropna(subset=['RainToday'], inplace=True)
d = df['RainToday'].value_counts()
labels = list(d.index)
d
plt.pie(d, labels=labels, autopct='%1.3f%%')
plt.show()

You can see that the distribution of RainTomorrow and RainToday is almost the same in this dataset.

5.4: Date

df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df.drop('Date', axis=1, inplace=True)
df.head()

5.5: Location

df['Location'].unique()

6: Numerical Variables

num_df = df.select_dtypes(exclude='object')
num_df.head()
def plot_scatter(column_to_category, data, row=3, col=3):
    for i in range(0, len(column_to_category)):
        plt.subplot(row, col, i+1)
#         plt.scatter(x = column_to_category[i], y = 'SalePrice', data = data)
        df.boxplot(column_to_category[i])
        plt.title(f"{column_to_category[i]}" ,fontsize=20)
    plt.show()
column_to_category = ['Rainfall', 'Evaporation', 'WindSpeed3pm', 'WindSpeed9am']
plt.figure(figsize=(25, 15))
plot_scatter(column_to_category, df, row=1, col=4)

7: Impute and encoding

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, PowerTransformer

7.1: Impute null values

cat_df = df.select_dtypes('object')
num_df = df.select_dtypes(exclude='object')

7.2: cat_df.head()

cat_df.head()
#fixing null values of categorical columns
for i in cat_df.columns:
    df[i].fillna(df[i].mode()[0], inplace=True)

7.3: Fix numerical null values

num_df.head()
#fixing null values of numerical columns
for i in num_df.columns:
    df[i].fillna(df[i].median(), inplace=True)
df.isna().sum()

7.4: Encoding

df['RainToday'] = df['RainToday'].map({'Yes': 1, 'No': 0})
df['RainTomorrow'] = df['RainTomorrow'].map({'Yes': 1, 'No': 0})
cat_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']
dummy_features = pd.get_dummies(df[cat_columns], drop_first=True)

# create new dataframe after creating dummy var
df = pd.concat([df, dummy_features], axis=1)

df = df.drop(cat_columns, axis=1)

df.head()

8: Outlier

a1 = ['MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine','WindGustSpeed',
      'WindSpeed9am', 'WindSpeed3pm']
a2 = ['Pressure9am', 'Pressure3pm', 'Rainfall'] 
a4 = []
a3 = ['Humidity9am','Humidity3pm', 'Cloud9am', 'Cloud3pm','Temp9am', 'Temp3pm']
plt.figure(figsize=(20,10))
sns.boxplot(df[a1])
plt.xticks(rotation=90)
plt.show()
# So many outliers
plt.figure(figsize=(20,10))
sns.boxplot(df[a2])
plt.xticks(rotation=90)
plt.show()
# So many outliers
plt.figure(figsize=(20,10))
sns.boxplot(df[a3])
plt.xticks(rotation=90)
plt.show()
# So many outliers
# pt = PowerTransformer()
# df['Rainfall'] = pt.fit_transform(df[['Rainfall']].values)
def plot_hist(column_to_category, data, row=3, col=3):
    for i in range(0, len(column_to_category)):
        plt.subplot(row, col, i+1)
#         plt.scatter(x = column_to_category[i], y = 'SalePrice', data = data)
#         df.boxplot(column_to_category[i])
        sns.histplot(data=data, x=column_to_category[i], kde=True)
        plt.title(f"{column_to_category[i]}" ,fontsize=20)
    plt.show()
sk1 = ['MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine','WindGustSpeed',
      'WindSpeed9am', 'WindSpeed3pm', 'Rainfall']
sk2 = ['Pressure9am', 'Pressure3pm', 'Humidity9am','Humidity3pm', 'Cloud9am', 'Cloud3pm','Temp9am', 'Temp3pm']
plt.figure(figsize=(25,10))
plot_hist(sk1, df, row=2, col=4)
plt.figure(figsize=(25,10))
plot_hist(sk2, df, row=2, col=4)
df_clean = df.copy()
df_clean.shape
def remove_out(df_clean, num_cols, lbv=0.25, hbv=0.75):
    Q1 = df_clean[num_cols].quantile(lbv)
    Q3 = df_clean[num_cols].quantile(hbv)
    IQR = Q3-Q1
    lb = Q1-1.5*IQR
    hb = Q3+1.5*IQR
    for i in num_cols:
        df_clean = df_clean[(df_clean[i]>=lb[i]) & (df_clean[i]<=hb[i])]
    return df_clean
cols = ['MinTemp', 'MaxTemp', 'Rainfall',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day', 'Evaporation', 'Sunshine']
df_clean = remove_out(df_clean, cols, lbv=0.10, hbv=0.90)
df_clean.shape

9: Split Data

d = df_clean['RainTomorrow'].value_counts()
labels = list(d.index)
d
plt.pie(d, labels=labels, autopct='%1.3f%%')
plt.show()
X = df_clean.drop(['RainTomorrow'], axis=1)

y = df_clean['RainTomorrow']
from imblearn.over_sampling import SMOTE

sm = SMOTE()
X, y = sm.fit_resample(X, y)
y.value_counts()
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, train_size=0.8, random_state=123
)
X_train.shape, X_test.shape

10: Scaling

cols = list(num_df.columns)
pt = PowerTransformer()
X_train[cols] = pt.fit_transform(X_train[cols])
X_test[cols] = pt.transform(X_test[cols])

11: Rainfall Prediction Using Python: Model

xgb_cl = XGBClassifier(
 learning_rate =0.1,
 n_estimators=120,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.9
)
xgb_cl.fit(X_train,y_train)
y_pred_test = xgb_cl.predict(X_test)

accuracy_score(y_test, y_pred_test)

12: Predict

cm = confusion_matrix(y_test, y_pred_test)
cm
print(classification_report(y_test, y_pred_test))

13: Rainfall Prediction Using Python: Model DL

from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import callbacks
from tensorflow.keras.metrics import Precision, Recall
early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, 
    patience=10,
)
model=Sequential()
model.add(Dense(256,input_dim=X_train.shape[1],activation='relu'))
model.add(Dropout(0.3))

model.add(Dense(128,activation='relu'))
model.add(Dropout(0.4))

model.add(Dense(64,activation='relu'))
model.add(Dropout(0.3))

model.add(Dense(1,activation='sigmoid'))
model.compile(
    loss='binary_crossentropy',
    optimizer='Adam',
    metrics=['accuracy', Precision(), Recall()]
)
history = model.fit(
    X_train, y_train,
    epochs=30,
    validation_split=0.2,
    batch_size=64,
    callbacks=[early_stopping],
    verbose=1
)
def plot_training_hist(history):
    '''Function to plot history for accuracy and loss'''
    
    fig, ax = plt.subplots(1, 2, figsize=(10,4))
    # first plot
    ax[0].plot(history.history['accuracy'])
    ax[0].plot(history.history['val_accuracy'])
    ax[0].set_title('Model Accuracy')
    ax[0].set_xlabel('epoch')
    ax[0].set_ylabel('accuracy')
    ax[0].legend(['train', 'validation'], loc='best')
    # second plot
    ax[1].plot(history.history['loss'])
    ax[1].plot(history.history['val_loss'])
    ax[1].set_title('Model Loss')
    ax[1].set_xlabel('epoch')
    ax[1].set_ylabel('loss')
    ax[1].legend(['train', 'validation'], loc='best')
plot_training_hist(history)
y_pred = model.predict(X_test)
# y_pred=(y_pred>0.5)
y_pred = np.squeeze(np.where(y_pred>0.5, 1, 0))
y_pred
print(classification_report(y_test.values, np.array(y_pred)))

Conclusion

In conclusion, the fusion of data science and Python empowers us to harness the vast potential of rainfall prediction. Through this article, we have explored the fundamental steps involved in the process, from data collection and preprocessing to model building and evaluation. By utilizing precipitation data and implementing predictive models, we can make accurate predictions and gain valuable insights into rainfall patterns. This knowledge enables us to optimize resource allocation, enhance decision-making, and contribute to the sustainable development of various sectors reliant on rainfall information.

Data science projects provide a unique opportunity to uncover hidden trends and relationships within rainfall data. By leveraging programming, data analysis, and machine learning techniques, we can extract meaningful information from vast datasets. These projects allow us to delve into the intricate details of precipitation patterns, taking into account factors such as temperature, humidity, wind speed, and atmospheric pressure. By exploring correlations between these related variables, we can identify valuable predictors and refine our models for improved accuracy in rainfall prediction.

2 thoughts on “Rainfall Prediction Using Python: Data Science Project”

Leave a Comment