Hydroinformatics
Published in

Hydroinformatics

Towards urban flood susceptibility mapping using machine and deep learning models (part 3): Random forest model

In the last article, we prepared a dataset to map urban flood susceptibility using point-based models such as random forest (RF), support vector machine (SVM) and artificial neural network (ANN). This article shows how to develop the models and use the trained model to map urban flood susceptibility. This series of articles summarize and explain (with python code) the paper “ Towards urban flood susceptibility mapping using data-driven models in Berlin, Germany” published in Geomatics, Natural Hazards and Risk.

Deep learning is a subset of machine learning that uses neural networks to mimic the learning process of the human brain. The literature is rich with papers showing deep learning models outperforming traditional machine learning models. Deep learning models showed their superiority in different fields where the available data size was large. However, machine learning models are preferable with small data sizes.

It is challenging to collect a reliable flood inventory to map flood susceptibility (e.g., Termeh et al., 2018 collected 53 flooded locations in an area of 5737 km2; Choubin et al., 2019: 51 locations in 126 km2; Zhao et al., 2020: 216 locations in 131 km2). New studies showed that machine-learning models outperformed deep-learning models for small datasets (Grinsztaj et al., 2022; Shwartz-Ziv and Armon 2022). Therefore, it is logical that machine learning models are more suitable for flood susceptibility than deep learning models.

Random forest

The random forest model consists of several individual decision trees. It divides the input dataset into several sub-samples and develops a decision tree model for each sub-sample. The final result is estimated based on the majority result of all the decision tree models (see figure below)

Random forest model
import numpy as np
import cv2
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import geopandas as gpd

# Read the shapefile or pickle which we created in last article
df=gpd.read_file("points_data.shp")
# df=pd.read_pickle("points_data.pkl") # in case of pickle
df.head()

#check that there is no no data values in the dataset
print(df.isnull().sum())
#df = df.dropna() # use this to remove rows with no data values

#Understand the data
#Here we can see that we have a balanced dataset (equal number of flooded and non flooeded locations
sns.countplot(x="Label", data=df) #0 - Notflooded B - Flooded


# show the correlation matric for the dataset
corrMatrix = df.corr()
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
#sns.heatmap(df.iloc[:, 1:6:], annot=True, linewidths=.5, ax=ax)
sns.heatmap(corrMatrix, annot=True, linewidths=.5, ax=ax)

Now we read the dataset, checked for no value data and removed them and had a look at the correlation between the predictive features. Your data frame should look like this.

The prepared dataset from the last article. The values in the table are normalized (between 0 and 1) as I used the same dataset with an artificial neural network. However, it is necessary to normalize the data for the random forests.

The dataset needs to be split into dependent and independent variables. The dependent variable is the variable that needs to be predicted (column name = Label) while the independent variables are the predictive features. The values in the Label column are 1 for flooded locations and 0 for non-flooded locations. The column geometry denotes the longitude and the latitude of the points and is automatically in the dataset because the dataset is originally a points shapefile. We don’t need it.

#Define the dependent variable that needs to be predicted (labels)
Y = df["Label"].values

#Define the independent variables. Let's also drop gemotry and label
X = df.drop(labels = ["Label", "geometry"], axis=1)
features_list = list(X.columns) #List features so we can rank their importance later

#Split data into train (60 %), validate (20 %) and test (20%) to verify accuracy after fitting the model.
# training data is used to train the model
# validation data is used for hyperparameter tuning
# testing data is used to test the model

from sklearn.model_selection import train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, Y, test_size=0.2,shuffle=True, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25,shuffle=True, random_state=42)

Now we can train the random forest model. The model can be used for both classification and regression problems. Flood susceptibility mapping is a classification problem as mention before.

#RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state = 42) # I am using the default values of the parameters.

# Train the model on training data
model.fit(X_train, y_train)

# make prediction for the test dataset.
prediction = model.predict(X_test)

# The prediction values are either 1 (Flooded) or 0 (Non-Flooded)
prediction

# The AUC is considered one of the best performance indices
# We can plot the curve and calculate it
from sklearn.metrics import plot_roc_curve

ax = plt.gca()
model_disp = plot_roc_curve(model, X_test, y_test, ax=ax, alpha=0.8)
plt.show()

The random forest model has a built-in feature importance function which is implemented in scikit-learn python module. Hence, we can estimate which predictive features influence the model prediction.

# Estimate the feature importance
feature_imp = pd.Series(model.feature_importances_, index=features_list).sort_values(ascending=False)
print(feature_imp)

# Plot the feature importance
feature_imp.plot.bar()

Once you are satisfied with the model performance you can use the model to map flood susceptibility for your whole study area. We used the trained model to map flood susceptibility in Berlin. Firstly, we need to a point shapefile for the whole study area as we did for the training — validation — testing dataset in the last article.

# Read shapefile for the whole study area
df_SA=gpd.read_file("Study_area.shp")
df_SA.head() # make sure that the dataset has the same column arrangement as the training dataset

X_SA= df.drop(labels = ["geometry"], axis=1) # we need to remove all the columns except the predictive features
X_SA.head()

prediction_SA = model.predict(X_SA) # predict if the location is flooded (1) or not flooded (0)

# In order to map the flood susceptibility we need to cacluate the probability of being flooded
prediction_prob=model.predict_proba(X_SA) # This function return an array with lists
# each list has two values [probability of being not flooded , probability of being flooded]

# We need only the probablity of being flooded
# We need to add the value coressponding to each point

df_SA['FSM']= prediction_prob[:,1]

Now we have a point shapfile which has the flood suscpebility of each location. We need to convert it raster. There are many options. We can do this step in Arcmap or QGIS or we can continue in python.

# Save the dataframe tp a shapefile in case of converting the points to raster using QGIS or Arcmap
df_SA.to_file("FSM.shp")

Your final product would be a map like this

References

Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90.

Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data?. arXiv preprint arXiv:2207.08815.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store