Hands-on Big Data Analysis on GCP Using AI Platform Notebooks

Introduction

In previous article, we already knew a little bit about big data. But that article is only about theory, how do we implement it on big data? Then this article should help you. When you are using python, there are many IDEs help you to code. Jupyter Notebook is on of the most popular IDE and it is actually an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Of course, you could download it locally. However, I think not everyone could have a good computer and GPU. Even though Jupyter provides online IDE, you are not able to control the resource they allocate to you. Instead, I will use AI Platform Notebooks provided by GCP in this article. From my personal experience, GCP notebook gives me two big advantages.

  1. You could edit your instance size to fit your requirement.
  2. You could do data analysis anywhere.

Environment Setup

Actually, it is really easy to setup Notebook on GCP.

Go to GCP Platform then search “AI platform” and click it.

On the left panel, choose “Notebooks”.

Click “New Instance”.

Choose Without GPUs under TensorFlow Enterprise 1.15.

In fact, GCP Notebook already provided some built-in library to us i.e. we don’t have to install TensorFlow in this tutorial. Of course, you could create your own instance by clicking “Customize instance” on the top if you want. But this tutorial is good enough to use the preset setting.

Keep all the things as default and click “Create”.

After few minutes, the instance is created. Click “Open Jupyterlab”.

If you see this page, it means our setup is successful.

We could directly clone our github repository as the above figure shown. I have prepared all the codes and please type:

https://github.com/manbobo2002/gcp-bigdata.git

Now our codes are also ready. Here I want to remind that we have 3 main parts in big data analysis: data preprocessing, model training and testing and evaluation.

Today we will do a raining prediction analysis. What we want is to predict whether tomorrow will rain or not. The answer is only “Yes” or “No”.

Data Preprocessing

We double click the gcp-bigdata folder and open the ipynb file, then the code will display as right hand side.

Here you have 2 choices to run the code. The first one is to select the code cell and type “Shift + Enter”.

The second method is click “Run” and then choose “Run All Cells”. I just use this method in this tutorial but I will still talk about what we are doing.

If you remember my previous article, you will note that there are 10 principals to do data preprocessing. But it is not necessary to do all of them.

If we are using Jupyter Notebooks locally, we will not only need to import, but also need to install the above library. But we are using pre-built library provided by GCP, we can skip the installation part.

1. Import libraries and read CSV file

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pltdata = pd.read_csv(‘weatherAUS.csv’)

2. Drop the unncessary feature

data.drop([‘RISK_MM’], axis=1, inplace=True)

We have to exclude the Risk-MM feature when training a binary classification model. Otherwise, it will leak the answers to our model and reduce its predictability.

3. Extract the yearmonthday and drop the original date

data[‘Year’] = data[‘Date’].dt.year
data[‘Month’] = data[‘Date’].dt.month
data[‘Day’] = data[‘Date’].dt.day
data.drop(‘Date’, axis=1, inplace=True)

4. Preparing training and testing Datasets

Y = data[‘RainTomorrow’]
X = data
X = X.drop([‘RainTomorrow’], axis=1)from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

5. Handle missing values (data cleaning)

X_train[‘WindGustDir’].fillna(X_train[‘WindGustDir’].mode()[0], inplace=True)
X_train[‘WindDir9am’].fillna(X_train[‘WindDir9am’].mode()[0], inplace=True)
X_train[‘WindDir3pm’].fillna(X_train[‘WindDir3pm’].mode()[0], inplace=True)
X_train[‘RainToday’].fillna(X_train[‘RainToday’].mode()[0], inplace=True)

We use the most common value to fill in all the missing categorical values on training data.

numerical_features = X_train.select_dtypes(exclude = [“object”]).columns
for col in numerical_features:
col_median=X_train[col].median()
X_train[col].fillna(col_median, inplace=True)

Then we use the median to fill in all the missing numerical values on training data.

X_test[‘WindGustDir’].fillna(X_test[‘WindGustDir’].mode()[0], inplace=True)
X_test[‘WindDir9am’].fillna(X_test[‘WindDir9am’].mode()[0], inplace=True)
X_test[‘WindDir3pm’].fillna(X_test[‘WindDir3pm’].mode()[0], inplace=True)
X_test[‘RainToday’].fillna(X_test[‘RainToday’].mode()[0], inplace=True)numerical_features = X_test.select_dtypes(exclude = [“object”]).columns
for col in numerical_features:
col_median=X_test[col].median()
X_test[col].fillna(col_median, inplace=True)

We do the same thing on testing data. The reason why we don’t handle missing value before splitting is because the most common value and the median will be different in training data and testing data. For example, if we have a set of number 1,2,3,4,5,6,7,8,9,10, now the median is (5+6)/2=5.5. But when we split it as training and testing data, let say training data are 1,2,3,7,8,9,10 and testing data are 4,5,6. Then the median of training data will be 7 and the median of testing data will be 5.

6. Normalization

from scipy.stats import skew
numeric_feats = X_train.dtypes[X_train.dtypes != “object”].index
skewed_feats = X_train[numeric_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 0.75]
print(skewed_feats)X_train[‘Rainfall’] = np.log1p(X_train[‘Rainfall’])
X_train[‘Evaporation’] = np.log1p(X_train[‘Evaporation’])numeric_feats = X_test.dtypes[X_test.dtypes != “object”].index
skewed_feats = X_test[numeric_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 1]
print(skewed_feats)X_test[‘Rainfall’] = np.log1p(X_test[‘Rainfall’])
X_test[‘Evaporation’] = np.log1p(X_test[‘Evaporation’])

7. Convert categorical features into dummy

X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
final_train, X_test = X_train.align(X_test, join=’left’, axis=1)

Until now, we have done all the necessary data preprocessing. Then what we have to do is train and test our model.

Model Training and Testing with Evaluation

In this part, we will use two algorithms called Logistic Regression and ANN to build our model. We start with Logistic Regression first.

from sklearn.linear_model import LogisticRegression# instantiate the model
LR = LogisticRegression()
#solver=’liblinear’, random_state=0)# fit the model
LR.fit(X_train, Y_train)LR_predictions = LR.predict(X_test)
LR_predictionsfrom sklearn.metrics import accuracy_score
print(‘Accuracy score: {0:0.4f}’. format(accuracy_score(Y_test, LR_predictions)))

The output may look like:

Accuracy score: 0.8397

Then let’s try ANN.

from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import ModelCheckpointNN = Sequential()NN.add(Dense(input_dim = X_train.shape[1], activation = ‘relu’, kernel_initializer = ‘uniform’, output_dim = 55))
NN.add(Dense(activation = ‘relu’, kernel_initializer = ‘uniform’, output_dim = 55))
NN.add(Dense(activation = ‘sigmoid’, kernel_initializer = ‘uniform’, output_dim = 1))
NN.summary()y_train = (Y_train == ‘Yes’)NN.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’, metrics = [‘accuracy’])NN.fit(X_train, y_train)y_test = (Y_test == ‘Yes’)scores = NN.evaluate(X_test, y_test, verbose=0)
print (NN.metrics_names[1], scores[1])

The output may look like:

accuracy 0.8294575214385986

Well, it seems the accuracy of them is over 80%, but could we do better? Rescaling and Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn. They might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. Let’s try rescaling.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
scaler.fit(X_train)col_train = list(X_train.columns)
X_train_temp = scaler.transform(X_train)
X_train_minmax = pd.DataFrame(X_train_temp, columns = col_train)X_test_temp = scaler.transform(X_test)
X_test_minmax = pd.DataFrame(X_test_temp, columns = col_train)

Also, we could make use of K-fold cross validation. K-fold cross validation is often used to handle the situation of overfitting, i.e. a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. It is performed by splitting the training dataset into k subsets. Then, models are trained by taking turns on all subsets except one which is held out, and the model performance is evaluated on the held out validation set. The process is repeated until all subsets are given an opportunity to be the held out validation set. The performance measure is then averaged across the performance on all models. Cross validation is often not used for evaluating deep learning models because of the computational expense. Let’s see the updated result of Logistic Regression.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorerLR = LogisticRegression(solver=’liblinear’, random_state=0)
scores = cross_val_score(LR, X_train, Y_train, cv=5)print (“Avg accuracy =”, scores.mean())

Then the output will look like this:

Avg accuracy = 0.8455417692269052

Then let’s see the ANN model.

from sklearn.model_selection import KFold
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import ModelCheckpoint
from sklearn.preprocessing import MinMaxScalerkf = KFold(n_splits=5)
acc_scores = 0
for train_index, test_index in kf.split(X_train):
X_train_new, X_test_new = X_train.iloc[train_index], X_train.iloc[test_index]
Y_train_new, Y_test_new = Y_train.iloc[train_index], Y_train.iloc[test_index]NN = Sequential()NN.add(Dense(input_dim = X_train_new.shape[1], activation = ‘relu’, kernel_initializer = ‘uniform’, output_dim = 55))
NN.add(Dense(activation = ‘relu’, kernel_initializer = ‘uniform’, output_dim = 55))
NN.add(Dense(activation = ‘sigmoid’, kernel_initializer = ‘uniform’, output_dim = 1))

y_train_new = (Y_train_new == ‘Yes’)
y_test_new = (Y_test_new == ‘Yes’)

NN.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’, metrics = [‘accuracy’])
NN.fit(X_train_new, y_train_new)scores = NN.evaluate(X_test_new, y_test_new, verbose=0)
acc_scores = acc_scores + scores[1]

print(“Avg accuracy =”, str(acc_scores/5))

Then the output will look like this:

Avg accuracy = 0.8297985553741455

Cleanup

After the completion of this lab, you just have to choose the Notebook instance and then click “DELETE”, nothing difficult.

Conclusion

This article talks about the flow to do big data analysis. Of course, there are slightly difference when we handle different datasets. In addition, we could fine tune our parameters in order to get a better result. There are also many algorithms different from Logistic Regression and ANN, they all have their strengths and weaknesses.

Enjoy Big Data!

Leave a Reply