My Heart will go on….

Kaustubh Pethkar
6 min readOct 23, 2020

This blogpost covers the first Kaggle Challenge to predict the survival on the Titanic and get familiar with ML basics.

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, they are asking you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Let’s Start

First, let’s import pandas as pd and also train and test CSV files.

It’s always better to take a glance at data using train_data.head() to see the first 5 rows in train_data.

Here, we have 12 different columns also known as feature columns. Among these 12 columns, some are useful and others not. For example, the name of a person and cabin in which he/she was present doesn’t contribute towards his/her chances of survival whereas their age and sex may give us some valuable information. Drop the unnecessary columns using…

Great! we now have some reliable feature columns. Though these contain reliable info there maybe some NaN. Check whether NaN is present in any columns.

Hmm… this doesn’t look good, there are 177 NaN and we have to fix it by using fillna( ) along with the mean of age, inplace is set to True so that the same value gets updated.

Now if we check again for NaN values hopefully we get nothing

Well done! Now our data is processed properly for training. But wait… what about categorical columns? we only validate data for numerical columns. Let’s convert the Categorical column into numerical by using pd.get_dummies( ). This gives us 2 numerical columns for females and males, i.e. 0 for females and 1 for males. drop_first =True drops the first column

Once, we get the numerical column from our categorical column, we don’t need a categorical column anymore. Get rid of that and concat this new column to our train_data.

After converting categorical data we get…

From our train_data, we can infer that all features are in between 0 and 1 have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. In this Age and Fare have large magnitude compare to others, so we have to scale them to lower magnitudes. For that, let’s import StandardScaler( )from sklearn.

we get..

Separate survived column from train_data and put it in a variable y_survived as output, and for input rest of features will do.

So far we have preprocessed data successfully and also split feature columns in x_train and y_survived for training data, we can proceed with SVM, Decision Tree, KNN model, but the question is which model should we select for best fit. For that, we have to implement GridSearchCV to find the best fit model.

To find the best fit model we’ll create a dictionary model_param in which each of these models will present along with their various params.

We iterate through model_param, each model is then passed to GridSearchCV to find out which params gives the best fit and high score also respective scores are appended in the list scores[ ].

To see each model score create a new dataframe model_score.

Clearly, SVC is the winner here so we’ll go with the SVC model

Let’s train this model

Great!!! We successfully trained our model using train_data. It’s time to preprocess the test_data before feeding it to the model. In test_data, we’ll remove the unwanted feature columns as we did for train_data.

Check for NaN values in feature columns, we got 2 columns which contain NaN values to let’s remove them by fillna( ) method.

Like train_data, test_data also contains the categorical column. We have to change it to numerical using pd.get_dummies( ). This gives us 2 numerical columns for females and males, i.e. 0 for females and 1 for males. drop_first =True drops the first column.

Once, we get the numerical column from our categorical column, we don’t need a categorical column anymore. Get rid of that and concat this new column to our test_data.

Here is test_data so far..

Age and Fare have a higher magnitude which makes the estimator unable to learn from other features correctly as expected, so we’ll make Age and Fare’s magnitude lower like other feature’s magnitude.

we get..

Now, we have processed test_data and we can feed it to the model for prediction.

We get all the predictions for test_data i.e. for survival in y_pred. Store this y_pred in new dataframe ‘submission’ and convert this in CSV file for submission of this kaggle challenge.

Conclusion:

We used the SVC model for this classification task of survival people on the Titanic and got an accuracy score of 78% by using the SVC model.

--

--