Predicting Survival on the Titanic using Azure Machine Learning Studio

6 min readNov 28, 2021

It has been 3 years ago since I first encountered the Titanic Kaggle Challenge. It still remains one of the best ways for beginners to learn about machine learning competitions. If you are new, I highly recommend going through Alexis Cook Tutorial then coming back here to learn how to use Azure Machine Learning Studio. Without further ado let's jump right in.

Background

When the Titanic sank on April 15, 1912, during her maiden voyage, the whole world stood still and it made headlines around the world. A lot of investigation was done to understand the circumstances leading to the disaster. A lot of theories were proposed and unsurprisingly a lot of conspiracy arose.

As Data scientists, it is our job to separate fact from fiction. Our mission is today is to determine the probability of surviving using the passenger data. It is known that 1502 out of 2224 passengers and crew did not survive. While surviving required some elements of luck, we know that some groups were more likely to survive than others.

About Azure Machine Learning

The Azure Machine Learning Portal is a cloud-based service from Microsoft that provides users the ability to manage the full life-cycle of their machine learning efforts. You can carry out data exploration, model experimentation as well as service deployment all in one place.

It is a subscription-based service and you get a $200 Free Credit to use when you sign up here. You can quickly create all the resources you will need from this link.

Meet the Dataset

There are three files in the data: train.csv, test.csv, and gender_submission.csv. Download all the three here.

The train.csv contains 891 passengers details, it has 12 columns with the values in the second columned response for whether the passenger survived or not:

“1”, the passenger survived.
“0”, the passenger died.

The test.csv contains 418 passengers details and we have to predict whether they survived or not. The gender_submission.csv file is provided as an example that shows how you should structure your predictions. It contains 2 columns:

“PassengerId” contains the IDs of each passenger from test.csv.
“Survived” is what our model predicted for the passenger.

Load the Dataset

Steps:

In Azure Machine Learning Studio, view the Datasets pages
Create a dataset from local files, enter the name and descriptions as seen below
Upload the file, then specific the setting and preview
Select all the columns other than Path and review the data types automatically detected
Confirm Details and click Create.

Explore the Dataset

Steps:

View the Designer page and select +, to create a new pipeline.
At the top left-hand side of the screen, click on the default pipeline name (Pipeline-Created-on-date) and change it to Titanic Training
Select compute target to select the compute cluster you created previously in Azure Resources
Drag the titanic-traindata dataset you created in the previous exercise onto the canvas. You can easily search for the name in the box to the left.
Click the dataset on the canvas, and on the Outputs menu, select Dataset output by clicking on the Preview data graph icon.
Review the schema of the data, noting that you can see the distributions of the various columns as histograms.

Data Cleaning

The usual suspect of any data cleaning adventures is missing data and unwanted columns. Since it is easy to yank unwanted data, I will start with it.

Since Name, Ticket, and Cabin data are not useful for my model creating in this analysis due to the types of values they store, I will remove them to improve the speed of processing the data frame. I also removed the surviving column since it is our dependent variable.

Search for the “Select Columns in Dataset” button, and drop it in the dashboard. Connect it to the dataset as seen below, then right-click. Select the required columns excluding the 4 mentioned above. Then click submit, this should run for about 5 mins.

Our data is taking shape, we are down to 8 columns from 12 and this would form the predictors for our model. Now let us take care of the missing data.

The first Candidate is the Age with about 177 missing data, which will be replaced by the mean age from the rest of the data while Embarked has just 2 missing data so we can easily remove the rows

Steps:

Search for “Clean Missing Data” and drop it twice in the dashboard
Select on the Age column for the first one and specify the cleaning mode to be “Replace with mean”
Select on the Embarked column for the second and specify the cleaning mode to be “Remove entire row”
Click Submit and let it run.

Data Transformation

First, we convex both Sex and Embarked columns into numerical values by :

“0” if Sex is Male and “1” if Sex is Female
“0” if Embark is Q, “1” if Embarked is C, and 2 if Embarked is S

Also I would use this opportunity to convert the other categorical variable in my dataset, so PClass, SibSp and Parch would be included

Convert the columns to categorical from string feature, use Edit Metadata, select the Age and Embarked Column, set Data type as Integer, set Categorical as Categorical, set fields as Features.
Select Convert to Indicator Values in your experiment, connect it to the dataset containing the columns you want to convert.
Use the Column Selector to choose one or more categorical columns.
Select the Overwrite categorical columns option since we want to output only the new Boolean columns.