Would you have survived the sinking of the Titanic?

About the project

The Titanic sank on April 15, 1912 and more than 2/3rds of the people on board died in the icy waters due to lack of a lifeboat. Surviving this ordeal was not entirely a matter of luck. Some groups of people were much more likely to survive the sinking than others. My goal is to take information gathered about the passengers on board (i.e. name, age, gender, socio-economic class, etc.) to determine which variables contributed to survival and build a machine learning model to predict survival of passengers on a holdout set.

About the data

The training data set for this project contained information about 819 passengers. For each passenger we were provided with the following information: passenger ID, ticket class, name, gender, age, number of siblings and spouses on board, number of parents or children on board, ticket number, cabin number, fare paid, port where they embarked, and whether or not they survived. The training data was mostly complete, but I was missing Cabin information for most passengers and ages of about 1/5th of them.

Exploratory analysis and baseline accuracy

First, we noticed that the survival rate for the set was 38.4%. Given this is a binary classification problem, this yields a Zero rate classifier (predicting no passengers survive) of 61.6% which will serve as our baseline accuracy. When looking at the distribution of feature variables we can see that: most passengers purchased 3rd class tickets, about 2/3rds of the passengers were men, most passengers were traveling alone, and that about 3/4ths of the passengers embarked at Southampton.

I constructed the heatmap below to determine which variables were correlated, especially to survival.

correlation of feature variables
  • Sex appears to be the most important predictor of survival. and that sex is correlated with ticket class and the number of family members you have on board.
  • Ticket class is the second most important factor and seems to be correlated with age.

Sex and survival:

Clearly women were much more likely to survive than men, making this a good predictive feature

Ticket class and survival:

It is clear from the graph that survival rate drops off quickly with ticket class, making this a good predictive feature. Sex and ticket class were not directly correlated and we can see that both features and significant and independent by looking the survival rate by sex when we group by class.

  • The survival rate of men in 2nd and 3rd class was around 15% but 1st class men had a survival rate of just under 37%, more than 2x as likely to survive.

Fare and survival:

To determine if Fare will contribute to our model in addition to ticket class we need to look at the survival rate as a function of Fare paid.

It would appear that fare is even more dramatically related to survival rate than is class with the lowest ticket prices resulting in almost certain death and the highest ticket prices having incredibly high survival rates. To determine if Fare provides more granular precision than ticket class, I broke each class roughly into thirds to determine if the highest paying group had a better survival rate than the lowest paying group in each ticket class.

We can see that even within each ticket class, passengers that paid more were more likely to survive. This effect was most pronounced amongst 1st and 2nd class passengers.

Age and survival

To determine the importance of age, we need to see some of the other variables broken down by age group:

  • The age distribution of men and women is very similar but with men being very slightly older on average
  • Survival rate was very different for different age groups. children under 6 had a relatively high survival rate, whereas people ages 16–25 had the worst survival rate.

The survival rates of different age bands look very different. This will likely be a better predictor than age which may contain too much noise to be useful.

Family members on board and survival

If we take a look at the survival rates of people with sibling/spouses or Parents/children on board along with which ticket classes they purchased:

It appears that 1st class passengers were more likely to be traveling with a single sibling or spouse but not with a parent or child.

It also appears that having a family member or two on board increases your chances of survival. I created a new feature to track the total number of family members on board by combining siblings, spouses, parents, and children to see if that was a more reliable predictor.

Odds of survival were better than 50/50 if you had between 1 and 3 family members on board with you. Although there appears to be a contribution from 1st class passengers surviving, this does not fully explain the relationship so this will be a useful feature to include in our model.

Port of embarkation and survival

All the passengers boarded at one of three ports Cherbourg ( C ), Queenstown (Q), Southampton (S). Below are graphs of survival rates and ticket class by port.

The survival rate for passengers from C is quite high while that of S is quite low. We notice that the survival rate is similar to the proportion of higher class ticket holders embarking from each port. We also notice that at port Q, only 3rd class passengers boarded, but that their survival rate was higher than 3rd class passengers overall making ‘embarked’ a valuable feature to include in our models.

Final data wrangling and cleaning

There were only a few passengers with missing Embarkation data. I didn’t want to lose these rows so I filled the missing values with the most common port (Southampton). A large number of ages were missing. I noticed that the age is class dependent, so I filled these missing values with the mean age for each passenger’s ticket class. Only 1 passenger fare was missing, so I filled this value with the mean fare for their ticket class.

Predictive models

Seeing as this was my first machine learning project, I decided to use three different models (logistic regression, random forest, and k nearest neighbors) and then to create an ensemble model with them to see if this would improve performance and so that I could compare the performance of each and the combination

We noticed that Passenger ID, and ticket number had no bearing on survival, so these will not be used in the models. We were missing Cabin data for the vast majority of passengers so this was also not included. We dropped the number of siblings and spouses (SibSp) and number of parents and children (Parch) in favor of the number of family members (Fmembers). We used age bands instead of age to reduce noise in our model.

After fitting each model and doing some hyperparameter tuning we found that the model accuracy for all three Classifiers was very similar at just over 80% as you can see in the figure below.

After fitting each model and checking their accuracy, I created an ensemble model using the voting classifier, but found that the performance of this ensemble model was still not quite as accurate as the Random Forest classifier that I had used to create the ensemble.

After submitting my results to the kaggle competition, I found that the performance of my model against the holdout set was 78.9%. Although I was really hoping to break the 80% mark, this is still a marked improvement over the baseline score of 61.6%.

You can find the Python code and additional references on my GitHub

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store