Skip to content

Predicting the outcome of passenger survival on the Titanic tragedy using neural networks, logistic regression, and random forest. Generates an interactive dashboard using Tableau.

Notifications You must be signed in to change notification settings

ericyang91/Machine_Learning_Titanic_Survival

Repository files navigation

Machine Learning: Titanic Survival

Overview

The Titanic tragedy was a maritime disaster that occurred on April 15, 1912, when the RMS Titanic, a luxurious British passenger liner, struck an iceberg in the North Atlantic Ocean and sank during its maiden voyage from England to the USA. The Titanic was considered unsinkable, but the collision caused significant damage, and the ship's watertight compartments were breached, leading to flooding and ultimately the ship's sinking. Of the 2,224 passengers and crew on board, more than 1,500 lost their lives, making it one of the deadliest peacetime maritime disasters in history. The tragedy led to the strengthening of maritime regulations.

The aim of this project is to demonstrate the predictability of passenger survival by using different machine learning algorithms such as the neural network, logistic regression, and random forest model. The output variables of the models are binary in nature to classify survival vs. death of a passenger. The dataset was downloaded from Kaggle and was pre-processed using Python, SQLite, and Scikit-Learn, and visualized through a dashboard created using Tableau.

Data Preparation

Below is a snippet of the dataset downloaded from Kaggle:

rawdata

  • PassengerId = Passenger ID
  • Survived: Survival (0 = No; 1 = Yes)
  • Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • Name: Name
  • sex: Sex
  • Age: Age
  • SibSp: Number of Siblings/Spouses Aboard
  • parch: Number of Parents/Children Aboard
  • Ticket: Ticket Number
  • Fare: Passenger Fare in British pound
  • Cabin: Cabin
  • Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat: Boat identification number
  • body: Body identification number for passengers who did not survive

The missing values in the age and fare columns were filled by calculating the average value of each column. Pclass, Embarked, and Sex columns were transformed into categorical data by performing one-hot coding. The original Pclass, Embarked, and Sex columns, as well as the Boat, Body, Name, Ticket, Cabin, and Index columns, were dropped because they were deemed unnecessary for the predictive analysis.

cleandata

The data was then split into target data (column = Survived) and input features (all other columns). The input features were then scaled before being fit and trained by machine learning models.

Neural Network

The neural network involved 2 hidden layers, each with 24 and 12 neurons. The output layer consisted of only one neuron to reflect the binary nature of the study. The activation model for the two hidden layers was set to ReLu, and that of the output layer was set to Sigmoid. With 100 epochs, I was able to achieve roughly 83% accuracy on the training data and 80% on the testing data. Below is the classification report.

classification

Several attempts were made to optimize this model. The number of neurons, deep layers, the type of activation functions, and the number of epochs were adjusted, but the model did not improve in performance. The age feature was then transformed into multiple categorical variables for simpler model learning. However, none of these attempts led to improvement.

Logistic Regression

The logistic regression model was the second of the three machine learning models used. After splitting, scaling, and fitting the model, the logistic regression model generated an accuracy score of 82%. The overall performance turned out to be better compared to the other two models.

logisticregression

Random Forest

The last model was the random forest model. Again, the data was split, scaled, and trained. The accuracy score was around 80%.

randomforest

Interactive Dashboard

An interactive dashboard that visualizes the survival rate vs. input features were created by using Tableau.

dash

Review and Further Research

Although logistic regression generated the best performing model out of the three models implemented, the differences in performance were small. All three models performed well in accuracy and precision, but not in the recall metrics for survivors. The reason for this is that the provided dataset was imbalanced with significantly fewer survivors than those who died. Given that none of the optimization attempts worked for the neural network model, a more balanced and larger dataset would help improve the model performance. It would also be interesting to build a predictive dashboard using Streamlit through which a user can predict whether a passenger will survive or not by entering different combinations of input features.

Languages and Libraries

Python pandas matplotlib Google Colab Logistic Regression Neural Networks TensorFlow SQLite3 Random Forest Scikit-Learn Tableau