Machine Learning: Titanic Survival

Overview

The Titanic tragedy was a maritime disaster that occurred on April 15, 1912, when the RMS Titanic, a luxurious British passenger liner, struck an iceberg in the North Atlantic Ocean and sank during its maiden voyage from England to the USA. The Titanic was considered unsinkable, but the collision caused significant damage, and the ship's watertight compartments were breached, leading to flooding and ultimately the ship's sinking. Of the 2,224 passengers and crew on board, more than 1,500 lost their lives, making it one of the deadliest peacetime maritime disasters in history. The tragedy led to the strengthening of maritime regulations.

The aim of this project is to demonstrate the predictability of passenger survival by using different machine learning algorithms such as the neural network, logistic regression, and random forest model. The output variables of the models are binary in nature to classify survival vs. death of a passenger. The dataset was downloaded from Kaggle and was pre-processed using Python, SQLite, and Scikit-Learn, and visualized through a dashboard created using Tableau.

Data Preparation

Below is a snippet of the dataset downloaded from Kaggle:

PassengerId = Passenger ID
Survived: Survival (0 = No; 1 = Yes)
Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name: Name
sex: Sex
Age: Age
SibSp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
Ticket: Ticket Number
Fare: Passenger Fare in British pound
Cabin: Cabin
Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat: Boat identification number
body: Body identification number for passengers who did not survive

The missing values in the age and fare columns were filled by calculating the average value of each column. Pclass, Embarked, and Sex columns were transformed into categorical data by performing one-hot coding. The original Pclass, Embarked, and Sex columns, as well as the Boat, Body, Name, Ticket, Cabin, and Index columns, were dropped because they were deemed unnecessary for the predictive analysis.

The data was then split into target data (column = Survived) and input features (all other columns). The input features were then scaled before being fit and trained by machine learning models.

Neural Network

The neural network involved 2 hidden layers, each with 24 and 12 neurons. The output layer consisted of only one neuron to reflect the binary nature of the study. The activation model for the two hidden layers was set to ReLu, and that of the output layer was set to Sigmoid. With 100 epochs, I was able to achieve roughly 83% accuracy on the training data and 80% on the testing data. Below is the classification report.

Several attempts were made to optimize this model. The number of neurons, deep layers, the type of activation functions, and the number of epochs were adjusted, but the model did not improve in performance. The age feature was then transformed into multiple categorical variables for simpler model learning. However, none of these attempts led to improvement.

Logistic Regression

The logistic regression model was the second of the three machine learning models used. After splitting, scaling, and fitting the model, the logistic regression model generated an accuracy score of 82%. The overall performance turned out to be better compared to the other two models.

Random Forest

The last model was the random forest model. Again, the data was split, scaled, and trained. The accuracy score was around 80%.

Interactive Dashboard

An interactive dashboard that visualizes the survival rate vs. input features were created by using Tableau.

Review and Further Research

Although logistic regression generated the best performing model out of the three models implemented, the differences in performance were small. All three models performed well in accuracy and precision, but not in the recall metrics for survivors. The reason for this is that the provided dataset was imbalanced with significantly fewer survivors than those who died. Given that none of the optimization attempts worked for the neural network model, a more balanced and larger dataset would help improve the model performance. It would also be interesting to build a predictive dashboard using Streamlit through which a user can predict whether a passenger will survive or not by entering different combinations of input features.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Resources		Resources
images		images
.gitignore		.gitignore
README.md		README.md
Titanic Dashboard.twbx		Titanic Dashboard.twbx
code.ipynb		code.ipynb
data_to_sqlite.py		data_to_sqlite.py
titanic.db		titanic.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning: Titanic Survival

Overview

Data Preparation

Neural Network

Logistic Regression

Random Forest

Interactive Dashboard

Review and Further Research

Languages and Libraries

About

Languages

ericyang91/Machine_Learning_Titanic_Survival

Folders and files

Latest commit

History

Repository files navigation

Machine Learning: Titanic Survival

Overview

Data Preparation

Neural Network

Logistic Regression

Random Forest

Interactive Dashboard

Review and Further Research

Languages and Libraries

About

Topics

Resources

Stars

Watchers

Forks

Languages