Skip to content
Henry Bol edited this page May 22, 2019 · 5 revisions

Kaggle Team

GroningenML

Dataset

https://www.kaggle.com/c/titanic/data

  • training set (train.csv): PassengerId 1-891
  • test set (test.csv): PassengerId 892-1309 (418 in total)

Target

Optimal prediction for survival (accuracy) Classification problem (binary): target variable: Survival

Variables

  • PassengerId: 891 passengers in total
  • Survival: binary 0 = No, 1 = Yes
  • Pclass: Ticket class
  • Name
  • Sex: binary male /female
  • Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5: 177 are NaN (training set)
  • SibSp: # of siblings / spouses aboard the Titanic; categorical: 0,1,2,3,8 (training set) and 0,1,2,3,4,5,8 (test set)
  • Parch: # of parents / children aboard the Titanic; categorical: 0,1,2,3,4,5,6 (training set) and 0,1,2,3,4,5,6,9 (test set)
  • Ticket: Ticket number: numbers / letters and numbers (check combinations and relevance); there are 681 unique ticket numbers on a total of 891 tickets; persons with the same ticket number are related
  • Fare: Passenger fare
  • Cabin: Cabin number; categorical: 687 are NaNs (training set); 147 cabins
  • Embarked: Port of Embarkation; categorical (C = Cherbourg, Q = Queenstown, S = Southampton); 2 persons (couple) did not embark (NaN - survived)
Clone this wiki locally