Posted on October 20, 2017 by Harish Krishnamurthy .
At the Open Data Science Conference in Boston held on May 3rd 2017, we presented an introductory workshop on Data Science with Python. This involved teaching Python, Libraries needed for Data Science followed by Logistic Regression with Titanic. This was done with the help of our online self-learning platform : http://refactored.ai You can still sign up, do the workshop and provide feedback on our platform.Here we are presenting the example problem with visualization in seaborn.
All the code, exercises are available on GitHub.
Titanic Survivors Problem
Kaggle posted a famous dataset from Titanic. It was about the survivors amongst the people aboard. Here is the description from Kaggle:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. The source of the competition is on kaggle.
Why are we Revisiting Titanic?
The reason is to showcase how data preparation and feature engineering must be performed with visualization and statistical methods prior to modeling.
Loading the Data
- The data has been split into train_data, test_data and can be loaded from kaggle with read_csv command:
# Import libraries %matplotlib inline from sklearn.metrics import roc_curve, roc_auc_score from sklearn.cross_validation import train_test_split import pandas as pd import seaborn as sns import statsmodels.formula.api as sm sns.set() sns.set_style("whitegrid") train_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv") test_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv") train_data.head() train_data.head()
Titanic Survivors – Data Selection & Preparation
Prior to fitting a logistic regression model for classifying who would likely survive, we have to examine the dataset with information from EDA as well as using other statistical methods. The logistic regression algorithm is also a supervised learning technique.
The dataset from training and testing has data that cannot be directly used due to many issues including but not limited to:
- Sparse column entries in certain columns such as Cabin.
- NaN entries in columns.
- Categorical variables with string entries.
- Selection of right columns.
Let us examine sparse columns by counting the ratio of NaNs to all the values. describe() function on dataframe provides information about mean, median and the number of values ignoring NaNs only for float/integer columns.
Number of Survivors
Let us look at how many people survived the titanic:
From the distribution, it can be observed that those who survived are approximately 2/3rd in the training set.
We can see that Age column has 714 entries with missing (891 – 714 = 177) values. This would be 177/891 = 0.19 or approximately 20% of missing values. If this percentage was small, we could choose to ignore those rows for fitting a logistic regression model. There are various methods to fill the missing values. But before we discuss the ways to fix the issue of Age column sparsity, let us examine other columns as well.
PassengerId, Name, Ticket
We can see that PassengerId, Name and Ticket are all unique to each person and hence will not serve as columns for modeling. Logistic Regression or any supervised or unsupervised learning methods need to understand patterns in the dataset. This is a necessary condition, so that algorithms can make sense of the data available by mathematically recording these patterns. Hence, Ids, Names are usually candidates that aren’t useful for modeling. They are needed for identifying the person post recommendation, prediction or classification. They are also going to be useful later for other columns, thereby improving the overall dataset.
The Cabin column is really sparse with just (148/891 = 0.16) 16% data available. You can use len(train_data.Cabin.unique()) to determine the total length. When data is very sparse, we can ignore it for modeling in the first iteration. Later for improving the fit, this column can be investigated deeper to extract more information.
This data shows the point where passengers Embarked. It has very less sparsity with train_data[train_data.Embarked.notnull()] = 889 which is nearly all the data (891). Hence, it can be useful for modeling.
This is a column we created ourselves by splitting up age into different bands of Child, Adult, Senior and Unknown. We have to determine how any Unknown people are there so that we can build better models. Since, this variable depends directly on age, if we can fix the sparsity of age, this will be fixed as well.
Sparsity in the dataset.
Let us analyze the sparsity in the columns of the dataset. This will give us an insight into what feature engineering needs to be done to resolve the sparsity.
raw_features = train_data.columns for feature in raw_features: content_prop = len(train_data[train_data[feature].notnull() == True].index)/len(train_data.index) print(feature, content_prop)
Understanding Age of People Aboard
About 20% of age is missing which means we need to find ways to impute it.
We do not have any prior information about how the age for child was defined in those years. The information is helpful as we can categorize the people as child, adults and seniors by looking the peaks.
Let us define functions to categorize the age into these three categories with age of 16 and 48 as the defining separators from the graph.
def person_type(x): if x <=16: return 'C' elif x <= 48: return 'A' elif x <= 90: return 'S' else: return 'U'
The above function categorizes the continuous variable ‘Age’ into a categorized variable. We can use apply function to transform each entry in the Age column and assign it to Person.
train_data['Person'] = train_data['Age'].apply(person_type) test_data['Person'] = test_data['Age'].apply(person_type)
We can now look at who is likely to survive depending on the type of Person with a factor plot. A factor plot can consider another factor such as Sex along with the Age.
g = sns.factorplot(x="Person", y="Survived", hue="Sex", data=train_data, size=5, kind="bar", palette="muted")
From the plot you can see that senior women were most likely to survive with highest probability than anyone else.
Now we can ignore the Sex type and visualize who was likely to survive.
Chances of Survival
Let us now look at the class from which people had better chances of survival:
g = sns.factorplot(x="Pclass", y="Survived", data=train_data, size=5, kind="bar", palette="muted")
We can see that 1st class passengers had much better chances of survival. We can also visualize the distribution of Male and Female using a violin plot. This is very useful to measure the mean and variance of passengers as to their likelihood of survival in each class grouped by Male and Female.
The distribution graphs are shown on either side of the line for each class.
sns.set(style="whitegrid", palette="pastel", color_codes=True) sns.violinplot(x="Pclass", y="Survived", hue="Sex", data=train_data, split=True, inner="quart", palette="muted")
You can observe that the area under curve of Class 1 Females and Class 3 males are highest: Class 1 Females are most likely to survive and Class 3 Males are most likely to die.
Distribution of Fare
The Sex column has entries male or female, a categorical variable.
Imputation refers to methods of substituting estimates for missing values in the data. This is an important step that can train the model better as more data becomes available post imputations. There are many known methods of imputations. Sometime, by analysis and EDA, we can design custom imputation methods that provide best statistical estimates for the missing value. This reduces sparsity in the dataset. In the Titanic dataset, let us start investigating various methods to impute sparse columns.
To impute the age column, we can use the name information. How many of names with Mr., Mrs., Miss and Master exist and use the mean values for each where the ages are missing. Here are the estimates for each category:
miss_est = train_data[train_data['Name'].str.contains('Miss. ')].Age.mean() master_est = train_data[train_data['Name'].str.contains('Master. ')].Age.mean() mrs_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean() mr_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()
The above estimates can be improved further by considering the Parents column as those names containing Master and Miss would have a subset of children (unmarried referring to Master & Miss). Here is a function that takes all of these rules into consideration:
girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean() boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean() woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean() man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean() woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean() man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()
We shall use the above estimates with an imputation function that we will build based on the same rules as above.
import math def impute_age(row): if math.isnan(row): if ((('Miss. ') in row) and (row == 1)): return girl_child_est elif ((('Master. ') in row) and (row == 1)): return boy_child_est elif ((('Miss. ') in row) and (row == 0)): return woman_adult_est elif (('Mrs. ') in row): return woman_married_est else: return man_married_est else: return row train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1) test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)
Let us start preparing the features for prediction.
train_embarked = pd.get_dummies(train_data['Embarked']) train_sex = pd.get_dummies(train_data['Sex']) train_data = train_data.join([train_embarked, train_sex]) test_embarked = pd.get_dummies(test_data['Embarked']) test_sex = pd.get_dummies(test_data['Sex']) test_data = test_data.join([test_embarked, test_sex]) train_data['Age_Imputed']=train_data.apply(impute_age, axis=1) test_data['Age_Imputed']=test_data.apply(impute_age, axis=1) # You dont want this to join twice if you are attempting this lesson multiple times. try: train_embarked = pd.get_dummies(train_data['Embarked']) train_sex = pd.get_dummies(train_data['Sex']) train_data = train_data.join([train_embarked, train_sex]) test_embarked = pd.get_dummies(test_data['Embarked']) test_sex = pd.get_dummies(test_data['Sex']) test_data = test_data.join([test_embarked, test_sex]) except: print("The columns already have the appropriate features.")
Logistic Regression using Statsmodels
features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female'] log_model = sm.Logit(train_data['Survived'], train_data[features]).fit() y_pred = log_model.predict(train_data[features]) roc_survival = roc_curve(train_data[['Survived']], y_pred)
sns.set_style("whitegrid") sns.plt.plot(roc_survival, roc_survival) sns.plt.show()
Logistic Regression using Scikit-Learn
Here we shall learn how to perform modeling using scikit-learn:
from sklearn import metrics from sklearn.linear_model import LogisticRegression log_sci_model = LogisticRegression() from sklearn import metrics, cross_validation from sklearn.linear_model import LogisticRegression # Modify the code below to include all possible features. features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female'] log_sci_model = LogisticRegression() log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived']) log_sci_model.score(train_data[features], train_data['Survived'])