Posted on October 20, 2017 by Harish Krishnamurthy .

At the Open Data Science Conference in Boston held on May 3rd 2017, we presented an introductory workshop on Data Science with Python. This involved teaching Python, Libraries needed for Data Science followed by Logistic Regression with Titanic. This was done with the help of our online self-learning platform : You can still sign up, do the workshop and provide feedback on our platform.Here we are presenting the example problem with visualization in seaborn.

All the code, exercises are available on GitHub.

Titanic Survivors Problem

Kaggle posted a famous dataset from Titanic. It was about the survivors amongst the people aboard. Here is the description from Kaggle:

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. The source of the competition is on kaggle.

Why are we Revisiting Titanic?

The reason is to showcase how data preparation and feature engineering must be performed with visualization and statistical methods prior to modeling.

Loading the Data

  • The data has been split into train_data, test_data and can be loaded from kaggle with read_csv command:
# Import libraries
%matplotlib inline

from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.cross_validation import train_test_split

import pandas as pd
import seaborn as sns
import statsmodels.formula.api as sm

train_data = pd.read_csv("")
test_data = pd.read_csv("")

Titanic Survivors – Data Selection & Preparation

Prior to fitting a logistic regression model for classifying who would likely survive, we have to examine the dataset with information from EDA as well as using other statistical methods. The logistic regression algorithm is also a supervised learning technique.

The dataset from training and testing has data that cannot be directly used due to many issues including but not limited to:

  • Sparse column entries in certain columns such as Cabin.
  • NaN entries in columns.
  • Categorical variables with string entries.
  • Selection of right columns.


Let us examine sparse columns by counting the ratio of NaNs to all the values. describe() function on dataframe provides information about mean, median and the number of values ignoring NaNs only for float/integer columns.

Number of Survivors

Let us look at how many people survived the titanic:


From the distribution, it can be observed that those who survived are approximately 2/3rd in the training set.



We can see that Age column has 714 entries with missing (891 – 714 = 177) values. This would be 177/891 = 0.19 or approximately 20% of missing values. If this percentage was small, we could choose to ignore those rows for fitting a logistic regression model. There are various methods to fill the missing values. But before we discuss the ways to fix the issue of Age column sparsity, let us examine other columns as well.

PassengerId, Name, Ticket

We can see that PassengerId, Name and Ticket are all unique to each person and hence will not serve as columns for modeling. Logistic Regression or any supervised or unsupervised learning methods need to understand patterns in the dataset. This is a necessary condition, so that algorithms can make sense of the data available by mathematically recording these patterns. Hence, Ids, Names are usually candidates that aren’t useful for modeling. They are needed for identifying the person post recommendation, prediction or classification. They are also going to be useful later for other columns, thereby improving the overall dataset.


The Cabin column is really sparse with just (148/891 = 0.16) 16% data available. You can use len(train_data.Cabin.unique()) to determine the total length. When data is very sparse, we can ignore it for modeling in the first iteration. Later for improving the fit, this column can be investigated deeper to extract more information.


This data shows the point where passengers Embarked. It has very less sparsity with train_data[train_data.Embarked.notnull()] = 889 which is nearly all the data (891). Hence, it can be useful for modeling.


This is a column we created ourselves by splitting up age into different bands of Child, Adult, Senior and Unknown. We have to determine how any Unknown people are there so that we can build better models. Since, this variable depends directly on age, if we can fix the sparsity of age, this will be fixed as well.

Sparsity in the dataset.

Let us analyze the sparsity in the columns of the dataset. This will give us an insight into what feature engineering needs to be done to resolve the sparsity.

raw_features = train_data.columns

for feature in raw_features:
    content_prop = len(train_data[train_data[feature].notnull() == True].index)/len(train_data.index)
    print(feature, content_prop)

Understanding Age of People Aboard

About 20% of age is missing which means we need to find ways to impute it.


We do not have any prior information about how the age for child was defined in those years. The information is helpful as we can categorize the people as child, adults and seniors by looking the peaks.

Let us define functions to categorize the age into these three categories with age of 16 and 48 as the defining separators from the graph.

def person_type(x):
  if x <=16:
    return 'C'
  elif x <= 48:
    return 'A'
  elif x <= 90:
    return 'S'
    return 'U'

The above function categorizes the continuous variable ‘Age’ into a categorized variable. We can use apply function to transform each entry in the Age column and assign it to Person.

train_data['Person'] = train_data['Age'].apply(person_type)
test_data['Person'] = test_data['Age'].apply(person_type)

We can now look at who is likely to survive depending on the type of Person with a factor plot. A factor plot can consider another factor such as Sex along with the Age.

g = sns.factorplot(x="Person", y="Survived", hue="Sex", data=train_data,
                             size=5, kind="bar", palette="muted")

From the plot you can see that senior women were most likely to survive with highest probability than anyone else.

Now we can ignore the Sex type and visualize who was likely to survive.

Chances of Survival

Let us now look at the class from which people had better chances of survival:

g = sns.factorplot(x="Pclass", y="Survived", data=train_data, size=5,
                   kind="bar", palette="muted")

We can see that 1st class passengers had much better chances of survival. We can also visualize the distribution of Male and Female using a violin plot. This is very useful to measure the mean and variance of passengers as to their likelihood of survival in each class grouped by Male and Female.

The distribution graphs are shown on either side of the line for each class.

sns.set(style="whitegrid", palette="pastel", color_codes=True)
sns.violinplot(x="Pclass", y="Survived", hue="Sex", data=train_data, split=True,
               inner="quart", palette="muted")

You can observe that the area under curve of Class 1 Females and Class 3 males are highest: Class 1 Females are most likely to survive and Class 3 Males are most likely to die.

Distribution of Fare


Statistical Imputations

Sex Column

The Sex column has entries male or female, a categorical variable.

Imputation refers to methods of substituting estimates for missing values in the data. This is an important step that can train the model better as more data becomes available post imputations. There are many known methods of imputations. Sometime, by analysis and EDA, we can design custom imputation methods that provide best statistical estimates for the missing value. This reduces sparsity in the dataset. In the Titanic dataset, let us start investigating various methods to impute sparse columns.

Age Column

To impute the age column, we can use the name information. How many of names with Mr., Mrs., Miss and Master exist and use the mean values for each where the ages are missing. Here are the estimates for each category:

miss_est = train_data[train_data['Name'].str.contains('Miss. ')].Age.mean()
master_est = train_data[train_data['Name'].str.contains('Master. ')].Age.mean()
mrs_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()
mr_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()

The above estimates can be improved further by considering the Parents column as those names containing Master and Miss would have a subset of children (unmarried referring to Master & Miss). Here is a function that takes all of these rules into consideration:

girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean()
boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()
woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean()
man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()
woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()
man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()

We shall use the above estimates with an imputation function that we will build based on the same rules as above.

import math
def impute_age(row):
    if math.isnan(row[5]):
        if ((('Miss. ') in row[3]) and (row[7] == 1)):
            return girl_child_est
        elif ((('Master. ') in row[3]) and (row[7] == 1)):
            return boy_child_est
        elif ((('Miss. ') in row[3]) and (row[7] == 0)):
            return woman_adult_est
        elif (('Mrs. ') in row[3]):
            return woman_married_est
            return man_married_est
        return row[5]

train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)
test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)

Feature Mapping

Let us start preparing the features for prediction.

train_embarked = pd.get_dummies(train_data['Embarked'])
train_sex = pd.get_dummies(train_data['Sex'])
train_data = train_data.join([train_embarked, train_sex])
test_embarked = pd.get_dummies(test_data['Embarked'])
test_sex = pd.get_dummies(test_data['Sex'])
test_data = test_data.join([test_embarked, test_sex])

train_data['Age_Imputed']=train_data.apply(impute_age, axis=1)
test_data['Age_Imputed']=test_data.apply(impute_age, axis=1)
# You dont want this to join twice if you are attempting this lesson multiple times.

    train_embarked = pd.get_dummies(train_data['Embarked'])
    train_sex = pd.get_dummies(train_data['Sex'])
    train_data = train_data.join([train_embarked, train_sex])

    test_embarked = pd.get_dummies(test_data['Embarked']) 
    test_sex = pd.get_dummies(test_data['Sex'])
    test_data = test_data.join([test_embarked, test_sex])
    print("The columns already have the appropriate features.")

Logistic Regression using Statsmodels

features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']
log_model = sm.Logit(train_data['Survived'], train_data[features]).fit()
y_pred = log_model.predict(train_data[features])

roc_survival = roc_curve(train_data[['Survived']], y_pred)
sns.plt.plot(roc_survival[0], roc_survival[1])

Logistic Regression using Scikit-Learn

Here we shall learn how to perform modeling using scikit-learn:

from sklearn import metrics
from sklearn.linear_model import LogisticRegression

log_sci_model = LogisticRegression()
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression

# Modify the code below to include all possible features.

features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']
log_sci_model = LogisticRegression()
log_sci_model =[features], train_data['Survived'])

log_sci_model.score(train_data[features], train_data['Survived'])

Posted on October 20, 2017 by Harish Krishnamurthy .

Jensen’s Inequality states that given g, a strictly convex function, and X a random variable, then,

Here we shall consider a scenario of a strictly convex function that maps a uniform random variable and visualize the inequality theorem. Consider a parabolic function with an offset which is strictly convex as shown in the above diagram (cover-figure), where f(x) is the pdf of the uniform random variable. The EM Algorithm is guaranteed to converge as log-likelihood is a strictly concave function and hence the opposite of the inequality holds true. This means any maximum value that the expected value of the probability distribution of the latent variables can take, is guaranteed to lie below the log-likelihood function. Hence, we can always maximize the expected values over many iterations leading to full convergence. The EM derivation follows from the fact that if g is strictly convex, then E[g(X)] = g(E[X]) holds true if and only if X = E[X] with probability 1 which implies X is a constant. At, we constantly work on such problems that help illustrate concepts in Machine Learning through visual mathematical examples.

%matplotlib inline
from scipy import stats, integrate

import numpy as np
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Plot the Strictly Convex Region

c = -12

x = np.arange(-10, 11, 1)
y = np.square(x) + c

plt.plot(x, y)

Expected value of a continuous function of a random variable

Let us look at a uniform random variable $X \in U(a, b)$ where the function on the variable is a convex function. The expected value of a function can be derived with preliminary calculus and it results in the equations as shown.

Uniform Random Variable

Let us create a set of uniform random variables with (a, b)s in a list and plot them:

# (a, b)s in a udf
urv_list = [(-1, 2), (-1, 3), (-1, 4), (-1, 5), (-1, 6),
            (-9, 1), (-9, 2), (-9, 3), (-9, 4), (-9, 5), 
            (-9, 6), (-9, 7)]

fig, ax = plt.subplots(figsize=(15, 10))

for (a, b) in urv_list:
    spread = (b - a)
    linestyles = ['-']
    mu = 1/2*(a + b)
    x_uniform = np.linspace(a-1, b+1, 1000)
    left = mu - 0.5 * spread
    dist = stats.uniform(left, spread)
    plt.plot(x_uniform, dist.pdf(x_uniform),c='green',
    label=r'$a=%i, b=%i$' % (a, b))

plt.xlim(-15, 15)

Compare E[g(X)] and g(E[X])

x = np.arange(-10, 11, 1)
y = np.square(x) + c

def g_ex(a, b):
    Computes g(E[X])
        a (float): Initial Point of Uniform Random Variable.
        b (float): End Point of Uniform Random Variable.
        f(E[X]) : Function of the Expected Value.
    p_x = 1/(b - a)
    g_ex_val = ((b**3 - a**3)/3.0 + c*(b - a))*p_x
    return g_ex_val

def e_gx(a, b):
    Computes E[g(X)]
        a (float): Initial Point of Uniform Random Variable.
        b (float): End Point of Uniform Random Variable.
        E[g(X)] : Expected value of function.
    mean = (a + b)/2
    return mean**2 + c

fig, ax = plt.subplots()

for (a, b) in urv_list:
    mean = (a + b)/2
    plt.plot(x, y)
    plt.plot(mean, e_gx(a, b), 'ro', color='green')  
    plt.plot(mean, g_ex(a, b), 'ro')
    print(e_gx(a, b), g_ex(a, b))

fig.set_size_inches(12, 8)
plt.xlim(-5, 5)

You can see from the above that for all distributions, the E[g(X)] >= g[E(X)]. The green dots show the value of the function of expected value of X and the red dots show corresponding expected values of the function aligned on the y-axis.

Posted on October 19, 2017 by Harish Krishnamurthy .

At stampedecon, AI Summit in St. Louis, Colaberry consulting presented how to take a complex idea in the AI domain, apply ML algorithms on it and deploy it in production using the platform. More details are available including code examples on the platform. This, as we see, is a common area of interest to many Organizations, large & small. To achieve this, it necessitates an efficient pipeline that is effective in having Data Scientists, Machine Learning Engineers and Data Engineers working in collaboration.

Often Machine Learning Engineers build models that are used by Data Scientists who apply statistical techniques to tweak the models for improving the accuracy of generated outputs. This by no means is the only arrangement that exists across companies. We do see treatment of all areas of AI as the same which affects the organization negatively as people with misaligned skills are expected to work on problems that aren’t their fit. This is largely due to lack of understanding of the history of the domain itself by the people in powers of decision making. Also it is easier to teach programming to people with Math backgrounds (such as EE, Math, Stats, Biostats who are a better fit for Data Analytics/AI than CS grads who generally lack skills in Linear Algebra/Probability) than teaching Math to programmers. However, the discussion elucidated here is mostly to deal with the coding guidelines. We shall consider a problem in the AI domain and see what steps we can take to production.

Building a Credit Classifier

Consider a credit rating system where the objective is to classify the datasets into good credit and bad credit.

German Credit Data 

German Credit Data is a dataset in UCI Repositary having information of credit of various customers. Our task is to segregate customers into Good Credit customers and Bad Credit customers. The data is very extensive and consists of 20 attributes, maily categorical. The dataset was provided by Prof. Hofmann and contains categorical/symbolic attributes.

Spark for Raw Data Ingestion

The raw data needs to be sampled so that we can start out to extract features from it for modeling.

# Load and parse the data
credit_data ="csv").load("/german_credit.txt", sep=" ")

training_sample = credit_data.sample(False, 0.1, 20171001)

Data Cleaning

  • We shall start out by adding a new column called ‘status’ and assigning it as either good/bad from the ‘good/bad’ column.
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score, roc_curve, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import scale
import numpy as np
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

german_data = pd.read_csv('../data/german.txt', sep=" ")
columns = ['checkin_acc', 'duration', 'credit_history', 'purpose', 'amount',
           'saving_acc', 'present_emp_since', 'inst_rate', 'personal_status',
           'other_debtors', 'residing_since', 'property', 'age', 'inst_plans', 
           'housing', 'num_credits', 'job', 'dependents', 'telephone', 'foreign_worker', 'good/bad']
german_data.columns = columns
german_data["status"] = np.where(german_data['good/bad'] == 1, "Good", "Bad")

Exploratory Data Analysis

  • Countplot of status, Histogram of credit amount, Amount comparison by credit status, Age of Borrowers, Regression plot of duration vs amount.
    german_data = german_data.drop("status", 1)
    print("Status is dropped")
german_data['good/bad'] = german_data['good/bad']-1

features = ['checkin_acc', 'credit_history', 'purpose','saving_acc', 
           'present_emp_since', 'personal_status', 'other_debtors',
           'property','inst_plans', 'housing','job', 'telephone', 'foreign_worker']
credit_features = pd.get_dummies(german_data, prefix=features, columns=features)

Prepare Dataset for Modeling

Split for modeling.


#Standardizing the dataset
names = list(X.columns.values)

#Performing the Scaling funcion
X_scale.columns = num
X = pd.concat((X_scale,X[cat]), axis=1)
#Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0) 

Model Selection

Let us apply various classifiers to the same dataset and discover the best performing model.

  • Append the CV scores to the list, cv_scores.
  • Make predictions on test set with the best model, best_model (var name) and assign the predictions to variable, y_hat.

Machine Learning Solutions

  • We see that the models have not used any causal feautures. Looking at causal features would involve looking up research journals in the area and going for a custom implementation.
  • During the problem discovery phase it is good to look up ML solutions.
models = []

models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
seed = 7
scoring = 'accuracy'
cv_scores = []
names = []
accuracy_scores = list()

for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
#Printing the accuracy achieved by each model    

#PLotting the model comparision as box plot
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
# Make predictions on Test dataset using SVM.
best_model = SVC(), y_train)
y_hat = best_model.predict(X_test)

svm_score = accuracy_score(y_test, y_hat)
print(confusion_matrix(y_test, y_hat))
print(classification_report(y_test, y_hat))
[[197  17]
 [ 58  28]]
             precision    recall  f1-score   support

          0       0.77      0.92      0.84       214
          1       0.62      0.33      0.43        86

avg / total       0.73      0.75      0.72       300

Scaling to Big Data

Since we have already done the cleaning in the experiments phase, let us borrow some portions of data cleaning build out the ingestion portion. We shall use a python notebook to experiment and set up the ingestion pipeline.

  • Copy the dataframe map functions and other related data cleaning regular expressions to Spark.
  • Pandas dataframe & Spark dataframes have similar functions.
  • Save ingested data, sample.
  • Save feature vectors, feature sample.

Data Ingestion Spark Module

This module can packaged as an ingestion module and added to the automation tasks by the IT.

* The module uses regular expressions, map, reduce functions in spark.

# display("dbfs:/FileStore/tables/ohvubrzw1507843246878/"))
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql import SQLContext
from copy import deepcopy

sqlContext = SQLContext(sc)
credit_df ="csv").option("header", "true").option("inferSchema", "True").load("dbfs:/FileStore/tables/ohvubrzw1507843246878/german_credit.csv")

credit_df_sample = credit_features.sample(False, 1e-2, 20171010)

credit_data = credit_df.rdd
feature_cols = credit_df.columns
target = 'Creditability'

def amount_class(amount):
    Classify the amount into different classes.
        amount (float): Total amount
        class (int): Type of class.
    if amount < 5000:
        return 1
    elif amount < 10000:
        return 2
    elif amount < 15000:
        return 3
        return 4

credit_data_new = row: (row, amount_class(row['Credit Amount'])))

Spark MLLib

Use the Spark Machine Learning Library to train the SVM on a sample of the dataset. This module can involve custom ML implementations depending on the type of problems.

  • Increase sample size as the training succeeds.
  • Ingest all data
def features(row):
  Gathers features from the dataset.
    row (rdd): Each row in the dataset.
    (LabeledPoint): Labeled Point of each row.
  feature_map = row.asDict()
  label = float(feature_map['Creditability'])
  feature_list = [float(feature_map[feature]) for feature in feature_cols]
  return LabeledPoint(label, feature_list)

credit_features =
credit_features_df = credit_features.toDF()

# Sample the features and save it.
credit_sample_df = credit_features.sample(False, 1e-2, 20171010)

# Read the credit data features
credit_features = sqlContext.parquetFile('dbfs:/FileStore/tables/credit.parquet')
model = SVMWithSGD.train(credit_features, iterations=100)

# Evaluating the model on training data
labels_preds = p: (p.label, model.predict(p.features)))
trainErr = labels_preds.filter(lambda lp: lp[0] != lp[1]).count() / float(lp.count())
print("Training Error = " + str(trainErr))

# Print the coefficients and intercept for linearsSVC
print("Coefficients: " + str(model.weights))
print("Intercept: " + str(model.intercept))

# Save and load model, "credit/SVMWithSGDModel")


The above example depicts how to build a credit classifier prototype using small data and roll it out to production by applying to big data and incrementally improvise it. This by no means is the only arrangement that exists across companies.
To discuss more about a pipeline that is relevant to your organization, you can reach us at If you are looking to either get started with data science or looking to advance your DS, ML and AI skills, check out our, a learn data science  by doing platform.

Posted on October 9, 2017 by Harish Krishnamurthy .

Maximum Likelihood

Maximum Likelihood Estimation (MLE) suffers from overfitting when number of samples are small. Suppose a coin is tossed 5 times and you have to estimate the probability of the coin toss event, then a Maximum Likelihood estimate dictates that the probability of the coin is (#Heads/#Total Coin Toss events). This would be estimate if we assumed that samples were generated according to the binomial distribution as shown in Figure 2. Consider an event where out of 5 different coin tosses, we ended up with all 5 heads or 4 heads and 1 tails. Then MLE value would be either a 1.0 or 0.8 which we know is not accurate as a fair coin has only two possibilities – either a heads or a tails and hence the unbiased coin tossing probability should be 0.5. We do know that as the number of coin tosses are increased, we could end up with a more realistic value of the coin toss probability value. To incorporate this belief, a conjugate prior is introduced. Here we shall illustrate with an experiment as to how we can arrive at a true value of the probability of the coin toss by incrementing the number of samples and harnessing the conjugate prior known as the Beta distribution.

Conjugate Prior

A conjugate prior is a distribution that describes the distribution of the latent variable with a mathematical formulation similar to that of the likelihood.In the above scenario of a coin toss, if we assumed that the probability of the coin was a random variable that conformed to a probability distribution then, we can say that the probability of the coin was picked according to:

Where alpha and beta are parameters that can guide a beta distribution. The beta distribution has the same form of the binomial likelihood function:

Hence, we can derive the probability of the coin toss as a posterior formulation:
which essentially implies that the probability of a coin toss is dependent on number of heads and number of tails. The mean and variance of the beta distribution is an indicator of the probability estimate and the uncertainty involved. We can see that the uncertainty or variance reduces as the number of samples increases.

Python Experiment

Let us now simulate these from a sample of the data collected from Univ of California,
Berkeley where 40K coin tosses was performed.

Define the parameters of the distribution and the experiment. We have considered various coin tosses at different levels to see how the mean and variance behaves as the number of samples increases. The dataframe with the toss results are shown on the right. We shall now define some parameters such as the mean and the variance of the beta distribution.

Let us now plot various values of alpha and beta by increasing the number of samples.

We can now plot these values iteratively for smaller values that are shown below in dotted lines and the more
accurate estimates with larger samples plotted in dark lines.

You can see that at the values closer to 0.5 the variance is much lower and the mean is centered around it. This shows an example of how using conjugate priors and formulating a bayesian posterior will help us arrive at the true estimate.

Posted on September 28, 2017 by Harish Krishnamurthy .


Every organization or individual Data Scientists perform a set of tasks in order to run predictions on input datasets. At Colaberry we have extensive experience working with data science and predictive analytics pipelines. As part of our effort to share our experience and expertise with our clients and data science community, we created a learn-by-doing data science platform On the platform, there are various paths that one can follow to learn how to analyze various types of datasets to create predictive analytics pipelines for their organizations. Python is used for all learning in a jupyter notebook environment. If you are new to python you can start learning python that is relevant to data science at:

In this blog, we will discuss about data science predictive analytics pipeline and pointers to where you can learn more on platform.

A Data Science workflow or a pipeline refers to the standard activities that a Data Scientist performs from acquiring data to delivering final results using powerful visualizations.

Here are the important steps in the pipeline:

  1. Data Ingestion
  2. Identify Nature of Dataset
  3. EDA
    1. Data Visualization
    2. Clustering
    3. Statistical Analysis
    4. Anomaly Detection
    5. Cleaning
  4. Mapping Algorithm to the Dataset
    1. Problem Identification
    2. Modeling
    3. Model Validation and Fine Tuning
  5. Model Building Using Machine Learning Algorithms
  6. Scaling and Big Data

The pipeline can be explained with the help of the diagram as shown below:


1. Data Ingestion

Acquiring data is the first step in the pipeline. This involves working with Data Engineers and Infrastructure Engineers to acquire data in a structured format such as JSON, csv, or Text. Data Engineers are expected to provide the data in the known format to the Data Scientists. This involves parsing the data and pushing it to a SQL database or a format that is easy to work with. This can involve applying a known schema to the data that is already known or can be inferred from the original data. When original data is in unstructured format, the data needs to be cleaned and relevant data extracted from it. This involves using a regular expression parser or multiple methods of parsing such as using perl and unix scripts, or language of your choice to clean the data.

An example of acquiring data is shown for “Women in STEM” dataset tutorial at: This dataset provides information about various college majors that women are graduating from.

To understand more about data ingestion, you can follow our Junior Data Scientist track at:

2. Identify Nature of dataset

Identifying the nature of data set is the second step in the pipeline. At high level datasets can be classified into Linearly separable, linearly inseperable, convex and non-convex datasets. Linearly separability refers to such datasets where a linear hyperplane or decision boundary will classify datasets with a good accuracy. Convexity refers to datasets where every line that joins two points in the dataset lie within the dataset.  This identification is typically the foundation of what type of data modeling can be done and what type of machine learning algorithms can be potentially applied to analyze data and do predictive analytics.

Here are a few visual examples of what kinds of data we may encounter:

It is helpful to analyze the type of dataset and its features such as linear separability, convexity and sparsity. Such characters help identify nature of dataset so that we can apply relevant algorithms in the pipeline ahead. The process of identifying the nature of data has been described in the data intelligence conference workshop at

3. Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the third step which involves looking at various statistics and visualizations generated from various dimensions of the dataset. EDA core activities include anomaly detection, statistical Analysis, data visualization, clustering and cleaning.

Anomaly detection may involve simple statistical anomalies or complex anomalies. For example, let us say we are looking to identify a dataset that has a column of social security numbers which could contain anomalies of few entries with 0s (000-00-0000). This incorrect data could lead to problems in applying machine learning techniques. We can identify such spurious numbers by looking at statistics such as frequency counts. Plotting a histogram can show up such numbers in significant probabilities signaling that spurious numbers are present. Also looking at mean, median, variance and other statistical measures will convey information about the characteristics of the data. Anomalies in a time-series graph plotted on a dataset that contains data about credit card activity of a user could signal fraudulent activity. There are other ways of looking at the graphs too by which we can identify anomalies and such spurious data. The anomalies detected can be put through cleaning process and get data ready for further processing. You can learn a lot of EDA methods by following our  Data Science Track at:

Let’s explore EDA though an example of using Titanic Dataset:

Titanic Survivors example: posted a famous dataset from Titanic posted at The challenge is about identifying the survivors amongst the people aboard Titanic.

The train_data and test_data provided in the challenge can be loaded from kaggle github with read_csv command:

import pandas as pd

train_data = pd.read_csv(“”)
test_data = pd.read_csv(““)

The source data provided e is missing some data in the age column which is referred to as sparsity. Instead of throwing away such sparse rows entirely, we interpolate it by doing statistical analysis as shown below:

miss_est = train_data[train_data[‘Name’].str.contains(‘Miss. ‘)].Age.mean()
master_est = train_data[train_data[‘Name’].str.contains(‘Master. ‘)].Age.mean()
mrs_est = train_data[train_data[‘Name’].str.contains(‘Mrs. ‘)].Age.mean()
mr_est = train_data[train_data[‘Name’].str.contains(‘Mr. ‘)].Age.mean()

In this example, we can visualize the results of analysis using violin plots. Violin plots are 2-D plots that can represent the distribution of datasets for two types of qualifiers within a dataset. In the plot below, you can see that the violin is split by half and is asymmetric across the vertical axis. In the leftmost violin plot, you can see that most females survived as the peak is high around 1.0 (survival probability) with low variance (less uncertainty of death).  Hence, we can conclude that most women in class 1 survived compared to men.

More on data visualization and EDA can be found at:

4. Mapping Algorithm to the Dataset

This stage typically involves problem Identification, modeling, model Validation and Fine Tuning. At this step of the pipeline, we can associate a relevant algorithm(s) based on the nature of the dataset, apply all the relevant algorithms and measure their performance.

Problem identification involves identifying what type of problem are we dealing with such as causal or non-causal (involving time as a feature or not), classification, prediction or anomaly detection. For example, a data containing time as a column is referred to as a time-series dataset, and by identifying it, we can pick a class of time-series algorithms.

Modeling refers to applying Machine Learning models to the the dataset. The models we have built need to be validated for performance. This stage is called as hyperparameter tuning. The first run of mapping algorithms to dataset will contain parameters that fit the model but are not necessarily optimal. To determine the optimal parameters, we need to apply tuning techniques. Commonly used techniques in linear regression involve cross-validation and regularization. You can try an example of applying  linear regression to boston housing dataset at:

5. Model Building using Machine Learning Algorithms

It is necessary to build models from scratch when existing algorithms of the standard packages fail. This is when, we refer to literature to build custom machine learning models. More on Machine Learning is on our ML Track:


Often we encounter datasets that are linearly inseparable for classification. For linearly separable classes by using an SVM, it is easy to add a hyperplane that classifies the data. However, for linearly inseparable data, a transformation to a higher dimension will help map the data to a linearly separable space. One such cool ways to map the data is by using functions called kernels.

Here we shall look at an example in 2-dimensions when mapped to 3-dimensions will help classify the data with a high accuracy.

First, let us create a 2-D circles dataset.

from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import numpy as np

X, y = make_circles(n_samples=500, random_state=20092017, noise=0.2, factor=0.2)


plt.scatter(X[y==0, 0], X[y==0, 1], color=’red’, alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color=’blue’, alpha=0.5

When we fit a linear support vector machine (SVM) and calculate the accuracy with a linear kernel we arrive at an accuracy of 0.644. However, instead of linear kernel if we apply Radial Basis Foundation (RBF) kernel, the accuracy will go upto to 0.986. We can also apply transformation functions to convert 2d points into 3d points by warping the space as seen in the illustration below.

Z = X[:, 0]**2 + X[:, 1]**2
trans_X = np.c_[X, Z]  
svm = SVC(C=0.5, kernel=‘linear’), y)

SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=’auto’, kernel=’linear’,  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False)

y_hat = svm.predict(trans_X)

Understanding and applying Machine Learning needs a thorough understanding of linear algebra and probability. You can learn them through visualizations at:

6. Scaling and Big Data

The models that perform greatly on small datasets might not do so on large datasets due to the variance present in the dataset. Hence, working with big data and scaling up the algorithms is a challenge. The models are initially validated with small datasets before working with big data. The popular technology stack for working with large datasets are Hadoop and Spark. For prediction on smaller datasets, pandas, sci-kit learn and numpy libraries are used and for large datasets, Spark MLlib is used.

About Colaberry:

Colaberry is a data science consulting and training company. is a data science learn-by-doing platform created by data scientists, data engineers and machine learning specialist at Colaberry. Contact colaberry at

About Authors:
Ram Katamaraja the founder and CEO of Colaberry and architect of platform. He can be reached at

Harish Krishnamurthy is chief  data scientist at Colaberry and the primary content author of platform. He can be reached at