Data Science Predictive Analytics Pipeline

In this blog, we will discuss the data science predictive analytics pipeline and pointers to where you can learn more on the refactored.ai platform.

Every organization or individual Data Scientist performs a set of tasks in order to run predictions on input datasets. At Colaberry we have extensive experience working with data science and predictive analytics pipelines. As part of our effort to share our experience and expertise with our clients and the data science community, we created a learn-by-doing data science platform refactored.ai. On the platform, there are various paths that one can follow to learn how to analyze various types of datasets to create predictive analytics pipelines for their organizations. Python is used for all learning in a jupyter notebook environment. If you are new to Python you can start learning Python that is relevant to data science at: https://refactored.ai/path/python/.

In this blog, we will discuss the data science predictive analytics pipeline and pointers to where you can learn more on the refactored.ai platform.

A Data Science workflow or a pipeline refers to the standard activities that a Data Scientist performs from acquiring data to delivering final results using powerful visualizations.

Here are the important steps in the pipeline:

1. Data Ingestion
Identify the Nature of the Dataset
1. EDA
  1. 1. Data Visualization
  1. 1. Clustering
  1. 1. Statistical Analysis
  1. 1. Anomaly Detection
  1. Cleaning

1. Mapping Algorithm to the Dataset
  1. 1. Problem Identification
  1. 1. Modeling
  1. Model Validation and Tuning

1. Model Building Using Machine Learning Algorithms

Scaling and Big Data

The pipeline can be explained with the help of the diagram as shown below:

1. Data Ingestion

Acquiring data is the first step in the pipeline. This involves working with Data Engineers and Infrastructure Engineers to acquire data in a structured format such as JSON, CSV, or Text. Data Engineers are expected to provide the data in the known format to the Data Scientists. This involves parsing the data and pushing it to a SQL database or a format that is easy to work with. This can involve applying a known schema to the data that is already known or can be inferred from the original data. When original data is in an unstructured format, the data needs to be cleaned and relevant data extracted from it. This involves using a regular expression parser or multiple methods of parsing such as using Perl and Unix scripts, or the language of your choice to clean the data.

An example of acquiring data is shown for the “Women in STEM” dataset tutorial at: https://refactored.ai/path/data-acquisition/. This dataset provides information about various college majors that women are graduating from.

To understand more about data ingestion, you can follow our Junior Data Scientist track at: https://refactored.ai/path/data-analyst/

2. Identify the Nature of the Dataset

Identifying the nature of the data set is the second step in the pipeline. At high-level datasets can be classified into Linearly separable, linearly inseparable, convex, and non-convex datasets. Linearly separability refers to such datasets where a linear hyperplane or decision boundary will classify datasets with good accuracy. Convexity refers to datasets where every line that joins two points in the dataset lies within the dataset. This identification is typically the foundation of what type of data modeling can be done and what type of machine learning algorithms can be potentially applied to analyze data and do predictive analytics.

Here are a few visual examples of what kinds of data we may encounter:

It is helpful to analyze the type of dataset and its features such as linear separability, convexity, and sparsity. Such characters help identify the nature of the dataset so that we can apply relevant algorithms in the pipeline ahead. The process of identifying the nature of data has been described in the data intelligence conference workshop at https://refactored.ai/path/data-intelligence/

3. Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the third step which involves looking at various statistics and visualizations generated from various dimensions of the dataset. EDA core activities include anomaly detection, statistical Analysis, data visualization, clustering, and cleaning.

Anomaly detection may involve simple statistical anomalies or complex anomalies. For example, let us say we are looking to identify a dataset that has a column of social security numbers that could contain anomalies of a few entries with 0s (000-00-0000). This incorrect data could lead to problems in applying machine learning techniques. We can identify such spurious numbers by looking at statistics such as frequency counts. Plotting a histogram can show up such numbers in significant probabilities signaling that spurious numbers are present. Also looking at mean, median, variance, and other statistical measures will convey information about the characteristics of the data. Anomalies in a time-series graph plotted on a dataset that contains data about the credit card activity of a user could signal fraudulent activity. There are other ways of looking at the graphs too by which we can identify anomalies and such spurious data. The anomalies detected can be put through the cleaning process and get data ready for further processing. You can learn a lot of EDA methods by following our Data Science Track at: https://refactored.ai/path/data-science/.

Let’s explore EDA through an example of using the Titanic Dataset:

Titanic Survivors example:

www.kaggle.com posted a famous dataset from Titanic at https://www.kaggle.com/c/titanic. The challenge is about identifying the survivors among the people aboard Titanic.

The train_data and test_data provided in the challenge can be loaded from Kaggle GitHub with the read_csv command:

import pandas as PD

train_data = pd.read_csv(“https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv”)
test_data = pd.read_csv(“https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv“)

The source data provided e is missing some data in the age column which is referred to as sparsity. Instead of throwing away such sparse rows entirely, we interpolate it by doing statistical analysis as shown below:

miss_est = train_data[train_data[‘Name’].str.contains(‘Miss. ‘)].Age.mean()
master_est = train_data[train_data[‘Name’].str.contains(‘Master. ‘)].Age.mean()
mrs_est = train_data[train_data[‘Name’].str.contains(‘Mrs. ‘)].Age.mean()
mr_est = train_data[train_data[‘Name’].str.contains(‘Mr. ‘)].Age.mean()

In this example, we can visualize the results of the analysis using violin plots. Violin plots are 2-D plots that can represent the distribution of datasets for two types of qualifiers within a dataset. In the plot below, you can see that the violin is split in half and is asymmetric across the vertical axis. In the leftmost violin plot, you can see that most females survived as the peak is high around 1.0 (survival probability) with low variance (less uncertainty of death). Hence, we can conclude that most women in class 1 survived compared to men.

More on data visualization and EDA can be found at: https://refactored.ai/user/rf/notebooks/1/dv.ipynb

4. Mapping Algorithm to the Dataset

This stage typically involves problem Identification, modeling, model Validation, and fine-tuning. At this step of the pipeline, we can associate a relevant algorithm(s) based on the nature of the dataset, apply all the relevant algorithms, and measure their performance.

Problem identification involves identifying what type of problem are we dealing with such as causal or non-causal (involving time as a feature or not), classification, prediction, or anomaly detection. For example, data containing the time as a column is referred to as a time-series dataset, and by identifying it, we can pick a class of time-series algorithms.

Modeling refers to applying Machine Learning models to the dataset. The models we have built need to be validated for performance. This stage is called hyperparameter tuning. The first run of mapping algorithms to the dataset will contain parameters that fit the model but are not necessarily optimal. To determine the optimal parameters, we need to apply tuning techniques. Commonly used techniques in linear regression involve cross-validation and regularization. You can try an example of applying linear regression to the Boston housing dataset at: https://refactored.ai/user/rf/notebooks/10/reg-journey.ipynb

5. Model Building using Machine Learning Algorithms

It is necessary to build models from scratch when existing algorithms of the standard packages fail. This is when we refer to the literature to build custom machine-learning models. More on Machine Learning is on our ML Track: https://refactored.ai/path/machine-learning/

Often we encounter datasets that are linearly inseparable for classification. For linearly separable classes by using an SVM, it is easy to add a hyperplane that classifies the data. However, for linearly inseparable data, a transformation to a higher dimension will help map the data to a linearly separable space. One such cool way to map the data is by using functions called kernels.

Here we shall look at an example in 2-dimensions when mapped to 3-dimensions will help classify the data with high accuracy.

First, let us create a 2-D circles dataset.

from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import numpy as np

X, y = make_circles(n_samples=500, random_state=20092017, noise=0.2, factor=0.2)

plt.figure(figsize=(8,6))

plt.scatter(X[y==0, 0], X[y==0, 1], color=’red’, alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color=’blue’, alpha=0.5

plt.show()

When we fit a linear support vector machine (SVM) and calculate the accuracy with a linear kernel we arrive at an accuracy of 0.644. However, instead of a linear kernel, if we apply a Radial Basis Foundation (RBF) kernel, the accuracy will go up to 0.986. We can also apply transformation functions to convert 2d points into 3d points by warping the space as seen in the illustration below.

Z = X[:, 0]**2 + X[:, 1]**2
trans_X = np.c_[X, Z]
svm = SVC(C=0.5, kernel=‘linear’)
svm.fit(trans_X, y)

SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=’auto’, kernel=’linear’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

y_hat = svm.predict(trans_X)

Understanding and applying Machine Learning needs a thorough understanding of linear algebra and probability. You can learn them through visualizations at:

https://refactored.ai/user/rf/notebooks/10/linear-algebra.ipynb

https://refactored.ai/user/rf/notebooks/10/probability-theory.ipynb

6. Scaling and Big Data

The models that perform greatly on small datasets might not do so on large datasets due to the variance present in the dataset. Hence, working with big data and scaling up the algorithms is a challenge. The models are initially validated with small datasets before working with big data. The popular technology stack for working with large datasets is Hadoop and Spark. For prediction on smaller datasets, pandas, sci-kit learn and numpy libraries are used and for large datasets, Spark MLlib is used.

About Colaberry:

Colaberry is a data science consulting and training company. Refactroed.ai is a data science learn-by-doing platform created by data scientists, data engineers, and machine learning specialists at Colaberry. Contact colaberry at [email protected]

About Authors:
Ram Katamaraja is the founder and CEO of Colaberry and the architect of the Refactored.ai platform. He can be reached at [email protected]

Harish Krishnamurthy is the chief data scientist at Colaberry and the primary content author of the Refactored.ai platform. He can be reached at [email protected]