# Category: Data Science

## Record Linkage in a Data Lake

Enterprises typically have various large data sets that are either in various enterprise systems, legacy systems and/or dumped into a big data lakes. With exponential generation of data from numerous sources and continuous storage of this data in inexpensive unstructured big data environment, “Record Linkage (RL)” is a huge challenge that all enterprise face when …

## Road to datascience hackathon @ UTD

Colaberry learning team in partnership with data science club of University of Texas, Dallas organized a data science hackathon on October 28th 2017. It was one whirlwind tour of putting our data science learning platform https://refactored.ai that we are cooking in our labs to an interesting test in an uncontrolled environment. Here is a run …

## Titanic Revisited – ODSC 2017, Boston MA

At the Open Data Science Conference in Boston held on May 3rd 2017, we presented an introductory workshop on Data Science with Python. This involved teaching Python, Libraries needed for Data Science followed by Logistic Regression with Titanic. This was done with the help of our online self-learning platform : http://refactored.ai You can still sign …

## Jensens Inequality that Guarantees Convergence of the EM Algorithm

Jensen’s Inequality states that given g, a strictly convex function, and X a random variable, then, Here we shall consider a scenario of a strictly convex function that maps a uniform random variable and visualize the inequality theorem. Consider a parabolic function with an offset which is strictly convex as shown in the above diagram …

Read More “Jensens Inequality that Guarantees Convergence of the EM Algorithm”

## From Prototype to Production: Building an Efficient Pipeline-AI Summit 2017, St. Louis

At stampedecon, AI Summit in St. Louis, Colaberry consulting presented how to take a complex idea in the AI domain, apply ML algorithms on it and deploy it in production using the refactored.ai platform. More details are available including code examples on the platform. This, as we see, is a common area of interest to …

Read More “From Prototype to Production: Building an Efficient Pipeline-AI Summit 2017, St. Louis”

## Why Bayesian Formulations better than Maximum Likelihood Estimates?

Maximum Likelihood Maximum Likelihood Estimation (MLE) suffers from overfitting when number of samples are small. Suppose a coin is tossed 5 times and you have to estimate the probability of the coin toss event, then a Maximum Likelihood estimate dictates that the probability of the coin is (#Heads/#Total Coin Toss events). This would be estimate …

Read More “Why Bayesian Formulations better than Maximum Likelihood Estimates?”

## Data Science Explained

Data science Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. Data science is a “concept to unify statistics, data analysis and their related methods” in order to …