Posted on November 15, 2017 by Ram Dhan Yadav Katamaraja .

Enterprises typically have various large data sets that are either in various enterprise systems, legacy systems and/or dumped into a big data lakes. With exponential generation of data from numerous sources and continuous storage of this data in inexpensive unstructured big data environment, “Record Linkage (RL)” is a huge challenge that all enterprise face when they try to get value out of this distributed data.


Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record linkage is called data linkage in many jurisdictions, but is the same process.

Goal: ‘Build → Deploy’ a queryable and linked data source for Record Linkage (RL)

In an enterprise, it is common to find records with data that are represented differently in different systems. The same field might be represented differently in different data sets.

Below is an example of data represented in different data sources:

Data Set A:
Master ID, First Name, Middle Name, Last Name, DOB, Twitter, … ID

Data Set B:
MasterID, First Name, Last Name, Facebook ID,…

Data Set C:
Full Name, Street Number + zip code.

As you can see in the example, different data sets have data stored with no common identity. How do we reconcile these data sets and link them together, so we can form a single query-able analytics resource that can serve as a clean data source for data scientists and analysts.

Solution: Modern Data Lake + Machine Learning = Record Linkage(RL)

The solution that we propose to achieve highly functional Record Linkage includes two aspects:

  1. Create a modern Data Lake.
  2. Apply Machine Learning for RL

Aspect 1: Modern Data Lake

Our recommendation is to first create a modern data lake that can that can store and process petabytes of data. An example modern data lake architecture diagram is below.

The core features of the data lake are:

  • Capability to ingest (un)structured data real-time from multiple data sources
  • Store data in a tiered storage layers for cost and processing effectiveness
  • Provide catalog services for unified data access across systems and tools
  • Support streaming and batch data processing
  • Provide self-service analytics, reporting and data processing for various users
  • Provide data ops and admin services
  • Sits on AWS, Azure, Google or other private cloud infrastructure

Aspect 2: Machine Learning for RL

To create “Record Linkage” our recommendation is to apply a probabilistic matching on ingested data, create machine learning models, train data and run the models to create a linked data lake.

Example: Building → Deploying a Queryable and Linked Data Lake
Below is an example of various stages of how a queryable and linked data lake can be built starting with prototyping and and finally deploying to production.

Data Ingestion: Data ingestion involves loading data from various data sources and data sets into a big data ecosystem (hadoop hdfs, cassandra, etc), pre-processing data to clean up any spurious data and saving the clean ground truth data in compressed data formats such as a parquet files.

Ground truth refers to cartesian pairs of matching records as well as false pairs. These are necessary for the supervised ML classifier to learn parameters that help classify the records as true or false. Finally, split the data into training and validation sets.

Here is our data ingestion pipeline* that shows how we approach data ingestion for record linkage.

Model Training:
Model training involves application of Supervised ML models during the training phase to learn the linkages. In our setup, Machine Learning Engineers build models that are used by Data Scientists. Data Scientists apply statistical techniques to tweak the models for improving the accuracy of generated outputs.

Here is and example of our model building and training pipeline* that shows how we sample data and work with standard tools in python. We select best models that are then implemented using the Spark MLLib.

This by no means is the only arrangement that exists across companies. The other arrangements include organizations where Data Science/Machine Learning areas are treated similarly, and the gap is fuzzy.

*You can learn more about data ingestion and model building pipeline by reading about our blog entry on the presentation done at Artification Intelligence summit( in October 2017. You can also play around with the data ingestion pipeline notebook on our :

Scaling to Big Data:

Once the appropriate model(s) are identified, they can be applied to large datasets. Since we have already done cleaning in the model training and experimentation phase, we can borrow some portions of data cleaning to spark codebuild out the ingestion portion. The code that is developed for training data can be ported to run on big data lake infrastructure. Some of the technologies that can be used for this are Python 3 and Spark.

Once deployed, the environment is constantly monitored, constantly fine tuned iteratively by conducting experiments to improve existing model and create new models to achieve more accurate and consistent record linkages. Here is an illustration of sample workflow.


The above example depicts how to build and deploy “Record Linkage” in an enterprise using Data Lake and Machine Learning. To discuss more about creating a “Record Linkage” data lake that is relevant to your organization, you can reach us at If you are looking to either get started with data science or looking to advance your DS, ML and AI skills, check out our, a learn data science by doing platform.

Ram Dhan Yadav Katamaraja is the Founder and CEO of Colabery Inc. Ram is a Harvard University Alum with a Master of Liberal Arts in Management. Ram believes that there is a need to create innovative skill development platforms from ground up that have “inclusivity” and “equal opportunity” at core. Ram has been working on and platforms which have directly impacted the careers of thousands of people from many countries.

Posted on November 4, 2017 by Ram Dhan Yadav Katamaraja .

Colaberry learning team in partnership with data science club of University of Texas, Dallas organized a data science hackathon on October 28th 2017. It was one whirlwind tour of putting our data science learning platform that we are cooking in our labs to an interesting test in an uncontrolled environment. Here is a run down of what happened.


It all started with Colaberry bringing its first ever data science interns to work on platform in summer 2017. One of those interns is, Kshitij Yadav, who is pursuing Masters in Bsiness Analytics from Naveen Jindal School of Management, UTD. He is very unassuming guy with great hustle and he did hustle his way into the internship team. He stumbled a lot through the summer, made lot of mistakes, but continuously improved through sheer handwork and dedication. At the end he emerged as one of the best interns we hired. Before he completed his internship, he asked : ‘I learnt a lot on How can I continue to contribute to this platform with once I back to school?’. I suggested that he go back, start/participate in a data science club and continue to learn on the platform and contribute to the platform as a volunteer. Kshitij took that advice very seriously, went back to school, collaborated with the newly formed data science club, convinced us to partner with the data science club to run a hackathon at university for the benefit of enthusiastic students and actually brought Colaberry to his school to run a hackathon.

Opening architecture:

Organizing a data science hackathon is a very interesting challenge for me and our refactored team. I usually continuously think about reusability and scalability of the work that we do. So, I started with the question: ‘How can we organize this hackathon in such a way that it gives us capability to repeat any future hackathons with less effort and potentially scale to run more hackathons?’

That question forced us to think critically about the data science content creation and consumption that happens on refactored platform. It forced us to move from a closed architecture to an open architecture. Instead of forcing users to create content using tools with in our platform, we opened the architecture so that content contributors can checkin their content into GitHub and pull it into the platform. This enabled us to think how to leverage GitHub infrastructure to empower any data scientist in the world to contribute content to refactored platform in the from of Jupyter notebooks. Giving the flexibility of contributing content in the form of Jupyter notebook comes with its own challenges. One such challenge is how to translate the content that is contributed into a lesson or lab on refactored platform. As the first step we created a temporary solution in the form of a custom refactored jupyter notebook . This custom notebook is hosted at This notebook allows users to upload their jupyter notebook and quickly convert that into format that allows refactored platform to import content through CI process. I hope that we can figure out how to eliminate this manual intervention and automagically import jupyter notebooks into refactored platform. This is probably a good artificial intelligence and machine learning challenge to solve.

Setting the stage:

With this open content contribution architecture and content format conversation tool, we have enough ammunition to organize a small scale data science hackathon. For organizing hackathons we created a public git repository at which users can fork and work on the challenges. To organize UTD data science hackathon, we created a subfolder utd-fall-2017 and posted hackathon instructions. We decided to test two aspects as part of hackathon. First one is to test the ability of participants to provide their data driven analysis for the data sets. Second one is to test their ability to deliver the analysis as a product that can be imported into the refactored platform.

UTD data science club‘s president Achintya Sen and team did a remarkable job in collaborating with us to create a plan for organizing the hackathon. Hackathon was divided into three stages:

  1. Refactored platform orientation – Oct 19th, 2017
  2. Hackathon day – Oct 28th, 2017
  3. Award ceremony day – Nov 2nd, 2017

Hackathon execution:

In the first stage, Harish Krishnamurthy and  Murali Mallina provided orientation on python for data science, jupyter notebooks and using platform to quickly learn data science by doing.

In the second stage, a 6 hours hackathon was organized in a large room that accommodated 57 students who formed into 18 teams to participate in hackathon. Hackathon was filled with energy, ideas, questions, focus, creativity and lots of fun.  Data science club team and volunteers did an incredible job behind the scenes organizing the logistics, food, AV etc

I was able to attend the actual hackathon to observe whether our open architecture is working as expected or not, how the participants are responding, whether they are able to easily follow the instructions and gather any other feedback. For most part, they were able to very easily work with jupyter notebooks and refactored platform. What was most exciting to me and our team was that the participants were constantly referring to the lessons and concepts in the refactored platform while hacking away.  One area where participants struggled was working with ‘assertion’ blocks as part of converting the content created into lessons format that can be imported into refactored platform. I envision a day were we can automate this lesson conversion process by applying artificial intelligence and machine learning.

Feedback from participants:

Hackathon also provided opportunity to get feedback from participants on various aspects of refactored platform and tools we are developing. Overall feedback was overwhelmingly positive. This gives us encouragement to continue to do the work that we are doing to democratize  data science. Sharing below some graphs from the feedback that we received.



After the hackathon was complete, refactored team worked on evaluating the 18 pull requests submitted on the public git repository. Evaluation team included Sathwik Mohan and  Brinda Krishnaswamy under the guidance of Harish Krishnamurthy.  We came up with various criteria for scoring and brainstormed a lot to pick winning submissions. Even after 2 days of deliberation, evaluation, scoring the team could not pick a winner as the aggregate scores of the submissions were creating ties. Finally, they decided to pick 3 winning entries and 2 runner-up entries.

Award ceremony: 

As the final step, data science club organized the award ceremony and a networking event on 2nd November. Tommy Haugh and David Freni from Colaberry recruiting team attended the event.  Data science club advisor and director of MS business analytics program Khasif Saeed was also present at the event.

Winners of the UTD 2017 hackathon along with analysis / feedback:

• Pythonians: The highlight of their notebook1 and notebook2 was the content development that was guiding their workflow throughout the analysis. They adhered to the Refactored course format, which helps us to understand their thought process in arriving at the model. From the beginning, their approach to the dataset was logical and well presented. Explanatory Data Analysis or EDA is an important part of understanding the data and exploring its weaknesses. We would encourage you to add more on that front. This will enable you to understand what the data is trying to tell you. Conclusions are an important part of model building. Write inferences from your model and explain your coefficients.

• DataFreaks: Major contribution in their notebook was the EDA. Box plots clearly conveyed the number of employed graduates in each category. A very nice heat map clearly shows multicollinearity. Add to that a table of variance inflation factor (VIF) was a good way to weed out unwanted variables. The side by side distribution plot of median salary in employed and unemployed graduates was very illustrative. We can clearly see there is a definite relationship. Show model statistics, coefficients and model accuracy using MSE. This alone helps us to understand how well the model can perform.

• MeanSquare: This team doubled their efforts and worked on two datasets. With the graduate students dataset, they started with some very good plots as part of their EDA. Bar plot for total graduates grouped by major category and heat map to check collinearity both hold a lot of information. A pairplot among predictors is a very easy way to check their distribution and assert model assumptions. The model predictions were good. Model diagnostics through colored scatter plot of predictions was also well done. In addition to providing mean squared errors for model accuracy, these kind of diagnostic plots convey the same information in a way that is easy to understand. For women in STEM dataset, they had a piechart, bar plots and heat map, in addition to a scatterplot of predicted vs actual values that captures the model accuracy. Presentation is an equally important part of model building. It is important to convey the thoughts that went behind building your model. We encourage you to add content to summarize and add conclusions in each step.

The Runners up did a very good job but just missed out on a few key points that was expected. But kudos to them for making the top.

• DataArtists: Happiness Index dataset was a tough nut to crack. But DataArtists team really owned it. Their approach to this complex dataset was inspiring. With numerous variables and lots of missing values, they handled the data cleaning very well.  Application of multinomial logistic and attempt at decision tree is commendable. Having some output would have helped understand the strength of the model. Following the Refactored format to creating a notebook really helps present your findings and model building logic in a way that is clear to the reader. Focus on adding content and more exploratory data analysis around your variables. Work on perfecting your model and increase accuracy.

• Data_Diggers: They followed the refactored course format all through the notebook with content added for every step. Each exercise was followed by very strong assertion blocks. A scatterplot of Median salary and proportion of women in STEM was used to clearly show the relationship between the two variables. Conclusions were provided for the plots, which is what we expect from EDA. A bar plot showing the number of men and women in each category is illustrative. Their team were one of the very few who clearly showed the model statistics along with model evaluation metrics. They used a pair plot to assess model accuracy which was a unique approach and very useful. As part of EDA and data cleaning, you were expected to check for multicollinearity and missing values and provide summary statistics. This forms an important first step. Always ensure your data is clean before you venture into building a model around it. Collinear predictors can severely affect the performance of your model.

In general, for everyone, we encourage you to not just fit a good model to the data but also to understand the hypothesis behind building the model. Derive inferences from your model. Every data tells a story and the objective of a model is to extract maximum information from that story. Build conclusions and explain limitations in the model. Understand how coefficients relate to the predictors and what their standard errors mean. Asses the model accuracy and build diagnostic plots. Explain why you chose the predictors and why your model is the best fit. Presenting output goes a long way in communicating what your model is about. Explain why you chose a plot, what did you learn from the plot, why did you choose a model diagnostic and what does it imply about your model.


Tom and David made some wonderful connections at the event and looking forward to work with students and university.

What’s next?

Traveling on this hackathon journey has been so exciting that it planted seeds for new possibilities. Some of the possibilities are organizing inter college hackathon or may be a world wide hackathon to put the platform to test the boundaries. We started receiving correspondence from students who are looking for career opportunites. So, I am thinking about launching a thousand virtual internships initiative for students to contribute to continuous development of the refactored platform and content.  That could open up a world of opportunities for students and us. I am definitely excited to see where this takes us.

Ram Dhan Yadav Katamaraja is the Founder and CEO of Colabery Inc. Ram is a Harvard University Alum with a Master of Liberal Arts in Management. Ram believes that there is a need to create innovative skill development platforms from ground up that have “inclusivity” and “equal opportunity” at core. Ram has been working on and platforms which have directly impacted the careers of thousands of people from many countries.