Posted on November 15, 2017 by Ram Dhan Yadav Katamaraja .

Share Post

Enterprises typically have various large data sets that are either in various enterprise systems, legacy systems and/or dumped into a big data lakes. With exponential generation of data from numerous sources and continuous storage of this data in inexpensive unstructured big data environment, “Record Linkage (RL)” is a huge challenge that all enterprise face when they try to get value out of this distributed data.


Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record linkage is called data linkage in many jurisdictions, but is the same process.

Goal: ‘Build → Deploy’ a queryable and linked data source for Record Linkage (RL)

In an enterprise, it is common to find records with data that are represented differently in different systems. The same field might be represented differently in different data sets.

Below is an example of data represented in different data sources:

Data Set A:
Master ID, First Name, Middle Name, Last Name, DOB, Twitter, … ID

Data Set B:
MasterID, First Name, Last Name, Facebook ID,…

Data Set C:
Full Name, Street Number + zip code.

As you can see in the example, different data sets have data stored with no common identity. How do we reconcile these data sets and link them together, so we can form a single query-able analytics resource that can serve as a clean data source for data scientists and analysts.

Solution: Modern Data Lake + Machine Learning = Record Linkage(RL)

The solution that we propose to achieve highly functional Record Linkage includes two aspects:

  1. Create a modern Data Lake.
  2. Apply Machine Learning for RL

Aspect 1: Modern Data Lake

Our recommendation is to first create a modern data lake that can that can store and process petabytes of data. An example modern data lake architecture diagram is below.

The core features of the data lake are:

  • Capability to ingest (un)structured data real-time from multiple data sources
  • Store data in a tiered storage layers for cost and processing effectiveness
  • Provide catalog services for unified data access across systems and tools
  • Support streaming and batch data processing
  • Provide self-service analytics, reporting and data processing for various users
  • Provide data ops and admin services
  • Sits on AWS, Azure, Google or other private cloud infrastructure

Aspect 2: Machine Learning for RL

To create “Record Linkage” our recommendation is to apply a probabilistic matching on ingested data, create machine learning models, train data and run the models to create a linked data lake.

Example: Building → Deploying a Queryable and Linked Data Lake
Below is an example of various stages of how a queryable and linked data lake can be built starting with prototyping and and finally deploying to production.

Data Ingestion: Data ingestion involves loading data from various data sources and data sets into a big data ecosystem (hadoop hdfs, cassandra, etc), pre-processing data to clean up any spurious data and saving the clean ground truth data in compressed data formats such as a parquet files.

Ground truth refers to cartesian pairs of matching records as well as false pairs. These are necessary for the supervised ML classifier to learn parameters that help classify the records as true or false. Finally, split the data into training and validation sets.

Here is our data ingestion pipeline* that shows how we approach data ingestion for record linkage.

Model Training:
Model training involves application of Supervised ML models during the training phase to learn the linkages. In our setup, Machine Learning Engineers build models that are used by Data Scientists. Data Scientists apply statistical techniques to tweak the models for improving the accuracy of generated outputs.

Here is and example of our model building and training pipeline* that shows how we sample data and work with standard tools in python. We select best models that are then implemented using the Spark MLLib.

This by no means is the only arrangement that exists across companies. The other arrangements include organizations where Data Science/Machine Learning areas are treated similarly, and the gap is fuzzy.

*You can learn more about data ingestion and model building pipeline by reading about our blog entry on the presentation done at Artification Intelligence summit( in October 2017. You can also play around with the data ingestion pipeline notebook on our :

Scaling to Big Data:

Once the appropriate model(s) are identified, they can be applied to large datasets. Since we have already done cleaning in the model training and experimentation phase, we can borrow some portions of data cleaning to spark codebuild out the ingestion portion. The code that is developed for training data can be ported to run on big data lake infrastructure. Some of the technologies that can be used for this are Python 3 and Spark.

Once deployed, the environment is constantly monitored, constantly fine tuned iteratively by conducting experiments to improve existing model and create new models to achieve more accurate and consistent record linkages. Here is an illustration of sample workflow.


The above example depicts how to build and deploy “Record Linkage” in an enterprise using Data Lake and Machine Learning. To discuss more about creating a “Record Linkage” data lake that is relevant to your organization, you can reach us at If you are looking to either get started with data science or looking to advance your DS, ML and AI skills, check out our, a learn data science by doing platform.

Share Post

Ram Dhan Yadav Katamaraja is the Founder and CEO of Colabery Inc. Ram is a Harvard University Alum with a Master of Liberal Arts in Management. Ram believes that there is a need to create innovative skill development platforms from ground up that have “inclusivity” and “equal opportunity” at core. Ram has been working on and platforms which have directly impacted the careers of thousands of people from many countries.