Overview of Entity Resolution and Related Processes

John R. Talburt, PhD
University of Arkansas at Little Rock

Entity resolution (ER) is a commonly used data quality process to decrease data redundancy. ER is primarily focused on the data quality problem of “multiple sources of the same information.” The fundamental problem is the records in an information system are only references to real-world objects. Despite the terminology used in database schemas, the rows in a database table are not “entities,” they are references to the entities. A patient record is not a patient, only a description of a patient, and the same patient can have many records in the hospital information system. These could be records of admissions, procedures, medications, diagnoses, and many other encounters. ER is the process of determining (resolving) whether two references to a real-world entity (e.g., patient, customer, product, etc.) are referring to the same entity or to different entities.

Several other terms are often confused with ER. These include record linking, record or data matching, master data, and identity resolution. While these all have some similarities, they mean different things. The term “record linking” is one of oldest. It describes the mechanics of ER. Before the ER process starts, each record is given a new attribute called the “link attribute.” If an ER process decides two records are referencing the same entity, then this decision is expressed by giving the link attributes of both records the same link value. On the other hand, if the ER process decides the records are referring to different entities, it will give their link attributes different values. So, we say two records are “linked” if they both have the same link identifier. For example, if a hospital ER system decides “R1, Mary Smith, 123 Oak St” and “R2, Marie Smith, 123 Oak” are the same patient, then it would add (append) the same link value to both resulting in “R1, Mary Smith, 123 Oak St, AX34” and “R2, Marie Smith, 123 Oak, AX34” where “AX34” is the link value. The actual value of the link is not important, only whether the link values between two records are the same or different.

Of course, ER systems can make linking mistakes. To help clarify the language we can define another term “equivalent references.” Two references (records) are equivalent if, and only if, they reference the same entity. As in the previous example, the ER system linked records R1 and R2, because its logic decided they were equivalent. But in truth, these might refer to different patients (non-equivalent references). When this happens, we say the ER system has made a “false positive” error because it made a positive decision to link the two references, but the decision was wrong (false) because the references were not equivalent. Similarly, if the ER system had made the negative decision not to link, but R1 and R2 are referring to the same patient (i.e., are equivalent), then the ER system made a “false negative” error. In other words, the goal of an ER system is to link two references, if and only if the references are equivalent, i.e., only true positive and true negative decisions.

That brings us to the term “record matching.” Record matching is an algorithmic process taking two references as input and giving a numeric value as output. The numeric values represents the degree of similarity between the two references. Typically, this is a value between 0.0 and 1.0 where 1.0 means the two references are the same (identical) and 0.0 means they are entirely dissimilar (nothing in common). There is no universal similarity algorithm. Different kinds of attributes call for different kinds of similarity measures. Methods for measuring the similarity between names is different than measuring the similarity between dates. Often these methods are designed to overcome data quality problems or language semantics. For example, matching “William” with the nickname “Bill” in English.

ER is related to record matching based on the assumption “the higher the similarity between two references, the more likely the references are equivalent, and the lower the similarity between two references, the less likely the references are equivalent.” Similarity, however, only gives the likelihood of equivalence or non-equivalence; it is not a certainty. Very similar references such as “R3, William Smith, 123 Pine St” and “R4, Bill Smith, 123 Pine St” are very similar, but may in fact be a father and son. Conversely, “R1, Mary Smith, 123 Oak St” and “R5, Mary Jones, 456 Elm St” are not very similar, but could be the same patient married and moved to a different address. In practice, the ER system is governed by a similarity threshold. When the similarity between two references meets or exceeds the threshold, they system links them, otherwise it does not link them.

After ER, the next step is establishing “identity.” As we see, when the entities are people, many of the demographic attributes that distinguish persons such as name, address, phone, and date-of-birth may be missing, miss-keyed, or change over time. The same is true for other entities like products and materials. Different companies may describe equivalent (replacement) parts in different ways.

Suppose that we run an ER process and link together three patient records “R1, Mary Smith, 123 Oak St, AX34”, “R2, Marie Smith, 123 Oak, AX34”, and “R5, Mary Jones, 456 Elm St, AX34”. If we store the information from this cluster of records in one data structure, we can establish an identity and call this identity “AX34”. This identity structure then becomes our “master record” for patient “AX34”, and because information is saved, we can make “AX34” a “persistent identifier.” This means if the ER system later determines that a new patient encounter, say “R7, Mary Smith, 456 Elm”, is similar enough to the information stored for patient “AX34”, we can then append “AX34” to this new reference. In this way, “AX34” becomes the “master identifier” for this patient even though some of the identity attribute values may change over time. The quality of a master data management system is measured by the accuracy of the identity information it aggregates, i.e., how well does it avoid false positive and false negative linking errors.

The act of assigning encounter record R7 the master identifier “AX34” described “identity resolution.” Identity resolution is when you determine whether a new reference entering the system is equivalent to an identity you are already managing, or if it represents a new identity. To illustrate the difference between entity resolution and identity resolution, I like to use the example of a crime team investigating a burglary. The technicians go the scene and collect fingerprints. Upon examination, they find there are two sets of fingerprints belonging to two different suspects. In this way, they are performing ER by taking each pair of prints and determining whether they are for same suspect or two different suspects and grouping them into clusters of fingerprints belonging to the same suspect. Now suppose that they send these prints to the FBI where they are matched to their criminal database. The FBI will now perform identity resolution by determining if either set of fingerprints belongs to a criminal (an identity) already established in their database.

Two of the biggest challenges facing ER and MDM are scalability and human-in-the-loop. Because ER is defined as a pairwise process of comparing two records, the number of comparisons goes up with the square of the number of records. So even modest size file of say 100,000 records generates 4,999,950,000 pairs of records. Therefore, even with the fastest computers, it is not practical, if not impossible, to compare every pair of records in a large file. Large files must be split into “blocks” where the system only compares all pairs within each block, but how to create these blocks without separating equivalent references is a challenge.

Another problem is the amount of effort required to prepare the records for the ER processes. Traditional processes are designed to always compare apples to apples, i.e., the same attribute values such as first names to first names, last names to last names, and so on. This can be a challenge when the records are coming from different sources systems with different layouts. In this case, someone needs to design an extract, transform, and load (ETL) process to bring all the sources into a standard layout for the ER to work properly. This takes time and effort and is subject error.

Entity resolution and the related processes described here are in constant evolution. While most master data management (MDM) systems rely on traditional data matching techniques, there is ongoing research into using machine learning and graph technology to improve the accuracy and efficiency of MDM. Some of the newest research is directed at building a “data washing machine.” In same way you put your dirty jeans in an automatic washing with detergent and the proper settings, you could put your raw files into the data washing machine and automatically produce a clean and standardized output file. While the data washing machine is still in the future, work in this area is ongoing.



Information Quality Graduate Program
University of Arkansas at Little Rock

In 2006, the University of Arkansas at Little Rock established the world’s first Information Quality Graduate Program. Working in collaboration the Massachusetts Institute of Technology Chief Data Officer and Information Quality (MIT CDOIQ) program, the UA Little Rock program was designed to prepare students to pursue a variety of IQ careers such as Chief Data Officer, Information Quality Manager, Director of Data Governance, Data Steward, Information Quality Analyst, and Data Scientist, or to pursue doctoral-level graduate studies in preparation for information quality research and instructional roles.

Housed in the Information Science Department of the Donaghey College of Science, Technology, Engineering, and Mathematics, the program has grown from a single, on-campus Master of Science in Information Quality to now include both a PhD and Graduate Certificate in Information Quality. In addition, all three programs are offered both on-campus and online. Each online course is a webcast of a scheduled on-campus course enabling online students the opportunity to engage in live interactions with the instructor and the on-campus students.

All three program are fully articulated upward. The 12 hours taken for the Graduate Certificate program will count toward the 33 hours needed for the MS program, and 27 hours of the MS program will count toward the 75-hour PhD program. In addition, up to 15 hours of graduate courses from other universities can be transferred into the PhD program. For more information about these program visit