Solved – Deep Learning approaches for Record Linkage

deep learningmachine learningrecord-linkage

Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference.
(Source Wikipedia)

There are several approaches to solving a record linkage problem , deterministic, probabilistic , machine learning approaches etc.

I am looking for deep learning approaches for solving record linkage use cases.

Could find below work on "Entity Resolution Using Convolutional Neural Network"
https://www.sciencedirect.com/science/article/pii/S1877050916324796

Please share thoughts on how to solve a record linkage problem using deep learning.

Best Answer

One classic method for linking text documents uses cosine similarity on TF-IDF features. A simple way to extend this would be to use Doc2Vec or similar document embeddings instead of TF-IDF - cosine similarity of word/document embeddings captures semantic similarity (some people might point out that word embeddings aren't technically deep learning, but I think that author might find these methods useful).


Second approach is to try to learn distance function that corresponds to item dissimilarity. This is analogous to record linkage method that uses TF-IDF features (use of distance function is analogous to cosine similarity in this model).

Siamese networks can be used to learn such distance functions. They are essentially networks that given two examples return their similarity/dissimilarity. "Siamese" comes from the use of shared weights for hidden layers (they encode both inputs in the same way).

Here you can see an example talk on using Siamese Networks for similar task.

If you want to read further on Siamese Networks I encourage you to look up One Shot Learning, which is somewhat similar to record linkage.

Related Question