research-article

The Battleship Approach to the Low Resource Entity Matching Problem

Authors:

Roee ShragaAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 224, Pages 1 - 25

https://doi.org/10.1145/3626711

Published: 12 December 2023 Publication History

Abstract

Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving entity matching. Although demonstrating unprecedented results, these solutions suffer from a major drawback as they require large amounts of labeled data for training, and, as such, are inadequate to be applied to low resource entity matching problems. To overcome the challenge of obtaining sufficient labeled data we offer a new active learning approach, focusing on a selection mechanism that exploits unique properties of entity matching. We argue that a distributed representation of a tuple pair indicates its informativeness when considered among other pairs. This is used consequently in our approach that iteratively utilizes space-aware considerations. Bringing it all together, we treat the low resource entity matching problem as a Battleship game, hunting indicative samples, focusing on positive ones, through awareness of the latent space along with careful planning of next sampling iterations. An extensive experimental analysis shows that the proposed algorithm outperforms state-of-the-art active learning solutions to low resource entity matching, and although using less samples, can be as successful as state-of-the-art fully trained known algorithms.

Supplemental Material

MP4 File

Presentation video

Download
93.10 MB

References

[1]

Yoram Baram, Ran El Yaniv, and Kobi Luz. 2004. Online choice of active learning algorithms. Journal of Machine Learning Research 5, Mar (2004), 255--291.

[2]

Kedar Bellare, Suresh Iyengar, Aditya G Parameswaran, and Vibhor Rastogi. 2012. Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 1131--1139.

Digital Library

[3]

Indrajit Bhattacharya and Lise Getoor. 2006. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining. SIAM, 47--58.

[4]

Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 39--48.

Digital Library

[5]

Alex Bogatu, Norman W Paton, Mark Douthwaite, Stuart Davie, and Andre Freitas. 2021. Cost--effective Variational Active Entity Resolution. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1272--1283.

[6]

Paul S Bradley, Kristin P Bennett, and Ayhan Demiriz. 2000. Constrained k-means clustering. Microsoft Research, Redmond 20, 0 (2000), 0.

[7]

Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data integration. In International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020. Open-Proceedings.

[8]

Peter Christen. 2012. The data matching process. In Data matching. Springer, 23--35.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[10]

Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.

[11]

Linton C Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry (1977), 35--41.

[12]

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. 1997. Selective sampling using the query by committee algorithm. Machine learning 28, 2 (1997), 133--168.

[13]

Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2020. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 3665--3671.

[14]

Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-End Multi-Perspective Matching for Entity Resolution. In IJCAI. 4961--4967.

[15]

Lise Getoor and Christopher P Diehl. 2005. Link mining: a survey. Acm Sigkdd Explorations Newsletter 7, 2 (2005), 3--12.

Digital Library

[16]

Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al . 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.

Digital Library

[17]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International Conference on Machine Learning. PMLR, 1321--1330.

[18]

Robert Isele and Christian Bizer. 2013. Active learning of expressive linkage rules using genetic programming. Journal of web semantics 23 (2013), 2--15.

Digital Library

[19]

Arjit Jain, Sunita Sarawagi, and Prithviraj Sen. 2021. Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. arXiv preprint arXiv:2104.03986 (2021).

[20]

Matthew A Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414--420.

[21]

Matthew A Jaro. 1995. Probabilistic linkage of large public health data files. Statistics in medicine 14, 5--7 (1995), 491--498.

[22]

Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9 (2021), 962--977.

[23]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535--547.

[24]

Muhammad Ebraheem Saravanan Thirumuruganathan Shafiq Joty and Mourad Ouzzani Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018).

[25]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5851--5861.

[26]

Pradap Konda et al. 2016. Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197--1208.

Digital Library

[27]

Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707--710.

[28]

Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang. 2020. GraphER: Token-Centric Entity Resolution with Graph Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8172--8179.

[29]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.

Digital Library

[30]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, and Wang-Chiew Tan. 2021. Deep entity matching: Challenges and opportunities. Journal of Data and Information Quality (JDIQ) 13, 1 (2021), 1--17.

Digital Library

[31]

Dekang Lin et al. 1998. An information-theoretic definition of similarity. In ICML, Vol. 98. 296--304.

Digital Library

[32]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[33]

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. PMLR, 97--105.

[34]

Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2016. Unsupervised domain adaptation with residual transfer networks. Advances in neural information processing systems 29 (2016).

[35]

Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.

[36]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stone-braker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data. 865--882.

Digital Library

[37]

Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824--836.

[38]

Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A comprehensive benchmark framework for active learning methods in entity matching. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1133--1147.

Digital Library

[39]

Zhengjie Miao, Yuliang Li, and Xiaolan Wang. 2021. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In Proceedings of the 2021 International Conference on Management of Data. 1303--1316.

Digital Library

[40]

Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125--136.

Digital Library

[41]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.

Digital Library

[42]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.

[43]

George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR) 53, 2 (2020), 1--42.

Digital Library

[44]

George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment 9, 9 (2016), 684--695.

Digital Library

[45]

Ralph Peeters and Christian Bizer. 2021. Dual-objective fine-tuning of BERT for entity matching. Proceedings of the VLDB Endowment 14, 10 (2021), 1913--1921.

Digital Library

[46]

Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC training dataset and gold standard for large-scale product matching. In Companion Proceedings of The 2019 World Wide Web Conference. 381--386.

Digital Library

[47]

Michael Prince. 2004. Does active learning work? A review of the research. Journal of engineering education 93, 3 (2004), 223--231.

[48]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1379--1388.

Digital Library

[49]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1--40.

[50]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

[51]

Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 269--278.

Digital Library

[52]

Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a" kneedle" in a haystack: Detecting knee points in system behavior. In 2011 31st international conference on distributed computing systems workshops. IEEE, 166--171.

Digital Library

[53]

Burr Settles. 2009. Active learning literature survey. (2009).

[54]

H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory. 287--294.

Digital Library

[55]

Roee Shraga. 2022. HumanAL: Calibrating Human Matching beyond a Single Task. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (Philadelphia, Pennsylvania) (HILDA '22). Association for Computing Machinery, New York, NY, USA, Article 7, 8 pages. https://doi.org/10.1145/3546930.3547496

Digital Library

[56]

Roee Shraga, Ofra Amir, and Avigdor Gal. 2021. Learning to Characterize Matching Experts. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1236--1247.

[57]

Roee Shraga, Gil Katz, Yael Badian, Nitay Calderon, and Avigdor Gal. 2021. From Limited Annotated Raw Material Data to Quality Production Data: A Case Study in the Milk Industry. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4114--4124.

Digital Library

[58]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189--202.

Digital Library

[59]

Parag Singla and Pedro Domingos. 2006. Entity resolution with markov logic. In Sixth International Conference on Data Mining (ICDM'06). IEEE, 572--582.

Digital Library

[60]

Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. 2018. A survey on deep transfer learning. In International conference on artificial neural networks. Springer, 270--279.

[61]

Sheila Tejada, Craig A Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.

Digital Library

[62]

Tippaya Thinsungnoena, Nuntawut Kaoungkub, Pongsakorn Durongdumronchaib, Kittisak Kerdprasopb, and Nittaya Kerdprasopb. 2015. The clustering validity with silhouette and sum of squared errors. learning 3, 7 (2015).

[63]

Saravanan Thirumuruganathan, Shameem A Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018).

[64]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).

[65]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. Zeroer: Entity resolution using zero labeled examples. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1149--1164.

Digital Library

[66]

Dongxiang Zhang, Dongsheng Li, Long Guo, and Kian-Lee Tan. 2020. Unsupervised entity resolution with blocking and graph algorithms. IEEE Transactions on Knowledge and Data Engineering (2020).

[67]

Wentao Zhang, Yu Shen, Yang Li, Lei Chen, Zhi Yang, and Bin Cui. 2021. ALG: Fast and Accurate Active Learning Framework for Graph Convolutional Networks. In Proceedings of the 2021 International Conference on Management of Data. 2366--2374.

Digital Library

[68]

Chen Zhao and Yeye He. 2019. Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference. 2413--2424.

Digital Library

[69]

Jingbo Zhu, Huizhen Wang, Tianshun Yao, and Benjamin K Tsou. 2008. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). 1137--1144.

Digital Library

Cited By

Brinkmann AShraga RBizer C(2024)SC-Block: Supervised Contrastive Blocking Within Entity Resolution PipelinesThe Semantic Web10.1007/978-3-031-60626-7_7(121-142)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-60626-7_7

Index Terms

The Battleship Approach to the Low Resource Entity Matching Problem
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Active learning settings
2. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
Frameworks for entity matching: A comparison

Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for ...
Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Benjamin and Florence Free Chair
National Science Foundation (NSF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
174
Total Downloads

Downloads (Last 12 months)174
Downloads (Last 6 weeks)17

Reflects downloads up to 03 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Brinkmann AShraga RBizer C(2024)SC-Block: Supervised Contrastive Blocking Within Entity Resolution PipelinesThe Semantic Web10.1007/978-3-031-60626-7_7(121-142)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-60626-7_7

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents