skip to main content
research-article

The Battleship Approach to the Low Resource Entity Matching Problem

Published: 12 December 2023 Publication History

Abstract

Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving entity matching. Although demonstrating unprecedented results, these solutions suffer from a major drawback as they require large amounts of labeled data for training, and, as such, are inadequate to be applied to low resource entity matching problems. To overcome the challenge of obtaining sufficient labeled data we offer a new active learning approach, focusing on a selection mechanism that exploits unique properties of entity matching. We argue that a distributed representation of a tuple pair indicates its informativeness when considered among other pairs. This is used consequently in our approach that iteratively utilizes space-aware considerations. Bringing it all together, we treat the low resource entity matching problem as a Battleship game, hunting indicative samples, focusing on positive ones, through awareness of the latent space along with careful planning of next sampling iterations. An extensive experimental analysis shows that the proposed algorithm outperforms state-of-the-art active learning solutions to low resource entity matching, and although using less samples, can be as successful as state-of-the-art fully trained known algorithms.

Supplemental Material

MP4 File
Presentation video

References

[1]
Yoram Baram, Ran El Yaniv, and Kobi Luz. 2004. Online choice of active learning algorithms. Journal of Machine Learning Research 5, Mar (2004), 255--291.
[2]
Kedar Bellare, Suresh Iyengar, Aditya G Parameswaran, and Vibhor Rastogi. 2012. Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 1131--1139.
[3]
Indrajit Bhattacharya and Lise Getoor. 2006. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining. SIAM, 47--58.
[4]
Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 39--48.
[5]
Alex Bogatu, Norman W Paton, Mark Douthwaite, Stuart Davie, and Andre Freitas. 2021. Cost--effective Variational Active Entity Resolution. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1272--1283.
[6]
Paul S Bradley, Kristin P Bennett, and Ayhan Demiriz. 2000. Constrained k-means clustering. Microsoft Research, Redmond 20, 0 (2000), 0.
[7]
Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data integration. In International Conference on Extending Database Technology, Copenhagen, 30 March-2 April 2020. Open-Proceedings.
[8]
Peter Christen. 2012. The data matching process. In Data matching. Springer, 23--35.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[10]
Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.
[11]
Linton C Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry (1977), 35--41.
[12]
Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. 1997. Selective sampling using the query by committee algorithm. Machine learning 28, 2 (1997), 133--168.
[13]
Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2020. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 3665--3671.
[14]
Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-End Multi-Perspective Matching for Entity Resolution. In IJCAI. 4961--4967.
[15]
Lise Getoor and Christopher P Diehl. 2005. Link mining: a survey. Acm Sigkdd Explorations Newsletter 7, 2 (2005), 3--12.
[16]
Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al . 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.
[17]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International Conference on Machine Learning. PMLR, 1321--1330.
[18]
Robert Isele and Christian Bizer. 2013. Active learning of expressive linkage rules using genetic programming. Journal of web semantics 23 (2013), 2--15.
[19]
Arjit Jain, Sunita Sarawagi, and Prithviraj Sen. 2021. Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. arXiv preprint arXiv:2104.03986 (2021).
[20]
Matthew A Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414--420.
[21]
Matthew A Jaro. 1995. Probabilistic linkage of large public health data files. Statistics in medicine 14, 5--7 (1995), 491--498.
[22]
Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9 (2021), 962--977.
[23]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535--547.
[24]
Muhammad Ebraheem Saravanan Thirumuruganathan Shafiq Joty and Mourad Ouzzani Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018).
[25]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5851--5861.
[26]
Pradap Konda et al. 2016. Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197--1208.
[27]
Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707--710.
[28]
Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang. 2020. GraphER: Token-Centric Entity Resolution with Graph Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8172--8179.
[29]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.
[30]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, and Wang-Chiew Tan. 2021. Deep entity matching: Challenges and opportunities. Journal of Data and Information Quality (JDIQ) 13, 1 (2021), 1--17.
[31]
Dekang Lin et al. 1998. An information-theoretic definition of similarity. In ICML, Vol. 98. 296--304.
[32]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[33]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. PMLR, 97--105.
[34]
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2016. Unsupervised domain adaptation with residual transfer networks. Advances in neural information processing systems 29 (2016).
[35]
Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
[36]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stone-braker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data. 865--882.
[37]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824--836.
[38]
Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A comprehensive benchmark framework for active learning methods in entity matching. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1133--1147.
[39]
Zhengjie Miao, Yuliang Li, and Xiaolan Wang. 2021. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In Proceedings of the 2021 International Conference on Management of Data. 1303--1316.
[40]
Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125--136.
[41]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.
[42]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
[43]
George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR) 53, 2 (2020), 1--42.
[44]
George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment 9, 9 (2016), 684--695.
[45]
Ralph Peeters and Christian Bizer. 2021. Dual-objective fine-tuning of BERT for entity matching. Proceedings of the VLDB Endowment 14, 10 (2021), 1913--1921.
[46]
Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC training dataset and gold standard for large-scale product matching. In Companion Proceedings of The 2019 World Wide Web Conference. 381--386.
[47]
Michael Prince. 2004. Does active learning work? A review of the research. Journal of engineering education 93, 3 (2004), 223--231.
[48]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1379--1388.
[49]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1--40.
[50]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[51]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 269--278.
[52]
Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a" kneedle" in a haystack: Detecting knee points in system behavior. In 2011 31st international conference on distributed computing systems workshops. IEEE, 166--171.
[53]
Burr Settles. 2009. Active learning literature survey. (2009).
[54]
H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory. 287--294.
[55]
Roee Shraga. 2022. HumanAL: Calibrating Human Matching beyond a Single Task. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (Philadelphia, Pennsylvania) (HILDA '22). Association for Computing Machinery, New York, NY, USA, Article 7, 8 pages. https://doi.org/10.1145/3546930.3547496
[56]
Roee Shraga, Ofra Amir, and Avigdor Gal. 2021. Learning to Characterize Matching Experts. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1236--1247.
[57]
Roee Shraga, Gil Katz, Yael Badian, Nitay Calderon, and Avigdor Gal. 2021. From Limited Annotated Raw Material Data to Quality Production Data: A Case Study in the Milk Industry. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4114--4124.
[58]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment 11, 2 (2017), 189--202.
[59]
Parag Singla and Pedro Domingos. 2006. Entity resolution with markov logic. In Sixth International Conference on Data Mining (ICDM'06). IEEE, 572--582.
[60]
Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. 2018. A survey on deep transfer learning. In International conference on artificial neural networks. Springer, 270--279.
[61]
Sheila Tejada, Craig A Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.
[62]
Tippaya Thinsungnoena, Nuntawut Kaoungkub, Pongsakorn Durongdumronchaib, Kittisak Kerdprasopb, and Nittaya Kerdprasopb. 2015. The clustering validity with silhouette and sum of squared errors. learning 3, 7 (2015).
[63]
Saravanan Thirumuruganathan, Shameem A Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018).
[64]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[65]
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. Zeroer: Entity resolution using zero labeled examples. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1149--1164.
[66]
Dongxiang Zhang, Dongsheng Li, Long Guo, and Kian-Lee Tan. 2020. Unsupervised entity resolution with blocking and graph algorithms. IEEE Transactions on Knowledge and Data Engineering (2020).
[67]
Wentao Zhang, Yu Shen, Yang Li, Lei Chen, Zhi Yang, and Bin Cui. 2021. ALG: Fast and Accurate Active Learning Framework for Graph Convolutional Networks. In Proceedings of the 2021 International Conference on Management of Data. 2366--2374.
[68]
Chen Zhao and Yeye He. 2019. Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference. 2413--2424.
[69]
Jingbo Zhu, Huizhen Wang, Tianshun Yao, and Benjamin K Tsou. 2008. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). 1137--1144.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Author Tags

  1. active learning
  2. entity matching
  3. entity resolution

Qualifiers

  • Research-article

Funding Sources

  • Benjamin and Florence Free Chair
  • National Science Foundation (NSF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)174
  • Downloads (Last 6 weeks)17
Reflects downloads up to 03 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media