SSN filtering method with pre-trained models for entity matching in data washing machine

  • Bushra Sajid Department of Computer Science, The University of Arkansas at Little Rock, AR 72204, USA
  • Ahmed Abu-Halimeh Department of Information Science, The University of Arkansas at Little Rock, AR 72204, USA
  • John R. Talburt Department of Information Science, The University of Arkansas at Little Rock, AR 72204, USA
Keywords: data quality; machine learning; entity resolution; filtering method
Article ID: 1929

Abstract

Entity Resolution (ER) is a vital process in data integration and quality improvement, aimed at identifying and linking records that refer to the same real-world entity. As data volumes and diversity grow, traditional ER methods face challenges such as scalability, poor data quality, and difficulties in handling sparse or inconsistent records. To address these limitations, this research introduces the Proof-of-Concept Data Washing Machine (DWM), developed under the National Science Foundation, Data Analytics that are Robust and Trusted (NSF DART) Data Life Cycle and Curation research theme, which automates the detection and correction of data quality errors through unsupervised entity resolution. The study focuses on advancing ER by replacing traditional rule-based approaches with machine learning (ML) and deep learning techniques, particularly for the linking process. Deep learning models like Bidirectional Encoder Representations from Transformers (BERT) and its variants are employed to enhance similarity scoring within Cluster ER methods. By integrating these models into the DWM framework, the research leverages attention mechanisms to generate reference embeddings and compute similarity score vectors. Additionally, it addresses optimization in candidate pair reduction during the ER blocking process to improve efficiency. A novel method for managing sensitive data, such as Social Security Numbers (SSNs), is proposed to streamline pair reduction in the linking stage. Comparative analysis between Linking_with_ML and SSN_Filtering_with_ML methods across diverse file types reveals that SSN_Filtering_with_ML achieves higher precision while maintaining a balanced trade-off between precision and recall. These findings highlight its robustness and accuracy in entity matching, significantly enhancing the DWM’s capacity for accurate record linkage while reducing unnecessary comparisons. This research contributes to advancing data quality practices, enabling better decision-making across organizations by providing scalable and efficient solutions for complex entity resolution challenges.

References

1. Hechler E, Weihrauch M, Wu Y. AI for entity resolution. In: Data Fabric and Data Mesh Approaches with AI. Apress; 2023.

2. Yang F, Zhang, C. Entity matching method and device and electronic equipment (Chinese). CN Patent 201811474215.1, 22 April 2022.

3. Barlaug N, Gulla JA. Neural Networks for Entity Matching: A Survey. ACM Transactions on Knowledge Discovery from Data. 2021; 15(3): 1-37. doi: 10.1145/3442200

4. Agarwal A, Singh S, Chaurasiya VK. Assessing Entity Resolution techniques based on deep learning. In: Proceedings of the 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT); 2022.

5. Carlsson R, Lundström O, Arizmendi GM, Olsson H. System and method for matching entities. WIPO Patent Application No. 2010063311A1, 10 June 2010.

6. Kong C, Gao M, Xu C, et al. Entity Matching Across Multiple Heterogeneous Data Sources. In: Proceedings of the 21st International Conference, DASFAA 2016; April 16-19, 2016; Dallas, TX, USA.

7. Papadakis G, Fisichella M, Schoger F, et al. Benchmarking Filtering Techniques for Entity Resolution. In: Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE); 2023.

8. Papadakis, G., Palpanas, T., & Koutrika, G. (2020). Entity Resolution Methods for Big Data. ACM Computing Surveys (CSUR), 53(1), 1-42.

9. Halimeh, Ahmed Abu. Integrating information quality in visual analytics. University of Arkansas at Little Rock, 2011.

10. Christophides V, Efthymiou V, Palpanas T, et al. An Overview of End-to-End Entity Resolution for Big Data. ACM Computing Surveys. 2020; 53(6): 1-42.

11. M. I. Sarker and M. Milanova, "Deep Learning-Based Multimodal Image Retrieval Combining Image and Text," 2022 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 2022, pp. 1543-1546, doi: 10.1109/CSCI58124.2022.00274.

12. Navid P. Deep Dive: What the Heck Is Entity Resolution. Deep Dive: What the Heck Is Entity Resolution. Pedram’s Data Based; 2022.

13. Papadakis G, Skoutas D, Thanos E, et al. Blocking and Filtering Techniques for Entity Resolution. ACM Computing Surveys. 2020; 53(2): 1-42. doi: 10.1145/3377455

14. Wang T, Kou Y, Shen D, et al. SIER: An Efficient Entity Resolution Mechanism Combining SNM and Iteration. In: Proceedings of the 2014 11th Web Information System and Application Conference; 2014.

15. Binette O, Steorts RC. (Almost) All of Entity Resolution. arXiv; 2020.

16. Al Sarkhi A, Talburt J. A scalable, hybrid entity resolution process for unstandardized entity references. Journal of Computing Sciences in Colleges. 2020; 35(9): 19-29.

17. Al Sarkhi A, Talburt JR. Estimating the parameters for linking unstandardized references with the matrix comparator. Journal of Global Information Technology Management. 2018; 10(4): 12-26.

18. Sajid B, Abu-Halimeh A, Jakoet N. Pre-trained models for linking process in data washing machine. Computing and Artificial Intelligence. Published online November 1, 2024: 1450. doi: 10.59400/cai.v3i1.1450

19. Sanh V, Debut L, Chaumond J, Wolf T. Distilbert, a Distilled Version of Bert: Smaller, Faster, Cheaper and Lighter. arXiv; 2020.

20. Wang W, Wei F, Dong L, et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 1 January 1970; Vancouver, Canada.

21. Talburt JR, Al Sarkhi AK, Pullen D, et al. An Iterative, Self-Assessing Entity Resolution System: First Steps toward a Data Washing Machine. International Journal of Advanced Computer Science and Applications. 2020; 11(12).

22. Zeakis A, Papadakis G, Skoutas D, et al. Pre-Trained Embeddings for Entity Resolution: An Experimental Analysis. Proceedings of the VLDB Endowment. 2023; 16(9): 2225-2238. doi: 10.14778/3598581.3598594

Published
2025-03-25
Section
Article