Performance Analysis of Duplicate Record Detection Techniques

Authors

  • S. H. Adil Department of Computer Science, Iqra University, Karachi, Pakistan http://orcid.org/0000-0003-1280-6645
  • M. Ebrahim Department of Computer Science, Iqra University, Karachi, Pakistan
  • S. S. A. Ali Department of Electrical and Electronic Engineering, Universiti Teknologi PETRONAS, Malaysia
  • K. Raza Department of Computer Science, Iqra University, Karachi, Pakistan
Volume: 9 | Issue: 5 | Pages: 4755-4758 | October 2019 | https://doi.org/10.48084/etasr.3036

Abstract

In this paper, a comprehensive performance analysis of duplicate data detection techniques for relational databases has been performed. The research focuses on traditional SQL based and modern bloom filter techniques to find and eliminate records which already exist in the database while performing bulk insertion operation (i.e. bulk insertion involved in the loading phase of the Extract, Transform, and Load (ETL) process and data synchronization in multisite database synchronization). The comprehensive performance analysis was performed on several data sizes using SQL, bloom filter, and parallel bloom filter. The results show that the parallel bloom filter is highly suitable for duplicate detection in the database.

Keywords:

duplicate detection, bloom filter, SQL, database

Downloads

Download data is not yet available.

References

A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate record detection: A survey”, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 1, pp. 1-16, 2007 DOI: https://doi.org/10.1109/TKDE.2007.250581

O. H. Akel, A Comparative Study of Duplicate Record Detection Techniques, MSc Thesis, Middle East University, 2012

B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors”, Communications of the ACM, Vol. 13, No. 7, pp. 422-426, 1970 DOI: https://doi.org/10.1145/362686.362692

L. Fan, P. Cao, J. Almeida, A. Z. Broder, “Summary cache: a scalable wide-area web cache sharing protocol”, IEEE/ACM Transactions on Networking, Vol. 8, No. 3, pp. 281-293, 2000 DOI: https://doi.org/10.1109/90.851975

F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, G. Varghese, “An improved construction for counting bloom filters”, in: European Symposium on Algorithms, Springer, pp. 684-695, 2006 DOI: https://doi.org/10.1007/11841036_61

M. Mitzenmacher, “Compressed bloom filters”, IEEE/ACM Transactions on Networking, Vol. 10, No. 5, pp. 604-612, 2002 DOI: https://doi.org/10.1109/TNET.2002.803864

B. Chazelle, J. Kilian, R. Rubinfeld, A. Tal, “The Bloomier filter: an efficient data structure for static support lookup tables”, Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, USA, January 11-14, 2004

A. Kumar, J. Xu, J. Wang, “Space-code bloom filter for efficient per-flow traffic measurement”, IEEE Journal on Selected Areas in Communications, Vol. 24, No. 12, pp. 2327-2339, 2006 DOI: https://doi.org/10.1109/JSAC.2006.884032

D. Guo, J. Wu, H. Chen, X. Luo, “Theory and network applications of dynamic bloom filters”, 25th IEEE International Conference on Computer Communications, Barcelona, Spain, April, 23-29, 2006 DOI: https://doi.org/10.1109/INFOCOM.2006.325

S. Geravand, M. Ahmadi, “Bloom filter applications in network security: A state-of-the-art survey”, Computer Networks, Vol. 57, No. 18, pp. 4047-4064, 2013 DOI: https://doi.org/10.1016/j.comnet.2013.09.003

Y. Emami, R. Javidan, “An Energy-efficient Data Transmission Scheme in Underwater Wireless Sensor Networks”, Engineering, Technology & Applied Science Research, Vol. 6, No. 2, pp. 931-936, 2016 DOI: https://doi.org/10.48084/etasr.629

Downloads

How to Cite

[1]
S. H. Adil, M. Ebrahim, S. S. A. Ali, and K. Raza, “Performance Analysis of Duplicate Record Detection Techniques”, Eng. Technol. Appl. Sci. Res., vol. 9, no. 5, pp. 4755–4758, Oct. 2019.

Metrics

Abstract Views: 435
PDF Downloads: 356

Metrics Information

Most read articles by the same author(s)