A Survey of Text Matching Techniques
Received: 28 November 2020 | Revised: 8 December 2020 | Accepted: 14 December 2020 | Online: 6 February 2021
Corresponding author: A. Baz
Text matching is the process of identifying and locating particular text matches in raw data. Text matching is a vital component in practical applications and an essential process in several fields. Furthermore, several dynamic techniques have been introduced in this context in order to create ease in pattern generation from words. The process involves matching of text files, text mining, text clustering, association rule extraction, world cloud, natural language processing, and text similarity measures (knowledge-based, corpus-based, string-based, and hybrid similarities). The string-based approach forms the most conspicuous form of text mining applied in different cases. The survey attempted in the present study covers a new research premise that uses text-matching to solve problems. The study also summarizes different approaches that are being used in this domain.
Keywords:text mining, similarity measure, matching, clustering, natural language processing, word cloud
P. Kudi, A. Manekar, K. Daware, and T. Dhatrak, "Online Examination with short text matching," in IEEE Global Conference on Wireless Computing Networking, Lonavala, India, Dec. 2014, pp. 56-60. https://doi.org/10.1109/GCWCN.2014.7030847
R. Munoz, A. Montoyo, and E. Metais, Natural Language Processing and Information Systems. Alicante, Spain: Springer, 2011. https://doi.org/10.1007/978-3-642-22327-3
M. Allahyari et al., "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques," Jul. 2017, Accessed: Dec. 26, 2020. [Online]. Available: http://arxiv.org/abs/1707.02919.
K. B. Cohen and D. Demner-Fushman, Biomedical natural language processing. Amsterdam, Netherlands: John Benjamins Publishing Company, 2014.
L. Xinwu, "A new text clustering algorithm based on improved k means," Journal of Software, vol. 7, no. 1, pp. 95-101, 2012.
P. D. Asanka, "Finding similar files using text mining," in 8th International Conference on Computer Science Education, Colombo, Sri Lanka, Apr. 2013, pp. 431-435. https://doi.org/10.1109/ICCSE.2013.6553950
T. Svadas and J. Jha, "Document Cluster Mining on Text Documents," International Journal of Computer Science and Mobile Computing, vol. 4, no. 6, pp. 778-782, Jun. 2015.
M. J. Basha and K. P. Kaliyamurthie, "An Improved Similarity Matching based Clustering Framework for Short and Sentence Level Text," International Journal of Electrical & Computer Engineering, vol. 7, no. 1, pp. 551-558, 2017. https://doi.org/10.11591/ijece.v7i1.pp551-558
M. Mateen, J. Wen, M. Hassan, and S. Song, "Text Clustering using Ensemble Clustering Technique," International Journal of Advanced Computer Science and Applications, vol. 9, no. 9, pp. 185-190, 2018. https://doi.org/10.14569/IJACSA.2018.090925
J. Yi, Y. Zhang, X. Zhao, and J. Wan, "A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network," Mathematical Problems in Engineering, vol. 2017, Jan. 2017, Art. no. 8310934. https://doi.org/10.1155/2017/8310934
Y. Liu, M. Liu, and X. Wang, "Towards Semantically Sensitive Text Clustering: A Feature Space Modeling Technology Based on Dimension Extension," PLOS ONE, vol. 10, no. 3, pp. 1-18, 2015. https://doi.org/10.1371/journal.pone.0117390
D. Westergaard, H. H. Stærfeldt, C. Tonsberg, L. J. Jensen, and S. Brunak, "A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts," P L o S Computational Biology, vol. 14, no. 2, 2018, Art. no. e1005962. https://doi.org/10.1371/journal.pcbi.1005962
S. A. Salloum, M. Al-Emran, A. A. Monem, and K. Shaalan, "Using Text Mining Techniques for Extracting Information from Research Articles," in Intelligent Natural Language Processing: Trends and Applications, K. Shaalan, A. E. Hassanien, and F. Tolba, Eds. Cham: Springer International Publishing, 2018, pp. 373-397. https://doi.org/10.1007/978-3-319-67056-0_18
M. S. Bewoor and S. H. Patil, "Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms," Engineering, Technology & Applied Science Research, vol. 8, no. 1, pp. 2562-2567, Feb. 2018. https://doi.org/10.48084/etasr.1775
M. Kulkarni and S. Kulkarni, "Knowledge discovery in text mining using association rule extraction," International Journal of Computer Applications, vol. 143, no. 12, pp. 30-35, 2016. https://doi.org/10.5120/ijca2016910144
J. Manimaran and T. Velmurugan, "A survey of association rule mining in text applications," in IEEE International Conference on Computational Intelligence and Computing Research, Enathi, India, Dec. 2013, pp. 1-5. https://doi.org/10.1109/ICCIC.2013.6724258
A. A. Oliinyk and S. A. Subbotin, "A stochastic approach for association rule extraction," Pattern Recognition and Image Analysis, vol. 26, no. 2, pp. 419-426, Apr. 2016. https://doi.org/10.1134/S1054661816020139
S. Mahmood, M. Shahbaz, and A. Guergachi, "Negative and positive association rules mining from text using frequent and infrequent itemsets," The Scientific World Journal, vol. 2014, May 2014, Art. no. 973750. https://doi.org/10.1155/2014/973750
M. N. Moreno, S. Segrera, and V. F. López, "Association Rules: Problems, solutions and new applications," in Actas del III Taller Nacional de Minería de Datos y Aprendizaje, TAMIDA2005, 2005, pp. 317-323.
R. Atenstaedt, "Word cloud analysis of the BJGP: 5 years on," British Journal of General Practice, vol. 67, no. 658, pp. 231-232, May 2017. https://doi.org/10.3399/bjgp17X690833
R. Atenstaedt, "Word cloud analysis of the BJGP," British Journal of General Practice, vol. 62, no. 596, pp. 148-148, Mar. 2012. https://doi.org/10.3399/bjgp12X630142
C. N. Hofer and G. Karagiannis, "Cloud computing services: taxonomy and comparison," Journal of Internet Services and Applications, vol. 2, no. 2, pp. 81-94, 2011. https://doi.org/10.1007/s13174-011-0027-x
M. A. Hearst, E. Pedersen, L. Patil, E. Lee, P. Laskowski, and S. Franconeri, "An Evaluation of Semantically Grouped Word Cloud Designs," IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 9, pp. 2748-2761, Sep. 2020. https://doi.org/10.1109/TVCG.2019.2904683
A. S. Tuchkova and P. P. Kondrasheva, "The term 'data mining'. tasks solved by data mining methods," Trends in the Development of Science and Education, vol. 5, no. 2, pp. 27-30, 2019. https://doi.org/10.18411/lj-10-2019-26
O. Filatova, "More Than a Word Cloud," TESOL Journal, vol. 7, no. 2, pp. 438-448, 2016. https://doi.org/10.1002/tesj.251
M. Nagao, "Special Issue: 'Collection of Best Annual Papers' Organized for the 20th Anniversary of the Association for Natural Language Processing," Journal of Natural Language Processing, vol. 21, no. 4, pp. 617-618, 2014. https://doi.org/10.5715/jnlp.21.617
S. Hakak, A. Kamsin, P. Shivakumara, M. Y. I. Idris, and G. A. Gilkar, "A new split based searching for exact pattern matching for natural texts," PLOS ONE, vol. 13, no. 7, 2018, Art. no. e0200912. https://doi.org/10.1371/journal.pone.0200912
M. Madhukar and S. Verma, "Hybrid Semantic Analysis of Tweets: A Case Study of Tweets on Girl-Child in India," Engineering, Technology & Applied Science Research, vol. 7, no. 5, pp. 2014-2016, Oct. 2017. https://doi.org/10.48084/etasr.1246
T. H. Nguyen, "A new approach to exact pattern matching," Journal of Computer Science and Cybernetics, vol. 35, no. 3, pp. 197-216, Aug. 2019. https://doi.org/10.15625/1813-9663/35/3/13620
C. C. Hoong and M. A. Ameedeen, "Boyer-Moore Horspool Algorithm Used in Content Management System of Data Fast Searching," Advanced Science Letters, vol. 23, no. 11, pp. 11387-11390, Nov. 2017. https://doi.org/10.1166/asl.2017.10289
S. Sharma and M. Dixit, "Single Digit Hash Boyer Moore Horspool Pattern Matching Algorithm for Intrusion Detection System," International Journal of Future Generation Communication and Networking, vol. 9, no. 9, pp. 169-180, 2016. https://doi.org/10.14257/ijfgcn.2016.9.9.15
Y. Jeong, N.-P. Tran, M. Lee, D. Nam, J.-S. Kim, and S. Hwang, "Parallelization and Performance Optimization of the Boyer-Moore Algorithm on GPU," KIISE Transactions on Computing Practices, vol. 21, no. 2, pp. 138-143, 2015. https://doi.org/10.5626/KTCP.2015.21.2.138
R. Janani and S. Vijayarani, "An efficient text pattern matching algorithm for retrieving information from desktop," Indian Journal of Science and Technology, vol. 9, no. 43, pp. 1-11, 2016. https://doi.org/10.17485/ijst/2016/v9i43/95454
M. O. Kulekci, "Tara: An algorithm for fast searching of multiple patterns on text files," in 22nd international symposium on computer and information sciences, Ankara, Turkey, Nov. 2007, pp. 1-6. https://doi.org/10.1109/ISCIS.2007.4456850
A. Weyer, "The Brute Force Algorithm," Ph.D. dissertation, Bowling Green State University, United States, 2019.
P. Kuipers, "Empowerment in community-based rehabilitation and disability-inclusive development," Disability, CBR & Inclusive Development, vol. 24, no. 4, pp. 24-42, 2013. https://doi.org/10.5463/dcid.v24i4.274
D. D. Prasetya, A. P. Wibawa, and T. Hirashima, "The performance of text similarity algorithms," International Journal of Advances in Intelligent Informatics, vol. 4, no. 1, pp. 63-69, Mar. 2018. https://doi.org/10.26555/ijain.v4i1.152
W. G. S. Parwita, I. G. A. A. D. Indradewi, and I. N. S. W. Wijaya, "String Matching based Plagiarism Detection for Document in Bahasa Indonesia," in 5th International Conference on New Media Studies, Bali, Indonesia, Oct. 2019, pp. 54-58. https://doi.org/10.1109/CONMEDIA46929.2019.8981821
H. T. Le, L. N. Pham, D. D. Nguyen, S. V. Nguyen, and A. N. Nguyen, "Semantic text alignment based on topic modeling," in IEEE RIVF International Conference on Computing Communication Technologies, Research, Innovation, and Vision for the Future, Hanoi, Vietnam, Nov. 2016, pp. 67-72.
S. Zhang, H. Tan, L. Chen, and B. Lv, "Enhanced Text Matching Based on Semantic Transformation," IEEE Access, vol. 8, pp. 30897-30904, 2020. https://doi.org/10.1109/ACCESS.2020.2973206
Y. Wu, W. Wu, Z. Li, and M. Zhou, "Knowledge Enhanced Hybrid Neural Network for Text Matching," Nov. 2016, Accessed: Dec. 26, 2020. [Online]. Available: http://arxiv.org/abs/1611.04684.
J. Chen, J. Zhou, Z. Shi, B. Fan, and C. Luo, "Knowledge Abstraction Matching for Medical Question Answering," in IEEE International Conference on Bioinformatics and Biomedicine, San Diego, USA, Nov. 2019, pp. 342-347. https://doi.org/10.1109/BIBM47256.2019.8982973
M. M. Mironczuk and J. Protasiewicz, "A recent overview of the state-of-the-art elements of text classification," Expert Systems with Applications, vol. 106, pp. 36-54, Sep. 2018. https://doi.org/10.1016/j.eswa.2018.03.058
B. Liu, Y. Zhou, and W. Sun, "Character-level text classification via convolutional neural network and gated recurrent unit," International Journal of Machine Learning and Cybernetics, vol. 11, no. 8, pp. 1939-1949, Aug. 2020. https://doi.org/10.1007/s13042-020-01084-9
M. Oghbaie and M. Mohammadi Zanjireh, "Pairwise document similarity measure based on present term set," Journal of Big Data, vol. 5, no. 1, p. 52, Dec. 2018. https://doi.org/10.1186/s40537-018-0163-2
Z. Yousefi, H. Sotudeh, M. Mirzabeigi, S. M. Fakhrahmad, A. Nikseresht, and M. Mohammadi, "Investigating text power in predicting semantic similarity," International Journal of Information Science and Management, vol. 17, no. 1, p. 17, Jan. 2019.
J. Guan, A. S. Levitan, and S. Goyal, "Text Mining Using Latent Semantic Analysis: An Illustration through Examination of 30 Years of Research at JIS," Journal of Information Systems, vol. 32, no. 1, pp. 67-86, Oct. 2016. https://doi.org/10.2308/isys-51625
K. Al-Sabahi, Z. Zhang, J. Long, and K. Alwesabi, "An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization," Arabian Journal for Science and Engineering, vol. 43, no. 12, pp. 8079-8094, Dec. 2018. https://doi.org/10.1007/s13369-018-3286-z
Z. Wu et al., "An efficient Wikipedia semantic matching approach to text document classification," Information Sciences, vol. 393, pp. 15-28, Jul. 2017. https://doi.org/10.1016/j.ins.2017.02.009
K. Orkphol and W. Yang, "Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet," Future Internet, vol. 11, no. 5, May 2019, Art. no. 114. https://doi.org/10.3390/fi11050114
W. H. Gomaa and A. A. Fahmy, "Simall: A flexible tool for text similarity," in The Seventeenth Conference on Language Engineering ESOLEC, vol. 17, pp. 122-127, 2017.
How to Cite
MetricsAbstract Views: 1797
PDF Downloads: 1112
Copyright (c) 2020 Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.