A Comparative Approach of Dimensionality Reduction Techniques in Text Classification
Abstract
This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.
Keywords:
stop word removal, stemming, feature weighting and selection, KNN, Naive BayesDownloads
References
J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy self-constructing feature clustering algorithm for text classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 3, pp. 335–349, 2011 DOI: https://doi.org/10.1109/TKDE.2010.122
H. Kim, P. Howland, H. Park, “Dimension reduction in text classification with Support Vector Machines”, Journal of Machine Learning Research, Vol. 6, pp. 37-53, 2005
A. L. Blum, P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 245-271, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00063-5
E. F. Cambarro, E. Montanes, I. Diaz, J. Ranilla, R. Mones, “Introducing a family of linear measures for feature selection in text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 1223-1232, 2005 DOI: https://doi.org/10.1109/TKDE.2005.149
D. Koller, M. Sahami, “Toward optimal feature selection”, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996
R. Kohavi, G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00043-X
Y. Yang, J. O. Pederson, “A comparative study on Feature Selection in Text Categorization”, 14th International conference on Machine Learning, San Francisco, USA, July 8-12, 1997
N. Slonim, N. Tishby, “The power of word clusters for Text Classification”, 23rd European Colloquium on Information Retrieval Research, 2001
D. D. Lewis, “Feature selection and feature extraction for Text Categorization”, Workshop on Speech and Natural Language, New York, USA, February 23-26, 1992 DOI: https://doi.org/10.3115/1075527.1075574
Y. Jan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-331, 2006 DOI: https://doi.org/10.1109/TKDE.2006.45
M. C. Dalmau, O. W. Marquez Florez, “Experimental results of the signal processing approach to distributional clustering of terms on Reuters-21578 collection”, European Conference on Information Retrieval, Rome, Italy, April 2-5, 2007
F. Sebastani, “Machine learning in automated text categorization”, ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47, 2002 DOI: https://doi.org/10.1145/505282.505283
M. F. Porter, “An algorithm for suffix stripping”, in: Readings in Information Retrieval, Morgan Kaufmann, 1997
M. Alghobiri, “A comparative analysis of classification algorithms on diverse datasets”, Engineering, Technology & Applied Science Research, Vol. 8, No. 2, pp. 2790-2795, 2018 DOI: https://doi.org/10.48084/etasr.1952
E. Jamalian, R. Foukerdi, “A hybrid data mining method for customer churn prediction”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2991-2997, 2018 DOI: https://doi.org/10.48084/etasr.2108
R. Neumayer, R. Mayer, K. Norvag, “Combination of Feature Selection Methods for Text Categorisation”, in: Lecture notes in computer science, Vol. 6611, Springer, 2009
Y. Sasaki, Automatic Text Classification, Lecture notes, University of Manchester, available at: http://www.nactem.ac.uk/dtc/DTC-Sasaki.pdf 2008
https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
https://martin-thoma.com/nlp-reuters/
http://www.daviddlewis.com/resources/testcollections/reuters21578/
http://disi.unitn.it/moschitti/corpora.htm
A. Ozgur, L. Ozgur, T. Gungor, “Text Categorization with class-based and corpus-based keyword selection”, 20th International Symposium, Istanbul, Turkey, October 26-28, 2005 DOI: https://doi.org/10.1007/11569596_63
R. Caruana, A. Niculescu-Mizil, “Data mining in metric space: an empirical analysis of supervised learning performance criteria”, KDD’04, Seattle, Washington, USA, August 22–25, 2004 DOI: https://doi.org/10.1145/1014052.1014063
S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav, G. Ravi Kumar, “Impact of feature selection techniques in Text Classification: an experimental study”, J. Mech. Cont.& Math. Sci., Special Issue, No.-3, pp. 39-51, 2019
G. Ravi Kumar, K. Nagamani, “A framework of dimensionality reduction utilizing PCA for neural network prediction”, International Conference on Data Science and Management, Bhubaneswar, USA, February 22-23
G. Ravi Kumar, K. Nagamani, “Banknote authentication system utilizing deep neural network with PCA and LDA machine learning techniques”, International Journal of Recent Scientific Research, Vol. 9, No. 12, pp. 30036-30038, 2018
M. V. Lakshmaiah, G. Ravi Kumar, G. Pakardin, “Framework for finding association rules in big data by using Hadoop Map/Reduce tool”, International Journal of Advance and Innovative Research, Vol. 2, No. 1(I), pp. 6-9, 2015
G. Ravi Kumar, G. A. Ramachandra, K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques”, International Journal of Innovations in Engineering and Technology, Vol. 2, No. 4, pp. 139-144, 2013
Downloads
How to Cite
License
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.