A Comparative Approach of Dimensionality Reduction Techniques in Text Classification

S. Rahamat Basha, J. K. Rani

Abstract


This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.


Keywords


stop word removal; stemming; feature weighting and selection; KNN; Naive Bayes

Full Text:

PDF

References


J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy self-constructing feature clustering algorithm for text classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 3, pp. 335–349, 2011

H. Kim, P. Howland, H. Park, “Dimension reduction in text classification with Support Vector Machines”, Journal of Machine Learning Research, Vol. 6, pp. 37-53, 2005

A. L. Blum, P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 245-271, 1997

E. F. Cambarro, E. Montanes, I. Diaz, J. Ranilla, R. Mones, “Introducing a family of linear measures for feature selection in text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 1223-1232, 2005

D. Koller, M. Sahami, “Toward optimal feature selection”, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996

R. Kohavi, G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324, 1997

Y. Yang, J. O. Pederson, “A comparative study on Feature Selection in Text Categorization”, 14th International conference on Machine Learning, San Francisco, USA, July 8-12, 1997

N. Slonim, N. Tishby, “The power of word clusters for Text Classification”, 23rd European Colloquium on Information Retrieval Research, 2001

D. D. Lewis, “Feature selection and feature extraction for Text Categorization”, Workshop on Speech and Natural Language, New York, USA, February 23-26, 1992

Y. Jan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-331, 2006

M. C. Dalmau, O. W. Marquez Florez, “Experimental results of the signal processing approach to distributional clustering of terms on Reuters-21578 collection”, European Conference on Information Retrieval, Rome, Italy, April 2-5, 2007

F. Sebastani, “Machine learning in automated text categorization”, ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47, 2002

M. F. Porter, “An algorithm for suffix stripping”, in: Readings in Information Retrieval, Morgan Kaufmann, 1997

M. Alghobiri, “A comparative analysis of classification algorithms on diverse datasets”, Engineering, Technology & Applied Science Research, Vol. 8, No. 2, pp. 2790-2795, 2018

E. Jamalian, R. Foukerdi, “A hybrid data mining method for customer churn prediction”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2991-2997, 2018

R. Neumayer, R. Mayer, K. Norvag, “Combination of Feature Selection Methods for Text Categorisation”, in: Lecture notes in computer science, Vol. 6611, Springer, 2009

Y. Sasaki, Automatic Text Classification, Lecture notes, University of Manchester, available at: http://www.nactem.ac.uk/dtc/DTC-Sasaki.pdf 2008

https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization

+collection

https://martin-thoma.com/nlp-reuters/

http://www.daviddlewis.com/resources/testcollections/reuters21578/

http://disi.unitn.it/moschitti/corpora.htm

A. Ozgur, L. Ozgur, T. Gungor, “Text Categorization with class-based and corpus-based keyword selection”, 20th International Symposium, Istanbul, Turkey, October 26-28, 2005

R. Caruana, A. Niculescu-Mizil, “Data mining in metric space: an empirical analysis of supervised learning performance criteria”, KDD’04, Seattle, Washington, USA, August 22–25, 2004

S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav, G. Ravi Kumar, “Impact of feature selection techniques in Text Classification: an experimental study”, J. Mech. Cont.& Math. Sci., Special Issue, No.-3, pp. 39-51, 2019

G. Ravi Kumar, K. Nagamani, “A framework of dimensionality reduction utilizing PCA for neural network prediction”, International Conference on Data Science and Management, Bhubaneswar, USA, February 22-23

G. Ravi Kumar, K. Nagamani, “Banknote authentication system utilizing deep neural network with PCA and LDA machine learning techniques”, International Journal of Recent Scientific Research, Vol. 9, No. 12, pp. 30036-30038, 2018

M. V. Lakshmaiah, G. Ravi Kumar, G. Pakardin, “Framework for finding association rules in big data by using Hadoop Map/Reduce tool”, International Journal of Advance and Innovative Research, Vol. 2, No. 1(I), pp. 6-9, 2015

G. Ravi Kumar, G. A. Ramachandra, K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques”, International Journal of Innovations in Engineering and Technology, Vol. 2, No. 4, pp. 139-144, 2013




eISSN: 1792-8036     pISSN: 2241-4487