A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy

S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav

Abstract


Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.


Keywords


summarization; dimension reduction; feature selection; feature extraction; feature clustering; text classification

Full Text:

PDF

References


J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 3, pp. 335–349, 2011

D. D. Lewis, “Feature selection and feature extraction for text categorization”, Workshop on Speech and Natural Language, Harriman, USA, February 23-26, 1992

X. Wan, J. Xiao, “Exploiting neighborhood knowledge for single document summarization and keyphrase extraction”, ACM Transactions on Information Systems,Vol. 28, No. 2, Article 8, 2010

M. Niepert, “An experiment system for textcClassification”, available at: https://www.semanticscholar.org/paper/An-Experiment-System-for-Text-Classification-Niepert/32be395201b132eb64939a1ca8541efd0f1e8

, 2005

H. Kim, P. Howland, H. Park, “Dimension reduction in text classification with support vector machines”, Journal of Machine Learning Research, Vol. 6, pp.37-53, 2005

A. L. Blum, P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, Vol. 97, pp. 245-271, 1997

E. F. Combarrow, E. Montanes, I. Diaz, J. Ranilla, R. Mones, “Introducing a family of linear measures for feature selection in text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 1223-1232, 2005

Y. Jan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-333, 2006

M. Alghobiri, “A comparative analysis of classification algorithms on diverse datasets”, Engineering, Technology & Applied Science Research, Vol. 8, No. 2, pp. 2790-2795, 2018

E. Jamalian, R. Foukerdi, “A hybrid data mining method for customer churn prediction”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2991-2997, 2018

D. Koller, M. Sahami, “Toward optimal feature selection”, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996

R. Kohavi, G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324, 1997

Y. Yang, J. O. Pederson, “A comparative study on feature selection in text categorization”, 14th International Conference on Machine Learning, San Francisco, USA, July 8-12, 1997

N. Slonim, N. Tishby, “The power of word clusters for text classification”, 23rd European Colloquium on Information Retrieval Research, 2001

Y. Sasaki, Automatic Text Classification, Lecture notes, University of Manchester, available at: http://www.nactem.ac.uk/dtc/DTC-Sasaki.pdf, 2008

L. D. Baker, A. McCallum, “Distributional clustering of words for text classification”, 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24-28, 1998

R. Bekkerman, R. EI-Yaniv, N. Tishhby, Y. Winter, “Distributional word clusters versus words for text categorization”, Journal of Machine Learning Research, Vol. 3, pp. 1183-120, 2003

I. S. Dhilllon, S. Mallela, R. Kumar, “A divisive information theoretic feature clustering algorithm for text classification”, Journal of Machine Learning Research, Vol. 3, pp. 1265-1287, 2003

https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization

+collection

https://martin-thoma.com/nlp-reuters/

http://www.daviddlewis.com/resources/testcollections/reuters21578/

http://disi.unitn.it/moschitti/corpora.htm

S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav, G. Ravi Kumar, “Impact of feature selection techniques in Text Classification: an experimental study”, J. Mech. Cont.& Math. Sci., Special Issue, No. 3, pp. 39-51, 2019

G. Ravi Kumar, K. Nagamani, “A framework of dimensionality reduction utilizing PCA for neural network prediction”, International Conference on Data Science and Management, Bhubaneswar, USA, February 22-23

G. Ravi Kumar, K. Nagamani, “Banknote authentication system utilizing deep neural network with PCA and LDA machine learning techniques”, International Journal of Recent Scientific Research, Vol. 9, No. 12, pp. 30036-30038, 2018

M. V. Lakshmaiah, G. Ravi Kumar, G. Pakardin, “Framework for finding association rules in big data by using Hadoop Map/Reduce tool”, International Journal of Advance and Innovative Research, Vol. 2, No. 1(I), pp. 6-9, 2015

G. Ravi Kumar, G. A. Ramachandra, K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques”, International Journal of Innovations in Engineering and Technology, Vol. 2, No. 4, pp. 139-144, 2013




eISSN: 1792-8036     pISSN: 2241-4487