A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy
Abstract
Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.
Keywords:
summarization, dimension reduction, feature selection, feature extraction, feature clustering, text classificationDownloads
References
J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 3, pp. 335–349, 2011 DOI: https://doi.org/10.1109/TKDE.2010.122
D. D. Lewis, “Feature selection and feature extraction for text categorization”, Workshop on Speech and Natural Language, Harriman, USA, February 23-26, 1992 DOI: https://doi.org/10.3115/1075527.1075574
X. Wan, J. Xiao, “Exploiting neighborhood knowledge for single document summarization and keyphrase extraction”, ACM Transactions on Information Systems,Vol. 28, No. 2, Article 8, 2010 DOI: https://doi.org/10.1145/1740592.1740596
M. Niepert, “An experiment system for textcClassification”, available at: https://www.semanticscholar.org/paper/An-Experiment-System-for-Text-Classification-Niepert/32be395201b132eb64939a1ca8541efd0f1e8984, 2005
H. Kim, P. Howland, H. Park, “Dimension reduction in text classification with support vector machines”, Journal of Machine Learning Research, Vol. 6, pp.37-53, 2005
A. L. Blum, P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, Vol. 97, pp. 245-271, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00063-5
E. F. Combarrow, E. Montanes, I. Diaz, J. Ranilla, R. Mones, “Introducing a family of linear measures for feature selection in text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 1223-1232, 2005 DOI: https://doi.org/10.1109/TKDE.2005.149
Y. Jan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-333, 2006 DOI: https://doi.org/10.1109/TKDE.2006.45
M. Alghobiri, “A comparative analysis of classification algorithms on diverse datasets”, Engineering, Technology & Applied Science Research, Vol. 8, No. 2, pp. 2790-2795, 2018 DOI: https://doi.org/10.48084/etasr.1952
E. Jamalian, R. Foukerdi, “A hybrid data mining method for customer churn prediction”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2991-2997, 2018 DOI: https://doi.org/10.48084/etasr.2108
D. Koller, M. Sahami, “Toward optimal feature selection”, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996
R. Kohavi, G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00043-X
Y. Yang, J. O. Pederson, “A comparative study on feature selection in text categorization”, 14th International Conference on Machine Learning, San Francisco, USA, July 8-12, 1997
N. Slonim, N. Tishby, “The power of word clusters for text classification”, 23rd European Colloquium on Information Retrieval Research, 2001
Y. Sasaki, Automatic Text Classification, Lecture notes, University of Manchester, available at: http://www.nactem.ac.uk/dtc/DTC-Sasaki.pdf, 2008
L. D. Baker, A. McCallum, “Distributional clustering of words for text classification”, 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24-28, 1998 DOI: https://doi.org/10.1145/290941.290970
R. Bekkerman, R. EI-Yaniv, N. Tishhby, Y. Winter, “Distributional word clusters versus words for text categorization”, Journal of Machine Learning Research, Vol. 3, pp. 1183-120, 2003
I. S. Dhilllon, S. Mallela, R. Kumar, “A divisive information theoretic feature clustering algorithm for text classification”, Journal of Machine Learning Research, Vol. 3, pp. 1265-1287, 2003
https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
https://martin-thoma.com/nlp-reuters/
http://www.daviddlewis.com/resources/testcollections/reuters21578/
http://disi.unitn.it/moschitti/corpora.htm
S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav, G. Ravi Kumar, “Impact of feature selection techniques in Text Classification: an experimental study”, J. Mech. Cont.& Math. Sci., Special Issue, No. 3, pp. 39-51, 2019
G. Ravi Kumar, K. Nagamani, “A framework of dimensionality reduction utilizing PCA for neural network prediction”, International Conference on Data Science and Management, Bhubaneswar, USA, February 22-23
G. Ravi Kumar, K. Nagamani, “Banknote authentication system utilizing deep neural network with PCA and LDA machine learning techniques”, International Journal of Recent Scientific Research, Vol. 9, No. 12, pp. 30036-30038, 2018
M. V. Lakshmaiah, G. Ravi Kumar, G. Pakardin, “Framework for finding association rules in big data by using Hadoop Map/Reduce tool”, International Journal of Advance and Innovative Research, Vol. 2, No. 1(I), pp. 6-9, 2015
G. Ravi Kumar, G. A. Ramachandra, K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques”, International Journal of Innovations in Engineering and Technology, Vol. 2, No. 4, pp. 139-144, 2013
Downloads
How to Cite
License
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.