Design and Analysis of News Category Predictor
Abstract
Recent technological advancements have changed significantly the way news is produced, consumed, and disseminated. Frequent and on-spot news reporting has been enabled, which smartphones can access anywhere and anytime. News categorization or classification can significantly help in its proper and timely dissemination. This study evaluates and compares news category predictors' performance based on four supervised machine learning models. We choose a standard dataset of British Broadcasting Corporation (BBC) news consisting of five categories: business, sports, technology, politics, and entertainment. Four multi-class news category predictors have been developed and trained on the same dataset: Naïve Bayes, Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). Each category predictor's performance was evaluated by analyzing the confusion matrix and quantifying the test dataset's precision, recall, and overall accuracy. In the end, the performance of all category predictors was studied and compared. The results show that all category predictors have achieved satisfactory accuracy grades. However, the SVM model performed better than the four supervised learning models, categorizing news articles with 98.3% accuracy. In contrast, the lowest accuracy was obtained by the KNN model. However, the KNN model's performance can be enhanced by investigating the optimal number of neighbors (K) value.
Keywords:
category predictor, naive bayes, random forest, KNN, SVM, accuracyDownloads
References
A. A. Hakim, A. Erwin, K. I. Eng, M. Galinium, and W. Muliady, "Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach," in 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia, Oct. 2014. DOI: https://doi.org/10.1109/ICITEED.2014.7007894
G. Mujtaba, L. Shuib, R. G. Raj, R. Rajandram, and K. Shaikh, "Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study," Journal of Forensic and Legal Medicine, vol. 57, pp. 41-50, Jul. 2018. DOI: https://doi.org/10.1016/j.jflm.2017.07.001
V. S. Padala, K. Gandhi, and D. V. Pushpalatha, "Machine learning: the new language for applications," IAES International Journal of Artificial Intelligence (IJ-AI), vol. 8, no. 4, pp. 411-421, Dec. 2019. DOI: https://doi.org/10.11591/ijai.v8.i4.pp411-421
F. Miao, P. Zhang, L. Jin, and H. Wu, "Chinese News Text Classification Based on Machine Learning Algorithm," in 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, Aug. 2018, vol. 02, pp. 48-51. DOI: https://doi.org/10.1109/IHMSC.2018.10117
S. M. H. Dadgar, M. S. Araghi, and M. M. Farahani, "A novel text mining approach based on TF-IDF and Support Vector Machine for news classification," in 2016 IEEE International Conference on Engineering and Technology (ICETECH), Coimbatore, India, Mar. 2016, pp. 112-116. DOI: https://doi.org/10.1109/ICETECH.2016.7569223
G. L. Yovellia Londo, D. H. Kartawijaya, H. T. Ivariyani, Y. S. Purnomo W. P., A. P. Muhammad Rafi, and D. Ariyandi, "A Study of Text Classification for Indonesian News Article," in 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), Yogyakarta, Indonesia, Mar. 2019, pp. 205-208. DOI: https://doi.org/10.1109/ICAIIT.2019.8834611
R. Wongso, F. A. Luwinda, B. C. Trisnajaya, and O. R. Rusli, "News Article Text Classification in Indonesian Language," Procedia Computer Science, vol. 116, pp. 137-143, 2017. DOI: https://doi.org/10.1016/j.procs.2017.10.039
A. N. Chy, Md. H. Seddiqui, and S. Das, "Bangla news classification using Naive Bayes classifier," in 16th Int'l Conf. Computer and Information Technology, Khulna, Bangladesh, Mar. 2014, pp. 366-371. DOI: https://doi.org/10.1109/ICCITechn.2014.6997369
I. Dilrukshi, K. De Zoysa, and A. Caldera, "Twitter news classification using SVM," in 2013 8th International Conference on Computer Science Education, Colombo, Sri Lanka, Apr. 2013, pp. 287-291. DOI: https://doi.org/10.1109/ICCSE.2013.6553926
H. Sawaf, J. Zaplo, and H. Ney, "Statistical classification methods for arabic news articles," presented at the Natural Language Processing in ACL2001, Toulouse, France, 2001.
I. J. Mrema and M. A. Dida, "A Survey of Road Accident Reporting and Driver's Behavior Awareness Systems: The Case of Tanzania," Engineering, Technology & Applied Science Research, vol. 10, no. 4, pp. 6009-6015, Aug. 2020. DOI: https://doi.org/10.48084/etasr.3449
M. Kiruthika and S. Bindu, "Classification of Electrical Power System Conditions with Convolutional Neural Networks," Engineering, Technology & Applied Science Research, vol. 10, no. 3, pp. 5759-5768, Jun. 2020. DOI: https://doi.org/10.48084/etasr.3512
N. M. N. Mathivanan, N. A. M. Ghani, and R. M. Janor, "Performance analysis of supervised learning models for product title classification," IAES International Journal of Artificial Intelligence (IJ-AI), vol. 8, no. 3, pp. 228-236, Dec. 2019. DOI: https://doi.org/10.11591/ijai.v8.i3.pp228-236
A. Khan, B. Baharudin, L. H. Lee, and K. Khan, "A review of machine learning algorithms for text-documents classification," Journal of Advances in Information Technology, vol. 1, no. 1, pp. 4-20, Feb. 2010. DOI: https://doi.org/10.4304/jait.1.1.4-20
V. Ashwin, "Twitter Tweet Classifier," IAES International Journal of Artificial Intelligence (IJ-AI), vol. 5, no. 1, pp. 41-44, Mar. 2016. DOI: https://doi.org/10.11591/ijai.v5.i1.pp41-44
M. A. Al-Hagery, "Extracting hidden patterns from dates' product data using a machine learning technique," IAES International Journal of Artificial Intelligence (IJ-AI), vol. 8, no. 3, pp. 205-214, Dec. 2019. DOI: https://doi.org/10.11591/ijai.v8.i3.pp205-214
A. Ferrario and M. Naegelin, "The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification," Social Science Research Network, Rochester, NY, SSRN Scholarly Paper ID 3547887, Mar. 2020. doi: 10.2139/ssrn.3547887. DOI: https://doi.org/10.2139/ssrn.3547887
E. Haddi, X. Liu, and Y. Shi, "The Role of Text Pre-processing in Sentiment Analysis," Procedia Computer Science, vol. 17, pp. 26-32, 2013. DOI: https://doi.org/10.1016/j.procs.2013.05.005
I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," The Journal of Machine Learning Research, vol. 3, no. null, pp. 1157-1182, Mar. 2003.
M. Farhoodi, A. Yari, and A. Sayah, "N-gram based text classification for Persian newspaper corpus," in The 7th International Conference on Digital Content, Multimedia Technology and its Applications, Aug. 2011, pp. 55-59.
F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, Mar. 2002. DOI: https://doi.org/10.1145/505282.505283
S. R. Basha, J. K. Rani, J. J. C. Prasad Yadav, and G. Ravi Kumar, "Impact of feature selection techniques in Text Classification: An Experimental study," Journal of Mechanics of Continua and Mathematical Sciences, no. 3 special, Sep. 2019.
F. Debole and F. Sebastiani, "Supervised Term Weighting for Automated Text Categorization," in Text Mining and its Applications, S. Sirmakessis, Ed. Berlin, Heidelberg: Springer, 2004, pp. 81-97. DOI: https://doi.org/10.1007/978-3-540-45219-5_7
G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513-523, Jan. 1988. DOI: https://doi.org/10.1016/0306-4573(88)90021-0
J. Ramos, "Using TF-IDF to Determine Word Relevance in Document Queries," in Proceedings of the first instructional conference on machine learning, Accessed: Oct. 06, 2020. [Online]. Available: https://www.semanticscholar.org/paper/Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/b3bf6373ff41a115197cb5b30e57830c16130c2c.
D. D. Lewis, "Naive (Bayes) at forty: The independence assumption in information retrieval," in Machine Learning: ECML-98, 1998, pp. 4-15. DOI: https://doi.org/10.1007/BFb0026666
S. R. Safavian and D. Landgrebe, "A survey of decision tree classifier methodology," IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 660-674, May 1991. DOI: https://doi.org/10.1109/21.97458
B. Xu, X. Guo, Y. Ye, and J. Cheng, "An Improved Random Forest Classifier for Text Categorization," Journal of Computers, vol. 7, no. 12, pp. 2913-2920, Dec. 2012. DOI: https://doi.org/10.4304/jcp.7.12.2913-2920
P. Soucy and G. W. Mineau, "A simple KNN algorithm for text categorization," in Proceedings 2001 IEEE International Conference on Data Mining, San Jose, CA, Nov. 2001, pp. 647-648.
R. Palaniappan, K. Sundaraj, and S. Sundaraj, "A comparative study of the svm and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals," BMC Bioinformatics, vol. 15, p. 223, Jun. 2014. DOI: https://doi.org/10.1186/1471-2105-15-223
S. R. Basha and J. K. Rani, "A Comparative Approach of Dimensionality Reduction Techniques in Text Classification," Engineering, Technology & Applied Science Research, vol. 9, no. 6, pp. 4974-4979, Dec. 2019. DOI: https://doi.org/10.48084/etasr.3146
A. K. Mourya, S. U. Ahsaan, and H. Kaur, "Performance and Evaluation of Different Kernels in Support Vector Machine for Text Mining," in Advances in Intelligent Computing and Communication, Singapore, 2020, pp. 264-271. DOI: https://doi.org/10.1007/978-981-15-2774-6_33
J.-Y. Jiang, R.-J. Liou, and S.-J. Lee, "A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 3, pp. 335-349, Mar. 2011. DOI: https://doi.org/10.1109/TKDE.2010.122
Downloads
How to Cite
License
Copyright (c) 2020 Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.