Design and Analysis of News Category Predictor

−Recent technological advancements have changed significantly the way news is produced, consumed, and disseminated. Frequent and on-spot news reporting has been enabled, which smartphones can access anywhere and anytime. News categorization or classification can significantly help in its proper and timely dissemination. This study evaluates and compares news category predictors' performance based on four supervised machine learning models. We choose a standard dataset of British Broadcasting Corporation (BBC) news consisting of five categories: business, sports, technology, politics, and entertainment. Four multi-class news category predictors have been developed and trained on the same dataset: Naïve Bayes, Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). Each category predictor's performance was evaluated by analyzing the confusion matrix and quantifying the test dataset's precision, recall, and overall accuracy. In the end, the performance of all category predictors was studied and compared. The results show that all category predictors have achieved satisfactory accuracy grades. However, the SVM model performed better than the four supervised learning models, categorizing news articles with 98.3% accuracy. In contrast, the lowest accuracy was obtained by the KNN model. However, the KNN model's performance can be enhanced by investigating the optimal number of neighbors (K) value. Keywords-category predictor; Naïve Bayes; random forest; KNN; SVM; accuracy


INTRODUCTION
Technology has a significant impact on society and has significantly changed the way people access information. News is a well-known and standard service. Recent technological advancements have considerably changed the way news is produced, consumed, and disseminated. It has enabled more frequent and on-spot news reporting that smartphones can access anywhere and anytime. Therefore, people now expect to receive news of their interest in real-time. The news sources are already flooded with colossal information. Therefore, it is essential to automatically classify the news in specific categories based on the information content to allow timely and efficient information dissemination. Automatic Document Classification (ATC) can be used to efficiently manage textbased information (i.e. news) [1][2][3]. It allows timely and efficient information retrieval in the search phase. ATC can assign a relevant category to a news from a predefined set of reference categories based on the text feature extraction by correctly understanding the meaning and context of words. The time required to categorize the news correctly is directly proportional to the quantity of the text. In the newspaper's archive, the comprehensive range of articles varies from business to technology, so it is inconceivable that humans could manage this abundant content of information in a reasonable time frame. Manual document classification is cumbersome and resource-exhaustive.
The news category predictor aims to recognize and categorize different articles based on content/information type. The automatic news classification plays a vital role in processing a massive amount of articles. It can classify and label the news articles by analyzing the content (i.e. extracting feature values) to quickly access where they are focused in, allowing efficient and speedy news dissemination. Additionally, news websites can also increase their visibility by developing a recommendation system that suggests/ recommends relevant news to attract more attention. Several studies have been carried out to study modeling and performance evaluation of news category predictors using machine learning (ML) algorithms over different datasets (which differ in languages and range of categories) [4][5][6][7][8][9][10]. In these studies, well-known machine algorithms, such as Naïve Bayes (NB), SVM, Random Forest (RF), etc. are used to model news category predictors. The findings/results show that the category predictor's performance can vary with the machine algorithm deployed and the dataset used to train the model. In contrast, ML is envisioned to solve problems in various related domains [11,12]. For a given ML algorithm, prediction performance can vary significantly depending upon the dataset.
To quote a few, the NB algorithm's precision in categorizing news articles is reported to be 0.92 in [3] and 0.88 in [5]. In these cases, different datasets were used to train the same ML model, and the prediction performance is different. In the past few years, much research is carried out using different ML algorithms in natural language processing (i.e. text/news classification [13][14][15][16][17]). However, the current review paper is focused on evaluating and comparing category predictors' performance based on well-known ML algorithms. A standard BBC dataset was chosen having news of five categories: business, sports, technology, politics, and entertainment. This is a balanced dataset and quite different from traditional datasets that usually contain biases.
The main contribution of this research is that and it is the first time a multi-class news category predictor was developed by training four well-known machine learning algorithms (i.e. NB, RF, KNN, and SVM) on the same dataset. Each category predictor's performance was evaluated by analyzing the confusion matrix and quantifying the test dataset's precision, recall, and overall accuracy. Finally, the performance of all category predictors was studied and compared.

II. METHODOLOGY AND DATASET
The ultimate aim of this study is to classify the news into specific categories and analyze the performance of the category predictor. Initially, data are collected and preprocessed, then the content of text document (D j ) is converted into useful features (w1 j ... wk j ) by feature extraction algorithms such as unigrams. The extracted features are transformed into numeric data that act as inputs for machine learning algorithms or classifiers (NB and RF). Finally, the ML models are trained on these transformed features, and the performance is evaluated on the test dataset. The research methodology/work flowchart is given in Figure .

A. Dataset
In this study, a BBC-originated news data set was used, which is obtained from Kaggle. It consists of 1490 documents from the BBC news website corresponding to stories in five typical areas: business, sports, technology, politics, and entertainment. The dataset is almost balanced: it contains an approximately equal portion of each class ( Figure 2). However, most samples (23.2%) belong to the sports category, whereas the tech category has the least. The distribution of classes plays an important role in classification, and balanced datasets result in better learning models. In this study, the dataset is broken into 1,192 (80%) records for training and 298 (20%) testing. Workflow.

Fig. 2.
Class distribution in the dataset.

B. Text Cleaning/Preprocessing
Text preprocessing or cleaning is a preliminary and crucial step of news classification, which reduces the required space and makes the classification more efficient [18]. Most of the times, the dataset is unstructured in combinations of useful and useless data. Unnecessary information such as stop words, punctuations, special characters, irrelevant sentences, quotations, and dates do not add any predictive power to the classifier/model. They only consume space and can distort the ML model. Therefore, before extracting any feature from the raw dataset, a cleaning process should be performed to minimize the distortions introduced to the model. In this paper, several steps have been followed to preprocess the news text. Transforming text in the same letter size (i.e. lower case) to eliminate homologous words that are different only in their case. For instance, the words "Fruit" and "fruit" are the very same in a real sense and should not be considered separately for prediction.

2) Removing Punctuations and Special Characters
Characters such as ?, !, ; and . are disposed of, this process simplifies computations in the next steps. Any special character and unnecessary whitespaces are also removed because they don't contribute to prediction power.

3) Filtering Stop Words
This technique is mainly used to remove unnecessary words or words with no specific meaning, such as "the", "an", "a", "what", etc. so that classifier cannot co-relate stop words and important class features. Furthermore, the most frequent or rarely used words do not contribute to the predictive power model. Therefore, they must be removed from the training set. In this study, we have downloaded a list of English stop words from the nltk library and then removed them from the dataset.

III. FEATURE ENIGINEERING/TEXT REPRESENTATION
The ML decision models (classifiers or regression algorithms) can only process and learn from numerical feature vectors. Therefore, the text features must be converted to a numeric representation [19,20]. Feature engineering is a process to transform data (in this case, text) into numeric features that can act as inputs for ML algorithms. It involves two steps: feature extraction to extract the unique features/patterns and feature representation to represent each feature numerically. Bag of Words (BOW) and N-grams are commonly used techniques to create text features [21,22]. BOW is the simplest feature extraction technique. It simply breaks apart the words in a document into individual word count statistics such as each word count/occurrence is used as a feature, but without considering the order. In contrast, N-grams is simply a sequence of N tokens (words) in the text. It not only finds the frequency of a word in the document but also considers the order and relationship between words. The Ngrams can be Unigram (N=1), Bigram (N=2), or Trigram (N=3) depending upon the number of tokens taken into consideration. These extracted features values can be represented by different techniques such as Binary Representation (BR), Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), and Normalized TF-IDF [23,24]. In this study, the TF-IDF feature representation technique is used. TF-IDF technique disposes of the most common words and extracts only the most important feature words from the text [25]. The TF-IDF algorithm works on the principle that if a word (w k ) is more frequent in one document (j) and appears less frequently in a specific corpus, then it has a stronger ability to distinguish the category of texts, and it should be given more weight. The TF-IDF can be estimated by: where ‫ݓ‬ is the weight of the word k in the document j, N represent the total number of documents, ‫݂ݐ‬ is the frequency of the word k in the document j and ݂݀ represents the number of documents containing the word (k).

IV. CLASS REPRESENTATION/ENCODING
News category prediction is multi-class classification. For instance, the dataset used in this study corresponds to five classes: business, sports, technology, politics, and entertainment. Each class is labeled to make it more understandable and often labeled in words. For ML models, label encoding is used to transform labels into numeric values. It can be done by a Label Encoder, which converts class labels into values between 0 and n-1, where n is the number of unique class labels. The actual and encoded labels of the dataset used in this study are given in Table I.

V. MACHINE LEARNING ALGORITHMS/CLASSIFIERS
A classifier is a ML model that maps input data to a proper category. In this study, NB, RF, KNN, and SVM algorithms are used to train a model that can classify news articles into categories.

A. Naïve Bayes
NB is a probabilistic classification algorithm based on Bayes' Theorem. It is simple, yet quite useful in a model, especially in text classification. The probability of any specific event is estimated by calculating its frequency in the past [26]. The fundamental NB assumption is that each feature is independent and unrelated to any other class feature. The Bayes theorem is: ‫‬ ሺ‫ܥ‬ | ݂ ) is the probability of occurrence of C given that event f has already occurred. The event f is termed as evidence, p(C) is the prior probability of the class, ‫‬ ሺ݂ | ‫)ܥ‬ is termed as the likelihood and ‫‬ ሺ‫ܥ‬ | ݂ ) is the posterior probability. In text classification, features can be numerous such as ݂ ሺ ݂1, ݂2 , ݂3 , ݂4 … … . . ݂݊ ) so by substituting f and expanding using the chain rule we get: Thus, we can find the category by finding the class with maximum probability.

B. Random Forest
Random Forest (RF) is a machine learning algorithm based on a set of trees classifiers [27]. The RF is an ensemble method used for classification that constructs several decision trees at training time and makes a final decision on majority voting. It uses bootstrap sampling in which data samples are sampled independently and with the same distribution for all trees in the forest [26].

C. K-Nearest Neighbors
KNN is an intuitive supervised learning algorithm and an easy method to implement. It is used to classify objects based on their nearest examples in training sets space. The procedure to identify an object is classified by the majority vote of its neighbors like an object is assigned to a common class among its closest neighbors. The new vector classification is found by classes of its k-nearest neighbors where k is a positive integer. This algorithm is implemented using Euclidean distance metrics to detect the nearest neighbor [29][30][31]. The main challenge in KNN is to determine the optimal value of k. A higher value of k will increase the rise of over-learning so, it is necessary to take a valid value of k that reduces over-learning. The Euclidean distance metric d(x, y) between two points is computed as: where N is the number of features like ‫ݔ‬ = { ‫ݔ‬ ଵ , ‫ݔ‬ ଶ , ‫ݔ‬ ଷ , … … , ‫ݔ‬ ே } and ‫ݕ‬ = { ‫ݕ‬ ଵ , ‫ݕ‬ ଶ , ‫ݕ‬ ଷ , … … , ‫ݕ‬ ே } . The number of kneighbors used to test a new vector varies from 1 to 10.

D. Support Vector Machine
The SVM is a kernel-based ML algorithm that can categorize input data input into specific classes or categories. SVM constructs a classifier that makes the decision boundary for every class and defines the hyper-plane to linearly or nonlinearly separate them. The accuracy of categorization can be increased by increasing the hyper-plane margin that also enlarges the distance among classes. Hence, the farthest hyperplane provides more immunity against noise. SVM is a kernelbased classifier that defines the process of mapping the training data set to develop its similarities to a linearly independent data set. The main reason to use mapping is to enhance the depth of the data set done by kernel function like some commonly used kernel are linear, RBF, and quadratic, etc. [32,33].

VI. PERFOMANCE EVALUATION METRICS
To precisely gauge the performance of the category predictor, there are different performance evaluation techniques and metrics such as Confusion Matrix, Accuracy, Precision, Recall or Sensitivity, and F1-Score. In this study, the Confusion Matrix is evaluated first, and Accuracy, Precision, and Recall are analyzed to get a true insight of the prediction performance. The Confusion Matrix is a table that is often used to quantify the performance of a category predictor or classification model on a set of test data for which the true/actual values are unknown. The confusion matrix summarizes the prediction performance by quantifying the correct and incorrect predictions (misclassification) broken down into each class. It gives a detailed insight into ways in which category predictor is confused while classifying the input data. Therefore, a confusion matrix is a good option for evaluating the performance of the multi-class category predictor.

VII. RESULTS AND DISCUSSION
The performance results of multi-class category predictors based on different supervised learning models are evaluated and compared in this section. This study's learning models are NB, RF, KNN, and SVM. The evaluation was done by observing each category predictor's prediction results by analyzing the Confusion Matrix and quantifying Precision, Recall, and overall Accuracy. This analysis was made on a test dataset consisting of 298 news samples. The Confusion Matrices of each category predictor are given in Figure 3. Figure 3 shows the confusion matrix of NB (a), RF (b), KNN (c), and SVM (d). In general, every category predictor has achieved good accuracy. SVM based category predictor achieved the highest accuracy for the given dataset. The SVM based category predictor predicts the news category with an accuracy of 98.3%, and this can be verified from Figure 3(d), as it shows only five wrong predictions out of 298 samples. The NB model's performance was observed to be as good as SVM's with an accuracy of 97.3%. In the NB model, the most misclassified category was Technology with four incorrect predictions, while the most accurately classified category was sports having no wrong predictions. The detailed category/class wise analysis of Precision and Recall for SVM and NB is given in Table II and Table III, respectively.  Figure 3(c). The detailed category/class wise analysis of Precision and Recall for the KNN model is given in Table V. Although KNN has achieved the lowest accuracy grades as compared to SVM, NB, and RF, its performance is still consideredsatisfactory.  VIII. CONCLUSION AND FUTURE WORK This paper presents a comparative analysis of the multiclass category predictor's prediction performance. News category predictors were developed by deploying/training wellknown machine learning algorithms (Naïve Bayes, Random Forest, K-Nearest Neighbor, and Support Vector Machine) on a BBC news dataset having five categories (business, sports, technology, politics, and entertainment). Later, using performance evaluation metrics, we analyzed the Confusion Matrix and quantified the test dataset's Precision, Recall, and overall Accuracy. As a result, the SVM model was proven to be the best among the four supervised learning models in correctly categorizing news articles with 98.3% accuracy. The lowest accuracy was obtained by the KNN model with K=5. However, the KNN model's performance can be enhanced by investigating the value of the optimal number of neighbors K. As future work, deep learning schemes will be introduced to further improve the classifier performance.