A Comparative Approach of Dimensionality Reduction Techniques in Text Classification

This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations. Keywords-stop word removal; stemming; feature weighting and selection; KNN; Naive Bayes

INTRODUCTION In text classification, usually the dimensionality of the feature vector is huge because the input document consists of vast data and many terms [1,2]. The major approaches for feature reduction are feature selection [2][3][4][5][6][7][8] and feature extraction [9,10]. Feature extraction approaches are computationally more extensive and more effective than feature selection methods [9,10]. Feature clustering is one effective technique in feature reduction, where similar features are grouped into one cluster and each cluster is treated as a feature [11,12]. To reduce dimensionality severity in preprocessing, the unnecessary words which do not support the classification task (i.e. articles, verbs, prepositions etc.) are removed. For text categorization (by supervised learning procedure) labels are assigned for some documents from predefined categories (e.g. business, health, movies, etc.). The number of digital documents in the web is increasing, the number of terms (i.e. features) in those documents is quite large but only a few are informative. It is a severe problem which degrades the efficiency of Information Retrieval (IR) procedures.
The current work includes stemming process [13] which reduces the dimensionality of feature space and stochastic dependence between terms. A better feature selection procedure reflects the effectiveness on classification and computational efficiency. In this paper, feature weighting is presented along with the implementation procedure of different feature selection methods and KNN and Naïve Bayes classifiers [14,15,[26][27][28][29][30]. Experiments are conducted and the results are analyzed.
II. RELATED WORK In feature selection approach the redundant and irrelevant features are removed from the corpus, e.g. selecting a subset of features from the training set and using that set as feature set for text classification. Some supervised feature selection approaches (IG, MI, OR, CHI, NGL, GSS etc.) [16] were used in our task. To reduce the noise of data with respect to term frequency (TF) [16], document frequency (DF) [16], is implemented by giving user input threshold and selecting the most probable features (by a threshold value k) and analyzing results with respect to dimensionality size. Rule-based classification is accurate if the rules are written by experts and are easily controlled if their number is small but if it increases or the rules conflict each other, rule maintenance becomes difficult. If the target domain changes the rules must be reconstructed. Machine learning-based approach is domain independent and gives high predictive performance, but training data are required [17].

III. FEATURE WEIGHTENING
In this process, each feature (a single word or term or token) is assigned with a score based on a score computing function and the higher scored (weighted) terms are selected. Score computing functions include some mathematical definitions and probabilistic approaches which are estimated by some static information in the documents across different categories. Some example notations based on probabilities are: • P(t): The probability of a document x containing the term t The number of documents in which a word occurs is DF: where A i = document i where the word is present, m= number of documents, and i is an integer ranging from 1 to m.
The DF was computed for every unique term in the training corpus and the features with less DF than the predefined threshold were removed.

B. Mutual Information (MI)
MI and IG give similar results for binary problems. The implemented multi class problem solving procedure was such that these two techniques give different results.

C. Chi Square
It is a statistical measure used to measure the independence of a feature or a class. In this context, the null hypothesis here is that the particular word and category are completely independent, i.e. that the word is useless for classifying documents.

D. GSS Coefficient
It is a simplified Chi Square function.

E. Odds Ratio (OR)
This measure compares the odds of a word occurring in one class with the odds of occurring in another. OR is positive if the feature more often occurs in one document than the other, negative for vice versa and zero if the feature's presence is equal in both:

F. NGL Coefficient
It is a variant of the Chi Square metric, also called as correlation coefficient.

G. Information Gain (IG)
This method is implemented on the constraints of class membership function (presence/absence) and by how much information is gained. The IG of a term t is given as:

J. KNN and Naive Bayes Classifiers
The advantage of KNN in this model is that by choosing different constraints in every level of classification task we may compare the results with respect to ܰ (ܰ is variable) most matched values. This classifier is implemented by computing the Euclidean distance. The following is an illustration of the Naive Bayes classifier. Let D be a document set with 6 documents ݀ ଵ , ݀ ଶ , … ݀ . The documents ݀ ଵ , ݀ ଶ , … ݀ ହ are used to train the classifier and we are predicting the label of document ݀ . The classifier is trained under the bag of words representation method. As shown in Table I   The multiplication of all individual probabilities concludes to the label of ݀ : A. Data Set 1: Self Made Data Set To train the classifiers, we used a small (of size around 1.5MB) self-made corpus. This allows the needed running time for training to be as short as possible. The documents of the self-made corpus were collected online articles from CNN, Washington Post, and New York Times. We collected 150 documents under the following categories: Business (23), Education (24), Health (30), Movies (10), Science (27), Sports (30), Travel (6), with an average of 702 words per document.

C. Classifiers Performance and Results
For most results we may conclude that Naïve Bayes classifier performed better even though the performance of the two classifiers is efficient. In Figure 1 it can be seen that the KNN classifier gives different results for each feature selection technique and from Figure 2 that Naive Bayes classifier gives almost similar results for all feature selection techniques except for IG and MI. We observe that KNN classifier gave similar results for MSF and Chi Square at feature space size=250 and 750, and for IG and DF at feature space size=500 and 750. The effectiveness of a classifier is not described by precision and recall, it is necessary to compute different evaluation metrics. The F-measure was computed, i.e. the harmonic mean of recall and precision. Micro-average F1 or micro average accuracy of F1 is calculated regardless of topics but macro-average F1 scores on all the topics [17,24]. Average precision [25] is the average of the precisions at eleven evenly spaced recall levels.     Tables IX and XIV describe the Precision and Recall values (micro and macro) given by the KNN and Naive Bayes classifiers respectively while using different feature reduction techniques. The performance of KNN at k=30, Threshold=1, Threshold step size=0.1, method of summation = sum, DF=<1, TF=<1 and Naive Bayes classifier at Threshold= 0.006666666666666667, Threshold step size= 0.0001, method of summation = sum, DF=<1, TF=<1for the test document Reut.003.xml at different dimension sizes while using different feature reduction techniques are given by Tables X-XIII, and XV-XVIII respectively.   Figures 1-2 clearly show that KNN classifier performed well in classification with MI and DF but when other feature reduction techniques were used, the average precision was low. The Naive Bayes classifier performed well with the feature reduction techniques except MI. The higher average precision reported by KNN was with DF. The results reported in Tables X, XV, XI, and XVI reveal that Naive Bayes classifier worked well with respect to 11 Point Precision and Breakeven point. The results reported in Tables XII, XVII, XIII, and XVIII reveal that KNN classifier worked better than Naive Bayes with respect to micro and macro precision and recall. Figure 3 shows that Naive Bayes classifier works better than KNN with the measures Micro F1 and Macro F1.
The number of categories, the size of the class/corpus, the used feature selection techniques etc. were the key factors of the experimental results. Time complexity of the experiments was not considered/reported in this study. Generally, KNN classifier is simple to use and takes less time when compared with Naive Bayes but it is proved that Naive Bayes can work better in many cases.