A Comparative Analysis of Classification Algorithms on Diverse Datasets

Data mining involves the computational process to find patterns from large data sets. Classification, one of the main domains of data mining, involves known structure generalizing to apply to a new dataset and predict its class. There are various classification algorithms being used to classify various data sets. They are based on different methods such as probability, decision tree, neural network, nearest neighbor, boolean and fuzzy logic, kernel-based etc. In this paper, we apply three diverse classification algorithms on ten datasets. The datasets have been selected based on their size and/or number and nature of attributes. Results have been discussed using some performance evaluation measures like precision, accuracy, F-measure, Kappa statistics, mean absolute error, relative absolute error, ROC Area etc. Comparative analysis has been carried out using the performance evaluation measures of accuracy, precision, and Fmeasure. We specify features and limitations of the classification algorithms for the diverse nature datasets. Keywords-data mining; classification algorithms; diverse; dataset


I. INTRODUCTION
Due to the evolving of computer science and the fast development and vast usage of World Wide Web and other electronic data, information extraction is a popular research field.Data mining [1, 2] is a significant method to extract information from data.There are various domains of data mining like classification, clustering, anomaly detection, association rule mining, regression, pattern mining, summarization etc.Data mining has even further involved in many other studies like text mining, social network mining, influence mining, sentiment mining etc.Data mining utilizes the information from existing data to examine the result of a specific problem.It analyzes the data that may have been extracted or gathered from any business.Decision makers opt for data mining to take a decision regarding marketing strategies for their products.Data mining interferes data into real-life investigation and can be applied to enhance sales, new product promotion, or product deletion.Classification [3,4] is one of the main domains of data mining and has extensively been used for various purposes like decision making, weather forecasting, prediction of customers' attitude, prediction of various social risk analysis as well as official tasks, prediction of influential bloggers [5-10] etc.A classification process can easily be separated into two main steps.In the first step, a part of data known as training data is used and each row or value of training dataset comprises a set of characteristics.Determination of classes is the main target behind classification.The first phase generates the classification model known as classifiers that depict the relationship between characteristics and classes.In the second phase, the classification accuracy of a classification algorithm that has been generated by the first stage is analyzed.There are various classification algorithms, classified into different categories.The category of classification algorithms includes probability based Bayesian algorithms, decision tree based algorithms, neural network based algorithms and kernel-based algorithms.Most classifiers use probability calculations to make class labels, however, accuracy measure has not been a target.Naive Bayes and the C4.5 learning algorithm are alike in predictive accuracy [11][12][13].

II. RELATED WORK
Classification is a very vast domain in data mining, having received a great deal of exploring over the last few decades.A comparison of four classification algorithms (logistic regression, Naïve Bayes, C4.5 decision tree and nearest neighbor) has been carried out in standard and boosted forms to predict class members for an online community in [14].The comparison has been evaluated using two performance measures, one is the area under the curve and another is the accuracy in the standard and boosted forms.Analysis conducted depicts very significant differences among base classification algorithms.Several classification techniques have been empirically compared within the analysis of unbalanced credit scoring data sets.Traditional classification techniques such as neural networks, logistic regression and decision trees have been used in order to find the suitability of support vector machines, gradient boosting and random forests for loan default prediction [15].The use of data mining is also very common in bio-informatics.Authors in [27] emphasized on the importance of rule-based decision trees as a classification method.There are two types of nondeterministic rules in decision tables known as inhibitory rules and bounded rules.In the former, we have decisions on one right side while in the later we can have fewer decisions on the right side.text classification [23], music emotion classification [24], feature-based mining of digital images [25], or annual crop classification [26].
The discussion above reveals that all the existing comparative works on classification algorithms have been done on algorithms of the same category.This study is a novel in the sense that algorithms have been selected with a diverse nature and diverse data sets better than those established on deterministic decision rules.

A. Selected Classification Algorithms
There are numerous classification algorithms, but we have focused on algorithms of diverse nature, therefore, three different algorithms have been chosen.C4.5 is the famous algorithm that is based on the decision tree algorithm, whereas the Naïve Bayes is a probabilistic algorithm and the Support Vector Machine algorithm (SVM) is a kernel based algorithm.The diversity of the proposed algorithms can lead decisionmaking to confusion, therefore, in the following, a brief description will be introduced.

Positioning Figures and Tables
Decision tree is also known as a statistical classifier used in classification.The decision tree is produced by C4.5.After a tree is built, the C4.5 rule induction program is used to produce a set of rules.Trees are shown by C4T and rules by C4R.At each node of the tree, one attribute is efficiently split its example set into subsets.Information gain is used for splitting.Attribute with top value of normalized information gain is selected to make the decision.The C4.5 algorithm is recursive on the smaller sub-lists.Decision tree divides the features of the documents into partitions.Splitting of data reduces the chances of errors at every stage.The root node is used to examine the branches of the tree to predict a label for new data [28].Data is trained in less time because of the graphical representation.We can examine it quickly from root to child nodes in which root node depicts the input.To calculate decision tree we need to calculate two type of entropies.
 Frequency table of one attribute.
 Frequency table of two attributes.

Naïve Bayes (NB)
It is a probabilistic algorithm, supposing that some features are not relevant to other features.Naïve Bayes is used for supervised learning methods, parameter estimation for Naive Bayes models etc.In a practical environment, this classifier performed better than others.Different attempts are taken to improve Naïve Bayes for classification [29].New text is presented with t* in a document A. it calculates: The Naïve Bayes (NB) classifier uses Bayes' rule: Training procedure calculates the relative frequency of P(t) and P(ƒᵢ|t).

Support Vector Machine (SVM)
A learning algorithm which is based on the kernel is used in SVM.It is used to identify pattern regression analysis and classification and it performs well in text classification.It predicts classes by using training data.SVM carries out nonlinear classification efficiently [30].Training points are separated into two categories based on support vectors by using decision surface.SVM optimization is calculated as

B. Selected Datasets
Ten datasets have been considered.Details of each dataset is shown in alphabetical order in Table I.

Contact lenses
The examples in the dataset are complete.The attributes are unable to define all the factors affecting the decision.

CPU
It is used for prediction in numeric form on the basis of instance-based learning with encoding length selection.

Credit
This is a credit related data set which is the largest and consists of 15 attributes and results in whether the credit is positive or negative.

Iris-discretized
A small dataset about iris classification.It is unique in this regard as its values are ranging and given in special characters and unique style.

Labor
This dataset is the most unlike one as the attributes in the dataset are of a unique style.Few are having special characters, few have enumerated values.There are Boolean attributes as well.The attributes are mainly related to wages, pension, allowances, assistance, plan, duration period etc.

Spambase
This dataset is concerning spam and is a very large set of attributes in real format.Usually based on the frequency of words or characters or case-based sentences in various categories, the attributes can be described as derived values.It is unique as its values are in real format and derived attributes based values.

Titanic
This dataset is related to the famous Titanic sinking event and it predicts a person survival level based on its class, age, and gender.

VO
This dataset is regarding Congress voting.It is very interesting in this regard that it has a number of attributes regarding factors influencing voters to vote either for democrats or republicans.

C. Software Used
Weka workbench provides the facility of visualizing attributes and algorithms for predictive analysis.It was built in C language, now we found Java-based versions only.It supports many tasks related to data mining like clustering, data preprocessing, classification, feature selection, visualization and regression.The ARFF file consists of two parts: the header and the data section.As the minimum number of attributes in the datasets is 6, thus, this value has been considered as the number of folds for all the algorithms.The following measures are used to find the results of the given classifiers:

Precision
Precision is denoted as the ratio of retrieved documents that are relevant to the search.

F-Measure
The f-measure includes both precision and accuracy.It may be considered as the weighted average of both values.

Recall
Recall is also known as the fraction of relevant instances that have been retrieved over the total amount of retrieved instances.

Recall = (11)
MCC MCC is used for measuring the quality of binary classification.It takes into account TP and FP and is generally known as a balanced measure.

|MCC| = x /n (12)
Kappa Statistic Kappa is the most robust method of measuring inter-rateragreement for a qualitative item.K takes into account the possibility of the agreement occurring by chance IV. RESULTS All datasets have been taken in ARFF file format, which is used in Weka [31,32] for data mining.The four main diverse nature algorithms have been compared twice.Exclusively, the datasets have been analyzed extensively as shown in Tables II-VII below.

P
(A) does not select t*.By supposing conditional independence of features ƒᵢ' class d, P( | ) is being estimated.

TABLE I .
DATASETS

TABLE II .
COMPARATIVE RESULT ANALYSIS FOR NAÏVE BAYES ALGORITHM

TABLE III .
WEIGHTED AVERAGE OF DETAILED ACCURACY RESULTS FOR NAÏVE BAYES ALGORITHM