A Comparative Analysis of Classification Algorithms on Diverse Datasets

M. Alghobiri

Abstract


Data mining involves the computational process to find patterns from large data sets. Classification, one of the main domains of data mining, involves known structure generalizing to apply to a new dataset and predict its class. There are various classification algorithms being used to classify various data sets. They are based on different methods such as probability, decision tree, neural network, nearest neighbor, boolean and fuzzy logic, kernel-based etc. In this paper, we apply three diverse classification algorithms on ten datasets. The datasets have been selected based on their size and/or number and nature of attributes. Results have been discussed using some performance evaluation measures like precision, accuracy, F-measure, Kappa statistics, mean absolute error, relative absolute error, ROC Area etc. Comparative analysis has been carried out using the performance evaluation measures of accuracy, precision, and F-measure. We specify features and limitations of the classification algorithms for the diverse nature datasets.


Keywords


data mining; classification algorithms; diverse; dataset

Full Text:

PDF

References


N. M. Ramos, J. M. Delgado, R. M. Almeida, M. L. Simoes, S. Manuel, Appliation of Data Mining Techniques in the Analysis of Indoor Hygrothermal Conditions, Springer, 2015

B. Bakhshinategh, O. R. Zaiane, S. ElAtia, D. Ipperciel, “Educational data mining applications and tasks: A survey of the last 10 years”, Education and Information Technologies, Vol. 23, No. 1, pp. 537-553, 2018

F. Ahmed, M. Samorani, C. Bellinger, O. R. Zaiane, “Advantage of integration in big data: Feature generation in multi-relational databases for imbalanced learning”, IEEE International Conference on Big Data, Washington, DC, USA, pp. 532-539, December 5-8, 2016

P. G. Clark, C. Gao, J. W. Grzymala-Busse, “MLEM2 Rule Induction Algorithm with Multiple Scanning Discretization”, Smart Innovation, Systems and Technologies, Vol. 72, pp. 218-227, Springer, 2017

H. U. Khan, A. Daud, U. Ishfaq, T. Amjad, N. Aljohani, R. A. Abbasi, J. S. Alowibdi, “Modelling to identify influential bloggers in the blogosphere: a survey”, Computers in Human Behavior, Vol. 68, pp. 64-82, 2017

H. U. Khan, A. Daud, T. A. Malik, “MIIB: A Metric to identify top influential bloggers in a community”, PloS One, Vol. 10, p. e0138359, 2015

U. Ishfaq, H. U. Khan, K. Iqbal, “Modeling to find the top bloggers using sentiment features”, International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), Quetta, Pakistan, pp. 227-233, April 11-12, 2016

U. Ishfaq, H. U. Khan, K. Iqbal, “Identifying the influential bloggers: a modular approach based on sentiment analysis”, Journal of Web Engineering, Vol. 16, pp. 505-523, 2017

H. U. Khan, “Mixed-sentiment classification of web forum posts using lexical and non-lexical features”, Journal of Web Engineering, Vol. 16, pp. 161-176, 2017

H. U. Khan, A. Daud, “Using machine learning techniques for subjectivity analysis based on lexical and non-lexical features”, International Arab Journal of Information Technology, Vol. 14, No. 4, 2017

A. Patel, S. Gandhi, S. Shetty, B. Tekwani, “Heart Disease Prediction Using Data Mining”, International Research Journal of Engineering and Technology, Vol. 4, No. 1, pp. 1705-1707, 2017

T. Pranckevicius, V. Marcinkevicius, “Comparison of Naïve Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression Classifiers for Text Reviews Classification”, Baltic Journal of Modern Computing, Vol. 5, No. 2, pp. 221-232, 2017

P. V. Ngoc, C. V. T. Ngoc, T. V. T. Ngoc, D. N. Duy, “A C4. 5 algorithm for english emotional classification”, in: Evolving Systems, pp. 1-27, Springer Berlin Heidelberg, 2017

C. Sibona, J. Brickey, “A Statistical Comparison of Classification Algorithms on a Single Data Set”, AMCIS 2012 Proceedings, pp. 1-13, AIS Electronic Library, 2012

A. Beque, K. Coussement, R. Gayler, S. Lessmann, “Approaches for credit scorecard calibration: An empirical analysis”, Knowledge-Based Systems, Vol. 134, pp. 213-227, 2017

N. S. Ketkar, L. B. Holder, D. J. Cook, “Empirical comparison of graph classification algorithms”, IEEE Symposium on Computational Intelligence and Data Mining, Nashville, USA, pp. 259-266, March 30-April 2, 2009

R. Dixit, H. Singh, “Comparison of detection and classification algorithms using boolean and fuzzy techniques”, Advances in Fuzzy Systems, Vol. 2012, Article No. 406204, 2012

T. R. Patil, V. Thakare, S. Sherekar, “A Combined Naïve Bayes and URL Analysis Based Adaptive Technique for Email Classification”, International Journal of Electronics, Communication and Soft Computing Science & Engineering, Special Issue: International Conference on “Advances In Computing, Communication and Intelligence”, pp. 88-90, 2014

M. Esmaeili, A. Arjomandzadeh, R. Shams, M. Zahedi, “An Anti-Spam System using Naive Bayes Method and Feature Selection Methods”, International Journal of Computer Applications, Vol. 165, No. 4, pp. 1-5, 2017

D. D. Arifin, M. A. Bijaksana, “Enhancing spam detection on mobile phone Short Message Service (SMS) performance using FP-growth and Naive Bayes Classifier”, IEEE Asia Pacific Conference on Wireless and Mobile, Bandung, Indonesia, pp. 80-84, September 13-15, 2016

X. Zhuang, Y. Zhu, C.-C. Chang, Q. Peng, F. Khurshid, “A unified score propagation model for web spam demotion algorithm”, Information Retrieval Journal, Vol. 20, No. 6, pp. 547-574, 2017

O. F. Arar, K. Ayan, “A Feature Dependent Naive Bayes Approach and Its Application to the Software Defect Prediction Problem”, Applied Soft Computing, Vol. 59, pp. 197-209, 2017

L. Jiang, C. Li, S. Wang, L. Zhang, “Deep feature weighting for naive Bayes and its application to text classification”, Engineering Applications of Artificial Intelligence, Vol. 52, pp. 26-39, 2016

Y. An, S. Sun, S. Wang, “Naive Bayes classifiers for music emotion classification based on lyrics”, IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, China, pp. 635-638, May 24-26, 2017

H. Lad, M. A. Mehta, “Feature Based Object Mining and Tagging Algorithm for Digital Images”, in: Proceedings of International Conference on Communication and Networks, Singapore, Advances in Intelligent Systems and Computing, Vol. 508, pp. 345-352, 2017

H. Zhang, Q. Li, J. Liu, J. Shang, X. Du, L. Zhao, N. Wang, T. Dong, “Crop classification and acreage estimation in North Korea using phenology features”, GIScience & Remote Sensing, Vol. 54, No. 3, pp. 381-406, 2017

P. Delimata, B. Marszał-Paszek, M. Moshkov, P. Paszek, A. Skowron, Z. Suraj, “Comparison of some classification algorithms based on deterministic and nondeterministic decision rules”, in: Transactions on Rough Sets XII, Springer, pp. 90-105, 2010

D. Oreski, S. Oreski, B. Klicek, “Effects of dataset characteristics on the performance of feature selection techniques”, Applied Soft Computing, Vol. 52, pp. 109-119, 2017

L. Jiang, D. Wang, Z. Cai, X. Yan, “Survey of improving naive bayes for classification”, Lecture Notes in Computer Science, Vol. 4632, Springer, Berlin, Heidelberg, pp. 134-145, 2007

C. Cortes, V. Vapnik, “Support-vector networks”, Machine learning, Vol. 20, pp. 273-297, 1995

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA data mining software: an update”, ACM SIGKDD Explorations, Vol. 11, No. 1, pp. 10-18, 2009

R. R. Bouckaert, E. Frank, M. A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “WEKA-Experiences with a Java Open-Source Project”, Journal of Machine Learning Research, Vol. 11, pp. 2533-2541, 2010




eISSN: 1792-8036     pISSN: 2241-4487