Detection of Spam Email by Combining Harmony Search Algorithm and Decision Tree

—Spam emails is probable the main problem faced by most e-mail users. There are many features in spam email detection and some of these features have little effect on detection and cause skew detection and classification of spam email. Thus, Feature Selection (FS) is one of the key topics in spam email detection systems. With choosing the important and effective features in classification, its performance can be optimized. Selector features has the task of finding a subset of features to improve the accuracy of its predictions. In this paper, a hybrid of Harmony Search Algorithm (HSA) and decision tree is used for selecting the best features and classification. The obtained results on Spam-base dataset show that the rate of recognition accuracy in the proposed model is 95.25% which is high in comparison with models such as SVM, NB, J48 and MLP. Also, the accuracy of the proposed model on the datasets of Ling-spam and PU1 is high in comparison with models such as NB, SVM and LR.


I. INTRODUCTION
Spam spotting is a significant task for all web actions and especially for email clients [1].Spam emails consume a considerable amount of traffic volume and also may carry viruses [2].Since the security of computer systems is based on the three principles of prevention, diagnosis, and response, if all the security risks were identified and prevented, there obviously would be no need for reaction.Hence, identification is a vital method for providing email users' security and it can be the front defensive line of for any computer system.Spam senders are always using more complicated tools and methods for getting through the spam filters.Hence, emails' security against attacks on email servers and the trial for accessing valid usernames or email addresses through accessing mail servers is a critical requirement [3].One way for keeping unauthorized access to users' email accounts is a two-step verification of identity.Two-step verification is a security mechanism that uses a second keyword or phrase in addition to the password [4].
Spotting spam email is mainly based on characteristics and features written in the subject field of emails.Most spams use similar subjects.Therefore, this is a unique feature for identifying.The present work utilizes the Spam-base dataset [5] including the two classes spam and non-spam for the purpose of spam identification based on a combination of HSA [6] and ID3 [7].In addition, the Ling-spam [8] and PU1 [8] datasets were used for assessment and comparison.HSA was used for selecting characteristics, increasing accuracy in the final solution, and for determining the features that take the fitness function to its optimal situation.ID3 was used for class recognition and final categorization.

II. RELATED WORKS
Various techniques have been introduced for identifying spams, including statistical techniques, expert systems, Bayesian networks, neural networks, fuzzy logic, and collective intelligence algorithms.A Radial Basis Function (RBF) model is set forward alongside with Support Vector Machines (SVM) technique for identifying spam emails in [9].RBF, an artificial neural network model, is used for training and testing data; and SVM, a classification technique, is used for mapping the features.Assessment is done on Double Bounce Email, a linear dataset.In addition, preprocessing and determining the frequency of words according to (1) is carried out by use of TF-IDF.Results show an identification accuracy of approximately %84.
In [10], a Bayesian Additive Regression Trees (BART) model was carried out on Ling-spam, PU1, and Spam-base.BART model is a binary DT with linear regression correlations at every end node that are able to predict numerical values.The most important and main criterion for assessment of the DT is the error rate created in the tree.For calculating the total error rate of the tree the weighted total of the error rate of the leaves is calculated.In order to prevent low quality laws being created, some branches are pruned.Although this pruning causes a higher error rate, it will stop inefficient laws being created.Results show that in the models CBART, Random Forests (RF), BART, and Classification and Regression Trees (CART) the accuracy on Ling-spam and PU1 is 100%.Also, on Spam-base, the highest amount in RF model is 98.61%.

www.etasr.com Gashti et al.: Detection of Spam Email by Combining Harmony Search Algorithm and Decision Tree
CART model reduces the error rate by 2.2% in comparison with BART.
An Enhanced Genetic Algorithm (EGA) model was used in [11], which is an enhanced version of Genetic Algorithm (GA) through combination with Simulated annealing (SA), is proposed for spotting spam emails.Assessment is carried out on Spam Corpus with 54 characteristics.In EGA, 15 chromosomes are used with 54 characteristics and 1000 generations.Results suggest an accuracy of 99.73% and 99.86% for GA and EGA respectively.Therefore SA has been very effective in enhancing GA, and has improved its operators, and increased its accuracy.A Particle Swarm Optimization (PSO) model, a population based algorithm, has been proposed for identification of spam emails in [12].GPU technology is used for running PSO model.GPU processes tasks in parallel and processes many tasks better and faster than CPU.Assessment is made on TREC 2015 with 48360 spam emails and 36450 non-spam emails.The probability of identifying spam emails is calculated with (2).Results suggest an accuracy of 99% in spotting non-spam emails and an accuracy of 66% to 99% in spotting spam emails.
A Bayesian classification was done on three datasets with 1000, 1500, and 2100 emails in [13].Bayesian classification identifies and predicts data according to probabilities.Bayesian model includes the steps of preprocessing, training, testing, and classification.Results show that the accuracy in Dataset1, Dataset2, and Dataset3 is 93.98, 94.85, and 96.46 respectively.In addition, the processing time for Dataset1 is less than the other two datasets.Identification of spam emails was performed on RFC2822 with 9189 emails according to Vertex Dependency [14].In this model, the neighboring vertices relationships are used for prediction.The data similarity distance between the vertices is calculated with (3).Results point to a maximum accuracy of 93.78%.In addition, the accuracy in identification of spam email is 80.72% and for nonspam emails, it is 98.01%.
A combined model PSO+K-Means in [15] was performed on Spam-base with 57 characteristic for spam email spotting.In PSO+K-Means model, PSO is used for characteristic selection and K-means is used for data clustering.In K-Means algorithm, k members are selected out of n members randomly as cluster centers.Afterwards, the remaining n-k members are allocated to the closest clusters.Once all members added to clusters, cluster centers are calculated again and members are allocated to clusters anew according to the new cluster centers.This process is carried on until the cluster centers are fixed.Distance factor according to Equation ( 4) is used for clustering.Results suggest that the maximum accuracy of the model is 94.62%.
A combined model Multi-Layer Perceptron Artificial Neural (PSO-MLPNN) was proposed for spam email identification in [16].PSO algorithm is used for characteristic selection and MLPNN model is used for training, data testing.In PSO-MLPNN model the perceptron neural function is used with sigmoid activation function for the hidden layer; %80 of the data is used for training and %20 for testing.The number of the hidden layers in MLP model is considered to be between 3 and 15; and the repetition of PSO algorithm for characteristic selection is 200.Assessment was done with 481 spam emails and 2171 non-spam emails on Ling-Spam and 6000 email samples on Spam-Assassin.Assessment done on Spam-Assassin and Ling-Spam suggests that the accuracy in PSO-MLPNN model is respectively 99.98 and 99.79.Comparison shows that PSO-MLPNN model is more accurate in identification than SVM and BPNN.In Table I, a comparison of the proposed models for spam emails identification is presented.

III. PROPOSED MODEL
In the proposed model, first the data in Spam-base dataset is preprocessed.At the preprocessing level, data are controlled; because the data in the dataset might not be controlled enough and inapplicable, repeated, or erroneous values may result in invalid output.Presence of inapplicable data in most results in dysfunctions in conclusion obtained from the data.In the next step, primary vectors form in HSA.Each vector is comprised of 57 characteristic.In the vectors, a number of characteristics are selected based on HSA memory randomly and transferred to ID3.In ID3 tree classification rules according to characteristics are carried out.Characteristics that are influential in identification accuracy are saved in HM memory and used for later FS steps.Assessment function in HSA does a complete and exhaustive search in the space of the subset of characteristics until it finds the best combination of characteristics.The most important step in the ID3 tree is setting the rules.According to the root of the tree, rules are set and characteristics are compared.The most important criterion for the root node is the data rate of the tree.The characteristic that has the highest data rate is chosen as the root node and the ID3 tree is expanded based on it.Afterwards, training and testing of data is carried out.Testing is done in order to assess data and its validity.In Figure 1, the flowchart of the proposed model is shown.Flowchart of the Proposed Model In HSA [6] Harmony Memory (HM) is used to maintain the best of the previous solutions.HM works as a matrix as done in Equation (5) with a solution in every line.Therefore, the number of the columns in this matrix actually shows the dimensions of the solution.The last column in the matrix is set for saving the value of the fitness function for every line.The amount of answer vectors in HM is shown using Harmony Memory Size (HMS).Output value of the fitness function for each vector is shown using f(x).

HM
In HSA, first all HM lines are made and the value of the fitness function for each line is calculated and saved in the last column of the HM matrix.Then, based on the number of necessary repetition or the time repetition is finished, the entire HM is scanned and for each line the suitable value for each entry is set according to HSA parameters.Afterwards, if the value for the fitness function is better than the worst solution present in HM, replaces it.Eventually, the solution that produces the optimal value in the fitness function is chosen as the best available solution.In HSA, any value can select values randomly.Randomization is in fact utilized for increasing the variety of solutions.
ID3 tree [7] is a method for presenting a set of rules that result in a group or a value.In ID3 tree, a statistical value is used called data rate for clarifying how much a character is effective in the final identification.In ID3 tree, at first, the amount of disorder for every characteristic is calculated using entropy; and using its value for each characteristic, the data rate is calculated.Entropy displays the randomness as a mathematical figure.If the set S includes positive and negative samples of a set, the entropy of S in relation to Boolean categorization is defined as (6).In (6), P⊕ is the ratio of positive samples to the all the samples, and P⊖ is the ratio of negative samples to the all the samples.The decision of which characteristic is to be in the root of the tree depends on the data rate of each characteristic.Equation ( 7) is used for calculation of the data rate [7].
In (7), the Values (A) parameter denotes sum of the A values and S V is a subset of S for which A gives value V.The criteria used for the assessment of the proposed model are Precision, Recall, F-Measure, and Accuracy; accuracy is the most important item among those criteria [17,18].

www.etasr.com Gashti et al.: Detection of Spam Email by Combining Harmony Search Algorithm and Decision Tree
TN represents the number of records that factually pertain to the negative set and were identified correctly as negative by the algorithm.TP represents the number of records that factually pertain to the positive set and were identified correctly as positive by the algorithm.FP represents the number of records that factually pertain to the negative set and were identified wrongly as positive by the algorithm.FN represents the number of records that factually pertain to the positive set and were identified wrongly as negative by the algorithm.

IV. EVALUATION AND RESULTS
Assessment and results of the proposed model are obtained in MATLAB 2015 on Spam-base.Spam-base dataset includes 4601 samples with 57 characteristics.In addition, in order to illustrate its efficiency and accuracy, the proposed model was run on Ling-Spam and PU1 datasets and was compared with other models.Ling-Spam datasets includes 481 spam and 2412 non-spam emails.PU1 dataset includes 481 spam and 618 non-spam emails.The maximum repetition in HSA is 150 and to the end of maintaining variety and optimal solutions, the technique of unfit vectors omission was put to use.In Table II, assessment of the proposed model on Spam-base dataset with different number of characteristics is shown.As could be seen in Table II, the number of characteristics is very influential in identification accuracy; also, the type of the chosen characteristic is influential in the accuracy of the results.In Figure 2, the comparison of the number of characteristics in identification accuracy in the dataset Spambase is shown in a graph.In Table III, the comparison of the proposed model with other models is presented.As can be seen in Table III the accuracy of the proposed model is 95.25%.The proposed model has a higher accuracy than models such as Naive Bayes (NB), SVM, J48, and most other models; but it also has a lower accuracy than Random Forest and Random Tree models.In Table IV the comparison of the proposed model with other models on Ling-Spam dataset is presented.The accuracy of the proposed model in Ling-Spam dataset is 99.80%, which is higher than models NB, SVM, NNET, and LR.In Table V, the comparison of the proposed model with other models on PU1 dataset is presented.The accuracy of the proposed model in PU1 dataset is 97.12%, which is higher than models NB, and NNET.In Table Vi, the comparison of the proposed model with other models is shown on the dataset Spam-base.In the proposed model the accuracy in the dataset Spam-base is 93.25%, which is higher than NB models.

V. CONCLUSION AND FUTURE WORKS
Spam emails consume a huge bulk of email storage and compromise users' security.Therefore, measurements such as identification filters should be adopted.Content based filters that use emails' content are the main and most common type of spam filters.In most content based methods, machine learning and data mining are used.In addition, many identification filters scrutinize the content and subject of emails for existence of key words or phrases used in spam emails frequently.In conclusion, at first identification is the best method for avoiding spams.In the present paper, a model for spam identification was proposed based on the combination of HSA and ID3.Assessment was carried out on datasets Spam-base, Ling-base, and PU1.Results suggest that the proposed model has higher identification accuracy in comparison with models SVM and NB; and compared to most models increases the identification accuracy up to 15%.In spam email identification one of the main problems faced is selection of the type of the characteristic.In conclusion, in order to eliminate the problem, algorithms should be used that are capable of FS and can enhance identification accuracy.

Fig. 2 .
Fig. 2. Comparison Graph of the Influence of the Number of Characteristics on Detection Accuracy on Spam-base

TABLE I .
COMPARISON OF PROPOSED MODELS FOR SPAM

TABLE III .
COMPARISON OF THE PROPOSED MODEL WITH OTHER MODELSON SPAM-BASE

TABLE VI .
COMPARISON OF THE PROPOSED MODEL WITH OTHER MODELS ON SPAM-BASE