Constrained K-Means Classification

Classification-via-clustering (CvC) is a widely used method, using a clustering procedure to perform classification tasks. In this paper, a novel K-Means-based CvC algorithm is presented, analysed and evaluated. Two additional techniques are employed to reduce the effects of the limitations of K-Means. A hypercube of constraints is defined for each centroid and weights are acquired for each attribute of each class, for the use of a weighted Euclidean distance as a similarity criterion in the clustering procedure. Experiments are made with 42 well–known classification datasets. The experimental results demonstrate that the proposed algorithm outperforms CvC with simple K-Means. Keywords-classification-via-clustering; k-means; supervised learning


INTRODUCTION
Data science and especially data mining [1] is a rapidly evolving field with the extraction of valuable knowledge out of large accumulated information being a major challenge.Recent technological progress has led to the creation of large datasets to all scientific areas.Thus, developing methodologies which discover knowledge out of raw data is now many researchers' common concern.Supervised learning methods are an important part of machine learning and data mining, referring to the task of learning from labeled training data, with classification being the most common.Many sophisticated classification algorithms have been proposed in the literature, each exploiting the data in a different way, such as Artificial Neural Networks [2], Bayesian Classifiers [3], Rule Based Classifiers [4,5], Decision Trees [6,7] and Ensemble Classifiers [8].A widely used method is Classification-viaclustering (CvC) [9][10][11][12][13][14].The latter involves a clustering technique, in order to group data into clusters, mapped to the classes.Instances are assigned to the clusters or classes according to some criterion (depending on the clustering algorithm used).The most known algorithm in clustering is K-Means [15,16], which is computationally light.Its low complexity makes it suitable to use in big data analysis for clustering and classification.It is referred to a clustering procedure which pattern for each cluster is a centroid.As a similarity criterion, a sort of distance is used (most commonly the squared Euclidean).According to that criterion, each data is assigned to the proper cluster and the centroids are then recalculated given the updated cluster.This algorithm is widely used in CVC, where the clusters are matched to classes.COP-KMeans [17] is a variation of K-Means, using background knowledge in terms of data-level constraints.This includes two kinds of constraints, must-link and cannot-link, defining whether a data must or must not be in a cluster, neither of which can be violated in order to assign a data point to a cluster.So, if a cluster includes a data-point which must be in the same cluster with another, the latter must be assigned to that cluster.Also, if two data-points are parts of a cannot-link constraint, they cannot be in the same cluster.In the same manner, another technique proposes an extended set of the above set of constraints [18].Must-link and cannot-link constraints remain, and two more are examined.The latter are based on calculating the maximum distance that two datapoints in the same cluster and the minimum distance that two data-points in different clusters must have.
Another approach is Constrained K-Means [19], which includes cluster-level constraints; each cluster has a constraint which is a threshold of the minimum number of data to contain.So, the final centroids are specified when all the clusters contain the desired amount of data.A different method of finding the k centroids is Global K-Means [20], where the problem is solved incrementally.An iterative procedure begins solving the problem initially for one centroid and each iteration adds one centroid to the solution until the k centroids are found."Fast Global K-Means" [20] tries to accelerate the above procedure.Instead of solving the k-centroids' problem given the k-1 solution, tries to calculate the error reduction from inserting the new centroid.The calculated error is used as a threshold on the next execution of k-means.A different approach is "Fuzzy C-Means" algorithm [21], with each datum fractionally belonging to all clusters and contributing to all centroids' update.In this work, a K-Means based CvC algorithm is presented, introducing the use of constraints for the centroids movement and a weighted Euclidean distance as a similarity criterion.The proposed algorithm is evaluated with 42 well known datasets from the UCI Machine Learning Repository [22].The 10-fold cross-validation [23] is employed and classification accuracy, average sensitivity and average precision [24] are used as metrics of interest.Experimental results demonstrate that the proposed algorithm considerably outperforms CvC with simple K-Means.

II. CVC WITH SIMPLE K-MEANS
Clustering is a method of finding subgroups within observations.The most known clustering algorithm used for CvC is K-Means, which divides the instances of a dataset into clusters.Since the number of clusters to be extracted is predefined, K-Means can be easily used in classification by setting K to be equal or greater than the number of classes and mapping each cluster to a specific class.In K-Means based CvC, is usually set equal to the number of classes .K-Means forms the clusters using the training set.Each cluster is mapped to the appropriate class using the labels of the training data.The majority class of the instances of each cluster is assigned to it.In the classification process, the unlabeled instances are assigned to the closest centroid of the clusters formed in the previous step.The formulation of the clusters using K-Means algorithm is summarized in the following simple K-Means algorithm: t++; 7: while m t !=m t-1 8: return m t ; K-Means consists of three main steps.At first, the means are initialized randomly (line 2).Next, the algorithm assigns the training instances to the "nearest"' mean's cluster (line 4) and updates the means to the centroid of the formulated cluster (line 5).To assign each instance , = 1, . . ., , to a cluster, the squared Euclidean distance , = 1, . . ., , from each mean is calculated, as shown in (1).
where is the number of attributes.After the calculation of the squared Euclidean distance, the instance is assigned to the cluster where ≤ for every ∈ 1, . . ., .The third step of k-means is the recalculation of each mean to be the centroid of all the instances currently in the cluster: | | is the size of the cluster, in other words the number of instances that are assigned to it.In K-Means, if (the number of clusters) and (the number of attributes) are fixed, the problem can be exactly solved in time ( ), where is the number of instances to be clustered [25].The K-Means algorithm has some limitations and weaknesses [26 -28] that influence negatively its classification results.First of all, the random initialization of the centroids affects significantly the formation of the cluster-classses.Furthermore, due to clustering task small classes may be dominated by bigger classes, so the grouping may demarcate the bigger class instances into 2 separate classes and include the instances of the small class to one of the bigger clusters.Moreover, the presence of noise -outliers in the dataset can affect the cluster pulling its centroid to a 'bad' position.Another factor that should be taken into account is the distance criterion for the assignation of each instance to a centroid.In CvC, each class may have a difference standard deviation in each attribute than the other classes, thus each attribute may have different contribution in the classification.

III. CONSTRAINED K-MEANS CLASSIFICATION ALGORITHM
Constrained K-Means Classification (C-K-Means), a novel CvC algorithm, is proposed in this work.The algorithm is designed so as to address the limitations and weaknesses mentioned above.As classification is a supervised learning technique, background knowledge is used efficiently to extract the model.Two main alterations to K-Means are proposed to use background knowledge: (i) application of constraints to the initialization and update of the centoids, (ii) weighted euclidean distance fuction employment.The training data are used to acquire constraints for each centroid and the weights for each attribute and class.Since each hypercube of constraints is generated based on data from a single class, each centroid inherits the class of the respective hypercube.The clustering procedure and the formulation of the clusters takes place in the test data rather than the training set.This helps us to better classify the testing instances, not only using the distance from each observation from the centroids, but also updating the centroids to the test set clusters formation.

A. Description of C-K-Means
The model produced during the training procedure includes a hypercube of each class and the weights for each attribute and each class.The algorithm takes as input a training dataset with attributes and classes and a hypercube is defined for each class by calculating minimum and maximum bounds for each attribute: with , being the average value of the attribute ( ∈ [1, ]) of the instances belonging to the class ( ∈ [1, ]), and , the respective standard deviation.The minimum and maximum bounds are , and , , respectively and is a parameter defining the relaxation of these bounds.The bounds are used to limit the movement of the centroids and so the cluster formation.The weights are used to facilitate a weighted Euclidean distance calculation as a similarity criterion to the clusters during clustering procedure.The concept includes the estimation of the standard deviation of each class' features in order to define an importance factor for that to be used in the similarity criterion.The calculation of the weights is given by the following formula: with , being the calculated weight, , being the standard deviation of the attribute ( ∈ [1, ]) of the observations belonging to the class ( ∈ [1, ]), and , , , the respective maximum and minimum values.The weights are then normalized: After the acquisition of the hypercube of constraints and the weights, the algorithm is ready for the CvC of the unlabeled instances.K-means algorithm is applied as follows: Initialization step: The initial K centroids are generated randomly within the hypercubes, with one centroid in each hypercube ( = ).Thus: where , is the centroid value of class in attribute .Assignment step: In this step, the weighted Euclidean distance is employed as similarity criterion, with each observation assigned to the closest centroid.Weighted Euclidean distance is calculated as follows: Update step: If during this procedure, a centroid is to be positioned outside the class hypercube of constraints (in which is initilly created), then it is forced to the bound values, i.e. if the value is less than its class , , then it is set to , , and the same applies if it is greater than the respective , .
After the convergence of the clustering procedure, each instance is classified to the class of the cluster of the centroid that it was assigned in the last iteration.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
To set up the experimental procedure, the discussed algorithm is implemented using C++.The relaxation parameter in ( 3) and ( 4), is experimentally set to 0.1.Also, all attributes of the datasets are normalized.In order to evaluate the proposed algorithm, a comparative study with CvC with simple K-Means took place and classification accuracy, average sensitivity and average precision performance metrics are extracted.For that purpose, WEKA software [29] was employed, where the respective results for CvC with Simple K-Means were obtained.The number of clusters is set equal to the number of classes of each dataset.Both algorithms are tested with 42 datasets from the UCI Machine Learning Repository, with various characteristics.The datasets, number of attributes and instances are presented in Table I (the number of attributes does not include the class).Classification accuracy, average sensitivity and average precision are obtained for each dataset; all results are shown in Table II.The differences per dataset are illustrated in Figures 1, 2 and 3 for the classification accuracy, average sensitivity and average precision, respectively.The diagrams were created as follows: the result obtained with simple K-Means was subtracted by the respective C-K-Means result (for each dataset) and then the results were sorted in a descending order.
The experimental study was based on an extremely large selection of classification datasets (42), including all classification datasets reported as "Most Popular" in the website, and several other well-known datasets.The main exclusion criterion was that no missing values should exist, since dealing with missing values is outside the scope of this study.The datasets have a number of attributes ranging from 4 to 255 and number of instances ranging from 83 to 45211, with 8 datasets having less than 200 instances, 20 having 200 to 1000 instances, 10 having 1000 to 5000 instances and 4 having more than 5000 instances.For all experiments, the classification accuracy results per fold were obtained and a paired 10-fold cross validated T-test [30] was performed, in order to examine the statistical importance of the differences, with 95% confidence interval of 95%.In Table II, all datasets with statistically important difference are marked with a bullet in the right.The results presented in Table II indicate that the proposed approach is superior in terms of classification accuracy: the proposed algorithm outperforms the traditional K-means based CvC to all 42 datasets with the difference ranging from 1.14% to 43.22%.More detailed, the difference in 11 datasets is up to 5%, in 13 it ranges from 5% to 15%, while in another 10 datasets it's between 15% and 25% and in the remaining 8 datasets it is over 25%.Accuracy differences for all datasets are illustrated in Figure 1     In this study, a novel CvC algorithm, based on K-Means clustering is proposed.The algorithms include two major modifications compared to the K-means, being (i) the use of a hypercube of constraints for each centroid extracted from the information of the training data, and (ii) the use weights for each attribute and each class along with the weighted Euclidean distance as a similarity criterion for the clustering procedure.An initial effort on this direction can be found in [31].Since classification is a supervised learning method, the performance can be assessed by the classification rate, i.e. the number of successes and failures of the model.Reducing the number of misclassifications is very important regardless of which clustering algorithm or which variation of K-Means is used in CvC, an efficient use of the background knowledge is required to succeed.The introduction of constrains in the centroids initialization and update based on background knowledge, which is the main idea of the proposed algorithm, contributes to this direction.In the C-K-Means, both training (mainly associated with classification) and clustering are used, since by training the class hypercubes and attribute weights are obtained, and clustering is used for moving the centroids inside the hypercubes.Thus, the principal idea is to use the background knowledge to limit the solutions inside a predefined space (i.e. the hypercubes) and then follow an unsupervised technique to fine tune this solutions (by allowing the centroids to be updated but forcing them to remain inside the hypercube).Experimental results were obtained using 42 datasets from the UCI machine learning repository, with various number of attributes and instances.The results clearly demonstrate that the proposed approach dominates CvC with simple K-Means.Differences up to 43.22% in accuracy, 32.16% in average sensitivity and 38.82% in average precision are presented, while statistical analysis in terms of the paired 10-fold cross validated T-test revealed statistically significant differences in 30 out of 42 datasets.The proposed algorithm is generic, and in this realization K-means clustering and hard hypercube bound, have been used.Alternative approaches, such as fuzzy c-means or soft/ellipsoid bounds, can be integrated.Moreover, other types of weights or other metric of distance can be used as a similarity criterion.All the above will result to alternative realizations of the proposed algorithm; potential variations of the C-K-Means classification algorithm will be addressed in future communications.