Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms

Clustering of customers is a vital case in marketing and customer relationship management. In traditional marketing, a market seller is categorized based on general characteristics like clients’ statistical information and their lifestyle features. However, this method seems unable to cope with today’s challenges. In this paper, we present a method for the classification of customers based on variables such as shopping cases and financial information related to the customers’ interactions. One measure of similarity was defined as clustering and clustering quality function was further defined. Genetic algorithms been used to ensure the accuracy of clustering. Keywords-classification; customers; shopping; genetic algorithm


INTRODUCTION
Traditional methods of mass marketing are no longer responsive to customers' needs diversity.Such diversity must be managed by clustering methods which puts the customers with the same need and similar shopping behavior in the same clusters [1].Using the proper clustering, companies can deliver the customers' goods, services and specific resources via close relationship.Customer clustering is one of the components in modern and successful marketing leading to improved customer relationship management (CRM) is [2].Clustering is an important issue in selecting the appropriate variables.Clustering variables have two local variables and parameters are based on the product [3].General variables are included of statistical information (age, gender, etc.), life-style and includes variables based on the buying patterns of customers.Different works have been taken in the field of customer clustering based on variables [4,5].In a study described in [6], the SOM (Self-Organizing Map) algorithm and K-Means were used for clustering of customers based on public information.In this method, first SOM helped to determine the number of clusters and then clustering was performed using the K-Means.Clustering based on general variables is more understandable, but the assumption that the customers have the same statistical information (age, etc.) as well as a similar life style and the same shopping habits, to some extent are doubtful.Today, customers can purchase from different parts of the planet, and this makes clustering partly more difficult.On the other hand some information about the local variables may not be provided by the customer.Although information may be available they may vary over time.For example, income, marital status, occupation and the similar cases makes the clustering through the use of public variables more doubtful [7].In this paper we propose a method of clustering of customers based on information about the goods.Genetic algorithms were also used in order to enhance the quality of these clusters which also have been used to determine the cluster centers.

II. CLUSTERING METHOD BASED ON CUSTOMER PURCHASES
In this section, methods of clustering of customers have been provided by the help of customers' purchase behavior.This method involves pre-processing core, similarity criteria based on purchase, clustering algorithm and the quality function of the cluster.

A. Data pre-processing
The purpose of this selection process is to transform and integrate data from one or more datasets into one data set.Suppose 'I' as an entire collection of items and T 0 a transaction.T 0 transaction includes fields such as client ID, time of transaction, items purchased and financial information.If id i is the customer ID number of ci customer, and itemset i ={i ia |i ia Î I} are goods purchased by ci, as well as financial information moneyset i ={m ia |a=1,…,||itemset i ||}, a record of t c i =(id i ,itemset i ,moneyset i ) describes ci customer's behavior based product and which will be displayed.

B. Criteria of similarity based on purchase
After preparing the data, it is necessary to specify the similarity criteria of users or to each other and goods.In this respect, there are two famous criteria of matching coefficient and Jaccard's coefficient [8,9].But these two criteria are not most appropriate because: • Existence of a small number of goods compared to the total of goods in a user set items • Inequality of goods (considered equal in the above two methods) • Imbalance in the importance of customers for the organization

www.etasr.com Bafghi: Clustering of Customers Based on Shopping Behavior and Employing Genetic Algorithms
So in this article we have introduced a new similarity measure that has overcome some of these shortcomings.Similarity measures presented here considered goods purchased together and the profitability of customers.The criteria of goods purchased together were inspired by the concept of Support provided in [10].Suppose Supp ({ii, ij)} of transactions involving {ii, ij} is the total transactions: Suppose Supp({i i ,i j )} as the ratio of transactions involving {i i ,i j } to the total transactions: In the above equation i i ,i j Î I.It should be noted that if the items are rarely purchased together, support value has a low amount [11] and to solve the problem we use intimacy criteria as follows: In the above equation numerical value of Int was between 0 and 1.This solves the problem of backup.Finally, the proposed formula to measure the similarity of two customers is presented in the following form:

C. Clustering algorithm based on purchase
The proposed purchase algorithm of the customers is based on (3).In the algorithm first customers are classified into K clusters.The value of K can be determined by the help of marketers or the method proposed in this paper (see section 4.2).After determining K, initial cluster centers are chosen from T c records.These centers may also be selected randomly or based on heuristic method (GA).Suppose G={c n |n=1,…,K} is a total cluster centers where c n is the center of n th cluster, so that c n Î T c .So set(T c -G)={c i |i=1,…,||T c -G||} includes customers not selected as the center of the cluster.Then for all customers who are not cluster center, their similarity to all cluster centers are measured and added to the clusters with the maximum similarity to the center.The problem can be seen in the equation below: After each customer was placed in clusters, cluster centers are calculated again.Let c i and c j are two customers in the G n cluster, and the center of Gn cluster is equal to c n .In this case priority of ci customer is calculated based on following relationship: Where c m is the center of G m cluster and  represents the sum of similarities between the c i customer and other customers in cluster G n .While represents the sum of similarities between the client c i and the other cluster centers except G n .For all customers existed in a G n cluster, customer is selected as the center of the cluster with the highest priority value.This concept is presented in the following equation: After specifying the new centers, algorithm has done a repetition pattern.The algorithm repeats this process until no clustering center change.The overall process of algorithm is shown in Figure 1.
Equation ( 8) considers η n as the average similarity between the center of the c n cluster and existing customers in the cluster G n .Equation (9) shows that η m determines the relationship between the average similarity between the center of the cluster c m and existing customers in the cluster G m .Equation 10determines the similarity between c n and c m .Using the clustering quality defined in (7), we can define the appropriate value for K in accordance with the following formula: In the above equation K value is determined between the lower bound and upper bound of t III.DETERMINING THE CENTER OF CLUSTERS USING GENETIC ALGORITHM Genetic Algorithms [12] can be used to select primary centers of clusters.A genetic algorithm is an abstract computational models of biological evolution that is used in optimization problems, in which the mutation operation, crossover and production of new population are used.

A. Encoding the chromosome
Each chromosome in the genetic algorithm represents a solution to the problem that is investigated.In this studied case, each chromosome represents a set of K centers of initial cluster.If f i is a chromosome, f i =[y 1 ,…,y j ,…,y k ] where y j is the j th gene and K is the total number of genes (In Figure 2).Coding of chromosomes

B. set values to the initial population
Let P e represents the population in the e th replication, and 0<=e<=E, where E is the maximum number of times to repeat the genetic algorithm.The number of chromosomes is constant in all occurrences.So P e ={f i |i=1,…,L} where f i is the i th chromosome and L is the total number of chromosomes in the population.L number is an even number specified by the user.

C. The fitness value of a chromosome
Fitness value of each chromosome means to determine the suitability of chromosomes in order to survive it.The fitness of each cluster is determined based on the equation 7.So the formula for calculating the fitness of each f i chromosome is calculated as follows: In ( 12) chromosomes existed in the population P e into two categories: good (P e bad ) and bad (P e good ).Good chromosomes are the sets of chromosomes with high fitness value and bad chromosomes are the sets of chromosomes with low fitness value as their numbers are equal||P e good ||=||P e bad ||=I/2.

D. Production of the new population
The purpose of generating new populations is to remove the chromosomes with low fitness value and copy the chromosomes with high fitness value in the new population.So P e bad is removed and P e good chromosomes take their place.The higher the fitness value of a chromosome, the higher the probability of selection.The formula for calculating the probability of selection of chromosomes as shown below: The production process of new population has ensured that the new generations are created of the parents with high fitness value.This process is shown in the following figure.

E. Mutate and crossover
After generating a new population, mutation and crossover operations are applied over the population.First, a f i chromosome and from P e good and fj from P e bad set are elected.If the genes of f i and f j are not equal, the crossover operation is applied on them.How to crossover them is shown in Figure 4.If the two genes are equal, then the gene mutation operation is performed on them.For this purpose, created chromosome is placed instead of one of the two chromosomes.This operation is shown in Figure 5. IV.RESULTS

A. Assessment of the proposed method
In order to evaluate the proposed method, the information collection of shopping store with 9729 customer transaction information relating to 4223 customers and 1560 item were applied.The purpose of the proposed algorithm is that customers in each cluster have the highest similarity with each other.So items purchased by the customer in a cluster are similar.Although these items are distinct, customers' buying patterns are definable by the help of the frequency of each item.This concept is known by the Support, as shown in Equation 1.
To assess this concept, we compared the K-means clustering methods [13] with the proposed method and achieved to the following results.In the results value of the K was set at 30 and calculated the mean support in each cluster of five maximum support besides standard deviation for all items.Results of Kmeans algorithm and the proposed method are shown in Tables I and II, respectively.The ANOVA test was used to express the differences of obtained results.The test results are shown in Table III.It can be seen that the mean value of support for 5 supported items were significantly higher than the results of Kmeans.Therefore, it can be argued that in the proposed approach, customers in each cluster have more similar buying pattern than the K-means algorithm.Also the standard deviation of proposed method is higher than the K-means which also confirms the previous term.The proposed method has a good convergence rate.As you can see in the following figure, when the replication frequency raised the total similarities between customers, cluster centers was also increased and number of change of cluster centers were also reduced.It was shown after the 9 replications, cluster centers did not change and the algorithm was finished.Convergence in proposed method

B. Evaluating the performance of Genetic Algorithm
As mentioned, the initial cluster centers can be determined randomly or using genetic algorithm.Some experiments were taken in order to show the usefulness of genetic algorithms in which the number of clusters was changed from 30 to 80.In the following figure the effectiveness of using genetic algorithms was shown compared to the initial cluster centers randomly.Comparison of genetic algorithm and random method V. CONCLUSION Mass marketing no longer meets the diverse needs of customers.Customer clustering techniques have to be employed so that customers with the same buying patterns are put in a similar cluster.In this paper we presented a method of clustering customers employing genetic algorithms.Results are presented and further discussed.As shown, an improved performance is achieved through the discussed technique.

Fig. 3 .
Fig. 3.The process of creating a new population

Fig. 7 .
Fig. 7.Comparison of genetic algorithm and random method

TABLE I .
PRODUCT PURCHASE PATTERN WITH THE HELP OF K-MEANS METHOD