A Hybrid Data Mining Method for Customer Churn Prediction

The expenses for attracting new customers are much higher compared to the ones needed to maintain old customers due to the increasing competition and business saturation. So customer retention is one of the leading factors in companies’ marketing. Customer retention requires a churn management, and an effective management requires an exact and effective model for churn prediction. A variety of techniques and methodologies have been used for churn prediction, such as logistic regression, neural networks, genetic algorithm, decision tree etc.. In this article, a hybrid method is presented that predicts customers churn more accurately, using data fusion and feature extraction techniques. After data preparation and feature selection, two algorithms, LOLIMOT and C5.0, were trained with different size of features and performed on test data. Then the outputs of the individual classifiers were combined with weighted voting. The results of applying this method on real data of a telecommunication company proved the effectiveness of the method. Keywords-customer churn; data mining; hybrid method; LOLIMOT; C5.0; weighted voting


INTRODUCTION
The way companies communicate with their customers has become a key point for competition in marketing.Concepts such as customer acquisition, maintenance and satisfaction are internalized in companies.According to studies, the expenses for attracting a new customer are 5-10 times more than maintaining an old one [1].On the other hand, customer maintenance has its own expenses and it's not possible to have these expenses for all customers, because not all customers worth maintaining.Thus, churn management systems are looking for customers that want to leave the company.The key point which shows the importance of churn management systems is that, studies show that a 5% increase in customer retention will have a 25% to 95% benefit increase for companies [2].A lot of studies have been carried out in churn management, and different methods for recognizing downfall reasons and churn prediction and prevention have been used.Among these, data analysis methods have been used widely.
Customer churn is a situation in which customers decide to leave the company.The meaning of churn is different in different areas [3][4][5].Most definitions keep the related behaviors with a product and a defined threshold with business rules in mind.When customer's transaction is less than the threshold, churn occurs [6].Authors in [7], defined customer churn, in banking field, as those customers who close their accounts.Authors in [6] define churn as those who have less than 2500 Euro (save, and all other kind of properties) in banks.Technically according to (1) churn is the customer loss in a defined time period.

Monthly churn=(C0+A1-C1)/C0
(1) Here C0 means the number of customers at the beginning of the period, C1 is the number of customers at the end of the period and A1 is the number of new customers in the period.
Generally, customer churn prediction is a binary classification that its outcome shows the probability of customer churn [10,11].However, the special nature of the churn prediction problem causes analysis algorithms to face some limitations.For example, in these problems, data are imbalanced.This means that lost customers are just a small part of the data.In addition, extended learning programs will face some kinds of noise.In any case, churn prediction needs to classify customers according to the probability of churn [12].Different approaches work on this problem and they have some defects themselves.For example, although algorithms based on decision trees are used for classification, it is possible that some leaves have the same class probability.In addition, this method is noise sensitive.Neural networks are looking for suboptimal solutions and when the number of parameters of models increases an overfitting happen.Although genetic algorithms present accurate prediction models, they can't make the probability of occurrence clear.And finally methods like support vector machines usually don't lead to the best results.In Table I, a summary of the above mentioned methods and their characteristics is shown.
In this study, using data fusion and feature extraction techniques, a hybrid method is presented for a more accurate prediction of customer churn.After data preparation and feature selection, two algorithms, LOLIMOT and C5.0, are trained with different size of features and the outputs of individual classifiers are combined with weighted voting.

II. PROPOSED MODEL
Due to the fact that combining individual algorithms usually leads to better and more accurate predictions, in this study a hybrid method for customer churn prediction is suggested.Generally three factors should be kept in mind when combining algorithms: (1) Training set creation, (2) selection of individual algorithms that must be combined and (3) method or rules that results are combined.The suggested model for churn prediction includes the following steps:

A. Data Preparation
Information needed for data mining models includes demographic data like living location, age, gender, number of children, salary, other financial data and bills and data related to customer usage [14].There is no need for customers' personal information such as name, email address, mailbox, father's name, etc. and they can be deleted from the database in this stage.Also missing data can be omitted or replaced with special amounts like average or predicted amount resulted from predicting techniques etc.

B. Features Selection or Extraction and Ordering
This step is important because it uses important features and omits extra and noisy features and those which have little information helping to data cleansing and dimension reduction [8].In this study principal component analysis (PCA) is used for this purpose.In this step, the goal is achieve to a set of features which are the most effective in prediction by the use of PCA.F k is the k th ordered feature that:

C. Constructing Subsets of Data According to The Extracted Features and Training Model(s):
Regarding to the literature and strengthened C5.0 and LOLIMOT high performance [26], these models are used as individuals and by using weighted voting their accuracy is improved.

1) C5.0 Algorithm
C5.0 decision tree is a classification tree.C5.0 algorithm is an improved form of ID3 algorithm that constructs the decision tree based on information theory.Training data are classified sets of samples in the form of S= s1, s2,….Each sample is a vector of features s1= X1, X2,… Training data include the vector that shows the class that each set belongs to, C= C1, C2,… In each tree node the feature which makes the best classification is selected in order to put the sample set in related classes.This classification is done by computing entropy (information gain).In this case the feature that has the highest information gain will be selected as decision maker.

2) Boosting Algorithm
Boosting algorithm is an ensemble method that trains a set of classification models [27].It is the case when each train set is created based on the accuracy of previous classification models.In this algorithm, new classification methods are constructed to predict the samples, in which classification accuracy was weak, more accurately.This process is done by adaptive resampling.This means that samples that were classified incorrectly will have more chance to be selected in the next step.Each sample has a weight and at the end of the classification these weights will be up to date.Finally, the result of different methods will be ensembled with voting.Strengthening can increase the accuracy of C5.0 algorithm but it needs more time for training.

3) LOLIMOT Algorithm
Linear tree model or LOLIMOT is based on the divide and conquer strategy.In this method the complicated problem is solved by dividing to several smaller problems [9].LOLIMOT algorithm for achieving a better outcome (outcome with smaller error) will divide the problem area to several local linear models (LLMs).And after finding the worst LLM, continues the algorithm by dividing it to two LLMs.This algorithm is used as a fast learning tree algorithm, in many pattern recognition and prediction problems and it has some remarkable results.LOLIMOT algorithm was first introduced in locally linear neuro fuzzy model [28].Dividing the incoming area to small linear subsets by using fuzzy activation functions is the basic strategy in these networks.In locally linear neuro fuzzy model, the neural network has one hidden layer.But computing operation in its neurons is much more complex than that of a regular neuron.In this network, there are (p+1)*M weights (M is the number of neurons and p is the number of inputs).In (3) and (4) the matrix of weights and inputs can be seen.Each neuron has a LLM and a validity function.Validity function will indicate validity area in the LLM.Validity function is also known as activation function, because it controls the actions of Locally Linear Models.
It is clear that the first output is achieved by (5).It is the output of LLM of the th i neuron.
Activation function can be computed by using ( 6) and (7).
where  i ( ) u is a Gaussian function which has two parameters, In LOLIMOT, two parameters, center and standard deviation, are considered to be fixed and the weights are calculated by the least squares function.With the assumption that having M neurons and N train vector, regression matrix and train vector output can be defined as (7), (8) and (9).The output vector for train data, is considered as: And, finally, the network output is calculated by (12).
Now, by considering a target function ( 13) through changing the weights, we try to minimize the error. That, LOLIMOT algorithm starts the training process by starting from one neuron or a specified numbers of neurons and the following three stages are done repeatedly:  The worst neuron with highest error is recognized by (14).
 All cases in which this neuron can be divided in one of its dimensions are considered and the case that has the least error is selected.
 The worst neuron is recomputed.
After selecting suitable data mining algorithms, subsets of training data are constructed based on feature sets and the proposed models train and test on each of these train sets.Models are evaluated and cleaned according to their accuracy.

4) Combining outputs of classifiers
In this phase, the outputs of the remained classifiers from the previous step will ensemble with one of the voting algorithms that will be explained below.The simplest method to combine classifiers is majority voting which gathers all outputs of individual classifiers C 1 , C 2 , …, C n and then the output with the largest number of votes is selected as the final decision.
where V(Ci) is the vote of ith classifier.
The other voting method is the weighted voting, that is improved but more complicated version of majority voting.In this strategy, each classifier is assigned a weight according to its classification performance.In this case, the weights can be obtained by the following formula [29].
where Wi is the weight of i th ordered classifier in which the classifiers are arranged based on their accuracy performance,  is a regulatory parameter 0<α<1 and Then, the voting output will become: Based on this method different weights are given to outputs according to their accuracy.If the total weights of the methods that estimate customer is churner are more than the whole weights of other methods, the customer will be predicted as churner, if not the customer is loyal.

III. EVALUATION CRITERIA
The lift criterion is a performance measure which is the result of the ratio between the obtained outcomes with and without the use of the prediction model.The higher the lift means the model is more accurate, and instinctively, the more profitable a targeted proactive churn management program will be [5].Calculating lift criteria for all of churn management data base is not logical and also not practical.It is also not economically possible for the organization to apply churn management for all customers, so it concentrates on those with more churn probability.For this, the two mentioned outcomes are usually kept apart, and are about 10-20% in marketing [26,30].
As the top decile lift is the main criterion in the churn management area [5], this criterion has been used to evaluate the performance of applied algorithms.The top decile lift focuses on the customers predicted most likely to churn.The first 10% are the most critical customers (i.e. with high churn probability) and this percentage is an ideal portion for targeting the retention marketing campaign [9].So for computing top decline lift, first customers are sorted from predicted most likely to predicted least likely to churn.Then the number of customers that were correctly recognized in the first 10% will be calculated.And finally this percent will be divided with the total churn.If all customers (in the first 10%) were recognized correctly and the ratio between churners and other customers is 50-50, the amount of top decline lift is equal to 100/50.In addition to lift, proposed algorithm was compared with some other algorithms that were mentioned in literature by the use of two other criterions, accuracy and area under ROC curve (AUC).

IV. DATASET
The dataset is provided by the Teradata Center at Duke University.The database contains of datasets of mature subscribers (i.e.customers who were with the company for at least six months) from a major U.S. telecommunication company.There are three different datasets in this database: calibration data, current score data and future score data.A total of 172 variables are included in the datasets, one for churn indication, and 171 variables for prediction.The prediction variables include three types of variables: behavioral data such as minutes of use, revenue, handset equipment, company interaction data such as customer calls into the customer service center, and customer household demographics.The churn response is coded as a dummy variable with churn=1 if the customer churns, and churn=0 otherwise.There are 100,000 records in the calibration dataset, 51,306 records in the current score and 100,462 records in the future score dataset.The actual average monthly churn rate is reported to be around

Fig
Fig. 2. In this phase, riety.The res cording with de Feature Subs Feature subse developed ba  1 1 A {F ,  3 1 A {F , Model Traini LOLIMOT a 2011a) platfor ftware.At first the training s ter the testing curacy and tho ows the classif ssifiers remain

TABLE III .
M