EDAMS: Efficient Data Anonymization Model Selector for Privacy-Preserving Data Publishing

—The evolution of internet to the Internet of Things (IoT) gives an exponential rise to the data collection process. This drastic increase in the collection of person’s private information represents a serious threat to his/her privacy. Privacy Preserving Data Publishing (PPDP) is an area that provides a way of sharing data in their anonymized version, i.e. keeping the identity of a person undisclosed. Various anonymization models are available in the area of PPDP that guard privacy against numerous attacks. However, selecting the optimum model which balances utility and privacy is a challenging process. This study proposes an Efficient Data Anonymization Model Selector (EDAMS) for PPDP which generates an optimized anonymized dataset in terms of privacy and utility. EDAMS inputs the dataset with required parameters and produces its anonymized version by incorporating PPDP techniques while balancing utility and privacy. EDAMS is currently incorporating three PPDP techniques, namely k-anonymity, l-diversity, and t-closeness. It is tested against different variations of three datasets. The results are validated by testing each variation explicitly with the stated techniques. The results show the effectiveness of EDAMS by selecting the optimum model with minimal effort.


INTRODUCTION
The advent of IoT, high processing speed hardware, and cloud storage with high bandwidth communication produces vast amounts of data which would be unthinkable a couple of decades ago. Due to these advancements, around 2.5 quintillion bytes of data are created each day [1]. Such huge production of information not only advances users' quality of life, but also enhances various vital administrations. The data collection process is not governed by a single entity [2]. The applications used in order to perform daily routine activities efficiently are constantly saving, collecting, and tracking user data. Moreover, companies are encouraged to release their micro-data in order to facilitate data analysis that eventually supports providing new business opportunities [3,4]. However, the release of micro-data results in tracking the public and private lives of concerned individuals, thus putting their privacy at risk [3,5,6]. A typical data collecting and publishing scenario is depicted in Figure 1. In the data collection phase, data holders gather data from individuals, i.e. record owners (e.g. Ahmed, Haris, Laraib, Sana). In the publishing phase the data are provided to data recipients who can be data miners or other third parties that can make use of that data for their own purposes. The published records may contain sensitive information [7][8][9][10][11]. To secure data owners' privacy and to avoid data exploitation, eradicating identifiable attributes like name, address, telephone number, and social security numbers is a common practice prior to data release. However, this simplistic technique is not sufficient to guarantee the protection of record owners. Data publishing in a way that they contain no sensitive information and the privacy of record owners remains intact is termed as PPDP [7]. Typically, PPDP deals with publishing of data in an anonymized way, i.e. the data contain sensitive information but that information cannot be linked with its owner, while being still useful for the interested parties. Various methods have been proposed [12][13][14][15] for transforming data into their anonymized version. These methods differ in their capabilities of preventing linking of data owners which can eventually harm their privacy. There is no standard method for selecting a particular anonymization technique. Technique selection is highly dependent on the type of dataset and its sensitive attributes. The publisher has to anonymize data by using multiple techniques in order to select the most suitable. This is not only expensive in terms of time and resources but also requires sufficient knowledge in order to choose the appropriate method to convert the actual data into their anonymized version. Selection of an inappropriate method may cause data loss therefore it is necessary to select a method which could provide results at the optimum level of its utility with the least possible loss of data. Keeping in view the aforementioned problems, this study aims to propose a model that can identify the most suitable technique for anonymizing a certain dataset with minimum information loss. The main contributions of the current study are: • The development of a model that helps the data holder who has no particular knowledge of data anonymization techniques to release data anonymously.
• The selection of the most appropriate method according to the nature of the respective dataset.
• The generation of an anonymized dataset with least information loss and maximized utility.
II. LITERATURE REVIEW Various real world attacks indicate the significance of preserving individual privacy when distributing personal information. Many times data released by companies for research purposes ended up with hurting individual privacy. The re-identification of individuals happens when they get linked with some other available external information is termed as linking attack [12]. Some reported incidents regarding released data that got linked with external information are summarized in Table I. Health dataset from Washington State 43% identification by linking the dataset with newspaper stories containing the word "hospitalized". [17] Prescription data of South Korean residents 100% individuals in the dataset were reidentified. Data were encrypted prior release. [12] Medical records of state employees of Massachusetts Governor of Massachusetts was identified when the dataset was linked with the publicly available voter enrollment list. [18] Three month credit card records 90% identification by analyzing buying patterns [9] AOL dataset One of the users was identified and interviewed by New York Times within three days of data release [19] Netflix dataset 99% of records were identified with 8 movie ratings Authors in [16] collected a health dataset from Washington State, which did not contain patient names. However, 43% of the individuals were successfully identified by linking the dataset with the newspaper stories containing the word "hospitalized". Authors in [17] conducted experiments on the encrypted prescription data of 23,163 South Korean Resident Registration Numbers (RRNs). They claimed that they were able to re-identify 100% of the data and concluded that encrypted data are also vulnerable. Author in [12] described the re-identification of the dataset released by Group Insurance Commission (GIC) that included medical records of the state employees of Massachusetts and was intended to facilitate medical research. The dataset contained demographic data, for example, birth date, gender, and zip code. It was explained how easily William Weld (the then governor of Massachusetts) was identified by linking the Massachusetts voter enrollment list with the information given by GIC. Authors in [18] studied a credit card report of 3 months consisting of 1.1 million individuals and uniquely identified 90% of them via analyzing only four spatiotemporal points. They reported that the buying patterns with a use of a credit card make an individual's privacy vulnerable. A similar incident has been reported in 2006 when AOL released 20 million search queries of its users and within three days of its release one of its users was identified and interviewed by New York Times [9]. A few months later, Netflix also faced re-identification of its users in the dataset it released for the development of an accurate movie recommendation algorithm. The data were attacked by authors in [19], and they showed that external information can be linked to identify or to link the data with the respective individual.
PPDP is a way of releasing anonymized data while preserving individual privacy [6]. In PPDP, the data are generally represented as a Table of Explicit Identifiers, Quasi Identifiers, Sensitive Attributes, and Non-Sensitive Attributes, where Explicit Identifiers is a set of attributes that explicitly identifies the individual, and Quasi Identifiers are those that could potentially identify the individual. Sensitive personspecific information such as salary, real time location and disability status are considered as Sensitive Attributes while the term Non-Sensitive Attributes contains all attributes that do not fall into the previous three categories. Numerous techniques and models have been proposed in PPDP for producing anonymized data such as k-anonymity [12], l-diversity [13], and t-closeness [20], which have become the foundation of many other models [22][23][24][25][26] and are therefore used in EDAMS.

A. Preliminaries
Let T be an original data table of the following form: where, DIs are Direct Identifiers, the attributes which should be removed prior data publishing, QIs are Quasi Identifiers, the non-sensitive attributes which when linked with external data can reveal the identity of a record owner, and SAs are Sensitive Attributes, the private information related to a record owner.

B. Methodology
The proposed data anonymization model initially makes use of k-anonymity, l-diversity and t-closeness as privacy models, and generalization and suppression as PPDP operations. The utility that guarantees the optimum information loss is Information Loss (ILoss) metric [27], which measures the loss of information by calculating the uncertainty that occurred in generalizing a value which relies upon how many other values cannot be distinguished from it. The overall anonymization process is depicted in Figure 2. The process comprises of 5 steps. In the first step, the original data are taken as input that clearly marks the DIs, QIs and SAs. After realizing the attribute's nature, the sensitivity of the overall dataset is calculated. As the sensitivity is computed, the generalization hierarchy of the QIs is generated. And, on the basis of sensitivity of the dataset, the optimum privacy model is selected for its anonymized version. The sensitivity of the dataset is calculated by: If the sensitivity is 0 that means no sensitive attribute is present in the dataset and k-anonymity privacy model will be used. Applying k-anonymity requires the value of k to be used optimally because it is responsible for the utility ratio of the dataset [28]. EDAMS makes use of two PPDP operations, i.e. generalization and suppression. The generalization lattice is created for each QI. DI and the attributes that cannot be generalized will get suppressed in the resulting anonymized table. When the above two steps are completed then the optimum model is chosen on the basis of sensitivity. The information loss is calculated via ILoss metric [7] and the data holder will get the anonymized version of the data with least cost. Figure 3 depicts the applied algorithm. IV. EXPERIMENTS EDAMS is developed using Java that run on a 2.4GHz Intel Core i5 Processor with 6GB RAM. Three datasets with their customized versions were examined for the assessment of the model, namely UCI Adult dataset [29], Employee's Salary dataset [30], and Crime Incident dataset [31] along with their different variations. Each dataset has been evaluated twice. Firstly with EDAMS and secondly with each method separately applied to it in order to get the optimal result in a process termed as Hit and Trial. Its results are shown in Table  II.

A. Case 1: Adult Dataset
The dataset in [29] contains 30,162 records. It consists of 9 attributes in total: sex, age, race, marital-status, education, native-country, work class, occupation, and salary-class. Three variants of this dataset were considered of having no DI. The first variation took all attributes as QIs. The second variation considered occupation as an SA and the rest of them as QIs, and the third variation included six of them as QIs and two of them, i.e. marital-status and occupation, as sensitive.

1) Selection via EDAMS
When dataset is taken as input to EDAMS, its sensitivity is calculated, i.e. the ratio of SA over Quasi Attributes. Considering the first variation, when there is no SA the sensitivity between SA and QI becomes 0% which means although the dataset has no direct sensitive information, it can serve as a tool for linking attacks. In this case, EDAMS suggested k-anonymity for the respective dataset with maximum information loss of 60%. Table II represents the chosen models with maximum information loss when the same procedure was applied to all of its variants.

2) Selection via Hit and Trial Model
For the verification of the results obtained by EDAMS each variant of the dataset is tested with each privacy model in order to find the best suitable model for the respective dataset. The threshold values are selected from the lowest possible values to the values where change in threshold values does not affect the result. Considering the same variant, the data holder has to try each and every possible combination of different methods which demands a substantial amount of time. Table III depicts the results obtained from different combination of methods employed through hit and try model. It can be seen that minimum information loss occurs when k-anonymity model is applied. But the identification of this least information loss method became possible after trying each model and their combinations with different threshold values. However, the same model is recommended by the EDAMS without requiring any extra effort.

B. Case 2: Employee's Salary Dataset
This dataset [30] contains 1,999 records and comprises on five attributes (name, gender, telephone number, zip code, salary). Two variations were created, in which two attributes (name and telephone number) were considered as DIs. The first variant considers the rest of the three attributes as QIs while the second variation considers salary as SA and rest of the two, i.e. gender and zip code as QI.

1) Selection via EDAMS
The process of selection of privacy model through EDAMS will remain the same for every dataset. Considering its second variation, there are two DIs and one SA. The DIs were removed out rightly from its anonymized version while ldiversity was selected as the privacy model. Table IV shows its results.

2) Selection via Hit and Trial Model
Analyzing the same dataset yields the results shown in Table V. The two models are providing the same results, however one of them has already been suggested by EDAMS (l-diversity).

C. Case 3: Crime Incident Dataset
This dataset [31] contains a total of eight attributes, namely last name, first name, block, gender, race, date of birth, case number, and crime_code with 1,058 records. Four versions of this dataset were formed. Last name, first name, and date of birth served as DIs in the first two versions and the remaining five attributes were taken as QIs in the first variant while crime_code was taken as sensitive attribute in the second version and the rest as QIs. The third and fourth variant took only the first name and date of birth as DIs and the rest of the structure remained the same.

1) Selection via EDAMS
Analyzing its second variant there is one SA along with four QI and three DI. The sensitivity is calculated to 13% and ldiversity is suggested by EDAMS.   V. DISCUSSION The cost of producing anonymized data via hit and trial model is high as there is no standard method for anonymizing data. Data holder has to keep checking different models over different thresholds to achieve data anonymity with greater utility. Moreover, absence of knowledge regarding privacy models makes it more difficult for the data holder to modify the data into their unidentified version. However, EDAMS is capable of selecting the appropriate model for the respective dataset by applying some initial effort thereby minimizing the overall cost with good efficiency.

VI. CONCLUSION AND FUTURE WORK
Available vast data can provide immense benefits when analyzed carefully. Many companies are sharing their data for research or other purposes. However, the data are becoming highly personalized as everything becomes automated, thus the companies need to make the necessary arrangements to protect their clients' privacy. PPDP is a promising approach that can be used to publish data while preserving individual privacy to a great extent. Many techniques are available in this domain for the generation of anonymized data but choosing one is a challenging decision. This study presented the data anonymization model selection tool EDAMS that is capable of generating anonymized data with minimal effort. EDAMS requires the dataset and the nature of attributes to proceed with the selection of the optimal method among k-anonymity, ldiversity and t-closeness. The results were validated by applying the techniques separately one by one on the same datasets and the conclusion was that EDAMS efficiently selects the most appropriate method.
PPDP is still in its development stage as the researchers are coming up with more efficient algorithms. EDAMS is currently providing limited anonymization algorithm selection, however it has the capability to work as a classifier when trained rigorously. As a result, it will be capable of anonymizing any type of data by selecting the most efficient algorithm. EDAMS is dealing with linking attacks using generalization and suppression as PPDP techniques and k-anonymity, l-diversity, and t-closeness as anonymization algorithms. However in the future it is planned to accommodate more anonymization techniques to protect individual privacy against probabilistic attacks.