A Genetic-Firefly Hybrid Algorithm to Find the Best Data Location in a Data Cube

—Decision-based programs include large-scale complex database queries. If the response time is short, query optimization is critical. Users usually observe data as a multi-dimensional data cube. Each data cube cell displays data as an aggregation in which the number of cells depends on the number of other cells in the cube. At any given time, a powerful query optimization method can visualize part of the cells instead of calculating results from raw data. Business systems use different approaches and positioning of data in the data cube. In the present study, the data is trained by a neural network and a genetic-firefly hybrid algorithm is proposed for finding the best position for the data in the cube.


INTRODUCTION
Business Intelligence (BI) is a vast category of methods, applications and technologies of gathering data, accessing information, and analyzing a large amount of data for getting knowledge in the organization to make effective business decisions and made Information Technology (IT) measurements powerful.Typical BI technologies include business rule modeling, ETL (Extraction, Transformation and Load) tools, data profiling, data warehousing and online analytical processing, and data mining.The focus of BI is to fully utilize massive data to help organizations gain competitive advantages.Decision support systems are increasingly used in business.They provide access to data that has been locked in a database and convert it into useful information.Many organizations and companies are developing integrated decision support systems known as data warehouses, where users can carry out analysis.As operational databases update the state of their information, data warehouses typically store the history of that information.As a result, data warehouses tend to grow over time.
Users of decision support systems are generally reluctant to identify behaviors instead of searching for an independent record.Queries in a decision support system build aggregation.The size of the data warehouse and complexity of the queries can produce long queries that require much time for completion.This causes delays that are not acceptable in most decision support system environments because it reduces the efficiency of the system.Data cube analysis is a powerful tool for analyzing multi-dimensional data.Cube analysis allows users to simply explore data by calculating the aggregate size of all possible groups as defined by their new dimensions.A number of techniques have been used to design effective methods to calculate the cube.Technically, a cube is a multi-dimensional redundancy plan relationship that calculates all SQL-grouped operators and aggregates their results in n-dimensional space to answer OLAP queries.This aggregation is calculated in derivative summary tables or multi-dimensional arrays and mathematical tools are required to estimate the aggregation sizes.Because an aggregation is generally quite large, indexing, which adds redundancy, is deemed necessary to speed up the inquiry [42,43,44].

II. OLAP
Online analytical processing (OLAP) is rooted in systems such as IRI Express, Comshare, and Essbage.Unlike statistical databases that hold census data and economic information, OLAP collects everyday commercial transactions data, such as that for sales or health.The main objective of OLAP is to enable analysts to build a mental image of the underlying data by exploring different perspectives at various stages of generalization and interaction [39].As a component in decision support systems, OLAP interacts with other components such as data warehouses and data mining to assist business analysts in the decision-making.Data warehouses collect data from multiple data sources, such as transactional databases, across an organization.The data is cleaned and converted to a sustainable format before being collected in the data warehouse.Subsets of the data in the warehouse are extracted to meet the specific needs of the organization.Unlike transactional databases, where data is continuously updated, the data collected in a warehouse updates itself; this is only possible by data sources in cyclic form.
OLAP and data mining both allow analysts to discover knowledge about the data in a warehouse.Data mining algorithms automatically generate such knowledge in a pre-defined state, such as by classification or collaborative laws [2,3].OLAP does not produce knowledge directly, but relies on human analysis to interpret the results of the query.OLAP is more flexible than data mining because analysts can access different patterns and trends rather than a specific set of knowledge.OLAP and data mining can be combined to enable analysts to obtain data mining results from various sectors of the data and in the generalization stages [9,39].In a typical session of OLAP, analysts count aggregated queries about the underlying data.OLAP can obtain data in a few seconds, even if the query has a very long record.The results allow analysts to collect large amounts of data to determine general patterns and trends.Based on their observations about an exception to a pattern, analysts can collect smaller amounts of data in greater detail to fill in parts.This process can be repeated in different parts of the data using data segmentation until the mental imagery is satisfied.OLAP system needs are determined in ways such as the FASMI test and Codd law.Some requirements are unique to OLAP.To enable OLAP for an interactive process, the system must first effectively respond to queries.OLAP often relies on large-scale pre-computing, specialized indexing, and storage to improve performance.
In the next step, analysts are allowed to explore the data from different perspectives and stages of generalization to organize and generalize the data into multiple dimensions and hierarchies.The data cube model described below is a popular abstract model for this purpose [9].The data to be analyzed by OLAP is collected in the data warehouse using the communication model and are organized using stellar models measuring dimension and other features.Each dimension has a table that displays contributions to dimension hierarchically.The table of dimensions could have redundancies which can be eliminated by dividing each dimension into other tables.The result is called a snowflake model.ROLAP and MOLAP are popular architectures of OLAP.ROLAP provides a front tool that translates the multidimensional queries in the corresponding SQL queries for processing within a relationship.OLAP is lightweight and is scalable to large datasets, but its effectiveness is limited because intra-relational optimization methods are not designed for multi-dimensional queries.OLAP does not rely on the relational model, but focuses on a multi-dimensional view.MOLAP can achieve better performance by visualization and improvement of a multi-dimensional view, although it requires significant storage for visualization and is not always scalable [2,4,23].

III. DATA CUBE
A data cube is an SQL operator that supports OLAP functions such as histograms and subsidiary operations.Histograms are aggregations at the calculation level.Even if such tasks are possible using standard SQL queries, they become too complicated.The number of sets needed for the number of dimensions is exponential.A complex query could produce results during the explorations of the base table which result in weak performance.Because a subtotal of all queries is common for OLPA queries, a new operator can be arbitrarily determined to collect subtotals as a data cube.Users of the system want the data cube to offer a summary of the data in detail from different aspects.
A data cube comprises reciprocal table generalization that occurs in multiple approaches.At first, a data cube is ndimensional.It is divided into eight parts, each of which is called a cube.The first cube is a three-dimensional cube known as the nucleus.The Next Tuesday cube contains data in which all values are 2D surface values.The next three cubes contain all values and 2D levels.The next three cubes are onedimensional.The last cube has a single value and a dimensionless point [1,11,16].The second generalization approach is the aggregation function.This aggregation will be discussed with SUM.In general, any aggregation function can be used with a custom value to build data cubes.This function can be divided into distributive, algebraic, and comprehensive categories.I is a set of values and Under these circumstances, distributive SUM, SOUNT, MIN, and MAX are easy to examine.After generalization of function G( ) to one, the vector of m is restored and algebraic aggregation shows characteristics similar to the distributive sample of The third approach is dimensional hierarchy.For a data cube, each dimension has two stages of weak hierarchy.Each dimension can have many features.Dimension features can create a weak hierarchical network.Features on the base table have lower bounds on the network and all queries have higher bounds.The network structure plays an important role in most data cube approaches [25,35].
Core cube visualization is necessary because this cannot be achieved by calculating other items.If any cube with all values has a value comparable to the core cube, it is not necessary to visualize the core cube because responding to a query using the cube incurs the same cost as the core cube.Many greedy algorithms have been proposed to position data in data cube [16,25].Even if all data cubes must be calculated, the network structure can be used to improve the efficiency of calculations.For example, if calculation is based on ordering records, then core cubes can be ordered in three ways.Every choice leads to simplification of the calculation to one cube with all values, because that cube can be calculated without additional ordering, although calculation of the two other cubes will require a core cube to be re-ordered.Based on the estimated cost, the algorithms available as the optimal choice for ordering each cube and the choices leading to construction of calculation pipelines again reduce the size of the cube ordered [11,12,25].Data warehouse users work in a graphical environment and data is displayed as a data cube in multidimensional form.Dimensions are usually 2D or 3D, but higher dimensions can also be generated to increase information exploration.The values in each cell of a data cube contain a measuring value [4,23,30,35].

IV. RELATED WORK
Effective calculations and multi-dimensional aggregation in a data cube have been studied by many researchers.A data cube has been classified as an aggregation of generalizations by a relational operator based on subsidiary data cubes with distributive, algebraic and comprehensive categories [10].A greedy algorithm has been developed to provide a partial cube for data cube calculations [35,36].A calculation method has been developed based on piece processing to effectively organize large multi-dimensional arrays [32].Several guidelines have been suggested for effective calculation of multidimensional aggregations for ROLAP servers [31].A model has been developed based on multi-path array aggregation piece processing for data cube calculations in MOLAP [40].A method has been also been developed to calculate scattered data cubes [18].After iceberg queries were developed [25,26], the BUC method, a scalable method with the ability to calculate iceberg cubes head downward, was introduced [16,17].The H cube method was also developed for calculating iceberg cubes with complex measurements using H-tree structure [12].The star cube method was introduced for calculation of an iceberg cube with a dynamic star tree structure [7].The MM cube is an effective method of calculating the iceberg cube using factorized network space [41].A shell-piece method based on data cubes is very effective for OLPA multi-dimensional data processing [23].
Apart from these studies on data cube models, the smart dwarf method has been suggested for the calculation of a declined data cube and is known as the condensed model [38].Calculation of compression data cubes (dwarf cubes) uses a smart cube structure that summarizes the semantics of a data cube [21].The same method was also developed in a QC tree structure [22].An aggregation-based approach known as a closed cube or cube C has been developed [8] to close calculation in a closed cube using a new algebraic method of measurement.Studies have been conducted on calculations using compressed data cubes by estimation, including the quasi-cubic [6] and the wavelet cube [13] methods.Compressed cubes for estimation in continuous dimensions [14] have been applied using the linear logarithm models [5] to a compressed data cube.

V. NEURAL NETWORK AND EVOLUTIONARY ALGORITHMS
Everything is done in an environment directly or indirectly and is a collection of decisions.It is important at the beginning of a decision to have sufficient knowledge of the issue investigated to make appropriate decisions.Evolutionary algorithms have been used to identify environments.In the present paper, two algorithms are used to comprehend the issues to facilitate knowledge [37].

A. Neural Networks
The perceptron network is a well-known neural network and its multilayer state is a widely-used neural network.Perceptron networks probably had the greatest effect on early neural networks.The perceptron learning rule is stronger than those such as the Hebbian rule.With reasonable assumptions, this method of learning by repetition converges to correct weights.Convergence means that network learning leads to estimation of weights that allow it to produce correct output values for each education input pattern and similar patterns.Moreover, the performances of neural networks are related to the choice of the neurons count, architecture of networks and learning algorithms [28,44].
The perceptron learning rule is supervised learning in which a stimulus, response, input, optimal output, pattern, and pattern class are available.Learning error occurs; hence, in practice, at each step the learning error must be used to set network parameters such that the learning error becomes less when the same input is applied again.The perceptron learning rule is generated for one-layer neural networks consisting of neurons with a conversion function with a two-value limit threshold [20,32,37].Normally, primary perceptrons have three layers: sensory neurons, connecting units, and the response unit forming an approximate retina model.Although these networks are only taught weights between the second and third layers, the activation function for each connecting unit is a binary step function with an optional, but fixed, threshold value.The signal sent from the connecting units to the output unit is a binary signal and ) ( in y f y  is the perceptron output with the activation function: ) ( This function determines the network output in addition to the usual outputs of +1 (belonging to the category or group) and -1 (not belonging to the category or group) and includes the area between θ and -θ in which a decision will not be made [34,37].The unit weights connected to the response or output units are set using the perceptron learning rule.In the perceptron training rule, the network computes the output unit response for each input learning vector and then determines whether or not an error has occurred for the model.Here, an error occurs when the output calculated by the network does not equal the target value.Learning this rule is similar to learning the Hebbian rule, except that the weights only alter when the network response for the input contains an error.
The perceptron network does not identify the difference between errors when the calculated network output is 0 (area without decision) and the target value is -1 or when the calculated output value is +1 and the target value is -1.The learning rule changes the direction of network weights such that the response becomes equal to the target value.Only the weights on the joints of connecting units that send non-zero signals to the output unit will be changed, because these signals have caused the error.In the perceptron training rule, if an error occurs in the training input pattern, the weights vary as: where t represents the target value as +1 or -1 and a is the learning rate and determines the speed of the weights.In this network, if no error occurs, the weights will not change and training will continue until no further error occurs.
An optimization algorithm which supplies the best training dataset for appropriate Artificial Neural Networks is proposed in this study like [24,44] that updates weights and bias and the trained net is added as input dataset for genetic algorithm.The perceptron learning rule convergence theorem states that if there are weights that allow the network to produce the correct answer for all training patterns, then the perceptron learning method uses these values when setting weights.This means that the perceptron network will be able to solve the problem or learn the desired categorization.In addition, the network will use these weights several times with a limited number of training repetitions [23,44].

B. Genetic Algorithm
In genetic algorithms, every solution is displayed as a binary string and the measurement of the related fitness function.A genetic algorithm solves a problem by first randomly generating a population of chromosomes.Second, the fitness of each chromosome in the population is assessed and, third, new chromosomes are created by random mating and mutation of some bits.Fourth, less fit members are excluded from the current population and, fifth, new chromosomes are evaluated and included in the population.Finally, the third to fifth steps are repeated until population convergence occurs, which is also the termination condition.The fitness of the total population increases as the number of people in the population increases; however, the number of fit people, or more fit people, is equal at t and t-1.The population is considered for evolution rather than an individual or components.A set of chromosomes makes up the population.The effect of genetic operators on the population is to form a population with the same number of chromosomes.The fitness function provides an indicator for independent functioning in the work area.Fenestration and linear normalization are terms within the fitness function [20,27,29,33,37].
Reproduction takes place in two stages when parental chromosomes are duplicated in bits of a child's chromosomes and functions alter the chromosomes of each child.The major operators are intersection, mutation, and inverse mutation.Intersection is the strongest supporter of genetic algorithms and is usually associated with high probability.Mutation is less important than intersection and less likely to occur.As the parameters change and simulation advances, mutation become more important and population change decreases [14].Intersection is the driving force behind a genetic algorithm which uses reproduction.Initially, a genetic algorithm uses a random number generator to select a random point in each parent at an intersection where a genetic algorithm occurs [14,15,27].Mutation causes searches in untouched spaces and it can be deduced that the most important task of mutation is to avoid convergence to the local optimum.When a member of the new population arises, each gene will mutate if possible.Inversion is used in the genetic algorithm because it is biologically plausible.It can be said that in inversion, one chromosome can be selected at any time and then two points are randomly selected and their locations reversed.There are different strategies for selecting people from a population for reproduction.The selection operator selects a number of chromosomes from a population for reproduction and the more fit chromosomes are more likely to be selected [15,33,37].

C. Firefly Algorithm
The firefly algorithm is inspired by observation of actual fireflies.Fireflies are social insects that live in colonies and their behavior tends toward survival of the colony rather than survival of a small fraction of them.One of the most important and interesting behaviors of fireflies is finding food, especially how to find the shortest path between a food source and their nest.The firefly algorithm is a swarm intelligence method; the difference between this algorithm and fireflies in nature is memory.The most important features of this algorithm are its high convergence speed, flexibility, high error tolerance, and its insensitivity to initial values [15].The similarity of the data placement in a data cube with a firefly algorithm and the parameters used to select the appropriate position for data and suggest the best location prompted the use of this algorithm in the present study [19].
Two very important factors in the firefly algorithm are variation in light intensity and formulation of attractiveness.For simplicity, it is always assumed that the attractiveness of a firefly is determined by its brightness, which is associated with target function coding.For maximum optimization problems, brightness I is selected for fireflies in position . β attractiveness is a common point that can be seen by fireflies and there is no difference in distance ij r for fireflies i and j.
Note that the light intensity decreases as the distance from its own resources increases.In the simplest case, the light intensity of I(r) differs according to inverse square law and is defined as [4,19]:

    
To implement the project, attractiveness function β(r), is assumed to be a uniform reduction function such as: The distance between two fireflies (or equipment) i and j in i x and j x is similar to a Cartesian distance: ) ( where k i x , is the kth location coordinate component of i x and the ith firefly.In the 2D state, it can be expressed as: The motion of firefly i is attractive for firefly j as expressed by: The second part of relation (11) represents the attractiveness of the firefly where α is a random parameter and i Î is the vector of random numbers drawn from the Gaussian or uniform distribution [15].
To use this algorithm, first an n-member population of fireflies in different locations is randomly generated.All fireflies initially have the same amount of light and with each iteration a light update phase and a location update phase are carried out.The location of the fireflies is the randomly-placed location and the purpose of the light is for best placement of the data.The amount of light for each firefly in each iteration is determined by its location fitness value.The fitness value is the value added to the light in each iteration.Firefly light for each update can be expressed as: is the new value of firefly light when used again,  is the fitness of firefly i in iteration t of the algorithm and p and  are constants with which to model the gradual decline and its effect on light.
Relation 13 is used to detect the position of other pieces of data in the neighborhood as: is the set of fireflies in the neighborhood of firefly i at time t.The distance between fireflies i and j at time t uses the Euclidean distance Discrete-time movement of fireflies with probability p is expressed as: where  is a constant parameter and t n controls the number of neighbors.

VI. PROPOSED METHOD
In this project, a data set containing information from a printing company was used.This data warehouse contained information such as user name, company ID, company name, company address, company phone number, company-enabled state, and company ads.User name, company ID, and company name are used as location dimension features.Company address, company phone number, company-enabled state, and company ads are used as time dimension features.
Other pieces of data contain information about orders by these companies, including ID, user name, user ID advertised, company ID, order name, color of ordered item, requested size (X and Y), material used for ordered item, order date, enabled or disabled state, confirmed or unconfirmed order state, send order to user, and change of order by user.ID, user name, user ID advertised, and company ID are used as location dimension features.Order name, color of ordered item, requested size (X and Y), material used for ordered item, order date, enabled or disabled state, confirmed or unconfirmed order state, send order to user, and change of order by user are used as time dimension features.
As like [28] we use the genetic algorithms techniques and firefly algorithm to optimize finding best location for data in data cube.The data was to consider being genetic algorithm inputs and the initial population totaled 1000.The initial population of the firefly algorithm was set at 700.The values were inserted experimentally and the population distribution in the environment was very effective because the area was very large.Determination of the initial population was based on trial and error.Reduction of data size was done using the genetic algorithm and placement improvement was done using the firefly algorithm.The proposed algorithm was written as a pseudo-code.Algorithm 1 gives the output (S) set by testing the MLP as input for optimized placement and algorithm 2 receives this input and makes a hybrid genetic-firefly algorithm to find the appropriate place for data in the data cube.Applying the optimization algorithm to the Genetic-firefly hybrid algorithm, the initial population of genetic algorithm totaled 1000 and firefly algorithm totaled 700.User name, company ID, and company name are used as location dimension features.Company address, company phone number, company-enabled state, and company ads are used as time dimension features.As it shown two dimensions are defined for this simulation; one "location" and the other "time".Simulation was done in a MATLAB environment using the observational method.At first step weights and bias are defined for multi-layer perceptron and learning with MLP occurred.If errors occurred in learning, weights and bias are updated.At the end in this phase, test dataset are tested.The data was first inserted into the data cube environment sporadically as shown in Figure 1.Now this phase output is as input for optimized algorithm.Parameters for both genetic and firefly algorithms initialize.The mail population size from dataset generate for genetic algorithm.Fitness function is used in this level and new population has been created and evaluated.In this part with using "Roulette Wheel" method, parents from new population are selected.Crossover and mutation in genetic algorithm occurs and offspring is placed into new population.The best data position was as shown in Figure 2. Firefly algorithm applies in new population and generated lighting.Distance beyond firefly population is defined.Component for attractiveness with light are moved and placed in data cube.Figure 3 shows the reduction of dimensions over time determined visually for the operation.

VIII. CONCLUSION
In this study, placement of data in the scalable space of a data cube was done using a hybrid genetic-firefly algorithm.The data was first inserted into the data cube environment sporadically then the best data position was shown and the reduction of dimensions made over time determined visually for the operation.Dataset clustering was carried out with the help of a perceptron multi-layer neural network and then the hybrid genetic-firefly algorithm was used for placement to find the best position for the data in the data cube.Decision-based programs include large-scale complex database queries.If the response time is short, query optimization is critical.Users usually observe data as a multi-dimensional data cube.Each data cube cell displays data as an aggregation in which the number of cells depends on the number of other cells in the cube.At any given time, a powerful query optimization method can visualize part of the cells instead of calculating results from raw data.Business systems use different approaches and positioning of data in the data cube.In the present study, the data is trained by a neural network and a genetic-firefly hybrid algorithm is proposed for finding the best position for the data in the cube.
www.etasr.comFaridi Masouleh et al.: Genetic-Firefly Hybrid Algorithm to Find the Best Data Location in a Data Cube

I
is the intensity at the source.For a given media, a constant light absorption coefficient is considered with variable γ where the brightness intensity I differs at distance r.This relation is defined as: Firefly attractiveness develops as intensity that can be observed by fireflies and β attractiveness of fireflies can be defined asis faster than the power function for calculation and is generally defined as: www.etasr.comFaridi Masouleh et al.: Genetic-Firefly Hybrid Algorithm to Find the Best Data Location in a Data Cube