A Modified Approach of OPTICS Algorithm for Data Streams

-Data are continuously evolving from a huge variety of applications in huge volume and size. They are fast changing, temporally ordered and thus data mining has become a field of major interest. A mining technique such as clustering is implemented in order to process data streams and generate a set of similar objects as an individual group. Outliers generated in this process are the noisy data points that shows abnormal behavior compared to the normal data points. In order to obtain the clusters of pure quality outliers should be efficiently discovered and discarded. In this paper, a concept of pruning is applied on the stream optics algorithm along with the identification of real outliers, which reduces memory consumption and increases the speed for identifying potential clusters.


INTRODUCTION
Traditional data mining methods are not that successful in case of huge data streams, as off-line mining is not applicable.There are some requirements for clustering algorithms.Fast processing of data points and identification of outliers must be clear and precise [1].Data uncertainty is an added issue.Different data type should be treated differently which is also an issue.An arbitrary shape of clusters makes hard to distinguish the accurate shape of the cluster [2].There are many different applications like network traffic analysis, sensor network, internet traffic etc. that produce stream data [20].Random sampling, sliding window, histograms, multi-resolution methods, sketches and randomized algorithm are some basic data sampling techniques for mining data streams [13].Classification of stream data is possible with algorithms such as the Hoeffding tree, the Concept Adaptive Very Fast Decision tree (CVFDT), the Very Fast Decision Tree (VFDT), and the classifier ensemble approach.
Accuracy, efficiency, compactness, separateness, purity, space limitation and cluster validity are an important issue in the aspect of clusters quality.Different types of clustering methods like partitioning, hierarchical methods, model based, density based, grid based, constraint-based and evolutionary methods etc. are used for clustering stream data.Various algorithms are developed for clustering in data streams.Micro-clustering algorithms for data streams are as follows: Den Stream Algorithm, Stream Optics Algorithm, HDD Stream Algorithm [19] etc. Density grid-based algorithms for data streams are as follows: D-Stream Algorithm, MR-Stream Algorithm, DENGRIS Algorithm [19] etc. Table I gives a basic mapping of several existing algorithms that contains the description of their advantages and disadvantages.
Clustering is a key task in data mining.There are various other additional challenges by data streams on clustering such as one pass clustering, limited time and limited memory.Along with this, finding out clusters with arbitrary shapes is very much necessary in data stream applications.Density-based clustering method is of significant importance in clustering data streams, as it has the tendency to discover arbitrary shape clusters as well as outliers.In density based clustering, a cluster is defined as areas of higher density than the remaining data set.Clustering algorithms requires very tedious calculations for detecting the outliers.Handling noisy data, limited time and memory, handling evolving data, and handling high dimensional data are also to be considered.An outlier is defined as a data point which shows abnormal behaviour to the system and it is application dependent.In stream data mining the data points are huge in number thus it may be possible that during clustering of such data points few of the data points which does not belong to clusters or does not take part in clustering due to its distance from the cluster will be termed as outliers.It is required to remove such data points.As data generation is continuous and fast, a structure should be established to handle the mining process.Clustering provides a solution to such types of issues in stream data mining

II. PROPOSED ARCHITECTURE
In this paper a modification is applied on the stream optics algorithm by applying a pruning method and setting a threshold value cut off points for data dynamically.The extension of the most basic density algorithm (DBScan) is OPTICS which is based on the ordering point in the stream data mining.Its concept is to continuously increase the given cluster till the density in the neighborhood cross some of the threshold value.For each data point in a given cluster from the group of the cluster, the neighborhood of given radius has to contain at least a minimum number of points.
One of the important advantages of this method is that it can find clusters of arbitrary shapes and can be used to filter out noise.It considers clusters as dense regions of the objects in the data space which are separated by low-density regions.For interactive and automatic cluster analysis this algorithm determines an augmented cluster ordering.An ordering of clusters is done to obtain basic clustering information and deliver the intrinsic clustering architecture.The core distance represents the smallest value from the core.The reachability distance considers the greater value of the core distance of the second object and the Euclidean distance between the two objects.This creates an ordering of the objects in a database and stores the suitable reachability distance and core distance for each object.The major drawback of the OPTICS method is that if there is no core object the reachability distance between two objects is undefined.The same basic scheme is applied for the stream OPTICS algorithm modified with the addition of an iterative property of the threshold value and with the concept of pruning in order to optimize and time complexity.Data stream input in the form of small chunks is called data chunks.Here a windowing concept is used because stream data are huge.Data are fitted into a window frame and then passes ahead.Different parameters like window size, threshold value, and radius are set by the user.After that, an one-way online process is used in which data chunks are fitted into the window and then the clustering process is applied.In the online phase, micro-clustering is performed and basic DBScan algorithm is used.The micro clustering is done by the selection of the centroid with nearest data point's i.e. cluster mean value of object.The output of these will be the data points with k cluster and n objects.The clustering is done based on distance.These will now be input to the offline phase with the parameters like core distance, epsilon and fixed value of min point, generating distance.The proposed scheme is depicted in Figure 1.
In the offline phase, the Stream Optics algorithm is used for clustering.Macro clustering takes place and the data points form clusters of good quality.Two phase work is required because due to the nature of data streams.Points which have not yet been part of any clusters are distinguished by giving them weight.For every new iteration, this value will be incremented to make sure it is part of real outlier.Based on the application of a threshold value, points termed as real outliers are detected.Thus the node or the data points, which are outliers, will be pruned off which will reduce the memory consumption and the time taken for generation of the potential clusters.These will improve the purity of the potential clusters.For each threshold value set the whole dataset is been checked with the prime motto of maintaining the quality of the clusters.The algorithm is broken down in steps in Figure 2.

III. RESULTS & DISCUSSIONS
Various parameters like cluster purity, number of clusters, SSQ, threshold, memory and time consumption are evaluated.Then these evaluated parameters are used for comparison and performance evaluation.Net Beans is used for the simulation studies in our research work.The Forest Cover Type data set and sensor data set are used for evaluation and algorithms are computed for 50000 and 100000 data records respectively.Different threshold values were tested and an optimum value was chosen in each case (3 for the the Forest Cover Type data set and 14 for the sensor data set).Tables II to V sum up the results.

IV. CONCLUSION
Handling data streams shows increased complexity due to their constant, huge and potentially infinite nature.Working with data streams challenges the memory, space, time and handling changes along with speed and multiple source of data generation.Thus, the algorithms used for offline data mining and management may prove insufficient in such application and variations may be required.Such a variation of the OPTICS algorithm is proposed in this paper.Simulations are performed, results are discussed and an overall improvement is documented.

TABLE I .
APPROACH, ADVANTAGES, AND DISADVANTAGES OF EXISTING ALGORITHMS

TABLE II .
OVERALL RESULT SUMMARY OF FOREST COVER DATASET WITH DIFFERENT PARAMETER AT VARIOUS THRESHOLD VALUES BEFORE PRUNING.

TABLE III .
OVERALL RESULT SUMMARY OF FOREST COVER DATASET WITH DIFFERENT PARAMETER AT VARIOUS THRESHOLD VALUES AFTER PRUNING

TABLE IV .
OVERALL RESULT SUMMARY OF SENSOR DATASET WITH DIFFERENT PARAMETER AT VARIOUS THRESHOLD VALUES BEFORE PRUNING

TABLE V .
OVERALL RESULT SUMMARY OF SENSOR DATASET WITH DIFFERENT PARAMETER AT VARIOUS THRESHOLD VALUES AFTER PRUNING