Enhanced-PCA based Dimensionality Reduction and Feature Selection for Real-Time Network Threat Detection

With the rise of the data amount being collected and exchanged over networks, the threat of cyber-attacks has also increased significantly. Timely and accurate detection of any intrusion activity in networks has become a crucial task. While manual moderation and programmed logic have been used for this purpose, the use of machine learning algorithms for superior pattern mapping is desired. The system logs in a network tend to include many parameters, and not all of them provide indications of an impending network threat. The selection of the right features is thus important for achieving better results. There is a need for accurate mapping of high dimension features to low dimension intermediate representations while retaining crucial information. In this paper, an approach for feature reduction and selection when working on network threat detection is proposed. This approach modifies the traditional Principal Component Analysis (PCA) algorithm by working on its shortcomings and by minimizing false detection rates. Specifically, work has been done upon the calculation of symmetric uncertainty and subsequent sorting of features. The performance of the proposed approach is evaluated on four standard-sized datasets that are collected using the Microsoft SYSMON real-time log collection tool. The proposed method is found to be better than the standard PCA and FAST methods for data reduction. The proposed approach makes a strong case as a dimensionality reduction and feature selection technique for reducing time complexity and minimizing false detection rates when operating on real-time data. Keywords-principal component analysis; fast clustering; dimensionality reduction; machine learning; network security

Abstract-With the rise of the data amount being collected and exchanged over networks, the threat of cyber-attacks has also increased significantly. Timely and accurate detection of any intrusion activity in networks has become a crucial task. While manual moderation and programmed logic have been used for this purpose, the use of machine learning algorithms for superior pattern mapping is desired. The system logs in a network tend to include many parameters, and not all of them provide indications of an impending network threat. The selection of the right features is thus important for achieving better results. There is a need for accurate mapping of high dimension features to low dimension intermediate representations while retaining crucial information. In this paper, an approach for feature reduction and selection when working on network threat detection is proposed. This approach modifies the traditional Principal Component Analysis (PCA) algorithm by working on its shortcomings and by minimizing false detection rates. Specifically, work has been done upon the calculation of symmetric uncertainty and subsequent sorting of features. The performance of the proposed approach is evaluated on four standard-sized datasets that are collected using the Microsoft SYSMON real-time log collection tool. The proposed method is found to be better than the standard PCA and FAST methods for data reduction. The proposed approach makes a strong case as a dimensionality reduction and feature selection technique for reducing time complexity and minimizing false detection rates when operating on real-time data.
Keywords-principal component analysis; fast clustering; dimensionality reduction; machine learning; network security I.
INTRODUCTION AND RELATED WORK Network security is of utmost importance, especially for companies or foundations. Any security breach will be characterized by a pattern in the network logs preceding the attack and these patterns, if detected accurately, can help diverting major mishaps [1]. Previous approaches have relied upon the use of Security Intelligence and Management Systems (SIEMs) coupled with manual moderation for scanning network threats. SIEMs are associated with Security Operations Center (SOC) to whom they report these threats [2][3]. They conform to the laws on risks and regulation criteria and make use of predefined rules for catching any network breach or Incident Response (IR). However, similar to manual moderation, there are certain limitations in the use of SIEMs. They operate on a static set of vulnerability rules and hence are not able to detect any trends of a novel attack. Further, they are associated with an operational overhead and hence come up short when working with real-time data. The performance of SOCs has worsened over the years and they are no longer sufficient when working with large scale networks. Machine learning algorithms help in deriving rules and making accurate predictions even on previously unseen data. The behavioral features of the system logs can be used for attack detection. However, the system logs consist of many parameters and not all of them contribute to the detection of network threats. Previously, methods like Principal Component Analysis (PCA) and FAST clustering have been used for eliminating the redundant features from high dimensional data [3,4]. While PCA focuses on the derivation of a set of orthogonal eigen vectors, FAST clustering focuses on an efficient but less effective grouping of features. In this paper, an approach for dimensionality reduction and feature selection is proposed which ameliorates the shortcomings of the PCA algorithm and improves the performance of the system. Specifically, FAST and PCA approaches were combined in a novel way and subsequently machine learning algorithms were used for detecting anomalies that deviate from normal configurations, hence predicting possible threats to the system. The main contributions in this paper are: • The formulation of a new approach for dimensionality reduction and feature selection that improves standard conventional approaches.
• The combination of FAST and PCA algorithms in a unique manner for efficient, yet effective dimension mapping.
• A network threat detection approach that has better performance than the previous approaches when working on a real-time data set.
There has been ample work in the past on anomaly detection on unseen data [5][6][7][8][9]. Most of this work can be segregated into three types based on their nature: statistical, distance, and density-based techniques. Each of these three methods has some shortcomings. Statistical methods often assume univariate distribution which is most often not true with real-world data. Distance-based methods struggle with overlapping data, while density-based methods may be useful but are computationally very expensive. Authors in [10] proposed a primitive regression method to detect the presence of unauthorized users masquerading as registered users by comparing their activity to the previous actions from those accounts. Various unsupervised learning algorithms have been applied, however, most of them have huge memory requirements [2,11]. Clustering has been used [12][13][14], but incorrect grouping would lead to higher risk of false negatives. As a result, dimensionality reduction becomes a crucial part of the process and thus the majority of approaches have focused on the use of PCA for this task [3,15]. PCA aims to derive new variables, called Main Components (PCs) as linear input variable combinations so that a few of the new variables reflect the overall variation between the input variables. Authors in [19,20] provide a network intrusion prevention approach that tackles the problem of controlling high computer network traffic and the time pressure to handle security threats. They use the techniques of multivariate analytics, including clustering and PCA to identify classes in the observed data.
The use of PCA is appealing due to its statistical consistency, faster inference, and effective computation [11]. Yet there has been valid criticism in its use for network intrusion detection due to several reasons like struggle with rare class labels, exponential complexity, etc. [4,7]. Examples of this are [18,19] where a combination of recorded variation and the screen-plot process were used for selecting the key components which may be risky as some anomaly amongst lesser-known components may be ignored. The existence of unusual data from regular operations is seen in [20] where the criteria of choosing components for PCA remain unanswered. A large number of variables could be reported to explain network traffic behavior [21]. For example, authors in [22] assume a certain expectation for collection methods for the variable. Typically, every input variable has a non-zero factor on all PCs. However, in practice, it is common that the majority of the component loadings is close to zero [23] and is of less practical significance. Recently there have been proposals of some hybrid approaches [24][25][26] but they have a higher training and inference cost, with increased complexity. Wormhole attacks have been tackled using separate routing [27] but the method does not work in structured log data. While synthetic oversampling and under sampling approaches may work well on text data [28], they are often detrimental to structured network log data. Hence it can be seen that there is a need for improving the shortcomings of the previous methods and come up with a more accurate and efficient solution.
II. PROPOSED METHODOLOGY In this paper, we work upon an enhanced feature selection and reduction technique that ameliorates the shortcomings of the previous approaches. The diagrammatic representation of the proposed algorithm is shown in Figure 1. The phases on the algorithm are defined in the following subsections.

A. Log Files and Feature Extraction
The Microsoft SYSMON log process collection tool was utilized for extracting the logs from the given system. These logs are collected in real-time and have a certain number of input attributes associated with them. These attributes are described in detail below. The data features are stored together in a matrix format for subsequent processing.

B. Calculation of Symmetric Uncertainty
Symmetric Uncertainty (SU) is used to obtain the selection of features by calculating the fitness of the features between the feature and the target class. Symmetric uncertainty is defined as: where Gain(X|Y) is the amount by which the entropy of the variable Y decreases. H(X) is the entropy of a discrete random variable X. If the prior probability of each element of X is p(x), then H(X) can be calculated as: The entropy value takes care of any associated bias amongst the features with large values and also normalizes them to a range of [0, 1]. An SU value of 1 indicates that the information value of the one variable is fully represented by the other, whereas a 0 value of SU(X, Y) indicates that X and Y are independent variables. Such mathematical representation also helps normalizing continuous features to a discrete form. These SU values are used for the calculation of the eigenvalues and the eigenvectors.

C. Eigenvalues and Eigenvectors
Eigenvalues are used to calculate the dependency of features and their correlation with the class values. The eigenvector with the highest uniqueness is considered as the most characteristic feature implying the highest variance [4]. Similarly, the second most unique vector is considered as the second characteristic principal variable with information retention. In this way, the top N eigenvectors are calculated and represented using a co-occurrence matrix. N indicates the higher contribution rate amongst all the eigenvectors. If M is a n×n matrix, then v is considered as the eigenvector of M if: where λ is the eigenvalue associated with v. For the given eigenvector v of M, given a scalar a: N different eigenvectors can always be chosen such that they collectively account for a unit length: The n eigenvectors are always orthogonal to each other. Thus, they can be used as the basis for the formulation of a new n-dimensional vector space. It is crucial to analyze the security of information from different unknown sources or attacks [10]. A fixed policy can never detect new different threats that are created. This examination can be done on different network trees. Once the FAST calculations of SU are done, the representation of the method is done by us using the PCA method. A multidimensional hyperspace is hard to visualize.
The key objectives of the unsupervised learning approaches are to minimize dimensionality, ranking all identifications based on a composite index, and clustering related identifications based on multivariate attributes together. Since it is difficult to imagine a multidimensional space, PCA is used to scale down the dimensionality of multivariate parameters into three dimensions. In this paper, PCA is used for multivariate analysis. The data set can be represented as a matrix X with n rows and m columns, the sample rows, and the attribute columns. Some multivariate methods presume a structure, while others separate the cases into classes trying to figure out the structure of data. The former is a case of supervised learning while the latter indicates unsupervised learning.
In PCA, the interrelated multivariate parameters are mapped to a set of non-related components, each expressing a distinct linear combination of the main variables. The nonrelated components extracted are the PCs and are predicted from the covariance matrix's main variables' ownership. PCA aims to obtain providence and minimum dimensionality by extracting the smallest number of PCs that account for most of the variation in the main multivariate information and summarizing the data with minimum information loss. The variance of an attribute is defined as:  (6) In this method, the last principal component score needs to be zero. All the given variables are scattered on a hyper plane. If there is any update in the interrelations, there may be an extension of the information outside the hyper plane. This is reflected in the updates of the principal component scores that were previously zero. Nil scores are the most sensitive to updates in the interrelations. The main component scores are powerful in observing any updates to the information. PCA performs normal data circulation by selecting a suitable orthogonal coordinate. The range of the main rectangular factor scores is more suitable for expressing normal data distribution than those of the sensed and actuated initial orthogonal coordinates.

D. Threshold Calculations
After the eigenvectors are calculated, they are sorted in descending order and the feature class mean value is calculated. This value is used for deciding the threshold value for obtaining the features. The following algorithm indicates the entire process followed by the enhanced PCA algorithm.
Input: Input attribute matrix X ∈ x 1 , x 2 ,...,x n , output Y, threshold t Output: Reduced representation S 1. For each III. DATASET DESCRIPTION For evaluation using a real-time dataset, data were collected from logged data using the Microsoft Sysmon data collector tool [30]. It provides the information of certain parameters present in the network, considered as the input variables for the network intrusion detection task. For a more generalized evaluation and elimination of bias, we collect four different sets of data each of varying sizes. The individual sizes of each of the dataset are:  Table I. IV. RESULTS AND DISCUSSION The obtained results are presented on two different levels: the output of the dimensionality reduction and the obtained performance on task-specific performance metrics. While the output variables received by the dimensionality reduction algorithm can be variable, we judge their effectiveness by comparing the results with those obtained by the models that use other dimensionality reduction methods. Table II indicates the output variables determined by the proposed approach to be deemed most important and relevant to the task in hand. The threshold was set to 66% and thus the total number of variables was reduced to 15. The performance of the proposed approach is evaluated with two different metrics: Accuracy and Inference time. These are defined in (7) and (8): Inference time = ܶ ை௨௧௨௧ -ܶ ூ௨௧ (8) Inference or prediction time is defined as the amount of time required for the system to output the prediction for a given input set. It is found by subtracting the time frame when the input was given from the time frame when the output was predicted. The results of the proposed approach for each of the four datasets for the three methods (FAST, PCA and enhanced PCA) are compared. The accuracy obtained for the three approaches is shown in Table III. It can be seen that the proposed approach has outperformed the standard approaches and has a healthy growth over the use of the normal PCA approach. In Table IV the evaluation of the tested methods in terms of Inference time is shown. It can be seen that while the proposed approach is not the best regarding inference time, it still performs better than the PCA algorithm. While the FAST algorithm has faster inference time, the proposed approach is better in terms of accuracy, which is more important in the case of network intrusion detection systems. The graphical comparison of the three approaches in terms of accuracy and inference time is shown in Figures 2-3. Thus our proposed approach has been able to get the right mix of effectiveness and efficiency.

V. CONCLUSION
In this paper, a new approach was presented for the dimensionality reduction and the subsequent prediction of any intrusion threats to a network system. We were able to minimize time complexity and false detection rates, especially on a real-time data set. The proposed approach derived inspiration from the traditional PCA algorithm but ameliorated its shortcomings and improved further its performance. Specifically, we made use of symmetric uncertainty, entropy, and associated factors to boost the working of the feature selection PCA algorithm. The Sysmon tool was used for the collection of data. The performance of the proposed approach was evaluated on four different datasets and was compared with the standard PCA and FAST approaches. The proposed approach was found to be more accurate while possessing a satisfactory inference time. An increment of over 2% was observed in terms of accuracy as compared to the standard approach. The reduction time was faster than the regular PCA approach by more than 15% in all cases. The proposed approach was proved to be a more preferred method than the previous approaches. Future work in this domain includes working upon other machine learning algorithms to couple with the dimensionality reduction method for enhanced results. Other preprocessing methods can also be used so that the initial variables are made more model-friendly. Genetic algorithms can be thought of as an alternative solution for enhanced optimization strategy. The proposed approach can be considered as a small contribution in the creation of timely and accurate network intrusion detection systems.