A Deep Learning Approach for Malware and Software Piracy Threat Detection

-Internet of Things (IoT) -based systems need to be up to date on cybersecurity threats. The security of IoT networks is challenged by software piracy and malware attacks, and much important information can be stolen and used for cybercrimes. This paper attempts to improve IoT cybersecurity by proposing a combined model based on deep learning to detect malware and software piracy across the IoT network. The malware’s model is based on Deep Convolutional Neural Networks (DCNNs). Apart from this, TensorFlow Deep Neural Networks (TFDNNs) are introduced to detect software piracy threats according to source code plagiarism. The investigation is conducted on the Google Code Jam (GCJ) dataset. The conducted experiments prove that the classification performance achieves high accuracy of about 98%.

INTRODUCTION Artificial intelligence (AI) approaches are overgrowing through machine learning and deep learning technologies. Using AI in applications improves accuracy and efficiency. The AI approach supports innovations in various fields [1][2][3][4][5][6][7][8][9][10]. The Internet of Things (IoT), as an interconnection of devicebased sensors through the Internet, requests a safety mechanism based on AI methods to prevent attacks and intrusion. IoT devices are defined by unique Radio Frequency Identifier (RFID) tags and are connected via nodes. The IoT interconnection mechanism ensures distant monitoring and controlling [9]. The IoT connectivity is a universal mechanism to support cloud computing, service industries, and innovative applications. With the number of connected devices via the Internet exceeding 50 billion by 2021 [10], data security becomes a significant challenge. The IoT technology faces a massive amount of data due to the growth of communication networks. Attackers benefit from the IoT architecture to handle attacks through IoT devices. Pirated software and malware infection have been used to affect the security of the industrial IoT cloud [11]. These methods attempt to reuse source code illegally and to use the system as a regular user. The attacker writes a malware code based on reverse engineering through the logic of the original code [12]. This kind of attack is a severe threat because it allows unlimited downloads of pirated software. This issue is solved by using an intelligent software plagiarism technique that finds the stolen source code in the illegal software. Intelligent software plagiarisms are based mainly on test-based analysis and structure. The proposed techniques use many methods: similarity identification, clone detection, software birthmark investigation, and software bug analysis [12]. Software plagiarism based on structure technique focuses on the basic structure of the source code, graph behavior, function call graph, and syntax trees. These methods do not catch the attack if another type of programming can preserve the same behavior as the original software.
Providing secure IoT networks is the purpose of many malware detection and intelligent software plagiarism techniques. Infecting the privacy of IoT nodes, smartphones, and computer systems is the goal of malware attacks. The different ways to detect malware are: Statistic identification analysis and dynamic identification analysis. The second one learns malware patterns when the code is executed in real-time. The malware is detected considering function parameters' exploration, function calls, visual investigation of codes, dataflow, and instruction traces. Some detection tools based on the dynamic behavior of malicious codes such as Anubis, TT analyzer, and CW Sandbox [13] are provided online. These tools, characterized by the monitoring of every dynamic behavior, suffer from the time-consuming issue. Statistic methods attempt to capture the layout information without realtime execution. As a statistic method, the signature identification technique detects windows-based malware via specification signatures as opcode frequency, string signature, and control flowgraph. Statistic methods are supported by disassembling tools to extract the hidden patterns from binary executables [14]. Byte sequence technique is considered a statistic method and removes n-byte sequences from patterns.

A. Software Piracy Detection
In most software plagiarism cases, the software is written on a single programming language and crackers change the control flow using a similar programming language. Authors in [15] proposed a method-based software benchmark to detect threats in java source code. The authors retrieved structural features by extracting the control flow of source codes. Then the similarity is computed between benchmarks of two source codes. Authors in [16] attempted to acquire the similarity between codes through a hybrid approach. The compiler level features technique was considered. The authors performed an unsupervised learning approach to detect plagiarism. The proposed method computes the similar functionalities of different sources codes. Authors in [17] introduced an approach to achieve the similarity features between C++ and C source codes. The proposed method was based on a source forager search engine to extract features of every code. The control flow of the source code is identified based on the shape functionalities of the code. Authors in [18] computed the difference between two source codes using a logic-based approach. The semantics for differences were captured using symbolic execution and preconditions techniques. Authors in [19] tried to detect plagiarism related to student's assignments using the Latent Semantic Analysis (LSA) method. The authors aimed to compare source codes with regard to syntactic structures. This objective was achieved by combining LSA with PlaGate to identify the similarity. Then, the syntax tree detected the syntactic view and the abstract of the source code. Authors in [20] attempted to detect similar source code fragments using the parse tree kernel. The authors focused on Java files and proved that the achieved results were inaccurate. Therefore, a fingerprinting method was proposed instead of the parse tree kernel. Authors in [21] introduced a behavioral approach called BPlag to detect source code plagiarism. The behavior was extracted using symbolic execution. Then, the code was assimilated to a novel graph-based format. The plagiarism was computed according to these graphs. The authors proved that their approach was more accurate and more robust to plagiarism-hiding than 5 source code plagiarism tools. Authors in [22] presented a combined approach, using a Greedy String Tiling and Explicit Semantic Analysis method named EsaGst. The proposed method supported the detection of source code plagiarism independently from the programming language. The evaluation was conducted using different languages, including Java, C++, Python, PHP, and Java-Script. The results proved the good performance of the EsaGst approach.

B. Malware Detection
Malware detection is an open scientific topic. Authors in [23] used a machine learning approach to classify worms from the binaries of benign files based on a sequence of variable length instructions. The authors built a dataset including 1330 benign files and 1444 worms. The experimental results achieved a classification accuracy of around 96%. Authors in [24] combined the SVM-based machine learning and N-opcode sequences to detect malware. The detection process included the critical instruction sequence and cosine similarity. Findings demonstrated that similar malware possessed common core signatures. The proposed malware detection method achieved an accuracy around 98%. Authors in [25] introduced a method based on the dynamic analysis approach to highlight the limits of the static analysis approach. The author proposed an obfuscation model to execute binary samples and identify significant behavioral features within a virtual machine. The evaluation proved the insufficiency of the static analysis approach in the case the malware is obfuscated. Authors in [26] attempted to automate malware detection by identifying abnormal behavior within the program. The proposed idea provides little information about malicious behavior. Authors in [27] applied a classification approach based on the clustering method. The aim was to classify malware samples based on behavioral features. The proposed method was added to the Anubis system to track malware samples. The tracking report detailed the in-depth activities of the malware samples. Authors in [28] aggregated between statistic and dynamic analysis to detect malware accurately. The statistical analysis managed the operational codes based on frequency occurrences. The dynamical analysis executed traces of system calls and executable files. Tacking the advantages of each approach, the authors achieved a better result than statistic or dynamic case separately. Authors in [29] utilized Convolutional Neural Networks for malware detection. The announced purpose was to reduce time, size, and resource overhead. The proposed method used image-based malware for classification with 98.5% accuracy. Authors in [30] tried to detect malware using the image similarity technique. The authors employed benign and vision research lab datasets to evaluate their method. Samples from executable files were converted to binary code. In the testing phase, the accuracy reached 98%. Authors in [31] established an enhanced method to detect malware variants based on the deep learning approach. The authors intended to obtain high accuracy at a low time cost. The suggested process transformed the malicious code into a grayscale image, then classified samples using the CNN based on significant features. The authors overcame the imbalance of data by applying the bat algorithm. According to the experimental results based on the research lab dataset, the computed accuracy and speed were sufficient. Authors in [32] applied a machine learning method with processor core events to detect malware with a hardware event counter to ensure the detection. The purpose was to detect the SPECTRE using on-chip hardware on time. The proposed hardware architecture was based on software agents. To predict malicious activity, the authors used several machine learning classifiers. The predictive results achieved an accurate detection.
The current paper's contributions can be summarized as follows: • A Deep Learning approach based on the TensorFlow Deep Neural Network is proposed to detect software piracy through source code plagiarism.
• A Deep Convolutional Neural Network (DCNN) is proposed to detect malware through binary visualization.
• The two models are combined into the same architecture for the IoT case.
• The proposed architecture is evaluated according to adequate datasets.

A. Software Piracy Threat Detection Model
The proposed software piracy attack detection is based on deep learning. The detection based on plagiarism methodology is captured through different types of source codes. The pirated version uses the same logic as the original software. Once the traffic data are classified as software piracy, the source codes are tokenized to decrease the dimensions of the data. In that step, significant features are extracted using the TensorFlow framework. The Keras API for deep learning is applied to capture source code plagiarism. The first database (D1) includes the network traffic data collected from Google Code Jam (GCJ) database. The D1 data are built from 100 programmers and contain about 400 source code documents. Software piracy threat detection is preceded by a preprocessing step. The purpose of this step is to divide the source code into small pieces. Then, the semi-code is converted into useful information, and noise is removed. Meaningful tokens are obtained during the tokenization step. Finally, the contribution of each token is zoomed through a weighting mechanism, see (1), based on the Logarithm of Term Frequency algorithm [33] and the Term Frequency and Inverses Document Frequency (TFIDF).
ܹሺ‫,ݐ‬ ‫,ܿܦ‬ ‫ܵܦ‬ሻ ൌ ‫,ݐ‪ሺ‬ܨܶ‬ ‫‪ܿሻ‬ܦ‬ ൈ ‫ܨܦܫ‬ ሺ‫,ݐ‬ ‫ܵܦ‬ሻ (1) where t defines the token, Doc defines a document, DS represent all the documents used in the dataset, TF defines the Term Frequency function, and IDF defines the Inverse Document Frequency function.
Deep learning is conducted by Tensorflow, which is used for high-level computations. The pirated software is identified through extracted similar codes. A fully connected network with dense layers is utilized for input and output data. The first layer, which contains 100 neurons, receives the data. The second layer is composed of 50 neurons. The third layer consists of 30 neurons. The output variable uses the fourth dense layer to identify the target of the plagiarized code. Deep learning is able to solve the overfitting problem using the drop out layer. The pattern is computed based on the rectifier (ReLU) activation method [34]: where x defines the input of the equivalent neurons.
The multi-class problem is conducted using the sigmoid method defined by:

B. Malware Threat Detection Model
The proposed malware threat detection model consists of two steps: preprocessing and deep convolutional neural network. Raw binary files generate the color images and the problem is becoming an image classification problem. The adopted color system is grayscale, and features are extracted from the color image. A feature reduction method is used to enhance the classification performance. It aims to reduce the feature set. The generation of the color image from a binary file proceeds as follows: (1) generate the hexadecimal strings, (2) divide the hexadecimal strings into a chunk of 8-bit vector, (3) convert each 8-bit vector to a two-dimensional matrix, and (4) plot the two-dimentional space. Then, the Deep Convolutional Neural Network (DCNN) is utilized to identify the malware. The DCNN receives training images. The Convolution layer's purpose is to reduce noise and enhance signal features. It reduces the over-fitting problem. The convolutional layer performs the computations using (4): where f defines the activation function, M is the cluster of given maps, ܾ presents the bias consistent, and ݇ denotes the convolution kernel.
The accuracy of the proposed DCNN is improved through the convolutional kernel width. The pooling layer ensures the reduction of the data overhead and selects useful information. It minimizes the consequence of image distortion using (5): where Pool() ensures the pooling task.
The classification of the output of the pooling layer is performed at the fully connected layer. It aims to enhance the model by reducing the over-fitting issue. The noise is removed using filters. Then, the training of the proposed DCNN is performed using Softmax-Cross-Entropy loss [35], as defined by: where ‫ݎ‬ denotes the rank of the k class. The learning of the parameters attempted to minimize the loss is conducted with the use of the Adam optimizer.

A. Software Piracy Detection Performance Evaluation
The evaluation is based on the code similarity between the pirated software and the source software using the GCJ dataset [36]. The similarity is checked using Codeleaks plagiarism tool [12]. The dataset is proceeded by the preprocessing step to provide the valuable tokens of each source code as root word, stemming, token's length, and token's frequency. Then, the TFIDF and LogTF algorithms are applied to conduct token weighting. The accuracy of the classification is improved according to the number of neurons. The evaluation is shown in Figure 2 based on validation accuracy, validation loss, and loss. The loss curves ( Figure 2) start from 0.75 and follow the same trend until 0.3. A fluctuation can be seen in the loss curve, but both curves are reduced in a similar way. From the accuracy curves, we can see that the proposed software piracy threat detection model achieved an accuracy of about 98%.

B. Malware Detection Performance Evaluation
The proposed model measures the effect of malware image ratios. The image size is taken as 180×180 and 196×196. We used the Leopard Mobile dataset [37], which is composed of 2486 benign and 14733 malware samples for evaluation. The training phase uses 15219 samples, and the testing phase employs 2000 samples. According to the experiments, the 196×196 dimension reached better accuracy than the 180×180 dimension. Therefore, the 196×196 ratio is more suitable for the proposed model. Table I    The comparison in Table II proves that the proposed malware threat detection model is more accurate than previous studies that are based only on the machine learning approach. The proposed method requires only 18s of computation time.
IV. CONCLUSION Recent industrial systems migrate to industrial-based IoT to support new network services. Many security issues are related to IoT networks, especially malware threats and software piracy. Accurate cyber security defending IoT big data is needed. In this paper, a new security architecture based on the deep learning approach is proposed. The attempt aims to detect malware attacks and pirated software. The proposed approach is a combined methodology to detect threats. A Deep Learning approach based on the TensorFlow Deep Neural Network is introduced to detect software piracy through source code plagiarism. Then, a DCNN is utilized to detect malware through binary visualization. The findings prove that the proposed combined approach of DCNN and TFDNN achieves a classification accuracy of about 98%. The results of the proposed approach are better than the results obtained by related works. Speeding up the computation time to support real-time systems can be the purpose of future work. This target could be reached by proposing a high-security level hardware accelerator.