A Neural Network-Based Multi-Label Classifier for Protein Function Prediction

Knowledge of the functions of proteins plays a vital role in gaining a deep insight into many biological studies. However, wet lab determination of protein function is prohibitively laborious, time-consuming, and costly. These challenges have created opportunities for automated prediction of protein functions, and many computational techniques have been explored. These techniques entail excessive computational resources and turnaround times. The current study compares the performance of various neural networks on predicting protein function. These networks were trained and tested on a large dataset of reviewed protein entries from nine bacterial phyla, obtained from the Universal Protein Resource Knowledgebase (UniProtKB). Each protein instance was associated with multiple terms of the molecular function of Gene Ontology (GO), making the problem a multilabel classification one. The results in this dataset showed the superior performance of single-layer neural networks having a modest number of neurons. Moreover, a useful set of features that can be deployed for efficient protein function prediction was discovered. Keywords-gene ontology; molecular function term; multi-label classification; neural network; protein function prediction

INTRODUCTION Understanding proteins' functions plays a vital role in acquiring insights of the molecular mechanisms operating in both physiological and ailing medical conditions. As a result, this understanding substantiates the discovery of drugs in different diseases [1]. However, predicting protein functions is an arduous task. The fact is markedly implied by the incredibly large number of unannotated protein entries hosted by the most comprehensive protein database, the Universal Protein Resource Knowledgebase (UniProtKB) [2]. This is mainly due to the reliance on traditional experimental annotation techniques carried out by molecular biologists. The gap between reviewed and unreviewed protein sequences is widening due to the data deluge from high-throughput state-ofthe-art sequencing techniques [1,[3][4][5]. The pressing demands for computational methods on the functional annotation of proteins have paved the way for significant contributions by computer science researchers. Many computational techniques employing machine learning for functional annotation of proteins have been utilized in the literature. The principal difference between various approaches lies in the set of features pursued by different investigators. This section presents a brief summary of some of the most prominent efforts in this area.
An ensemble of Deep Neural Networks (DNNs) was proposed in [1], where each DNN worked on a different set of features from the dataset. The predictions of different DNNs were then voted to arrive at the final protein function prediction. A DNN for the hierarchical multilabel classification of protein functions designed to perform well even with a limited number of training samples was presented in [3]. In [4], a DNN was introduced to learn features from word embedding of protein sequences, based on the concept of Natural Language Processing (NLP), using sequence similarity profiles as additional features to locate proteins. Authors in [5] established the efficacy of exploiting any interrelationships among different functional terms. For instance, different functional classes were found to coexist with some proteins suggesting a mutual relationship. Furthermore, a quantification model of these relations was proposed, using a functional similarity measure and a framework to capitalize on it for the eventual prediction of protein functions. A classification technique based on a neural network coupled with a Support Vector Machine (SVM) was demonstrated in [6], utilizing a bidirectional Long Short-Term Memory (LSTM) network to generate fixed-length protein vectors out of variable-sized sequences and deal with the challenges posed by the variable length of protein sequences. In [7], protein sequence motifs were used to build a deep convolutional network and predict protein function, while the authors claimed to have built the best performing model for the cellular component classes. The significance of Protein-Protein Interaction (PPI) and timecourse gene expression data as powerful predictors for the prediction of protein function was shown in [8]. A method, called Dynamic Weighted Interactome Network (DWIN), was proposed, that in addition to PPI and gene expression data, took also into account information related to protein domains and complexes to improve the prediction performance. In [9], clustering was applied on a PPI network for the prediction of protein function. A protein graph model was shown in [10], constructed of protein structure, with each node representing a cluster of amino acid residues. However, the idea of using an accuracy metric for evaluation is generally misleading. In [11], an active learning approach was explored for the prediction of protein function using a PPI network. This method operated in two phases: Spectral clustering was used to cluster the PPI network followed by the application of the betweenness centrality measure for labeling within each cluster, and then the labeled protein data were used by a classification algorithm. Associations between functions in a PPI network were used in [12], stating that multiple function labels assigned to proteins were not independent and their coexistence could be used effectively to predict protein function. A deep semantic text representation was presented in [13], with various pieces of information extracted from protein sequences such as homology, motifs, and domains. Protein function prediction was carried out using a consensus between text-based and sequence-based methods. In [14], a classifier using cumulative iterations was proposed, based on its semantic similarity with the term Gene Ontology (GO). Each prediction was followed by updating and optimizing scores of characteristic terms in the set of GO annotations, which, in turn, led to improved future predictions. The dissimilarity of protein functions, rather than conventional similarity measures, was used in [15] to segregate rare and frequently occurring classes of functions. This technique worked well for imbalanced datasets.
The notable contributions cited above are just a handful of numerous praiseworthy efforts towards the prediction of protein function. These endeavors differ in terms of the protein information utilized by the corresponding systems and the computational or time complexities of the classification models. The current paper presents a neural network-based multi-label classifier for the prediction of protein function by training and testing several neural networks on a large dataset [16]. The results indicate that a neural network with a single hidden layer achieved remarkable prediction performance with nominal computational complexity. This makes its implementation viable on systems with modest hardware capabilities. Consequently, the time required for the classification task is in the order of seconds.

A. Dataset
The dataset adopted from [16] includes 121,378 protein instances. These labeled protein examples were extracted from UniProtKB [2], a comprehensive worldwide repository of protein information. These protein entries pertain to 9 bacterial phyla, namely Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes, and Tenericutes. Each instance in the dataset had 9,890 features. These features included the sequence of amino acids making up the corresponding protein, compositions of amino acids, dipeptides and tripeptides; compositions of five groups of amino acids, i.e. aliphatic, aromatic, positively charged, negatively charged, uncharged, and various structural and physiochemical properties derived from the amino acid sequence. In addition, some features quantify conjoint triads. A conjoint triad is a unit of three successive amino acids such that each amino acid in the unit belongs to one of the seven groups formed on the basis of the dipole and volume scale [17]. These characteristic values indicate the strength of interaction between the amino acids of these 7 groups. The feature set also contains pseudo amino acid compositions for the corresponding protein. As suggested in [18], these numbers overcome the loss of sequence order effect in a protein caused by considering just plain amino acid compositions. Moreover, there are also 541 motifs included in the features. These are small segments in proteins' tertiary structure that are frequently found in different proteins. These similar patterns are associated with the structural or functional roles of proteins.
There are 1,739 binary labels associated with each protein instance. These labels correspond to GO terms belonging to the Molecular Function (MF) category. The GO is a categorization of biological functions using three broad classes, i.e. Molecular Function (MF), Cellular Component (CC), and Biological Process (BP), generally referred to as GO terms [19]. The molecular function term specifies a biochemical activity performed by a gene product, without taking into account the time and space dimensions of this activity. The enzyme is an example of the MF term. The CC refers to the location of the biochemical activity of a gene product in the cell. Ribosome and nuclear membrane are two such examples. BP, an allencompassing term, defines a biological objective to which activities of various gene products contribute. Cell growth and maintenance serve as examples the BP term.

B. Data Preprocessing
The Comma Separated Values (CSV) files for 9 different bacterial phyla were combined to obtain a single Pandas' data frame object using the Pandas data analysis library in Python [20]. Duplicate rows were removed from the data frame, which was then converted to an array using the scientific computing library NumPy in Python [21]. The feature values were then scaled using the standard scaler available in the scikit-learn library in Python [22]. Data scaling was investigated using normalization and robust scaler, but these data scaling techniques proved inferior to the standard scaling technique.

C. Features Partitioning
The neural networks were trained on 3 sets of features. The objective of partitioning features into various subsets was to test the hypothesis that compositions of amino acids, dipeptides, and tripeptides are sufficient to predict protein functions. F = {F 1 , F 2 , F 3 } represented the set of features used to train different models, where F 1 was the entire set of 9,890 features, and F 2 was the set of 8,420 features that contained only compositions of amino acids, dipeptides, and tripeptides. The set F 3 =F 1 -F 2 contained 1,470 features consisting of various properties and characteristics derived from proteins as described in subsection A.

D. Neural Networks
A variety of neural networks was selected, differing in the number of hidden layers and neurons in each layer to train the protein function classification system on datasets corresponding to each feature set F 1 , F 2 , and F 3 . The experimental results are given in Section III. It was observed that the simplest neural network containing a single hidden layer demonstrated better performance on this dataset compared to neural networks having more hidden layers. The optimal number of neurons in this single hidden layer was experimentally determined to be just 5% of the total input and output neurons for feature sets F 1 and F 2 for the best performing neural network. However, for the F 3 feature set, the optimal number of neurons in the single hidden layer of the best performing neural network turned out to be 50% of the total of input and output neurons.
Once the optimal number of neurons in a single hidden layer was determined, the addition of another hidden layer was utilized to observe any potential boost in performance. The number of neurons in the second hidden layer was chosen to be 50% of the first hidden layer. This was done to ensure that the network captured the most important features for prediction. Table I summarizes various single-hidden layer neural networks trained and tested on the F 1 feature set, i.e. the entire set of features from the dataset. The reference computer for all time and memory size measurements presented is a Core i7-8700 at 3.2 GHz 6-core processor.  Table II presents the M8 neural network with two hidden layers trained on F 1 . This model was constructed by adding another hidden layer to the best performing single-hidden layer neural network M2 to explore any performance gain. The second hidden layer had 50% neurons of the first hidden layer in an attempt to capture the optimal features best suited for the prediction task.  Table III summarizes various single-hidden layer neural networks trained and tested on the F 2 feature set, i.e. the compositions of amino acids, dipeptides, and tripeptides in the protein sequence.  Table IV presents the M14 neural network containing two hidden layers and trained on F 2 . This model was developed by adding another hidden layer to the best performing single hidden layer neural network M9, to investigate any improvements in the classifier performance. The second hidden layer had 50% neurons of the first hidden layer to exploit the predictors best suited for the prediction task.   Table VI presents the M21 neural network having two hidden layers and trained on F 3 . This model was generated by adding another hidden layer to the best performing single hidden layer neural network M19 to discover any potential performance enhancement. The number of neurons in the second hidden layer was chosen to be 50% of the first hidden layer to capitalize on the features best suited for the classification. For each network, we employed the relu activation for the hidden layers, the sigmoid activation for the output layer, the he_uniform kernel initializer for the hidden layers, and the Adaptive moment estimation (Adam) optimizer with a learning rate of 0.00001.

E. Performance Evaluation
Since a protein example in this dataset can be mapped to more than one binary label, the prediction of protein function is a multilabel classification problem. The dataset is also highly imbalanced due to the overwhelming number of negative examples for each label. Evaluation of such a classification model cannot simply rely on the accuracy of prediction [23,24]

1) Precision
Precision is defined as the fraction of positively classified instances that are, in effect, positive. This gives a clear picture of a classifier's strength in predicting positive classes. Letting TP and FP respectively denote the count of true and false positives, Precision is calculated as: The precision of predict-majority-class-for-all classifier is thus 0 judiciously penalizing for its shortcoming at predicting the positive minority class. However, any classifier that makes just one positive prediction and ensures its correctness would have 100% precision despite its failure to predict other positive examples. This calls for another classification metric, called Recall, also known as sensitivity.

2) Recall
Recall is defined as the fraction of positive examples in the dataset classified as positive. Letting FN denote the number of false negatives, the Recall is given by: This measure penalizes a classifier that attempts to achieve high precision simply by making a few correct positive predictions.

3) F1 Score
Precision and recall are combined in a single performance measure called F1 score, which is their harmonic mean.
‫1ܨ‬ ‫݁ݎܿݏ‬ = 2 * ௦ * ோ ௦ାோ (3) As the harmonic mean is biased towards lower values, F1 score can have a higher value only in the case when both precision and recall have high values. In multi-label classification, there are several ways to average the aforementioned performance metrics on all labels [25,26]. These are the micro average, macro average, weighted average, and samples average as defined below. As usual, the F1 score is the harmonic mean of the corresponding precision and recall in each case.

4) Micro Average
This is calculated by counting the number of True Positives (TPs) across the entire set of target labels. If there are N samples in the dataset and each sample has L binary target labels, then the micro averages of Precision and Recall are calculated as: where Y_pred and Y_true are the predicted and actual target labels, respectively. The conjunction operator ∧ ensures the inclusion of only those label instances that are positive in both Y_pred and Y_true, i.e. TPs.

5) Macro Average
This averages the Precision and Recall scores of the individual target labels, giving equal weights to all of them.

6) Weighted Average
This averages the Precision and Recall scores of the individual target labels, using the number of positive instances of each label in the set Y_true as their weight.
where w j denotes the weight, also known as the support of the j-th label.

7) Samples Average
It averages the Precision and Recall scores across the samples.
This is the most faithful as well as the most conservative performance indicator of the multi-label classifier as it reflects, on the average, how well the classifier performed on each sample. Therefore, the sample averages were used to gauge the performance of the models.

8) Zero-One Loss
For a multi-label classification problem, this measure credits a prediction as correctly classified only when all labels are correctly classified. The loss is zero for a correct prediction. However, if the classifier fails to make a correct prediction even for just one target label, the corresponding loss is 1. It follows that the zero-one loss is truly a conservative and highly penalizing performance measure.
The combination of the product operator Π and the exclusive-OR operator ⊕ ensures that any mismatch between L predicted and target labels generates a loss of 1 for any given sample. Otherwise, the loss is zero for a complete match between all predicted and target labels for a given sample.

9) Hamming Loss
This gives the fraction of all incorrectly predicted labels by quantifying the number of incorrect predictions of all labels rather than penalizing individual examples. Hence, if a multilabel classifier incorrectly predicts 1 out of 10 labels for a given instance, the hamming loss for that example is just 1/10 as compared to 1 in the case of zero-one loss. It follows that hamming loss is lenient compared to the stringent zero-one loss.

10) Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is a binary version of Pearson's correlation coefficient [27]. However, multiclass classification problems can also benefit from its extended version [28]. MCC compares ground truth and predicted vectors, considering all possibilities of prediction, i.e. True Positives (TP), True Negatives (TN), False Positives (FP), and False negatives (FN). Therefore, it gives a balanced evaluation of the performance of the classifier. The correlation coefficient lies in the range [-1, +1], with -1 for false prediction, 0 for random prediction, and +1 for correct prediction.
The MCC was calculated for every example and its average was used to assess the performance of the classifier on the entire dataset.

11) Consolidated Performance Metric
For the sake of an all-encompassing and more realistic comparison of performance, the aforementioned metrics were combined in a single Consolidated Performance Metric (CPM) as follows: The CPM was constructed in the higher, the better way. Table VII shows the performance details of the neural networks (M1 to M21) as the samples averages of Precision (P), Recall (R), F1 score (F1), Zero-One Loss (ZOL), Hamming Loss (HL), MCC, and CPM.

III. RESULTS
A wide variety of neural networks was trained and tested on a large dataset of proteins. For feature sets F 1 and F 2 , it was observed that 5% of the total input and output neurons in the single hidden layer networks exhibited better prediction performance than other single-layer models. These models were designated as M2 and M9, respectively. However, for the feature set F 3 , the optimal count of neurons in the hidden layer emerged as 50% of the total input and output neurons. This model was designated as M19. The bar graphs in Figures 1-3 compare the CPM for various neural networks that work on a specific feature set. In each case, the best performing singlelayer network was extended by adding a second hidden layer to assess any performance edge. Neurons in the second hidden layer were the 50% of the first hidden layer to ensure that the most relevant features for prediction play their due role. The blue bars in Figures 1-3 represent the performance of the best performing single-layer networks, while yellow bars show the performance of 2-layer neural networks.

F. Optimal Feature Set
The experiments also focused on exploring an optimal set of features for the prediction of protein function. As it was noticed, the F 3 feature set proved to be the best predictor for this multi-label classification. Figure 4 shows a comparison of the best-performing models for each feature set, where M19 on F 3 achieved the best performance. Performance comparison of the best performing single-layer models on different feature sets. F3 proves to be the best predictor set.

G. Classification Threshold
The impact of the classification threshold on the performance of a classifier was examined. The models predict the probability of each target label associated with every instance. These probability values quantify the chance for a given instance to belong to a particular class. These probability values should be translated into binary labels 0 and 1 before the final evaluation of the model. This conversion to binary labels required a threshold or probability cutoff value below which all values are classified as class 0 and equal or greater values are classified as class 1. Classifier performance metrics are profoundly influenced by the choice of this threshold. The impact of the threshold is more pronounced for imbalanced datasets. As the examined dataset is skewed towards more negative examples of each label, the performance of the models was evaluated for various values of thresholds. Figures 5 and 6 show two example plots of samples averages of P, R, and F1 score for models M2 and M19, respectively, against a range of classification thresholds.  The Performance curves of model M19 for the F3 feature set.

H. Confusion Matrix
The confusion matrix is a visualization of a classifier's performance, as it gives the count of TP, FP, TN, and FN class predictions. Fig. 7.
Confusion matrices of the best-performing model M19 for F3. Only labels having support more than 1,000 are included.

www.etasr.com
Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Figure 7 presents the confusion matrices of select labels in the test dataset for the model M19 that performed best in the F 3 feature set. To highlight the strength of the classifier, only labels whose support exceeded 1,000 in the dataset were chosen

IV. DISCUSSION
The performance comparison of different neural networks on a large protein dataset showed that neural networks having a single hidden layer and a modest number of neurons achieved superior performance on this specific dataset than relatively more complex networks. The number of neurons in the single hidden layer was empirically determined. The rigorous experimentation revealed that only 5% of the total input and output neurons were adequate in single-hidden-layer models operating on F 1 and F 2 feature sets. However, this count was 50% for a single-layer model on F 3 . This disparity in the number of neurons was because these models for the F 1 and F 2 feature sets had 9,890 and 8,420 neurons, respectively, in their input layers. Therefore, even 5% of the total input and output neurons were adequate to effectively train the model for 1,739 labels. However, for F 3 , the number of input neurons was barely 1,470, and consequently, more neurons were needed for better prediction performance. This justifies a 50% count of neurons in the single hidden layer for this network. In any case, the training time, prediction time, and model size of these models were much better than those of other competing models. These models showed much better performance (F1 score: 0.96) compared to the deep learning ensemble [1] (F1 score: 0.79) on the same dataset. This was also achieved with a much lower computational complexity.

A. Best Predictors
The findings highlight the impressive role of the physiochemical properties and motifs in proteins, pseudo amino acid compositions, and other properties derived from the protein sequences in predicting protein functions. The proposed model for this feature set was extremely efficient as it had better performance and lower computational complexity.

B. Sufficiency of Amino Acid, Dipeptide, and Tripeptide Compositions
The results were suggestive of the sufficiency of amino acid, dipeptide, and tripeptide compositions in predicting protein functions. Although the performance metrics for this particular feature set had lower values than other feature sets, it can be used for a sufficient and tolerable approximation. This could save time spent in engineering features from existing features of the dataset consisting of bare compositions of amino acids, dipeptides, and tripeptides.
V. CONCLUSIONS This study culminated in two significant findings regarding the examined protein dataset. The first one pertains to the exceptional performance of single-layer neural networks on this dataset, alhough the number of neurons in this single hidden layer must be empirically determined as a percentage of the total input and output neurons in the network. The simple design of this single-layer model requires minimal computing resources. This model showed a performance improvement of more than 16% over two-layer neural networks operating on the F 1 feature set. The corresponding performance improvements for the F 2 and F 3 features set were 20% and 13%, respectively. This study could play a substantial role in the prediction of protein function, due to the tremendous predictive power of some physiochemical properties of proteins, their pseudo-amino acid compositions, motifs in proteins, and some other significant characteristics. The bare compositions of amino acids, dipeptides, and tripeptides provide a reasonably high level of approximation of protein functions. This could be useful in cases where researchers want to have an approximate idea of protein functions just from the amino acid sequence rather than extracting and relying on many other properties of proteins.