A Review on the Use of Machine Learning Against the Covid-19 Pandemic

-Coronavirus-2019 disease (Covid-19) is a contagious respiratory disease that emerged in late 2019 and has been recognized by the World Health Organization (WHO) as a global pandemic in early 2020. Since then, researchers have been exploring various strategies and techniques to fight against this outbreak. The point when the pandemic appeared was also a period in which Machine Learning (ML) and Deep Learning (DL) algorithms were competing with traditional technologies, leading to significant findings in diverse domains. Consequently, many researchers employed ML/DL to speed up Covid-19 detection, prevention, and treatment. This paper reviews the state-of-the-art ML/DL tools used, thoroughly evaluating these techniques and their impact on the battle against Covid-19. This article aims to provide valuable insight to the researchers to assess the use of ML against the Covid-19 pandemic.

INTRODUCTION Deep Learning techniques can be employed in many technical predicaments instead than conventional ones in order to save time and resources. This paper investigates and evaluates the use of Machine Learning (ML) and Deep Learning (DL) techniques on the fight against the Covid-19 pandemic. We will mainly focus on four states: At first, DL and ML models require a lot of data in their training process to provide accurate predictions. Therefore, we begin this study by reporting some Covid-19-related datasets used to train ML/DL tools. Then, we will investigate the way these tools are employed to prevent the disease, either by tracking the spread of the virus or by helping in the productions of a new vaccine. Thirdly, the paper will study the utilization of ML techniques in the discovery phase, presenting the models and algorithms adapted to diagnose and detect the virus. Finally, we will examine the same approaches, but this time applied to the drug discovery and treatment of the disease. This paper aims to present a review about the use of ML against Covid-19, to evaluate these techniques, and to analyze their effectiveness to counter the pandemic. However, before talking about implementing ML or DL techniques in any given field, essentially, when the models we are dealing with are on supervised learning methods, the fundamental problem to deal with is data. Various types of Covid-19 datasets available for practitioners and researchers are examined below.
A broad range of data was collected and made open during this pandemic. It ranges from classic Computerized Tomography (CT) scans to X-Ray Chest (XRC) radiograph images, but one can also find datasets based on text data, video, or sound. Authors in [1] collected a dataset of 123 computerized CT scans. This dataset contains not only Covid-19 but also MERS, SARS, and ARDS (other coronaviruses) images. The authors in [2] provide a more extensive dataset of 275 CT scans of Covid-19 patients. A familiar and productive technique used to augment data in ML is segmentation. This technique was used in several studies [3] to develop more samples. In addition to segmentation, other studies collected negative Covid-19 CT images to combine them with positive cases for binary classification diagnosis models [4][5][6]. XRC images belong to a different type of data. One of the first efforts to collect an XRC Covid-19 dataset was [7]. The authors used a collection of more than 13000 images to train a neural network classification model that achieved 93.11% accuracy. Moreover, added preliminaries in the XRC datasets were developed in [1]. This resource framed the basis of many extensions [8][9][10][11], whether by appending more images of Covid patients, using data augmentation techniques such as segmentation or by adding non-Covid-19 images. In addition, an Italian hospital provided more than 4700 X-ray images of Covid-19 patients [12]. Another popular dataset type deals with tracking and visualizing Covid-19's new cases and deaths in real-time. The most known dataset of this kind is [13], while [14] was used to predict the pandemic's spread.
Besides these platforms that offer universal statistics, many country-focused datasets accumulate information about specific countries [15][16][17]. Although these datasets represent limited regions, they generally present richer resources since they consider demographic aspects, travel restrictions, and sometimes even personal data. In [18], the authors presented a collection of Covid-19, pneumonia, and healthy videos. A related dataset [19] consolidated data from some hospitals and web resources to present a collection of images and videos of lung ultrasounds. Additionally, the Coswara ongoing project [20], consists of collecting breathing, cough, and speech sounds from Covid-19 patients to build a diagnostic tool. Likewise, another sound-based South African dataset was erected [21] on cough audio. Table I summarizes the cited datasets, the nature of the data they contain, and, more importantly, the number of samples in each one. Notably, the big problem in using these datasets to train DL models is that they do not have enough data to predict real-world situations accurately.

Ref.
Nature of the data Number of samples [1] Frontal view X-rays 123 [2] CT scans 275 [3] CT scans 20 cases [4] CT scans 5900 images of 1200 patients [5] CT scans 888 CT scans of 888 subjects [6] CT scans 366,558 CT scan images [7] XRC 13000 images [12] X-ray images 4700 Videos 247 The following sections will provide an extensive literature review of the papers that used ML techniques to help the spreading prevention, detection, and treatment of Covid-19.

II. ML METHODS FOR COVID-19 PREVENTION
This section will examine the operation of ML techniques in the prevention of the spreading of Covid-19. To accomplish this, we will present two effective solutions that have been used in response to the outbreak: Vaccines (to reduce the spread of the infection) and spread estimation (to embrace the appropriate restrictions at the right time and place).

A. Vaccine Discovery
The invention of a vaccine is arguably the most prolific Covid-19 area of research. In fact, given its significance in putting an end to the limitations placed on people (lockdown, festivities, gatherings), the vaccine is the ultimate solution to the Covid-19 crisis. Unfortunately, the firms that use ML techniques in their vaccine development pipeline publish minimal data, if any [22]. A pioneering effort to use ML in the process of developing an effective Covid-19 vaccine was conducted in [23]. The authors trained an XGBoost model (a decision-tree-based ensemble ML algorithm) to predict the best candidate protein for the vaccine. Their work resulted in 6 potential candidates. The authors adopted an optimized MLbased technique to pick some suitable amino acid fibers to improve the composition of already discovered vaccines [24]. Separate studies [25][26][27] relied on Support Vector Machine (SVM) and other ML techniques in the same direction. SVM is a supervised learning ML algorithm generally used to classify data by figuring out the best hyper-plane that could separate them with the lowest error rate.
The most documented DL-based approach to explore the world of proteins is the AlphaFold project powered by Google DeepMind [28,29]. It focuses on predicting the 3D structure of proteins based on their genetic sequence. During the first months of the pandemic, DeepMind published a prediction for many proteins related to SARS-CoV-2, the virus responsible for Covid-19. This work aimed to contribute to the understanding of the functioning of the virus.

B. Spread Estimation
The compartmental models are the most used models to forecast the spread of the pandemic [30]. They rely on stochastic frameworks to predict the right actions, decisions, and strategies to perform under any circumstances. In [31], the authors used this method to evaluate the impact of social distancing on the pandemic spread to predict the expected number of infections after the lockdown. A similar approach was used in [32] to predict the outbreak's growth in Wuhan, the Chinese city where the first Covid-19 cases were detected. The main recommendation of this work was that the lockdown and travel restrictions in China would help the control of the spread of the virus in the rest of the world. Compartmental models were further used in various studies [33][34][35][36] to predict the transmission of Covid-19 in Japan, Egypt, and other countries. Researchers used DL [37] and logistic growth [38] models to estimate the spread in China and South Korea. A complex approach was used in [39], where the authors used Markov Chain Monte Carlo and the number of confirmed deaths and cases to predict the reproduction rate. DL models were also used to predict a condition that refers to rapid breathing associated with Covid-19 infections. The authors in [40] relied on a bidirectional Gated Recurrent Unit (GRU) model and data coming from smartphones. Impressive results were provided by the authors in [41] by training a Deep Neural Network (DNN) on a sizeable real-time dataset collected from smartphones. A Long Short-Term Memory (LSTM) model was used in [42] to estimate the pandemic spread in Canada. The authors in [43] developed a technique that joined LSTM and GRU to design and train a model to measure the Covid-19 release and death cases.

III. ML METHODS FOR COVID-19 DETECTION
A prominent area where Artificial Intelligence (AI) methods shined during the Coronavirus pandemic is disease diagnosing and detecting. Several published papers highlight the use of DL models on Covid-19 identification. It is also worth noting that many of the studies we found are on CT chest scans [44]. The authors in [45] used a DL model to perform visual feature extraction from CT chest images. This project aimed to implement a classifier discriminating between Covid-19 and other pneumonia diseases. The researchers in [46] achieved promising results (86.30% accuracy) when developing a classification model to tackle the same problem. Authors in [47] [51], where the Inception pre-trained model was used to create a diagnosis model. The authors of [52] intended the same goal. However, their approach was slightly different since they trained their model on X-ray images. Similarly, the authors in [9] proposed a TL model based on pre-trained models such as VGG19, VGG16, ResNet, DenseNet, and InceptionV3. It should be mentioned that these DNN models are well-known for reaching state-ofthe-art results for many applications, especially in computer vision classification. Authors in [53] proposed a new classification model based on the Random Forest algorithm and X-ray scans from people affected by Covid-19 and other lung diseases. An aggregated model was developed in [54], where pre-trained models such as ResNet, VGG, and AlexNet were used, along with SVM. The pre-trained models were used for feature extraction, while the SVM model handled the classification. In another direction, some studies were exploring a completely different path to achieve the same goal of detecting Covid-19, namely sound-based and video-based techniques. For example, the authors of [18] used videos of healthy people and patients affected by Covid-19 and pneumonia. These data then fit a pre-trained model (VGG-16) used in the prediction phase. Moreover, authors in [55,56] shown that respiratory sounds (mainly cough and breathing sounds) could be employed to detect people affected by Covid-19. The authors of [55] built binary classifiers for distinguishing Covid-19 positive patients from other respiratory diseases. Figure-1 illustrates the intersections between Covid-19 and other lung diseases and why it is difficult to distinguish one from the others.

IV. ML METHODS FOR COVID-19 TREATMENT
So far, this article explored the use of ML methods in the detection and prevention against Covid-19. However, the most straightforward and essential action to fight the outbreak is to assist the affected people to recover. It can eventually be solved by finding drugs that have a practical impact in fighting the new virus. This section will discover how ML models contributed to alleviating this problem and how efficient they were. Drug discovery is one of the trickiest and most challenging areas in medicine. Furthermore, it is known for taking many years to give valuable results. Many researchers have turned their attention towards DL techniques instead of traditional technologies to speed up this process. For instance, the authors in [57,58] used LSTM, CNN, and other DL models in their quest to find acting antivirals among the existing known drugs. Several other studies focused on Biomedical Knowledge Graphs (BKGs), a technique used to detect similarities and relationships between drugs, viral proteins, and genes. Authors in [59] utilized BKGs to identify a promising treatment for Covid-19. Authors in [60] used the same technique to describe the relations between gene-disease pairs. The ML model showed relations between the viral protein and some previously known drugs, which led to the prediction of many potential drug candidates that could be effective against Covid-19 [61]. Similar work has been done in [62], in which the researchers used a DL model to find candidate drugs. A multitask ML model and more than 2000 human proteins were used in [63] to find relations between Covid-19 and Known Drug Targets (KDTs). The drug representation was extracted in [64] using a deep graph network and known biological interactions.
ML and DL techniques were also applied to narrow the possibilities in the molecular docking process, one of the most computationally expensive, yet frequently used techniques in drug design. The authors used a neural network to narrow 3 million candidates to 1000 [65]. A Random Forest model to identify 187 molecules in the coronavirus S-protein was proposed in [66]. Moreover, the ML framework to predict viral protein activities was developed in [67]. The ensemble model used in this framework studied 19 drugs and ranked them based on their capacity to prevent the coronavirus proteases. The authors also used docking to assess critical aptitude.
AI was also used to find new chemical compounds. The authors in [65] used the protein homology model along with 28 DL models and Reinforcement Learning (RL) techniques to evaluate drugs based on some predefined metrics. Furthermore, the authors in [68] used RL in their quest to discover new SARS drugs. More precisely, they used Deep Q Learning and evaluated the discovered molecules according to 3 predetermined factors.  V. EVALUATION AND DISCUSSION Many researchers have used AI as a forefront tool in the fight against Covid-19. Choosing this technology instead of traditional approaches was driven by the impact that DL techniques have gained during the last decade. This section addresses these contributions evaluates their impact, and presents their limitations. ML and DL approaches have contributed to the findings and improvements in the fight against the pandemic. We presented many examples in this paper to back up this claim. However, our study also shows that AI did not play a game-changing role in this fight. We dedicate this section to present our observations and explain why this is the case.
The companies that used ML tools in their research and development process generally publish little relevant information about their findings, particularly in drug discovery and treatment of the disease, although the biggest challenge the development of effective DL solutions faces is the lack of data. DL is known to be more effective when an abundant amount of (useful) data is available. However, as we have seen above, Covid-19 datasets are generally composed of a few samples. For instance, we have seen models built upon data collected from only 28 cases. The number of samples is not the only issue with Covid-19 datasets. The quality of these samples also raises another challenge. Many datasets are being collected from web portals, affecting the quality of images and the information credibility. Another issue regarding Covid-19 data consists of their non-representativeness. There is a substantial disparity in the number of tests between countries, classes, and cities in the same country, affecting the scaling of the trained models and resulting in inaccurate predictions. The challenge of extracting these samples could explain the lack of quality and quantity of data related to Covid-19. Besides, one should not forget that we are dealing with personal data, so privacy constraints come into play. One solution that has been used in lots of studies to handle this problem is data augmentation. Simultaneously with data augmentation, TL could present a great opportunity using the already available models and data developed for similar medical scenarios.
It is essential to note that the dilemma with data issues when using DL models is that we can have a well-trained model that gives high performance in the limited data but fails to scale when new situations arrive. Therefore, it is unlikely to get a solution applied in the real world. Our study shows that most of the contributions are generally about predicting the spread of the pandemic or detecting the virus. There are relatively few studies in the drug and vaccine discovery area. This is in line with the previous remark about real-world implementations since these areas present a more significant risk and therefore do not allow a margin of error to play with non-scalable and non-precise models. Our study also revealed that the primary source of Covid-19 data is in the form of images. Therefore, CNNs constitute the basis for many propositions. There is also the issue regarding privacy protection and the issue of the interpretability of DL models and how much we could trust their predictions in critical situations.  VI. CONCLUSION According to the official public service announcement on Coronavirus from the WHO, Covid-19 caused almost 5 million deaths while there are more than 238 million confirmed cases. This enormously heavy toll has generated the commitment of the research and development communities to alleviate this crisis. It turned out that ML and DL tools have been widely used in many applications during the last decade, including healthcare applications [69,70]. To render valuable guidance to researchers in evaluating the use of ML/DL models against Covid-19, this paper studied the available Covid-19 datasets, the indispensable fuel for any DL-based approach. Then, the article reviewed how ML/DL tools were used in preventing, diagnosing and treating/curing the Covid-19 virus. We presented the models that advanced to detect the virus and then find drugs to treat the disease.
Finally, the last section of the paper is devoted to the thorough evaluation of these techniques, along with a discussion about their effectiveness in fighting Covid-19. The results that came from this study reveal that ML/DL techniques are not mature enough to be used in critical situations when people's lives are at stake, mainly due to the scarcity of the required medical data that are needed to train the models. The quality of the available data samples often raises some concerns. We proposed several potential solutions to remedy these circumstances, such as encouraging open-source initiatives, data augmentation, TL, and differential privacy.