Benchmarking Deep Learning Models and Automated Model Design for COVID-19 Detection with Chest CT Scans

COVID-19 pandemic has spread all over the world for months. As its transmissibility and high pathogenicity seriously threaten people's lives, the accurate and fast detection of the COVID-19 infection is crucial. Although many recent studies have shown that deep learning based solutions can help detect COVID-19 based on chest CT scans, there lacks a consistent and systematic comparison and evaluation on these techniques. In this paper, we first build a clean and segmented CT dataset called Clean-CC-CCII by fixing the errors and removing some noises in a large CT scan dataset CC-CCII with three classes: novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal controls (Normal). After cleaning, our dataset consists of a total of 340,190 slices of 3,993 scans from 2,698 patients. Then we benchmark and compare the performance of a series of state-of-the-art (SOTA) 3D and 2D convolutional neural networks (CNNs). The results show that 3D CNNs outperform 2D CNNs in general. With extensive effort of hyperparameter tuning, we find that the 3D CNN model DenseNet3D121 achieves the highest accuracy of 88.63% (F1-score is 88.14% and AUC is 0.940), and another 3D CNN model ResNet3D34 achieves the best AUC of 0.959 (accuracy is 87.83% and F1-score is 86.04%). We further demonstrate that the mixup data augmentation technique can largely improve the model performance. At last, we design an automated deep learning methodology to generate a lightweight deep learning model MNas3DNet41 that achieves an accuracy of 87.14%, F1-score of 87.25%, and AUC of 0.957, which are on par with the best models made by AI experts. The automated deep learning design is a promising methodology that can help health-care professionals develop effective deep learning models using their private data sets. Our Clean-CC-CCII dataset and source code are available at: https://github.com/arthursdays/HKBU_HPML_COVID-19.


I. INTRODUCTION
The COVID-19 (Corona Virus Disease 2019) pandemic is an ongoing pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1]. The SARS-CoV-2 virus can be easily spread among people via small droplets produced by coughing, sneezing, and talking [2]. Even worse, § Corresponding author at Hong Kong Baptist University, Tel.: +852-3411-5998; Email: chxw@comp.hkbu.edu.hk ¶ Corresponding author at Hangzhou Dianzi University; Email: jzhang@hdu.edu.cn SARS-CoV-2 can be highly stable in a favourable environment so that it can adhere to different object surfaces up to several days [3], which causes a higher risk of getting infected by touching these contaminated surfaces and then touching their own faces. COVID-19 is not only easily contagious, but also a serious threat to human lives. The COVID-19 infected patients usually present with pneumonia-like symptoms (fever, dry cough, dyspnea, etc.) and gastrointestinal symptoms such as diarrhea, followed by a severe acute respiratory infection. In some cases, acute respiratory distress accompanied by severe respiratory complications may even lead to death. According to the COVID-19 situation report [4] provided by the World Health Organization (WHO), as of the end of May, there were 5,934,936 COVID-19 infections and 367,166 deaths globally. The usual incubation period of COVID-19 ranges from one to 14 days. Many COVID-19 patients do not even know that they have been infected without any symptoms, which would easily cause delayed treatments and lead to a sudden exacerbation of the condition. Therefore, a fast and accurate method of diagnosing COVID-19 infection is crucial.
Currently, there are two commonly used methods for COVID-19 diagnosis. One is viral testing, which uses real-time reverse transcription-polymerase chain reaction (rRT-PCR) to detect viral RNA fragments. The other one is making diagnoses based on characteristic imaging features on chest X-rays or computed tomography (CT) scan images. [5] conducted the effectiveness comparison between the two diagnosis methods and concluded that chest CT has a faster detection from the initial negative to positive than rRT-PCR. However, the manual process of analyzing and diagnosing based on CT images highly relies on professional knowledge and is time-consuming to analyze the features on the CT images. Therefore, many recent studies have tried to use deep learning (DL) methods to assist COVID-19 diagnosis with chest X-rays or CT scan images.
However, the reported accuracy of the existing DL-based COVID-19 detection solutions spans a broad spectrum because they were evaluated on different datasets, making it difficult to achieve a fair comparison. In this paper, we aim to conduct a reproducible comparative study of DL methods for COVID-19 detection using chest CT scans. To this end, we first build a clean and segmented CT scans dataset based on a largescale open-source dataset 1 from CC-CCII (China Consortium of Chest CT Image Investigation) [6]. Our dataset, named Clean-CC-CCI, consists of three classes: novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal controls (Normal). Totally, there are 340,190 slices of 3,993 scans from 2,698 patients in our dataset, where the number of slices of NCP, CP, and Normal is 131,517, 135,038, and 73,635, respectively. We split the dataset into the training and test sets according to the patient's ID with a ratio of 4:1, the details of which are shown in Table II. Notice that our test set size is the largest one (e.g., it is twice of that in [6]), making our evaluation results more conservative than existing ones. Our benchmark dataset is made open to the public and can facilitate the fair comparison of new DL models for COVID-19 detection.
In this paper, we use our dataset to benchmark two types of state-of-the-art (SOTA) DL models: 1) 3D convolutional neural networks (CNNs), including DenseNet3D121 [17], R2Plus1D [18], MC3 18 [18], ResNeXt3D101 [17], Pre-Act ResNet [17], and ResNet3D series [17]; 2) 2D CNNs, including DenseNet121 [19], DenseNet201 [19], ResNet50 [20], ResNet101 [20] and ResNeXt101 [21]. We explore three key factors that may affect the detection performance, including model depth, methods of reading slice images, and model architecture. First, regarding the model depth, we compare the performance of the ResNet architecture [20] with 3D from 10 layers to 152 layers, i.e., ResNet3D10, ResNet3D18, ResNet3D34, ResNet3D50, ResNet3D101, ResNet3D152, and ResNet3D200. Second, in terms of how to read the slice images, we consider two popular approaches: one is to read a slice as an RGB image with three channels; another is to convert the slice to a greyscale image with only one channel. Therefore, the scan images used to train the model will be different because of the different ways of slice reading. Third, we exploit multiple DNN architectures including the hand-craft models and automatically generated models with AutoML techniques [22], [23]. We use seven 3D models to analyze the effect of two types of scan data. Besides, we discuss the influence of the number of slices in a CT scan on the model performance. We also evaluate the effectiveness of the mixup data augmentation method by comparing model accuracy before and after applying the mixup method. Our major contributions are summarized as follows:

1) We build an open benchmark dataset Clean-CC-CCII
for COVID-19 detection using chest CT scans, and benchmark 9 different CNN architectures with more than 20 variants. 2) We find that both 3D and 2D CNNs are promising solutions for detecting COVID-19 infections. However, the overall performance of 3D CNNs is better than 2D CNNs. Besides, the results of the ResNet3D series show that the model performance does not scale very well with the model depth. 3) We find that the models can achieve higher AUC when the slices are converted to greyscale images. 4) To the best of our knowledge, this is the first paper to explore the relationship between model performance and the number of slices in a CT scan. Our result shows that there is no significant correlation between them. In other words, increasing the number of slices does not necessarily improve the model performance. Instead, the model trained on scan data with a small number of slices can also achieve comparable or even better results. 5) We demonstrate that the mixup data augmentation method [24] can effectively improve model accuracy in our study. 6) We develop an automated deep learning methodology to generate a lightweight deep learning model MNas3DNet41. On our dataset, it achieves an accuracy of 87.14%, F1-score of 87.25%, and AUC of 0.957, which are on par with the best results of the highly fine-tuned models made by AI experts. The rest of the paper is organized as follows. Section II describes the related work. In section III, we describe the strategies used to build our dataset, the comparison study of SOTA CNN models, and the automated model design methodology. Section IV presents and discusses the experimental results. We conclude the paper and introduce the future research directions in Section V.

II. RELATED WORK
In recent years, DL techniques have been proved to be effective in the diagnosis of diseases with X-ray and CT images [25]. To enable machine learning techniques be applied in helping detect COVID-19, an increasing number of publicly available COVID-19 datasets has been proposed in the past few months as shown in Table I. These datasets can be classified into two classes: X-ray and CT scan images. Machine/Deep learning techniques highly rely on both the quality and quantity of the dataset.

A. Publicly-available datasets of COVID-19
IEEE8023 Coivd-chestxray-dataset [26] is an open dataset of COVID-19 cases with chest X-ray and CT images, which allows users to submit other COVID-19 data to this dataset. However, this dataset mainly focuses on X-ray images with only a very small number of CT scans. Based on this dataset, several DL based techniques have been proposed [7]- [9] to detect COVID-19.
Covid-ct-dataset [27] is a CT dataset of COVID-19, which is mainly composed of CT images extracted from PDF files of COVID-19 papers in medRxiv and bioRxiv. Thus, it has two main drawbacks. First, many CT images contain some marks created by the CT machine or doctors, which may have a high impact on the DL techniques. Second, each patient has only one to several CT images instead of a complete 3D scan volume, which results in some difficulties to use 3D CNNs to exploit the depth information of the lung. CC-CCII is another publicly available CT volume dataset proposed by [6]. It is currently one of the largest CT datasets for COVID-19, which contains 617,775 slices of CT images from 6,752 scans of 4,154 patients. It has 3 classes of novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal controls (Normal). CP includes bacterial pneumonia and viral pneumonia. However, this dataset (version 1.0 released on 23 April 2020) contains some errors (e.g., disorder of CT images in some scans, some scans include CT of the head but not the lung, etc.).
COVID-19-CT-Seg-Dataset [28] is a publicly available CT dataset of COVID-19. It contains 20 well-labeled scans with annotation of left lung, right lung and lesions. Three experienced radiologists are involved for each annotation: two radiologists do the annotation and one does the verification.

B. DL-based methods for COVID-19 detection
Most research is conducted on CT images, but many of them do not exploit the 3D information of CT images, such as the work by [10], [13], [14]. They only propose the DL models with 2D CNNs for COVID-19 detection. [11] is the most related work to ours; but it only benchmarks ten 2D CNNs and compares their performance in classifying 2D CT images on their private dataset with 102 testing images.
On the other hand, the studies in utilizing 3D CT images are relatively rare, which is mainly due to the lack of 3D CT scan dataset of COVID-19 in the earlier days. However, there still exist some work proposing 3D CNNs with their private 3D CT datasets (e.g., [16], [15]). Recently, [6] publish a large-scale publicly available 3D CT dataset, based on which they propose 3D CNNs methods to segment lesion and detect COVID-19. However, in [6], only two DL models are exploited to evaluate the model performance using 10% of the dataset as test set. It is of practical importance to evaluate which types of models are suitable to the 3D CT images in detecting  There are also some other studies conducted on X-ray images. For example, [9] propose three 2D CNNs for COVID-19 detection. [7] introduce a deep anomaly detection model for fast and reliable screening. [8] investigate the estimation of uncertainty and interpretability by droiopweights-based Baysian CNN on the X-ray images. [12] use both X-ray images and CT images to do segmentation and detection.

C. Automated model design for medical image analysis
In recent years, Automated Machine Learning (AutoML) has created many SOTA results by automatically searching model architectures and hyper-parameters for specific tasks [22], [23], [29]. For example, [30] introduce AutoML into the medical image processing task. They used five public datasets, MESSIDOR, OCT images, HAM 10000, Paediatric images and CXR images, to train models by Google Cloud AutoML. Their experimental results demonstrate that AutoML can generate competitive classifiers compared to manually designed DL models.

III. MATERIALS AND METHODS
A. Dataset [6] provide an open-source chest CT image dataset for COVID-19 diagnosis, namely China Consortium of Chest CT Image Investigation (CC-CCII), which contains a total of 617,775 CT slices of 6,752 CT scans from 4,154 patients. CC-CCII has three classes: novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal controls (Normal). CP includes bacterial pneumonia and viral pneumonia. To best of our knowledge, CC-CCII is the largest COVID-19 CT dataset which is publicly available currently. It would be helpful for accelerating the research on machine learning based methods in COVID-19 diagnosis. However, CC-CCII has five main issues (i.e., damaged data, non-unified data type, repeated and noisy slices, disordered slices, and non-segmented slices) that would have high negative impacts on the model performance.
In this section, we first describe our methods to address the problems in CC-CCII to generate a better dataset for DL techniques. Then we introduce the strategies of scan images construction.
3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020. . https://doi.org/10.1101/2020.06.08.20125963 doi: medRxiv preprint   After addressing the above problems, we construct a clean CC-CCII dataset named Clean-CC-CCII, which is more suitable to DL-based methods in COVID-19 diagnosis. The statistics of our dataset are presented in Table II. Finally, our Clean-CC-CCII dataset consists of 340,190 slices of 3,993 scans from 2,698 patients. The dataset is divided into the training set and the test set according to patients to make sure that the CT scan 2 https://github.com/booz-allen-hamilton/DSB3Tutorial 4 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020. . images from the same patient will appear either only in the training set or in the test set. The ratio of the number of scans in the training set and the test set is 4:1.
2) Scan images construction: After data pre-processing, we need to construct CT scan images as inputs of DL models for training. As shown in Fig. 3, there are two steps before feeding data into DL models: slice sampling and slice processing.
Slice sampling: In our dataset, each CT scan contains a different number of slices as shown in Fig. 2. The minimum and maximum number of slices are 9 and 457, respectively. However, DL models generally require the same dimensional inputs. To keep the same dimension inputs, we propose two types of slice sampling strategies: random sampling and symmetrical sampling. Specifically, the random sampling strategy is applied to the training set, which can be regarded as data augmentation, while the symmetrical sampling strategy is performed on the test set to avoid introducing randomness into the testing results. Besides, because the number of slices can be manually set to different values, both sampling strategies support automatically select upsampling or downsampling based on the original and target number of slices. We will also study the performance impact of the number of slices in Section IV-C. Notably, the relative order between slices remains the same before and after sampling. The details of our sampling strategies are given in Algorithms 1 and 2 of A.
Slice processing: After slice sampling, each scan data is composed of the same number of slices. We then resize all slices to 160×160 and central crop to 128×128. In this way, the final input data sizes for the 3D and 2D models are c×d× 128 × 128 and d × 128 × 128, respectively, where c ∈ {1, 3} is the number of channels of the slice image, and d indicates the configured number of slices. For all scan data in the training set, we apply a 3D random horizontal flip transformation. The scan data in both the training and test sets is normalized by subtracting the mean and dividing the variance.

B. A comparative study of COVID-19 detection methods
In this study, we aim to investigate the performance of different types of DL models on detecting COVID-19 infection with chest CT scans. Therefore, we implement various experiments to evaluate the potential effective methods for COVID-19 diagnosis. Specifically, we compare the performance between SOTA DL models, including 3D and 2D models, and explore the relationship between model performance and (a) model depth and (b) how to read slice images. We also evaluate the effectiveness of the mixup data augmentation method in improving model classification accuracy.
Our pipeline of using DL models to classify CP, NCP, and Normal CT scans is shown in Fig. 3. The first step is to construct CT scan images to feed into the DL models by slice sampling and processing. The sizes of all slices are fixed to 128×128 for the model inputs. The models are trained with the training set and evaluated on the test set.
1) Exp 1: Comparing different CNN models: In this study, we evaluate 17 CNN classification models shown in Table III, including 3D models and 2D models. For the 3D models, we use DenseNet3D121 [17], R2Plus1D [18], MC3 18 [18], ResNeXt3D101 [17], PreAct ResNet3D [17], and ResNet3D series [17] (ResNet3D10, ResNet3D18, ResNet3D34, ResNet3D50, ResNet3D101, ResNet3D152, and ResNet3D200). For the 2D models, we use DenseNet121 [19], DenseNet201 [19], ResNet50 [20], ResNet101 [20] and ResNeXt101 [21]. For the 2D models, the input scan data is composed of greyscale slice images. In terms of 3D models, we evaluate two types of scan data: RGB slice images with three input channels and greyscale slice images. Besides, for both 2D and 3D models, the size of slices is fixed to 128 × 128. Therefore, the input sizes for 3D and 2D models are c × d × 128 × 128 and d × 128 × 128, respectively, where c is the number of channels of the slice image that depends on how to read the slice images, and d is the number of slices in a scan image. c = 3 and c = 1 indicate that each slice is read as the RGB and greyscale image, respectively. The number of input channels in the first convolutional layer of all models is modified accordingly to handle the input with different size.
2) Exp 2: Comparing the different number of slices: In [6], the scan input is fixed to 64 slices. However, in our dataset, the number of slices contained in different CT scans ranges from 9 to 457, and the mean value is 85 as shown in Fig. 2. Intuitively, the higher number of slices, the more information can be extracted by the models, which could result in a higher performance. We empirically study the performance impacts of the number of slices by setting d to different values. We choose four representative 3D models (MC3 18, DenseNet3D121, ResNet3D101, and ResNeXt3D101) to evaluate the relationship between the model performance and the number of slices. For MC3 18, DenseNet3D121, and ResNet3D101, we evaluate five types of scan images containing 16, 32, 64, 128, and 256 slices, respectively. For ResNeXt3D101, it is too large to fit into the GPU memory when d > 64, so d is chosen with 5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020.  is a generic and straightforward data augmentation strategy, which has been proven to be effective in improving the model performance on 2D image classification tasks. Therefore, we explore the effectiveness of the mixup method in our 3D CT scan classification task.
In essence, mixup trains a DL model on linear combinations of pairs of examples and their labels. The formula is given as follows: x where (x i , y i ) and (x j , y j ) are two feature-target vectors drawn at random from the training set, and the variable λ ∈ [0, 1] obeys a β-distribution, i.e., λ ∼ β(α, α) for α ∈ (0, ∞). By doing so, a new feature-target vector will be generated by mixing up two feature-target vectors, which encourages the model to behave linearly in-between training examples.

C. Automated model design for COVID-19 detection
The results of all baseline experiments (to be discussed in Section IV) show that DL is a powerful tool to assist the detection of COVID-19 infection based on CT images, where 3D models generally outperform 2D models. However, as shown in Table III, 3D models have a very large model size and are slow to train. Based on the results of Table IV  and Table V, we can see that a larger or a deeper model does not necessarily result in better performance. For example, ResNeXt3D101 is the largest model in our evaluated models, but its performance is not the best. Therefore, in this section, we aim to design a lightweight 3D model, which is expected to achieve comparable or even better results than the baseline 3D models and is easier to deployment for faster detection.
However, manually designing a deep neural network is a time-consuming process that highly relies on experience and expertise. Luckily, a recent technique, namely neural architecture search (NAS), would be a promising solution for us. NAS can be seen as a sub-field of AutoML [22], [23], [29], which draws much attention from academia and industry as it can design various neural networks automatically. In the following content, we first introduce our search space and search strategy, and then describe the implementation details and experimental results.
1) Search space: The first step of NAS is to build the search space, which defines the design principles of neural architectures. MobileNet [31] and MobileNetV2 [32] are a class of efficient models manually designed for mobile and embedded devices for efficient inference. Many NAS studies [33], [34] use the MobileNetV2 structure to design the factorized hierarchical search space, but they mainly focus on 2D image recognition tasks. In this work, we also exploit MobileNetV2 as the backbone to design the 3D search space.
An overview of the final model is shown in Fig. 4, which consists of n different cells. The number of blocks in a cell can be different, represented by [B 1 , ..., B i , ..., B n ]. The stride is set to 2 in the first block if the resolutions of input and output are different, and the stride is 1 in all other blocks. The blocks within the same cell have the same number of input/output channels. Besides, the structure of each block is selected from a series of 3D mobile inverted bottleneck convolution operations [32], represented by K × K M BConvE, where K is the filter kernel size and E is the expansion ratio of linear layers. In our method, the search space consists of the following operations: CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020.
2) Search strategy: After building the search space, we can see that the key idea of the search task is to select the best submodel (in terms of validation accuracy) from the super-model. As summarized in [22], [29], there are various of search strategies, such as reinforcement learning, evolutionary algorithms, gradient descent-based methods, and random search. In recent studies [35]- [37], the authors demonstrate that random search is a more competitive method than many others. Therefore, we also apply the random search strategy.
3) Implementation details: The pipeline of our NAS methodology is shown in Fig. 5, which contains two stages for searching 3D models on our Clean-CC-CCII dataset: the search stage and the evaluation stage.
Search stage. In the search stage, we search for 100 epochs. Each epoch consists of a number of steps. We sample a new neural architecture every five steps and make sure that every sampled architecture is trained. Note that only the training set is used for training and evaluating the sampled models in the search stage. At the end of the search stage, there are 100 neural architectures and their corresponding training accuracy.
Evaluation stage. After the search stage, we need to select several top ranked models (in terms of validation accuracy) for the next stage. Specifically, according to the training records, we choose those models that perform better validation accuracy than the previous sampled models. The selected models are first trained with the training set from scratch for 200 epochs, and then evaluated on the test set.
Implementation details. For both search and evaluation stages, we use the Adam optimizer [38] [24,40,80,96,192,320]. Each experiment is conducted on four Nvidia Tesla V100 GPUs . Furthermore, to improve the searching efficiency, we fix the height and width of the input scan to 60×60 during the search stage, and restore the size to 128×128 in the evaluation stage.

IV. RESULTS AND DISCUSSION
In this section, we present and analyze the results of the different experiments mentioned above. All models are trained using the Adam [38] optimizer with an initial learning rate of 0.001. The cosine annealing scheduler [39] is applied to adjust the learning rate.

A. Evaluation metrics
To compare the performance of CNN models, we use several commonly used evaluation metrics as follows: 3 https://github.com/microsoft/nni 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020. . https://doi.org/10.1101/2020.06.08.20125963 doi: medRxiv preprint (a) The ROC curves of 3D models that are trained with greyscale slices.
(b) The ROC curves of 3D models that are trained with RGB slices.
(c) The ROC curves of 2D models that are trained with greyscale slices. Fig. 6. The ROC curves of 3D and 2D models. The overall performance of 3D models is better than 2D models. Besides, the variance between the performance of the models that are trained with greyscale slices is smaller.
Besides, the area under the receiver operating characteristic (ROC) curve (AUC) is also applied to evaluate the performance of COVID-19 diagnosis. In this study, the positive and negative cases are assigned to NCP and non-NCP (i.e., CP and Normal) scans, respectively. Specifically, N T P and N T N indicate the number of correctly classified NCP and non-NCP scans, respectively. N F P and N F N indicate the number of wrongly classified NCP and non-NCP scans, respectively. The accuracy is the micro-averaging value for all test data, which is used to evaluate the overall performance.

B. Results of Exp 1: Comparing different CNN models
The performance comparison between different CNN modes including 3D and 2D is shown in Table IV, in which the number of slices in the scan data is fixed to 64 and there are two types of inputs that differ in the way of reading slice images.
The results in Table IV show that both 2D and 3D models can achieve relatively good results on our Clean-CC-CCII 8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020. . dataset, which indicates that the computer-aided COVID-19 diagnosis with state-of-the-art DL techniques would be a promising solution. It shows that DenseNet3D121 is one of the best models among all evaluated models as it achieves the best accuracy, precision, sensitivity, specificity, and F1-score, and MC3 18 obtains the highest AUC score when the slices are read as the greyscale images.
In terms of the accuracy of 3D models, the different number of input channels has different impacts on the different network architectures. However, regarding the AUC metric, almost all 3D models with the greyscale slice images perform better the RGB images. One can see that the ROC curves in Fig. 6 (a) are higher and distributed closer than those in Fig.  6 (b), which indicates that the models trained with greyscale slices are more robust. The main reason is that the original CT slices are greyscale images, and duplicating the greyscale images to RGB images would introduce much repetitive and redundant information, which instead increases the difficulty of model training. Regarding the comparison of 2D models and 3D models, we can see that the overall performance of the 3D models is better than that of the 2D models, which is as expected because the convolutional filters in 3D models can better extract the three-dimensional spatial relationship between the slices of the scan data.
We also explore the impact of model depth on model performance as shown in Table V, from which one can see that there is no model that can have an absolute advantages on all metrics. Although no significant correlation can be found between model performance and model depth, the results suggest that a smaller model can also obtain similar or even better results than the larger one.  Fig. 7 (a) plots the relationship between model accuracy and the number of slices. One can see that only the accuracy of ResNet3D101 increases with the number of slices, while other models do not. However, because the distribution of our dataset is imbalanced, higher accuracy does not mean better performance. As Fig. 7 (b) presents, when the number of slices is 64, the AUC of ResNet3D101 is smaller than the other cases. Besides, Fig. 7 (b) also shows that increasing the number of slices does not always improve the performance. Instead, the models trained on a smaller number of slices can also achieve comparable or even better 9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 9, 2020. . https://doi.org/10.1101/2020.06.08.20125963 doi: medRxiv preprint  A possible explanation for this result might be that the original training data can be regarded as a pile of scattered points distributed in high-dimensional space, and a large number of new data points between the original data points are created by the mixup method. In this way, the original dataset is expanded to some extent, and the data distribution becomes smoother, which regularizes the model training and improves the model performance.

E. Results of automated model design for COVID-19 detection
We implement two types of NAS experiments. One is to search for 21-layer networks, taking 3.7 hours, while the other searching for 41-layer networks took 5 hours. Table VII presents the performance comparison between the baseline 3D models and our searched 3D models by NAS, namely MNas3DNet. To have a fair comparison, for all models, the input scan images are composed of 64 greyscale slice images. Compared to the baseline 3D models, the sizes of our searched models are much smaller, where MNas3DNet21 and MNas3DNet41 are 12.34 and 22.91 MB, respectively. At the same time, both models achieve the SOTA performance. Specifically, MNas3DNet41 achieves an accuracy of 87.14%, F1-score of 87.25%, and AUC of 0.957, which are on par with the best models designed by AI experts. The strong empirical results prove the effectiveness of random search strategy, and demonstrate that NAS is a promising research direction for designing neural networks of detecting COVID-19.

V. CONCLUSION AND FUTURE WORK
In this paper, we aim to benchmark DL models and use AutoML techniques to design DL models for COVID-19 detection using chest CT scans. Our experimental results show that DL models are promising solutions, and 3D models outperform 2D models. We find that the model performance does not absolutely improve with the increase of model depth or the number of slices. In other words, a smaller model trained on less number of slices can also achieve comparable or even better results. Besides, we demonstrate that mixup data augmentation can effectively improve model performance. Last but not least, we design an automated deep learning methodology to generate a lightweight deep learning model, which achieves comparable results to the models designed by AI experts.
We have several directions for future work on the agenda as follows. First, most of the data in our dataset are from China, thus we plan to collect more data from other countries to further improve the accuracy of COVID-19 detection. Second, we will try to apply semantic segmentation technology to our dataset, so as to help doctors diagnose more effectively. Last, we will try other SOTA NAS methods to explore more types of deep learning models.