Analysis of the Number of Tests, the Positivity Rate, and Their Dependency Structure during COVID-19 Pandemic

Background Applying recent advances in medical instruments, information technology, and unprecedented data sharing into COVID-19 research revolutionized medical sciences, and causes some unprecedented analyses, discussions, and models. Methods Modeling of this dependency is done using four classes of copulas: Clayton, Frank, Gumbel, and FGM. The estimation of the parameters of the copulas is obtained using the maximum likelihood method. To evaluate the goodness of fit of the copulas, we calculate AIC. All computations are conducted on Matlab R2015b, R 4.0.3, Maple 2018a, and EasyFit 5.6, and the plots are created on software Matlab R2015b and R 4.0.3. Results As time passes, the number of tests increases, and the positivity rate becomes lower. The epidemic peaks are occasions that violate the stated general rule, due to the early growth of the number of tests. If we divide data of each country into peaks and otherwise, about both of them, the rising number of tests is accompanied by decreasing the positivity rate. Conclusion The positivity rate can be considered representative of the level of the spreading. Approaching zero positivity rate is a good criterion to scale the success of a health care system in fighting against an epidemic. We expect that if the number of tests is great enough, the positivity rate does not depend on the number of tests. Accordingly, the number and accuracy of tests can play a vital role in the quality level of epidemic data.


Introduction
Dr. Li Wenliang, a 34-year-old ophthalmologist, warned his colleagues and set the alarm to the society about a new infection caused by a type of coronavirus in December 2019 in Wuhan, China [1]. Shortly after his warning, all over the world encountered this epidemic. WHO declared this fast speeding infection  in March 2020. As of January 27, 2021, over 100 million cases, and around 2200 K deaths involving COVID-19 have been reported around the world.
The epidemic COVID-19 is the most informative pandemic throughout history. These unprecedented recorded data give rise to some unprecedented concepts, relationships, analyses, discussions, and models [2]. Modeling the dependence between the number of tests and the proportion of positivity (positivity rate) is one of these new issues.
The proportion of positivity is a critical measure because it gives us an indication of how widespread infection is in the area of interest. The proportion of positivity helps public health officials answer questions such as: -What is the current level of SARS-CoV-2 (coronavirus) transmission in the community?
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint 4 -Are we doing enough testing for the people who are getting infected? [3] According to the ratio nature, the high proportion of positivity is due to the high number of positive tests or the low number of total tests. Based on the first possibility, a higher positivity rate suggests higher transmission and that there are likely more people with coronavirus in the community who have not been tested yet. On the other hand, according to the second possibility, a high percentage of positivity means that more testing should probably be done. Accordingly, for policymakers, the high value for this parameter suggests either it is not a good time to relax restrictions aimed at reducing transmission, or it is a good time to add restrictions to slow the spread of disease [3]. In this regard, an analytic report segregated by regions in the UK was presented by the Office for National Statistics [4].
This study aims to investigate the time series of positivity rates individually and together with the time series of the number of tests. This investigation is conducted in two analytic methods: regional and temporal. The individual analysis is mainly undertaken based on the peaks of the spreading of the pandemic (Table 3). For the regional aspect, among the 221 countries, we selected twelve countries: the USA, India, the UK, Italy, Iran, the UAE, Bolivia, Guatemala, Nigeria, Australia, South Korea, and South Africa. The reasons for selecting these twelve countries are -They are the top countries in the influential indices (Table 1).
-Some of them are widely different from the others in some indices (Table 1).
-The numbers and time of peaks are different about them (Table 3).
-Their quality of health care systems are of different levels.
-Their data, especially about the number of tests are relatively well recorded.
-They are selected from all continents: the USA and Guatemala from North America, Bolivia from South America, India, Iran, the UAE, and South Korea from Asia, the UK and Italy from Europe, Nigeria and South Africa from Africa, and Australia from Australia. Finally, to illustrate the dependency of the number of tests and the positivity rates, we apply copulas. Sklar introduced the concept of copulas in 1959 [5]. A copula -mainly parametric, partially semi-parametric, and rarely non-parametric-is a function that completely describes the dependency structure. It contains all the information to link the marginal distributions to their joint distribution. Accordingly, to obtain a valid multivariate distribution function, it suffices to combine several marginal distribution functions with any candidate for the copula function. Thus, for the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint purposes of statistical modeling, it is desirable to have a large collection of copulas at one's disposal. Copula is widely applied in diverse fields, including environmental studies [6 -7] finance [8 -9], hydrology [10], and medical studies [11 -16].

Data
The main data sources of the paper are the website Worldometers [17] and Our World in Data [18]. We summarize and illustrate all the relevant information about the twelve countries in three (twelve-row) tables and three (twelve-partitioned) figures created on Matlab R2015b. Table 1 includes the key general indicators up to January 25, 2021. It is worth saying that the total indicators or even per-million indicators do not determine the quality of health care systems because there are observable underreported statistics about the countries Bolivia, Guatemala, Nigeria, Iran, and even India. Despite the mentioned reality, we consider the indicator of the number of tests per one million (the 7 th column of Table 1) as a criterion representing the level of facilities, therefore the quality of health care systems. Based on the information about this criterion, we define the lags (the distance between the test and diagnosis) for the different health care systems. Table 2 represents the underlying properties of any country. As mentioned before, lag is the difference between the time of testing and the time of receiving the results of the tests, positive or negative, in days. The more facilities a health care system has, the more tests that system can do -therefore the lower positivity rate it has. Also, the more facilities a health care system has, the lower distance is between the tests and results. Based on the concept of lag, we pair the number of tests on the th day with the number of results on the ( + ) th day to obtain the dependency structure by using the copulas. The last column is calculated based on the start date and end date of the period of recording data (the fourth and fifth columns) and the lag (the sixth column), and it displays the number of pairs that we use to obtain the dependency structure for each country.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The numbers in parentheses indicate the rank of countries among 221 countries in the world. For example, Iran has the 18 th population among all countries.
The bold numbers display the ranks of the second half of the global ranking. The ranks 111 to 221 are considered the second half. For example, regarding cases per million, India is a country of the second half.
The light-highlighted cells show that the country is among the highest quarter of the countries based on the relevant parameter. The darker highlighted cells indicate the country is among the top 5% worldwide. For example, the first row illustrates that except for the criterion of the number of tests per million -where it is among the highest quartile-, the USA is of the top 5% in all indicators. Generally, during an epidemic wave, the number of new infected individuals increases rapidly to an epidemic peak and then falls more gradually until the epidemic wave is over, and the number of new cases be stabilized. Roughly speaking, the epidemic peaks are the -neighborhood of-time points that corresponds the local maximum of the number of newly infected cases.

Change point detection
We define the epidemic peak as the time neighborhood -or the time point-that :the number of new confirmed cases on the th day, exceeds the mean plus three times standard deviation of the last three weeks for at least a week, that is, These epidemic peaks are local maximums. In addition, it is noticeable that the distance between two successive epidemic peaks must be at least one month. This definition is derived from the definition of outlier in regression analysis. According to this definition, the peaks of Table 3 are obtained for the countries of interest. It is remarkable that except for the peaks of Bolivia -which are almost the same-, the later waves are more acute than previous ones. We must add this point that the more acute peak means the more number of new confirmed cases, therefore the more intense spreading. Finally, it is possible that because of the lack of information at the beginning, this definition misses the epidemic peaks in the initial days. Mathematically and logically, the number of positive tests (confirmed cases) is affected by the number of tests and positivity rate. The number of cases equals the number of tests multiplied by the positivity rate. Therefore, the increment of the number of cases (as a multiplication) equals the sum of these two: - The number of tests multiplied by the increment of positivity rate, and - The positivity rate multiplied by the increment of the number of tests. Consequently, the intense changes in the count of cases are due to at least a remarkable change in one of these multiplications. About the countries with a regular increase in the number of tests like the USA, the increment of the proportion of positivity plays the principal role in the peaks. Table 3 shows that the proportion of positivity is significantly better than the frequency of tests to indicate the peaks of the pandemic. The positivity rate is more associated with the number of cases than the number of tests (90% versus 45%). After moving average, these proportions reach 100% and 50%, respectively. Countries of the southern and northern hemispheres faced a peak around July and November, respectively, possibly due to falling temperatures.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

S Korea
February to March*** Second half of August** December*** S Africa July*** December and January 2021*** All the dates belong to 2020. Otherwise, the year is mentioned.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint the period of study for the studied countries (the fourth and fifth columns of Table  2). The vertical axis of Figures 1 and 3 display the number of new tests -conducted on that day-and the proportion of positive tests -reported that day-, respectively. Figure 5 is the plot of the joint distribution of the number of tests on a day and the proportion of positivity on the days later. Figure 1 shows that the peaks of the number of tests coincide with the epidemic peaks of COVID-19 in different countries. For example, in the USA, there are two peaks of the number of tests simultaneous with the second and the third epidemic peaks -mentioned in Table 3-. Also, it is obvious that Bolivia has experienced two peaks for the number of tests around 150 th and 300 th days -from March 19, 2020which coincide with the epidemic peaks in Table 3. The USA, the UK, and the UAE experienced some regularly rising time series. Except for some overruns in epidemic peaks, the patterns of Italy and South Africa are increasing too. The number of tests in Guatemala is increasing, accompanied by an increasing fluctuation. Owing to the restriction by the limited capacity of tests, Iran and Nigeria followed a stepwise trend. Apart from the peaks, one for each of them, the plots of Australia and South Korea are stationary. In the case of Bolivia, the time series is proportional to the peaks. India is the only country whose time series is initially increasing, then stable, and after that decreasing. Generally, the counties have an increasing trend. Figure 2 gives us a clustering about the countries from the viewpoint of the number of tests: 1. The USA, 2. India, 3. The UK, 4. Italy, Australia, and the UAE, 5. South Korea, South Africa, and Iran, and 6. Nigeria, Guatemala, and Bolivia. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint peaks than their analogous in Figure 1. For example, it is clear that the USA has encountered three peaks. It is worthwhile that the graph of Iran has three peaks while the first of them is missed in Table 3 because of the lack of information at the beginning. A similar situation (being missed by investigation of either the number of tests or the number of confirmed cases while discovered by the analysis of the positivity rate) happens to the epidemic peak in India in late March, the first and the second peaks of the UK, and the epidemic peaks in middle May and the November for the UAE. Figure 4 illustrates a clustering of the countries based on the positivity rate: 1. Nigeria, Guatemala, and Bolivia, 2. South Africa, and Iran, 3. The USA, India, the UK, and Italy, and 4. Australia, the UAE, and South Korea. The horizontal and vertical axes of Figure 5 display the number of new tests and the proportion of positivity of them, respectively. Generally, as the number of new tests increases, the positivity rate falls. Since the epidemic peaks are opposing this general rule, it is not very clear to see the opposite direction of the changes. Guatemala, due to lack of epidemic peak, is a good example of this diversely proportional relationship.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  If the reason for an increase be the rising number of tests, we expect not to return the previous channel in short term. In addition, the positivity rate does not undertake a remarkable change. On the other hand, it is normal to assume that entering a peak is accompanied by increasing the number of negative tests as well. Consequently, the lack of the growth of negative test results (rising the positivity rate while continuing the previous trend for the frequency of tests) is only reasonable if at least one of the factors of tests accuracy, testing policy, or the viewpoint of the population were changed around that time. Otherwise, there are a remarkable number of un-reported cases belonging the peak. It is noticeable that this company of risings causes the observed acceleration in growth regarding epidemic peaks. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) where  is introduced as the dependence parameter [5]. Accordingly, Copula is mostly defined as a function Eventually, for twice differentiable function C , 2-increasing property (P2) can be replaced by the condition , otherwise C is asymmetric. The most well-known, powerful, and applicable copulas are: -FGM copula [19][20] ( t s , and -Gumbel copula [23]; The parameters of the marginal and copula distributions are estimated using the maximum likelihood method. The computations and illustrations regarding copula theory are conducted in software Maple, R, and R 4.0.3, Maple 2018a, and EasyFit 5.6.

Copula vs Correlation Coefficient
Measures of dependence are common instruments to summarize a complicated dependence structure in the bivariate case. Pearson's, Spearman's rho, and Kendall's . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 23, 2021.  Table 4. Kendall's tau of copula function

Copula
Parameter space The parameter copula is estimated and the relationship between parameter copula and k  is given in the last column of Table 1. The parameter copula in each case measures the degree of dependence and controls the association between two variables. When the parameter approaches 0 there is no dependence, and if the parameter tends to infinity there is a perfect dependence. Schweizer and Wolff [29] showed that the dependence parameter copula, which characterizes each family of copulas can be related to Kendall's k  . Therefore, copulas allow modeling both linear . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Copula vs Regression
Regression analysis is a statistical method for investigating the relationships between some dependent variables and some independent variables. The basic form of the regression analysis, ordinary least squares is not suitable for some applications because the relationships are often nonlinear and the probability distribution of the response variable may be non-Gaussian.
The major advantage of copula regression is that there are no restrictions on the probability distributions that can be used. The copula regression is the most appropriate method in non-Gaussian (no need for normality assumption) regression model fitting. Copula functions, connecting the marginal distributions to their joint distributions, are useful in simulating the linear or nonlinear relationships among multivariate data. Copula is a multivariate distribution function with marginally uniform random variables on [0, 1] (the PDF of the CDF). Copula functions have some appealing properties such as they allow scale-free measures of dependence and are useful in constructing families of joint distributions.

Results
The presumptions to apply copula theory for a couple of variables are the existence of continuous marginal distributions accompanied with their correlation. Table 5 investigates whether the pair of the frequency of the tests and positivity rate meets the presumptions. The marginal distributions were obtained in EasyFit. It is observable that the generalized Pareto and Weibull distributions had good performance to fit the positivity rates. It is observable that the correlation in countries with the highest number of tests is negative and it is commonly between -0.2 and -0.3. In countries lacking enough tests, the correlation coefficient is significantly greater -possibly due to the low quality of data and under-reporting. It is noticeable that calculation over the data of Bolivia, Iran, and South Africa, lead even to positive correlations.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The highlighted rows indicate that the correlation are not significant for those countries.
Based on Table 5, we are allowed to look for the suitable copula functions to connect the marginal distributions to find the desired joint distributions for nine of the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint countries. Notice that the countries without meaningful correlation (Italy, South Korea, and the UAE) were of the countries with the least proportion of positivity of the tests. These countries have involved with tracing the infected cases. Table 6 represents the results of comparing the best candidates from the FGM, Clayton, Frank, and Gumbel families.
According to Table 6, Clayton copulas are suitable candidates for the countries with low tests per million. In addition, Frank copulas can describe a wide variety of countries. Finally, the Gumbel family seems not to be a good option to couple the variables of the frequency of tests and the positivity rate. We now discuss the simulation of data from the obtained copula models and perform comparisons between correlations in the simulated data and in the observed data based on 1000 simulations. We follow the simulation method proposed by Johnson (1987, Ch.3) and later Nelson (2006, page 41).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Figure 6 illustrates the scatter plots of the transformed observed data versus simulated samples of the CDFs of the frequency of tests and positivity proportion variables taken from the fitted copula models in Table 6. It can be seen that the simulated data and the original data have similar dependence patterns. To settle this concern, Table 6 shows the rank correlations between the frequency of tests and CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint positivity proportion variables calculated from the original data and the simulated data of size 1000 taken from the fitted copula models. By comparing these correlations, we can conclude that the results show strong consistency of the estimated and real correlations. Finally, we want to investigate the structure of dependency between the number of tests and positivity rate totally. By collecting the data of the twelve countries, 3877 pairs are obtained whose Kendall's correlation is -0.1434 (P-value: 2.8464*10^ −19). In addition, we split the data into two parts: peaks and otherwise. This split restricted us to applying marginal distributions -then copulas_ because it causes the gap in the number of tests. Table 7 represents the Kendall's correlations for the countries of interest. It is worth saying that the correlation coefficient for the variables (the number of tests and positivity rate) is negative in both peaks and otherwise. Light or dark bolded figures indicate that the coefficient correlation is significantly positive or negative, respectively.

Discussion
Generally, at the beginning of an epidemic, the number of tests is low and the proportion of positivity is high. As time passes, the number of tests rises. Also, as the number of new tests increases, the positivity rate falls. The correlation in countries with high number of tests, higher quality of data, is negative and it is commonly between -0.2 and -0.3. By considering all the data as a set, the Kendall's coefficients are -0.1434, -0.2132, and -0.1381 for total, peaks, and total after . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) removing peaks, respectively. The positivity rate of the tests is significantly better than the frequency of tests to indicate the peaks of the pandemic. The positivity rate is more associated with the number of cases than the number of tests (90% versus 45%).
The proportion of positivity is more proportional than the number of tests to the number of infected cases. Approaching zero positivity rate is a good criterion to scale the success of a health care system in fighting against an epidemic. The number and accuracy of tests can play a vital role in the quality level of epidemic data. The policymakers can consider the factors affecting the positivity rate such as the testing policy, restricted facilities, peaks, fluctuations, and so on, and make decisions to prevent misleading because of them.
The first limitation is the low quality of data for some countries because of the restricted facilities, the low number of tests, and non-organized data collection program. Also, some interpolation and moving average methods were applied to find some missing data regarding the countries of interest and calculating the correlation for the countries with poor data. Out of the twelve countries, Iran, South Africa, Nigeria, Bolivia, and Guatemala have been restricted by the number of tests. The data of Italy, the UAE, and South Korea showed no significant correlation. The lack of dependency is a good criterion to show that there is no shortage of facilities. The highest quality and most significant correlations belong to the USA, India, the UK, and Australia.
The present approach using copulas is promising since it allows to take into account a wide range of correlation, frequently observed in medical. In fact, the classical multivariate models cannot reproduce all type of correlations. Moreover, the standard models are limited, especially because the choice of the marginal distributions is restricted. The crucial step in the modeling process is the choice of the copula function, which best fits the data. Further work is needed to choose the best copulas able to reproduce the dependence structure of bivariate medical variables. In clinical trials or medical studies, sample size is often an important consideration and is relatively small. The copula-based methodology overcomes this limitation as well, because the algorithm can be used to replicate data for any number of patients. The suggested copula-based methodology presented in this paper is simple and easy to implement.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint

Ethics declarations -Declaration of interest statement
The authors declare that they have no conflict of interest.

-Ethical statement
The methodology for this study was approved by the Human Research Ethics committee of the Kermanshah University of Medical Sciences.
-Informed consent It is not applicable. This study did not deal with human and animal subjects.
-Consent on publication This is not applicable. The manuscript includes no case study.

Funding
There was no specific funding for this study.

Availability of data Sample statement
The number of confirmed cases The data are available in [17][18], at https://www.worldometers.info/coronavirus/countries https://ourworldindata.org/coronavirus-testing The number of tests The data are available in [18], at https://ourworldindata.org/coronavirus-testing . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 23, 2021. ; https://doi.org/10.1101/2021.04.20.21255796 doi: medRxiv preprint