Abstract
We investigate group based testing strategy targeted to identify infected patients by making use of a medical test which equally well applies to single and pooled samples. We demonstrate that, under assumed setting, quick sort grounded testing algorithm allows to reduce average costs, and the reduction is very significant when the infection percentage is low. Although the basic idea of test sampling is known, our major novelty is the rigorous treatment of the model. Another interesting insight following rigorous analysis is that an average number of tests per one individual scales like entropy of the prevalence of infection. One more reason for the paper is the context: taking into account the current situation with the coronavirus, dissemination of renowned ideas and the optimisation of algorithms can be of a great importance and of economical benefit.
1 Introduction
The group testing is devoted to the detection of unknown subset of defective items in a set of objects using the tests in the most efficient way. It turns out that quite often individual testing can be efficiently replaced by testing the groups. In many settings (and, particularly, in a medical one), adoption of such strategy results in the significant time and costs savings and is, therefore, a desirable step.
The concept of group testing was originally introduced by [1] in relation with efficient mass blood testing and later found important applications in molecular biology, quality control, computer science and other fields. In particular, the situation with the current COVID-19 epidemic raises the need of the quick and cheep PCR testing for SARS-COV-2 virus in order to find a rapid exit from the lockdown strategy in many countries. In this paper, we investigate group based testing strategy aimed to identify infected patients by making use of a medical test which equally well applies to single and pooled samples. We show that quick sort grounded (halving) testing algorithm allows to reduce average costs, and the reduction is very significant when the infection percentage is low.
The main objective of the paper is to recall the main schemes of group testing and compare them to each other on the grounds of the rigorous mathematical base. We hope that, taking into account the current situation with the coronavirus, dissemination of renowned ideas and the optimisation of algorithms can be of a great importance and of economical benefit.
The remaining part of the paper is organized as follows. Section 2 provides the description of two classical schemes of testing together with preliminaries. In Section 3, we consider the main algorithm of the paper (Scheme C) together with comparison to the other two schemes. In Section 4, we give an accompanying discussion of the assumptions of our setup and the related literature. Appendices A and B contain mathematical derivations and tables.
2 Preliminaries
Consider the following setup. Assume that the prevalence of some disease (the fraction of the infected individuals) is equal to p ∈ (0, 1) in an infinite (or large enough) population. A cohort, spanning N independent individuals, has to be tested. For this, samples are collected from each individual. The test to be applied performs equally well for individual and for pooled samples: a situation might occur, e.g., when the test indicates the presence of the infection in the blood sample, and there is no difference whether the latter is obtained from a single individual or from a pooled cohort of samples.
For the situation described, physicians can choose different testing strategies. Let us assume that the following are two possible choices.
Scheme A. Test each patient’s sample.
Scheme B. Conduct testing of the pooled sample. Test each member of the cohort only in case of detected infection in the pooled sample.
Although it is not obvious at the first glance, the second choice is more efficient. To give a rigorous justification, let us formally define the underlying model corresponding to Scheme B.
Consider the sample of N individuals. Put Xi = 1, provided the test of i-th individual is positive, and Xi = 0, otherwise. Let S = SN = X1 + … + XN be the total number of infected individuals in the sample and let T = TN be the total number of tests applied to the cohort: the test is applied once if the result is negative, and it is further applied to each of N individuals otherwise, i.e.
Assume that X1, …, XN are independent identically distributed (i.i.d.) random variables, each having Bernoulli distribution Be(p). Then S has Binomial distribution Bin(N, p). Therefore, an average number of tests per cohort is
where q := 1 − p. An average number of tests per individual, say t = t(N), is
Consider a function t: (0, ∞) → (0, ∞) given in (2.2). By equating its derivative to 0, we see that stationary points solve equation
which is a fixed point equation for
, N ∈ (0, ∞), and, hence, can be easily solved iteratively. It is further not difficult to prove that, for p in the region enclosing (0, 0.2), there exists a unique solution Np > 0 of (2.3) which is a minimizer of t(N) (see Proposition A.1 in Appendix B). Then, turning back to economic/biomedical interpretation, we conclude that, having a cohort of ⌊ Np ⌋ (here and in the sequel, ⌊ y ⌋ stands for an integer part of y ∈ ℝ) individuals, Scheme B results in a lowest average number of tests per person which is possible when applying scheme of this type for a population having prevalence p. Scheme A, in contrast, always has a constant number of tests 1 per person. Therefore, the average (absolute) gain attained by applying Scheme B versus Scheme A is given by the difference
Right panel in the Figure 1 shows the graph of p ↦ 100Gp, p ∈ (0, 0.2), which is an average gain measured by the number of tests saved per 100 individuals. The corresponding values are provided in Table 1 (see Appendix B). An accompanying graph of p ↦ Np (see the left panel of the Figure 1) demonstrates dependence of optimal sample size on p. To obtain a fast numerical evidence, assume that N is bounded away from zero and pN ↦ 0. Then from (2.3) it follows that optimal sample size satisfies
Performance of Scheme B.
Performance of Scheme B
Hence, assuming that p is small enough for pN ≈ 0 to hold, the above implies that
For example, if p = 0.01, then we have Gp ≈ 0.8, i.e. an approximate average gain is 80% or so.
3 Main algorithm
In what follows, we retain all the notions and assumptions introduced previously. For the sake of convenience, we additionally assume that the size of the cohort (to be tested) is of the form N = 2n, n ∈ ℕ. Keeping in mind all the said, below is the recommended algorithm for a group based testing when p is low.
Scheme C
Step 1. Test pooled sample of the whole cohort. Proceed to Step 2.
Step 2. If the test is positive, proceed to Step 3, otherwise, finish testing cohort.
Step 3. Divide the cohort into two parts consisting of the first and second halves r espectively. Apply the whole algorithm to the two obtained parts recursively.
The main features of the above testing algorithm are summarized in Proposition 3.1 below.
Proposition 3.1. Assume the testing Scheme C. Then
an average number of tests per person is given by
an average number of tests per person in the case of an infinitely large cohort is
for a fixed p ∈ (0, 1), function t: ℕ 1→ (0, ∞) admits at most two minimizers Np: the value N = Np corresponding to optimal sample size is either
or
.
To shed some light on the described testing scheme, we provide several comments on the performance of Scheme C vs. Scheme A and Scheme B.
Consider first the limit in (ii) of Proposition 3.1. Obviously,
Hence, for q ≈ 1 (or alternatively p ≈ 0), t(∞) < 1. The latter means that, when the prevalence is low, the proposed scheme always outperforms common sequential Scheme A. To gain a quick quantitative insight, assume that p is small enough for pN ≈ 0 to hold. Then turning to (iii) and taking a ‘continuous’ (undiscretized) version of Np equal to yields relationships (see Remark A.1, eq. (A.3))
Therefore, an approximation to an average gain Gp = 1 − t(Np) is 1 + 2p log2 p. Taking, e.g., p = 0.01 results in G0.01 ≈0.867. Considering analogous example given in Section 2 for Scheme B, we see that the gain has an increase close to 7%. In fact, this does not surprise (for a visual comparison of Schemes B and C on the linear and the log-log scale, see Figure 2, and, for the numerical one, Table 1 and Table 2 in Appendix B), since for Scheme B we had and
as p → 0. Equality (3.2), however, exhibits some magic flavor. To see this, note that, for p ≈ 0, entropy Ip of X ∼ Be(p) is asymptotically equivalent to p log2 p since
Performance of Scheme C
Scheme C (red) vs. Scheme B (black) on the linear and the log-log scale.
Consequently, (3.2) means that an average number of tests per one individual scales like entropy of the prevalence of the infection. Keeping in a view the above relationship, it is not surprising that the significant number of works ([2], [3], [4]) have approached the testing problem from the Information Theory perspective.
Another connection with the Information Theory is due to the Quick Sort (QS) algorithm utilized in the Scheme C. It is well known that QS yields the best (up to the constant multiple) possible average performance among comparison based algorithms: to sort an array having N non-constant (i.e., random) items, the smallest average number of comparisons is of the order N ln N [5], and, in fact, all “randomize and conquer” type algorithms (with QS being one among the rest) have expected time asymptotically equivalent to QS, which randomly splits sorted set into two equal subsets [6]. Our formula (3.1) is just a confirmation of this (well known fact). To see this, note that, in the context of sorting task, (3.1) presents an average number of comparisons per item. Though the order is correct, we are inclined to think that the multiplier can be improved by making use of QS modification (or another comparison-based algorithm) designed to sort items with a small number of possible values (in our case, there are just two: “sick” and “healthy”). On the other hand, we think that the order is optimal since though there are algorithms which can beat the order of QS when sorting integers, e.g., [7], [8], they operate under different, i.e., noncomparisons-based, mode. In our case, however, comparison is predefined by the setting of the problem at hand: we assume that biomedical tests can only be carried by making use of comparison.
4 Discussion
Assumptions
At a first glance, independence assumption seems too restrictive. That is, more natural would be to assume that individuals forming the tested cohort are related. However, an application of the proposed procedure is most effective when the prevalence is low (p ≈ 0). In such case, under classical assumption pN → λ > 0, the number of infected individuals SN can be well approximated by the Poisson distribution Pois(pN), and the approximation remains quite accurate irrespectively of the nature of the dependence exhibited by summands (see [9], [10], [11] and references therein for results of this kind with possible extensions of the classical setting).
Another important assumption is that of the constant prevalence p. When the testing is applied in the vibrant pandemic environment, it would be more natural to assume the dependence on time: p = p(t). This, however, requires switching to a more complex process theory model, and it is difficult to say what is more reasonable: to assume validness of our simple setting for a short period of time, or to look for a more complicated yet more realistic model requiring longer and elaborated analysis and suitable data.
Related works
The ideas underlying the testing problem covered in our paper are far from being new. For example, the Scheme B dates back at least to 1943 (see [1]) and then reappears with modifications and further extensions not only in biomedical context but also in the engineering sciences as well: [12], [13], [14]. The Scheme C is currently afresh discussed by Gollier and Gossner [15], Mentus et al. [16] and Shani-Narkiss et al. [17]. For an older reference, discussing the case of nonhomogeneous population (i.e., the one in which the probability of being infected p may vary across individuals) and containing quite a large body of applied literature on halving algorithm corresponding to the Scheme C, we refer to [18]. Our major input is that, in contrast to the mentioned references discussing the procedure and providing instructions suitable for the practical application of Scheme C with a brief theoretical background, we estimate its properties in a rigorous fashion. To our best knowledge, the reference [19] is the only work close to ours both in nature of investigations and results. However, in that paper, the authors focus on the treatment of an asymptotic regime of Scheme C when p → 0.
To finish our short discussion, it is also worth to mention that there is a large body of papers of applied nature, prevalent in medical, engineering and other literature and devoted to pooling in various settings (see, e.g., [20], [21] among others). Writing the paper, we aimed to recall the ideas of group testing, highlight directions for improvements, and promote dissemination which seems to be of most importance in the present context forcing a burst of papers devoted to similar problems (see, e.g., [22], [23], [24], [25], [26], [27], [28]). Finally, note that some countries have already successfully applied pooling methodology for testing of the SARS-CoV-2 virus.1
Data Availability
Data sharing is not applicable to this article as no new data were created or analysed in this study.
Appendix
A Appendix. Technical details
Proof of Proposition 3.1. (i) For 1 ≤ i ≤ j ≤ N = 2n, let Mij = {Xi, Xi+1,…, Xj} and let be the number of infected individuals in the cohort Mij. Denote T (i, j) = the total number of tests applied to the cohort Mij after the initial pooled test. By the description of the testing Scheme C,
Taking expectations yields
hence, the first equality in (i). For the second one, take the last sum above and continue as follow
(ii) By (A.1),
(iii) Since N = N(n) = 2n, by the second equality in (A.2),
Clearly, q2N ↓0 as N → ∞. Therefore, there exist no more than two Np ∈ ℕ such that ∀ N ≤ Np Δn ≤ 0, ∀ N ≥ Np Δn ≥0, and t(Np) attains its minimal value at Np. To find Np, we first solve 1 − 2q2N = 0 with respect to N, and then choose from the two nearest integers (i.e., ⌊N ⌋, ⌊N + 1) the one which minimizes tN.□
Remark A.1. Note that, if N ⩾ 1 and pN → 0, then for t(N) in (3.1) it holds
To see this, it suffices to use the following bounds
□
The next proposition justifies the discussion on Scheme B presented in the introductory Section 2.
Proposition A.1. Let be fixed. Consider function
, N > 0. It admits a unique fixed point Np which minimizes t(N) given by (2.2).
Proof. Step 1 (fixed points). Define
Then . Note that f ′ (N) → −∞ as N → ∞. Moreover, f ′ (0) > 0 since
which is satisfied for any p ∈ A. Therefore, f attains maximal value at
Since ln . On the other hand,
and
. Consequently, f has two zeroes: Np ∈ (0, Nmax) and
. The latter means that g has two fixed points.
Step 2 (minimizer). In this step, we show that Np from Step 1 is the minimizer for t(N) given in (2.2). By (2.3),
Hence,
From Step 1 (see (A.5)) it follows that , i.e. v/2 ϵ 2 (0, Nmax). Therefore, (A.6) holds if and only if f (v/2) > 0. The latter reads as
and is equivalent to
showing that Np (being the critical point of t) is indeed the announced minimizer. Finally, note that the above analysis also implies that
from Step 1 is the maximizer of t(N) which affirms the uniqueness of the minimizer.□
Remark A.2. One can also show that p ↦ Np is strictly decreasing and continuous on A. However, the latter properties seem to be of less importance and we omit the details.□
B Appendix. Tables
In the tables below, the following information is provided:
Column ‘Np’ shows an optimal sample size corresponding to p ranging in an interval given in the column ‘Range of p’.
Column ‘Range of 100Gp’ shows an average gain (as defined in the main body of the paper) per 100 individuals corresponding to values of p and Np given in the two leading columns. The highest gain corresponds to the lowest p in the corresponding interval. For example, in Table 1, the first line should be interpreted as follows: for p ∈ [0.1865, 0.2000], optimal sample size Np is equal to 2; if p = 0.2000, then average gain per 100 individuals 100Gp is equal to 16.1782; if p = 0.1865, then 100Gp = 14.000; for intermediate values of p, the value of 100Gp lies in [14.0000, 16.1782].
Footnotes
↵* Author is supported by grant S-COV-20-4 from the Research Council of Lithuania
↵1 According to Wikipedia, “In Israel, researchers at Technion and Rambam Hospital developed and tested a method for testing samples from 64 patients simultaneously, by pooling the samples and only testing further if the combined sample is found to be positive. Pool testing was then adopted in Israel, Germany, South Korea, and Nebraska, and the Indian states Uttar Pradesh, West Bengal, Punjab, Chhattisgarh, and Maharashtra.” [29]