## Abstract

Large scale screening is a critical tool in the life sciences, but is often limited by reagents, samples, or cost. An important challenge in screening has recently manifested in the ongoing effort to achieve widespread testing for individuals with SARS-CoV-2 infection in the face of substantial resource constraints. Group testing methods utilize constrained testing resources more efficiently by pooling specimens together, potentially allowing larger populations to be screened with fewer tests. A key challenge in group testing is to design an effective pooling strategy. The global nature of the ongoing pandemic calls for something simple (to aid implementation) and flexible (to tailor for settings with differing needs) that remains efficient. Here we propose HYPER, a new group testing method based on hypergraph factorizations. We provide characterizations under a general theoretical model, and exhaustively evaluate HYPER and proposed alternatives for SARS-CoV-2 screening under realistic simulations of epidemic spread and within-host viral kinetics. We demonstrate that HYPER performs at least as well as other methods in scenarios that are well-suited to each method, while outperforming those methods across a broad range of resource-constrained environments, and being more flexible and simple in design, and taking no expertise to implement. An online tool to implement these designs in the lab is available at `http://hyper.covid19-analysis.org`.

## Introduction

Biological screens that identify members of a large population with a disease have become invaluable tools for disease diagnosis and surveillance. When these screens are difficult to conduct or resources are limited, finding an efficient way to conduct the screen becomes critical. As such, widespread, scalable and frequent testing is a defining challenge in combatting COVID-19 in the face of local, national and global resource constraints. Pooled testing has recently arisen as a promising efficient scientific solution to the world-wide challenge of increasing SARS-CoV-2 testing capacity^{1–14}, encouraged in part by the finding that a single positive sample can be reliably detected by RT-qPCR in large pools ^{15}.

One approach to achieve efficiency is to design the screen in such a way that leverages structure in the tested population ^{16}. This idea dates back at least to the seminal work of Dorfman^{17}, which proposed testing *pools* of samples when there is prior knowledge that the vast majority of samples will test negative. Dorfman testing is a two-stage approach with each individual assigned to exactly one pool. A negative test result for a pool at the first stage can be applied to all its members, eliminating the need to test them individually and potentially greatly increasing efficiency, depending on the pool size and prevalence of positive members of the population. A great strength of this approach is its simplicity and robustness in laboratory implementation. Pools are easy to form and putative positives are simply those individuals contained in a positive pool. Several early methods ^{3–5;18} for COVID-19 pooled testing focus on Dorfman testing. However, it is well-known that Dorfman testing can have sub-optimal efficiency^{7;8;13;19}; alternative designs use tests more efficiently and can thus screen more individuals, especially in the face of major resource constraints.

There has been tremendous study and progress on pooled testing (also called group testing or specimen pooling) in general. Numerous works provide statistical ^{20–24}, combinatorial^{25–29}, and information theoretic^{2;30–38} perspectives, as well as software ^{39;40} to aid implementation, to name just a few. Additionally, there has been a lot of work on analyzing and optimizing group testing for various constraints and evaluation criteria ^{41–51}, often in the low prevalence regime. Broadly speaking, the approaches fall into three categories: i) single-stage (or nonadaptive) approaches that identify positive individuals after only one round of pooled tests by using pools with carefully designed overlaps; ii) two-stage ^{50–53} approaches that declare putative positives after one round of pooled tests. The putative positives are then individually tested in the second round. Finally, iii) multi-stage ^{52;54} (or adaptive/hierarchical) approaches that carry out multiple rounds of pooled tests with pools at each round chosen based on previous rounds.

Many recent methods ^{6–13;55} for COVID-19 pooled testing (more than can be reviewed here) are seeking greater efficiency by splitting samples into multiple pools, and using a one-, two- or multi-stage approach. One such method is P-BEST^{6}, which is a single-stage approach designed for a prevalence around 1% that splits each of 384 individuals into 48 partially overlapping pools. The pool assignments are based on a Reed-Solomon error correcting code that enables identification from the single round of tests and provides robustness against, e.g., independent PCR failures. Positive individuals are identified by running a specialized decoding algorithm based on sparse regression. A popular two-stage method is array pooling ^{7}, which arranges individuals into either an 8 × 12 or 16 × 24 grid (corresponding to plate sizes common in laboratory environments), then takes each column and each row to be a pool, resulting in 20 pools for 96 individuals or 40 pools for 384 individuals, respectively. Each individual is split into two pools and is a putative positive if both those pools test positive. This approach is potentially more efficient than Dorfman testing while retaining some simplicity. Moreover, it is well-suited for clinics that already store samples in these grids, especially if they have multi-channel pipettes that make row/column pooling convenient. Another method extends this idea to samples arranged in a *q*-dimensional hypercube ^{8} with sides of length three. The approach is multi-stage, with each of 3^{q} individuals placed into *q* “slice” pools (chosen from 3*q* total pools) in the first round, followed by adaptively formed smaller hypercubes. For increasing *q*, this design covers numerous individuals using only a few tests in the first round, and is highly efficient when prevalence is low. Together, the proposed pooling strategies offer several alternatives that address the urgent, global need for efficient screening.

However, given the global nature of the pandemic, there is a wide variety of settings with differing needs and constraints, in which the proposed combinatorial designs may have limited utility^{19}. Designs that split samples into more than two or three pools, e.g., P-BEST and hypercubes, can be time-consuming and error-prone to execute by hand, making them best-suited for well-resourced labs with robotic-pipetting platforms. The decoding algorithm used by P-BEST and the multi-stage logistics of hypercubes can also make them nontrivial to implement without prior experience, expertise, and substantial lab infrastructure. Moreover, many of the proposed designs are somewhat rigid and are not trivially adapted to environments that can have widely varying numbers of individuals to test, available test kits, or prevalence of positive results. Array pooling with 150 individuals, for example, might be done using a partially-filled 16 × 24 grid or perhaps two 8 × 12 grids with one or both partially-filled, but it is unclear which to choose and whether either remains efficient under variable prevalence. For SARS-CoV-2 screening in resource-limited settings, adapting to these various conditions is critical to achieving the greatest effectiveness ^{19}. Therefore, for this application along with broader applications of pooled screening, there remains an outstanding need for a robust, simple, flexible strategy with performance that matches (or exceeds) the effectiveness of specialized strategies, and can be applied in diverse environments without special equipment or expertise.

We propose HYPER, a two-stage pooling strategy based on the combinatorics of hypergraph factorizations. In this approach, individuals are assigned to pools by cycling through a carefully ordered sequence, given by a combinatorial construction. While the underlying mathematics is sophisticated, the resulting pools are simple to implement by hand (individuals are split at most three ways), and putative positives can be easily identified with only pencil and paper. We also provide an online tool available at http://covid19-analysis.org to facilitate implementation. The design accommodates any number of individuals while maintaining robustness, balance and efficiency. One simply cycles through the sequence until everyone is assigned. We characterize its performance through theoretical analysis, as well as through realistic simulations that model both viral kinetics and epidemic spread. Our findings demonstrate that HYPER is not only simple to implement and decode, but is also very efficient. Across a broad range of resource-constrained environments, it outperforms both array and P-BEST designs even in the scenarios where those methods excel in efficiency. Furthermore, in some scenarios we consider, it screens over three times as many effective individuals as both existing methods and 14 times more than individual testing. Its flexibility, in particular, makes it easy to adapt to each scenario. For SARS-CoV-2 and beyond, HYPER represents a valuable addition to the growing toolbox for performing large-scale, pooled screens.

## Results

### Limitations of existing designs and the need for balanced designs that are simple and flexible

A pooling design is an assignment of each of *n* individuals (or more generally, samples) to one or several of *m* pools. We seek a simple and flexible pooling design that is balanced in the following natural ways:

All individuals are assigned to the same number

*q*of pools; we focus on*q*= 1, 2, 3 for practical implementation.The

*m*pools are assigned as evenly as possible, meaning that the sizes of the pools are as close as possible to equal.The possible pool combinations are assigned as evenly as possible.

Similar conditions have been widely studied in group testing, see the references above. However, as explained below, methods that satisfy our constraints (*q* ≤ 3, near-optimal efficiency and noise tolerance, at most two stages) are not available. The above balance conditions encourage the design to maximally utilize the overall pooling resources available, while allowing for robust implementation and quality control. For instance, the volume used for each sample, or produced for each pool and combination of pools, should be balanced. This also ensures equitable treatment of each individual sample. Ensuring uniform splits *q* and balanced pools is especially relevant for uniform treatment of the individuals since both affect the impact of dilution on each individual. Maximally balancing the pool combinations can be helpful for efficiency. As a simple example, suppose one individual is positive. After pooled testing, that individual will be identified as a putative positive and needs to be retested, *along with any others assigned to the same pool combination*. With imbalanced pool combinations, unless we are lucky, the positive individual may be assigned to one of the more frequently assigned combinations. This in turn would lead to an excess number of putative positives and additional tests, i.e., worse efficiency.

For *q* = 1, corresponding to classical Dorfman testing, each individual needs to be assigned to a single pool. Balance can be easily achieved here by cycling through the pools until all individuals are assigned. For example, to assign *n* = 8 individuals to *m* = 6 pools (A-F), assign individual 1 to pool A, individual 2 to B, and so on, yielding the following assignments:

All the *n* = 8 individuals are assigned to *q* = 1 pool, the *m* = 6 pools are assigned as evenly as possible (A and B are assigned twice and the rest are assigned once), and pool “combinations” are just the pools themselves when *q* = 1.

Creating maximally balanced designs for *q* > 1, and especially *q* > 2, can be much harder. A straightforward approach of listing the pools and assigning individuals to consecutive pairs (e.g., AB, CD, EF, AB, …) underutilizes the combinatorial space (for instance, it does not use AC). At the same time, cycling over all pairs sorted in lexicographic order is likely to be unbalanced in multiple ways. For instance, using all pairs in lexicographic order for *n* = 8 individuals assigned to *q* = 2 of *m* = 6 pools, yields
which assigns A five times and F only once. As a result, this design is *not* uniformly sensitive across individuals. For example, individual 1 undergoes a five-fold dilution in pool A and a four-fold dilution in pool B, while individual 5 is diluted in pool A but not at all in pool F. Uneven dilution of viral loads can lead to differential false negative test results, leaving individual 1 with a higher risk of false negatives than individual 5. Clearly, alternative orderings could be more balanced, and the challenge is how to systematically identify these. One approach is to generate a random design ^{19;56;57}, e.g., by assigning each individual to *q* of *m* pools independently and uniformly at random. Doing so treats both pools and pool combinations uniformly “on average”, but any given random draw may assign them very unevenly in practice, especially when *m, n, q* are not very large. For the example above, this approach draws one of the possible designs uniformly at random, so pool sizes (subjects per pool) can range all the way from zero to *n*. Likewise for pool combinations. In fact, 98.7% of the random draws in this case will not be maximally balanced, i.e., the pools or pool combinations are not assigned as evenly as possible, as can be verified by exhaustive search. In other words, it is rare for randomly generated designs to both treat individuals uniformly, and maximally use all available pool combinations. With high probability, selecting a design at random does not achieve our goal. With increasing *m, n, q*, the probability decreases exponentially, making random search practically very expensive for sufficiently large designs.

We can also consider searching among many random draws for a suitable design. In the small example above, it would take 180 draws to have a 90% chance of drawing at least one maximally balanced design. One might try selecting then modifying the most balanced among many draws, but such an approach is ad-hoc, involves manual tweaking, and may still not be maximally balanced.

An exhaustive search, on the other hand, is a systematic approach but is only practical for very small cases. The possible designs to search through is an enormous and intractable number in general. Moreover, one must repeat the process for any change in the numbers of individuals *n*, number of splits *q*, or number of pools *m*; the design is not easily adapted from one environment to another. P-BEST^{6} attains balanced pools and pool combinations with a Reed-Solomon error-correcting design that requires technical expertise to adapt, and may only remain balanced in limited settings. Array designs ^{7} can also be adapted to have balanced pools (by requiring the array to be a square), but they do not use all available pool combinations. They can have sub-optimal efficiency in certain of our simulation settings.

As a result, though it is quite natural to desire maximally balanced designs, the problem of easily generating them is highly nontrivial. In fact, it is not initially clear that designs maximally balanced in all three ways even exist in general, let alone, an efficient way to find them.

Here we show how certain deep results from combinatorics, such as Baranyai’s theorem ^{58} imply that such designs do in fact exist, under the minimal required conditions. Moreover, we demonstrate how to generate them under mild conditions using the combinatorics of hypergraph factorizations. For *q* = 3, we leverage a non-trivial number-theoretic construction due to Beth ^{59;60}. These tools enable us to develop a simple, flexible and efficient pooled testing strategy with maximally balanced pool designs.

#### HYPER pooling method

We propose HYPER, a two-stage pooling strategy that uses maximally balanced pools built on hypergraph factorizations. Stage one consists of *pooled testing* to identify putative positives, who are tested individually in stage two. For convenience, we use H_{n,m,q} to denote the HYPER design with *n* individuals per batch, *m* pools, and *q* splits. We illustrate the details via a small example with *n* = 12 individuals each split into *q* = 2 out of *m* = 6 pools (Fig. 1a).

First, to create pools for stage one, we assign each of the *n* individuals to *q* of the *m* pools by cycling through an ordering of the possible pool combinations generated via hypergraph factorization (as described below). If *q* = 1, this sequence of possible pool assignments would consist of the *m* pools in an arbitrary order. In our more involved example (Fig. 1a), the sequence reads AB, CD, etc., so individual 1 is assigned to pools A and B, individual 2 to C and D, and so on. In this case, we did not use all pairs in the sequence since there were only *n* = 12 individuals. However, if there were *n* > 15 individuals, we would cycle through the sequence again until all individuals were assigned. Hence, once we have the sequence, the pools are simple and flexible to form for any number of individuals *n*.

Next, we test each pool and *decode* the test results to identify putative positives. An individual found only in positive pools is a putative positive. Put another way, putative positives are the individuals left over after removing all individuals in negative pools. This is easily figured out with only pencil and paper, or a basic spreadsheet for larger designs (and one can also use our web implementation). In our example (Fig. 1a), pools A, E, F test negative so only individuals 2, 4 and 7 are putative positive. These samples are then individually tested in stage two.

Note that this way of decoding pooled testing results does not correct for any pooled tests with false negative results, e.g., as is done in P-BEST^{6}. Error correction may be incorporated by introducing a tolerance so that individuals in just one or two negative pools are still considered putative positives ^{19}. Doing so can improve the overall sensitivity of HYPER. However, an important source of false negatives is the dilution of viral loads below the limit of detection (which is a major concern). An individual with small viral load is hence likely to yield false negatives in most of its pools, in which case error-correction may not significantly improve sensitivity. Meanwhile, it can dramatically reduce efficiency. In our example (Fig. 1a), allowing a tolerance of one negative pool would turn individuals 1, 5, and 9-12 into putative positives. Labs must consider how to properly weigh these tradeoffs according to their specific needs. Here we will focus primarily on an error tolerance of zero, i.e., no error-correction of false negatives.

The only missing piece now is how we use hypergraph factorization to create the sequence of pool combinations for *q* = 2 and *q* = 3. Here we provide a brief overview, with further details of the construction provided in Supplementary Methods. The idea is to order the combinations so that there is no overlap among the combinations in each consecutive block of *m/q* combinations (which is in fact the formal definition of a “hypergraph factorization”, as we discuss below). In our example (Fig. 1a), the first *m/q* = 3 pairs (AB, CD, EF) do not overlap, as with the next three (BC, DF, AE), and so on. This property keeps pool sizes maximally balanced, and since each combination appears once, the combinations are balanced as well. Note that no pair is repeated and all pools contain 4 individuals. If we instead had only 11 individuals, the pools would no longer be perfectly even, since pools A and D would have only 3 individuals, but they would still be as even as possible. The same holds for any number of individuals.

Our task, therefore, is to divide the combinations into subsets of combinations, where each pool appears exactly once in each subset. As illustrated in Fig. 1a, we can think of each pool as one of *m* vertices in a *hypergraph*, where each *hyperedge* is a set of *q* of the *m* vertices (pools). Each hyperedge will correspond to a potential set of pools into which any individual sample is split. Our example has hyperedges (blue edges in Fig. 1a). Putting together all hyperedges forms the so-called complete hypergraph of order *q* on *m* vertices ( in Fig. 1a). Any subset of the hyperedges that uses all vertices exactly once is called a 1-*factor*. Restated in these terms, our task is to divide the hyperedges of into disjoint 1-factors. This is known in combinatorics as *hypergraph factorization*. In our example (Fig. 1a), we show a factorization of into five disjoint 1-factors, the first of which consists of hyperedges AB, CD, and EF. Hypergraph factorization has been the subject of intense study ^{27;28}. In particular, a deep result from combinatorics known as Baranyai’s theorem^{58} shows that factorizations *exist* under minimal conditions. Moreover, there are also efficient algorithms to construct them up to *q* = 3 under mild conditions, see in particular ^{59–61}. More generally, and beyond what we need, the efficient construction of hypergraph factorizations is a challenging combinatorial problem.

The mathematical construction of hypergraph factorizations is somewhat sophisticated. However, the output is easy to use. We provide an online tool available at http://hyper.covid19-analysis.org that implements constructions for *q* = 2 (as long as *m* is even) and for *q* = 3 (as long as *m* is a multiple of 6 and *m* − 1 is a prime number; which covers a wide range of settings). We describe the construction details in Supplementary Methods. Using this, forming the HYPER pools and identifying putative positives becomes both simple and flexible.

#### Theoretical performance characterization

To analyze the performance of HYPER, we consider a standard statistical model for group testing ^{17;20–24}. We assume the disease prevalence is *p*, i.e., each individual is positive independently at random with probability *p*. In addition, each test has specificity 1 − *α* and sensitivity *β*. Here we present only the key results; see Supplementary Methods for more details. This setting is different from the COVID-19 model we use in our simulations in the next section that accounts for the features of COVID-19. We emphasize that our method also extends beyond COVID-19, and thus it is valuable to have performance guarantees that capture more general settings.

For this analysis we suppose *n* is a multiple of *m/q*, and let *k* = *nq/m* be the number of subjects per pool. This is not a significant restriction as *m/q* is typically a small integer in the examples we are interested in (e.g., in some simulations below we have *m* = 24 with *q* = 3). Denote *r* = 1 − *p* and Then, we can show that the expected number 𝔼 (*T*) of tests *T* used by HYPER (including the tests from the second stage) is upper bounded as

Where

This inequality becomes an equality when *q* = 1, 2, so the above formula characterizes the expected number of tests used by HYPER (Supplementary Methods). In fact, we derive a more general and sharper version of this upper bound valid for all *q*, using the Dawson-Sankoff inequality^{62;63}, which is a non-trivial generalization of the Bonferroni union-intersection inequalities from probability theory.

Using this, when *q* = 2 and we are in the noiseless large *n* case, i.e., sensitivity and specificity are close to 1 and sample size is large (*α* = 0, *β* = 1, and *n* → ∞), and the prevalence is small, i.e., *p* → 0, we show that the optimal number of pools *m* is approximately

The corresponding approximate efficiency is 𝔼 (*T*)*/n* ≈ 3*p*^{2/3} (including stage two tests; Supplementary Figs. 1a and 1b). This shows that HYPER has a better efficiency than Dorfman designs, which have efficiency 2*p*^{1/2} when the prevalence is small as *p* → 0^{20}. In fact, its efficiency is asymptotically the same as three-stage Dorfman testing^{20;56}. While there are more efficient designs available in the literature, they rely either on multiple rounds ^{8;20} or on taking *q* much larger^{41–49}; each of which is outside of the constraints that we work with. See Supplementary Methods for more discussion.

For HYPER with *q* ≤ 2 and , the overall average specificity is 1 − *αγ* and the overall average sensitivity is *β*^{q+1}, where *γ* = [*β* + (*α* − *β*) *· r*^{k−1}]^{q} (Supplementary Fig. 1). The false negative probability (the probability that a sample is positive given that they were declared negative) of HYPER is

The true positive probability (the probability that a sample is positive given that they were declared positive) of HYPER is

Here *o* = (1 − *p*)*/p* is the odds ratio of prevalence. In this noisy group testing model, sensitivity and specificity calculations have been reported in Bilder’s work for array designs, see e.g., ^{21}; but it is beyond our scope to do a full comparison. Note also that the sensitivity of HYPER under this theoretical model can be improved via error-correction of false negatives in stage one, e.g., by including individuals in only one negative pool among the putative positives ^{19}. However, doing so can come at the cost of efficiency, and analyzing this tradeoff is beyond our present scope.

To summarize, these results characterize the expected efficiency, sensitivity and specificity under the standard theoretical model where tests have independent errors. This model captures important features of many applications beyond COVID-19 testing, and provides a useful setting to evaluate methods. Here we found that the efficiency of HYPER at low prevalence is competitive with other existing methods of comparable simplicity. Based on the specific application, the analysis can also be used to help guide the selection of HYPER designs that strike the appropriate balance of efficiency, sensitivity and specificity. For example, one might use a small number of splits *q* in settings where sensitivity loss due to independent test errors is of primary concern. On the other hand, if test errors are negligible, the best efficiency (for *q* = 2 splits) at low prevalence may be obtained by using *m/n* ≈ 2*p*^{2/3} − *p* pools per individual. Finally, the formulas for false negative and true positive probabilities help guide the interpretation of negative and positive results, respectively, declared by HYPER.

#### Performance under a COVID-19 model

We study the performance of HYPER under the viral load based COVID-19 model of Cleary and Hay et al. ^{19}. We simulate: a) SARS-CoV-2 viral load kinetics in infected individuals; b) the dilution of viral loads during pooling that can lead to false negatives; and c) the evolution of infection prevalence in a large population over time during epidemic growth and decline. We focus here on a window during which the prevalence increases exponentially from 0.03% to 2.46% (days 40–90 in our simulation; we consider other windows below) and individual testing has a sensitivity (i.e., fraction of positive individuals that are identified) of roughly 85% (Fig. 2). We compare HYPER with two existing state-of-the-art methods (Fig. 2a): array designs ^{7} and P-BEST^{6}. These methods use batches of *n* = 96 individuals (8 × 12 array; left panels) or *n* = 384 individuals (16 × 24 array and P-BEST; right panels). We consider both efficiency gain with respect to individual testing (number of individuals screened divided by average number of tests used, including stage-two tests) and average sensitivity (fraction of positive individuals identified after completion of all stages) of the methods.

For *n* = 96 (Fig. 2a, left panels), we consider the 8 × 12 array and a HYPER design, H_{96,16,2}, chosen to dilute samples a similar amount and thus potentially achieve a similar sensitivity. Our simulation shows that both methods indeed have similar sensitivity, which is about 10 percentage points lower than the sensitivity of individual testing (Fig. 2a, bottom-left panel). This is due to dilution of viral loads below the limit of detection during stage-one tests. While the array design is roughly 4.8 times more efficient than individual testing for much of the 50-day window (Fig. 2a, top-left panel), consistent with earlier studies ^{7}, the H_{96,16,2} design is roughly 6 times more efficient than individual testing. This corresponds to a 25% improvement in efficiency over the comparable array design, while achieving essentially the same sensitivity.

For *n* = 384 (Fig. 2a, right panels), we consider the H_{384,16,2} design, again chosen to potentially match the sensitivity of the array design (16 × 24) appropriate for this number of samples. The HYPER design is again roughly 25% more efficient than the array design (Fig. 2a, top-right panel) while having essentially equal sensitivity (Fig. 2a, bottom-right panel). In contrast to these two-stage methods, the one-stage approach of P-BEST has a constant efficiency gain of 8 times individual testing (Fig. 2a, top-right panel), but it significantly loses sensitivity around day 80 as prevalence grows (Fig. 2a, bottom-right panel). This is because the design and decoding algorithm are optimized for a prevalence around 1% and performance degrades beyond this operating point. In contrast, the sensitivity of HYPER increases around day 80 as prevalence increases. Notably, error-correcting does not appear to effectively handle the false negatives that arise here due to diluted viral loads falling below the limit of detection. P-BEST is generally the least sensitive among the three pooling strategies.

Finally, we compare various choices for the number of pools (Fig. 2b, *m* = 32, 16, 12) and the number of splits (Fig. 2c, *q* = 1, 2, 3) in HYPER. The behavior is generally consistent with the findings of Cleary and Hay et al. ^{19}, which studied the performance of random designs *on average*. However, note that any given draw of the random designs may perform better or worse based only on chance. Here we demonstrate the same effects and overall performance with fixed *deterministic* designs. Specifically, the efficiency gain of HYPER is roughly *n/m* early in the epidemic, during which low prevalence leads to pools frequently testing negative. Decreasing the number of pools *m* can significantly increase efficiency, but comes with a slight reduction in sensitivity. As the epidemic progresses, rising prevalence leads to increasingly many stage-two tests, though designs with larger *q* are more robust to this loss of efficiency. At the same time, larger *q* designs tend to be less sensitive. Overall, more efficient designs tend to be less sensitive, creating a trade-off that depends significantly on prevalence. One size does not fit all, underscoring the benefit of flexible designs like HYPER that can easily be adapted from one environment to another.

#### Choosing a pooling method given resource constraints

In practice, decision makers must often choose a pooling method given limited capacity (on average) for daily testing and sample collection. One approach is to maximize the *number of individuals screened per day*, i.e., the number of individuals *n* per batch times the number of batches *b* that can be run per day, given the limited testing and collection capacities. This metric accounts for the impact of resource constraints. However, it does not represent the *actual* number of infected individuals that the population screen can identify. A very efficient method can screen numerous individuals but may miss all the infected ones if it lacks sufficient sensitivity. Hence, a more natural goal is to choose a design that maximizes the number of individuals screened times the average sensitivity, which we call the *effective number of individuals screened per day* or simply the “effective screening capacity” for short. Note that, scaling this by prevalence measures the actual number of infected individuals that are identified by the screen. We evaluated the overall effective screening capacity for the epidemic window considered above (Fig. 2). We compared HYPER (across a sweep of *n, m* and *q*; Supplementary Table 1) with individual testing, array designs ^{7}, and P-BEST^{6}, all across a range of resource constraints.

We will first consider a few specific scenarios to understand the tradeoffs with each design, then consider a larger grid to get a picture of the overall trends. We begin with a testing-scarce setting (Fig. 3a) with an average testing capacity of 12 tests per day being far outstripped by an average sample collection capacity of 3072 samples per day. In this case, individual testing only screens 12 individuals and achieves an effective screening capacity of 10.2 individuals (the average sensitivity of individual testing is 84.8%, due to the presence of positive individuals with viral load below the limit of detection). The best HYPER design (maximizing the effective screening capacity, given the constraints), obtains an effective screening capacity of 122.2 individuals, roughly 12 times more than individual testing. It does so by pooling *n* = 192 individuals per batch into *q* = 2 of *m* = 6 pools with an average of *b* ≈ 0.9 batches run per day (recall that some of the testing capacity is needed for stage-two tests). Both array designs and P-BEST use more than 12 tests in a single run so do not appear here.

As testing capacity grows to 24 then 48 tests (Figs. 3b and 3c) with sample capacity unchanged, larger effective screening capacities become possible by using larger designs, including the 8 × 12 array followed by the 16×24 array and P-BEST. HYPER adapts to these settings as well, and the larger designs here are accompanied with a larger number of splits *q*. HYPER remains the most effective overall. For a testing capacity of 24 daily tests, HYPER achieves 289.1 effective individuals screened, which is over 14 times as many as individual testing (20.4 effective individuals screened) and over 3.4 times as many as the array designs (83.0 effective individuals screened). P-BEST cannot be run since it uses more tests than available, so it achieves zero effective individuals screened. For a testing capacity of 48 daily tests, HYPER achieves 663.3 effective individuals screened, which is over 16 times that of individual testing (40.7 effective individuals screened), over 2.2 times that of the array designs (293.6 effective individuals screened), and over 2.5 times that of P-BEST (256.1 effective individuals screened).

When the capacity grows to 768 tests per day (Fig. 3d), i.e., a quarter the sample collection capacity, the pooled designs remain more effective than individual testing, but now by less than 4 times. In this increasingly testing-rich regime, P-BEST and array designs are sample-constrained and under-utilize testing resources. P-BEST uses only *mb* = 384 of the 768 available tests, since all 3072 available samples are tested after *b* = 8 batches of *n* = 384 samples. The same is true for the 16 × 24 array design, although additional tests will also be used in stage two. The most effective HYPER design H_{6,1,1} corresponds to simple Dorfman testing, uses roughly 508 tests in the first stage, and achieves 2375.1 effective individuals screened.

Finally, we considered two settings well-tailored for the array designs and P-BEST: 96 samples with 24 tests (Fig. 3e) for which the 8 × 12 array is well-suited (recall that extra testing capacity is needed for stage 2), and 384 samples with 48 tests (Fig. 3f) for which the 16 × 24 array and P-BEST are well-suited. In particular, these are settings where the number of available individuals and pools closely match (a multiple of) the number of individuals and pools used by these designs in a single batch. This can help them maximally utilize both the testing and sample collection capacities, i.e., neither resource is under-utilized. The array designs and P-BEST performed similarly to HYPER in these favorable cases, but notably HYPER remained slightly more effective: 74.1 vs. 71.1 effective individuals screened for the first scenario, and 265.5 vs. 262.2 effective individuals screened for the second.

Expanding this analysis to a grid of sampling and testing capacities gives a broad view of overall trends. We consider a sweep with each ranging from 12 to 6144 (Fig. 3g). Note first that for all sample collection capacities, the effective screening capacity grows as testing capacity scales up, until testing capacity matches or outpaces sample collection capacity. Individually testing all samples collected is optimal from that point on. Likewise, for any given testing capacity, the effective screening capacity rises as sample collection capacity grows, eventually reaching an upper limit at which testing capacity becomes the limiting factor. Pooled testing increases this upper limit, enabling an effective number of individuals screened far beyond the actual number of available tests. For example, for a testing capacity of 96 tests per day, pooled testing achieves an effective screening capacity of up to 1500.9 individuals per day, which is over 18 times the effective screening capacity of 81.4 individuals per day achievable by individual testing.

Across the testing-constrained regime, i.e., testing capacity below sample collection capacity, HYPER out-performed both P-BEST and the array designs (Fig. 3 and Supplementary Fig. 2). As observed before (Figs. 3a to 3f), the flexibility of HYPER generally makes it more readily tailored and optimized for each setting. Dorf-man testing (i.e., HYPER with *q* = 1) is most effective when testing capacity is within a quarter of the sample capacity. However, as sample capacity begins to further outstrip testing capacity, combinatorial designs that involve more individuals *n* per batch and that use more splits *q* become most effective, consistent with earlier studies of analogous random designs ^{19}.

The above results capture the average effectiveness of each method when deployed over a 50-day window during epidemic spread. We next investigated effectiveness on individual days, each corresponding to a different fixed prevalence. At low to moderate prevalence of 0.1% (Supplementary Fig. 3), 1.06% (Supplementary Fig. 4), or 1.36% (Supplementary Fig. 5), HYPER is consistently the most effective strategy across all settings. In an intermediate range with prevalence of 1.48% (Supplementary Fig. 6) and 2.46% (Supplementary Fig. 7) there is a subset of scenarios in which P-BEST outperforms HYPER, although the performance of each method is nearly equivalent in these settings. Outside these settings P-BEST is either not viable or substantially under-performs HYPER. At a higher prevalence of 3.15% (Supplementary Fig. 8) HYPER again performs best across all scenarios. We did not observe any scenarios in which array designs were most effective.

## Discussion

Our results demonstrate the effectiveness of a new family of pooling designs that are adaptable to any number of samples, with only mild conditions on the number of pools, while remaining maximally balanced in three senses (number of assignments per individual, pool, and combination of pools). This flexibility is critical to selecting appropriate designs under the widely varying global demands and capabilities for SARS-CoV-2 testing. In addition, the balanced nature of the designs ensures equitable treatment of samples and facilitates robust and simple implementation. Despite the simplicity of implementing HYPER, the existence and construction of the designs relies on deep mathematical results, such as Baranyai’s theorem from combinatorics^{58} and Beth’s construction from number theory ^{59;60}.

Our evaluation of HYPER in both a general theoretical framework and a SARS-CoV-2 specific simulation can be used to guide the choice of design, depending on the setting and purpose of testing. For our theoretical setup, where each test has specificity 1 − *α* and sensitivity *β* independent of all other tests, we showed that using roughly *m/n* ≈ 2*p*^{2/3} − *p* pools per individual maximizes the efficiency of HYPER designs, and we derived that HYPER has sensitivity *β*^{q}. When selecting a design for SARS-CoV-2 testing, one must also account for the impact of dilution in pooled tests (leading to false negatives), among other effects like the evolution of viral loads over time. These considerations are important for selecting designs that remain effective during epidemic spread, which is critical in the ongoing pandemic and in preparations for the future. Here, insight can be gained from our results on a realistic COVID-19 simulation ^{19}. For example, the sensitivity of HYPER now depends on both the number of splits *q* and the number of pools *m*. Designs with fewer pools have larger pools, leading to more dilution and lower sensitivity. Since we also do not correct for false negatives, the sensitivity is lower than individual testing. In theory, adding error correction could increase sensitivity, but doing so is most effective when false negative PCR results are independent across tests. When most false negative results come from diluting positive samples below the LOD (as in our simulations), then error correction is not as effective (as we observed with P-BEST) because the failures are not independent. The sensitivity of HYPER is generally best when using designs with fewer splits *q* or more pools *m*. However, doing so generally reduces efficiency at low prevalence. The results illustrate a general trade-off between sensitivity and efficiency that must be balanced depending on the setting.

While pooled testing can substantially increase effectiveness depending on laboratory capacity and prevalence, it is important to consider the added logistical challenges. Notably, the gains in testing effectiveness that we demonstrate above do not account for the additional pipetting steps during pooling, or the logistical cost of temporarily storing and retrieving samples for stage two testing. However, simple (Dorfman) pooling designs are receiving increasing interest^{3–5;14;18;64} for real-world testing, demonstrating that these logistical challenges can be overcome in practice in a variety of settings. In comparison to Dorfman designs, more complex designs (with *q* > 1) will require up to *q* times as many pipetting steps during stage one pooling. Depending on the relative timing and cost of each step in the protocol, this may shift the relative favorability of the strategies considered above. In particular, P-BEST, with *q* = 6 or more, may become relatively unfavorable if pooling steps are expensive, while array designs, which utilize multichannel pipettes, may become more favorable. We note that we have previously validated a HYPER design (H_{96,6,2}) ^{19} in the laboratory, and found that with an unoptimized workflow, stage one pooling of 192 samples could be completed manually in under 90 minutes. It is therefore especially likely that gains in effectiveness may outweigh additional logistical costs in settings that are constrained by testing throughput (e.g., due to limited reagents or equipment) but have an excess of laboratory technician capacity.

Although the alternative pooling designs we evaluated here were originally published as fixed designs, potentially optimized for a fixed prevalence (P-BEST), it is possible they could be adapted to different settings. Array designs could potentially be adapted to any grid that fits on standard (96-well or 384-well) laboratory plates. However, our results suggest that HYPER designs could in general match the sensitivity of array designs while having greater efficiency. Indeed, array designs had sub-optimal effectiveness in every setting we considered, including those tailored to best fit each array. P-BEST can in principle be adapted to different settings, although it is unclear how flexible it can be while retaining balance, and properly doing so requires some expertise and experience. To use it, one might need to not only generate another Reed-Solomon code but also properly retune the decoder algorithm. Here we used the design and decoder provided online (https://github.com/NoamShental/PBEST) by the authors. We have so far been unable to evaluate Hypercube ^{8} in simulation since the code for it is currently unavailable.

So far we have limited HYPER to *q* ≤ 3. This has the advantage of reducing the additional logistical burden (and potential for error) that comes with splitting samples into more pools. Moreover, the efficient construction of hypergraph factorizations is highly nontrivial for *q* > 3. However, higher *q* can have several advantages. For example, individuals in the same hyperedge (i.e., assigned to the same combination of pools) are identified as putative positives together as a block even if only one of them is actually positive. Using a higher *q* can significantly increase the number of hyperedges , reducing the number of individuals sharing a single hyperedge. Results for HYPER here also indicated that high *q* designs can be highly effective when the sample capacity significantly outstrips testing capacity, consistent with earlier studies of analogous random designs ^{19}. Likewise, greater efficiency can be obtained by using a multi-stage approach with more than two stages, which is also more logistically challenging. In practice, one must weigh these opportunities for greater effectiveness against the increased complexity. Such designs may be especially promising for labs with access to robotic pipetting platforms.

To conclude, we present a simple, efficient and flexible pooled testing strategy that can be easily tailored and implemented without specialized expertise or equipment. To further facilitate implementation, we provide an online tool available at http://hyper.covid19-analysis.org that makes it easy to generate and carry out designs for a broad range of settings.

## Data Availability

Simulated data can be regenerated using the accompanying code.

## Author contributions

All authors contributed to the design of the study and to discussions of all aspects. D.H., R.D., and E.D. performed the theory development and analysis. D.H., X.L., and B.C. performed and analyzed the simulations under the COVID-19 model. R.D. and X.L. developed the interactive online tool. D.H., B.C., and E.D. wrote the first draft of the manuscript. All authors reviewed and edited the manuscript.

## Competing interests

None.

## Supplementary Methods

### Simulation under a COVID-19 model

We performed simulations studies using the COVID-19 model of Cleary and Hay et al. ^{19}. The model first simulates viral loads for a large population of 12,500,000 individuals across 357 days during which the epidemic grows then declines. It captures the evolution of both: a) viral loads within each individual, i.e., within-host viral kinetics, and b) infection prevalence in the overall population. See ^{19} for a detailed description.

Next, the model simulates pooled testing to determine the average efficiency gain (with respect to individual testing) and average sensitivity for each day. For the reader’s benefit, we detail the process here. For HYPER designs, i.e., H_{n,m,q}, the simulation proceeds for each trial *r* of day *d* as follows:

Draw

*n*individuals uniformly at random from the population. Let*z*_{1}, …,*z*_{n}be their viral loads that day.Generate the

*sampled*viral load for each of the*m*pools*I*_{1}, …,*I*_{m}*⊆ {*1, …,*n}*as follows: where | ℐ_{j}| is the size of pool*j*, i.e., the number of individuals assigned to it.Compute stage-one pooled testing results:

if

*v*_{j}> LOD then pool*j*tests positive, where the LOD (limit of detection) we use is 100.otherwise, pool

*j*tests negative with probability 0.99 (i.e., the false positive rate of PCR results is 1%).

Select putative positives as those individuals that are not in any negative pools.

Compute stage-two individual testing results for the putative positives: putative positive individual

*j*tests positive if*z*_{j}> LOD and tests negative otherwise.Declare individuals identified by HYPER as those that tested positive in stage-two.

Record the following for the current trial

*r*and day*d*:the number of true positive individuals identified by HYPER: ,

the number of tests expended:

*T*^{(r)}(*d*) =*m*+ number of tests used in stage-two.the number of true positive individuals seen: = number of individuals with viral load > 0,

For each day, we repeat this for 500 initial trials, then continue until either at least 2,500 true positive individuals have been seen or a total of 200,000 trials have elapsed (including the initial 500). This is to reduce experimental noise. Denoting *R* to be the total number of trials run, we then compute the following averages across trials
then finally compute the average efficiency gain and average sensitivity for day *d* as follows:

Note that step 2 in the simulation above captures dilution due to pooling, since each individual’s viral load gets divided by the pool size. The Poisson random variable models a small volume being sampled from each swab. Note also that step 5 models the individual testing of stage two as having no false positives. Doing so simplifies the simulation without meaningfully affecting our conclusions (e.g., the optimal pooling designs, which do not depend substantially on stage two specificity). We do include false positives in stage one, since the overall efficiency depends on the specificity there.

For the 8×12 and 16×24 array designs ^{7}, the simulation proceeds in the same way except for step 2, where the corresponding array pools are used instead. Recall that the array method is a two-stage method like HYPER. For P-BEST^{6}, which is a one-stage method, steps 1-3 are the same (except that step 2 now uses the P-BEST pools). Steps 4-6 are replaced by running the P-BEST decoder to identify individuals. For this, we followed the example (including its tuning parameters) provided online by the authors at https://github.com/NoamShental/PBEST/blob/f7ffebe6c7021ee40167239210806c5a1319f81e/mFiles/example_PBEST.m. Finally, since P-BEST has no second stage of validation tests, the number of tests expended is always *T* ^{(r)}(*d*) = *m* = 48.

Fig. 2 plots the average efficiency gains and average sensitivities of the various methods for each day in a 90-day window of epidemic growth. Here we included individual testing, which has a constant average efficiency gain of 1 (unity) since it is the baseline. Its average sensitivity on day *d* is equal to
since individual testing identifies those individuals with viral load > LOD, and true positive individuals are those with viral load > 0 (as before). The average sensitivities of the various methods appeared to generally have significant experimental noise. So, Fig. 2 plots the raw averages (i.e., ) as dots along with a degree-8 polynomial curve fitted to v.s. log_{10} *p*(*d*) across the plotting window of days *d* = 20, …, 110, where *p*(*d*) is the prevalence on day *d*.

In Fig. 2a, we compared HYPER designs H_{96,16,2} and H_{384,32,2} with their counterpart array designs and P-BEST. For the HYPER designs, the numbers *n* of individuals per batch were chosen to match the array designs and P-BEST. The numbers *m* of pools were chosen so that the corresponding pool sizes *nq/m* match the maximum pool sizes of the array designs (12 for the 8 × 12 array and 24 for the 16 × 24 array). Fig. 2b compares HYPER designs H_{384,32,2}, H_{384,16,2}, and H_{384,12,2} that have varying numbers of pools. Fig. 2c compares HYPER designs H_{384,12,1}, H_{384,12,2}, and H_{384,12,3} that have varying numbers of splits.

### Comparison of pooling methods under resource constraints

We used the simulations above to evaluate the various methods (individual testing, HYPER, array designs, P-BEST) under resource constraints and over time. We considered two forms of resource constraints: i) a limited daily sample collection budget, and ii) a limited daily testing budget. We let both range from 12 to 6144, forming the grid of resource-constrained scenarios shown in Fig. 3g and Supplementary Fig. 2, with a few selected scenarios highlighted in Figs. 3a to 3f. These figures evaluate average performance of the various methods when deployed across days 40-90 of the simulation. Supplementary Figs. 3 to 8 repeat the analysis (using the same set of scenarios) for individual days, namely days 53, 80, 83, 84, 90, and 93. Hence, we will focus on describing Fig. 3 and Supplementary Fig. 2; Supplementary Figs. 3 to 8 are similar.

In each scenario, we evaluated each method by its *effective screening capacity*. As discussed in the main text, this performance metric measures how many individuals the method can screen under the resource constraints, with a correction applied to account for the associated sensitivity. To measure performance over time, we also consider averaging across a chosen set of days *𝒟*. Fig. 3 and Supplementary Fig. 2 consider days 40-90, so *𝒟* = *{*40, …, 90*}* there. Supplementary Figs. 3 to 8 examine individual days, which corresponds, e.g., to *𝒟* = *{*53*}* in Supplementary Fig. 3. To compute average effective screening capacity, we first determine the number of batches *b*(*d*) that can be run on each day *d*, and its corresponding average :

If batches per day, i.e., fewer than 0.9 batches can be run per day on average, then the method is considered infeasible within the resource constraints. Then we set the method to have an average effective screening capacity of . Setting the above threshold at 0.9 captures an assumed flexibility to use fewer or more tests across days. Otherwise, if , we compute the effective screening capacity *C*(*d*) for each day *d* and its corresponding average as follows:

Figs. 3a to 3f show the average effective capacities for the considered methods as bars, with the corresponding average number of batches noted at the bottom of each bar. Multiple configurations are available for both the array method (the 8 × 12 and 16 × 24 array designs) and HYPER (various choices of *n, m* and *q*). For these methods, we select the most effective among all configurations, i.e., the configuration with the highest average effective screening capacity . For HYPER, in particular, we optimized over the configurations listed in Supplementary Table 1. The chosen configuration is noted at the top of each bar in Figs. 3a to 3f.

Supplementary Fig. 2 shows the bar graphs for the full range of resource-constrained scenarios considered. Fig. 3g summarizes these findings by showing only which method was best (where we distinguish different choices of *q* in HYPER), the corresponding average effective screening capacity, and the corresponding configuration.

### HYPER pool designs from hypergraph factorization

As we illustrated in Fig. 1a, HYPER assigns individuals to pools by cycling through a sequence of pool assignments given by hypergraph factorization. Namely, for *q* = 2 splits and *m* = 6 pools labelled A-F, we cycled through the 15 possible pairs of pools in the order: AB, CD, EF, BC, DF, AE, BD, AF, CE, BE, CF, AD, BF, DE, AC. Namely, individual 1 was assigned to pools A and B, individual 2 to C and D, and so on; after individual 15, we return to the beginning of the sequence and cycle through again. For *n* = 18 individuals, this would result in the following pool assignments:

For use in a lab protocol, it can sometimes be helpful to re-order these assignments so that individuals assigned to the same pair of pools appear consecutively:

This is useful if we plan to first combine the samples from individuals 1 and 2, then split that combined sample into pools A and B. Likewise for individuals 3 and 4, as well as 5 and 6, in this case. To avoid confusion, we emphasize that we have simply re-ordered the pool assignments to show the repeated ones one after the other. Thus, the table does not show AB, CD and EF as the first three pairs, but rather AB is repeated twice, then CD is repeated twice, and so on.

### Hypergraph factorization

The sequence of pool assignments used in HYPER comes from hypergraph factorization. Here we describe the key ideas (and algorithms) for factorizations and the underlying theoretical results, in parallel with the application to group testing.

Suppose we are given a number *m* of vertices, which correspond to pools in group testing. We consider the *complete hypergraph* of order *q* ≤ *m* on these *m* vertices, which is simply the collection of all *q*-*hyperedges*, i.e., subsets of size *q* of the *m* vertices. For *q* = 2, this is the complete graph on *m* vertices, which can be drawn as all edges connecting *m* vertices.

Drawing the corresponding complete hypergraph for *q* = 3 is harder, but one can quickly visualize the hyperedges as triangles connecting *q* = 3 vertices:

In general, there are hyperedges in the complete hypergraph , given by all subsets of size *q* of the *m* vertices. In group testing, the hyperedges correspond to samples: each sample is placed into the pools contained in the subset of pools determined by a hyperedge.

### Connection to group testing

For HYPER group testing, we are interested in assigning samples to pools in a simple and balanced manner. The notion of balance corresponds to using each pool an equal number of times; or as close as possible to this. This can be achieved quite directly in simpler cases, but requires more work in more complex cases. Consider now the simplest case, where *q*, the number of pools into which each sample is placed divides the total number of pools. Under this number-theoretic condition, for any *m* pools, we can clearly split them into *m/q* non-overlapping subsets/hyperedges of size *q*, and thus for a set of *m/q* samples, we can use each of the *m* pools exactly once, achieving perfect balance. The above partitioning hyperedges are called a 1-*factor* of the hypergraph. For example, for *q* = 2 and *q* = 3 (both with *m* = 6) we could have:

### Baranyai’s theorem

While this method to achieve balance is clear for *n* = *m/q* samples, it is less clear if and how something similar can be achieved for more samples. In fact, a celebrated result in combinatorics, *Baranyai’s theorem* ^{58}, *states that the complete hypergraph can be factorized*, in the sense that the collection of hyperedges can be split/partitioned into non-overlapping and different 1-factors (each of which containing *m/q* hyperedges), such that each *q*-hyperedge appears in exactly one of the partitions. For group testing, this means that perfect balance can be achieved whenever the number of splits *q* divides the total number of pools *m* (e.g., for a two-pool split, we need an even number of pools). In theory, this solves exactly the problem we need.

Baranyai’s theorem guarantees the existence of the desired designs, but does not provide efficient algorithms for constructing them. In fact, at the current time, general constructions seem to be known only for q = 2 and q = 3. For relatively small q, m one can certainly attempt to use brute-force enumeration to find appropriate designs. However, in this work we will focus on efficient and general constructions.

### Hypergraph factorization for *q* = 2

For *m* even, we use the following efficient method for constructing a factorization of . Here we follow the description in Section VII-5.5 of ^{29} and illustrate it using *m* = 6 as an example; see also page 595 of ^{27}. The construction begins with the following starter:

Namely, vertices *V* = *{*−*u*, …, +*u*, ∞*}* where *u* = *m/*2−1, i.e., Z_{m−1} *∪{*∞*}*, with the starter 1-factor formed by edges *{*(0, ∞), (−1, +1), …, (−*u*, +*u*)*}*. The remaining 1-factors are then generated by “rotating the diagram” as follows:

This yields *m* − 1 many 1-factors in total, that taken together form the desired hypergraph factorization of . To connect this back to the pools, simply relabel the vertices *V* using the pool names. For example, using
for the above yields the following 1-factors:

### Hypergraph factorization for *q* = 3 via Beth’s construction

We will leverage the non-trivial numbertheoretic construction for *q* = 3 due to Thomas Beth ^{59;60}, which is guaranteed to work when *m* = 6*k* for some integer *k* (so the total number of pools is divisible by six), and *r* = 6*k* − 1 is a prime number (that is, a number that is not divisible by any other number between 1 and *r*). In what can be viewed as a lucky coincidence, it so happens that most of the designs that we are interested in enjoy these properties: e.g., for pool sizes of *m* = 6, 12, 24, 48, each are divisible by six, and further we have that *r* = 5, 11, 23, 47 are prime numbers.

#### Algebraic background

We follow the description in^{59} (which appears to be difficult to access online; the construction is also presented in the thesis ^{60}, Section 3.1, and also referenced in ^{61}). The construction works as follows. Consider the finite field (or Galois field) of the prime order *r, GF* (*r*) = Z*/r*Z = *F*_{r}. This is defined as the set of numbers *{*0, 1, …, *r* − 1*}*, with addition, multiplication, and division by nonzero elements all defined modulo *r* (i.e., the result is always the residue after dividing by *r*). To this field, we append the symbol ∞, as the result of division by zero, so that 1/∞ = 0. We also define *c* + ∞ = *c ·* ∞ = ∞ for all *c* ≠ 0 *∈ F*_{r}. This constitutes the so-called “projective line” *PG*(1, *r*), with the “point” ∞ at infinity.

#### Beth’s construction

Now, Beth considers the fractional linear map *π* : *PG*(1, *r*) → *PG*(1, *r*) given by *π*(*x*) = −(1 + *x*)*/x*. Here, 1 denotes the additive unit of the field, while addition and division are taken modulo *r*. A key observation is that *π* is a fixed-point-free map or order three; that is, it maps *x* → *π*(*x*) → *π ° π*(*x*) → *π ° π ° π*(*x*) = *x*, such that all intermediate values are distinct. Thus, these *orbits* of *π* are sets of size three that partition *PG*(1, *r*). Let *O* = *{A*_{i}, *i* = 1, …, (*r* + 1)/3*}* be the partition of *PG*(1, *r*) into orbits (and note that the size of *PG*(1, *r*) is *r* + 1, hence there are (*r* + 1)/3 orbits).

Let also *ω* be a primitive element of *F*_{r}, that is an element such that *ω*^{j}1, for any *j* = 1, 2, …, *r* − 2. Then, Beth’s result ^{59;60} states that the partitions induced by multiplying and translating *O* by specific values *λ, g* as *λ · O* + *g* form a 1-factorization of *F*_{r}. Here *λ · O* + *g* means that we take each of the hyperedges *A*_{i} *∈ O*, and transform their elements affinely into *λ · A*_{i} + *g*, thus obtaining another hyperedge. Specifically, *λ* needs to take the values of the powers of *ω* given by *λ* = *ω*^{j}, *j* = 1, …, (*r* − 1)/2, and *g* can take any value in *F*_{r}.

The key for us is that this construction can be evaluated very efficiently, by simply iterating over the orbits of *π* and the values *λ, g*.

Example: *r* = 5

For *m* = 6 pools, *r* = *m* − 1 = 5 is a prime number and Beth’s construction applies. We go through the construction in this setting as an illustrative example.

For

*r*= 5, we have*F*_{r}=*{*0, 1, 2, 3, 4*}*and*PG*(1,*r*) =*{*0, 1, 2, 3, 4, ∞*}*.We compute orbits of

*π*by repeatedly applying*π*as follows:So we have the two orbits

*O*=*{{*0, ∞, 4*}, {*1, 3, 2*}}*that partition*PG*(1,*r*).We find a primitive element of

*F*_{r}by looking at powersSo we can choose

*ω ∈ {*2, 3*}*. We will (arbitrarily) choose*ω*= 2.Finally, we loop through

*λ*and*g*. For*ω*= 2,*λ*loops through*{*2, 4*}*.Note that each cell in the resulting output table forms a partition of

*{*0, 1, 2, 3, 4, ∞*}*as desired. Looping over the cells and using the partitions to form pools yields the desired design.

### General background on design theory

To understand our designs, and how they fit into the broader context, it is valuable to introduce some basic concepts from design theory. See e.g., ^{27;28} for excellent introductions, and we will follow notation and terminology from those references.

For us, a design *I* is a collection of points *V* and blocks *B*, and an assignment of some points to some blocks. In group testing, the points correspond to samples, and the blocks correspond to pools. The terminology of points and blocks is meant to be evocative of geometry, and indeed designs are closely connected to finite geometries such as affine and projective planes. Intuitively, points can sometimes be viewed as geometric points, while blocks can be viewed as lines. Points will be denoted with lowercase letters such as *p*, while blocks will be denoted with upper case letters such as *B*. The fact that point *p* is associated with (or incident on) block *B* is denoted as *pIB*.

Designs are called *q*-hypergraphs, if the size of the set of points incident on each block *B* is *q*, i.e., (*B*) = *{p ∈ V* : *pIB}* has size *q*. For group testing, this means that each sample is assigned to *q* pools. We are thus interested in *q*-hypergraphs, for small values of *q*, such as 2, 3.

A partition of a design is a disjoint union of its blocks into parts. A parallel class of a design is a collection of blocks such that each point is incident on exactly one block. This is analogous of the geometric idea that parallel lines do not intersect, and so if we partition the space into parallel lines, then each point belongs to exactly one such line. A design is called resolvable if it has a partition into parallel classes (a.k.a 1-factorization, parallelism or resolution).

For group testing, a resolution means that in each part of the partition of the blocks, we use each pool exactly once. Since our goal is to use pools in a balanced way, this is precisely what we want. Thus, from a design theory perspective, we are interested exactly in resolutions of *q*-hypergraphs. There is a great amount of work on existence and constructions of such resolutions, see e.g., ^{27} Ch VIII and references therein.

One classical strategy is the permutation group action approach. Here, we start with a collection *{B*_{j}*}, j* = 1, …, *J* of base blocks (say of size *q*), viewed as subsets of the set of pools [*m*] = *{*1, …, *m}* where *m* is the number of pools, and a subgroup *G* of the permutation group *S*_{m} on *m* elements. If the action of *G* on *B*_{j} *⊂* [*m*] *j* = 1, …, *J* leads to *J* non-overlapping orbits, then there are classical conditions under which the collection of these orbits forms a *t*-design (i.e., a hypergraph such that all *t*-subsets of points are incident on the same number of blocks), see Theorem III.8.2 on p. 207 of ^{27}. A more specific technique is the difference cycle method, which is an application of the permutation group action method when the cyclic group *G* = Z_{l} acts on the vertices *V* = Z_{l} by translation. This approach leads to resolutions of the complete hypergraph for both *q* = 2 and *q* = 3, see Section VIII.8 of ^{27}. These are precisely the algorithms that we use.

### Performance characterization

To gain further insight into the performance of the proposed hypergraph design, here we study its operating characteristics.

### Bound on expected number of tests

Our first result gives a sharp upper bound on the expected number of tests for hypergraph factorization designs. To get this result, we leverage the Dawson-Sankoff inequality^{62}, a nontrivial refinement of the Bonferroni union-intersection inequalities, which we use in the form given by^{63}. We suppose here that tests have perfect sensitivity and specificity, and we will relax this later.

*Consider hypergraph factorization designs with any number of samples n, number of pools m, number of pools per sample q, such that n is a multiple of m/q, and let k* = *nq/m be the pool size. Let p be the prevalence level, and suppose that each sample is positive independently with probability p. Let T be the number of tests required, which is a random variable. For any positive integer l ≥2*, with , *the expected number of tests is upper bounded by*

*The bound becomes an equality when (A) q* = 1 *or (B)* *and q* = 2; *and with taking l* = 2 *in both cases. For general q, the optimal choice for the parameter l is bounded by*

*Proof of Theorem 1*. Let *R*_{i}, *i* = 1, …, *n* be the indicator of the event that we need to re-test individual/sample *i*. Using the standard approach of calculating 𝔼 *T*, we find that the number of tests required is equal to *m* (one for each pool), plus any retests required. Hence,

Now, for our decoder, *R*_{i} happens precisely when all groups containing *i* are positive. Let *T*_{i} be the indicator that the *i*th sample is positive, which in this case is the same as the *i*th test outcome is positive. Let *G*_{j} be the groups containing sample *j*.

We use a refined version of the Bonferroni inequality known as the Dawson-Sankoff inequality^{62} to bound *P* (*R*_{i}). First let us recall the familiar Bonferroni union-intersection inequalities. Consider events *A*_{1}, …, *A*_{n}, and for all *j ∈ {*1, …, *n}*, define

Then, the well-known Bonferroni inequalities state that for even *h ∈ {*1, …, *n}*,

We set . Then for any even *h*

Taking *h* = 2, we thus find

We can get sharper results with the Dawson-Sankoff inequality^{62}. In the form given by^{63}, this states that for any integer *l ≥* 2,

Now we can write and for *j*≠*j*′, we have .

It remains to bound |*G*_{j}|. Here hypergraph factorization designs are useful, because they try to balance |*G*_{j}|. In each consecutive block of *m/q* samples, they use each of the *m* pools exactly once. Thus, in each *G*_{j}, there is at most one new sample. Recall that *k* is the number consecutive of blocks of samples of size *m/q*, and we assumed that *k* = *nq/m* is an integer. Based on the above, we have |*G*_{j}| = *k*.

Moreover, |*G*_{j} *∪ G*_{j} | = |*G*_{j′} | + |*G*_{j′} | − |*G*_{j} *∩ G*_{j′} |, and the intersection *G*_{j} *∩ G*_{j′} has size at most . The reason is that, since hypergraph designs are maximally balanced, they only intersect at most times in every consecutive block of samples. These intersections correspond to the number of ways of choosing the remaining *q* − 2 pools out of the remaining *m* − 2. Hence, *P* (*A*_{j}) = (1 − *p*)^{k} and for *j* = *j ′, P* (*A*_{j} *∪ A*_{j′}) ≤ (1 − *p*)^{2k−u}. Thus, *S*_{1} *≥ q*(1 − *p*)^{k} and . In addition, we have equality for *S*_{2} when either (A) *q* = 1, in which case the intersection is empty and *u* = 0, or (B) and *q* = 2 (in which case we know that the groups intersect in exactly *u* = 1 sample, which is the sample defining them) or (C) *n* is a multiple of (in which case we know that the number of intersections is exactly given by *u* for each pair of groups). This leads to the desired result.

As is known from ^{62;63}, the optimal choice for *l* is *l* = ⌊2*S*_{2}*/S*_{1}*J*⌋ + 2. We can approximate the optimal choice using the calculations from the proof as

This finishes the proof.

Let us denote *r* = 1 − *p*. When , an optimal hypergraph design is obtained by minimizing the number of per-person tests *𝔼* (*m, q*) = E*T/n*, i.e.,

This formula is exact for *q* = 1 and *q* = 2. One can check that we recover the bound from above with equality for these cases when (so *u* = 0 for *q* = 1 and *u* = 1 for *q* = 2), and by taking *l* = 2. In more detail, we can write (replacing above *k* = *nq/m*, and using *r* = 1 − *p*)

Moreover, by substituting into Theorem 1 *u* = 0 for *q* = 1 and *u* = 1 for *q* = 2, and by taking *l* = 2; the bounds given there for 𝔼 *T/n* (denote them ) become

This shows that the upper bounds are sharp for *q* = 1, 2.

However, for a larger *q*, this formula corresponds to the so-called *locally tree-like* approximation in the graphical model corresponding to the observation model, when computing the probability of *R*_{i}.

### Optimal efficiency for *q* = 2

Our next result characterizes the optimal efficiency of hypergraph designs for *q* = 2 for small *p* → 0.

(Optimal efficiency for *q* = 2). *In the noiseless large n case (i*.*e*., *α* = 0, *β* = 1, *and n* → ∞*), the optimal efficiency of HYPER with q* = 2 *is approximately E*^{*} ≈ 3*p*^{2/3} *in the limit p* → 0 *with m/n* ≈ 2*p*^{2/3} − *p*.

These results are valid in the regime of *p* where , which restricts *p* to be larger than a certain threshold. When *p* → 0, we expect the optimal *m* to decrease in this limit, eventually leading to . In practice, this means that the formula is valid for a larger range of *p* when *n* is larger.

In comparison, the optimal efficiency for Dorfman testing (*q* = 1) is approximately 2*p*^{1/2} for small *p* e.g., ^{20}. Thus hypergraph designs improve over Dorfman designs. The same asymptotic efficiency *E*^{*} ≈ 3*p*^{2/3} is also achieved by three-stage-testing^{20} as well as double pooling tests ^{56}. However, our proposal is a two-stage approach. In comparison, both the hierarchical testing approaches proposed in ^{20} and ^{8} attain an asymptotic efficiency *E*^{*} ≈ *ep* ln(1*/p*) as *p* → 0, which is asymptotically more efficient; but it requires multi-stage tests that we avoid.

More generally, we note that there is a lot of work on optimality of group testing, for various cases, e.g., two-stage and multi-stage algorithms, adaptive and non-adaptive algorithms, in worst case or average case, etc ^{41–46}. As *p* → 0, these works and others discuss a universal lower bound of order Θ(*p* ln(1*/p*)) on the efficiency. However, the specific rate at which *p* → 0 with *n* can lead to better rates; and better algorithms. In particular, under the same probabilistic model as in our paper ^{44} construct certain tests where each sample is placed into *q* = ln(1*/p*)/ ln 2 pools, and show that these attain asymptotic efficiency *p* ln(1*/p*). The work ^{47} constructs a 2-stage algorithm with asymptotically optimal efficiency, requiring *q* = *m* ln(2)/(*np*). The work ^{48} discusses similar proposals for the noisy case. In our work, the constraints we work with do not allow *q* to grow with *p* → 0. The work ^{49} proposes a 4-stage algorithm with asymptotically optimal efficiency.

*Proof of Proposition 2*. Taking the limit as *n* → ∞ with *m/n* = *y* fixed, where *α* = 0 and *β* = 1, we obtain

To optimize, we differentiate *E* with respect to *y*, obtaining

The optimal *y* can be obtained by solving 0 = *∂E/∂y* for *y* in terms of *p*. We approximate this solution in the limit *p* → 0 by taking the leading two terms of the Puiseux expansion (around *p* = 0) of the degree four Taylor approximation of *∂E/∂y* (with respect to *p* = 0). This has one branch corresponding to a real solution yielding *y*^{*} ≈ 2*p*^{2/3} − *p*.

Substituting *y* = 2*p*^{2/3} − *p* into *E* and computing a Taylor approximation yields
completing the derivation.

### Noisy case

We next study the noisy case, where each of the pooled tests can have false positives and false negatives. Our first result gives an exact formula for the expected number of tests for *q* = 1, 2, and moreover, also gives formulas for the false positive and false negative rates for each individual test.

(Performance of hypergraph factorization: noisy case). *Consider hypergraph factorization designs in a noisy observation model. Suppose n is a multiple of m/q. Suppose q* = 1 *or q* = 2, *and* . *Let k* = *nq/m and r* = 1 − *p. The expected number of tests has the following exact form:*

*Here γ* = [*β* + (*α* − *β*) *· r*^{k−1}]^{q}. *Denoting the odds ratio as o* = (1 − *p*)*/p the true negative and true positive probabilities for each individual sample’s status* *and the test results T*_{i} *are, respectively*,

Our second result gives a more generally applicable upper bound on the expected number of tests, valid for all *q*.

(Performance bound for hypergraph factorization: noisy case, upper bound). *Consider hypergraph factorization designs in a noisy observation model with any number of samples n, number of pools m, number of pools per sample q, such that n is a multiple of m/q, and let k* = *nq/m. Let* . *Let p be the prevalence level, and suppose that each sample is positive independently with probability p. Suppose the pools and the re-tests have sensitivity* 1 − *α (where α is the level of each test), and specificity (or power) β. For any positive integer l ≥* 2, *the expected number of tests is upper bounded by*

*Where*

*Here r* = 1 − *p. The optimal choice of l, minimizing the upper bound, is achieved by l* = ⌊*l*(*q* − 1)*p*_{2}*/p*_{1}*J*⌋ + 2.

*Proof of Theorems 3 and 4*. We will follow, to some extent, the notation and assumptions from Bilder’s works, see e.g., ^{21}. Let be the binary result of testing group *j*, and *H*_{j} be the true status of the *j*-th group. By definition, 1 − *α* is the sensitivity of each grouped test, i.e., (we use this notation in convention with the notion of *α* for the level of a test in hypothesis testing); and *β* is the specificity (or power) of each grouped test, i.e., . Moreover, we assumed that each test outcome is independent.

We have

The key is to determine the probabilities *P* (*R*_{i}|*T*_{i} = *t*). Now *R*_{i} happens if and only if each of the groups containing *i* have a positive status, i.e., for all *i ∈ G*_{j}, we have :

Since *q* = 1, 2, and , these groups are non-overlapping outside of *i*, and thus, their probabilities are independent conditional on *i*. We can thus write

Next, we can condition on *H*_{j} for each term, to write

Moreover, since *H*_{j} = max_{i∈G}*j T*_{i}, we have

Working our way back up, we can substitute these above to find

Letting *n*_{i} = |*{i* : *i ∈ G*_{j}*}*| be the number of groups that *i* belongs to, we find

Finally,

Hence

Recall that for the hypergraph designs, |*G*_{j}| ≤ *n/*(*m/q*), and if *n* is an integer multiple of *m/q* as assumed here, then |*G*_{j}| = *nq/m*. Moreover, by construction, *n*_{i} = *q*. Hence we find

This gives the desired formula for the expected number of tests. Next we derive the per-test false negative and positive rates. Let be the indicator that the *i*-th sample is declared positive. This happens if all groups containing *i* are positive in the first round, and then the result of a second independent test is also positive. Recall that *T*_{i} is the indicator that the *i*th sample is positive, and *P* (*T*_{i} = 1) = *p*.

We are interested in the probabilities , for . For this denotes the true positive probability; while this denotes the true negative probability. Using Bayes’ rule, we can write

Thus the probabilities reduce to determining . We can write, assuming the result of the re-test is independent of the original tests, and assuming the re-test has the same operating characteristics as any grouped test, that

We have

Hence, using our previous results, denoting ,

Working our way back,

Recall that for the hypergraph designs, if *n* is an integer multiple of *m/q*, then *γ* = [*β* +(*α* − *β*) *·* (1 − *p*)^{nq/m−1}]^{q} and *n*_{i} = *q*. Hence, denoting the odds ratio as *o* = (1 − *p*)*/p* we find that the true negative and true positive probabilities are, respectively,

This finishes the proof of Theorem 4. Now, we proceed to Theorem 3. This follows in a similar way to the previous Theorem 1, but with more involved calculations. As before, the Dawson-Sankoff inequality ^{62}, in the form given by^{63} states that for any integer *l*,

Taking , we find that the above equals . Now *S*_{1} = Σ*P* (*A*_{j}), *S*_{2} = *P* (*A*_{j} *∩A*_{g}). Thus it is enough to give a lower bound for and an upper bound for for all *j ≠ g*.

We can condition on *H*_{j} to write

In the last line, we have used that *n* is a multiple of *m/q*.

Similarly, we can calculate for *j ≠ k*, noting that |*G*_{j}| = |*G*_{g}| = *k*, and denoting |*G*_{j} *∩ G*_{g}| = *v*, so that |*G*_{j} *∪ G*_{g}| = 2*k* − *v*,

As discussed before, due to the construction of hypergraph factorization designs, we have . Since the above expression is monotonically increasing in *u*, we can also conclude that we can upper bound it by replacing *v* with *u*. This finishes the proof.

## Acknowledgements

D.H. was supported by the Dean’s Fund for Postdoctoral Research of the Wharton School and NSF BIGDATA grant IIS 1837992. R.D. and X.L. were supported by a grant from the Partners in Health. E.D. was supported in part by NSF BIGDATA grant IIS 1837992.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].
- [11].
- [12].
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].
- [23].
- [24].↵
- [25].↵
- [26].
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].
- [32].
- [33].
- [34].
- [35].
- [36].
- [37].
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].
- [43].
- [44].↵
- [45].
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵