## Summary

Contagion happens through heterogeneous interpersonal relations (homophily) which induce contamination clusters. Group testing is increasingly recognized as necessary to fight the asymptomatic transmission of the COVID-19. Still, it is plagued by false negatives. Homophily can be taken into account to design test pools that encompass potential contamination clusters. I show that this makes it possible to overcome the usual information-theoretic limits of group testing, which are based on an implicit homogeneity assumption. Even more interestingly, a multiple-step testing strategy combining this approach with advanced complementary exams for all individuals in pools identified as positive identifies asymptomatic carriers who would be missed even by costly exhaustive individual tests. Recent advances in group testing have brought large gains in efficiency, but within the bounds of the above cited information-theoretic limits, and without tackling the false negatives issue which is crucial for COVID-19. Homophily has been considered in the contagion literature already, but not in order to improve group testing.

## I. INTRODUCTION

Massive and timely identification of asymptomatic disease carriers is crucial if human-to-human asymptomatic transmission happens, which is documented for COVID19^{1,2,3,4,5}. Li, Pei *et al*. (2020)^{6} find that although the transmission rate of undocumented carriers is only 55% that of documented carriers, the former are responsible for 80% of contaminations.

Massive and repeated RT-PCR testing is possible only through group testing (testing a pool of swabs of many individuals). Group testing is used to fight COVID19 in China, India, Germany, the United States^{7} and Rwanda^{8}. For the literature on group testing and COVID19, see ^{7,8,9,10,11,12,13 and 14} With a .1 percent prevalence, the two-step adaptive design proposed by Dorfman (1943)^{15} decreases 17-fold the number of tests required to identify asymptomatic COVID-19 carriers (0.06 test per person) while the strategy suggested by Mutesa *et al*. (2020)^{8} decrease it 55 times (0.018 test per person). Still, these testing strategies are plagued by false negatives (see Section IV).

Contagion happens through heterogeneous interpersonal relations (homophily) which can be identified *ex ante* (Section II) to design test pools that encompass potential contamination clusters. Thus, it is possible to overcome information-theoretic limits, which rely on an implicit homogeneity assumption, and to make tests more efficient (Section III). Combining this approach with individual complementary exams for the positive groups identifies carriers who would be missed even by costly exhaustive individual tests on nasopharyngeal swabs (Section IV).

## II. HOMOPHILY IS PREVALENT AND CAN BE IDENTIFIED EX ANTE

The social sciences and epidemiology literatures show that heterogeneous interpersonal interactions, which induce small potential contamination clusters, are recurrent and can be identified *ex ante*. Thus, it is possible to design test pools encompassing potential clusters. Homophily, defined in 1954 by Lazarsfeld and Merton^{16}, “refers to the fact that people are more prone to maintain relationships with people who are similar to themselves”^{17}. Homophily is prevalent in many social networks^{18} and affects contagion^{19}. Moulton (1986, 1990)^{2021}, introduced a related econometric concept, clustering, which refers to the nondeterministic correlation of outcomes between individuals that are somewhat related: failing to take it into account induce significant errors when estimating standard errors. Clustering has been popularized by Bertrand *et al*. (2004)^{22} and extended to multiple non-nested dimensions^{23}. Correcting for potential clustering is a condition *sine qua non* for publication in applied economics.

The epidemiologic literature confirms the importance of clusters. Han and Yang (2020)^{24} cite a Chinese-written article asserting that “In some cities, cases involving cluster transmission accounted for 50% to 80% of all confirmed cases of COVID-19.” According to a meta-analysis of 20 studies, households display high secondary attack rate (15.4% on average)^{25}, and out of 36 children infected in a Chinese city, 32 (89%) had transmission by close contact with family members^{26}. High SARs also occurs in a chalet (73.3%), at a choir (53.3%), at a religious event, or for travels and eating with an index case. Other cases of clusters with very high absolute attack rate include “a nursing home in Kings County, Washington (64%) […] a church in Arkansas (38%), a homeless shelter in Boston (36%), a fitness dance class (26.3%) and the Diamond Princess cruise ship in Japan (18.8%)”^{25}. Park *et al*. (2020)^{27} analyze an outbreak in a building: 94 of 97 cases worked on the same floor, and 79 in the same open-space (attack rate of 52%). Many clusters have been observed in slaughterhouses.

*Ex ante* general contamination patterns in small clusters can be identified. Longer and more intense exposure increases the risk of infection^{25,28,29,30}; so do indoor environments with sustained close contact and conversations^{31}. In a call center, cases are concentrated in large open spaces but only one case in small offices^{27}. This is consistent with Harpedanne (2020)^{32} who shows a convex theoretical contamination effect of the number of users. Using these general patterns and theoretical results, it is possible to identify *ex ante* potential clusters, and to design pools that encompass these clusters. Sections III and IV quantify the gains from this strategy.

## III. INFORMATION THEORETIC LIMITS CAN BE OVERCOME WHEN DESIGNING TEST POOLS

Here, I show that homophily contains information that can make group testing more efficient theoretically. For that purpose, I show that strong homophily used to design the testing pools makes it possible to overcome the most recent and tight information-theoretic lower bounds on the efficiency of group testing, identified by Chan *et al*. (2011)^{33} who claim to be the first in the literature to define limits in terms of actual numbers and not only rate or capacity, and Baldassini *et al*. (2013)^{34}, who follow the same path and provide a new and tighter lower bound.

I focus here on noiseless tests: a negative test outcome is guaranteed when all items in the testing pool are nondefective, and a positive outcome when a least one item in the pool is defective^{35}. Otherwise, the test is noisy.

Noisy tests are often examined under the assumptions of constant^{33} or worst-case^{36} noise. However, the literature points instead to a risk of false negatives increasing with dilution^{14} (analyzed in a previous draft), and to patient-specific idiosyncratic noise^{37,38,39}. General form noise models (the symmetric error model^{33} or the additive model^{40}) are irrelevant to analyze idiosyncratic noises. Also, focusing here on noiseless tests shows that the benefits of taking homophily into account in group testing are not limited to noise-related issues, and makes the comparison with the information-theoretic limit of Baldassini *et al*. (2013)^{34} easier. Furthermore, Baldassini *et al*. (2013, Section III)^{34} analyze a noiseless test in a population of size N, with K defectives (K is known for simplicity). They show that if the number of tests is limited to T, the probability of correct identification of the set of defectives is:
where .

Let build a simple counterexample. *N* is 64 and *K* is 8. Using 6 tests only (T=6), one can cut the population in 8 groups of 8 people each and determine which group contains carriers *if only one group contains carriers* (think of the 64 population as a 4×4×4 cube and cut it in half in each dimension, that is implement 6 tests over 32 people each). According to (1):
With homophily, there exists high potential for within-group contamination and low potential for between-groups contamination. With probability (1-ε_{1}), only one individual has imported the disease in the 64 population (a decent assumption with a low general prevalence); ε_{2} is the probability that intergroup contamination has happened. Then with probability (1 - ε_{1})(1 - ε_{2}), all 8 carriers are in the same group. Let fix ε_{1}=0.2 and ε_{2}=.5. Then:
which contradicts (2): homophily provides information that makes it possible to overcome information-theoretic limits based on an implicit homogeneity assumption.

## IV. FALSE NEGATIVES CAN BE IDENTIFIED USING A TWO-STEP STRATEGY

The swabs of many disease carriers fail to contain viral loading, inducing patient-specific idiosyncratic false negatives^{37,38,39}. A strategy based on group testing with homophily can solve this issue and identify more asymptomatic carriers than exhaustive individual testing.

Methods to identify SARS-CoV-2 carriers include clinical diagnosis (not for asymptomatic carriers), chest radiograph and CT-scan (not very available), fibrobronchoscope brush biopsy and RT-PCR on bronchoalveolar lavage fluid (requires specific equipment and skilled operators), sputum (produced in only 28 % of the COVID cases^{41}), feces swabs, nasal and throat swabs… Only nasal and throat swabs can be used easily for large scale asymptomatic testing. The former induce a limited rate of false negatives if implemented less than one week after the onset of the disease for symptomatic patients (table 1). Thus, a strategy based on nasal swabs should be repeated weekly to minimize the risk of false negatives.

Still, even nasal swabs, widely used for identification of SARS-CoV-2, display a significant rate of false negatives. Group testing with homophily is very beneficial here. If a carrier is “false negative” with probability α, then a pool with two carriers will be negative with probability α^{2} only (α<1), and with probability *α*^{n} if there are *n* carriers: concentrating carriers in a pool decrease the risk of wrongly classifying this pool as negative.

In a second step, implementing individual RT-PCR on nasopharyngeal swab would reintroduce a false negative problem. Rather, advanced complementary exams are implemented on all individuals belonging to positive pools: chest radiograph or CT scan, fibrobronchoscope brush biopsy, and RT-PCR on additional types of swabs. With independence, the probability that all tests provide a false negative would be the product of the proportion of false negatives for each type of test. Empirical evidence on this issue is scarce: The correlation between nasal and throat swabs is low (Kappa=0.308) and Computed Tomography scan was always able to detect ground-glass opacities for cases without viral shedding in the swabs examined (3 cases)^{39}. Similar results were obtained for PiO2/FiO2 and Murray score, implemented on two cases without viral shedding, always pointed to lung injury. Whether these results apply to asymptomatic carriers is an open question. More advanced work is needed on the correlation between these tests and exams for asymptomatic carriers, but a series of tests including nasal, throat and feces swabs, sputum swabs when available, lower respiratory swabs, PiO2/FiO2 tests and Murray score, may be able to detect most individual carriers if implemented frequently; CT scan may prove useful to identify more severe case.

Graph 1 analyze the quantitative gains from homophily in this two-step strategy, for 2 to 5 defectives. This covers a large range of situations: two defectives out of two thousand people correspond to a rate of 0.1%, while five out of 50 correspond to 10%. The size (and number) of pools do not affect the graphs, which therefore cover a large range of pool size.

From left to right, each graph displays increasing concentration (denoted by <C) of the defectives in testing pools. <_{C} is transitive but is not a total order, and using brackets ≈ for non-ordered configurations, we get, for 5 defectives:
Bars describe the expected number of missed carriers due to false negatives. When carriers are concentrated in a few pools or a single pool, the expected number of missed carriers decreases. The gains from homophily can be summarized by comparing the expected number of missed carriers when all carriers are in different pools and when they are in the same pool. For *α*=1/3 (which is the central case^{37,39}), the reduction reaches 67% for two carriers (0.67 expected missed carriers if the carriers are in two different groups against 0.222 if both are sin the same pool), 89% for three carriers, 96% for four carriers and 98.7% for five carriers. The absolute and relative gains are higher for *α*=0.5 (respectively 75%, 94%, 98.4% and 99.6% for two, three, four and five carriers) and lower for *α*=0.25 (respectively 50%, 75%, 87.5% and 93.75%). Even less successful concentration can still bring huge gains. For instance, with five carriers and *α*=1/3, getting three carriers in a pool and two in another brings a reduction of 80%.

Note than even without homophily, that is if carriers are *i*.*i*.*d*. in the different pools, two carriers may be in the same pool by chance, which reduces the expected number of “missed” carriers. Group testing even without homophily can help identify and isolate carriers who would be missed by exhaustive individual testing. To the best of my knowledge, this simple and striking fact has been overlooked in the scientific and policy debates on group testing to fight epidemics. Advanced complementary exams are costly. Homophily is crucial here. When the concentration of carriers increases, the expected number of contaminated groups, and therefore the expected number of positive groups (groups that are *identified* as contaminated by the first step) are reduced stringently. The gains of homophily are lower than those observed for the expected number of missed carriers, because homophily has two opposite effects here: concentration decreases the number of potential positive groups, but improves the identification of these groups. Still, these gains are significant. For α=1/3, the reduction in the number of groups on which to implement complementary exams is 33.3% for two carriers, 51.8% for three carriers, 63% for four carriers and 70% for five carriers. The absolute gains are especially high for low values of α, since for low α, most contaminated groups are identified correctly and homophily reduces mainly the number of really contaminated groups.

## V. CONCLUSION

Homophily is prevalent in interpersonal interactions. Taking it into account makes group testing more efficient and help reduce false negatives issues. In case of a lockdown, testing together households, which are all potential clusters with high attack rate, is efficient; in more normal times, a multi-dimensional testing strategy with non-nested pools (households on the one hand, other potential cluster such as firms on the other hand) is likely to be beneficial. These results open a new avenue for research to fine tune the present analysis and combine it with other research in order to better fight the COVID19 epidemic.

The author declares no competing interest.

## Data Availability

No data: mathematical analysis

## Acknowledgements

I thank Marc Fleurbaey, Matthew Jackson, Xavier d’Haultfeuille and seminar participants at Paris School of Economics for useful comments. All remaining errors are mine.