Probabilistic approaches for classifying highly variable anti-SARS-CoV-2 antibody responses

Antibody responses vary widely between individuals1, complicating the correct classification of low-titer measurements using conventional assay cut-offs. We found all participants in a clinically diverse cohort of SARS-CoV-2 PCR+ individuals (n=105) – and n=33 PCR+ hospital staff – to have detectable IgG specific for pre-fusion-stabilized spike (S) glycoprotein trimers, while 98% of persons had IgG specific for the receptor-binding domain (RBD). However, anti-viral IgG levels differed by several orders of magnitude between individuals and were associated with disease severity, with critically ill patients displaying the highest anti-viral antibody titers and strongest in vitro neutralizing responses. Parallel analysis of random healthy blood donors and pregnant women (n=1,000) of unknown serostatus, further demonstrated highly variable IgG titers amongst seroconverters, although these were generally lower than in hospitalized patients and included several measurements that scored between the classical 3 and 6SD assay cut-offs. Since the correct classification of seropositivity is critical for individual- and population-level metrics, we compared different probabilistic algorithms for their ability to assign likelihood of past infection. To do this, we used tandem anti-S and -RBD IgG responses from our PCR+ individuals (n=138) and a large cohort of historical negative controls (n=595) as training data, and generated an equal-weighted learner from the output of support vector machines and linear discriminant analysis. Applied to test samples, this approach provided a more quantitative way to interpret anti-viral titers over a large continuum, scrutinizing measurements overlapping the negative control background more closely and offering a probability-based diagnosis with potential clinical utility. Especially as most SARS-CoV-2 infections result in asymptomatic or mild disease, these platform-independent approaches improve individual and epidemiological estimates of seropositivity, critical for effective management of the pandemic and monitoring the response to vaccination.

To validate our assays, we repeatedly analyzed a large set of serum samples from historical 80 blood donors as negative controls (n=595) -critical for determining the assay background. 81

82
As we show, and as has been reported by others 4,7,11 , the magnitude of response varied 83 greatly between seropositive individuals and was associated with disease severity. Those with 84 most pronounced symptoms had the highest anti-viral antibody titers, while those with 85 asymptomatic or mild disease (including otherwise healthy blood donors and pregnant 86 women) exhibited a range of antibody levels, with many measurements in close 87 approximation to the negative control background, complicating their correct classification. 88 To improve upon the dichotomization of a continuous variablewhich is common to many 89 clinical tests but results in a loss of information 12,13we used tandem anti-S and RBD IgG 90 data from confirmed infections and negative controls to train different probabilistic 91 4 algorithms to assign likelihood of past infection. Compared to strictly thresholding the assay 92 at 3 or 6 standard deviations (SD) from the mean of negative control measurements, these 93 more quantitative approaches modelled the probability a sample was positive, improving the 94 identification of low titer values and paving the way for a greater utility to antibody test 95 results. 96 97 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. We developed ELISA protocols to profile IgM, IgG and IgA specific for a pre-fusion-104 stabilized spike (S) glycoprotein trimer 14 , the RBD, and the nucleocapsid (N). Trimer 105 conformation was confirmed in each batch by cryo-EM 15 and a representative subset of study 106 samples was used for assay development (Fig. S1A). In contrast to other studies reporting 107 significant cross-reactivity to S in the UK population 16 , we did not observe reproducible IgG 108 reactivity to S or RBD across all 595 historical controls in the study, although two individuals 109 who were PCR-positive for endemic coronaviruses (ECV+) in the last six months displayed 110 reproducible IgM reactivity to both SARS-CoV-2 N and S, and two 2019 blood donors (from 111 n=72 tested) had low anti-S IgM reactivity (Fig. S1B). Thus, further investigation is required 112 to establish the contribution of potential cross-reactive memory SARS-CoV-2 responses 17 . 113

114
Responses to S and the RBD were highly correlated and our assay revealed a greater than 115 1,000-fold difference in anti-viral IgG titers between Ab-positive individuals when 116 examining serially diluted sera ( Fig. S1C  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; 6 responses were generally weaker and more variable and also spread over a large range (Fig.  132   1B). 133

134
To examine this further, PCR+ individuals were grouped according to their clinical status: 135 non-hospitalized (Cat. 1), hospitalized (Cat. 2) or admitted to the intensive care unit (Cat. 3). 136 To validate our clinical classification, we measured serum IL-6 levels in a random subset of 137 PCR+ individuals (n=64). IL-6 feeds Ab production [19][20][21][22] , and as has been reported 23 , was 138 increased in samples from individuals with severe disease (Fig. 1C). Furthermore, 139 multivariate analyses (accounting for the effects of age, sex and days from symptom 140 onset/PCR test) revealed increased anti-viral IgM, IgG and IgA to be associated with disease 141 severity, as has been reported 7 ( Fig. 1C and S1D-E, Table S1). Severe disease was most 142 strongly associated with virus-specific IgA, suggestive of mucosal pathology. We did not 143 observe an association between ICU or IL-6 status and IgM levels, supporting that levels of 144 the cytokine and IgA mark a more severe clinical course of COVID-19 (Fig. S1D). Anti-RBD 145 IgA responses were slightly lower in non-hospitalized and hospitalized females compared to 146 males, and trended similarly for S ( Fig. S1D and Table S1), consistent with females 147 developing less severe disease 4 . 148 149 Across all PCR+ individuals (sampled up to two months from PCR test), anti-viral IgG levels 150 were maintained, while IgM and IgA decreased, in agreement with their circulating t1/2 and 151 viral clearance ( Fig. S1D and Table S1). In longitudinal patient samples (sequential sampling 152 of PCR+ individuals in the study) where we observed seroconversion, IgM, IgG and IgA 153 peaked with similar kinetics when all three isotypes developed, although IgA was not always 154 generated in non-hospitalized or hospitalized individuals (Fig 1E), supporting a more diverse 155 antibody response in severe disease. 156

157
To extend these observations, we characterized the in vitro virus neutralizing antibody 158 response in PCR+ patients. Using an established pseudotype virus neutralization assay 24 , we 159 detected neutralizing antibodies in the serum of all SARS-CoV-2 PCR+ individuals screened 160 (n=48) (Fig. 1F). Neutralizing responses were not seen in samples before seroconversion or 161 negative controls ( Fig. 1E and F). A large range of neutralizing ID50 titers was apparent, with 162 binding and neutralization being highly correlated (Fig. S1D). In agreement with the binding 163 data, the strongest neutralizing responses were observed in samples from patients in intensive 164 care (g.mean ID50=5,058; 95% CI [2,422 -10,564]) ( Fig 1E). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ;

166
In healthy blood donors and pregnant women (n=1,000 collected between weeks 17-21 2020 167 the same time as the patient cohort), who did not have signs or symptoms of COVID-19 for 168 two weeks prior to sampling, and had not been hospitalized for COVID-19, IgG titers varied 169 greatly but were generally lower than hospitalized COVID-19 patients, and were comparable 170 to titers in PCR+ hospital staff (n=33) who also had never been hospitalized following 171 infection (Fig. 1G). alongside test samples throughout the study. We considered the spread of negative values 185 critical, since the use of a small and unrepresentative set of controls can lead to an incorrectly 186 set threshold, which can considerably skew the seropositivity estimate. This is illustrated by 187 the random sub-sampling of non-overlapping groups of negative controls, resulting in a 40% 188 difference in the positivity estimate ( Fig. 2A). Worryingly, many clinically approved tests use 189 a ratio between a known positive and negative serum calibrator to classify seropositivity 25 , 190 although we show here that these are highly variable within the population. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; https://doi.org/10.1101/2020.07.17.20155937 doi: medRxiv preprint 8 specificity >99.6% (Fig. 2D). On these metrics, LDA gave the highest specificity. Logistic 200 regression had similarly high specificity on some folds of the training data, but with higher 201 sensitivity. However, we deliberately considered balanced and unbalanced folds (where 202 case:control ratios varied between folds) and found LOG to show the least consistency across 203 strategies, which reflects that the proportion of cases in a sample directly informs a logistic 204 model's estimated parameters. SVM methods had lower specificity than LDA in the training 205 data, but higher sensitivity. 206

207
The standard methods, calling positives by a fixed number of SD above the mean of negative 208 controls, displayed two extreme behaviors: 3-SD had the highest sensitivity (100%) while 6-209 SD had the highest specificity, and the lowest sensitivity ( Fig. S2A), emphasizing that the 210 number of SD above the mean is a key parameter, but one which is not learnt in any formal 211 data-driven manner. Both SVM and LDA offer linear classification boundaries, but we can 212 see that the probability transition from negative to positive cases is much sharper for LDA 213  Table S2). This is in contrast to the SD 232 thresholding, which identified 12% and 10% positivity for S and RBD, respectively, at 3 SD, 233 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; 9 and 8% and 7.5, respectively, at 6 SD (Table S2). Therefore, apart from providing more 234 accurate population-level estimatescritical to seroprevalence studies, where we have 235 applied these and related tools in a large cohort 26these methods have the potential to 236 provide more nuanced information about titers to an individual after an antibody test. For 237 example, test samples with a 30-60% chance of being antibody positive (Fig. S2B)  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; 12 as additional negative controls. The use of study samples was approved by the Swedish 311 Ethical Review Authority (registration no. 2020-01807). Stockholm County death and 312 Swedish mortality data was sourced from the ECDC and the Swedish Public Health Agency, 313 respectively. Study samples are defined in Table 1. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; 13 Chromogen (Invitrogen). The reaction was stopped using 1M sulphuric acid and optical 345 density (OD) values were measured at 450 nm using an Asys Expert 96 ELISA reader 346

(Biochrom Ltd.). Secondary antibodies (all from Southern Biotech) and dilutions used: goat 347
anti-human IgG (2014-05) at 1:10,000; goat anti-human IgM (2020-05) at 1:1000; goat anti-348 human IgA (2050-05) at 1:6,000. All assays of the same antigen and isotype were developed 349 for their fixed time and samples were randomized and run together on the same day when 350 comparing binding between PCR+ individuals. Negative control samples were run alongside 351 test samples in all assays and raw data were log transformed for statistical analyses. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; https://doi.org/10.1101/2020.07.17.20155937 doi: medRxiv preprint We considered three strategies for cross-validation: i) random: individuals were sampled into 412 folds at random, ii) stratified: individuals were sampled into folds at random, subject to 413 ensuring the balance of cases:controls remained fixed and iii) unbalanced: individuals were 414 sampled into folds such that each fold was deliberately skewed to under or over-represent 415 cases compared to the total sample. We sought a method with performance that was 416 consistently good across all cross-validation sampling schemes, because the true proportion 417 of cases in the test data is unknown, and we want a method that is not overly sensitive to the 418 proportion of cases in the training data. We chose to assess performance using sensitivity and 419 specificity, as well as consistency. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021. ; https://doi.org/10.1101/2020.07.17.20155937 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 6, 2021.