Abstract
In this work will apply mixture models based on distributions from the SMSN family to antibody data against four SARS-CoV-2 virus antigens. Furthermore, since the true infection status of individuals is known a priori, performance measures will be calculated for the methods proposed for cutoff point estimation such as sensitivity, specificity and accuracy. The results of a simulation study will also be presented.
1 Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection that causes the devastating and often lethal COVID-19 disease was first detected in China, province of Wuhan in December 2019 ([26]). Rapidly, SARS-CoV-2 infection spread over the entire world and the COVID-19 disease was declared as a pandemic by the World Health Organization.
The detection of the virus is so far done by the so-called reverse transcription quantitative PCR (RT-qPCR) on samples from nasopharyngeal or throat swabs ([26]). In general, only symptomatic individuals or people who were in close contact with detected cases are tested, which might lead to an underestimation of the proportion of individuals infected with SARS-CoV-2 ([31]). Alternatively, serological testing allows to detect asymptomatic individuals exposed to the infection. In addition, serological testing is able to quantify the degree of exposure to the infection in the population. In this context, it is important to estimate seroprevalence at the population level, i.e., the proportion of seropositive individuals that show antibodies against any SARS-CoV-2 antigen ([14]).
The presence of antibodies in a serum sample can be regarded as an indicator of immunity against a given infectious agent or as an indicator of past infection in the absence of vaccination ([10]). The detection of antibodies in the serum samples is classically done via enzyme linked immunosorbent assays (ELISA), where the resulting data are light intensities, also called optical density, which reflects the underlying antibody concentration in the samples ([9]). For statistical convenience, the analysis of serological data proceeds by dichotomizing the amount of antibodies present in the serum of an individual using an arbitrary cutoff point in the antibody distribution to achieve a certain sensitivity and specificity. This allows the classification of individuals into seronegative (with antibody levels below the cutoff point) and seropositive (with antibody levels above the cutoff point) ([26]).
Given the possible impact of the cutoff chosen, different criteria for seropositivity determination have a direct impact on the sensitivity and specificity of the respective serological classification ([22]). In addition, it might also impact the estimation of the seroprevalence ([13]) and the following (epidemiological) decision that can be taken when facing a given estimate of this epidemiological parameter. This means that when determining the cutoff point for a serological test, one should take into account the benefit of the test, the economic and social consequences of serological misclassification and the prevalence of the disease in the population. It turns out that these aspects are often ignored in practice ([25]).
One of the traditional methods to establish the cutoff point in serological assays is to consider the logarithmic transformation of the antibody concentration of a known seronegative population and proceed to calculate the mean plus 2 or 3 standard deviations ([4,18,25,32]). This method is more adequate when the antibody distribution of the seronegative population is normally distributed ([4]). However, our previous studies of different serological data ([9,30]) showed evidence against a normality assumption for the antibody levels associated with a putative seronegative population. In the case where the true infection (or disease) status is known, ROC curve-based methods are most commonly used to determine the cutoff point for defining seropositivity. These methods are widely discussed in the literature ([5,11,12,21,23,27,33]).
Alternatively, finite mixture models can be used to determine the seropositivite cutoff directly from the data ([4,9,13,21,29]). In our previous work, three methods for determining seropositivity cutoff were explored using the so-called scale mixtures of Skew-normal distributions in the case where the true infection status is unknown ([9]). In this paper we applied the same methods and models in order to evaluate their performance in freely available serological data concerning SARS-CoV-2 virus ([26]). We also used simulation to understand the performance of the cutoff estimators associated with different criteria for seropositivity determination.
2 Serological data concerning SARS-CoV-2 virus
In this study we analyzed IgG antibody responses against four SARS-CoV-2 spike or nucleoprotein antigens: RBD – glycoprotein receptor-binding domain; S^{tri} — S trimeric spike protein; S1 — spike glycoprotein S1 domain; S2 – SARS-CoV-2 spike glycoprotein S2 domain. Antibodies were measured in serum samples collected up to 39 days after symptom onset from 215 adults in four French hospitals (53 patients and 162 health-care workers) with quantitative RT-PCR-confirmed SARS-CoV-2 infection. A total of 335 negative control serum samples were collected from France, Thailand, and Peru before the start of the COVID-19 pandemic ([26]). A detailed description of lab procedures can be found in the original study ([26]). The data is freely available at https://github.com/MWhite-InstitutPasteur/SARSCoV2SeroDXphase2.
3 Statistical methods
Serological data can be viewed as arising from two or more latent populations; each population is assumed to represent different levels of exposure to a given antigen. For simplicity, individuals that were never exposed or exposed a long time ago to an infectious agent are considered as seronegative. In contrast, individuals exposed to the same infectious agent are considered seropositive. In this scenario, the antibody distribution can be described by a mixture of two or more probability distributions ([8]). However, the true serological state of the individuals is unknown and therefore it needs to be estimated.
In the particular case of the SARS-CoV-2 data, we know which individuals were exposed to the virus and, therefore, we can assume to know which individuals are true seronegative and seropositives.
In many serological studies, it is common to assume a normal distribution for the basis of the mixture models. However, the behaviour of antibody distribution is not constant over time and their concentration decreases after infection ([26]). This fact makes the distribution of the seropositive population skewed to the left ([10]). In order to accommodate the possible skewness in the seropositive population we use the scale mixture of Skew-Normal (SMSN) class of distributions that include the Skew-Normal and the Skew-t distributions, which will be the focus of our study. A brief description of these alternative distributions can be found below.
3.1 Skew-Normal and Skew-t distributions
Let W ⌢ SN(µ, σ^{2}, α) a random variable with a Skew-Normal distribution. In this distribution, the parameters µ, σ^{2}, and α can be seen as the location, scale, and shape parameters, respectively. Then the probability density function (pdf) is given by where ϕ(·) and Φ(·) is the pdf and the cumulative distribution function of the standard Normal distribution, respectively ([1,3,9]). The Skew-Normal distribution is part of a family of distributions called the Scale Mixtures of Skew-Normal distributions (SMSN), of which the Skew-t distribution is also a particular case ([9]).
A random variable W is said to have a Skew-t distribution, W ⌢ ST (µ, σ^{2}, α, v), if the pdf is given by where f_{T} (.; µ, σ^{2}, v + 1) and F_{T} (.; µ, σ^{2}, v + 1) represents the pdf and the cumulative distribution function of the generalized Student’s t distribution with v + 1 degress of freedom, and ([1,3,9]).
3.2 Finite mixture models
Let G_{1} and G_{2} be the seronegative and seropositive subpopulations from a population G, respectively. Let π_{1} and π_{2} the probabilities of sampling a seronegative and a seropositive individual, respectively (with the usual restriction of and 0 ≤ π_{k} ≤ 1) and considering Z the random variable that represents the antibody level. The probability density function (pdf) of Z is given by where f_{k}(z; θ_{k}) is the mixing probability density function of Z associated with the k − th latent population and parameterized by the vector θ_{k}. Θ is the vector of all unknown parameters of the mixture model, i.e., Θ = (π_{1}, π_{2}, θ_{1}, θ_{2}). In our application, f_{k}(z; θ_{k}), is given by the Skew-normal or the Skew-t distributions.
In general, the estimation of a finite mixture model can be done by the classical EM algorithm ([15]). The EM algorithm is an iterative method widely used in incomplete data problems where the maximum likelihood estimators (MLE) have no closed expression ([7]). Considering (z_{1}, z_{2}, …, z_{n}) the observed sample of size n and Y_{i} ≡ Y_{ik}, (i = 1,.., n; k = 1, 2), the binary vector representing the component from which the data comes from. Thus, Y_{i} ⌢ Bernoulli(π_{2}) and the pdf of Y_{i} is given by We have that the complete data is the pair (z_{n}, y_{n}) and the joint pdf is given by Then, the log-likelihood function is given by The step E of the EM algorithm consists in obtaining where
The step M consists in maximizing Q(Θ, Θ^{(p)}) as function of the unknown parameters. However, if the model has many parameters that need to be estimated, then step M may incur in computational problems such as excessive time consuming or estimate instability. In this sense, it is possible to break the step M into several sub-steps (S > 1) that allow to get around these computational constraints by performing some restrictions on the parameters. This method is called expectation-conditional-maximization (ECM) algorithm ([19,17,20]). Considering that Θ^{(p+s)} represents the value of Θ in the s^{th} CM step of the iteration p + 1 in order to maximize Q(Θ, Θ^{(p)}) and the constraint function g_{s}(Θ) = g_{s}(Θ^{(p+(s−1)}), the ECM algorithm is performed as follow ([17]):
calculate the expected complete-data log-likelihood given the current estimates of the parameters, Θ^{(p)}. The calculations are the same as for the EM algorithm;
fix Θ^{(p)} and calculate Θ^{(p+s)} to maximise the expected complete-data loglikelihood;
fix Θ^{(p+s)} and calculate Θ^{(p+(s+1))} to maximise the expected complete-data log-likelihood on the s + 1 sub-step iteration and continuing until you have gone through all the S sub-steps.
In this way, it can be seen that Q(Θ, Θ^{(p+s)} ≥ Q(Θ, Θ^{(p)}) for all Θ ∈ Ω_{s}(Θ^{(p+s)}), where Ω_{s}(Θ^{(p+s)}) = {Θ ∈ Ω : g_{s}(Θ) = g_{s}(Θ^{(p+(s−1)})} ([19,17,20]).
Considering the SMSN family of distributions, namely the Skew-Normal and the Skew-t distributions, the application of the ECM algorithm in the context of mixtures can be found in ([3,16]).
In order to decide which model is the best one among all the models fitted to the same data, we used the Bayesian Information Criterion (BIC) ([9]).
3.3 Definition of seropositivity
Seroprevalence is an epidemiological measure defined by the proportion of seropositive individuals in the sample. For its estimation, it is then necessary to define the serological status of the i-th individual by dychotomization the variable, Z_{i}, which represents the antibody concentration of the individual. This dychotomization is done by determining a value c such that for antibody values equal to or greater than c, the individual is classified as seropositive and seronegative, otherwise. Thus, let Y be the random variable representing the number of seropositive individuals in a sample of size n, we have to where π_{2} represents the seroprevalence, i.e, π_{2} = P[Z_{i} ≥ c] and I_{{·}} is the indicator variable. Considering that the random variable representing the antibody levels Z_{i} is modelled by a finite mixture of distributions, the way to estimate the cutoff c from the observed data is not standard. To facilitate the determination of this cutoff value, we below present three estimation methods or criteria.
- Method 1 (M1): It is based on the 99.9%-quantile associated with the estimated seronegative population. This method is the most popular in seroepidemiology ([28,29]). It is often called as the 3σ rule, because the 99.9%-quantile is given by the mean plus 3 times the standard deviation of a normally distributed seronegative population;
- Method 2 (M2): It relies on the minimum of the density mixture functions. In the case of two latent populations, the cutoff corresponds to the absolute minimum, and in the case of three or more latent populations the cutoff corresponds to the lowest relative minimum. This point can be calculated using the Dekker’s algorithm ([6]). It should be noted that the minimum of the mixing function is not expected to coincide with the point of intersection of the probability densities of each individual subpopulation;
- Method 3 (M3): It imposes a threshold in the the so-called conditional classification curves ([29]). Under the assumption that all components but the first one refer to seropositive individuals, the conditional classification curve for the i-th individual given the antibody level Z_{i} = x is defined as
In turn, the classification curve of seronegative individuals is simply given by After calculating these curves, one can impose a minimum value for the classification of each individual. In this case, two cutoff values arise in the antibody distribution, one for the seronegative individuals and another for seropositive individuals. Mathematically, the classification rule is given as follows where c_{−} and c_{+} are the cutoff values in the antibody distribution that ensure a minimum classification probability, say 90%. To calculate these cutoff values in practice, one can use the bisection method providing an initial interval where they might be located ([29]).
3.4 Performance of the proposed methods for cutoff point estimation
In order to evaluate the performance of each of the cutoff points, we estimated the respective sensitivity and specificity. Let D and D^{∗} be the true and estimated serological classification (or infection status), respectively. Sensitivity is defined as the conditional probability In turn, the specificity i s defined as The overall performance of each method is given by the accuracy (ACC) of the proposed method which corresponds to the proportion of correct results, that is,
3.5 Simulation study
We performed a small simulation study to assess the performance of cutoff points proposed by each method. With this purpose, we assume two simulation scenarios regarding the mixture model assumed for the data: (i) a mixture model based on the Skew-Normal distributions and (ii) a mixture model based on the Skew-t distribution.
For each scenario, we simulated 1000 samples with dimensions 100, 500 and 1000. In addition, for each simulation cycle, the weight of the mixture model was varied to check the ability of the model to identify the seropositive component even when the weight assigned to that component is very low. The implications of varying the weight of the seronegative and seropositive population are as follows: in the case where the proportion of seronegative individuals is very high relative to seropositive individuals, more effective decisions can be made to control the number of infections in the population. The opposite scenario is important in the case of effectiveness of vaccination in the population, particularly for individuals who may have lost immunity.
To this end, it was considered that the proportion of seronegative individuals could take the values 90%, 60% and 30%, being the respective proportion of seropositive individuals 10%, 40% and 70%, respectively. For each simulated sample, the parameters of the mixture model were estimated by maximum likelihood (via the EM algorithm) according to the distributional scenarios described above, as well as the respective cutoff points according to the methods M1, M2 and M3. Considering θ^{∗} the estimated parameter, θ the true value of the parameter, than we calculate the relative error that is and the mean squared error (MSE), i.e.
3.6 R packages
We used the package mixsmsn to fit different mixture models based on SMSN ([24]). In particular, we used the function smsn.mix to estimate the model parameter via the EM algorithm For fitting the Student’s t-distribution, we considered the R package extraDistr ([34]), namely, the function dlst to calculate their density, the function plst to define the cumulative distribution function and the function rlst to generate random samples in the simulation study. The fitting of the Skew-Normal distributions was performed with the package sn ([2]). The functions dsn, psn and rsn were used to calculate the probability density function, the cumulative distribution function and generate random samples of the Skew-Normal distribution, respectively. In the case of the Skew-t distribution, the functions dst, pst and rst were used to calculate the probability density function, the cumulative distribution function and generate random samples, respectively.
4 Results
4.1 Patients characteristic’s
For this study, data relating to 549 individuals was analysed. Serum samples were collected from individuals with confirmed SARS-CoV-2 infection by PCR test in four hospital units from Paris, namely: 4 (0.7%) from the Hôpital Bichat, 49 (9.0%) from the Hôpital Cochin and 161 (29.3%) from the Nouvel Hôpital (Strasbourg). Regarding the negative controls, 68 (12.4%) are from the Thai Red Cross (TRC), 90 (16.4%) from the Peruvian donors (NHP) and 177 (32.2%) from the France blood donors (Établissement Français du Sang). For each antigen under analysis, the logarithmic transformation of base 10 was considered for the concentration of antibodies against that antigen.
Regarding the analysis of antibodies by the individuals who performed PCR test, there were statistically significant differences between individuals who tested negative and positive for SARS-CoV-2 by Mann-Whitney test (RBD: 1.64 vs. 3.48, p < 0.001; S1: 1.72 vs. 2.59, p < 0.001; S2: 1.79 vs. 2.99, p < 0.001; Stri: 1.59 vs. 3.43, p < 0.001) (Figure 1). Such differences were expected given the general knowledge about the infection status, i.e., individuals who have already been exposed to the virus will have a higher concentration of antibodies than those who are still susceptible.
4.2 Mixture Model approach
We performed the fitting of the different mixture models considering two subpopulations, i.e., a seronegative population and a seropositive population. According to the BIC values, the model based on the Skew-Normal distribution was considered for the following antigens: RBD (BIC=852.25), S1 (BIC=561.63), S2 (BIC=775.29). For the case of the Stri antigen, the best model was found to be the Skew-t distribution (BIC=915.82) (Figure 2 and table 2). As has been observed in previous studies, there is a marked skew to the right of the data for the seronegative population and a skewed for the left in the seropositive population, although not very marked for the S1 (α_{S1} = 1.062) and S2 (α_{S2} = 0.450) antigens (Table 1).
4.3 Seropositivity estimation
After defining the model that best fits the data, we proceeded to categorize the amount of antibodies for each antigen by estimating the cutoff point. For this, we used the methods M1, M2 and M3 already described and whose results are shown in Figure 1 and table 2.
Estimation of the cutoff point based on the minimum densities of the mixture model (M2) proved to be the method with the highest sensitivity for classifying seropositive individuals, as well as the one that produces the highest proportion of correct results (accuracy) for the RBD antigen (cutoff = 2.49, sens = 86.45%, ACC = 92.89%), S1 (cutoff = 2.27, sens = 71.03%, ACC = 86.89%) and S2 (cutoff = 2.39, sens = 83.64%, ACC = 90.89%). In the case of the S^{tri} antigen, it was not possible to calculate the sensitivity and accuracy of the method based on the 99.9%-quantile (M1), given the high values that the quantile assumes leading to the seropositive population being fully absorbed by it. Thus, for comparison purposes, the application of each methods to the Skew-Normal distribution was considered, again verifying that the method based on the minimum densities of the mixture model produces the highest sensitivity (cutoff = 2.46, sens = 90.19%). However, for this antigen, the method with the highest accuracy is based on the conditional probability (set at 90%) of classifying an individual as being seropositive (ACC=93.44%) (Figure 3 and table 2).
In order to evaluate the quality of methods M1, M2 and M3, the optimal cutoff point was estimated using the ROC curve. This is possible since the true infection status of the individuals is known. It is interesting to see that in terms of specificity and accuracy the results are similar to the method that is traditionally used (ROC curve). However, it is possible to observe a poor performance of the M1 method with regard to its sensitivity. (Figure 3, table 2 and table 3).
4.4 Simulation results
To conduct the simulation study, two scenarios were considered: the first consists of the scenario where the model that best fits the data is a Skew-Normal distribution, and the second where the model that best fits the data is a Skew-t distribution. For this purpose, the results for the RBD antigen (Skew-Normal distribution) and the S^{tri} antigen (Skew-t distribution) were selected. For each scenario the sample size was varied, as well as the proportion of seronegative individuals in the population. The results are shown in table 4 and table 5.
In general it is found that as the sample size increases, both the relative error and the root mean square error tend to decrease. It is also found that for small samples and extreme π_{1} values (π_{1} = 0.3 or π_{1} = 0.9), the models tend to have some difficulty in identifying a seronegative and seropositive population. This is a result that alerts to the existence of possible false positives and false negatives in the case of small samples.
In situations where there is an ongoing vaccination plan and therefore the majority of the population is seropositive (e.g. π_{1} = 0.9) it is important to know if it is possible to identify seronegative individuals in this population given the time of immunization. If the timing of immunization is short, it is important to identify these individuals early in order to take action and prevent a further increase in infections.
5 Conclusions
The purpose of this study was to use a flexible class of mixture models to antibody data against the SARS-CoV-2 virus. In particular, we used a class of models that allows capturing the skewness present in this type of data, namely the Skew-Normal and Skew-t distributions.
It has become clear that diagnostic tests play a key role in the early identification of infected individuals, allowing us to act to control a pandemic by isolating and tracing the contacts of an infected person. Diagnostic tests can classify an individual as seronegative or seropositive by defining a cutoff point that can take on different values depending on the technique used by the manufacturer to develop the test. Most of the time, this cutoff point is relaxed and is calculated using the 3σ-rule, which assumes that the underlying distribution of the data is Normal. However, as we have seen in our application, this assumption cannot always be made, making this method unfeasible.
Note that this study has the advantage that the true cases and controls of the infection are known, allowing us to compare different methods for obtaining the cutoff point that allows classifying an individual as seropositive.
In [9], three methods for obtaining the cutoff point had been presented that could not yet be validated because the true infection status of the individuals was not known. In this sense, we proceeded to use these methods in this study, and it was verified that the three methods under analysis present high accuracy, compared to methods used in literature, namely through the empirical ROC curve. However, the proposed methods proved to be more specific than sensitive. Note that the performance of the method based on the 99.9% probability quantile may be overestimated, especially when the fitted distribution corresponds to a heavy-tailed distribution (such as the Skew-t distribution). This is because the calculation of this quantile involves only and exclusively the population of seropositive individuals, so that if the distribution is too skewed to the right, then the seropositive population is totally absorbed by this quantile.
When a new virus is present in the population, there is a natural tendency for the proportion of susceptible individuals to be much higher than the seropositive individuals. This is the phase in which early identification of the infected people is essential for pandemic control, although total control of the spread of the virus only occurs when there is vaccination or eradication of the virus. In this sense, with the simulation study developed in this work, we intend to analyze the pandemic evolution scenarios and understand the behavior of different methods for determining the cutoff point. It was found that as the sample size increases, there is a tendency for the relative error and the mean square error of the cutoff point estimates in skewed distributions to decrease, while this tendency is not linear in the case of the usual symmetric distributions (Normal and Student t). This fact may be due to the fact that symmetrical distributions are not the most appropriate for these types of data, or even that the proposed methods should not be used when considering the usual distributions.
As we expected, for small sample sizes and for large imbalances in the serological populations, the proposed models were found to have problems in identifying two components. Note that in the case of skewed distributions, it will be natural that if the weight of the seronegative population is very high, then observations relating to the seropositive population are considered false negatives and false positives otherwise.
A limitation of this study is the fact that the adjustment of the different mixture models was performed using the same distribution for the two components (through the package mixsmsn). If the components of the mixture model were distinct, this would have a direct implication on the estimated cutoff points. However, the package that would allow this analysis is now discontinued.
In conclusion, we recommend the use of mixture models based on distributions of the SMSN family for the analysis of serological data given the flexibility of these models, as well as the use of the proposed methods for determining cutoff points as an alternative to the method based on the 3σ rule.
Data Availability
All data produced are available online at https://github.com/MWhite-InstitutPasteur/SARSCoV2SeroDXphase2
https://github.com/MWhite-InstitutPasteur/SARSCoV2SeroDXphase2
Footnotes
tmdomingues{at}fc.ul.pt
mhnunes{at}fc.ul.pt
nunosep{at}gmail.com