BinomiRare: A carriers-only test for association of rare genetic variants with a binary outcome for mixed models and any case-control proportion

Whole genome and exome sequencing studies have become increasingly available and are being used to identify rare genetic variants associated with health and disease outcomes. Investigators routinely use mixed models to account for genetic relatedness or other clustering variables (e.g. family or household) when testing genetic associations. However, no existing tests of the association of a rare variant association with a binary outcome in the presence of correlated data controls the Type 1 error where there are (1) few carriers of the rare allele, (2) a small proportion of cases relative to controls, and (3) covariates to adjust for. Here, we address all three issues in developing the carriers-only test framework for testing rare variant association with a binary trait. In this framework, we estimate outcome probabilities under the null hypothesis, and then use them, within the carriers, to test variant associations. We extend the BinomiRare test, which was previously proposed for independent observations, and develop the Conway-Maxwell-Poisson (CMP) test, and study their properties in simulations. We show that the BinomiRare test always controls the type 1 error, while the CMP test sometimes does not. We then use the BinomiRare test to test the association of rare genetic variants in target genes with small vessel disease stroke, short sleep, and venous thromboembolism, in whole-genome sequence data from the Trans-Omics for Precision Medicine program.


Introduction
Whole-genome and exome-sequencing studies are becoming increasingly available to public health researchers, for example, from NHLBI's Trans-Omics for Precision Medicine (TOPMed) program (1), NHGRI's Centers for Common Disease Genetics (CCDG), and the UK Biobank (2). As most variant in sequencing datasets are rare, researchers may be interested in using such datasets for detecting rare-variant associations, genome-wide or in a genomic region of interest.
They may also seek to confirm suggested associations from other studies or populations, or to assess pathogenicity in large population-based studies of rare variant alleles reported from small family-based studies. For example, Amininejad  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) (7) studied cognitive outcomes in carriers of rare copy number variants. These studies demonstrate that there is an interest in testing single rare genetic variant associations with a wide range of health outcomes, including binary outcomes such as disease or affection status.
Testing rare variant associations with binary traits is challenging. It was previously shown that likelihood-based tests such as the Wald, Score, and likelihood ratio tests poorly control Type 1 error when testing for rare variant associations with a binary trait (8; 9). The Score test performance depends on the case-control ratio, and for rare variants, even a small imbalance causes "inflation" (i.e. too many false positive results). A few approaches have been used previously to study rare variant associations in a set of unrelated individuals. Amininejad et al.
used a permutation approach to test for association of rare genetic variants with Crohn's Disease. Wright et al. used Fisher's exact test. While it is possible to adjust for covariates in the permutation approach and when using Fisher's exact test to some extent through stratification (10), they do not have the full flexibility of covariate adjustment of a generalized linear model; i.e. they still require the identification of distinct groups in which no additional adjustment is required. Further, permutation tests may also be computationally intensive if low p-values are desired, because the number of required permutations may be large, although there are ways to reduce this computational burden (10). Alternatively, Tuijnenburg et al. used a method called BeviMed (11), implementing a Bayesian model to estimate posterior disease probabilities. The BinomiRare test has also been proposed as a powerful method to test for rare variant associations that can account for covariates (9). The BinomiRare test uses standard methods to compute the disease probabilities in the entire dataset, under the null hypothesis of no association between a specific genetic variant and the binary outcome. Then, for each specific genetic variant, it uses . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint the estimated probabilities in the variant carriers to test the hypothesis that the disease probabilities under the null are the true outcome probabilities in the carriers. The null hypothesis is rejected if the number of carriers with the outcome is inconsistent with their outcome probabilities. However, the previously published version of this method assumed the sample contains only unrelated individuals. Currently, there is no single-variant test that is generally appropriate for testing rare variants when individuals are correlated, e.g., due to known or cryptic genetic relatedness. Notably, the saddlepoint approximation to compute p-values (henceforth SPA; (12)) was first developed to improve the calibration of the Score test when there is casecontrol imbalance, and then extended in the SAIGE framework for the settings where related individuals are used (13). However, it does not reliably control the Type I error rate when the number of carriers of the rare variants is very small (i.e. tens of individuals; (14)). Therefore, there is a need for a statistical test that is well-calibrated when the number of carriers is low, individuals are potentially related, and there is case-control imbalance.
The previously published version of BinomiRare test (14) is useful in the presence of casecontrol imbalance, allows for covariate adjustment, controls the Type I error rate for any number of carriers, and can also be used when combining heterogeneous studies and here, we expand its framework for testing rare variant associations when study individuals are correlated. We developed two tests: first, we extended the BinomiRare test to the mixed models setting by applying it on conditional probabilities computed with a mixed model, rather than on marginal probabilities. Second, we developed the Conway-Maxwell-Poisson (CMP) test, which follows the carriers-only framework by using estimated (conditional) disease probabilities like the BinomiRare. For a given rare variant it uses the estimated disease probabilities in the variant . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint carriers to fit the parameters of the CMP distribution, under the null. It then tests whether the observed number of carriers with the outcome is consistent with this distribution. We study these tests using synthetic simulations with varying outcome probabilities, variant allele frequencies, and strengths of correlation between individuals due to genetic relatedness, and also in realistic simulation studies, using real phenotypes and WGS-based variant call set from the TOPMed program. We finally apply the BinomiRare test to test rare variant associations in known disease causing genes for specific disorders: the NOTCH3 gene and small vessel disease (SVD)ischemic stroke; the DEC2 (also known as BHLHE41) gene and "short sleep", and the F5 gene and venous thromboembolism (VTE).

Statistical approach
Let ! be an indicator of the disease, or another binary outcome, of participant , with value 1 if the person is affected and 0 otherwise, where = 1, … , and the individuals may be correlated. Let ! be a × 1 vector of covariate values for the th participant, and ! be their count of minor alleles for a specified genetic variant. Under the logistic disease model for correlated data: 1, … , and % modelling the correlation structure corresponding to a particular source of correlation. While the methods proposed here can be applied for an arbitrary ≥ 1, we simplify . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 9, 2021. Step 2: Testing the association between a genetic variant and disease status.
Suppose that we obtained disease probability estimates ̂!, = 1, … , , under the null as described above. Denote * as the number of carriers of the rare variant ("carriers" henceforth), i.e. those with > 0, so that ∑ 1( ! $ !(# > 0) = * . Without loss of generality, assume that participants = 1, … , * are the carriers. Let + be the number of diseased carriers where $ ! is the true, unknown, vector of outcome probabilities among the carriers.
The p-value for testing the null hypothesis of no variant-disease association is given by: This is a two-tailed p-value, because + can appear to be lower or higher than expected. When only a single person carries the rare variant, i.e. * = 1, the calculation is trivial; Equation (1) reduces to the single carrier's fitted probability ̂! if they are a case, and 1 −̂! if they are a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; control. When * > 1, there are two special cases that are already developed. If ̂! for all carriers are equal, and outcomes for all carriers are independent, then + ∼ ( ,!), and the pvalue is the tail area (possibly two tails) of the standard Binomial distribution, i.e. a Binomial exact test. If the ! for the carriers differ but independence still holds, the distribution is the Poisson-Binomial distribution, and the test is the previously-proposed BinomiRare test for independent data (9). In the general case, an arbitrary sum of binomial variables, possibly correlated, has the Conway-Maxwell-Poisson (CMP)-Binomial distribution, which can be approximated by the CMP distribution (18; 19) when the number of carriers is "large enough" (see appendix).
In addition to the p-value above, we also study the mid-p-value, which was previously shown to improve properties of discrete tests (20) and to be less conservative. The mid-p-value is always smaller than the p-value, because when summing the tail areas probabilities, it accounts for only half of the probability of the observed event + , whereas the p-value uses it as it is, without dividing in half.

BinomiRare and CMP tests using conditional probabilities
In the appendix, we show that the distribution of + in the general case can be approximated by the CMP distribution and develop the CMP test. However, because approximations may not work well in practice for low carrier count * , we also attempt a different approach. Note that for two individuals and , we have that ! andwere independent if the true conditional disease probabilities were known. In other words, given conditional disease probabilities, knowing the disease status of individual does not inform of the disease status of individual . Therefore, we . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; consider using the BinomiRare test which was developed for independent data -with the conditional probabilities. We note that this independence may not hold when probabilities are estimated, and therefore it is not trivially true that the BinomiRare is appropriate in this setting.
Both the CMP and the BinomiRare tests for correlated data are available in the GENESIS R package for genetic association analysis (21).

Simulation study: testing rare variant associations using BinomiRare and CMP in a sample of trios
We carried out a simulation study to evaluate the performance of BinomiRare and CMP tests in samples of correlated individuals. In each simulation, we generated 3,000 individuals as 1,000 trios (two parents and one offspring), as follows. For 1,000 pairs of parents, and each of two chromosomal copies, we generated 20 independent "non-causal" genetic variants by first sampling minor allele frequencies (MAF) from a uniform [0.05, 0.5] distribution and setting MAF ∈ {0.05, 0.02, 0.01, 0.001} for one "causal" variant, followed by sampling of genetic variants using a Binary distribution based on these MAF. For each parent, allele count was the sum of the two sampled alleles. For each variant independently, an offspring inherited one allele from each of the parents. The parental allele was sampled at random with equal probabilities from the two alleles. We used the 21 (1 causal and 20 non-causal) simulated genotypes to generate a variable mimicking a principal component (PC), as a weighted sum of all allele counts, with weights sampled from a standard Normal distribution (0,1). Next, we simulated probability of disease using a mixed logistic model: . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint Here, exp ( , ) ∈ {0.01, 0.05, 0.5} is the probability of disease in non-carriers ( ! = 0) with genetic PC and ! equal to zero; .* models the association of the PC with disease probability,  (2) , log (3) , log(4)} when evaluating power. We then sampled disease status for each individual from a Binary distribution with the computed disease probability. Finally, we applied the BinomiRare and CMP tests and computed p-values and mid-p-values. We performed 1 10 / replicates to estimate Type 1 error rate and 1 10 0 replicates to estimate power. We estimated Type 1 error rate and power for pvalue threshold for declaring significance {1 10 1& , 1 10 12 , 1 10 13 }. For tests that did not, empirically, control the Type I error rate for a given p-value threshold (i.e. the proportion of simulations passing the threshold was higher than the threshold), we computed a "calibrated threshold", defined as a value for which the proportion of simulations with p-value less than this value was the desired threshold. We then used this calibrated threshold to estimate power, specifically power at an "honest alpha". Our main results are those focused in simulations in which the variance component had non-zero estimate, but we analyze all simulations.

The TOPMed whole genome sequencing Study
WGS was performed via TOPMed and the NHGRI's Centers for Common Disease Genetics (CCDG) programs. WGS was performed using DNA from blood at multiple sequencing centers . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint using Illumina X10 technology at an average sequencing depth of >30X. Studies and samples were sequenced in multiple phases. Periodically, the TOPMed Informatics Research Center (IRC) performed variant calling on the combined TOPMed and CCDG samples, resulting in multiple releases of data "freezes". Details regarding sequencing methods and quality control are provided elsewhere (22) and in the TOPMed website (https://www.nhlbiwgs.org/data-sets).
We used three TOPMed multi-ethnic data sets: a data set of small vessel disease stroke (SVD stroke) in the Women Health Initiative (WHI), a study of short-sleep, and a study of venous thromboembolism (VTE), with the latter two comprised of individuals from multiple TOPMed cohorts. We performed data analysis to demonstrate the BinomiRare test. The approaches for data analysis were similar. GRMs were constructed based on the analytic datasets of each of the analyses, using all genetic variants with minor allele frequency≥0.001. Logistic mixed models under the null were fit and adjusted for age, sex, and self-reported race/ethnic group, and for short-sleep, also for parent study/cohort. SVD stroke and short sleep analyses used TOPMed freeze 5b release, while the VTE analysis used TOPMed freeze 8 genotype release. All participants provided written informed consent at their recruitment centers.

The TOPMed WHI stroke dataset
The Women's Health Initiative (WHI) is a long-term health study following postmenopausal women aged 50-79 years who were recruited from 1993 through 1998 from 40 clinical centers throughout the U.S. (23). In the present analysis, we focus on a subset of 5,358 WHI participants who were sequenced through TOPMed with data available via freeze 5b, and had SVD stroke case-control classification, according to the following methodology: stroke diagnosis requiring . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint and/or occurring during hospitalization was based on the rapid onset of a neurological deficit attributable to an obstruction or rupture of an arterial vessel system. Hospitalized incident stroke events were identified by semiannual questionnaires and adjudicated following medical record review, which occurred both locally (at individual study sites) and centrally. Ischemic strokes were further classified by the central neurologist adjudicators into cardio-embolic stroke, larger artery stroke, and SVD stroke according to the Trial of Org 10172 Acute Stroke Trial (TOAST) criteria (24). The TOAST classification focuses on the presumed underlying stroke mechanism and requires detailed investigations (such as brain computed tomography, magnetic resonance imaging, angiography, carotid ultrasound, and echocardiography). Baseline stroke cases were excluded from the analysis and VTE cases were excluded from the control samples. Further, participants who had non-SVD stroke were excluded.

The TOPMed short sleep dataset
We used sleep duration data from multiple TOPMed cohorts, as described in the Supplementary Information detailing phenotype harmonization for short sleep analysis. Short sleep was defined as self-reported sleep duration during weekday, or usual sleep (if sleep duration during the weekdays was not available) being 5 hours or less. Otherwise, if self-reported sleep duration was 6 hours or longer and less than 9 hours, sleep was "normal". Individuals with self-reported sleep duration longer than 5 hours and shorter than 6 hours were excluded to minimize risk of misclassification. Because of a well-known "U-shaped" relationship between sleep duration and cardiovascular disease (25), suggesting that potential non-linearity in genetic associations may exist as well, we also excluded "long sleepers" reporting 9 or more hours of usual sleep, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The TOPMed VTE dataset
The TOPMed VTE dataset includes TOPMed participants from six studies, combining prospective cohort and case-only studies. Individuals were matched across groups defined to be homogeneous with respect to race/ethnicity and sex, and strata defined by age at event (determined according to cases). The matching strategy resulted in a sample set mimicking a case-control study, with 11,627 individuals of which 3,793 are cases, and 7,834 are controls.

Association testing of rare coding variants within known disease causing genes
For each of the SVD stroke, short sleep, and VTE datasets, we considered a known gene associated with the disorder. For stroke, we focused on the NOTCH3 gene, in which mutations may cause Cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), which causes ischemic stroke (26). For short sleep, we focused on the gene DEC2 (also known as BHLHE41), a transcription inhibitor of orexin, a neuropeptide that regulates wakefulness (27; 28). For VTE, we focused on the coagulation factor V gene, F5 (29; 30), which has a known common variant highly associated with VTE, factor V Leiden (rs6025). We performed single variant analysis within the candidate genes, as follows.
We selected a subset of rare variants within the genes based on functional annotations, with the goal of increasing power by focusing on variants that are more likely to be functional compared to others. In detail, the filter based on functional annotation included the selection of variants that is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint Ensemble Variant Effect Predictor and have FATHMM-XF coding score > 0.5. The annotation based variant filtering was performed using the Annotation explorer application on the NHLBI's BioData Catalyst (36). We further filtered variants to those that passed the TOPMed QC filter (22), had at least 3 carriers of the rare allele, and had no more than 300 carriers. This upper threshold was defined because we were interested specifically in rare variants and because it was previously shown that properties of statistical tests of rare variant associations depend on carrier count, rather than on allele frequency (14). Finally, we further restricted the set of variants to those that had reasonable statistical power according to a power analysis performed as follows.
We arbitrarily assumed an odds ratio (OR) of 2 for a causal variant, and for each variant we

Simulations studies
We studied the performance of the tests in simulations of 1,000 trios. In the setting where  Table 1 provides estimated . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; Type 1 error rates in the simulations, restricted to those simulations in which the estimated variance component was ~) & > 0. For BinomiRare, we only provide results for the mid-p-value, because in our simulations it always controlled the Type 1 error rate, while the usual p-value controlled it as well while being more conservative. For CMP, we only provide results for the usual p-value, because it sometimes did not control the type 1 error and the lack of control was worse with the mid-p-value. The CMP test usually did not control the type 1 error when the variant was very rare (MAF =0.001), and when the case proportion was low (exp( , ) = 0.05).
Its performance improved as the MAF increased. In the Supplementary Information, Table S3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; Data analysis: TOPMed data sets . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint For each of the three TOPMed datasets that we considered, Table 2 provides the sample sizes, gene of interest, and number of variants according to sequential filtering: the number of available (non-monomorphic) variants in the sample that passed the functional filters described in the Methods section, number of variants after applying quality filters, and after restricting to those with at least 3 carriers of the rare alleles and less than 300 carriers, and the number of variants with at least 50% power to reject the null hypothesis at the 0.05 level under the assumption of odds ratio =2. There were 3 such variants in the NOTCH3-SVD stroke analysis, 1 variant in the DEC2-short sleep analysis, and 4 variants in the F5-VTE analysis.  Table 3 provides the results from testing each of the variants passing this estimated power filter.
Of the three tested NOTCH3 variants, rs115582213 had p-value=0.03. For short sleep, only a single DC2 variant was tested, it had BinomiRare mid-p-value=0.03, suggesting association with short sleep. For the F5 gene and VTE, none of the four tested variants showed evidence of association. Table 3: Results from association analysis of rare genetic variants within monogenic disease genes of interest. Genetic variants presented are those that passed functional annotation and statistical power filters. For each variant we provide its BinomiRare p-value and mid-p-value, the number of carriers of the rare allele ! , the number of carriers with the outcome " , the estimated power computed while assuming effect size OR=2 and p-value threshold=0.05, pathogenicity interpretation from ClinVar, CADD score, and FATHMM-XF coding score.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021.

Discussion
We extended the BinomiRare test and proposed a CMP test for testing the association of a rare genetic variant with a binary outcome in the mixed model framework. These tests were specifically developed to handle variants with very low minor allele counts (tens of carriers), because it was previously shown that other tests that allow for covariates adjustment such as the naïve Score test and the SPA test do not always control the type 1 error in the very low count settings (14). Both carriers-only tests first estimate the outcome probabilities for each person in a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint data set, while accounting for covariates and for genetic relatedness (and possibly other covariance matrices) via a mixed model, and then use the estimated conditional disease probabilities. For a single variant, the carriers of the rare alleles are identified, and based on their disease probabilities and the observed number of "cases", a p-value is computed, as the probability of observing the given number of cases or more extreme given the estimated outcome probabilities. The BinomiRare test with conditional probabilities performed well, while, surprisingly, the CMP test did not control the Type 1 error rate for settings with low carrier counts. This was likely because the approximations on which it relies are asymptotic in its noncentrality parameter , which is related to the number of carriers.
We demonstrated the application of the BinomiRare test using three TOPMed studies: of SVD stroke, short sleep, and VTE. Due to the low power for testing low-count variants, we filtered variants according to functional annotation, and according to computed statistical power. The limitation of this approach is that (1) The deleteriousness predicting annotations used and the filters applied to them may not have captured the true functional variant set; (2) the power analysis was based on an arbitrarily selected OR parameters. In this study, we chose OR=2 and only considered the handful of variants that had estimated power > 0.5 for testing while requiring p-value ( level)<0.05. We recognize that many rare variants have larger effect sizes. However, if we specified a larger OR parameter, and thus included more variants in our analysis, a more stringent level would be needed. Thus, the resulting list of variants to test may have been similar. More work is needed developing strategies for identifying single rare variant associations.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint For each of the phenotypes, SVD stroke, VTE, and short sleep, we searched for rare variants within genes with known trait associations. For SVD stroke, we considered NOTCH3, because some NOTCH3 variants have been in CADASIL patients, which poses a risk for stroke. Most For VTE, we considered the F5 gene. The F5 gene harbors the strongest known, relatively common, genetic risk factor for VTE, the rs6025 variant (38; 39). This motivated the search for rare variants in this gene. We did not identify any variant associated with VTE at the p-value<0.05 level. We did not consider rs6025 as part of our testing strategy because it was common with MAF=0.04, and had 839 carriers of the rare allele, a setting in which other tests such as the SPA should be able to control the type 1 error well and also be more powerful. Still, as a positive control we tested its association with VTE using BinomiRare, and the p-value was 1.5x10 -14 .
Short sleep has been consistently associated with cardiovascular and cardiometabolic disease (40; 41). Genetic determinant of short sleep may help elucidate this connection (42). We considered the DEC2/BHLHE4 gene, which a mutation with a know familial aggregation associated with short sleep. Our filtering strategy resulted in a single variant considered for . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint testing: rs121912617, the known short sleep mutation (27). In our data, it was associated with short sleep with BinomiRare mid-p-value=0.03. Rs121912617 is substantially more common (yet is still rare) in African Americans compared to European Americans (0.01 MAF in African Americans from the TOPMed short sleep datasets, compared to MAF < 0.001 in European Americans from the same dataset), allowing for observing this association in a population-based, rather than a family-based, study.
Here, we demonstrated the BinomiRare test for testing single-variant associations in data with known or cryptic relatedness. It can also be used to test sets of rare variants, by defining a carrier as an individual with at least one rare allele in the variant set. It is a topic of future research to extend this framework to use the counts of the rare variant allele and increase power. According to this proposition, an arbitrary sum of binary variables is distributed as a sum of exchangeable binary variables, where the exchangeable variables are such that there is a unique combination of probability parameter = Pr( # = 1) = ⋯ = Pr( 4 = 1) and a parameter modeling the dependency between each pair ! , -, ≠ . Therefore, two parameters suffice to characterize the distribution of an arbitrary sum of binary variables. Specifically, for a given set of carriers of a rare genetic variant, the sum of their disease statuses is distributed like a unique sum of exchangeable binary variables. Based on the estimated disease probabilities, we estimate the two parameters (different than the probability and dependency parameters and above) of the CMP distribution to obtain an estimated probability function in a variation of a method-of-moment approach that is based on estimated probabilities, rather than on the observed data. Daly and Gaunt (43) provided an approximation to the CMP distribution: as → ∞.
Assuming that 1#/5 is small (which, as we shall see, is true when is very large, because tends to be well bounded), we get that, approximately: . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 9, 2021.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint we estimate probabilities for each potential number of disease carriers, in the following process: 1. Obtain individual disease probability estimates # U, … , $ ! U via standard approaches (e.g. logistic mixed model).

Compute estimates [ ] '
and [ ] ' in the analytic approach, or compute [ ] ' and ' in the sampling approach.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint Figure 1: Step 1 of testing genetic association using the carriers-only tests framework. A "null model" of association between the binary outcomes and covariates of interest is fitted, accounting for genetic relationship. Then, estimated conditional outcome probabilities are extracted to be used in the testing step.

Figures
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint

Figure 2:
Step 2 of testing genetic associations using the carriers-only tests framework. Based on estimated outcome probabilities, variants are inspected one at a time. For a given variant, carriers of the rare allele are identified, and a test of the null hypothesis # : $ ! = $ ! ' is performed testing whether " is consistent with the outcome probabilities within the carriers, based on the null model.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249450 doi: medRxiv preprint