Calculating variant penetrance using family history of disease and population data Authorship

Genetic penetrance is the probability of a phenotype manifesting given that one harbours a specific variant. For most Mendelian genes, penetrance is high, but not complete, and may be age-dependent. Accurate estimates of penetrance are important in many biomedical fields including genetic counselling, disease research, and for gene therapy. The main methods for its estimation are limited in situations where large family pedigrees are not available, the disease is rare, late onset, or complex. With the advance of high-throughput technologies, population-scale genetic data is available for an increasing range of genetic diseases. Here we present a novel method for penetrance estimation in autosomal dominant phenotypes. It uses population-scale data regarding the distribution of a variant among unrelated people affected and unaffected by an associated phenotype and can be restricted to samples of affected people by considering family disease history. The approach avoids kinship-specific penetrance estimates and the ascertainment biases that can arise when sampling rare variants among control populations. We test the method upon candidate variants and diseases, demonstrating that our estimates align with those derived using established methods. We have implemented the method in a public web server (https://adpenetrance.rosalind.kcl.ac.uk) and made it available as an open-source R library (https://github.com/ThomasPSpargo/adpenetrance).


Introduction
Penetrance is the probability of developing a specific trait given that a person harbours a certain genetic variant or set of variants. Some pathogenic variants are fully penetrant, and people harbouring them always develop the associated phenotype. For instance, a trinucleotide CAG repeat expansion within the HTT gene is fully penetrant for Huntington's Disease by 80 years of age among people harbouring an expansion variant larger than 41 repeated CAG units (1). For many variants however, penetrance is incomplete, and those with risk variants can remain unaffected throughout their life. For example, the p.Gly2019Ser variant of the LRRK2 gene exhibits incomplete penetrance for Parkinson's Disease (PD), meaning that it elevates risk for PD but does not necessarily result in its manifestation (2).
In medical genetics, estimating the penetrance of a given variant or set of variants is important for the correct interpretation of genetic test results, something that will be increasingly valuable as genome sequencing becomes routine, both within and outside clinical practice. With the advance of precision medicine and gene therapy, being able to accurately estimate the penetrance of a large spectrum of human genetic variants is crucial (3)(4)(5)(6).
There are several existing methods for penetrance estimation. The first and most widely used is based on the statistical examination of how the variant segregates with the phenotype within pedigrees (7). However, the generalisability of estimates derived from specific families may be limited. Other approaches involve examination of the incidence of disease in a sample of unrelated people who harbour a variant (8,9). Without systematic sampling, these estimates can be affected by ascertainment bias. Where large pedigrees are not available, or if disease is rare or late onset, these techniques may not be possible (10).
Estimating penetrance for a variant of unknown significance identified, for example, as a result of genome sequencing-based screening can be particularly challenging. The problem is exemplified by the large number of reported SOD1 gene variants in amyotrophic lateral sclerosis (ALS): although SOD1 variants are cumulatively one of the most common causes of ALS, over 180 ALS-associated variants in the gene are reported to date (11,12). Family pedigrees suitable for establishing penetrance are available for only a minority of these.
We have developed a new method to calculate penetrance for variants with an autosomal dominant inheritance pattern using population level data from unrelated people who are and are not affected by the associated phenotype (case and control populations). It can be operated using variant information drawn only from affected populations, stratified according to family history between 'familial' and 'sporadic' disease presentations. This approach is based on our previously published model of disease which explains how variant penetrance and sibship size determine the presentation or absence of a disease for families in which the variant occurs (13).
The method is complementary to, and fills an important gap left by, existing techniques. Using population-scale data, it takes full advantage of the rapidly growing quantity of genetic data that are being generated for a wide range of human disease and, therefore, it is ideally placed to be a valuable tool in the precision medicine era. Moreover, the capacity to assess penetrance based on the distribution of a variant between samples of unrelated people drawn only from the affected population allows estimates unbiased by kinshipspecific effects or ascertainment of unaffected population members.
We have tested the approach in four variant-disease case examples, drawing upon the most common and widely studied autosomal dominant risk variants for each disease: the p.Gly2019Ser variant of the LRRK2 (OMIM: 609007) gene for PD (2); variants in the BMRP2 gene (OMIM: 600799) for heritable pulmonary arterial hypertension (PAH) (14); and variants in the SOD1 (OMIM: 147450) and C9orf72 (OMIM: 614260) genes, for ALS (15, 16).
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Model
The disease model our method builds upon (13) makes the following assumptions: a rare dominant variant is necessary but not sufficient for disease to occur, therefore penetrance is not complete and people within a family who do not harbour the variant are not affected; all individuals harbouring the variant are ascertained; all variants are inherited from exactly one parent, thus there are no homozygous carriers or de novo variants.
The model calculates three probabilities for a nuclear family where one parent harbours a given variant: that no family members are affected, !(#$%&&'()'*); that exactly one member is affected, !(,-./%*0(); and that more than one member is affected, !(&%1020%2). These probabilities are determined by penetrance, &, and sibship size, 3. In a family with 3 siblings: where the parent is unaffected, and none of the sibs are affected (each being transmitted the high-risk variant with probability ½).
(2) where either the parent is affected and no siblings are affected, or the parent is unaffected and exactly one of the sibs is affected. Application to penetrance calculation Conversely, given the observed rates of the (unaffected, sporadic, familial) disease states in families where the variant occurs and the average sibship size for these families, we can estimate penetrance. We can also estimate penetrance based on the observed rates of families presenting as unaffected versus 'affected', a fourth disease state whereby !(%&&'()'*) = !(&%1020%2) + !(,-./%*0(). Observed disease state rates can be derived as a weighted proportion of estimates of heterozygous variant frequency given for people across a valid subset of the four defined states (see table 1); the appropriate weighting factors will vary based on the disease states for which variant frequencies are given. Sibship size can be estimated for the sample either directly, based on the average sibship size among the described families, or indirectly, by designating an estimate representative of the sampled population (e.g. available within global databases).
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  Table 1. Valid disease state combinations and weighting factors used to estimate disease state rates associated with a given variant as described in Figure 1  Our method involves three operations and an optional further step for deriving error in the estimate. These processes are summarised as a flowchart in Figure 1 and outlined in detail in the supplementary methods. In this approach, we assume that: in variant frequency estimates, disease state classifications are assigned according to the status of the sampled person and first-degree relatives only; individual families are represented only once in variant frequency estimates; weighting factors and average sibship size represent absolute values; the value specified for sibship size is representative of sibship size across disease state groups; in families where the variant occurs, the associated trait can only manifest owing to that variant. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  Construct lookup table for expected rates of # at ' ) and N

Optional step: Estimate error in &
Find nearest !(#) ) -. and estimate ' at error bounds is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 24, 2021. ; https://doi.org/10.1101/2021.03.16.21253691 doi: medRxiv preprint

Figure 1. Flowchart summarising the key operations within this penetrance estimation approach. Operation 1: variant frequencies (M) and weighting factors (W) are defined for a valid subset of the familial (F), sporadic (S), unaffected (U), and affected (A) states to
calculate rate of one of these states, arbitrarily labelled state X, among families in which the variant occurs drawn from those states for which data were provided. This is the observed rate of state X, I(J) *+, , and Table 1

Tool access
We have made this method available as an R function (R version 3.6.1) and, leveraging the R Shiny package (version 1.4.0.2), also developed a publicly available web resource (https://adpenetrance.rosalind.kcl.ac.uk) that facilitates easy use of the method. The source code of the R library is available on GitHub (https://github.com/ThomasPSpargo/adpenetrance). The web tool is further described in the supplementary methods and Figure 2 presents an example of its usage.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Case examples
Input parameters for the presented case studies have been estimated using publicly available data. We estimated variant frequency in the familial, @ $ , and sporadic, @ % , states in all cases and, in case 1, additionally ascertained this for the unaffected state, @ & , from control samples. In all cases, we derived the standard error of these values, N 0 ! 11111, to allow for assessment of error in the penetrance estimate. Variant frequency estimates were weighted to calculate the observed rate of arbitrary state X, I(J) *+, , among variant harbouring families using the factors presented in Table 1. Accordingly, the frequencies of familial, !(B|D), and sporadic, !(E|D), disease among the affected population, D, were defined in all cases, based on the first-degree familiality rates of that trait; note that !(E|D) = 1 − !(B|D). The probability of a population member being affected, P(A), was defined for case 1 only, using estimates of lifetime risk for that trait.
Sibship size, 3, was estimated in each case based on the Total Fertility Rates reported in the World Bank database (19) for the world region(s) which best represent the population from which variant frequency estimates were drawn.
An R script detailing the calculations made for each case study can be found within our GitHub repository.

Case 1: LRRK2 penetrance for PD
We estimated the penetrance of the p.Gly2019Ser variant of the LRRK2 gene for PD. This case was used to illustrate the flexibility of this method for application using data drawn from several combinations of the defined disease states. This intercontinental cohort largely describes people from European, North American and Asian countries but there is no single predominant region. We accordingly estimated that 3 = 1.646 by aggregating Total Fertility Rate estimates available in the World Bank database (19) across each of the reported population samples. For each population, this was weighted by its proportional contribution to the total sample; see Table S1 for further details.
Case 2: BMPR2 penetrance for heritable PAH We estimated the penetrance of variants in the BMPR2 gene for heritable PAH, a gene for which the low penetrance of pathogenic variants is well established (24). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Input parameters were defined based on only people with idiopathic (sporadic) or heritable PAH diagnoses (14). This captures people with and without family disease history and excludes those with PAH manifestations associated with comorbidities or drug exposure.
We estimated !(B|D) and !(E|D) using the first-degree familiality rate of heritable PAH, about . 055 of people affected by either idiopathic or familial PAH (24).
In this case, to minimise any study specific bias, we applied data from two reports to build independent estimates for each of @ $,% .
The first dataset (14), presents a moderately large sample of people with familial and sporadic PAH. Of 247 people with familial PAH, 202 harboured BMPR2 variants (@ $ = 0.818, N 0 $ 11111 = 0.025), compared to 200 of 1174 in the sporadic state (@ % = 0.170, N 0 " 11111 = 0.011). It is possible that this data may violate two assumptions of our approach. First, information on familial clustering was reportedly unavailable and so some families may be represented more than once in the familial state. Second, it is not specified whether disease familiality is defined only by the disease status of first-degree relatives.
The second dataset (25), overcomes a limitation of the first as each family is represented only once in variant counts. However, the sample is smaller in size. It is reported that 40 of 58 people with familial PAH (@ $ = 0.690, N 0 $ 11111 = 0.061) harboured BMRP2 variants, compared to 26 of 126 in the sporadic state (@ % = 0.206, N 0 " 11111 = 0.036). Variant counts are additionally reported separately for small genetic variations (point mutations and indels) and large genetic rearrangements in BMPR2, which allowed penetrance estimation stratified by variant type. It is not reported whether disease states are defined according to the status of first-degree relatives only.
The first cohort samples people from Asian, European, and North American populations; French, German and Italian cohorts comprise about 60% of the sample (14). The second cohort samples people exclusively from Western Europe (25). We therefore estimated that 3 = 1.543 in both instances, the Total Fertility Rate of the European Union in 2018 (Ref.

Cases 3 and 4: SOD1 and C9orf72 penetrance for ALS
We estimated the penetrance of variants in the SOD1 and C9orf72 genes for ALS. In SOD1, we examined the aggregated of penetrance of various SOD1 variants harboured by people with ALS. For C9orf72, we examined the penetrance of a single pathogenic variant, a hexanucleotide GGGGCC repeat expansion. These penetrances have been historically difficult to establish without incurring kinship-specific biases. It is an ideal candidate for usage of our method.
The first-degree familiality rate of ALS, about 0.050, was applied to define !(B|D) and !(E|D) in these cases (26,27).
We drew upon the results of two recent meta-analyses to estimate @ $,% for SOD1 and C9orf72 (18,28). As variant frequencies differed between Asian and European ancestries, we model these, and therefore penetrance, separately for each group. We derive A / !,# 0000000 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 24, 2021. ; https://doi.org/10.1101/2021.03.16.21253691 doi: medRxiv preprint using z-score conversion from the 95% confidence intervals (95% CIs) reported: for the arbitrary state X, 78%:*;.<
Accordingly, we identified that, in Asian ALS populations: In these datasets, the Asian ancestry cohorts were predominantly individuals from East Asia, with small proportion from South Asia. The European ancestry cohorts primarily comprise people from European countries, with some from North America and Australasia. Accordingly, 3 was estimated for the Asian population samples as 1. 823 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Results
Here we summarise the input data and results of the case studies modelled (see table 2).
In case 1, we estimated the penetrance of the p.G2019S variant of the LRRK2 gene for PD, taking estimates of @ ',=,> to allow estimation via four of the five possible disease state combinations presented in table 1. The output estimates (see table 2) were consistent across the modelled disease state combinations, with some discordance between estimates derived with and without the inclusion of the unaffected disease state. We expect this to reflect that the variant is rare in the unaffected (control) population and the variant frequency estimate may be affected by an ascertainment bias.
In case 2, we estimate the penetrance of variants in BMPR2 for PAH, drawing estimates of @ ',= from two distinct reports (see table 2). For the first sample (14), we found penetrance of 0.395 (95% XY: 0.356, 0.433), compared to 0.303 (95% XY: 0.211, 0.390) for the second sample set (25), in which penetrance was comparable between the defined BMPR2 variant subtypes. The marginally higher penetrance estimate observed for first dataset reflects differences observed in @ $,% between the cohorts and may be affected by unspecified family clustering within this sample set. It is not known for either dataset whether family history classifications were restricted to first-degree relatives only and so the estimates obtained may be slightly inflated. With the available data these possibilities cannot be explored further.
In cases 3 and 4, we estimated the penetrance of variants in SOD1 and C9orf72 for ALS, drawing estimates of @ ',= for each gene in Asian and European populations separately based on the findings of recent meta-analyses (see table 2). We found the penetrance of SOD1 variants to be 0.749 (95% XY: 0.629, 0.864) in Asian and 0.660 (95% XY: 0.494, 0.812) in European populations, and the penetrance of the pathogenic C9orf72 hexanucleotide repeat expansion to be 0.282 (95% XY: 0.023, 0.514) in Asian and 0.449 (95% XY: 0.377, 0.518) in European populations. These estimates demonstrate consistency within genes across populations and indicate that the penetrance for ALS is greater in people harbouring SOD1 variants than in those harbouring the C9orf72 expansion. Table S2 presents additional penetrance estimates made for widely-described SOD1 variants: penetrance was estimated to be 0.917 for p.Ala5Val, 0.617 for p.Ile114Thr, and 0.0009 for p.Asp91Ala. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint --  Table 2. Penetrance estimation for the present case studies. § Lifetime disease risk is only required as a weighting factor where the unaffected (control) population are represented within the data given (see Table 1); * Proportion sporadic is defined as 1 -proportion familial (!(#|%) = 1 − !(*|%)); † Estimated using Total Fertility Rates reported for the: populations sampled to calculate variant frequencies (see Table S1

Discussion
We have developed a novel approach to estimate the penetrance of genetic variants which confer risk for autosomal dominant traits. The method was tested via application to several variant-disease case studies.
Our penetrance estimates of the LRRK2 p.G2019S variant for PD closely matched those previously obtained when analysing data that is not liable to inflation owing to selection of familial cases (2). Such studies make lifetime penetrance estimates for this variant between 0.24 (95% )*: 0.135, 0.437) and 0.45 (CI not reported).
Our estimates for the penetrance of BMRP2 variants for PAH also aligned with existing estimates. Longitudinal analysis of disease trends among 53 families harbouring BMRP2 variants finds penetrance as 0.27 overall, 0.42 for women and 0.14 for men (29). Our slightly higher estimate could reflect that a broader definition of familiality was used in the assessed samples, however this cannot be tested with the data available.
The estimates we generated in the SOD1 and C9orf72 case studies aligned with current understanding of the penetrance of variants in these genes for ALS.
For SOD1 variants, penetrance for ALS in a normal lifespan is reportedly incomplete and differs between individual variants (10, 30). The widely-described p.Ala5Val (formally p.Ala4Val) variant has been recorded to have penetrance of .91 by age 70 (31). Among other variants, penetrance is less apparent and can be expected to be lower than this (10, 30). Of the best characterised variants, p.Ile114Thr approaches complete penetrance in some, but not all, pedigrees and p.Asp91Ala reaches polymorphic frequency in some populations, with ALS typically arising with an autosomal recessive pattern (10, 11,31). We drew estimates which align with these observations when modelling penetrance of the heterozygous forms of these three variants individually (see Table S2), estimating it to be 0.917 for p.Ala5Val, 0.617 for p.Ile114Thr, and 0.0009 for p.Asp91Ala. These findings highlight the spectrum of penetrance across variants in SOD1. Our estimate for the p.Asp91Ala variant in particular is compatible with and supports the hypothesis that it is associated with ALS via a recessive or oligogenic inheritance pattern (32). The absence of p.Asp91Ala within the familial ALS database sampled further corroborates this finding. Accordingly, our penetrance estimates in Asian and European populations can be taken to suitably represent an aggregated penetrance of risk variants in SOD1 for ALS; some variation between populations can be expected, reflecting differences in the admix of variants between them.
For C9orf72, we modelled the penetrance of its pathogenic hexanucleotide repeat expansion for ALS. Pleiotropy is a well-established characteristic of this variant, additionally conferring risk for frontotemporal dementia and, to a lesser degree, other neuropsychiatric conditions (33). In people who harbour the variant, age-dependent penetrance for ALS and frontotemporal dementia is about equal and has been reported as almost complete at around age 80 (34). This estimate is however liable to inflation from biased ascertainment of affected people, and unaffected people are observed to harbour this variant more often than would be expected if it were accurate (16,33,34). Adjusted for possible ascertainment bias, the penetrance for either ALS or frontotemporal dementia is tentatively reported as 0.90 by age 83. Accounting for lifetime risk of each phenotype and their respective . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  Table S3; 28,[35][36][37][38]. It is therefore reasonable to predict that penetrance of this variant for ALS would be around 0.45, a value comparable to our findings.
The method we present has high validity. Criterion and face validity are shown across the penetrance estimates outlined in the present paper, aligning with those made using other techniques and current understanding of the assessed cases. Construct validity is also demonstrated. In the ALS case studies, we found disease risk to be greater for those harbouring a pathogenic SOD1 variant than for those with the C9orf92 repeat expansion. This aligns with the multi-step model of ALS, where harbouring SOD1 variants is associated with a 2-step disease process, converse to the 3-step process associated with the C9orf72 repeat expansion (39).
The data necessary to operate the present approach is distinct from that of prior penetrance estimation techniques which examine patterns of disease among affected people, allowing it to be assessed in unrelated populations rather than families. The estimates are therefore unaffected by kinship-specific modifiers and are instead applicable to the region from which data are drawn.
Where the analysis is confined to people affected by disease, across the familial and sporadic states, we circumvent the ascertainment biases affecting designs which examine the distribution of a variant between affected and unaffected populations (9). In instances where analysis includes data for unaffected samples (i.e. controls harbouring the variant) these would not be avoided; ascertainment of controls compared to cases has equivalent challenges irrespective of the penetrance estimation approach. However, as our method does not require this information if data of familial and sporadic cases are available, it does not majorly limit the approach.
Furthermore, limitations of ascertainment will diminish as huge datasets of genetic and phenotypic information available within public databases become increasingly available. Therefore, the usefulness of penetrance estimates generated through population data will grow as the size and scope of genetic data held in such datasets expands, facilitating accurate estimation how of disease manifestations are distributed within the population in relation to harboured genetic variation (9).
A limitation of this approach is the definition of familiality, which we have defined as the occurrence of the studied trait in a first-degree relative. In practice, familial disease may be defined using various criteria, for example considering the disease status of second-or third-degree relatives, or including related diseases that may share a genetic basis (27,40). For example, ovarian and breast cancer, or ALS and frontotemporal dementia each share a genetic basis, and it is reasonable to consider a family history of frontotemporal dementia when assessing familiality in a person with ALS. If the extended kinship is incorporated within familial disease state definitions, then the familial rate will trend upwards and inflate penetrance estimates. However, the use of a wider definition of being affected is acceptable, although it will yield penetrance estimates for the joint condition. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 24, 2021. ; https://doi.org/10.1101/2021.03. 16.21253691 doi: medRxiv preprint This method is suitable for calculating the point, rather than age-dependent, penetrance of a variant. It can be applied to derive penetrance for an individual variant or for an aggregated set of variants, with the latter indicating an averaged burden of variants meeting the given criteria. It can be applied to any form of germline genetic variation that is associated with a given trait via an autosomal dominant inheritance pattern.
In a scenario where penetrance can be estimated via multiple approaches, we recommend that researchers use each method available to them, given the complimentary nature of these techniques. If the results of multiple approaches conflict, we would suggest inspection of the suitability of the input data given for each method and to prioritise the result obtained from the method which this fits best.
In conclusion, our novel method for penetrance estimation fills an important gap in medical genetics because, making use of the available amounts of population-scale data, it enables the unbiased and valid calculation of penetrance in genetic disease instances that would be otherwise difficult or impossible using existing methods. It serves to expand the range of genetic diseases and variants for which high-quality penetrance estimates can be obtained, as we illustrate in the ALS case examples. Estimates drawn via this approach have clear clinical utility and will be useful for guiding the interpretation of genetic test results that reveal an individual to harbour a characterised risk variant. They have wider relevance to the population than those obtained by studying particular kinships and will be more interpretable for clinical professionals.
The tool code is available on GitHub (https://github.com/ThomasPSpargo/adpenetrance) and the method is available and free to use via a public webserver (https://adpenetrance.rosalind.kcl.ac.uk).
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 24, 2021. ; https://doi.org/10.1101/2021.03.16.21253691 doi: medRxiv preprint