Risk factors for long COVID: analyses of 10 longitudinal studies and electronic health records in the UK

The impact of long COVID is increasingly recognised, but risk factors are poorly characterised. We analysed questionnaire data on symptom duration from 10 longitudinal study (LS) samples and electronic healthcare records (EHR) to investigate sociodemographic and health risk factors associated with long COVID, as part of the UK National Core Study for Longitudinal Health and Wellbeing. Methods Analysis was conducted on 6,899 adults self-reporting COVID-19 from 45,096 participants of the UK LS, and on 3,327 cases assigned a long COVID code in primary care EHR out of 1,199,812 adults diagnosed with acute COVID-19. In LS, we derived two outcomes: symptoms lasting 4+ weeks and symptoms lasting 12+ weeks. Associations of potential risk factors (age, sex, ethnicity, socioeconomic factors, smoking, general and mental health, overweight/obesity, diabetes, hypertension, hypercholesterolaemia, and asthma) with these two outcomes were assessed, using logistic regression, with meta-analyses of findings presented alongside equivalent results from EHR analyses. Results Functionally limiting long COVID for 12+ weeks affected between 1.2% (age 20), and 4.8% (age 63) of people reporting COVID-19 in LS. The proportion reporting symptoms overall for 12+ weeks ranged from 7.8 (mean age 28) to 17% (mean age 58) and for 4+ weeks 4.2% (age 20) to 33.1% (age 56). Age was associated with a linear increase in long COVID between age 20-70. Being female (LS: OR=1.49; 95%CI:1.24-1.79; EHR: OR=1.51 [1.41-1.61]), poor pre-pandemic mental health (LS: OR=1.46 [1.17-1.83]; EHR: OR=1.57 [1.47-1.68]) and poor general health (LS: OR=1.62 [1.25-2.09]; EHR: OR=1.26; [1.18-1.35]) were associated with higher risk of long COVID. Individuals with asthma also had higher risk (LS: OR=1.32 [1.07-1.62]; EHR: OR=1.56 [1.46-1.67]), as did those categorised as overweight or obese (LS: OR=1.25 [1.01-1.55]; EHR: OR=1.31 [1.21-1.42]) though associations for symptoms lasting 12+ weeks were less pronounced. Non-white ethnic minority groups had lower 4+ week symptom risk (LS: OR=0.32 [0.22-0.47]), a finding consistent in EHR. Associations were not observed for other risk factors. Few participants in the studies had been admitted to hospital (0.8-5.2%). Conclusions Long COVID is clearly distributed differentially according to several sociodemographic and pre-existing health factors. Establishing which of these risk factors are causal and predisposing is necessary to further inform strategies for preventing and treating long COVID.


Introduction
SARS-CoV-2 infection can lead to sustained or recurrent multi-organ symptoms in some individuals. [1][2][3] Extended COVID-19 symptomatology over weeks to months has been defined by individuals as 'long COVID'. 4 More formally, the UK's National Institute for Health Care and Excellence defined acute COVID-19 (AC; lasting <4 weeks), ongoing symptomatic COVID-19 (OSC; 4-12 weeks), and post-COVID-19 syndrome (PCS; >12 weeks), with the latter two categories combined as 'long COVID'. 1 Estimates of long COVID prevalence range from 13.3% in highly selected, community-based survey respondents with test-confirmed COVID-19, to at least 71% among those hospitalised by the infection. [5][6][7] Given the scale of the pandemic, even a low proportion of individuals with long COVID will generate a major burden of lingering illness. 8 In order to target appropriate support and focus research on possible causal mechanisms, we first need to understand risk factors for the disease. Current understanding of frequency of, and risk factors for, long COVID remains poor, impeding mechanistic investigation for intervention development and constraining service planning. Obtaining accurate estimates of association and risk requires large generalisable samples with comprehensive measures of pre-morbid health. UK national primary care records, which cover >95% of the population, afford one such data source, but are limited to those who consult with symptoms and depend on diagnosis and recording of long COVID. Furthermore, risk factor and co-morbidity data are limited to those who consult and are tested. Population-based longitudinal studies (LS), established decades before the pandemic, overcome these limitations as they collect data from all participants agnostically, regardless of healthcare attendance, with detailed and, where possible, objective measures of pre-pandemic health. The limitation of LS is that they are relatively small and separately may yield imprecise estimates. Combined analysis of primary care records and LS provides a powerful tool to compensate for their different limitations and biases.
This work aimed to satisfy the clinical and policy need to better understand factors reliably associated with OSC and PCS (long COVID). To do this, we identified individuals with these specifically predefined COVID-19 outcomes in: 1) a consortium of population-based LS which captured coordinated repeat questionnaire data on COVID-19 using harmonised measures from the Wellcome Trust's Covid- 19 Questionnaire, and 2) the OpenSAFELY dataset of primary care records (https://www.opensafely.org/). Within these data sets, we examined the frequency of long COVID among individuals with suspected and test-confirmed COVID-19 and examined associations of sociodemographic and pre-pandemic health risk factors. questionnaires specifically reflected the categories used by these guidelines, asking respondents to self-report duration of symptoms with categories that could be designated as either AC, OSC or PCS.
Based on these categories, we defined two primary outcomes: i) durations lasting 4+ weeks (combining OSC and PCS), with individuals reporting symptoms lasting 0-4 weeks as reference, and ii) 12+ weeks of symptoms (PCS specifically), with reference being individuals with symptoms lasting 0-12 weeks. Full details of the questions and coding are available in Supplementary file 1. In addition, two studies derived an alternative estimate of long COVID based on whether symptoms were present for more than 4 or 12 weeks in total over at least six months (BiB, TwinsUK). BiB study members who self-reported COVID-19 were asked to report whether any particular symptoms (27 in total) were present during March-September 2020. Three of these symptoms (runny nose, sneezing and blocked nose) were considered non-specific for COVID-19 and were removed. Data were used to derive symptom length categories above by summing included symptoms present for 0-4 weeks; 4-12 weeks or 12+ weeks. In TwinsUK, all study members were asked to report whether they had experienced particular symptoms (33 in total) between February and November 2020. Five symptoms (runny nose, sneezing, blocked nose, shaking or difficulty while walking, and phlegm production/chesty cough) were considered non-specific for COVID-19 and were removed. Similar to BIB, data were used to derive the symptom length categories above through summing whether any of the included symptoms were present for 0-4 weeks; 4-12 weeks or 12+ weeks, at any point in time over the specified period. This was performed for people who had had COVID-19 and those who had not (confirmed by negative antibody testing).

Exposures
Pre-pandemic risk factors were restricted to measures more than 6 months but less than 5 years prior to 23 rd March 2020 wherever possible. Full details of the questions and coding are available in Supplementary file 1.

Sociodemographic factors.
These included: age, sex (female/male), ethnicity (white, non-white ethnic minority; in studies where possible), and socioeconomic position measured by highest education levels (degree, no degree), Index of Multiple Deprivation (IMD, a widely used geographical based measure of relative deprivation based on factors such as income, employment and education), and occupational class of own current/recent job (or parental occupational class for younger cohorts; four categories: managerial/professional; intermediate; routine; or not working/not available).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2021. ;

Mental health.
Mental health was captured in the most recent pre-pandemic survey using validated continuous scales of psychological distress that assessed symptoms of common mental health difficulties such as anxiety and depression (e.g., Hospital Anxiety and Depression scale, TwinsUK; Short Mood and Feelings Questionnaire, ALSPAC-G1; Edinburgh Postnatal Depression Scale, ALSPAC-G0; General Health Questionnaire-12, USOC). For analyses, each scale was transformed into standard deviation units (z-scores) within each study, and a dichotomous variable was derived using established cut-offs for each measure (see Supplementary file 1 for details).

Self-rated health.
All study participants had been asked about their general health prior to the pandemic using 5 categories (1= Excellent; 2 = Very good; 3 = Good; 4 = Fair; 5 = Poor). A binary variable was derived for self-reported health, grouping those with excellent-good health (categories 1-3) and those with fair-poor health (categories 4-5).

Health conditions.
Body mass index (BMI = weight [kg]/(height [m] 2 )) of study participants was obtained prior to the pandemic. For analyses a binary weight variable were categorised as those who had a BMI between 0-24.9 (underweight/normal weight) and those who had a BMI of 25 or more (overweight/obese). Prepandemic asthma, diabetes, hypertension, and high cholesterol status was captured through selfreport.

Statistical analysis
Main analyses were conducted in studies with a direct self-reported measure of COVID-19 symptom length. Using this measure, two separate binary variables were created. The first grouped those who had symptoms from 0-4 weeks (reference group) and those who had symptoms for 4+ weeks (long COVID). The second grouped those who had had symptoms from 0-12 weeks (reference group) and those who had symptoms for 12+ weeks (post-COVID-19 syndrome). The association between each sociodemographic or pre-pandemic health risk factor and each long COVID outcome was assessed in separate multivariable logistic regression models within each study. We adjusted for a minimal set of confounders across all studies, where relevant: age (as a continuous variable), sex, and ethnicity. Odds ratios (ORs) and 95% confidence intervals (CIs) were the main measure of association.
We modelled the relationship of age with long COVID risk in two ways, given that there were diverse age structures between studies. First, in age-heterogeneous samples, we analysed long COVID within age categories relative to pre-defined baseline groups, given an a priori rationale that association between age and long COVID may not be linear. Categories within each study are shown in . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2021. ; Supplementary figures 1 and 2. Second, in a subset of LS birth cohorts with participants of nearidentical ages and who were issued fully harmonised long COVID questionnaires (MCS, NS, BCS70 and NCDS), we analysed the trend in absolute risk of long COVID with increasing age between studies using meta-regression.

Data synthesis
To synthesise effect sizes across studies, fixed-effect meta-analyses with restricted maximum likelihood were carried out and repeated with random-effects modelling for comparison. We report heterogeneity using the I 2 statistic to examine the percentage of variability in effect estimates accounted for by heterogeneity rather than sampling error (0% indicates no variation between estimates across studies; values closer to 100% indicate greater heterogeneity).

Attrition and design weights
Selective attrition in LS (compounded by conventional problems of self-selection) can affect the representativeness of retained samples and introduce bias in estimates of association. To address this, where possible, most studies were weighted to be representative of their target population. This attempted to account for survey design and differential non-response to the COVID-19 surveys.
Weights were not available for GS, BiB or TwinsUK.

Sensitivity analysis
To mitigate index response bias, 22 inverse probability weights (IPW) were derived for COVID-19 status. These were derived in each LS separately but following a common approach used previously. 14 Self-reported COVID-19 status was regressed on each exposure to assess whether COVID-19 was associated with each socio-demographic or pre-pandemic health risk factor. To determine what variables to include across LS, observed associations were meta-analysed to identify consistent predictors of COVID-19 self-report status (see Supplementary Information 2 for list of covariates used to derive IPWs). To avoid missingness on IPWs, covariates included in each model were imputed using multiple imputation by chained equations (MICE) and IPWs were derived across multiple imputed data sets. Derived weights were then applied in all analysis models as a sensitivity check.
For studies in which we were able to verify SARS-CoV-2 infection through collected serology data in summer/autumn 2020 (TwinsUK and ALSPAC-G0 and -G1), analyses were replicated on a subsample of those who had positive polymerase chain reaction (PCR) obtained through linkage to testing data and/or lateral flow antibody testing (ALSPAC) and enzyme-linked immunosorbent assay (ELISA) (TwinsUK) 23 confirming exposure to COVID-19. All statistical analyses on the LS were performed in Stata version 16 or R (release 3.6.0 or later).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2021. ;

Data sources: EHR
Working on behalf of NHS England, we conducted a population-based cohort study to measure long COVID recording in electronic health record (EHR) data from primary care practices using TPP SystmOne software, linked to Secondary Uses Service (SUS) data (containing hospital records) through OpenSAFELY. This is a data analysis platform developed during the COVID-19 pandemic, on behalf of NHS England, which includes all practices using TPP SystmOne software linked to Secondary Uses Service (SUS) data (containing hospital records) to allow near real-time analysis of pseudonymised primary care records at scale, operating within the EHR vendor's highly secure data environment. Details on Information Governance for the OpenSAFELY platform can be found in the Supplementary Information 1.

Sample: EHR
From a population of all people alive and registered with a general practice on 1 December 2020, we selected all patients who had evidence of a COVID related code, either: testing positive for SARS-CoV-2, being hospitalised with an associated COVID diagnostic code, or having a recorded diagnostic code for COVID in primary care.

Outcome: EHR
The outcome was any record of long COVID in the primary care record, as a binary variable. This was defined using a list of 15 UK SNOMED codes, which are categorised as diagnostic (2 codes), referral (3) and assessment (10) codes. SNOMED is an international structured clinical coding system for use in electronic health records. The outcome was measured between the study start date (2020-02-01) and the end date (2021-05-09).

Sociodemographic variables
Demographic variables included age (in categories), sex, geographic region, IMD (divided into quintiles), and ethnicity. Details on coding have been reported elsewhere (see: Mathur et al. 2021). 24

BMI.
People were categorised as not obese or obese using their most recent BMI measurement, with those in the obese category further categorised into Obese I (BMI 30-34.9), Obese II (BMI 35-39.9), or Obese III (BMI 40+). Those with a missing BMI were assumed to be not obese.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2021. ;

Health conditions.
A previous code six months to five years before March 2020, for one or more of: diabetes; cancer; haematological cancer; asthma; chronic respiratory disease; chronic cardiac disease; chronic liver disease; stroke or dementia; other neurological condition; organ transplant; dysplasia; rheumatoid arthritis, systemic lupus erythematosus or psoriasis; or other immunosuppressive conditions. Those with no relevant code for comorbidities were assumed not to have that condition. Number of comorbidities was categorised into "0", "1", and "2 or more".

Mental health.
Evidence of a pre-existing mental health condition was defined using a prior code for one of: psychosis; schizophrenia; bipolar disorder; or depression.

Statistical methods: EHR
The number of people with or without a long COVID code was recorded amongst the selected sample and stratified by each of the measures. The proportion of people with long COVID codes was calculated overall and within each code category. The percentage of long COVID events across the different measure categories was also reported.
We conducted multivariable logistic regression to assess whether GP-recorded long COVID was associated with each sociodemographic or pre-pandemic health characteristic. We adjusted for the same set of confounders as used in the LS analyses: age (as categorical variable), sex, ethnicity. Odd ratios (ORs) and 95% confidence intervals (CIs) were again the main measure of association.
In further analyses of age as a risk factor for long COVID in the EHR data, we assigned individuals within 10-year categories an age at the midpoint of each group, then assessed the trend in long COVID frequency with age using linear and non-linear meta-regression.
All code for the OpenSAFELY platform for data management, analysis and secure code execution is shared for review and re-use under open licenses at https://github.com/opensafely. Codelists describing the definition of all the above conditions can be found at: https://github.com/opensafely/long-covid-historical-health/tree/main/codelists All code for data management and analysis for this paper is shared for scientific review and re-use under open licenses on GitHub https://github.com/opensafely/long-covid-historical-health . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ;
Within cases, the percentage of females ranged from 55% (NCDS) to 96% (BiB) and the mean age across studies ranged from 19.9 years (MCS) to 63.0 years (NCDS) ( Table 1). Ethnicity differed across LS, with members identifying as 'White' ranging from 43.8% in BiB to 98.4% for ALSPAC G0. The percentage of those with a degree ranged from 7.5% (BiB) to 49.5% (TwinsUK). Within each LS, most participants managed their illness at home and were not admitted to hospital (range 0.8%-5.2%, see Table 1). Descriptives for the cases and base sample populations are provided in Supplementary Table 1.
In the age-heterogeneous LS, increasing trends in risk of symptoms lasting both 4+ weeks and 12+ weeks with higher age were observed across participants ranging from young adulthood to approximately 70 years (Supplemental Figures 1 and 2). In meta-regression analyses to assess absolute differences in long COVID frequency with age, a clear linear trend in reporting of symptoms for 4+ weeks was present across the four national cohorts (MCS, NS, BCS70 and NCDS), . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ; corresponding to a 3.02% (95% CI: 1.86, 4.17) higher proportion of individuals with COVID-19 reporting OSC or PCS per decade between 20 to 63 years ( Figure 1, left panel). A more modest linear trend in the reporting of symptoms for 12+ weeks was observed, and with less precision due to a lower number of cases reporting PCS alone (0.68% per decade; 95% CI: -0.15, 1.51).
Pooled associations between other sociodemographic and health traits and each binary long COVID outcome (4+ vs 0-4 weeks (OSC and PCS combined) and 12+ vs 0-12 weeks (PCS specifically)) are presented as part of Figure 2, and in full detail in Supplementary Figures 3 to 6. This synthesised analysis included the 10 LS samples with a total of 6754 participants.  5). Across LS, no strong evidence was found for associations of IMD with either outcome.
Having not attained a degree from higher education was associated with lower risk of PCS specifically (OR: 0.73; 95% CI: 0.57-0.94), but not with OSC and PCS in combination (OR: 0.95: 95% CI: 0.80-1.14).
When synthesising associations for health characteristics across LS, those with poor or fair prepandemic self-reported general health were found to have greater odds of having symptoms for both long COVID outcomes (4+ weeks: OR=1.62; 95%CI: 1.25-2.09; 12+ weeks: OR=1.66; 95%CI: 1.14-2.40). Greater pre-pandemic psychological distress was also associated with higher risk of both long COVID outcomes (4+ weeks: OR=1.45; 95%CI: 1.16-1.82; PCS: OR=1.58; 95%CI: 1.15-2.17). No strong evidence was observed for a linear association of BMI with either outcome. In models to examine the potential importance of a BMI threshold in relation to long COVID, overweight/obesity was associated with increased odds of symptoms lasting for 4+ weeks (OR= 1.24; 95%CI: 1.01-1.53) threshold but not with PCS specifically (OR 0.95, 95% CI: 0.70-1.28). Associations were not found for diabetes, hypertension, or high cholesterol with either outcome, although modest point estimates were on the side of higher long COVID risk in several instances (Supplementary figures 4 and 6. Asthma was the only specific medical condition associated with increased odds of having symptoms for 4+ weeks (OR=1.31; 95%CI: 1.06-1.62), although the association with PCS specifically was closer to the null (OR=1.13;95%CI: 0.80-1.58).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ;

Sensitivity analyses
When including IPWs for risk of COVID-19 status, all identified associations persisted and, in some instances, associations increased slightly in magnitude (Supplementary figures 7 to 10). Notably hypercholesterolaemia was associated with both long COVID outcomes in the LS meta-analyses weighted for probability of reporting COVID-19.

Electronic Health Records
Within 1,199,812 individuals with any acute COVID-19 code, 3327 individuals also had a recorded long COVID code, constituting 0.27% of COVID-19 cases.
An inverted U-shaped association of recording of long COVID with age was observed (Supplemental  (Table 3 and Figure 2). Individuals living in areas with the least deprivation had higher odds of having a long COVID code compared to those in the most deprived IMD quintile (Figure 2).
Again, as with the population-based studies an increased risk was observed in individuals with a prepandemic diagnosis of asthma (OR=1.56; 95%CI:1.46-1.67) and overweight and obesity (OR=1.31, 95%CI:1.21-1.42). No increase in risk was observed for diabetes.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ;

Discussion
This research aimed to provide information useful to both clinical practice and policy given the lack of characterisation of factors reliably associated with OSC and PCS (together termed long COVID).
Using data from a consortium of population-based LS which captured coordinated repeated questionnaire data on COVID-19 and the OpenSAFELY resource, we examined the frequency of long COVID and associations with sociodemographic and pre-pandemic health risk factors.

Main findings
The frequency of those with apparent OSC specifically ranged from 3% to 18% and the frequency of those with PCS specifically ranged from 1% to 17% across studies in young adult LS and late midlife LS respectively. Using a stricter definition of symptoms affecting day-to-day function that was recorded by a subset of our LS questionnaires, proportions with both conditions were lower (OSC: 3.0 to 13.7; PCS: 1.2% to 4.8% in young adults and those in late midlife respectively). In individuals both presenting to and diagnosed with COVID-19 by primary care practitioners, the proportion recorded with long COVID of any duration was substantially lower at 0.27%.
In both LS and EHR, long COVID reporting by any definition increased with age. Unlike risk of severe COVID-19, this appeared to be an apparently linear (and not exponential) relationship across most adult age groups. Women were approximately 50% more likely to report long COVID than men, while those of non-white ethnicity were approximately a third to a quarter less likely than those of white ethnicity to report long COVID in EHR, and PCS specifically in LS. Greater socioeconomic advantage (measured from area of residence) was associated with a greater risk of long COVID in primary care data, but not in LS.
In both LS and EHR, pre-existing adverse mental health was associated with an approximate 50% increase in the odds of reporting long COVID, while estimates of the association with poorer general health ranged between 1.26 for EHR to 1.62 for LS. Asthma was the only specific prior health condition associated with greater odds of persistent symptoms; in LS by a third, and in primary care by a half.
Reports on the proportions of infected individuals going on to experience long COVID have varied.
Current estimates from Office of National Statistics (ONS) estimate that by May 2021, among 1.0 million individuals living in the UK, 1.6% self-report long COVID (defined by symptoms persisting for more than four weeks after the first suspected COVID-19infection that were not explained by . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ; something else). 25 In this report ONS ascertained long COVID estimates using a self-reported question similar to those used in our LS.
Many currently available studies reported on selected, hospitalised or outpatient populations find higher proportions with long COVID. We found only two population-based samples in the literature to date, which assessed the presence of Despite these differences and the markedly lower risk of long COVID diagnosis in primary care versus LS, several risk factor associations were consistent between various LS and in EHR. Findings that long COVID was more common with each decade of age from age 20 to age 70, and was 50% higher in women than men, are consistent with reports from most 4,29-33 but not all previous studies.(34,35) There was an approximate linear increase in risk with age between 18 and 70 years.
Over the age of 70, we observed a sharp decline in risk in most LS and EHR. This decline in risk for older adults which has been observed in other studies, 4,29,32 may be explained by selective competing risk of mortality, non-response bias, individuals misattributing long COVID to other illnesses, or a combination of these factors.
We observed a counterintuitive reduction in odds for long COVID for demographic factors which are commonly associated with increased morbidity, such as lower education (associated with lower risk of PCS only). This contrasts with a population-based study in the US 26 which found no strong evidence for associations with ethnicity or socioeconomic status and a Swedish study which found no . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ; socioeconomic gradient in long COVID. 34 While we found no strong evidence for a relationship between area-level socioeconomic status in LS, in primary care EHR there was an apparent gradient of higher risk in individuals from the least deprived areas. This likely reflects unmet need in those who live in socioeconomically deprived areas, given that both pre-existing adverse mental and physical health is associated with greater risks of long COVID, and that these conditions are likely more prevalent in those who are less advantaged. We found an apparent reduction in odds in minority ethnic groups for PCS specifically in LS and long COVID code reporting in EHR. Further research is needed to understand the reasons for this.
Both LS and EHR have rich pre-pandemic data on health and disease on their participants, which most published studies of long COVID lack. 25,35 Therefore, we were able to disaggregate mental and physical health characteristics caused by the pandemic, from those that were pre-existing. Our finding of a greater risk of long COVID related to adverse prior mental health, has been reported elsewhere, 26 but pre-pandemic general health has not previously been highlighted. 29,31,36 These findings were robust in all analyses including inverse probability weighting for risk of COVID-19. The finding of an excess risk of long COVID in association with asthma across cohorts and primary care records resolves previous conflicting and limited findings, 26,29,35 and provides considerable support for focusing on asthma as a high-risk condition, for example by investigating into whether immune processes seen in asthma play a role in the development of long COVID. In our analysis weighted for risk of COVID-19 onset, high cholesterol measures were associated with a greater risk of long COVID which has not been reported previously, although only two LS contributed data for metaanalyses of this factor and further studies will be required to confirm or refute this finding. We found no association between diabetes or hypertension and long COVID. 26,31,35,37 Findings for overweight/obesity were suggestive of an increased risk again resolving previous uncertainty. 29,35,36 The markedly lower reporting of long COVID in primary care compared to LS suggests only a minority of people with long COVID seek care and subsequently receive a code. Diagnostic codes for long COVID have only recently been instituted and uptake by primary care practitioners has not been uniform. 38 Additionally, the analyses here are based on practices that use TPP SystmOne software, noting that these practices had a 2-to 3-fold lower rate of long COVID recording than those that use EMIS software. 38

Strengths and limitations
This analysis brings together data from 10 longitudinal study samples and EHR, with rich information on pre-pandemic risk factors and COVID-19 symptom length. Although several recent surveys are available, the lack of pre-pandemic measures makes it difficult to assess directional effects of risk factors on outcomes. This study is strengthened by the coordinated investigation in multiple LS that . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ; are each susceptible to different sources of bias, with differing study designs, target populations, and selection and attrition processes. Moreover, the use of multiple studies increased statistical power to look at subpopulations, such as ethnic minority groups, and allowed for greater examination of the influence of age on long COVID. Our novel approach to harnessing multiple datasets allowed research questions to be addressed which would not otherwise be possible. Differences between studies in a range of factors -including measurement of risk factors, timing of surveys, design, response rates, and differential selection into the COVID-19 sweeps -are potentially responsible for heterogeneity in estimates. However, despite this heterogeneity, the key findings were consistent across most datasets. Unmeasured/residual confounding bias cannot be ruled out in either LS or EHR; and our analysis was not able to assess causation. We attempted to assess any index event bias using a systematic, structured approach across LS, which produced consistent results, but there remains the possibility that we have not fully accounted for this, due to the presence of unobserved factors and imperfect measurement of observed factors. Further, analysis of case series alone may yield bias as a result of generating an artificial sampling frame within which observed associations do not reflect whole population truths. 22 This may equally be the case for other published manuscripts that confine their samples to hospitalised/Emergency Department patients. 31,36,39 Our samples were populationbased, and only a small number of individuals were admitted to hospital. We did not adjust for severity of initial disease which others have reported as relevant. 4,26 However, the persistence of associations across studies of scale and with heterogeneous characteristics lends confidence in our findings. Lastly, it should be noted that associations presented here represent those specific to case status as defined in our collections.

Implications
It has been possible to identify risk factors at the level of the population. Although causal inferences cannot be made at this stage, this provides evidence to support investigation, in particular of the role of sex differences, biological ageing, and immunity in the development of long COVID. Targeting services to those most in need may be warranted. Our data suggest that improved diagnosis within primary care is needed, both to facilitate research but also to allow rolling out of future interventions when effective support becomes available. Further research on prevention and treatment of long COVID is urgent and critical given the scale of the pandemic and the functional consequences of the condition. Individuals in older working age may particularly require support and given high levels of comorbidity in this group, will require holistic approaches that incorporate potential multimorbidity's.
Trials should therefore ensure inclusivity of older people, and people with prior mental and physical health diagnoses. In this work we have demonstrated the benefits of cross-cohort collaborations and harmonised analyses which has accelerated the return of robust reproducible findings to the scientific community and the public. Future efforts to link EHRs to LS could yield even greater insights.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ; https://doi.org/10.1101/2021.06.24.21259277 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. in survivors managed in Lagos State, Nigeria. BMC Infect Dis. 2021;21(1):1-7.
doi:10.1186/s12879-020-05716-x . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. provide core support for ALSPAC. A comprehensive list of grants funding is available on the ALSPAC website (http://www.bristol.ac.uk/alspac/external/documents/grant-acknowledgements.pdf). We are extremely grateful to all the families who took part in this study, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool" and reference the following webpage: http://www.bristol.ac.uk/alspac/researchers/our-data/. Ethical approval for the study was . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. ; Figures   Table 1. Descriptives of the ten LS analytic samples   is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) .

CC-BY 4.0 International license
It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. Footnotes: Studies are ordered from youngest to oldest mean age within categories of method of long COVID ascertainment * Questionnaires in these four cohorts asked respondents to report duration for which COVID-19 symptoms impeded normal function, rather than simply the duration of any symptoms (however mild) as in other studies. Hence proportions reporting long COVID in them are expected to be lower when compared to other cohorts with similar characteristics ** Based on symptom-counting approach over months, rather than self-reported duration of symptoms as in all other cohorts, which yields higher proportions of individuals being designated long COVID categories . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2021. Left --in four longitudinal studies where participants are of near-identical ages (the cohorts MCS, NS, BCS70 and NCDS), proportions reporting symptom length of four or more w COVID-19 cases were ascertained from questionnaire responses. Right --in OpenSAFELY, proportions represent individuals within 10-year age categories (with estimates groupe point of each category) who have long COVID codes in GP records, hence the proportions are substantially lower than in the corresponding cohort data. Trend lines and 95% confi shading represent absolute differences in long COVID frequencies with increasing age, estimated by linear meta-regression of data from the four cohorts and from 18 to 70 year old OpenSAFELY (data from older individuals were not modelled; refer to results text for further explanation).
t) e weeks in ped at the midnfidence interval olds in . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint All associations were adjusted for age and sex, except where redundant. In all instances where it was possible to derive results from both meta-analyses of longitudinal studies and analysis of EHRs, the corresponding results are plotted side-byside for comparison. The outcome used for longitudinal study fixed-effect meta-analysis estimates presented here was symptoms lasting for 4+ weeks, and the outcome in EHRs was any reporting of a long COVID read code in GP records (regardless of duration of symptoms). Full study-level results, heterogeneity statistics and random-effect estimates for the longitudinal study meta-analyses are presented in supplemental figures 3 and 4. The equivalent meta-analyses of longitudinal study data where symptom duration of 12+ weeks was instead used as the outcome are depicted in supplemental figures 5 and 6. 'Poor overall health' represents the self-rated health exposure in the LS meta-analysis, and comorbidities in OpenSAFELY. The outcome 'Overweight and obesity' represents combined BMI categories over 25 in the LS, and solely individuals with BMI 30-34.9 in OpenSAFELY.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2021. ;