Psychological distress across adulthood: test-equating in three British birth cohorts

Background Life-course and cross-cohort investigations of psychological distress are limited by differences in measures used across time within- and between- cohorts. Aims We aimed to examine adulthood distribution of symptoms and cross-cohort trends by test-equating mental health measures administered in the 1946, 1958 and 1970 British birth cohorts. Methods We used data from the above three birth cohorts (N=32,242) and an independently recruited calibration sample (n=5,800) where all measures of psychological distress that were used in at least one sweep of the cohorts were administered. We used two approaches to test-equating (equipercentile linking and multiple imputation) and two index-measured (General Health Questionnaire [GHQ]-12 and Malaise-9). We presented and compared means and prevalence of mental distress across adulthood in each cohort. Results While broad patterns of the shape of mental distress were similar across adulthood (inverse-Ushape) for all methods used, both test-equating method and index measure resulted in slightly different estimates, most notably for cross-cohort comparisons. Cross-cohort comparisons using GHQ-12 suggested that psychological distress is higher in younger cohorts, whereas using Malaise-9 there were inconsistent differences between cohorts. Sensitivity analysis (using incidents where both measures were simultaneously available in the cohorts) indicated that multiple imputation led to more accurate estimates compared to equipercentile linking. Conclusion When estimating life course trajectories of psychological distress we observe an inverse-U shaped trajectory across adulthood. Differences in point estimates between measures and methods do not allow for clear conclusions about consistent trends between cohorts.


Introduction
Despite the fact that common mental disorders are a leading cause of disease burden 1 , with one in six adults in England meeting the threshold for a clinical diagnosis in 2014 2 our ability to assess psychological distress reliability across time, person and place is limited. Limitations to comparability result from a lack of 'gold standard' and plethora of instruments used as well as from differences in mode of administration and response options. Recently, studies have used Item Response Theorybased approaches 3,4 , equipercentile linking 5,6 , and multiple imputation 7 to investigate comparability of different mental health measures. Despite this recent interest in applying such test-equating methods to mental health measures, little is known about the effects of using different approaches on resulting estimates of distribution and above threshold prevalence.
A life-course and cross-cohort perspective The adulthood distribution of psychological distress is expected to follow an inverse-U shape with symptoms increasing from early adulthood to mid-life and then decreasing from mid-life to old age 8,9 , although there is limited evidence for other life-course distributions as well, including an increase in distress in the over-75s 10 . The limitations of relying on cross-sectional data across ages to determine the life-course shape of psychological distress are well recognised conflating of age and cohort effects 8,11,12 ). One of the limitations to drawing stronger conclusions about the adulthood trajectory of symptoms in longitudinal data has been the use of different measures across time in population based cohort studies (for instance, in the 1946 birth cohort the Present State Examination (PSE) is used at age 36, the Psychiatric Symptom Frequency (PSF) scale at age 43, and the GHQ-28 at ages 53, 63 and 69).
Similarly, cross-cohort comparisons are also limited by different cohorts using different measures.
Where identical measures have been used to compare cohorts there is some evidence that mid-life psychological distress is higher in more recent mid-20 th century cohorts 13 , although other studies suggest recent cohorts have better mental health 11 or U-shaped cohort effects with psychological distress highest in the oldest and most recently born cohorts 14 .

The present study
The availability of three successive national cohort studies with mental health measures through adulthood (the 1946, 1958, 1970   CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. The analysis sample for the three birth cohorts was everyone for whom mental health data was available for at least one survey sweep in adulthood (18+ years).

Main outcomes and measures
Our main outcome was psychological distress. We collected all measures of psychological distress that were ever used in at least one of the birth cohorts (GHQ-12 and -28 18  We used the GHQ-12 and Malaise-9 as our index measures for both test-equating methods. Using two index measures has the advantage of being able to assess robustness of the modelling procedure (each . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint serve as the other's sensitivity analysis). We collected and harmonised data on age in years, sex and highest level of education (none, GSCE or equivalent, A-levels or equivalent, degree or higher) to inform our multiple imputation models.

Statistical analysis
We first composed descriptive statistics of all mental health measures collected in the calibration sample, and estimated correlations between these measures. For our equipercentile linking approach to test-equating, we then cross-tabulated percentile rankings on the GHQ-12 or Malaise-9 with percentile rankings on remaining measures to determine equivalent scores 5,23 . Based on this, we first identified a threshold score on the remaining measures most closely corresponding to that of the Malaise-9 (≥4) and the GHQ-12 (≥12). We applied this calibrated score back to the existing measures in the birth cohorts at each sweep to estimate the prevalence of mental distress. Secondly, using our equipercentile ranking, we converted scores on other measures in all sweeps of the birth cohorts to GHQ-12 or Malaise-9 and estimated means, variance and above-threshold prevalence of mental distress. Details on how we meet the various assumptions associated with equipercentile linking 23 are found in the Supplemental Methods.
Separately, we used a multiple imputation approach to test-equating. Multiple imputation is a more robust extension of linear transformation-based test-equating approaches 7,24 , able to take account of stochastic error and uncertainty around single imputation transformed estimates. We coded data on covariates and all psychological distress measures identically across all four datasets, with psychological distress measures coded at the scale-level per age-group. We appended the calibration sample to each of the cohort samples, separately for GHQ-12 and Malaise-9. Each dataset thus consisted of two samples, and (at least) two measures per age group. In at least one of these samples (the calibration sample) both measures were complete and data from this were used to impute values into the cohort sample (Supplemental Table 4). We used multiple imputation by fully-conditional specification using chained equations 25 . Analyses were conducted post-imputation, combining estimates across 50 imputed data sets using Rubin's rule 26 . We estimated means, standard deviations and above-threshold prevalence of mental distress across adulthood in the cohorts.

Sensitivity analysis
In both the 1958 (at age 42) and 1970 (at age 30) cohorts the GHQ-12 and Malaise-9 were jointly administered. We used this opportunity to assess comparability of prevalence yielding from the three methods described above to original estimates as an additional sensitivity analysis. For example, for the GHQ-12 calibration at age 30 in the 1970 cohort, we compared the prevalence yielding from the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint equipercentile linking, calibrated cut-off and multiple imputation approaches to the prevalence derived from the original GHQ-12 measure.
Throughout this paper we present results for the GHQ-12 in the main manuscript, with results for the Malaise-9 in the Supplemental Materials.
Our analysis plan was pre-registered on the Open Science Framework and can be accessed here: https://osf.io/7uc4j. We used Stata 16 for all our analyses 27 .

Ethics and consent
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. All procedures involving human subjects were approved by the UCL Institute of Education Research Ethics Committee ( REC1210). Informed consent was obtained from all participants.

Results
We recruited 5,800 participants into the calibration sample, distributed across five age groups (Supplemental Table 5). The analysis sample for the 1946 cohort consisted of 3,689 participants (68.7% of full cohort); for the 1958 cohort this was 14,814 (85.0% of full cohort); and for the 1970 cohort this was 13,739 (79.9% of full cohort). Missing data in the calibration sample was low (highest: 5.1% on GHQ-28, Supplemental Table 6). Means and standard deviations per questionnaire can be found in Supplemental Table 5. Correlations between measures varied between 0.68 (between Malaise-9 and GHQ-12) and 0.91 (between GHQ-12 and GHQ-28, Supplemental Table 7).

Equipercentile linking
Scores calibrated to the GHQ-12 are detailed in Table 2, and calibrated means and standard deviations across the life-course in each of the three birth cohorts are detailed in Table 3 and Figure 1A.
Distribution of above-threshold prevalence is detailed in Figure 2. For the calibrated cut-off scores, prevalence followed an inverse U-shape in the 1946 birth cohort, peaking at 35.9% at age 63, before declining to 27.3% at age 69. In the 1958 birth cohort, the shape was similar to the 1946 cohort and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint the peak was observed at age 42 (39.5%). Prevalence of psychological distress was relatively stable across the 1970 birth cohort, peaking at age 26 at 46.5%. Using the equipercentile linking method (where total scores were calibrated before the cut-off was applied), prevalence of psychological distress peaked in the 1946 birth cohort at age 63 at 35.9%, before declining to 27.3% at age 69. In the 1958 birth cohort, prevalence peaked at age 42 at 23.1%, and in the 1970 birth cohort the peak at age 42 was 28.7%.

Multiple Imputation
Means and standard deviations of psychological distress across the life-course based on multiple imputation are detailed in Table 2 and Figure 1A. In the 1946 cohort, mean scores peaked at age 43 Corresponding above-threshold prevalence estimates of psychological distress are detailed in Figure 2.
The peak in the 1946 birth cohort occurred at age 43 at 44.6%, before declining to 36.8% at age 69. In the 1958 birth cohort, prevalence declined from a peak at age 23 (45.1%) until age 42 (38.5%) before increasing again at age 50 (44.0%). In the 1970 birth cohort, prevalence was highest at age 26 (52.9%) and lowest at age 30 (34.5%).

Life-course mental health and cross-cohort comparisons
Although broad patterns in distribution and above threshold prevalence over adulthood appear similar across both index measures and the two test-equating methods used in this paper, point estimates differ substantially. For instance, when examining the mean scores and standard deviations at age 36 in the 1946 birth cohort, a three-fold variance is observed for the GHQ calibration (2.91 [SD: 4.25] for the equipercentile linking method, and 9.98 [SD: 0.38] for the imputation approach). Prevalence scores for this sweep varied from 2.7% in the equipercentile linking and calibrated threshold approaches using the GHQ-12 to 35.2% using a multiple imputation approach.
Regarding cross-cohort comparisons of the prevalence of psychological distress calibrated against the GHQ-12, all approaches suggest that prevalence is highest in the 1970 birth cohort, though there is considerable variation in point estimates. Calibration against the Malaise-9 using a multiple imputation approach suggests that prevalence in the 1946 birth cohort is higher than in the other two cohorts, whereas using both a calibrated cut-off and equipercentile linking approach suggest that prevalence is comparable across the three cohorts.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Comparison of the two index measures
Full results for the various methods of calibration against the Malaise-9 can be found in the Supplemental Results. Figure 1 details the mean scores across the adulthood sweeps in all three cohorts for both the equipercentile linking and multiple imputation methods, for both GHQ-12 ( Figure 1A) and Malaise-9 ( Figure 1B). There appear to be larger differences in means between both methods for the GHQ-12 compared with the Malaise-9, though this was not formally tested. Calibration against the GHQ-12 yielded higher prevalences than calibration against the Malaise-9 (Figure 2, Supplemental   Figure 6).
Though broad patterns of psychological distress across the life-course were similar for GHQ-12 and Malaise-9 with prevalence highest in mid-life across the three cohorts, the curve was much flatter for the Malaise-9 across all methods (Supplemental Figure 6). Whereas using the GHQ-12 as an index measures seems to suggest slightly higher psychological distress in the younger cohort, using Malaise-9 suggests that psychological distress is lowest in the 1958 cohort.

Sensitivity analysis
Additional sensitivity analyses resulting from comparing both calibration methods to original prevalence estimates at age 42 in the 1958 cohort and age 30 in the 1970 cohort suggested more accurate estimates using the multiple imputation method with both measures (Supplemental Table 10).

Discussion
Whilst broad patterns of psychological distress remained similar across the adulthood, the equipercentile linking test-equating method yielded lower means and standard deviations across the life-course compared the multiple imputation approach. Whilst this held true across both index measures, differences appeared to be larger for the GHQ-12 compared with the Malaise-9. However, cross-cohort comparisons were more susceptible to methodological effects. In general, using the GHQ-12 as an index measure yielded higher prevalence estimates than using Malaise-9. Sensitivity analyses using study sweeps with both index measures suggested that multiple imputation approach lead to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Comparison with existing literature
The previously reported 8,9 inverse U-shape, across adulthood was observed in the 1946 birth cohort for calibration against the GHQ-12 and Malaise-9 for the calibrated-cut-off and equipercentile linking methods, though for both index measures the multiple imputation method showed a gradual decline in prevalence of psychological distress across the life-course. This pattern was less clearly observed in the 1958 and 1970 cohorts across both index measures and both calibration methods, likely due to the fact that these cohorts are still in middle age and prevalence of psychological distress has only started to decline marginally.
As per previous research comparing age 42 years sweep across the 1958 and 1970 cohorts 13 , we find that based on most methods and measures the 1970 cohort has higher prevalence of psychological distress at most ages compared to the 1958. However, this is the first paper comparing these cohorts to the earlier 1946 born cohort and we can draw no clear conclusions about any trends also including this cohort. With GHQ-12 as index measure we see lower prevalence and with the Malaisee-9 see higher prevalence of psychological distress in the 1946 cohort. Previous studies in North America have found that some older and more recent cohorts have higher distress, demonstrating a U-shape in cohort effects 14 , and other UK-based studies have observed lower prevalence in more recent cohorts 11 . It important to note that the birth cohorts included in the present study were born across just 24 years in mid-20 th century Britain, and hence we cannot extrapolate findings to recent cohorts where higher distress is increasingly reported. Our sensitivity analysis using both sweeps where both GHQ-12 and Malaise-9 were administered could not provide insight into why clear conclusions about crosscohort trends could not be drawn, although it seems to suggest that multiple imputation might be more reliable than equipercentile linking. However, even just focussing on the multiple imputation findings we see prevalence of psychological distress increasing in more recent cohorts using the GHQ-12, yet lowest distress in the 1958 cohort using the Malaise-9.

Strengths and limitations
These results should be interpreted in the light of the strengths and limitations inherent to this study.
We used three nationally representative birth cohorts, and our calibration sample had sufficient coverage across the whole distribution of mental health and was broadly representative of the current general population of the United Kingdom in terms of age, sex, level of education and country of residence. Our study was methodologically robust and was designed to allow for assessing reliability across methods (in contrast to previous literature using one method and measure): we used two . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint different index measures and test-equating methods, enabling us to describe differences on the basis of these. This is in contrast to previous literature applying these methods which typically only use one method and measure 3,4,6,7 resulting in limited evaluation of the reliability of any findings based on these approaches.
However, there are also some limitations inherent to our study design. As we only used data from the United Kingdom, we are uncertain about the generalisability of our results to an international context.
Mode of questionnaire administration differed between our calibration sample (all self-reported online) and the cohort samples (either self-report via a paper questionnaire or intervieweradministered), and this might have led to higher reporting of mental health symptoms in the calibration sample 28 . Finally, we are utilising responses today to equate responses given at a previous point in time, as far back as 1982. However, there appears to be no evidence that within-individuals and cross cohort interpretation of the Malaise-9 changes over time 29 , and for the measures in the calibration measurement invariance analyses (Supplemental Table 3) indicate that younger and older respondents today answer these measures similarly, hence increasing confidence in the longitudinal comparisons made.

Interpretation
Whilst life-course patterns of psychological distress were similar across both index measures and testequating methods, point estimates were not. When comparing methodologies, the imputation method yielded higher means and standard deviations than the equipercentile linking method, and sensitivity analyses indicate the former might be the less biased approach in this scenario. Whilst means are not directly comparable across index measures due to different scale ranges, prevalence estimates were higher using the GHQ-12.
These differences have little bearing on the longitudinal symptom profile (we confirmed an inverted Ushape over the life-course). There are two hypothetical explanations for this inverted U-shape: it might be artefactual because the instruments we use are poor at capturing important aspects of mental health in later life, or it might be a reflection of genuine better mental health in later life (either through a reduced perception of pressure through socioemotional selectivity or through eudaemonic processes) after a period of greater stress in mid-life (potentially reflecting the multiple stressors faced by many of childcare, career pressures and caring for elderly parents 30 ) needs to be further understood.
However, these differences do have substantial implications for cross-cohort comparisons. For instance, when examining means derived through the equipercentile method calibrated against the Malaise-9 (Figure 1), means are higher in the 1958 and 1970 cohort, and any mid-life peak is earlier in these cohorts. However, when using the same index measure and applying our multiple imputation-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint based approach, the mean is highest in the 1946 birth cohort, and there is no discernible mid-life peak in the other two cohorts. This method-and measure-dependency leaves us unable to make strong conclusions about whether more recent generations experience poorer mental health.
It important to note that the Malaise 9 has less variance given the range from 0-9, compared to the GHQ-12 (range 0-36) and most of the other measures that we calibrated. We speculate that some of the discrepancies in the results we observe between these different measures might be due to differences in their scales (for instance a score of between 8 and 10 the GHQ-12 gives a score of 1 on the Malaise-9, Supplemental Table 7). If this is indeed an important part of the consideration, then testequating between measures with similar ranges and variance is more likely to yield reliable estimates compared with measures with vastly different variances. This might also explain why the multiple imputation approach appears to be more reliable than the other two approaches as it does not try to superimpose substantially larger or smaller variance. An important implication of our findings is the

Conclusion
We used two different test-equating methods and two different index measures to calibrate psychological distress measures used in three British birth cohorts against an independently recruited calibration sample. Our subsequently derived mean scores and above cut-off prevalences showed heterogeneity across both measure and method. Whilst this had some implications for estimations of psychological distress across the life-course, we mainly observed an inverse-U shaped trajectory across adulthood. However, the method and measure had the most severe consequences for cross-cohort comparisons, where although there are indications that distress is higher is the 1970 compared to 1958 birth cohort, consistent trends across all three cohorts are not observed. We therefore should be cautious in interpreting studies that have relied on one method and one index measure only and we recommend that future studies using test-equating approaches to compare mental health across timepoints or datasets use more than one measure to increase reliability of any findings.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138958 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.