Clinical judgement of General Practitioners for the diagnosis of dementia

Background: The accuracy of General Practitioners' (GPs') clinical judgement for diagnosing dementia is uncertain. Aim: Investigate the accuracy of GPs' clinical judgement for the diagnosis of dementia. Design and Setting: Diagnostic test accuracy study, recruiting from 21 practices around Bristol. Method: The clinical judgement of the treating GP (index test) was based on the information immediately available at their initial consultation with a person aged over 70 years who had symptoms of possible dementia. The reference standard was an assessment by a specialist clinician, based on a standardised clinical examination and made according to ICD-10 criteria for dementia. Results: 240 people were recruited, with a median age of 80 years (IQR 75 to 84 years), of whom 126 (53%) were men and 132 (55%) had dementia. The median duration of symptoms was 24 months (IQR 12 to 36 months) and the median ACE-III score was 75 (IQR 65 to 87). GP clinical judgement had sensitivity 56% (95% CI 47% to 65%) and specificity 89% (95% CI 81% to 94%). Positive likelihood ratio was higher in people aged 70-79 years (6.5, 95% CI 2.9 to 15) compared to people aged [≥] 80 years (3.6, 95% CI 1.7 to 7.6), and in women (10.4, 95% CI 3.4 to 31.7) compared to men (3.2, 95% CI 1.7 to 6.2), whereas the negative likelihood ratio was similar in all groups. Conclusion: GP judgement is more likely to under identify rather than over identify dementia.


Introduction
The James Lind Alliance has identified the role of general practice in supporting a more e ective route to diagnosis of dementia as a priority for health research (1). People with symptoms of dementia have historically faced long delays before ge ing for an assessment and an explanation for their symptoms (2). Approaches to address waiting lists have included psychiatrists supporting primary care memory clinics (3), integrated one-stop clinics (4), and training GPs to make a diagnosis in uncomplicated cases ((5, 6) which is supported by NICE (7). A GP could use a range of brief cognitive assessments (8) to evaluate a person with symptoms of dementia. National guidelines di er on which test to use, possibly because there is li le evidence in a symptomatic primary care population (9, 10). Formally evaluating cognition takes time, and familiarity with the test. GPs report using non-standardised processes (11), such as their clinical judgement (12) to decide whether a person has dementia. Previous studies to investigate the accuracy of GP clinical judgment have typically su ered from one of two significant limitations (13). Firstly, a definition of clinical judgement which is of unclear relevance to practice, such as retrospective judgement, or indeed documentation of recorded diagnoses in the medical record, which are systematically incomplete (14). Secondly, sampling unselected people a ending general practice regardless of symptoms, which is more akin to screening. To address these limitations, we investigated the prospective accuracy of GP clinical judgement for the diagnosis of dementia syndrome in people over 70 years, who were a ending their GP surgery and who had cognitive symptoms for at least six months but had not already been diagnosed with dementia (15).

Methods.
Population. We recruited participants from 21 participating GP surgeries in the Bristol, North Somerset, and South Gloucestershire (BNSSG) area, which is a diverse geographic area within 15 miles of the City of Bristol, covering a total population of around 900,000 people across 82 GP practices. Research clinics were in four participating GP surgeries, strategically located for accessibility. We calculated that a minimum sample size of 200 was needed for Creavin et al. | medRχiv | November 20, 2020 | 1-13 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint a lower bound of the specificity 95% confidence interval of 80%, based on a specificity of 95% in prior studies, and a 75% prevalence of dementia in local memory clinic data (16).
Inclusion and exclusion criteria. Participants were people with symptoms of dementia, who were aged at least 70 years and had been referred by their GP to this research study. Symptoms of dementia were not specified but generally include disturbance in memory, language, executive function, behaviour, and visuospatial skills (17). Symptoms were required to be present for at least six months, and could be reported by the person themselves, a family member, a professional, or another person; there was no severity threshold. An accompanying informant was mandatory. All participants were o ered free accessible transport and translation services. People were excluded if they had a known neurological disorder (i.e. Parkinsonism, Multiple Sclerosis, learning disability, Huntington's disease), registered blind, profound deafness (i.e. unable to use a telephone), psychiatric disorder requiring current secondary care input, or if cognitive symptoms were either rapidly progressive or co-incident with neurological disturbance. People with very severe dementia, operationalised as inability to consent, were excluded as they were judged by a lay advisory group to find the research process overly burdensome. GPs were encouraged to refer a consecutive series of all a ending eligible patients to the study, regardless of their clinical judgement or any test results. An electronic prompt reminded clinicians about the study when consulting. GPs obtained verbal consent to share contact details with the study team, who re-confirmed eligibility and took consent. The research team contacted potentially eligible people on at least three occasions at two di erent parts of the day.
Index test of clinical judgement. The referring GP recorded their clinical judgement using an electronic referral form during a consultation with their patient about cognitive symptoms. Clinical judgement was operationalised as normal, cognitive impairment not dementia (CIND), or dementia. GPs were asked to: Please write a few words about what you think led you to form your gut feeling but were not required to arrange any test and could also refer people simultaneously or subsequently to NHS services. The study team contacted the practice at least three times to obtain any missing referral data.
Reference standard. At the research clinic, a single specialist physician with more than 20 years' experience in the field of dementia conducted a standardised assessment lasting approximately 60 minutes comprising clinical history, the Addenbrooke's Cognitive Examination III (ACE-III) (18), Brief Assessment Schedule Depression Cards (BAS-DEC) (19) and the Bristol Activities of Daily Living (BADL) estionnaire (20). The specialist was not aware of other test results such as GP judgement or any investigations. The reference standard was based on the evaluation of the specialist physician for dementia according to ICD-10 criteria (21). Medical records were reviewed for all participants six months a er the research clinic to identify any information that had come to light that would contradict this judgement. A second specialist adjudicated cases where there was diagnostic uncertainty at the research clinic using the initial specialist assessment and the medical record review, but without access to the GP judgement. Study data were electronically entered and managed using RED-Cap (Research Electronic Data Capture) hosted at the University of Bristol (22).
Statistical methods. Characteristics of participants including age, sex and ACE-III score were tabulated by dementia status according to the reference standard. Separate logistic regression analyses were used with non-participation as the dependent variable and GP judgement, age (in years) and female sex as the independent variables to test the hypothesis of no association with these variables. Time from referral to appointment was described using median and interquartile range and logistic regression was used to test the hypothesis of no association between time to appointment (in days) and dementia (as the dependent variable). Measures of diagnostic test accuracy (sensitivity, specificity, likelihood ratios, predictive values) were calculated together with 95% confidence intervals. Decision curve analysis (23) was used to show the net benefit (which incorporates both discrimination and calibration) of GP judgement at varying threshold probabilities. Decision curve analysis quantifies the net benefit of a test, in units of true positive, across a range of preferences (24) where a net benefit of 0.05 means "five true positives for every 100 patients in the target population" (25). Sensitivity analyses were done to explore whether accuracy varied by age (<80 years | ≥ 80 years) since prediction models perform di erently in these age groups (26), and sex. Cochran's Q test was used to test the hypothesis of no di erence in likelihood ratios between groups (27).  Figure 1 shows a flowchart for inclusion in the study. The theoretically "eligible" figure of 1,735 people was derived from the age specific incidence of dementia (28) and the demographics of the population in the participating practices (34,956 aged over 70 years (29). One person who consented withdrew before any data collection was done because they were admi ed to hospital with acute illness. Of the 240 with available data, there were 20 borderline cases that were adjudicated by a second specialist. The 240 people were classified as either Normal (47), Dementia (132) of whom 1 had DSM-5 but not ICD-10 solely because of no memory impairment, or CIND (61) of 59 whom met criteria for MCI (1 a ective disorder, 1 brain injury)). There was li le evidence of an association between non-participation and clinical judgement of CIND (odds ratio 1.2; 95% CI 0.55 to 2.41) or dementia (odds ratio 1.9; 95% CI 0.90 to 3.93). There was some evidence for an association between non-participation and age (odds ratio per year 1.08; 95% CI 1.04 to 1.12), or female sex (odds ratio 1.88; 95% CI 1.21 to 2.92). The median time between referral (clinical judgement) and the clinic appointment (reference standard) was 47 days (IQR 30 to 72 days), the longest interval was 177 days, which was due to di iculties with a ending earlier appointments. There was no association between time from referral to appointment and dementia (odds ratio per day 1.0; 95% CI 0.99 to 1.01). Table 1 shows the demographics of participants. Table 1 goes here Two people could not complete the ACE-III because English was not their first language and they had both declined an interpreter. In both cases su icient information was available from other parts of the assessment for a categorisation about cognition to be made (one had normal cognition, one had dementia). For the 238 people who had an ACE-III score, the median was 75 (interquartile range 65 to 87). Referring GPs judged that 34 people were normal, 86 had dementia, and 120 had CIND; the one person who withdrew from the study due to acute illness was judged by the referring GP to have CIND. People that GPs judged as having dementia had a total ACE-III score IQR of 60 to 74, with a 90th centile of 81/100 and a highest score of 95/100. Similarly, people that GPs judged as having CIND had an ACE-III score IQR 71 to 89. Table 2 goes here  Table 2 shows the diagnostic accuracy for GP judgement for dementia. The sensitivity and specificity of GP judgement were respectively 56% (95% CI 47% to 65%) and 89% (95% CI 81% to 94%). Clinical judgement was more useful for ruling in dementia, than ruling it out, with higher specificity and PPV than sensitivity and NPV. In people aged 80 or more years, clinical judgement had similar sensitivity and specificity to those aged under 80 years (Q test p value 0.296 for LRP and 0.798 for LRN). There was weak evidence that clinical judgement in women had a higher LRP (Q test p value 0.074) and a lower LRN (Q test p value 0.064) than clinical judgement in men. Figure 2 goes here Figure 2 shows that clinical judgement has greater net benefit than a treat-all approach at threshold probabilities of above 50%, and a treat-none approach at threshold probabilities below 85%. At a threshold probability of 80%, indicating a preference for avoiding over-diagnosis, clinical judgement has a net benefit of 0.11 over the treat-none approach, indicating an additional 11 true positives for every 100 people, compared to the approach of diagnosing-none. If the doctor prefers to not miss dementia, with a threshold probability of up to 50%, perhaps in a younger patient with a strong family history, then clinical judgement will have lower net benefit than treating everyone as if they have dementia and probably arranging further tests or a referral.

Discussion
Summary. From 21 participating GP surgeries, 465 people were referred and 240 were evaluated. Of these, 132 (55%; 95% CI 48% to 61%) had dementia. Clinical judgement as a single test had a LRP of 5 (95% CI 3 to 9) and a LRN of 0.5 (95% CI 0.4 to 0.6) for the target condition dementia. People that GPs judged as having dementia had a total ACE-III score IQR of 60 to 74, and those that they judged as having MCI had a total ACE-III IQR 71 to 89. This compares to published ACE-III thresholds of <82 for dementia (30) and < 88 for MCI (30) and suggests that in this study GPs are not being overly restrictive in their judgement for dementia, or liberal in their judgment for CIND.
Strengths and limitations. The patient selection in the current study closely reflects real world clinical practice in the United Kingdom, with e orts to avoid people being excluded based on language, transport, or appointment availability. Participants were included with a range of GP opinions about the presence of cognitive impairment in people who had presented with symptoms, which means that cognitive problems were one of the problems discussed in the initial GP consultation; typically 2.5 problems are discussed per appointment (31). The index test clinical judgement in this study reflects an average measure of diagnostic accuracy for an estimated 142 whole time equivalent GPs working in di erent se ings (32). Responses indicated that clinical judgement was typically informed by "face to face presentation". GPs were told they need not use any formal test to inform their judgement and based on previous studies this is likely to be based on rules of thumb (11) and not formal tests (12). The interval between clinical judgement and the reference standard was relatively short, and unlikely to be associated with a significant progression in cognitive impairment (10). Clinical judgement was fully verified against the reference standard for all consenting people who were referred and there was no evidence of selective participation by cognitive status. Follow-up data a er six months was obtained, and uncertain cases were adjudicated. An important limitation is that despite providing translation services the population were largely white, native English speakers. In addition, the confidence intervals for our sub-groups are still wide.
Comparison with existing literature. Table 3 goes here  Table 3 summarises the features of this study compared to the existing literature (33,34). A major strength of this study for applicability to practice is that it evaluated symptomatic people. This study was also one of only two studies with complete verification by the reference standard. This study has lower sensitivity and higher specificity than the French study (35), but this could be because the other study verified only 26% of people who underwent the index test.
Implications for Research and/or practice. Diagnosis can be conceptualised as a pragmatic method of classification that is fit for purpose in the clinical se ing (36), and GP judgement may o en use heuristics (rules of thumb) and system one (non-analytical (37)) cognitive processes. Clinical judgement may be systematically di erent to formal definitions, just as di erent formal definitions (generally formulated for research needs) select di erent groups of peo-Creavin et al. | Judgement for dementia medRχiv | 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; ple (38). The GP heuristic of dementia in the older adult may be an individual with forgetfulness who also has sensory impairment, limited mobility, multi-morbidity, and needs additional assistance performing activities of daily life (39). It remains to be seen which definition is most useful in practice. Concerns about resources and lack of specialist expertise for GPs to diagnose and manage dementia well have been reported for many years (40), and GPs have been reported to frame dementia care as a specialist activity ((41). However, the priority of patients and their kin is to get a prompt diagnosis, in an emotionally safe and personalised way (2). Approaches to diagnosis (3, 6, 42) and follow-up (43) have been reported in primary care but regre ably, a well-designed intervention to improve practice was not effective in improving documentation or increasing case identification (44). Instead of training GPs, patients may benefit more from additional practice-based dementia case workers (44), which in England could be provided through Primary Care Networks (45).

ACKNOWLEDGEMENTS
The authors thank the participants and the sta at participating practices, without whom this work would not have been possible.
The sta at the West of England Clinical Research Network arranged for redaction, collection and transport of medical records from general practices. Wri en in L A T E X using zHenriquesLab-StyleBioRxiv.cls available at https://www.overleaf.com/latex/templates/ henriqueslab-biorxiv-template/nyprsybwffws Bibliography  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ;

12
Methods for calculating or comparing measures of diagnostic accuracy, and the statistical methods used to quantify uncertainty (e.g., 95% confidence intervals) 2 13 Methods for calculating test reproducibility, if done NA 14 When study was performed, including beginning and end dates of recruitment 3 15 Clinical and demographic characteristics of the study population (at least information on age, sex, spectrum of presenting symptoms). See also item 18 6 16 The number of participants satisfying the criteria for inclusion who did or did not undergo the index tests and/or the reference standard; describe why participants failed to undergo either test (a flow diagram is strongly recommended). See also items 3-5 7 17 Time interval between the index tests and the reference standard, and any treatment administered in between 3 18 Distribution of severity of disease (define criteria) in those with the target condition; other diagnoses in participants without the target condition 6 19 A cross-tabulation of the results of the index tests (including indeterminate and missing results) by the results of the reference standard; for continuous results, the distribution of the test results by the results of the reference standard 6 20 Any adverse events from performing the index tests or the reference standard 3 21 Estimates of diagnostic accuracy and measures of statistical uncertainty (e.g., 95% confidence intervals). See also item 12 8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint

22
How indeterminate results, missing data, and outliers of the index tests were handled 3 23 Estimates of variability of diagnostic accuracy between subgroups of participants, readers, or centers, if done 8 24 Estimates of test reproducibility, if done. See also item 13 NA 25 Discuss the clinical applicability of the study findings 3-4 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.20.20234062 doi: medRxiv preprint