Parkinson’s Progression Markers Initiative: A Milestone-Based Strategy to Monitor PD Progression

Background : Identifying a meaningful progression metric for Parkinson’s disease (PD) that reflects heterogeneity remains a challenge. Objective : To assess the frequency and baseline predictors of progression to clinically relevant motor and non-motor PD milestones. Methods: Using data from the Parkinson’s Progression Markers Initiative (PPMI) de novo PD cohort, we monitored 25 milestones across six domains (“walking and balance”; “motor complications”; “cognition”; “autonomic dysfunction”; “functional dependence”; “activities of daily living”). Milestones were intended to be severe enough to reflect meaningful disability. We assessed the proportion of participants reaching any milestone; evaluated which occurred most frequently; and conducted a time-to-first-event analysis exploring whether baseline characteristics were associated with progression. Results: Half of participants reached at least one milestone within five years. Milestones within the cognitive, functional dependence, and autonomic dysfunction domains were reached most often. Among participants who reached a milestone at an annual follow-up visit and remained active in the study, 82% continued to meet criteria for any milestone at one or more subsequent annual visits and 55% did so at the next annual visit. In multivariable analysis, baseline features predicting faster time to reaching a milestone included age ( p <0.0001), greater MDS-UPDRS total scores ( p <0.0001), higher GDS-15 depression scores ( p =0.0341), lower dopamine transporter binding ( p =0.0043), and lower CSF total α -synuclein levels ( p =0.0033). Symptomatic treatment was not significantly associated with reaching a milestone ( p =0.1639). Conclusions: Clinically relevant milestones occur frequently, even in early PD. Milestones were significantly associated with baseline clinical and biological markers, but not with symptomatic treatment. Further studies are necessary to validate these results, further assess the stability of milestones, and explore translating them into an outcome measure suitable for observational and therapeutic studies.


INTRODUCTION
The progressive course and diverse motor and non-motor features of Parkinson's disease (PD) have been recognized since the earliest descriptions of the disorder [1]. Although PD is classically defined based on cardinal motor features, cognitive decline and a spectrum of other non-motor features may emerge and progress along the disease course and result in substantial disability [2][3][4][5][6] . Identifying a clinically meaningful progression metric for testing novel therapeutics that reflects this heterogeneity has proven to be a challenge. Several different ways of defining progression have been implemented as outcomes in trials based on motor, cognitive, or biomarker outcomes [1, 7,8]. However, none have been entirely satisfactory for either confirming or rejecting putative disease-modifying effects because they fail to capture the protean features that progressive PD can produce. Defining progression has also proven difficult for observational and biomarker verification studies utilizing Parkinson's Progression Markers Initiative (PPMI) data and specimens, with challenges including differences in ON vs.
OFF state data completeness patterns among sporadic vs. genetic PD cohorts [9] and evidence that PD participants who dropped out early had lower cognitive performance at their last completed visit [10]. Thus, a challenge for future PD research is to develop reliable and valid endpoints that can account for progression across the spectrum of clinical features and are versatile in the context of incomplete data.
Change in the Unified Parkinson's Disease Rating Scale (UPDRS) [11] and Movement Disorder Society UPDRS (MDS-UPDRS) [12] have been the most common metrics for quantifying disease progression [13,14]. While the MDS-UPDRS has been useful for testing symptomatic drugs, several limitations have been recognized. First, only Part II measures functional outcomes and is thus intrinsically clinically meaningful. Second, the MDS-UPDRS, especially the motor examination (Part III), is highly sensitive to the impact of symptomatic treatment [15]. As a result, disease-modifying therapies have typically been tested during the brief period between diagnosis and the initiation of symptomatic treatment and only progression . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Brumm of motor disability may be assessed. In this paradigm, only a small fraction of PD patients is eligible to participate in disease-modifying trials, and participants must often trade-off between the need for symptomatic treatment and trial participation.
An alternative approach is to record the emergence of clinically relevant outcomes. This approach is accepted in other fields of medicine. For example, in therapies for vascular disease, composite outcomes combining mortality with nonfatal events-including myocardial infarction, stroke, and revascularization-are widely used and considered to be a measure of clinically meaningful impacts of the disease [16]. We sought to identify a similar approach to measuring progression in PD patients as they move from diagnosis into the middle stages of disease when disability becomes more apparent. We utilized PPMI data to define and measure a composite endpoint comprised of 25 "progression milestones" spanning six domains. These components were selected based on expert consensus to reflect meaningful PD disability such that meeting a milestone would represent unequivocal disease progression. Primary analyses assessed the frequency of reaching any milestone within a five-year follow-up period after enrollment and explored whether baseline factors-including demographic characteristics, clinical features of PD, and cerebrospinal fluid (CSF) and imaging biomarkers-were associated with time to progression. In addition, sample size estimates were calculated to evaluate proof-of-concept and provide a benchmark for future efforts to refine this framework for possible use in therapeutic trials.

Study sample
PPMI is a multicenter, international, prospective cohort study. Study aims and methodology have been published elsewhere [17]. Study protocol and manuals are available at committee on human experimentation before study initiation and obtained written informed . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Brumm consent for research from all participants in the study. PD participants included in this analysis were recently diagnosed (mean [SD] duration from diagnosis: 6.6 [6.5] months) and untreated with PD medications at the time of enrollment. Participants were required to be aged 30 years or older (at diagnosis); have a Hoehn and Yahr score of < 3; and either have two symptoms out of resting tremor, bradykinesia, or rigidity (including either resting tremor or bradykinesia), or asymmetric resting tremor or asymmetric bradykinesia. In addition, all participants underwent a screening dopamine transporter (DAT) or vesicular monoamine transporter (VMAT) scan and were required to have evidence of dopaminergic deficit consistent with PD.

Baseline Measures
All participants underwent a comprehensive baseline evaluation-including clinical testing, imaging assessments, and biospecimen collection-as detailed elsewhere [18]. From these data, a pre-specified set of candidate predictor variables were considered for this analysis. This encompassed demographics, including age, sex, and clinical site (US vs. non-US); body mass index (kg/m 2 ); orthostatic (supine to standing) change in systolic blood pressure; and duration of disease (months from diagnosis). Clinical assessments of motor and non-motor PD characteristics comprised the MDS-UPDRS, including Hoehn and Yahr stage and derived tremor and postural instability/gait difficulty (PIGD) scores [12,19] is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Brumm (α-syn) was measured using a sandwich-type immunoassay kit (BioLegend; formerly Covance) [20] and CSF amyloid beta (Aβ 1−42 ), total tau (t-tau), and phosphorylated tau 181 (p-tau) were measured using Elecsys electrochemiluminescence immunoassays (Roche Diagnostics) [21].

Longitudinal Measures
Time to initiation of PD medication was determined based on the initiation date of symptomatic treatment for motor features of PD, as previously described [22]. Standard clinical metrics were assessed at least annually after baseline (see Table 1 [24]; and (2) presence of significant functional impairment due to cognitive deficits.

Brumm
LMC, BM, DG, KLP, CMT, DW, KK, KM) based on knowledge of the existing literature [2][3][4][5][6]25] and clinical experience. This process included several sequential steps. First, the working group convened for a series of meetings and agreed on an overarching strategy of defining progression using a multidimensional composite endpoint. Second, the same panel reviewed the rating scales and other outcome assessments included in the PPMI protocol and identified items that measured dysfunction within the dimensions of interest (e.g., motor, cognitive, autonomic); in doing so, a concerted effort was made to omit items that are particularly sensitive to the effects of symptomatic therapy (e.g., MDS-UPDRS Part III items measuring tremor).
Third, in cases where scale items had multiple levels, the panel agreed upon levels that represented unequivocal and at least moderately severe forms of the type of disability they were intended to capture and reflected a degree of dysfunction that is recognized as clinically meaningful within the expert community. Lastly, to facilitate interpretability by grouping milestones into categories that were consistent with clinical practice, components of the composite endpoint were classified across six clinical domains.
Per protocol, most milestones were assessed quarterly for one year and semiannually thereafter; however, three autonomic dysfunction milestones were only assessed at six months and then annually, and three cognitive milestones were only assessed annually. As previously described [23,26], the site investigator's determination of cognitive impairment (from which two dementia-related milestones were derived) was introduced after some participants had already completed their baseline and 12-month visits; consequently, most PD participants (74.9%) missed this assessment at baseline and roughly a third missed it at 12 months. Otherwise, missing data were rare. In all instances of missing data, a conservative approach was applied by which it was assumed that the corresponding milestone criteria were not met.
A composite binary endpoint, defined as time to first occurrence of any one of the milestones, comprised the primary outcome variable. Participants who met milestone criteria at baseline and/or never completed any follow-up visits were excluded. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint

Data Sources
specimens/download-data and reflecting data captured in the PPMI database as of June 30, 2020; RRID:SCR_006431), two analysis data sets were derived. The first data set computed the primary endpoint based on data collected at the first five annual follow-up visits only (i.e., the visits at which all milestones were evaluated per protocol). A second data set derived the primary endpoint from the first five annual follow-up visits and seven additional "interim" visits (scheduled at 3,6,9,18,30,42, and 54 months). Interim visits evaluated most, but not all, progression milestones. To gauge the possible implications of the frequency of endpoint assessments on future study design, most analyses evaluated both data sets. However, for ease of interpretation and to ensure equal weighting across milestones, models examining baseline predictors of time-to-progression were fitted using annual data only.

Statistical Analysis
Figures were created using RStudio (Posit Software, PBC, Boston, MA; posit.co; RRID:SCR_000432) [27]. All other analyses were performed using SAS v9.4 (SAS Institute Inc., Cary, NC; sas.com; RRID:SCR_008567). To identify baseline predictors of progression, a time-to-event analysis was conducted using multivariable Cox proportional hazard models with a backward selection approach. Time was calculated from the date of enrollment until the date of the first annual visit at which criteria for at least one milestone were met. Participants who never met milestone criteria were censored at the time of their last completed annual visit.
Participants who met criteria for any milestone at baseline and/or did not complete at least one annual follow-up were excluded from all models. Ties were handled using Efron's approximation. For model fitting, a covariate was included if it was associated with time to progression at a significance level of 0.10 or less. PD medication use (i.e., a binary indicator . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) For CSF total α -syn, all values were included regardless of hemoglobin level; however, sensitivity analyses were conducted that excluded samples with hemoglobin levels exceeding 200 ng/mL [28]. To address multicollinearity during model selection, the MDS-UPDRS total score was considered for the multivariable model instead of Hoehn & Yahr stage and PIGD Score. Similarly, among biomarker variables, mean striatum SBR was prioritized over mean putamen SBR and the ratio of CSF t-tau/Aβ 1−42 was favored over CSF Aβ 1−42 alone. This screening process revealed a set of potential predictor variables, which made up an initial "full model." Subsequently, a backward selection process removed variables one at a time until all variables remaining in the model were significant at the 0.05 level. For all steps in the backward selection process, sex and PD medication use were forced into the model. Due to the exploratory nature of these analyses, no adjustments were made for multiple comparisons.
As secondary analyses, we performed sample size calculations for a hypothetical trial targeting 80% power for a two-sided log-rank test (α = 0.05) comparing the survival curves of two treatment groups using a balanced design. Variable assumptions included study length (two vs. three years) and the hazard ratio of the experimental group relative to the comparison group (0.50 vs. 0.75). The comparison group's survival curve was approximated using a piecewise linear curve based on survival function estimates derived from two separate data sources (the "annual visits" vs. "all visits" data sets defined above). Survival function estimates were computed using Kaplan-Meier estimators, with time rounded to the nearest 3 months (i.e., perprotocol time).

RESULTS
Supplementary Figure 1 presents a flow chart summarizing how many participants were assessed at each study time point. Out of 423 PD participants enrolled, 32 (7.6%) met criteria . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint at baseline for at least one progression milestone. This included two participants who met baseline criteria within two domains (in one case, autonomic dysfunction and walking and balance; in the other, autonomic dysfunction and activities of daily living) and 30 who did so for one domain only (13 autonomic dysfunction, five walking and balance, five cognition, four activities of daily living, three functional dependence). These participants were excluded from all analyses. The remaining 391 participants had a median duration of follow-up of seven years and a 5-year dropout rate of 18%. Finally, 6/391 participants never completed any follow-up visits and were excluded from additional analyses. Supplementary Table 1 presents baseline demographic and disease characteristics for the remaining participants.  (3,6,9,18,30,42, and 54 months). Out of 385 participants who were milestone-free at baseline and returned for at least one follow-up assessment (annual or interim), 193 (50.1%) met progression milestone criteria during at least one of these 12 assessments, with corresponding 12, 24, and 36-month progression rates of 16.6%, 27.7%, and 37.4%, respectively. Table 2 summarizes the contribution of each individual domain and milestone to the composite endpoint, i.e., how frequently they coincided with the initial event for a participant.
Milestones within the cognitive domain (met by 14.1% of participants based on annual data only vs. 14.3% based on all available data) and functional dependence domain (12.0% vs. 14.5%) were reached first most frequently in this cohort. Collectively, milestones within the autonomic . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
Notably, at the time of the first milestone, most participants met milestone criteria within a single domain only. Among participants who progressed at an annual follow-up, only 26/166 (15.7%) did so across multiple domains concurrently, with 20 reaching milestones within two domains, four within three domains, and two within four domains (data not shown). Among these multi-domain progressors, it was most common for one of the domains to be functional dependence (18/26; 69%), followed by walking and balance (12/26; 46%). By contrast, milestones within the cognition and autonomic dysfunction domains were comparatively likely to occur in isolation, with 43/53 cognitive (81%) and 33/41 (80%) autonomic progressors experiencing an event within a single domain (data not shown).
Descriptive analyses also evaluated the frequency of each milestone in isolation, i.e., if they ever occurred regardless of whether a different one occurred first (Supplementary Table 2).
Based on data collected at all study visits, 89 participants (23.1%) ever met the functional dependence milestone, 82 (21.3%) ever reached at least one cognitive milestone, and an appreciable number ever reached one or more components of the autonomic dysfunction (16.9%), walking and balance (14.3%), activities of daily living (13.5%), and motor complications (12.5%) domains. Relative to other components of the composite endpoint, those in the activities of daily living domain (choking, speech, dressing, eating, hygiene) were least likely to coincide with the initial event; of the 52 participants who ever reached one of these milestones, only 19 did so at their first event (see Table 1). Table 3 summarizes the analysis of baseline predictors, which modeled time-toprogression based on annual milestone assessments only. After adjustment for sex and PD medication use, the final multivariable model included three predictors with positive associations (age, MDS-UPDRS total score, GDS-15 score) and two predictors with negative associations . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Additional analyses evaluated the stability of the milestone-based approach at subsequent annual visits (Supplementary Table 3 Rates of permanent reversion (i.e., not meeting criteria within any domain at any subsequent visit) were elevated among individuals whose initial event included milestones within the autonomic dysfunction (22.5%) or motor complications (29.4%) domains. Table 4 presents sample size calculations, based on the survival function estimates depicted in Figure 1, for a two-arm trial targeting 80% power. Estimates vary depending on the source of survival estimates (annual vs. all visits); proposed study length; and, particularly, the assumed treatment effect. For instance, based on the rate of clinically meaningful outcomes we observed in our data, a three-year study assuming a 50% reduction in the hazard ratio would be powered at 80% with approximately 125-150 participants per arm. Alternatively, a three-year study assuming a more modest reduction in the hazard ratio (25%) would likely require at least 600 participants per arm to achieve 80% power.

DISCUSSION
The results of this study show that a set of clinically meaningful milestones derived from widely used assessment scales may have utility as a progression outcome in an early PD cohort. Participants in the PPMI de novo PD cohort were recently diagnosed and untreated at the time of entry, and then followed quarterly for one year and semiannually for four years thereafter. Half of this cohort reached at least one milestone during this clinical follow-up period, with over a quarter doing so within two years. The most frequently reached milestones included loss of functional independence, indicators of cognitive impairment (in particular, a MoCA score below 21), measures of dysautonomia (urinary incontinence and syncope), and postural instability. The milestone definitions chosen for this study were intended to reflect more severe . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint Brumm forms of a given problem in order to mitigate uncertainty regarding their functional relevance.
Importantly, the emergence of milestones was largely independent of whether symptomatic therapy had been initiated. Moreover, the composite endpoint appeared to be relatively stable; among participants who reached at least one milestone and remained active in the study, less than 20% permanently reverted to being "milestone-free" at all subsequent visits. These features support the applicability of a milestone-based outcome measure to assess disease progression in early and middle stage PD.
Multivariable analysis indicated that baseline predictors of faster time to reaching a milestone included advanced age, greater MDS-UPDRS total scores, lower DAT-SPECT striatal binding, lower CSF total α -syn, and higher GDS-15 depression scores. Several of these baseline characteristics-including age, lower DAT binding, and greater motor impairmenthave been reported to have poor prognosis in prior studies [29][30][31], which provides collateral support for our approach. The apparent utility of CSF total α -syn to predict reaching a clinically relevant milestone is especially interesting considering current literature demonstrating that PD is associated with a small but significant decrease in CSF total α -syn concentrations relative to healthy controls [20,[32][33][34]. These predictors of risk for reaching a milestone suggest enrichment strategies to make clinical trials more efficient by building risk factors into trial entry criteria.
A milestone-based outcome measure offers a degree of adaptability that more conventional methods may lack. For instance, if a participant dropped out early but reached a clinical milestone prior to study withdrawal, this metric of progression would be fully captured in a milestone-based time-to-event model. Because the milestones derived from MDS-UPDRS part III items (measuring gait, freezing of gait, postural stability, and speech) are defined using ON or OFF assessment scores, these components can still be evaluated if either the OFF or ON assessment could not be completed. Because of these properties, our results suggest an approach to testing disease-modifying therapies that may not be affected by symptomatic . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint Brumm treatment and could be implemented in more naturalistic settings. Specifically, a milestonebased composite endpoint could be considered for trials evaluating novel therapeutics for PD as an add-on to, rather than instead of, standard symptomatic therapy.
Our sample size estimates, which are meant to illustrate the conceptual feasibility of this framework, indicate that the number of participants required for a trial using a milestone-based approach over two to three years would be comparable to a trial of untreated patients using Outcomes measures comprised by a composite of clinically relevant events have been applied in other areas of medicine-including cancer [36], cardiology [37], nephrology [38], and stroke [39]-and have been acceptable to regulators [40]. Milestone-based or composite outcomes have been employed before in PD therapeutics, as well. The Deprenyl and tocopherol antioxidative therapy of parkinsonism (DATATOP) trial [1] defined its primary outcome by a clinically relevant milestone, i.e., the need for dopaminergic therapy. This is similar to our approach but used a single rather than composite outcome. Although a landmark trial, the DATATOP study has been criticized because the outcome was sensitive to the symptomatic effect of selegiline [41]. In this analysis, we focused to select outcomes that would not be substantially influenced by treatment. In addition, we included initiation of symptomatic treatment as a time-dependent covariate in our analyses to control for its effect. The NET-PD study of creatine (LS-1) [42] provides another relevant precedent for our analysis. The LS-1 study used a global statistical test (GST) composed of the modified S&E, Symbol Digit . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023.  [44]. Like our measure, this outcome is composed of clinically meaningful components. Unlike our simple composite measure, scores on the GST did not lend themselves to intuitive clinical interpretation. Thus, our framework for a composite of clinically meaningful outcomes may represent a potential advance over existing metrics in terms of robustness in the setting of symptomatic treatment and clinical interpretability. Other observational cohort studies have included milestone-based or composite outcomes in their analyses. This includes the CamPaIGN study, which examined the "irreversible" milestones of postural instability (Hoehn & Yahr stage 3), dementia, and death [45]; and the Norwegian ParkWest study, which evaluated the "advanced PD" milestones of visual hallucinations, recurrent falls, dementia, and nursing home placement [46]. Other milestones reported in the literature include severe dysphagia, autonomic dysfunction (e.g., orthostatic hypotension), and unintelligible speech [47]. Our study extends the results of those analyses by including additional clinical milestones and more intensive biomarker assessments which potentially make our results more relevant to implementation in therapeutic research.
Our results must be considered in light of several limitations. First, a multidimensional composite may not be appropriate for interventions that are intended to impact only certain contributors to PD disability. Per FDA guidance, composite endpoints should be chosen with an expectation that a given intervention will "have a favorable effect on all the components" [40]. It is possible that the pathophysiological mechanisms underlying the various clinical domains described herein (e.g., motor vs. cognitive vs. autonomic) are too different to expect that a single intervention could favorably affect all of them. However, given that the natural history of PD progression is multifaceted, a clinical endpoint that encompasses both motor and non-motor milestones may be the most appropriate approach to assessing interventions intended to slow overall disease progression [25].
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023 [37] . The FDA recommends choosing composite endpoints with components of "reasonably similar"-and not "substantially different"-clinical importance, a standard that is harder to establish with more components and one to which our composite may not sufficiently adhere [40]. For instance, we report cognitive milestones defined by apathy and hallucinations.
Although both symptoms are reported to predict cognitive impairment in PD [29,48], they are proxy measures and may not clear the bar of being "reasonably similar" to other milestones (e.g., a site investigator diagnosis of dementia). Our data are meant to illustrate the usefulness of the concept of a milestone-based outcome for PD trials. Future directions could include efforts, such as factor analysis, to test our domain grouping system and simplify the composite by removing redundancy and components that contribute minimally to the overall endpoint.
Third, the criteria for our composite endpoint are satisfied by the occurrence of a single rater-dependent event recorded at a single time point, an approach that prioritizes sensitivity over specificity and raises important questions about reliability. We considered an alternative strategy requiring that milestones be evident at consecutive visits. However, this made the endpoint less efficient, particularly if participants meeting criteria at baseline were excluded (in which case the endpoint could not be met until the second follow-up visit); and was insensitive to participants who met criteria at a single visit and withdrew before their next visit (due to worsening parkinsonism). Ultimately, we chose a first occurrence strategy, concluding that experiencing something sufficiently severe for the first time represents an important clinical event even if it is not reported at the next visit. Moreover, since cutoffs were made at severe manifestations of each clinical feature, we could envisage medication changes and other therapeutic maneuvers that could temporarily reduce the severity of such problems, which then recur after a hiatus. That said, nearly 20% of participants in our sample who ever reached a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint Brumm milestone did not recur at any subsequent visit. We acknowledge that this is not an insignificant number and that efforts to mitigate such occurrences are warranted. The domain-level analyses reported herein (Supplementary Table 3) suggest that milestones within certain domains (e.g., autonomic dysfunction, motor complications) may be less stable than others and future analyses evaluating the stability of each individual milestone are being planned.
Another important limitation to our study is the lack of Patient and Public Involvement and Engagement (PPIE). Milestones were carefully chosen by a panel of clinical experts and anchored largely to MDS-UPDRS items, which were developed with extensive input from patient focus groups [12]. However, for a milestone-based composite measure to be considered as the primary outcome in a therapeutic trial, greater PPIE would be essential. One possibility would be to survey PD patients and care partners on the relative "clinical importance" of the milestones reported herein and elsewhere in the literature [45][46][47]. Our composite also lacked a global quality of life measure, such as the PDQ-39, and other patient-reported outcomes (PROs). Additional PROs as well as objective digital measures have been added to the PPMI battery and could be areas of future research.
Other key limitations of our study that warrant further investigation include its exploratory nature (e.g., no adjustment for multiple comparisons) and absence of external validation.
Importantly, efforts are underway to validate this milestone-based endpoint in other early PD cohorts, including the STEADY-PD III [49] and SURE-PD3 [50] trial cohorts and their extension in AT-HOME PD [51]. Like PPMI, these studies included participants with early-stage PD who were not on levodopa or dopamine agonists at enrollment. Furthermore, they are comparable in mean age (PPMI = 61.5; STEADY-PD III = 62; SURE-PD3 = 63) and, in the case of SURE-PD3, were similarly enriched for evidence of dopaminergic deficit at screening. Notably, however, these cohorts are considerably younger and far less treated than other PD populations, such as incident PD cases enrolled in the population-based CamPaIGN (mean age: 70.6) and PINE (mean age: 72.5) cohorts [52,53]. As such, some important considerations will be whether . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint Brumm these findings are generalizable to future studies that enroll older and more treated cohorts and which segment of PD patients would be appropriate for a clinical trial that implemented a milestone-based outcome measure.
Also, since milestones were only evaluated at pre-scheduled visits, it only known that criteria became evident at some point during the interval between one visit and the next.
However, our analysis used the approach, commonly applied in practice, of assuming that event times were either observed exactly at the end of said interval (imputed from the visit date on which the milestone was first recorded) or right censored, and then applying standard time-toevent methods (i.e., Kaplan-Meier and Cox regression estimates). Planned validation efforts will apply methods tailored specifically to interval-censored data [54,55]. More generally, the use of a time-to-first-event analysis approach may be inefficient because it ignores additional information (e.g., events at subsequent visits, total number of milestones/domains reached).
Alternative approaches that may increase study power include recurrent event models [56,57] or repeated-measures analysis of an ordinal or continuous "score" reflecting the sum of multiple milestones/domains. These methodological limitations are balanced by important strengths. We conducted our study in the context of the PPMI study which employs rigorous, standardized data collection of motor, non-motor and biomarker assessments in the context of an international, multicenter cohort with long-term follow-up [18,23,58]. In summary, the results of this study show that clinically meaningful milestones occur frequently within five years of follow-up of patients recruited with early, untreated PD, and are significantly associated with baseline demographic characteristics, clinical features, and objective biomarkers. These findings support the viability of using a milestone-based outcome measure in observational and biomarker verification studies.
Our results also have several important implications for clinical trial design. First, stratification based on baseline markers may reduce variability in progression in clinical trial cohorts, thus making trials more efficient. Second and importantly, a composite measure based on the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint Brumm milestones we evaluated could become a primary outcome in PD disease modification trials.
Additional follow-up and analysis of PPMI data will address limitations in our study, produce further validation and refine a framework for efficient trials of potentially disease-modifying therapeutics.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  For participants who ever reached any milestone, data only considers the initial event (i.e., first visit at which criteria for at least one milestone were met). Columns include participants who were milestone-free at baseline and subsequently completed at least one of the specified follow-up visits. *Derived from follow-up data collected at 12, 24, 36, 48, and 60 months. **Derived from follow-up data collected at 3, 6,9,12,18,24,30,36,42,48,54 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 24, 2023. ; https://doi.org/10.1101/2023.05.23.23290344 doi: medRxiv preprint Fig. 1. Kaplan-Meier curves of progression-free survival, as defined by reaching any progression milestone, based on data collected at 12, 24, 36, 48, and 60 months (blue) versus 3,6,9,12,18,24,30,36,42,48,54, and 60 months (red). Each curve reflects de novo PD participants who were milestone-free at baseline and completed at least one of the corresponding follow-up visits.

FIGURE LEGENDS
Supplementary Fig. 1. Flow chart summarizing which participants from PPMI de novo PD cohort were included in the analysis and how many participants were assessed at each study time point.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 24, 2023