## 1 Abstract

We aim to help inform the choice of estimand (i.e., target of inference) and analysis method to be used in future COVID-19 treatment trials. To this end, we describe estimands for outcome types of particular interest in these trials (ordinal and time-to-event). When the outcome is ordinal, the estimands that we consider are the difference between study arms in the mean outcome, the Mann-Whitney (rank–based) estimand, and the average of the cumulative log odds ratios over the levels of the outcome. For time-to-event outcomes, we consider the difference between arms in the restricted mean survival time, the difference between arms in the cumulative incidence, and the relative risk. Advantageously, the interpretability of these estimands does not rely on a proportional odds or proportional hazards assumptions. For each estimand, we evaluate the potential value added by using estimators that leverage information in baseline variables to improve precision and reduce the required sample size to achieve a desired power. These are called covariate adjusted estimators. To evaluate the performance of the covariate adjusted and unadjusted estimators that we present, we simulate two-arm, randomized trials comparing a hypothetical COVID-19 treatment versus standard of care, where the primary outcome is ordinal or time-to-event. Our simulated distributions are derived from two sources: longitudinal data on over 500 patients hospitalized at Weill Cornell Medicine New York Presbyterian Hospital prior to March 28, 2020, and a CDC preliminary description of 2449 cases reported to the CDC from February 12 to March 16, 2020. We focus on hospitalized, COVID-19 positive patients and consider the following outcomes: intubation, ventilator use, and death. We conduct simulations using all three estimands when the outcome is ordinal, but only evaluate the restricted mean survival time when the outcome is time to event. Our simulations showed that, in trials with at least 200 participants, the precision gains due to covariate adjustment are equivalent to requiring 10-20% fewer participants to achieve the same power as a trial that uses the unadjusted estimator; this was the case for each outcome and estimand that we considered.

**Paper Organization** Section 2 provides background on the use of covariate adjustment in randomized trials. Section 3 describes estimands and estimation strategies that we find compelling when the outcome is ordinal or a time to event. Section 4 describes the methods behind the simulation study that we conducted to evaluate these estimation strategies. Section 5 presents the results from this simulation. Section 6 presents our recommendations for the estimands and primary efficacy analysis methods that we recommend for future COVID-19 treatment trials.

## 2 Background on Covariate Adjustment in Randomized Trials, for Ordinal and Time to Event Outcomes

The ICH E9 Guidance on Statistical Methods for Analyzing Clinical Trials (1998) states that “Pretrial deliberations should identify those covariates and factors expected to have an important influence on the primary variable(s), and should consider how to account for these in the analysis to improve precision and to compensate for any lack of balance between treatment groups” [10]. Adjusting for prognostic baseline variables is called covariate adjustment.

Though there appears to be general agreement among regulators [8, 9] on how to adjust for baseline variables when the outcome is continuous valued (i.e., using analysis of covariance, abbreviated as ANCOVA), there is less specific guidance for several outcome types of interest in COVID-19 treatment trials, namely binary, ordinal, or time to event outcomes [1]. We evaluate the performance of covariate adjusted estimators for ordinal and time to event outcomes. Binary outcomes are a special case where the ordinal outcome has two levels, and the standardization approach of [13, 17] can be applied.

## 3 Estimands and Analysis Methods

Throughout we assume that treatment is randomly assigned independently of baseline covariates. The described methods can easily be adapted to handle the case that treatment assignment depends on (a subset of) the measured baseline covariates, as is the case when stratified randomization is used.

### 3.1 Ordinal Outcomes

We consider three estimands when the outcome is ordinal. We comment on these estimands briefly here, and refer the reader to Appendix A.1 for analytic expressions for these quantities.

#### Estimand 1: Difference in means of a prespecified transformation of the ordinal outcome (DIM)

In most settings, this transformation will be monotone so that larger values of the ordinal outcome (assumed to be preferable) will result in larger, and therefore preferable, transformed outcomes.

#### Estimand 2: Mann-Whitney (MW) estimand

This estimand reports the probability that a random individual assigned to treatment will have a better outcome than a random individual assigned to control, with ties broken at random.

#### Estimand 3: Log-odds ratio (LOR)

We consider a nonparametric extension of the log-odds ratio (LOR) [6] defined as the average of the cumulative log odds ratios over levels 1 to *K −* 1 of the outcome. In the case that the distribution of the outcome given treatment arm is accurately described by a proportional odds model [12], this estimand corresponds to the coefficient in front of treatment in this model. Otherwise, it represents a projection onto this model.

It follows from Appendix A.1 that all of the above estimands can be written as smooth summaries of the arm-specific cumulative distribution functions (CDFs) of an ordinal outcome. To estimate these quantities, it suffices to estimate these two CDFs and then to evaluate these summaries. This results in so-called plug-in estimators.

We consider two CDF estimators. The first, which we term the unadjusted estimator, corresponds to using the arm-specific empirical CDFs, that is, corresponds to stratifying the sample based on treatment arm and then using the observed empirical distributions within each arm. The resulting estimator for the DIM corresponds precisely to taking the empirical differences of transformed mean outcomes between the two arms. The resulting estimator *M* for the MW estimand is closely related to the usual Mann-Whitney U-statistic *U*: in particular, *U* = *n*_{0}*n*_{1}*M*, where *n*_{0} and *n*_{1} are the total sample sizes in the two treatment groups.

The second CDF estimator, which we term the covariate adjusted estimator, leverages prognostic information that is available at baseline. It uses working models, i.e., models that are fit in the process of computing the estimator but which we do not assume to be correctly specified (and for which consistency and asymptotic normality still hold under arbitrary model misspecification). Specifically, this estimator makes use of a treatment-stratified proportional odds working model that assumes that, within each treatment arm, the CDF of the ordinal outcome satisfies the proportional odds assumption based on a linear combination of the baseline covariates. This yields two working covariate-conditional CDFs (one per treatment arm). Our final treatment-specific marginal CDFs are obtained by averaging these conditional CDFs across the pooled covariates from the two treatment arms. Importantly, the validity of this estimator (e.g., asymptotic unbiasedness, asymptotic normality) does not rely on correct specification of the working treatment-stratified proportional odds model that this estimator uses. The fact that our CDF estimator is robust to misspecification of the outcome working model is reminiscent of the fact that ANCOVA estimator can be used for covariate adjustment when the outcome is continuous [24, 22]. We also refer the reader to the following references for further discussion of the benefits of covariate adjustment: for the DIM, see [22]; for the MW estimand, see [21]; for the LOR, see [6]; in the case that the ordinal outcome takes only two values, see [14], whose general discussion applies to all three estimands.

Because the covariate adjusted approach that we have described is based on a working parametric model, there are many possible approaches to developing valid confidence intervals for the estimand of interest. These approaches include the nonparametric BCa bootstrap [7] or Wald-type confidence intervals. When developing BCa intervals based on trial data, we recommend using a large number of bootstrap replicates to ensure that the final interval does not meaningfully depend on the random number generator.

We recommend handling missing covariates by imputing them based only on data from those covariates that were observed. Importantly, to ensure that treatment assignment can be treated as randomized even conditionally on the imputed covariates, we do not recommend using treatment or outcome information in this imputation. We recommend handling missing ordinal outcomes using doubly robust methods whose validity relies on the outcomes being missing at random conditionally on the covariates and treatment assignment. In terms of implementation, this approach involves simply applying the methods described in Appendix A.2, but with study arm assignment recoded as 0 to indicate that a patient was both randomized to study arm 0 (control) and had their outcome measured, 1 to indicate that a patient was both randomized to study arm 1 (treatment) and had their outcome measured, and −1 to indicate that the outcome is missing.

### 3.2 Time-to-Event Outcomes

We consider three treatment effect estimands in the time-to-event setting, all of which are interpretable under violations of a proportional hazards assumption. We refer the reader to Appendix B for formal analytic expressions for these quantities.

#### Estimand 1: Difference in restricted mean survival times (RMSTs)

Recall that the RMST corresponds to the expected value of a survival time that is truncated at a specified time [4, 15]. Alternatively, the RMST can be expressed as the area under the survival curve up to the truncation time.

#### Estimand 2: Cumulative incidence difference (CID)

Additive difference in arm-specific incidences of the event by a specified time.

#### Estimand 3: Relative risk (RR)

Ratio of the arm-specific incidences of the event by a specified time

A covariate adjusted targeted minimum loss-based estimator of the RMST is described in Díaz et al. [5]. Analogous to the ordinal outcome case, the estimator of the RMST relies on an estimate of the survival function at each time point. Estimators obtained through the Kaplan-Meier curve are referred to as unadjusted. Adjusted estimators are obtained by computing time-specific hazard probabilities which are transformed using the product-limit formula and marginalized over covariates. The adjusted approach has two key benefits relative to unadjusted alternatives, for example, those based on an unadjusted Kaplan-Meier estimator [11]. First, under regularity conditions, the adjusted estimator is at least as precise as the unadjusted estimator, asymptotically. Second, the adjusted estimator’s consistency depends on an assumption of censoring being independent of the outcome given study arm and baseline covariates, rather than an assumption of censoring in each arm being independent of the outcome marginally; see Eqs. 3 and 4 in Appendix B. The former may be a more plausible assumption. Similar covariate adjusted estimators for the CID and RR are also available [2].

Both [5] and [2] provide approaches that can be used to develop Wald-type confidence intervals and corresponding tests of the null of no effect.

## 4 Simulation Methods

### 4.1 Data-Generating Distributions

#### 4.1.1 Ordinal outcomes

We generated data based on [3], which reported outcomes for hospitalized and non-hospitalized individuals with COVID-19. We focus on the hospitalized data and place additional results pertaining to the non-hospitalized population in the Appendix. There were three levels of the ordinal outcome, with level 1 representing death, level 2 representing that ventilation was required, and level 3 representing that ventilation was not required. We adjusted for the following patient age categories: 0–19, 20–44, 45–54, 55–64, 65–74, 75–84, and *≥* 85. Lower and upper estimates were given for probabilities of outcomes in each age group in [3]; we used the mean of these within each age group.

For the hospitalized COVID-19 positive population, the resulting probabilities of being in each age group and of the outcomes: ventilator use and death are listed in Table 1.

We evaluated performance of the methods for both ineffective and effective treatments. We considered trials with 1:1 randomization probabilities and total enrollment of *n* = 100, 200, 500, and 1000. To simulate treatment effects, we assumed that a treatment would have no effect on the probability of death but would decrease the odds of ventilation by the same relative amount in each age category. This relative reduction was selected for each sample size, such that a two-sided t-test that the average outcome between treatment arms were equal rejected the null hypothesis with approximately 80% power (*n*=200,500,1000) and 50% power (*n*=100). The discrepancy between *n* = 100 and the other sample sizes occurred because there did not exist a relative reduction that achieved 80% power at a sample size of 100. For each setting, 1000 Monte Carlo replicates were performed.

The adjusted estimators that we implemented in our simulation were based on a proportional odds regression model that included age category as a linear term (0-19, 20-44, 45-54, 55-64, 65-74, 75-84, *≥* 85). We estimated the treatment propensity using the marginal empirical probability of treatment. We used 1000 replicates when developing BCa bootstrap confidence intervals.

#### 4.1.2 Time-to-event outcomes

In order to mimic key predictive features of clinical outcomes in COVID patients, our simulation data generating mechanism is based on the over 500 patients hospitalized at Weill Cornell Medicine New York Presbyterian Hospital prior to March 28, 2020. This ensures we have at least 14 days of follow-up for all individuals. The event of interest in this simulation is time from hospitalization to intubation, and the predictive variables used are sex, age, whether the patient required supplemental oxygen at ED presentation, dyspnea, hypertension, and the presence of bilateral infiltrates in the chest x-ray. We focus on the RMST 14 days after hospitalization. Patient data was re-sampled with replacement to generate 1000 datasets of each of the sizes *n* = 100, 200, 500, and 1000. For each dataset, a hypothetical treatment variable was drawn from a Bernoulli distribution with probability 0.5 independently of all other variables. A positive treatment effect of 1.06 was simulated by adding an independent random draw from a *χ*^{2} distribution with 4 degrees of freedom to the treated group. Five percent of the patients were selected at random to be censored, and their censoring time was drawn from a uniform distribution.

We compare the performance of the implementation of the covariate adjusted estimator described in Section 6 of [5] to the unadjusted Kaplan-Meier-based estimator described in [5]. Wald-type confidence intervals and corresponding tests of the null of no effect are reported.

### 4.2 Performance Criteria

We compare the type I error and power of tests of the null hypothesis of no treatment effect based on unadjusted and adjusted estimators, both within and across estimands. For each estimand, we also compare the estimator bias, variance, and mean squared error of the unadjusted and the adjusted approaches. We report the relative efficiency of the unadjusted relative to the adjusted estimator (ratio of variance of adjusted estimator to variance of unadjusted estimator)— since both of these estimators are asymptotically unbiased, one minus the relative efficiency approximately has the interpretation as the proportion reduction in sample size needed for a covariate adjusted estimator to achieve the same statistical power as the unadjusted estimator, at local alternatives.

## 5 Simulation Results

### 5.1 Ordinal outcomes

We present results based on bootstrap-based confidence intervals and hypothesis tests here, while results based on closed-form inference methods are presented in Appendix C. Tables 2, 3, and 4 display results for the difference in means, Mann-Whitney estimand, and log odds ratio, respectively. Type 1 error control of the covariate adjusted methods was comparable to or better than that of the unadjusted methods across most settings. The covariate adjusted methods achieved higher power across all settings. Absolute gains in power varied from 4% to 7% for the DIM, 1% to 7% for the MW estimand, and 3% to 5% for the LOR. The relative efficiency of covariate adjusted methods relative to unadjusted methods varied from 0.83 to 0.89 for the DIM, 0.82 to 0.91 for the MW estimand, and 0.83 to 0.88 for the LOR.

### 5.2 Time-to-event outcomes

We present two sets of results. Table 5 displays the results for estimators that adjust for age and sex along with the four other variables described in Section 4.1.2. Type 1 error control of the covariate adjusted methods was comparable to or better than that of the unadjusted methods across most settings. The covariate adjusted methods achieved higher power across all settings, with absolute gains varying from 3% to 8%. The relative efficiency of covariate adjusted methods relative to unadjusted methods improved monotonically as a function of sample size, from approximately 0.95 when *n* = 100 to approximately 0.80 when *n* = 1000. Table 6 displays the results of the analysis that only adjusts for age and sex. In this case, the gains of the covariate adjusted methods relative to the unadjusted methods are more modest, with absolute gains in power of approximately 1% and relative efficiency ranging from 0.97 to 1.00. These results suggest that there can be a meaningful benefit from adjusting for prognostic covariates beyond just age and sex.

### 5.3 Main Findings

We found that adjusting for variables beyond just age and sex can add substantial value. Specifically, for a time-to-event intubation outcome among hospitalized patients at Weill Cornell Medicine, we saw substantial gains from additionally adjusting for hypertension, dyspnea, whether the patient required supplemental oxygen at ED presentation, and the presence of bilateral infiltrates in the chest x-ray. For ordinal outcomes, we saw that the log odds ratio estimator has undesirable performance when the sample size is small (*n* = 100). Because the log odds ratio estimand is also hard to interpret clinically in the likely scenario that a proportional odds assumption fails to hold, we recommend the difference of means or Mann Whitney estimands when the outcome is ordinal.

## 6 Recommendations for Treatment Effect Definition and Primary Efficacy Analysis

### 1. Randomization scheme

Use stratified block randomization, with strata including site and calendar time (which may be strong predictors of the outcome and for which there may be too many to adjust for at the end of the trial). Prognostic baseline variables such as age, disease severity, and comorbidities will be adjusted for in the primary analysis (see point 3 below), rather than being stratified on in the randomization procedure.

### 2. Estimand when the outcome is ordinal

In selecting a treatment effect definition, if a utility function can be agreed upon to transform the outcome to a score with a clinically meaningful scale, then we recommend using the difference between the transformed means in the treatment and control arms. Otherwise, we recommend using the unweighted difference between means or the Mann-Whitney estimand. We recommend against estimating log odds ratios, since these are difficult to interpret.

### 3. Covariate adjustment

Adjust for prognostic baseline variables to improve precision and power. We expect improvements to be substantial since there are already several known prognostic baseline variables, e.g., age and co-morbidities. The baseline variables should be specified before the trial is started (or should be selected using only blinded data from the trial using a prespecified algorithm) and the number of variables adjusted for should be no more than (roughly) the total sample size divided by 20.

### 4. Confidence intervals and hypothesis testing

we recommend that the nonparametric bootstrap (BCa method) be used with 10000 replicates for constructing a confidence interval. The entire estimation procedure, including any model fitting, should be repeated in each replicate data set. Hypothesis tests can be conducted either by inverting the confidence interval or by permutation methods — the latter can be especially useful in smaller sample size trials in order to achieve the desired Type I error rate.

### 5. Early stopping for efficacy or futility

Rules for early stopping such as O’Brien-Fleming boundaries can be directly applied, except where z-statistics are constructed using the covariate adjusted estimator described above and the covariance between statistics at different analysis times is estimated using nonparametric bootstrap as described above. The timing of analyses can be based on the information accrued, which has the advantage of reducing the average sample size even in trials with no treatment effect, which results from earlier futility stopping.

### 6. Plotting the CDF when the outcome is ordinal

Regardless of which treatment effect definition is used in the primary efficacy analysis, we recommend that the estimated CDF of the primary outcome be plotted for each study arm when the outcome is ordinal. This is analogous to plotting Kaplan-Meier curves for time-to-event outcomes, which can help in interpreting the trial results. For example, if the observed probability of death is similar in the two arms but the probability of ICU admissions is smaller in the treatment arm, then this will be evident in the plots. We also recommend to plot simultaneous confidence bands to represent uncertainty about the estimated curves. To improve precision, covariate adjustment can be used when constructing the estimated CDFs and the corresponding confidence band.

### 7. Missing covariates

We recommend handling missing covariates by imputing these covariates based only on data from those covariates that were observed. Importantly, to ensure that treatment assignment is randomized conditionally on the imputed covariates, no treatment or outcome information should be used in this imputation.

### 8. Missing ordinal outcomes

We recommend handling missing ordinal outcomes using doubly robust methods whose validity relies on the outcomes are missing at random conditionally on the covariates and treatment assignment. One such doubly robust approach can be evaluated by applying the methods described in Appendix A.2, but with study arm recoded as 0 to indicate that a patient was both randomized to study arm 0 (control) and had their outcome measured, 1 to indicate that a patient was both randomized to study arm 1 (treatment) and had their outcome measured, and −1 to indicate that the outcome is missing. When study arm is recoded in this way and the outcome is not missing completely at random but is missing at random conditionally on covariates, it is important that the model used for described in Appendix A.2 actually depend on the baseline covariates, since this recoded treatment is not fully randomized.

### 9. Loss to follow up with time-to-event outcomes

We recommend accounting for loss-to-follow-up using doubly robust methods such as those described in [2, 5]. These methods rely on a potentially more plausible condition on the censoring distribution than do unadjusted methods. The covariate adjusted estimator for the restricted mean survival time in the time-to-event setting is double robust under censoring being independent of the outcome given baseline variables and arm assignment.

## Data Availability

The patient level data used to inform our simulations is not available to be shared.

## Appendix

### A Estimands and estimators when the outcome is ordinal

#### A.1 Estimands

Let *Y*_{0} denote a draw from the distribution of ordinal outcomes in treatment arm 0 and *Y*_{1} denote an independent draw from the distribution of ordinal outcomes in treatment arm 1. The three estimands write as

Above, *u* is a prespecified real-valued transformation of an outcome.

#### A.2 Covariate adjusted estimator

Consider a setting in which we observe *n* independent draws (*X*_{1}, *A*_{1}, *Y*_{1}), …, (*X*_{n}, *A*_{n}, *Y*_{n}), where each *X*_{i} represents baseline covariates, each *A*_{i} represents a treatment, and each *Y*_{i} represents an outcome. We suppose that *A⊥ Y*|*X*. We now derive an estimator for the CDF that is closely related to an estimator presented in Scharfstein et al. (1999) [16] and to targeted minimum loss-based estimators [20, 19].

For and *β ∈* ℝ^{d}, define the following ℝ^{K−1} × ℝ^{d} function as follows:

We will consider the treatment-stratified proportional odds working model for *P*{*Y≤ j*| *A* = *a, X* = *x*} in which there exist (*α*_{0}, *β*_{0}) and (*α*_{1}, *β*_{1}) such that for all *j, x, a*. Importantly, we do not rely on this model being correct.

Let be an estimate of *P* (*A* = *a*|*X* = *x*). In the clinical trial setting that we consider, this quantity can be estimated with the empirical marginal treatment probability. At the end of this subsection, we describe alternative approaches for estimating this quantity. Suppose that, for and are chosen to minimize the following weighted empirical risk in (*α, β*):

Each of these *a*-specific optimizations can be solved by running a weighted logistic regression on a repeated measures dataset of size *n* × (*K −*1). Alternatively, they can be fitted using software for a proportional odds model. For both levels of the treatment *a*, it can be shown that , and so, for any covariate value is a monotone nondecreasing function. Moreover, if our treatment-stratified proportional odds working model is correct, then and are asymptotically linear estimators of the true underlying parameters.

Our covariate adjusted estimate of the CDF *ψ*_{a}(*j*) := *P* (*Y ≤ j*|*A* = *a*) is given by

Because is monotone nondecreasing for all is also monotone nondecreasing. The above estimator also satisfies the known constraint that .

It can also be shown that is (i) doubly robust and (ii) efficient if both the treatment mechanism and the stratified proportional odds working model are correctly specified. To show this, we now establish that is in fact an augmented inverse probability weighted estimator. First, note that minimizing (1) to find and implies that the following first-order condition is satisfied for all *j ∈* {1, …, *K −* 1}:

Next, note that adding this to the right-hand side of (2) shows that

The above shows that is an augmented inverse probability weighted estimator for , with the estimate of the outcome regression *x ↦ P* {*Y ≤ j*|*A* = *a, X* = *x*} given by .

We conclude by discussing our estimation of the treatment probability *P* (*A* = *a* |*X* = *x*). Though this quantity can always be estimated by the empirical treatment probability in our randomized trial setting, there are generally advantages to estimating this quantity within a richer model. For example, a main-terms logistic regression of treatment on covariates could be used — in a randomized trial setting, this model is correctly specified provided that it includes an intercept term. The advantage of estimating known treatment probabilities via correctly specified parametric models has been discussed elsewhere – see, for example, [23] or, for a general treatment, Section 2.3.7 in [18].

### B Estimands and censoring assumptions when the outcome is a time-to-event endpoint

Let *T* be a time-to-event outcome, *C* be a right-censoring time, *A* be a treatment indicator, and *X* be a collection of baseline covariates. Let *τ* be an investigator-specified truncation time that will be used to define the RMST, and let *t*^{*} be an investigator-specified time at which a comparison between the arm-specific cumulative incidences of the event is of interest.

Let *T*_{1} be a random variable with the distributon of *T* |*A* = 1 and let *T*_{0} be a random variable with the distribution of *T* |*A* = 0. The three estimands write as

We conclude by describing the assumptions made by the adjusted and the unadjusted methods on the right censoring time *C*. Unadjusted methods assume that

The adjusted methods discussed in the main text assume that which is often more plausible than (3).

### C Additional results for ordinal outcomes

#### C.1 Results for Wald-style inference for hospitalized patients

#### C.2 Results for non-hospitalized COVID-19 patients

For the non-hospitalized COVID-19 positive population, the probabilities of being in each age group and of hospitalization and death are listed in Table 10. For this simulation, adjustment via the proportional odds model resulted in numerical instabilities in many simulated data sets. Therefore, we considered adjusting for age using a stratified estimator. That is, the conditional probability of each outcome in each age group was estimated using the empirical proportion of the sample with each outcome in each age group. As with the hospitalized simulation, we assumed that a treatment would have no effect on the probability of death but would decrease the odds of ventilation by the same relative amount in each age category. The relative reduction was selected for each sample size, such that a two-sided t-test that the average outcome between treatment arms were equal rejected the null hypothesis with approximately 80% power (*n*=200,500,1000) and 50% power (*n*=100).

## Footnotes

↵* Co-first authors