## Abstract

**Objective** To improve the estimation of healthcare expenditures by introducing a novel estimation method that is well-suited to situations where data exhibit strong skewness and zero-inflation.

**Data Sources** Simulations, and two sources of real-world data: the 2016-2017 Medical Expenditure Panel Survey (MEPS) and the Back Pain Outcomes using Longitudinal Data (BOLD) datasets.

**Study Design** Super learner is an ensemble machine learning approach that can combine several algorithms in order to improve estimation. We propose a two-stage super learner that is well suited for use with healthcare expenditure data by separately estimating the probability of any healthcare expenditure and the mean amount of healthcare expenditure conditional on having healthcare expenditures. These estimates can be combined to yield a single estimate of expenditures for each observation. The method can flexibly incorporate a range of individual estimation approaches for each stage of estimation, including both regression-based approaches and machine learning algorithms such as random forests. We compare the performance of the proposed two-stage super learner with a one-stage super learner, and with multiple individual algorithms for estimation of healthcare cost under a broad range of data settings in simulated and real data. The predictive performance of alternative strategies was compared using Mean Squared Error and R^{2}.

**Principal Findings** Our results indicate that the two-stage super learner has better performance compared with a one-stage super learner and individual algorithms, for healthcare cost estimation under a wide variety of settings in both simulations and empirical analyses. The improvement of the two-stage super learner over the one-stage super learner was particularly evident in settings when zero-inflation is high.

**Conclusions** The two-stage super learner provides researchers an effective approach for healthcare cost analyses in environments where they cannot know the best single algorithm a priori.

## Introduction

The study of healthcare expenditures is a key component of health services research with common applications including estimating future expenditures, expenditures related to particular health conditions, and effects of interventions on expenditures [1]. However, statistical modeling of healthcare expenditures is often challenging for two reasons: individuals without health care utilization can lead to zero inflation, and very high expenditures in a small number of individuals can lead to large positive skewness [2–3]. In the United States, Berk & Monheit [4] report that in a typical year, 10% or more of individuals incur no healthcare expenditures, while 5% of the population accounts for the majority of health expenditures.

Many methods have been proposed to deal with these challenges [3]. A common approach involves regression of log-transformed cost using ordinary least squares. Duan’s smearing estimator [5] can be used to map estimates from the log-scale back to the original scale. Other approaches include non-Gaussian generalized linear models (GLM) [2, 5, 6], accelerated failure time (AFT) models [7], hazard-based models [8–10], and quantile-based models [11]. To explicitly account for zero-inflation, researchers often use two-part models [12]. In this approach, separate models are specified for the probability of any cost, and for the mean of costs amongst individuals with positive costs. Two-part GLMs with a logit link for the binary component and a Gamma distribution with log-link for the continuous component are common in practice [13–15].

All of these methods can work well when the outcome distribution meets the corresponding model assumptions. However, practitioners are often unaware of whether this is the case for a given data set and unfortunately, there is no “one-size-fits-all” approach [16]. Moreover, many common methods rely on parametric regression models, which are commonly mis-specified in practice, and can introduce bias in the results. This has led to increased interest in machine learning techniques, which use more flexible approaches to learn about relationships in data, thereby providing, in many cases, more accurate cost predictions [17]. Specifically, penalization and ensemble approaches have been applied for modeling healthcare expenditures [18–21].

In this work, we consider super learning, an ensemble method that combines a range of candidate models also known as model stacking, for application to healthcare expenditures [22–25]. Super learning entails using cross-validation to learn the optimal combination of a collection of candidate methods. This collection could include parametric and machine learning-based algorithms. Super learning has shown benefits over a single method in several healthcare studies [26–28] including prediction of expenditures in the context of plan payment risk adjustment [19].

However, no past research on super learning has focused specifically on zero-inflated data.

In this work, we propose a two-stage super learner. We define a set of candidate methods for predicting the presence of any costs, as well as for the positive portion of the cost distribution. The full super learner library consists of all pairwise combinations of the methods for each stage. We provide an efficient implement of the two-stage super learner with tailoring for challenges associated with healthcare cost data. We compare the two-stage super learner with the regular, one-stage super learner, along with individual methods commonly used in studies of healthcare expenditures using Monte Carlo simulations under various data generating processes. In addition, we analyze data from the 2016-2017 Medical Expenditure Panel Survey (MEPS) [29, 30] and Back pain Outcomes using Longitudinal Data (BOLD) project [31]. In both cases, the two-stage super learner improved performance over existing approaches.

## Methods

### Two-stage super learner

#### Two-part model

The two-part model consists of a binary model for the probability of outcome being positive and a regression model applied to the positive subsample. Let Y denote the healthcare cost, and X denote a potentially high-dimensional vector of predictors.

Motivation for a two-part model is that the mean, *E*[*Y*|*X*], can be written as:
The two pieces of this equation can be estimated separately. The first piece, *Pr*(*Y* > 0 | *X*) is commonly modeled using logistic regression, although in principle any regression suitable for a binary outcome could be used. The second component *E*[*Y*|*Y* > 0, *X*] is commonly modeled using Gamma-family GLMs with log link functions to account for high skewness, though again any suitable regression could be used. Given the dizzying array of possible regression approaches for each component of this model, it could be quite useful for practitioners to have both a formal framework for selecting amongst the many choices and a general recommendation on an optimal strategy when the goal is to obtain high prediction accuracy.

#### Super Learning

Super learning provides a formal means of selecting a combination of algorithms that best fits the true regression function. Here, the notion of “best” refers to a cross-validated risk criterion. Suppose we have a dataset of independent observations on n individuals, *O*_{i} = (*Y*_{i}, *X*_{i}), *i* = 1, …, *n*, where *Y* is the outcome of interest and *X* is a set of covariates. Suppose we have access to a candidate prediction function , e.g., that provides a prediction for a new observation *x* based on a fitted two-stage model. We introduce the notion of the *risk of* , which provides a global summary of how well predicts outcomes *Y* based on covariates *X*. Often [32] we rely on risk measures that can be expressed as the *average discrepancy* between *Y*and the prediction made by . In other words, given a regression , we can define a *loss function* , that takes as input a particular data point (*x, y*) and returns a real number measuring the discrepancy between and *y*. A larger value of the loss indicates a further gap between prediction and truth. Common loss functions for cost data include mean squared error (MSE) for mean cost estimation, mean absolute error (MAE) for median cost estimation and “check function” [33] for quantile cost estimation (e.g. 90^{th} percentile of cost). In this study, we explicitly consider the squared-error loss function and consider mean squared error (MSE), , as our risk criterion. Given a risk criterion, we can define the optimal prediction function, say *Q*_{0} as the function that minimizes risk over all possible prediction functions.

The performance optimization problem can be equivalently defined as a statistical estimation problem. For example, it is straightforward to show that for squared error loss, *Q*_{0}(*X*) = *E*(*Y*|*X*). Thus, the task of learning the optimal (with respect to MSE) prediction function is equivalent to the task of estimating the conditional mean of cost given covariates. Super learner provides one objective and flexible strategy for accomplishing this task.

In a given problem, there are often many different individual approaches to developing a prediction function. In super learning, we call each approach an *algorithm* and refer to a pre-specified collection of algorithms as a *library*. Here, “algorithm” is used in a general sense as any means of mapping a given data set into a prediction function. We use *training* to refer to the process of applying an algorithm to data. Examples of algorithms include (i) fitting ordinary least squares regression and returning the linear predictor; (ii) performing variable screening based on a univariate significance threshold, *then* applying ordinary least squares regression; (iii) training a random forest, where tuning parameters are selected via cross-validation. The super learner library should be, to the greatest extent possible, informed by subject-area expertise, but can utilize data-driven, machine learning approaches as well [34].

Suppose we have *K* potential algorithms. The implementation of super learner involves using cross-validation to determine an ensemble of these algorithms. The process can be implemented as follows.

Train each algorithm in the library on the entire dataset yielding prediction functions .

Randomly split the data into

*V*mutually exclusive and exhaustive blocks of approximately equal size. For*v*= 1, …,*V*, define the*v*-th block as the*v*-th validation sample, and the remaining*V*− 1 blocks the*v*-th training sample.For

*v*= 1, …,*V*, train each of the*K*algorithms using the*v*-th training sample yielding prediction functions . For each observation*X*_{i}in the*v*-th validation sample obtain the prediction .Propose an ensemble of algorithms in the library indexed by

*K*-length-weight vector*α*, e.g.,Find the weights that minimize the cross-validated mean squared error of the ensemble over all possible values of

*α*, e.g., where the minimum is taken over all .The super learner is .

In step 4, many choices of ensembles could be considered. One particular choice is to build the ensemble from a set of weight vectors that each assigns a weight of 1 to a particular algorithm and 0 weight to all others. The choice of assigning weight of 1 to a particular algorithm is known as the cross-validation selector or discrete super learner since it represents the individual algorithm with the lowest cross-validated risk.

Theoretical guarantees [25, 35, 36] state that in large samples the super learner should predict costs essentially as well as or better than the best-performing single algorithm amongst the *K* considered. Because the super learner is only capable of predicting as well as the best amongst the *K* algorithms used in its construction, this theory suggests that it is advantageous to consider a diverse mixture of algorithms when constructing a super learner. In the next section, we propose a strategy for obtaining a large and varied collection of modeling approaches in the context of zero-inflated data.

#### Two-stage Super Learner

We propose a two-stage Super Learner model, wherein we propose a library for *Pr*(*Y* > 0 | *X*) *and for E*[*Y* | *Y* > 0, *X*]. The overall two-stage super learner library consists of all pairwise combinations of these two models. Assuming the stage-1 library includes *K*_{1} algorithms and the stage-2 library includes *K*_{2} algorithms, then the two-stage super learner’s “whole library” would contain *K*_{1} × *K*_{2} candidate algorithms with each one representing a specific combination of algorithms from the stage-1 and stage-2.

The implementation of the two-stage super learner is similar to that of the super learner described above. The key difference is that each of our proposed candidate algorithms consists of two constituent algorithms: one for *Pr*(*Y* > 0 | *X*) and one for *E*[*Y* | *Y* > 0, *X*]. A prediction of *Y* is obtained as the product of the predictions made by each trained algorithm. Beyond this modification in how predictions are obtained, our approach is implemented as described above and the theoretical results governing the behavior of the original super learner immediately apply to our approach.

Our approach is implemented in a freely available R package. In addition to the two-stage learning approach, our software handles additional challenges associated with healthcare expenditure data. In particular, we implement a novel cross-validation approach that ensures an approximately equal split of zeros and outliers over the folds. The package also provides a modified quadratic programming approach to calculate super learner weights in the presence of very large residual values (Appendix-A).

#### One-stage Super Learner

The standard one-stage super learner is similar to the two-stage super learner, except that it does not split the estimation problem into two parts. When zero inflation is high, this may lead to decreased performance. However, if zero inflation is not prominent in a particular dataset, it may not be necessary to estimate the probability of zero expenditures, and thus one-stage super learner may perform well. The one-stage super learner can be computationally faster than the two-stage super learner.

Since the one-stage super learner has not been widely used for healthcare cost estimation, an additional objective of this study was to examine the performance of one-stage super learner in various situations a healthcare cost analyst may encounter.

#### Monte Carlo Simulations

We evaluated the relative performance of the two-stage super learner compared to common approaches across a wide range of settings using Monte Carlo simulations. In particular, we studied how the two-stage super learner behaved in comparison with one-stage super learner and individual algorithms when varying the sample size, zero-inflation amount, non-zero distribution and complexity in the cost data.

#### Data generating process

For simulation study, categorical and continuous predictors were generated to include as many scenarios as possible and approximate covariates seen in real data.

Specifically, predictors were simulated as follows:
Variables *X*_{1}, …, *X*_{5} impacted the distribution of costs, while the others were noise. To assess how the sample size, zero percentage, non-zero distribution, and data complexity affects the estimations, we varied four aspects of the data generating process: sample size (small (500) vs. large (2000)), zero inflation percentage (low (5%) vs. high (70%)), distribution of non-zero costs (Log-normal, Gamma, Tweedie, Mixture), and data complexity (two-way interactions among covariates, Yes and No). The combination of the four settings above results in 32 scenarios. For each scenario, we analyzed 1000 simulated data sets.

Costs were simulated using a two-stage procedure to allow for point mass at zero. In the first stage, we drew a random variable *Z* from a Bernoulli distribution with the probability of zero determined by a logistic model (Appendix-B). If *Z* = 0, then we set costs to zero. If *Z* = 1, we drew the observed costs from one of the four different non-zero distributions listed above (Appendix-C & Appendix-Figure 1).

#### Prediction algorithms

A previous study [37] indicated that the choice of the stage-2 model may alter results more dramatically than stage-1. Accordingly, we focused our two-stage super learner towards diversity in algorithms for stage-2 by including 10 algorithms in the stage-2 model, while only including 3 algorithms in the stage-1 model. Additionally, we were interested in how the performance of a two-stage super learner compared to that of a standard super learner, thus we also fit a standard super learner with a library of 8 single-stage algorithms. Hence, we refer to this algorithm as the *one-stage* super learner. The algorithms considered in the simulation study include a mixture of machine learning algorithms such as random forests and parametric regressions such as GLM with Gamma distribution and log link (Appendix-Table 3). We compared the performance of each of 38 algorithms (30 combinations of two-stage algorithms + 8 individual algorithms), the one-stage super learner, the two-stage super learner built from 38 algorithms, and the *discrete super learner*, or the single algorithm in the two-stage super learner library with lowest cross-validated risk as comparisons to the two-stage super learner. For each two-stage super learner, we used *V* = 10 folds for cross-validation.

#### Evaluation metrics

Performances of prediction algorithms (individual algorithms, one-stage super learner, discrete super learner, and two-stage super learner) were evaluated on an independent test set of size *n*_{test}. Let *y*_{i} be the true outcome values on the test set, ŷ_{i} be the predicted outcome values from certain algorithms on the test set and be the mean of *y*_{i}, we use the following metrics for evaluation:

Mean Squared Error (MSE):
Relative MSE:
Coefficient of determination (R^{2}):
Relative Efficiency (RE):

### Real-world data

#### MEPS Data

We analyzed the Medical Expenditure Panel Survey (MEPS) data from 2016 to 2017 [29, 30]. MEPS is a national survey on the financing and use of medical care of families and individuals, their medical providers, and employers across the United States. Participating households were drawn from a nationally representative subsample and provided demographic information, health status, self-reported medical conditions, medical expenditure and utilization, health insurance coverage, and access to care for medical events. For some individuals, self-reported medical expenditures are supplemented with information from medical providers and insurers.

We used the longitudinal data of MEPS 2016-2017, with 2016 MEPS used as a training sample and the same individuals in the 2017 MEPS as a testing sample. We developed a prediction model for total annual healthcare expenditures based on 14 covariates that included demographics, medical conditions, and insurance characteristics (Appendix-Table 7). Total annual healthcare expenditures include out-of-pocket payments and third-party payments from all sources but exclude insurance premiums. The sample included only adults and excluded observations with missing data. The final sample contained 10,925 observations for training and 10,815 observations for testing. We used the same library of algorithms as the simulations and ten-fold cross-validation to build the two-stage super learner (Appendix-Table 3).

#### BOLD Data

The Back Pain Outcomes using Longitudinal Data (BOLD) project is a large, community-based registry of patients aged 65 years and older who presented with primary care visits for a new episode of back pain from March 2011 to March 2013 [31]. The BOLD data are derived from self-reported questionnaires and electronic health records (EHR). Expenditures were calculated as total relative value units (RVUs), a measure of value used in the US Medicare reimbursement formula for physician services [38].

In this study, we considered 4 spine-related total annual RVUs as outcomes, including 1) total spine-related RVUs; 2) spine-related physical therapy RVUs; 3) spine-related injection RVUs; 4) spine-related imaging RVUs. We developed separate prediction models of each outcome based on 24 covariates from baseline patient questionnaires and EHR, see Appendix-D & Appendix-Table 9 for details. The final sample contained 4,397 observations. Some algorithms used in our simulation study failed to converge on the BOLD data, necessitating the use of a slightly different library of algorithms for this part of the study (Appendix-Table 10). To evaluate the models, the data were randomly partitioned into ten distinct folds. Every single algorithm, one-stage super learner, discrete super learner and the two-stage super learner were trained using nine of the folds and predictions were obtained using the remaining fold. The performance was estimated as the MSE of the predictions on the validation fold, averaged over the ten validation folds.

## Results

### Monte Carlo Simulation

On average, across all settings, the two-stage super learner had the smallest MSE (Table 1), alongside discrete super learner and one-stage super learner. See Appendix-Table 4 & 5 for results under other evaluation metrics. Overall, the discrete super learner (the approach that used cross-validation to select the single best algorithm from the two-stage super learner library for each situation, as opposed to using a weighted average of all algorithms) was the second-best-performing algorithm, while the one-stage super learner was the third. The overall MSE improvements of two-stage super learner over discrete learner, one-stage super learner and the best single algorithm were about 1.9%, 4.0% and 9.1%, respectively. Other algorithms with favorable performances included combinations of the Lasso model at stage-1 with GLM (log-link & gamma distribution) at stage-2 and Zero-inflated Negative-Binomial (ZINB) model. The overall Interquartile Range (IQR) of MSE for the two-stage super learner is smaller compared to other algorithms, indicating a robust prediction performance under different data settings. The improvement of the two-stage super learner over the one-stage super learner was more evident under high zero-inflation (70% zero) with a MSE improvement of 9.7%. This was in line with expectations as using a two-stage model is likely to be more beneficial the greater the point mass at zero. When zero inflation was small (5% zero), the one-stage super learner was superior and had the smallest MSE, but the two-stage super learner was the second best-performing algorithm and outperformed the best single algorithm (ZINB) with a MSE improvement of 4.1%.

Appendix-Figure 2 presents more extensive results. Overall, we found that no single algorithm was best in all cases, and there are instances where a single algorithm provides the best performance. Practically, however, the analyst will not know which algorithm fits the data best a priori, which supports the idea of using a super learner approach. Across a large number of cases, the two-stage super learner was more robust than the individual algorithms and close to the best algorithm under all different data settings, while the performances of individual algorithms changed dramatically across different data settings.

### Empirical analysis

#### MEPS

Distributions of total expenditures in 2016 & 2017 were both zero-inflated, with almost 20% of observations having zero expenditures, as well as highly skewed, with 2% of observations expenditures over $50,000 (Appendix-Table 6, 7 & Appendix-Figure 3).

The two-stage super learner performed better than all the individual algorithms and the one-stage super learner with the lowest test-set MSE (table 2). Specifically, for MSE, the improvements of the two-stage super learner over the best single algorithm and one-stage super learner were about 0.7% and 2.2%; while for R^{2}, such improvements increased to 4.2% and 14.6%. The one-stage super learner was among the top 10 algorithms with MSE lower than any single-stage individual algorithm (table 2). Relative MSE for individual algorithms compared to the two-stage super learner ranged from 1.007 (Random Forest at both stages) to 1.192 (model with intercept only). Adding stage-1 for zero-inflation appears to improve the prediction as Random Forest and Lasso used at stage-2 of two-stage models were better than them being used solely in one-stage models. Choices of stage-2 models seem to matter more than the choices of stage-1 models as algorithms with the same stage-2 algorithms share analogous behavior albeit different stage-1 algorithms. Random Forests provided nontrivial improvements to parametric regressions with a lower MSE.

#### BOLD

Distributions of four spine-related RVUs were all highly skewed with heavy upper tails but varied in both scales and levels of zero-inflation (Appendix-Table 8, Appendix-Figure 3). Spine-related RVUs had the lowest zero-inflation (5%) while spine-related injection RVUs had the greatest zero-inflation (91%).

In these data, we again found that two-stage super learner was the best performing algorithm under all different zero-inflation levels, although improvements over the discrete learner (average MSE improvement: 0.6%), one-stage super learner (average MSE improvement: 1.0%) and the best single algorithm (average MSE improvement: 0.8%) were more modest than in the MEPS data (Table 3). Performances of single algorithms changed dramatically under different zero-inflations levels (Appendix-E). For example, the random forest had strong performances for modeling spine-related RVUs (5% zero) and spine-related imaging RVUs (55% zero), but poor performances for modeling spine-related physical therapy RVUs (85% zero) and spine-related injection RVUs (91% zero). One-stage super learner and single-stage algorithms showed good performances when zero-inflation was low, as illustrated by the fact that the one-stage super learner was the second best-performing algorithm, and single-stage Random Forest is the best single algorithm in modeling spine-related RVUs.

However, these models did not perform as well for modeling spine-related injection RVUs, in which zero inflation is greater.

## Discussion

The novel two-stage super learner algorithm provided robust performance across a wide range of situations using both simulated and real data. We also found that the one-stage super learner performed well, and in some cases when zero inflation was small, even better than the two-stage super learner. We found that some individual algorithms had a good performance in certain scenarios, but poor performance in others. Since an analyst is unlikely to know, a priori, what the best performing single algorithm will be, we believe that super learning provides a useful general approach for estimating healthcare expenditures. In other words, super learning eliminates the need for a-priori knowledge as to which single algorithm might perform best in a given application--we found that it performed well in each scenario that we studied.

Our results are consistent with and extend prior research regarding the estimation of healthcare expenditures. First, we found that many commonly used methods do perform well in particular situations. However, we did not find a single algorithm that performed well in all situations. A major contribution of this study is that the novel two-stage super learning framework worked well across a wide variety of simulated and real-world datasets. Next, we found that the one-stage super learner also generally performed well. Though one-stage super learner is commonly used for other healthcare outcomes [27, 39, 40], its use is uncommon when working with healthcare expenditure data. We found that it often performs well and may deserve wider use.

Finally, the findings of this study help distinguish situations when one may want to use one-stage versus two-stage super learner approaches. When zero-inflation is low, the one-stage super learner may be a reasonable strategy. However, in the presence of higher zero-inflation, using a two-stage super learner is likely to lead to better performance.

This study suggests several directions for future work. First, we have focused on optimizing a super learner with respect to mean squared error. We believe this is an appropriate scientific goal in many situations and in particular, in situations where super learning is used as an intermediate step in drawing inference about the average causal effect. However, for some pure prediction tasks, mean squared error may not be the most appealing measure to gauge predictions since it is very sensitive to outliers. In these settings, it may be interesting to consider super learners constructed using alternative loss functions that are more robust to large costs. Second, our approach is focused on the prediction of the overall cost. In some circumstances, the scientific question may motivate us to simultaneously consider the prediction of having any cost (i.e., the stage-1 model itself is interesting) *as well as* the cost itself if some costs are present. For such problems, an alternative approach to super learning may be more useful, wherein one develops a super learner for each component of the model. In such an approach, each of the two super learners would be optimized towards a particular goal, as opposed to here where a single super learner is fit that is optimized towards the overall goal of predicting costs.

The findings of this study should be considered in light of several limitations. First, as with all simulation studies, we could not consider all possible data generation scenarios. While we picked a robust set of scenarios that likely reflect many real-life situations, we do not know if the results we observed generalize to other data generation processes that we did not examine. Similarly, as we analyzed only two empirical data sets, and the results may not generalize to other settings. In future work, we hope to examine more diverse scenarios to further inform practitioners in the application of our methods. An additional limitation is that we focused on super learning for the sake of prediction. A natural extension would be to study the potential benefits of two-stage super learning in the context of estimating the causal effect of an intervention on a cost outcome.

## Conclusions

In this study, we introduce a novel two-stage super learner approach that is well suited to many of the common problems encountered when analyzing healthcare expenditure data. Further, we provide a more in-depth study of the application of one-stage super learner to the healthcare expenditure context. We find that the two-stage super learner provides robust performance across a wide variety of data generating processes and that one-stage super learner also performs well, especially when there is little zero inflation. Given the notorious challenges that healthcare expenditures data present to the analyst, we think that super learner approaches will be a useful tool to advance health services research.

## Repository

The data and code used to train the two-stage super learner, one-stage super learner and individual algorithms can be found at https://github.com/wuziyueemory/Two-stage-SuperLearner. Introductions to simulations and real-data analysis are provided on the above GitHub page.

## Data Availability

The data used in simulation studies of this work are from the data generating process with code openly available at https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/Simulation/createData.R.

The data used in MEPS empirical analysis is 3rd party data from the 2016-2017 Medical Expenditure Panel Survey (MEPS) database. This 3rd party data is free online (see https://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-202).

The data used in the BOLD empirical analysis is 3rd party data from The back pain outcomes using longitudinal data (BOLD) registry. The data could be accessed through https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/BOLD%20data%20analysis/bold%20data.csv.

https://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-202

https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/Simulation/createData.R

## Funding

This work was funded from the National Heart Lung and Blood Institute of the United States National Institute of Health.

## Tables

## Acknowledgments

Research reported in this publication was supported by the University of Washington Clinical Learning, Evidence And Research (CLEAR) Center for Musculoskeletal Research. The CLEAR Center is supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) of the National Institutes of Health under Award Number P30AR072572. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We thank Jerry Jarvik, Sandra Johnston, and Laurie Gold for helpful comments and providing data on the BOLD analysis in this study.

## Appendix

### Technical appendix

#### A. Additional modifications of two-stage super learner

We consider two modifications on (i) the weights calculation and (ii) the cross-validation scheme to improve the two-stage super learner on predicting healthcare costs. For continuous outcome, a quadratic programming algorithm designed by Goldfarb and Idnani was generally applied to calculate the best convex combination of weights that minimize the squared error loss. However, the heavy upper tails in healthcare expenditure would result in excessive huge numbers in the matrix and vector of a quadratic function to be minimized. This consequently induces overflow errors and causes the quadratic programming constraints inconsistent. As a modification, we proposed a scaling scheme that divides the quadratic function by a large constant to shrink the huge matrix and vector in the quadratic function. The scaling would not affect the results given the raw quadratic function is just a multiple of the scaled quadratic function. Alternatively, we consider a modification to the cross-validation scheme in the standard super learning procedure. The standard V-fold cross-validation allocates subjects randomly to each block. However, this random allocation could result in all observed subjects with zero costs or large costs in the same block of data. We might expect a better finite sample evaluation of how well methods fit zero as well as large costs by evenly splitting the zero and large costs amongst the blocks. Considering the minimum of our data is zero, we propose a snake-like assignment of subjects to each block. That is, the V lowest ordered costs are assigned to blocks 1 through V, then the next V lowest ordered costs are assigned reversely to blocks V through 1, respectively. This process is repeated until all subjects have been assigned a block. By splitting up the zero and largest costs, we ensure that every time we fit the candidate methods to V-1 of the blocks, there will be an adequate number of subjects with zero and large costs in the held-out block.

#### B. Regression formula for zero-inflation & non-zero cost distribution

Costs were simulated using a two-stage procedure to allow for different zero-inflation levels and non-zero cost distributions. In stage-1, we drew a random variable *Z* from a Bernoulli distribution with the probability of zero determined by a logistic model:=
The regression formula *μ*_{β}(*X*) was mainly controlled by zero-inflation percentage and data complexity (Table 1). If *Z* = 0, then we set costs equal to zero. If *Z* = 1, we drew the observed cost from one of the four different non-zero cost distributions (Log-normal, Gamma, Tweedie, Mixture) with the corresponding parameter mean (Log-normal, Tweedie) and shape (Gamma) generated by another regression formula *μ*_{λ}(*X*), *X* = (*X*_{1}, …, *X*_{5}). The regression formula *μ*_{λ}(*X*) was mainly controlled by data complexity (Table 2).

#### C. Data generating process - distribution of non-zero costs

We generate non-zero costs from one of the four distributions (Log-normal, Gamma, Tweedie, Mixture). The mean parameter *μ* for the natural logarithm and Tweedie distributions and shape parameter *β* for the Gamma distribution was controlled by the regression formula *μ*_{λ}(*X*), *X* = (*X*_{1}, …, *X*_{5}) described in table 2 above.

#### Lognormal Distribution

The log-normal model for positive costs can be written as *In* (*y*) = *μ*_{λ}(*X*) + *ε*, where *X* = (*X*_{1}, …, *X*_{5}), *ε ∼ N*(0, *σ*^{2}), *E*(*X*^{T}*ε*) = 0. The conditional expectation of *y* given *y* > 0 can thus be written as *E*(*y* | *x*) = *exp* (*μ*_{λ}(*X*) + 0.5*σ*^{2}). We set *σ*^{2} = 0.3.

#### Gamma Distribution

The Gamma distribution has PDF where *β* is the shape parameter and *α* is the scale parameter. The mean is *αβ*. The shape parameter was set to *β* = *exp* (*μ*_{λ}(*X*)) and the scale parameter was set to *α* = 1.3.

#### Tweedie Distribution

Tweedie distributions are defined as a subfamily of exponential dispersion models (ED) with a special mean-variance relationship. A random variable *Y* is Tweedie distributed *Tw*_{p}(*μ, σ*^{2}), if its mean *μ* = *E*(*Y*) and *var*(*Y*) = *σ*^{2}*μ*^{p}, where *σ*^{2} is the dispersion parameter and *p ∈ R* is the power parameter. The PDF for the Tweedie family is complex and cannot be expressed in closed form, but the Tweedie family includes common distributions like Normal (*p* = 0), Poisson (*p* = 1), and Gamma (*p* = 2). In the simulation, the mean was set to *μ* = *μ*_{λ}(*X*), while the scale and power were set to *σ*^{2} = 1.8, and *p* = 1.5, respectively.

#### Mixture distribution

The mixture distribution is a mixture of log-normal and gamma distribution. Specifically, it was generated by first drawing a binary random variable with a pre-specified probability, *G ∼ BernouIIi*(*p*). We subsequently generated a gamma distribution if *G* ≤ *p*, or a log-normal distribution if *G* > *p*. In the simulation, the probability was set to *p* = 0.5. We wanted to evaluate the two-stage super learner in situations where a simple parametric model does not capture the true distribution, as we expect to be the case in practice.

#### D. Details of covariates used in the BOLD study

In this study, patient self-reported questionnaire responses were used to predict future expenditures (measured by RVUs) in several categories. Specifically, the covariates include the following measures from patient self-reported questionnaires collected at baseline: (1) Socio-demographics (age, sex, race, ethnicity, education, employment status, etc.); (2) Pain-related characteristics (back/leg pain duration, back/leg pain intensity, modified Roland-Morris Disability Questionnaire, Brief Pain Inventory Activity Interference Scale); (3) PHQ-4 measure of anxiety and depressive symptoms; (4) European Quality of Life 5 Dimension (EQ5D) index and Visual Analog Scale; (5) Number of falls; and (6) Recovery expectation. Besides, we also include the Quan comorbidity score, baseline diagnosis, and total RVUs at one year before index visit from EHR as covariates.

#### E. Extended results of BOLD analyses

Individual algorithms that perform well in one setting will not necessarily perform well in other settings. The best individual algorithms in modeling 4 spine-related RVUs changed dramatically. One-stage super learner worked better in modeling outcomes with low to medium zero-inflation, with performance ranked 2^{th} in modeling spine-related RVUs (5% zero) and 6^{th} in modeling spine-related imaging RVUs (55% zero). One-stage individual algorithms had good behavior likewise when zero-mass were not serious, especially in low zero-inflation situations where the one-stage Random Forest beat all two-stage individual algorithms in modeling spine-related RVUs (5% zero). Baseline socio-demographic and patient-reported outcomes may have signals for predicting various spine-related RVUs in a year after index, with a cross-validated *R*^{2} ranging from 6.2% to 27.8% for the two-stage super learner. The performance was especially great for modeling spine-related physical therapy RVUs. MSE varied significantly in modeling different spine-related RVUs, partly due to the difference in terms of the scale of RVUs. The results also suggested that there may exist redundancy in the covariates as Lasso was always among the best-performed algorithms in modeling spine-related RVUs with different zero-inflations.