Abstract
Objective To improve the estimation of healthcare expenditures by introducing a novel estimation method that is well-suited to situations where data exhibit strong skewness and zero-inflation.
Data Sources Simulations, and two sources of real-world data: the 2016-2017 Medical Expenditure Panel Survey (MEPS) and the Back Pain Outcomes using Longitudinal Data (BOLD) datasets.
Study Design Super learner is an ensemble machine learning approach that can combine several algorithms in order to improve estimation. We propose a two-stage super learner that is well suited for use with healthcare expenditure data by separately estimating the probability of any healthcare expenditure and the mean amount of healthcare expenditure conditional on having healthcare expenditures. These estimates can be combined to yield a single estimate of expenditures for each observation. The method can flexibly incorporate a range of individual estimation approaches for each stage of estimation, including both regression-based approaches and machine learning algorithms such as random forests. We compare the performance of the proposed two-stage super learner with a one-stage super learner, and with multiple individual algorithms for estimation of healthcare cost under a broad range of data settings in simulated and real data. The predictive performance of alternative strategies was compared using Mean Squared Error and R2.
Principal Findings Our results indicate that the two-stage super learner has better performance compared with a one-stage super learner and individual algorithms, for healthcare cost estimation under a wide variety of settings in both simulations and empirical analyses. The improvement of the two-stage super learner over the one-stage super learner was particularly evident in settings when zero-inflation is high.
Conclusions The two-stage super learner provides researchers an effective approach for healthcare cost analyses in environments where they cannot know the best single algorithm a priori.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work is funded under award 1R01HL137808 from the National Heart Lung and Blood Institute of the United States National Institute of Health.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
no IRB
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
The data used in simulation studies of this work are from the data generating process with code openly available at https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/Simulation/createData.R.
The data used in MEPS empirical analysis is 3rd party data from the 2016-2017 Medical Expenditure Panel Survey (MEPS) database. This 3rd party data is free online (see https://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-202).
The data used in the BOLD empirical analysis is 3rd party data from The back pain outcomes using longitudinal data (BOLD) registry. The data could be accessed through https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/BOLD%20data%20analysis/bold%20data.csv.
https://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-202
https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/Simulation/createData.R