Use of machine learning for comparing disease risk scores and propensity scores under complex confounding and large sample size scenarios: a simulation study

Background: The surge of treatments for COVID-19 in the ongoing pandemic presents an exemplar scenario with low prevalence of a given treatment and high outcome risk. Motivated by that, we conducted a simulation study for treatment effect estimation in such scenarios. We compared the performance of two methods for addressing confounding during the process of estimating treatment effects, namely disease risk scores (DRS) and propensity scores (PS) using different machine learning algorithms. Methods: Monte Carlo simulated data with 25 different scenarios of treatment prevalence, outcome risk, data complexity, and sample size were created. PS and DRS matching with 1: 1 ratio were applied with logistic regression with least absolute shrinkage and selection operator (LASSO) regularization, multilayer perceptron (MLP), and eXtreme Gradient Boosting (XgBoost). Estimation performance was evaluated using relative bias and corresponding confidence intervals. Results: Bias in treatment effect estimation increased with decreasing treatment prevalence regardless of matching method. DRS resulted in lower bias compared to PS when treatment prevalence was less than 10%, under strong confounding and nonlinear nonadditive data setting. However, DRS did not outperform PS under linear data setting and small sample size, even when the treatment prevalence was less than 10%. PS had a comparable or lower bias to DRS when treatment prevalence was common or high (10% - 50%). All three machine learning methods had similar performance, with LASSO and XgBoost yielding the lowest bias in some scenarios. Decreasing sample size or adding nonlinearity and non-additivity in data improved the performance of both PS and DRS. Conclusions: Under strong confounding with large sample size DRS reduced bias compared to PS in scenarios with low treatment prevalence (less than 10%), whilst PS was preferable for the study of treatments with prevalence greater than 10%, regardless of the outcome prevalence.


INTRODUCTION
Given the increasing abundance of routinely collected data from clinical practice, the use of observational data for understanding associations between exposures (hereafter treatments) and outcomes is on the rise [1]. However, one of the key challenges associated with observational data is the presence of known and unknown sources of confounding [2]. Propensity score (PS) analysis is one of the most popular approaches for mitigating confounding effects, where the PS for a given individual represents the probability of receiving a treatment of interest conditional on known confounders. Several PS methods have been tested to reduce confounding effects such as matching, stratification, and weighting [2] [3].
Disease risk score (DRS) analysis presents an alternative approach, where the DRS for a given individual is the predicted probability of the outcome conditional on the confounders. Let denote treatment, denote outcome, and denotes confounders, PS can be expressed as ( = 1| ) while DRS is ( = 1| = 0, ). PS is a function treatment whereas DRS is independent of treatment, this property motivated us to test its performance against PS under different treatment prevalences. While PS is well known and widely applied, DRS is relatively less widely applied in epidemiological literature. As such DRS could be potentially advantageous in scenarios with low treatment prevalence, where PS has been shown to underperform [4].
Although PS and DRS have been compared in specific scenarios with multiple confounders [5], little is known overall and the comparative performance of the two approaches -particular in cases of strong confounding and rare treatment prevalence -remains unclear [5] [6] [7] [8].
Further, while recent research has investigated machine learning methods for estimation of PS [9] [10] [11] [12] [13] [14] [15], the calculation of DRS has exclusively been performed using logistic regression [5]. According to some studies [9] [10] [14] [15], machine learning may be better suited to scenarios that involve large sample sizes and high data complexity. However, to our knowledge, a comparative performance of machine learning methods under such scenarios for PS and DRS has not been done.
Therefore we aimed to assess the performance of different machine learning methods for PS and DRS estimation under a range of common real-world scenarios, namely different treatment prevalence and different outcome risk scenarios in the presence of strong confounding.
The main contributions of this paper are to demonstrate: • the comparative performance of PS and DRS in case of different treatment prevalence especially low treatment prevalence and under strong confounding, • the effects of data complexity and sample size on the accuracy of PS and DRS in estimating treatment effects, and • the comparative performance of different machine learning methods in PS and DRS estimation.

METHODS
A Monte Carlo simulation study was conducted. The simulation settings of nonlinearity and nonadditivity in [15] were followed and adapted for the various scenarios under investigation.

Data Simulation
We simulated datasets using two different strategies: 1. independent covariates without nonlinearity and nonadditivity, shown in the figure on left of Figure 1. 2. Covariates with nonlinearity and nonadditivity, shown in the figure on right of Figure 1. Specifically, for the second simulation strategy, we included 15 (30% of the total amount of confounders) two-way interactions for the nonadditive terms and 5 cubic terms for the nonlinear terms. For both linear and nonlinear data settings, the dataset included 50 confounders that affected both treatment and outcome, 5 instrumental variables that affected only treatment, and 1 risk factor that only affected the outcome. Details of simulation settings are shown in Table 1 and in the Appendix.
In total, 25 different scenarios were tested, as shown in Table 1. Different treatment prevalence ratios, outcome risk percentages, sample size, data complexities were examined. These scenarios were investigated using different settings (Table 1) including "strong confounding" and "large sample size" settings, defined as follows. Previous work typically included 10 confounders, so here we included 50 linear and non-linearly distributed confounders to simulate a "strong confounding" setting. Further, when testing large sample sizes, previous work [16] [17] Use of machine learning for comparing disease risk scores and propensity scores under complex confounding and large sample size scenarios: a simulation study • 3

Propensity Score Settings
PS represents the probability of receiving treatment of interest conditional on confounders, and many PS methods have been tested to reduce confounding effects such as matching, stratification, and weighting [2] [3]. Traditionally, PS is calculated using logistic regression with pre-selected confounders. In this study, we used machine learning methods LASSO, XgBoost, and MLP to estimate PS. Target variable was set to treatment and all 50 confounders, 5 instrumental variables and 1 risk factor were input into machine learning model. After obtaing PS, a matched cohort was obtained with 1:1 matching ratio with greedy nearest neighbour matching and without replacement.

Disease Risk Score Settings
DRS was first proposed by Miettinen in 1976 [18] and it is the probability of getting potential outcome conditional on confounders if untreated. It can be computed in two different ways [5]: from the unexposed population or the full cohort. Full cohort DRS is obtained by regressing outcome to covariates and treatments using the entire study population ∼ , , and then fitted values ( = 1| = 0, ) of full population are computed by setting treatment status to unexposed = 0. Unexposed DRS is computed by firstly regressing outcome to covariates ∼ only for unexposed population, which was done using machine learning methods as described above. Then fit the model to the entire population, fitted values ( = 1| ) are the unexposed DRS. Same as PS, 1:1 matching was used with greedy nearest neighbour matching and without replacement.
Compared to unexposed DRS, full cohort DRS has shown better performance in terms of bias when estimating treatment effect [5], therefore it was used in this work.

Machine Learning Methods
To estimate PS and DRS, we applied three different machine learning methods namely least absolute shrinkage and selection operator (LASSO) regularization, multilayer perceptron (MLP), and eXtreme Gradient Boosting (XgBoost). We input all variables in the simulated dataset into our machine learning models. The target variable for estimating PS was set as treatment while the target variable for estimating DRS was set as the outcome. All machine learning methods were hyperparameter tuned using 10-fold cross validation (see Appendix for details). Matching was used to obtain the estimated treatment effect.

Estimands and Metrics
In either PS or DRS, a matched cohort was obtained with 1:1 matching ratio with greedy nearest neighbour matching and without replacement. The estimated treatment effect was calculated in the matched cohort via a logistic regression with only treatment as the predictor variable. The true treatment effect was calculated via a logistic regression with treatment, 50 confounders and one risk factor as the predictor variables. To evaluate the treatment effect estimation, we used relative bias, which was defined as: Estimated treatment effect − True treatment effect True treatment effect We also used the corresponding 95% confidence interval to evaluate the results.

Effect of Treatment Prevalence
Comparison between PS and DRS regarding treatment prevalence are shown in Figure 2. The relative bias for PS increased with decreasing treatment prevalence, regardless of the machine learning methods used. Secondly, DRS increasingly outperformed PS with the decreasing treatment prevalence. We observed that the treatment effect estimated by DRS had a much lower bias compared to PS when having nonlinear data and treatment prevalence . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270151 doi: medRxiv preprint was lower than 10%, for all three machine learning methods. For example, DRS estimated by LASSO yielded 42.43% bias whereas PS estimated by LASSO resulted in 120.71% bias, under scenarios 6-10, when treatment prevalence was 5%. This was not true in the case of linear data, leftmost plot in Figure 2. Here PS consistently outperformed DRS, although the performance of DRS was similar to PS with the decrease of treatment prevalence. We also found that when outcome risk was rare (2%) or high(50%) and treatment prevalence was between common and high (10% -50%), there was at least one machine learning method estimated PS can achieve lower or comparable bias compared to DRS. For instance, referring to Figure 2, scenarios 11-15 when treatment prevalence was between 10% and 50%, all PS results (dashed lines) were lower than DRS in solid line. Scenarios 16-20 when treatment prevalence was between 10% and 50%, PS estimated by method XgBoost and MLP were achieving lower bias in the dashed line below the DRS (solid lines).
In general, results show that DRS was resulted in less bias in estimation of treatment effect when both treatment prevalence and outcome risk were relatively rare, whereas PS performed better (i.e. resulted in less bias) when treatment prevalence was common, under our setting of strong confounding and nonlinear nonadditive data.

Effect of Data Complexity
Regarding the effect of nonlinearity and nonadditivity on estimation results, Figure 4 compares how PS and DRS were affected by the nonlinearity and nonadditivity. For all methods, the bias generated by nonlinear data . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270151 doi: medRxiv preprint using PS was higher than the linear one. The difference was even more significant when the treatment prevalence was lower than 10%. For example, when treatment prevalence was 1%, the relative bias for nonlinear PS was 326.98% while the linear data generated 168.98%. Similarly, for DRS, nonlinear data introduced more bias into DRS results especially when treatment prevalence was lower than 10%.

Effect of Sample Size
Considering how sample size affected treatment effect estimation bias, Figure 3, when decreasing sample size from 10000 to 500, treatment effect estimation bias increased regardless of the machine learning method. However, the effect on DRS was more significant than PS, when treatment prevalence was 1%: using XgBoost, LASSO, and MLP, reducing sample size from 10000 to 500 increased bias of DRS respectively from 65.35% to 184.03%, 86.36% to 184.26%, and 103.38% to 160.59%. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270151 doi: medRxiv preprint Use of machine learning for comparing disease risk scores and propensity scores under complex confounding and large sample size scenarios: a simulation study • 7

Comparison of Machine Learning Methods
Three machine learning methods were compared too, bias generated by machine learning estimated PS and DRS are shown in Figure 5. For all the scenarios tested, three machine learning methods had similar performances, especially considering the confidence interval overlapped. However, our data suggested that there were scenarios that either XgBoost or MLP can generate the lowest bias. We observed XgBoost was giving the lowest bias when decreasing treatment prevalence lower than 10% compared to other machine learning methods, under nonlinear data with the large sample size 10000 and 50% outcome risk: At 1% treatment prevalence, PS and DRS estimated by XgBoost were giving 139.23% and 72.98% bias respectively, whereas PS and DRS estimated by LASSO were giving 158.98% and 106.36% or estimated by MLP were giving 235.23% and 175.31%; At 5% treatment prevalence, PS and DRS estimated by XgBoost were giving 97.05% and 80.31% bias respectively whereas PS and DRS estimated by LASSO were giving 118.71% and 125.43% or estimated by MLP were giving 105.96% and 102.76%. We also observed that LASSO generated the lowest bias, under 10%-50% treatment prevalence, linear data with the small sample size 500, 50% outcome risk, for both PS and DRS. For instance, referring to scenarios 1-5, at 10% treatment prevalence, PS and DRS estimated by LASSO were generating 58.35% and 67.95% relative bias respectively, whereas PS and DRS generated by XgBoost were 90.23% and 60.94% and generated by MLP were 64.31% and 72.30%. For other case scenarios, we did not observe a significant advantage of XgBoost over LASSO or vice versa with overlapping confidence intervals.

DISCUSSION
We conducted a simulation study to 1) evaluate the performance of DRS and PS for measured confounders adjustment; 2) to assess the performance of three different machine learning algorithms in the generation of PS or DRS.
For nonlinear data under strong confounding settings, we observed that when outcome risk was high (50%) or relatively rare (2%), DRS outperformed PS when treatment prevalence was relatively rare (lower than 10%). This finding was as could be expected since models used for PS estimation may fail to capture nonlinearity and nonadditivity with the target variable (treatment prevalence) being imbalanced. We also found that when outcome risk was relatively rare (2%) or high (50%) and treatment prevalence was between common and high (10% -50%), PS resulted in less bias compared to DRS. Previous research by Xu et al. [6] compared PS and DRS using . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270151 doi: medRxiv preprint Type 1 error and empirical power, which are other metrics that could be used in simulation studies. Type I error represents the probability of incorrectly rejecting the true null hypothesis of no treatment effect when it's true. Empirical power is the probability of not making a Type II error, which is not rejecting the null hypothesis when it's false. Our result was consistent with the results demonstrated by Xu et al., although that study compared PS and DRS when the treatment was common and the outcome was moderately rare or common. They found that for 1:1 matching PS outperformed DRS, as PS had lower type I error. However, their study did not have scenarios where DRS can be advantageous over PS e.g. when both treatment prevalence (lower than 10%) and outcome risk (2%) were relatively rare, as shown by our results. This could be caused by the difference in our simulation settings as well as the evaluation metric used.
Sample size also affected our estimation bias. With only 500 observations, 50 confounders, and nonlinear data setting, our data suggested that PS resulted larger bias when outcome risk was 2% and treatment prevalence was lower than 10%. This can be explained by the number of events per variable being too low, as we had 50 confounders, and when treatment prevalence was lower than 10%, we had less than one event per confounder. As some previous research demonstrated, 10 events per variable may not even be enough [19]. A previous research comparing DRS and PS [5] also demonstrated that for logistic regression, when reducing the number of events per covariate, bias increased for both PS and DRS. We also observed that sample size affected DRS more than PS. This could be expected as modelling the DRS introduced additional complexity as it requires accurately modeling the relation between treatment and outcome. Therefore, decreasing sample size could make it more difficult for DRS than PS.
Among the three machine learning methods we tested, LASSO, XgBoost, and MLP, all of them had similar performance when we had complicated nonlinear data settings. Previous review [10] found that boosting method was the most promising machine learning method for PS analysis, which was consistent with our observation that XgBoost as a boosting method outperformed the other two machine learning methods under many scenarios. We observed XgBoost as the best model when having nonlinear data, large sample size, and low treatment prevalence. We also found that LASSO was the best model when having linear data, small sample size, and common treatment prevalence. In all other cases, our data supported LASSO and XgBoost or ensemble these two methods in super . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270151 doi: medRxiv preprint learner [20]. High dimensional propensity score [13] is also a popular method to estimate PS. It includes a process of ranking covariates by their prevalence and univariate association with the outcome and/or the treatment, and it requires a specific pre-selected number of covariates. However, our approach was to allow machine learning to self-select among potentially large numbers of confounders. Additional machine learning methods such as outcome adaptive LASSO [21] and highly adaptive LASSO [22] have also been illustrated to have competitive performance compared with other popular machine learning techniques like gradient boosted machine [22], however, they were mostly compared on small scale data.

CONCLUSIONS
We compared the performance of less-known DRS with widely-used PS for estimating treatment effect in the presence of strong confounding and large sample sizes. We computed both PS and DRS using three well-established machine learning methods. These data suggest that DRS could be a preferable method compared to PS when treatment prevalence was lower than 10%, under strong confounding and nonlinear data scenarios. PS had competitive performance compared to DRS when treatment prevalence was between common and high (10% -50%). Sample size was important when having strong confounding in data, where small sample size may lead to inaccurate estimation for both DRS and PS, and it affected DRS more than PS. Nonlinearity and nonadditivity in the data introduced more bias in both PS and DRS.

Hyperparamters.
Hyperparameters tuned for LASSO was shrinkage coefficient . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270151 doi: medRxiv preprint Hyperparamters tuned for XgBoost were: learning rate, minimum split loss, maximum depth of a tree, minimum sum of instance weight (hessian) needed in a child, subsample ratio of the training instances Hyperparamters tuned for MLP were: number of layers, number of units per layer, activation function, optimizer.
Simulation details.