## Abstract

Epidemiological studies often analyze data as static, essentially averaging observed associations across time. Overlooking time trends is especially problematic in settings subject to rapid changes. A prominent example for such a setting is antibiotic resistance, which has reached concerning levels, and poses a global healthcare challenge. Bacteria constantly evolve and hence antibiotic resistance is characterized by time-varying relationships with clinical and demographic covariates. In this paper, we speculate that covariates with a causal effect are expected to have stable relationships with resistance over calendar time. To this end, we applied time-varying coefficient models to a large clinical dataset from an Israeli hospital, and have shown their advantages in describing covariate-resistance relationships. We found both time-stable and time-varying covariate-resistance relationships. These results serve as initial evidence towards causal interpretation of these relationships, as one may expect time-stable rather than time-varying relationships to correspond with causal effects. We further conducted data-driven simulations, that have illustrated how results from time-varying coefficient models must be carefully interpreted with respect to causal claims. Potentially, identification of causal covariate-resistance relationships can lead to new medical interventions and healthcare policies, and improve the generalization of existing predictive models for antibiotic resistance.

## Introduction

Antibiotic resistance is a major global healthcare concern. According to a 2019 report by the Centers for Disease Control and Prevention, in the US alone at least 2.8 million people suffer from antibiotic resistant bacterial infections each year, of which approximately 35,000 result in death (1). Antibiotic use and over-prescription have led to high global resistance levels, prompting the World Health Organization to declare antibiotic resistance a global health crisis, which also threatens the global economy (1,2).

Antibiotic resistant infections emerge through complex biological processes, shaped by evolutionary forces and bacterial population dynamics (3,4). The interplay between these processes and epidemiological-level covariates is still poorly understood. Moreover, estimates of epidemiological-level effects are likely to vary with time, due to the rapid dynamics of bacterial populations in response to environmental changes. For example, resistance frequencies are constantly changing as a result of selective pressure of antibiotic use in medicine (5–8) or agriculture (8–11), demographic changes such as urbanization (12,13) and healthcare-related policies (14–16).

The causes of antibiotic resistance have been the focal point of much research. Randomized controlled trials, which can help unravel these causes, are in many cases difficult and impractical to perform in the context of antibiotic resistance. Hence, many of the studies that attempted to identify the causes of antibiotic resistance were observational. However, observational studies are subject to confounding, which limits their ability to determine if a risk factor causes antibiotic resistance or is only associated with it. Nonetheless, there have been some attempts, including by our group, to overcome confounding from observational studies to estimate the effect of antibiotic use on resistance (e.g. (17–20)).

In this work, we showcased that estimates of time-stable relationships (over calendar time) between clinical and demographic covariates and resistance, can provide evidence towards causality. If a covariate has a causal effect (or no effect at all) on an outcome variable, one may expect the relationship between them to be time-stable rather than time-varying. The estimated relationship estimate is not necessarily an estimate of the causal effect of the covariate on resistance, but rather evidence towards the existence of a causal relationship. We emphasize that this paper is concerned with calendar time-varying mechanisms resulting from the dynamic nature of infectious diseases. It is not concerned with time-varying treatments and confounders within an individual.

Demonstrating these claims empirically, we estimated temporal relationships between hospitalized patients’ bacterial antibiotic susceptibility test results and their corresponding electronic medical records during 2016-19. We modelled these temporal relationships by logistic time-varying coefficient models. Such models are a private case of Generalized Additive Models (GAM) (21,22), that allows flexible modelling of non-linear covariate-outcome relationships in time. Previous research used GAMs in the context of antibiotic resistance. Two such papers dealt with the relationship between population-level antibiotic use and the emergence of resistance (23,24), while another paper utilized GAMs, but not time-varying coefficient models, to model non-linear relationships between hospitalization duration and resistance (25).

We compared our results to standard time-fixed logistic regression models, and explored the differences between them. Our results presented significant time trends in the relationships between risk factors and resistance otherwise overlooked by standard models. Temporal relationships between risk factors and resistance that were estimated as stable, coincided with prior knowledge deeming them causal. On the other hand, temporal relationships estimated as time-varying corresponded to prior knowledge suggesting they were not causal. Finally, we conducted data-driven simulations that demonstrated the need for careful interpretation of obtained time-varying coefficients, as various plausible causal scenarios may give rise to observed time trends.

## Methods

### The motivating data

The dataset used in this paper includes electronic records of all patients who had a positive bacterial culture between 2016-19 in Meir hospital, Kfar Saba, Israel. The data originated from two types of records: Bacterial antibiotic susceptibility test results, and their corresponding patients’ demographic and clinical data. Table 1 presents summary statistics of key patient covariates, stratified by resistance test result (susceptible/resistant), and Tables S1 and S2 describe in detail and present summary statistics of all the covariates used in our analysis.

Our analysis focused on resistance to gentamicin. Susceptibility testing to gentamicin was commonly performed, yielding a resistance rate of 15.6%, thus providing us a large sample with satisfactory event rate. Seven bacterial species commonly isolated in hospitalized patients *(Eescherichia coli, Pseudomonas aeruginosa, Klebsiella pneumoniae, Proteus mirabilis, Enterobacter cloacae, Morganella morganii and Citrobacter koseri)* were chosen for the analysis, since each had at least 200 resistance results to gentamicin. To account for the heterogeneity of the species, a categorical covariate indicating the bacteria’s genus was incorporated in all our models. In total, 11,592 gentamicin resistance results from the above-mentioned bacterial species were sampled between 2016-19 and included in our analysis.

### Statistical methods

Temporal relationships between patients’ clinical and demographic covariates, and resistance, were modelled by logistic time-varying coefficient models, which are an extension of standard logistic regression models (21,22). Unlike standard logistic regression models, time-varying coefficient models allow for variation of covariate-outcome relationships across time by replacing some of the time-fixed coefficients with non-parametric functions of time. These models take the form
where *Y*_{it} is the gentamicin resistance, and *X*_{it} = (*X*_{1it}, …, *X*_{pit}) is the vector of *p* covariates of the *i*^{th} observation, both measured at time *t*, which is the time when the bacterial culture was drawn from the patient (21,22). We henceforth omit the index *t* from *X*_{i} and *Y*_{i} for simplicity of presentation. The parameters of the model are the intercept *β*_{0}, the time-fixed coefficients *β*_{1}, …, *β*_{l}, and the time-varying coefficients *β*_{l+1}(*t*), …, *β*_{p}(*t*). The model’s interpretation is that is the adjusted odds ratio (OR) between the resistance and a unit change in the *j* ^{th} variable at time *t*.

We modelled *β*_{l+1}(*t*), …, *β*_{p}(*t*) by penalized cubic splines with 40 equi-spaced knots, allowing them to be smooth functions of time. Smoothing parameters were estimated via Restricted Maximum Likelihood (REML), and subsequently *β*_{0} and ** β** = (

*β*

_{1}, …,

*β*

_{l},

*β*

_{l+1}(

*t*), …,

*β*

_{p}(

*t*))

^{T}were estimated by minimizing the penalized model deviance (26).

Penalized splines can be expressed from a Bayesian perspective, which allows deriving credible intervals for them (26), while having frequentist coverage probabilities (27–29). Models including different sets of time-varying and time-fixed covariates were compared by a modified version of the Akaike Information Criterion (AIC) that accounts for the uncertainty in the smoothing parameter estimation (30,31). The analysis was performed in R version 3.6.1 using the packages *mgcv* version 1.8-28 (28), *tableone* version 0.10.0 and *mgcViz* version 0.1.6.

The potential dependence between resistance results of bacteria from the same patient was addressed using random intercepts. The incorporation of random intercepts did not change the results substantially, and therefore the results of random-intercept models are presented in the supporting information.

### Data-driven simulations

Estimated time-varying covariate-outcome relationships might misrepresent the true underlying causal effect. To further study this, two data-driven simulation studies were conducted. The simulations were designed with a confounding mechanism that affects coefficient estimates while following a hypothetical yet relevant clinical scenario.

In each simulation study, a Data Generating Mechanism (DGM) was designed to set up a scenario illustrating the challenges of interpreting estimated time-varying coefficients. The DGM determined hypothetical relationships between the 20 covariates previously included in our models and the outcome variable of gentamicin resistance. The simulations were data-driven in the sense that the values of the covariates in the DGM were taken from our data. Hence the simulated data structure, including the sample size, were comparable to our data. In each simulation scenario, all but one of the coefficients used to simulate the covariate-outcome relationships were the estimates from the models fitted to the original data. In addition, a hypothetical binary covariate denoted *Community use* was created and included in the DGM of both scenarios, and its values were sampled in each iteration of the simulation. We did not have access to data about antibiotic use outside the hospital. Therefore, we designated *Community use* to hypothetically indicate whether a patient was prescribed antibiotics outside the hospital during the year prior to their susceptibility test. The conditional probability of *Community use* was determined using Bayes’ theorem as detailed in the Results section.

Each simulation study consisted of 500 iterations. In each iteration, 11,592 gentamicin resistance results were drawn from the DGM. Then, a time-varying coefficient model with one time-varying coefficient (keeping the rest of the coefficients time-fixed(was fitted using the actual covariates in the data, the simulated values of Community use and resistance results. Finally, the average of the 500 time-varying coefficient estimates alongside their empirical 2.5% and 97.5% quantiles were calculated, as well as the average estimate and empirical 2.5% and 97.5% quantiles from an analogous standard (time-fixed) logistic regression model fitted to the same simulated data.

## Results and discussion

### Time-varying coefficient models applied to clinical data

Using our clinical data, we modelled the probability of resistance to gentamicin as a function of 20 covariates that were selected based on prior knowledge (32), and are detailed in Tables S1 and S2. We fitted one standard logistic regression model, and four logistic time-varying coefficient models, each with a different single time-varying coefficient.

The four coefficients allowed to vary in time correspond to different patient-level clinical and demographic information: whether the patient’s culture was drawn at the Emergency Room (ER) (*Culture drawn at ER*); the patient’s sex (*Male*); hospital use of relevant antibiotics during the prior 6 months (*Used antibiotics*), and whether the patient had a Methicillin-resistant *Staphylococcus aureus* (MRSA) infection during the prior year (*Had MRSA)*. Relevant antibiotics were considered those previously shown to have direct cross-resistance links with gentamicin (19). We correspondingly refer to these four models as the *Culture drawn at ER* model, *Male* model, *Used antibiotics* model and *Had MRSA* model. For example, in the *Culture drawn at ER* model, the coefficient of *Culture drawn at ER* was allowed to vary in time, as expressed by having *β*_{1}(*t*) (and not *β*_{1}) in the following model equation,
where *i* stands for the *i*^{th} observation, *Y*_{i} is the resistance, *X*_{i} is the vector of all covariates, *β*_{0} is the intercept, *β*_{1}(*t*), *β*_{2}, *β*_{3}, *β*_{4} are the coefficients of the listed covariates, and and are the vectors of the rest of the 16 coefficients and covariates, respectively.

Table 2 and Figure 1 present estimated coefficients of selected covariates from the four time-varying models and from a standard logistic model using the same 20 covariates. The AIC values (Table 2) suggest that allowing the coefficients of *Culture drawn at ER* and *Male* to vary with time is preferable to keeping them constant. The estimates of these coefficients vary substantially with time (Figure 1), suggesting that perhaps both should be allowed to simultaneously vary with time. Indeed, a model that allowed the coefficients of both *Culture drawn at ER* and *Male* to vary with time had an AIC score of 8970.2, a score lower than the AIC scores of the other five models we fitted (Table 2). The estimates from this model were similar to the estimates from the *Culture drawn at ER* and *Male* models (data not shown). The estimated coefficients which were kept time-fixed in all models were consistent with prior evidence (32) (Tables S3-S7).

Considerable differences were observed between some of the estimated coefficients in the time-varying models and the standard logistic model (Figure 1A-B). While the standard logistic model yielded a positive coefficient estimate for *Culture drawn at ER*, the *Culture drawn at ER* model estimated it as positive only during 2016-18, with zero outside the credible interval only during 2016 to mid-2017, and with a decreasing linear trend towards the null/negative values with time (Figure 1A). Note that the obtained linear shape in Figure 1A was not a-priori imposed but resulted from the REML fitting procedure. The estimated *Male* coefficient also fluctuated with time in the *Male* model, but unlike the coefficient of *Culture drawn at ER*, no clear trend was observed (Figure 1B). When a random intercept was included in the *Male* model, a clearer mildly increasing trend was observed (Figure S1B). Of note is that both the time-varying and the standard logistic model estimated the coefficient of *Male* as positive during 2016-19.

On the other hand, the estimates of the time-varying coefficients in the *Used antibiotics* and *Had MRSA* models were similar to their corresponding estimates in the standard time-fixed logistic model, both in terms of point estimates and uncertainty bounds (confidence intervals for the time-fixed logistic model and credible intervals for the logistic time-varying coefficient models). With that being said, the estimated time-varying coefficient of *Had MRSA* slightly increased during 2016-2017 and then slightly decreased until the end of the study period.

The above-described results align with the arguments we pose in this work. Both *Used antibiotics* and *Had MRSA* are biologically plausible causes for antibiotic resistance. Prior antibiotic use causes resistance via evolutionary selective pressure (33), while a previous MRSA infection can imply remnants of resistant bacteria (34,35). Thus, the estimated coefficients of *Used antibiotics* and *Had MRSA* demonstrate how time-stable relationships may align with causal interpretation. Conversely, it is biologically unlikely that *Culture drawn at ER* and *Male* are direct causes of resistant infections. Therefore, the estimated time-varying relationships of these covariates with resistance are more plausibly explained by confounding.

We now turn to describe and illustrate such confounding mechanisms via data-driven simulations.

### Exploring drivers of time-varying coefficient estimates via simulations

#### First simulation study: Recreating a biased time-varying coefficient estimate

In the first simulation study, we recreated the time-varying coefficient estimate of *Culture drawn at ER* from the *Culture drawn at ER* model (Figure 1A), even though data were simulated such that *Culture drawn at ER* had no causal effect on resistance. That is, the underlying relationship was null. In this hypothetical scenario, the relationship between *Culture drawn at ER* and resistance was confounded by *Community use*, in a time-varying manner. In our scenario, we can hypothesize that increased awareness of antibiotic resistance led doctors treating outside the hospital to prescribe a decreasing number of antibiotics in the community during 2016-19, as was observed in various settings (36,37). As a result, the association between *Culture drawn at ER* and *Community use* over time changed, so that *Culture drawn at ER* was a good surrogate for *Community use* at the beginning of 2016, but deteriorated over time.

The DGM for this scenario employed a standard time-fixed logistic model for the probability of resistance. The DGM included the same 20 covariates as the five models described in the previous subsection, apart from *Culture drawn at ER* which was replaced by *Community use*. For all covariates other than the artificially created *Community use*, the coefficients in the DGM were the estimates from the *Culture drawn at ER* model. We set the coefficient of *Community use* to two, as antibiotic use is a known cause for antibiotic resistance (33). In each simulation iteration, we simulated the resistance results according to a standard time-fixed logistic model

Using Bayes’ theorem, we set the probability of *Community use* as time-varying,
for *k* = 0,1. For simplicity, we chose a linear decrease of antibiotic prescription in the community over time, , where the slope and intercept were chosen such that from t = 1279 (3.5 years) we obtained *p*_{t}(*Community use*_{i} = 1) = 0.1. The resulting approximate density of *p*_{t}(*Community use*_{i} = 1) in the data is shown in Figure S2. In our dataset, approximately 10% of the patients’ cultures were sampled at the ER. In this hypothetical scenario, patients who were prescribed antibiotics in the community were on average “less healthy”, and thus more likely to have arrived at the ER and had their bacterial culture drawn there, than patients who did not receive antibiotics in the community. Hence, we set *p*(*Culture drawn at ER*_{i} = 1|*Community use*_{i} = 1) = 0.2. Finally, we estimated *p*_{t}(*Culture drawn at ER*_{i} = 1) from the original data using a time-varying intercept-only model (Figure S3).

In each simulation iteration, we estimated a logistic time-varying coefficient model with *Culture drawn at ER* being the only time-varying coefficient (Methods). The average estimate of the time-varying coefficient of *Culture drawn at ER* across simulations (Figure 2) emulated the estimated coefficient obtained from the *Culture drawn at ER* model fitted to the original data (Figure 1A). To explain this, note that at the beginning of 2016 *Culture drawn at ER* was a good surrogate for *Community use*, and the average estimated time-varying coefficient was close to the actual coefficient of *Community use* in Equation (2) (Figure 2). Then, as the association between the two covariates diminished over time due to the trend of decrease in *Community use*, the average time-varying coefficient estimate of *Culture drawn at ER* diverged from the coefficient of *Community use* towards zero, its true value under the DGM (Equation (2)). Thus, this simulation study demonstrated how a confounding mechanism that varies with time can result in a misleading estimate.

For comparison, the average coefficient estimate of *Culture drawn at ER* from the standard time-fixed logistic model (Figure 2) was also biased. Its average was 0.35 (2.5%, 97.5% quantiles: - 0.01, 0.72), while in practice *Culture drawn at ER* had no effect on the resistance. Hence time-fixed models are also subject to bias in such a scenario.

#### Second simulation study: Recreating a biased time-stable coefficient estimate

In the second data-driven simulation, we considered a causal relationship that varied considerably with time but was approximately estimated as time-stable, even though the estimated coefficient was allowed to vary in time. To obtain this approximate time-stable estimate, the confounding mechanism had to vary with time in a way that near-perfectly negated the underlying time-varying relationship. Hence, it demanded parameter values that are unlikely to be realistic in our context.

In this simulation scenario, the relationship between *Had MRSA* and resistance was confounded by *Community use* in a time-varying manner. The time-varying effect of *Had MRSA* on resistance was set as positive, and decreased over time. The decrease over time can be motivated through changes in the bacterial population of *Staphylococcus aureus*, as it can affect the resistance of other bacteria via horizontal gene transfer (HGT) (34). As in the first simulation, we mimicked increased awareness of antibiotic resistance that led community doctors to decrease antibiotic prescription over time. As a result, the association between *Had MRSA* and *Community use* changed over time, so that *Had MRSA* started as a good surrogate for *Community use* but deteriorated over time. We designed the change in the association between *Had MRSA* and *Community use* to specifically induce bias into the estimate of the time-varying coefficient of *Had MRSA*, such that a constant estimate was obtained.

We set the coefficient of *Community use* to 1.25, and the time-varying coefficient of *Had MRSA* to . Then, in each iteration we simulated the resistance results according to the following DGM,

Again, we set the probability of *Community use* using Bayes’ Theorem,
for *k* = 0,1. As in the previous simulation study, the values of each component in Equation (4) followed a clinical scenario in which antibiotic community use linearly decreased during 2016-19. Therefore, we set . In our dataset, approximately 6% of the patients have previously had an MRSA infection, hence we set *p*(*Had MRSA*_{i} = 1) = 0.06. Finally, we assumed that patients who were prescribed antibiotics in the community are on average more likely to have had an MRSA infection than those who were not, and set *p*(*Had MRSA*_{i} = *1*|*Community use*_{i} = *1*) = *0*.*12*.

In each iteration, we estimated a logistic time-varying coefficient model with the 20 available covariates in which the coefficient of *Had MRSA* was the only time-varying coefficient (Methods section). The obtained average coefficient estimate (Figure 3) was time-stable, and resembled the coefficient estimates from the *Used antibiotics* and *Had MRSA* models in the original data (Figure 1C-D).

As in the first simulation study, in each iteration we also fitted a standard time-fixed logistic model. The average estimate of the time-fixed coefficient of *Had MRSA* was also biased (Figure 3).

## Conclusions

Many epidemiological settings, and especially those involving infectious diseases, are characterized by time-varying relationships. However, epidemiological studies often analyze data as static, essentially averaging observed associations across time and neglecting to account for the dynamical nature of the studied system. Using a combination of an antibiotic resistance related clinical dataset and data-driven simulations, we have demonstrated that time-varying coefficient models can reveal phenomena otherwise obscured by time-fixed models.

We showcased how the use of time-varying coefficient models may help unravel causal covariate-outcome relationships. As we have shown, estimates that are time-stable can coincide with prior knowledge suggesting that the underlying relationships are causal. Conversely, time-varying estimates may direct researchers to inquire the underlying causal structure leading to the obtained time-varying relationships.

Time-varying coefficient estimates can result from a mixture of underlying covariate-outcome relationships and confounding mechanisms that are time-fixed and/or time-varying. It is possible the underlying covariate-outcome relationships are time-fixed (or non-existent), while time-varying confounding leads to a time-varying estimate; we demonstrated such a scenario in our first data-driven simulation study. On the other hand, time-stable coefficient estimates are more likely the consequence of underlying time-fixed (or non-existent) covariate-outcome relationships, that are possibly confounded in a time-fixed manner. While it is technically possible that time-varying confounding will lead to a time-stable estimate of an underlying time-varying covariate-outcome relationship, this scenario is unlikely. Such a scenario requires the variations in the underlying relationship and in the confounding mechanism to approximately negate each other, as we illustrated in the second data-driven simulation study. This is implausible and hence can often be ruled out, though it is advised to do so based on subject-matter expertise.

From a clinical standpoint, knowledge about causal relationships can help researchers understand the biological mechanisms behind them, and possibly result in new interventions. This knowledge can also be exploited to improve existing predictive models for antibiotic resistance (e.g. (38–41)). By emphasizing covariates with causal effects on resistance, predictions from such models might prove more stable and robust to generalizations over different time periods or locations.

In the context of antibiotic resistance, future research should extend efforts of further identifying causal relationships between risk factors and antibiotic resistance using time-varying models. Development of a theoretical framework for assessing causal effects in the field of antibiotic resistance (i.e., based on potential outcomes and/or directed acyclic graphs), while accounting for variations with time, would be clinically and methodologically valuable.

## Data Availability

Data produced in the present study are proprietary but can be available upon reasonable request to the authors.

## Funding

This work was supported by a grant from the Tel Aviv University Center for AI and Data Science (TAD).