Abstract
Background Recent molecular studies of B-cell precursor acute lymphoblastic leukemia (BCP-ALL) have started to delineate the nature and timing of genetic variants and those responsible for subsequent progression to overt leukemia. However, the etiology behind both initiation and progression remains largely unknown. Nonetheless, theories, but also epidemiological evidence, of how exposure to common infections and other microbes in our environment modulates the risk of developing childhood BCP-ALL, have emerged. In light of these theories and the well-known phenomena of seasonality in infectious disease spread, childhood ALL has been analyzed for signs of seasonal variation, with differing results.
Methods In this study we applied the Bayesian Generalized Auto Regressive Integrated Moving Average with external variables (GARIMAX) model, adapted for count data via a negative binomial distribution, to study seasonal variation of incidence in a Swedish population-based cohort of 1601 BCP-ALL cases. The studied cases were aged 0-18 years at diagnosis and identified from the Swedish Childhood Cancer Registry (SCCR). Also, two subgroups of BCP-ALL represented by the most abundant genetic subtypes, ETV6/RUNX1 and HeH respectively, were analyzed accordingly. All analyses were performed in two stages. The first stage identified the presence of the repeatable pattern using harmonic functions, and the second stage consisted of the identification of the peak months in the series.
Results An informative seasonal variation in BCP-ALL incidence numbers, displaying a peak in August and September, was detected in the total cohort of 1601 individuals. No seasonality was detected analyzing the subtype groups HeH and ETV6/RUNX1 positive BCP-ALL, respectively.
Conclusion The manifested seasonality in BCP-ALL with a peak in August-September may suggest that the prolonged period of minimal viral spread during Swedish summer vacation causes a temporary halt in the last step of progression to overt disease and consequently an accumulated number of cases presenting in August-September.
Introduction
Acute lymphoblastic leukemia (ALL) is the most frequent cancer (25%) in children below the age of 15 years [1]. ALL is most commonly (85%) of B-cell precursor origin, termed BCP ALL [2]. To date, 11 distinct molecular subtypes based on recurrent genetic aberrations have been defined in BCP ALL, displaying highly variable prognosis and used to guide modern treatment strategies [3]. The most abundant subtypes, ETV6/RUNX1 fusion and high hyperdiploidy (HeH), also have the most favorable prognosis [4].
The annual Swedish ALL incidence rate of 4.2 per 100 000 in children 0-15 years of age is comparable to numbers in other affluent societies [5]. A significative peak in incidence is seen at 2-5 years of age [6, 7] mainly comprised by ETV6/RUNX1 positive and HeH cases [6, 8, 9]. However, markedly lower incidence has been reported especially from sub-Sharan countries [10] where the early incidence peak is less pronounced or even non-existing [11].
The cause of childhood BCP-ALL is yet largely unknown and likely multifactorial, comprised of both environmental and genetic factors. Mapping the temporal aspects of disease initiation and progression has been central to our understanding of its etiology. There is today compelling evidence that BCP-ALL is initiated prenatally. Pre-leukemic clones have been detected in children who later develop BCP-ALL but also in healthy controls [12–18]. As many as 1-5% of healthy neonates have been found to harbor a pre-leukemic clone of ETV6/RUNX1 positive BCP-ALL [19, 20]. However, secondary genetic aberrations are required for the progression of pre-leukemic clones [21, 22], explaining why only a fraction of pre-leukemic cell carriers develop overt BCP-ALL.
Although the nature of both initiating and secondary genetic variants in BCP-ALL have been delineated to some extent, including leukemia predisposing germline variants in an approximated 4,4% [23], little is known of the environmental drivers behind these genetic aberrations. To date, ionizing radiation is the only environmental factor convincingly proven to increase the risk of ALL [24, 25]. However, the role of common pathogens such as viruses and bacteria has been given great interest in both epidemiological and molecular studies in recent years.
The initiation of pre-leukemic clones by infectious agents has to date not been molecularly proven. However, recent studies have indeed confirmed that progression to overt leukemia from a pre-leukemic state may be promoted by a specific pathogen [26], but also that it requires exposure to common infections [27, 28]. On the more general level, models for infectious exposures selective pressure on leukemia progression have been suggested. Originally, in 1988 Kinlen formulated the hypothesis of “population mixing” after observing an increased incidence of childhood BCP-ALL in immunologically naïve previously isolated populations after exposure to a common mild infectious agents transmitted by individuals from urbanized areas of residence [29–31]. Simultaneously, Greaves suggested a model where delayed exposure to common infections during early childhood causes strong adverse reactions of the immune system once infections are encountered, in turn promoting progression of pre-leukemic cells [32]. As reviewed by Hauer et al [33], more recent data building on these original models suggests early training of the innate immune system has a protective effect against BCP-ALL progression. This is based on observations of reduced BCP-ALL risk following early animal contact, early daycare attendance, vaginal delivery, breastfeeding, having older siblings, early BCG vaccination etc. In summary, both what infections we encounter and when during fetal life and early childhood we encounter them appear to influence the destiny of pre-leukemic cells initiated as a consequence of in some cases genetic predisposition and other yet unknown environmental factors.
One epidemiological aspect studied based on the above hypothesis is seasonal variation in incidence of BCP-ALL and likewise in BCP-ALL patients’ season of birth. As the spread of common infections (such as influenza, adeno, corona, rhino and rs-viruses among others) vary strongly with season in temperate areas of the world, the hypothesis has been that if such infections impact disease progression a seasonal pattern in incidence of BCP-ALL, and possible also patients season of birth, may also be seen. The first to report seasonal variation of acute leukemia incidence were Lampin and Gerard [34] in 1934. In a literature search we identified 41 papers published from 1961 an onward, investigating seasonal variation in ALL (predominantly in childhood) by a variety of methods and with inconsistent results (Table 1). 20 publications reported significant seasonal variation of ALL incidence [35, 36, 38, 39, 41, 44, 45, 47, 49, 50, 52, 53, 55, 56, 58, 63, 69–72], while 21 papers reported no evidence of seasonality [37, 40, 42, 43, 46, 48, 49, 51, 54, 56, 57, 59–62, 64–68, 73].
In the current study we applied Generalized Autoregressive Integrated Moving Average model with External seasonal covariates (GARIMAX) to a Swedish population-based cohort of 1601 childhood BCP-ALL cases to investigate presence of seasonal variation in incidence of diagnosis. Also, two subgroups of BCP-ALL represented by the most abundant genetic subtypes, ETV6/RUNX1 and HeH respectively, were analyzed for seasonal variation of incidence.
Materials and methods
Data sources
Sweden has a renowned system of records for citizens in which demographic and healthcare data are collected continuously. All permanent residents are given personal identity numbers that enable linkage between the registers.
The Swedish Childhood Cancer Registry (SCCR) is a National Quality Registry containing information about children diagnosed with tumors and hematological malignancies between 0 and 18 years of age stretching back to the 1970s for ALL and 1980s for all other malignancies. The registry has an overall coverage of 89% for all diagnoses, however, coverage for ALL specifically is estimated to be as good as 100% at present. It includes information about clinical characteristics, treatment, outcome, immunophenotype, genetic subtype, and other clinically important genetic aberrations. The most abundant BCP-ALL genetic subtypes, HeH and ETV6/RUNX1-fusion have been registered since 1992 and 2000 respectively, when robust cytogenetic methods to detect these aberrations were introduced in clinical diagnostic routines. The Total Population Registry (RTB) holds information on the date of birth, death, and emigration for all Swedish citizens.
The main data sources of this paper the Swedish Childhood Cancer Registry (SCCR), and The Total Population Register (RTB).
The study population
From the Swedish Childhood Cancer Registry (SCCR) we identified a cohort of 1601 children and adolescents diagnosed with BCP-ALL at age of <18 years between January 1, 1995 and December 31, 2017. Only individuals born and diagnosed in Sweden were included in the study. For the purpose of this study, anonymized data of date of diagnosis, date of birth, and genetic subtype were collected. Patient characteristics are summarized in Table 2. The overall mean age at diagnosis was 6.8 years, with ETV6/RUNX1 carriers having a lower mean age than HeH and the non-HeH/ ETV6/RUNX1 BCP-ALL group) (5.0 vs. 5.3 and 8.1 years). In the cohort, 448 cases were of HeH and 272 cases of ETV6/RUNX1-fusion BCP-ALL subtype. The mean age at diagnosis was 6.8 years for all cases, 5.0 for ETV6/RUNX1-positive cases, 5.3 for HeH cases, and 8.1 for other and undefined genetic subtypes.
To study seasonality by date of diagnosis, we included all 1601 BCP-ALL cases described above. Individual cases of BCP-ALL were summed up in a total of 92 quarters to facilitate time series of quarterly counts. Three types of quarters within a year were defined as follows: Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec (first quarter type), Feb-Apr, May-Jul, Aug-Oct, Nov-Jan (second quarter type), and Mar-May, Jun-Aug, Sep-Nov, Dec-Feb (third quarter type).
Statistical methods
We implemented the Bayesian Generalized Auto Regressive Integrated Moving Average model with exogenous variables (GARIMAX) [74, 75] for identification of seasonal variation in BCP-ALL. The key elements of the model are (i) generalization of the ARIMAX process to a count distribution via negative binomial distribution, and (ii) a Bayesian formulation for model setup. The generalization allowed us to search for seasonality in the sparse (counts of low values) data [76], whereas the Bayesian formulation is beneficial in the applications of the complex models in small sample settings [77]. We used BIC (Bayesian information criterion) for the model selection. Chi-square tests were performed for exploratory purposes.
ARIMAX model
ARIMAX stands for autoregressive integrated moving average with external variables and was proposed by Box and Jenkins [78] in 1970. The assumption in the AR (autoregressive) process is that the mean at time t (expected value) depends on the previous realization of the process and the MA (moving average) part is that the additive error is correlated across time. The I (integration) part refers to the fact that the process is an integration of a process. The X part (external variable) introduces independent variables to the process.
The ARIMAX is denoted by parameters (p, d, q), where p is the order of autoregression (indicates how many previous observations to use in the AR), d is the differencing order (indicates how many times to differentiate the response variable), and q is the order of moving average (indicates how many previous errors to use in the MA).
The ARIMA model is widely used for modeling seasonal patterns and trends in economics in the analysis of Gross Domestic Product, inflation, demand [79–81], and finance in predicting stock prices [82]. It has also been widely used in medical research. For example, the ARIMA model was implemented to forecast COVID-19 cases using Johns Hopkins data [83], and to analyze malaria cases in Sri Lanka [75]. ARIMAX is a powerful statistical method in the analysis and forecasting of time series data.
The ARIMAX(p, d, q) for yt in the general form can be written: where yt is an observation of the series at time t, p is the order of autoregression, q is the order of moving average. Moreover, Φp(L) = (1 −ϕ1L− ϕ2L2 … −ϕpLp), and Θq(L) = (1 + θ1L + θ2L2 + … + θqLq), where ϕ1, …, ϕp are the coefficients for the autoregressive part of the process, θ1, …, θq are the coefficients of the moving average part of the model. L is a backshift operator with Liyt = yt−i. The lag operator can be multiplied such that LiLjyt = yt−i−j. ut is a white noise, or uncorrelated, error process at time t, Δd is a differencing operator of order d. The differentiation method Δ is chosen to be log differentiation, so Δdyt denotes log differentiation of yt taken d times. For example Δyt = log(yt) − log(yt−1). β is a vector of coefficients for covariates .
Generalization of ARIMAX
While the classical ARIMAX model assumes normally distributed data, BCP-ALL case data can only take integer values from minimum value 0 to maximum value 16, which breaks the assumption of the classical model. Generalization of ARIMAX allows modeling the series of non-normal distributions [76].
The go-to generalization to the discrete data is performed via Poisson distribution.
We can see that several papers in our literature search used Poisson regression as a method for obtaining seasonality in leukemia case data [37, 40, 45]. One of the assumptions of Poisson distribution is that the mean and variance of the process are the same. The observed excess variability compared with the Poisson count model motivates use of a negative binomial formulation within the ARIMAX framework. Conditional variance of the negative binomial distribution exceeds the conditional mean. The source of the overdispersion is the unobserved heterogeneity caused by hidden variables as the harmonic and seasonal covariates are just proxies of infectious agents. NB distribution compensates for the lack of fit by introducing an extra parameter [84, 85]. In the literature survey generalization via NB distribution was used by Goujon-Bellec et al [45]. The parameters of NB distribution are pr and r, where pr denotes the probability of success, and r is the number of successes before trials stop. Poisson distribution is the limiting form of NB distribution when r → ∞.
The parameter of interest, pr, is assumed to depend on the season.
The score of the ARIMAX is generalized to the parameter of NB distribution prt by link function prt = g(λt) [76], where g(.) is a link function, and λt is the score of ARIMAX process (predicted yt in equation 1), the score λt becomes: The link function used in the paper is . So, given prt and estimated r yt ∼ NB(prt, r). The generalized ARIMAX model called GARIMAX model [75].
Following Zeger and Qaqish [86] we implement “ZQ1” transformation of the ARIMAX by adding a constant c. Addition of the constant allows avoiding the problem of computing the logarithm of observations with the zero value in the log difference integration transformation of the model. “ZQ1” transformation suggests , where 0 < c ≤ 1.
GARIMAX in the Bayesian setup
The formulation of the model in the Bayesian setup gives several advantages over the classical maximum likelihood estimation. The first one is that the analysis is no longer performed on a single estimate, but rather on the distributions of the underlying parameters. The Bayesian models with the correctly specified priors and estimation using MCMC allow for the parameters’ interval estimates to be appropriate in small samples [77]. This advantage of the Bayesian framework allows for unbiased inference even in small samples.
We assume a stationary model for the GARIMAX process, so AR and MA coefficients ϕ1, …ϕp and θ1, …, θq should be constrained such that the resulting process is invertible and stationary. We followed Jones [87] in assigning prior distributions for AR and MA coefficients. The algorithm for generation of the sample of ϕ1, …, ϕp can be summarized as:
Algorithm:
Generate value k1, …, kp following the , where p = 1, …, P, where P is the number of lags of autoregression, square brackets denote the integer part of the value in them (round to the closest integer);
perform transformation rp = 2kp − 1 for all p;
assign for all p;
and then for j = 2, …, p and for i = 1 : (j − 1) iteratively compute ;
yi is the sample for ϕi, where i ∈ 1, .., p
The same procedure is performed for the θ coefficients, but instead of the lags for the autoregression (p) the lags for the moving average (q) are used.
For the seasonal coefficients of the harmonic functions and the coefficients of the seasonal matrix normal distribution was selected, β1, …, βk∼ Norm(0, 0.1). Distribution is centered around the 0 value, the prior states that there is no evidence of seasonality of BCP-ALL no its subtypes before the data is introduced.
The last parameter to be estimated in the model is the parameters of the Negative Binomial distribution r, which represent the number of failures until the trials are stopped. Prior distribution for r is gamma [88], r∼ Gamma(0.01, 0.01).
The model is estimated using MCMC (Markov chain Monte Carlo) simulation method. JAGS [89] software is used to implement the simulation. Python is the main programming language used for data preparation, visualization and calls for JAGS software.
Model choice
To identify a number of lags for autoregression and moving average parameters, 12 models were estimated. The maximum 3rd order of the lags was chosen to identify the best GARIMAX model for each analysed time series. We assume a weakly stationary series after taking log difference. Bayesian information criteria (BIC) was chosen as a score for model selection. The model with the lowest BIC was then selected as a basis for the seasonality checks. The BIC allows finding a balance between the complexity of the model and its performance. The BIC for the model m is defined as: where LLm is the median log-likelihood computed by the model m, N is the number of observations of the analyzed series, and k is the number of the estimated parameters of the model.
General procedure
We implemented a two-step procedure for identification of seasonal wave for every analysed time series. The preliminary step aims to identify the number of lags for AR and MA coefficients using the BIC score without any seasonal covariates.
The first step uses harmonic functions as a covariate for the GARIMAX model. The harmonic covariate is .
During the second step, we run the same specification of the GARIMAX model with a quarterly seasonal matrix as a covariate. The third step identifies the particular quarter when the wave has its peak. The covariates in this case become a seasonal matrix, in which each column value corresponds to a specific quarter of the year, except for the so-called “base quarter”, which is taken as the quarter with the lowest number of cases.
The main underlying process that models the dynamics of the observations at time t is ARIMAX. The score is then generalized via Negative Binomial distribution, using of the discrete NB distribution. The generalization addresses the assumption of the count nature of date of diagnosis quarterly case time series data.
Exploratory analysis
As a supplementary and exploratory analysis, the most popular test in the literature survey (Chi-square test) was performed on date of diagnosis and date of birth data of BCP-ALL. Chi-square test is a statistical test for categorical data, and it is used to determine whether the data is significantly different from the expected value. The test statistics can be formally written as: , where χ2 is the test statistic, O is observed frequency, E is the expected frequency. P-values were adjusted using Benjamini-Hochberg procedure [90] and all tests were performed using the Python programming language.
Also as an exploratory measure, we generated descriptive data on distribution of age at diagnosis and genetic subtypes, for the whole cohort as well as for cases diagnosed in each respective month of the year.
Results
The best models for all three types of quarters are presented in the table 3.
All BIC scores for the models are presented in S1 Appendix
After identification of the order of autoregression and moving average, we estimated the model with harmonic functions as covariates to identify presence of seasonal wave in the series.
Tables 4 and 5 report summary statistics for posterior distributions of the coefficients of seasonal harmonic functions. The first column of the table is the name of the analyzed series, the second column is the specification of the GARIMAX model, the column “Waves” specifies the type of the harmonic function, the forth column reports the median value of the distribution, the last two column report 95% credibility interval. If the credibility interval fully consists of positive or negative values, it means that 95% of the posterior does not contain 0 and it is unlikely that that the covariate has no effect on the response variable. The covatiate in this case is said to be informative in the Bayesian setup, the corresponding term in the classical statistics is significant. If the credibility interval contains the 0 value, it means that the big mass of the posterior is centered around 0 value and the covariate is uninformative.
Supplementary Figures S1 Fig, S2 Fig, S3 Fig show graphical distributions of posterior densities of the coefficients of seasonal harmonic functions (the visual representation of the table 4 and 5). The red line on the figure represents the zero value of the coefficients.
The credibility intervals of posterior densities of the harmonic functions do not include 0 values for sine harmonic waves of BCP-ALL of the 1st and 2nd specified quarter type, and cosine harmonic wave of BCP-ALL of the 3rd quarter type.
The credibility intervals of posterior distribution of the coefficients of seasonal harmonic functions do include the value 0, which means that harmonic functions do not provide much information in explaining the dynamics of the case counts of BCP-ALL subtypes HeH and ETV6/RUNX1. The informative seasonal waves are: sine wave of BCP-ALL series of the 1st type quarter, sine wave of BCP-ALL series of the 2nd type quarter, the cosine wave of BCP-ALL series of the 3rd type quarter. No informative seasonal waves for the BCP-ALL subtypes were found.
Tables 6 and 7 report the summary for the posterior distributions of the coefficients of the seasonal matrix. The first column is the name of the series. Base quarter for each series was chosen such that all other quarters show positive median values. The last three columns of the table are median and 95% credibility intervals.
Figures S4 Fig, S5 Fig, S6 Fig show the visual representation of the posterior distributions of the coefficients of the seasonal matrix. The base quarter presented as a blank picture.
Jul-Sep and Aug-Oct are the informative peak quarters for BCP-ALL series for the first and the second quarter types accordingly. Peak seasons contain August and September, the months with the highest number of diagnosed BCP-ALL cases in our studied Swedish cohort. It is important to observe the significance on both stages of the analysis, as the seasonal matrix does not consider repeatability of seasonal variation, while harmonic seasonal waves do not provide information concerning the peak months.
The high hyperdiploid genetic subtype of BCP-ALL does not show any informative result for any quarter types not in the first nor the second steps of analysis. Diagnosis of ETV6/RUNX1 positive BCP-ALL shows an increase in Jan-March and Aug-Oct quarters, which might suggest a two-peaked seasonal wave, which however did not yet form the strong periodicity tested by strictly periodic harmonic functions. On the other hand, the second step of analysis does not test for seasonality, it tests for the increase of the estimated quarters from the base quarter. Hence, the only result that consistently holds is seasonality in the whole group of BCP-ALL cases. In this group, the informative harmonic function was found in all quarter types and in the first and second specified quarters with an increase containing August and September.
Tables 8 and 9 report the results of exploratory Chi-square tests of BCP-ALL (including subtypes) seasonal variation. The lowest p-value was obtained for seasonality of BCP-ALL diagnosis using the second type of quarters, which showed a peak in August - October. No significant p-values were found for seasonality in diagnosis of BCP-ALL subtypes (HeH and ETV6/RUNX1). Neither did we find any significant p-values for seasonality of date of birth in BCP-ALL cases, the lowest p-value being 0.07 (BCP-ALL third quarter type with peak in March-May).
As all the tests were performed on the same BCP-ALL data, HeH and ETV6/RUNX1 being the subtypes of BCP-ALL, the p-values of Chi-square test were adjusted using the Benjamini-Hochberg procedure. After this adjustment of p-values, there are no significant results of Chi-square test to be reported.
Descriptive data displayed a similar distribution of age at diagnosis comparing the entire cohort and cases diagnosed each respective month Table 10, including the peak incidence months August and September. Also the distribution of different subtypes of BCP-ALL was similar when comparing the peak months August and September to all other 10 months respectively and to the whole year (entire cohort). Thus, cases diagnosed during peak months August and September did not stand out in any apparent way neither regarding age of diagnosis nor subtype.
Discussion
Exposure to infectious disease is today a well-established suspect in the search for environmental involved in both initiation and progression of BCP-ALL. We know that pre-leukemic clones of many BCP-ALL subtypes are initiated during fetal life in a substantial portion of cases [91–94], and that the chromosomal rearrangement representing each subtype is considered the initiating genetic event or “first hit”. Epidemiological data has pointed to some specific viral infections increasing the risk of BCP-ALL following maternal infection during fetal life, although not all studies support these findings [95–97]. Thus, the cause of initiating genetic events stands unresolved. We performed an exploratory analysis applying Chi-square test to data of quarterly aggregated at time of birth of BCP-ALL cases, but did not obtain any significant difference in incidences between months. Four previous studies have indeed reported seasonality in time of birth for ALL cases (B- and T-cell ALL analyzed together), two of which specifically in the 1-6 year age group, with peaks ranging from jan-april [47, 48, 52]. Yet, other reports have not detected date of birth seasonality [99]. Further studies aiming at associating temporal waves in time of birth of BCP-ALL cases to preceding dittos of infectious disease hold the potential to guide molecular studies of leukemia initiating infections.
For preleukemic clones to progress into overt leukemia additional “hits” are required, as first suggested in the “two-hit hypothesis” by Greaves [32]. That a fraction of healthy neonates harbor small populations of pre-leukemic clones at birth without ever developing BCP-ALL emphasizes the importance of such subsequent events [19], but also offers hope for development of preventive measures [33]. Epidemiological studies have identified environmental factors such as early (<1 years of age) day care attendance [100–102], birth mode [103–106], and early contact with livestock and pets [107], all proxies for exposure to a variety of microbes, to be modulating the risk of BCP-ALL progression. These findings are in agreement with the theory that delayed exposure to infections results in an aberrant immunological response fueling progression of preleukemic cells if present. This delayed exposure to infections and minimized contact with microbes is a clear consequence of lifestyle in affluent societies, which indeed also experience higher incidence rates of BCP-ALL [10].
While early (within first year of life) exposure to microbes appears to protect against BCP-ALL development, infectious disease has also been suggested to promote BCP-ALL progression possibly through inflicting the second hit or hits leading to overt diesease. In 1917, Ward was the first to acknowledge the peak of BCP-ALL incidence in 1-6 year olds, suggesting infections could be a trigger of disease given that common infections are more abundant in this age group than others [108]. This was followed by Kinlen’s idea of BCP-ALL being a rare response to a common infectious agent which becomes apparent by an increase in incidence upon in-mixing of migrants to an isolated immunologically naïve population [29–31]. Biological examples suggested to support this notion is the space-time clustering of increased BCP-ALL incidence with infectious epidemies [109–113]. In addition, a study reporting rapidly decreased incidence of childhood BCP-ALL soon after the implementation of infection control measures in response to the 2003 SARS outbreak in Hong Kong is also thought to lend supports of viral etiology. A two month long closing of schools was followed by rigorous hygiene routines after re-opening and other social distancing measure lasting for a total of 6 months. Substantial decline in communicable common infectious disease following these measures was proposed to protect at risk individuals, i.e. carriers of pre-leukemic clones, from acquiring 2nd hit/-s needed for progression [114]. This is further supported by two studies showing that exposure to infectious agents is required for developing BCP-ALL in mice housing pre-leukemic ETV6/RUNX1 positive pre-leukemic clones or susceptibility conveyed by Pax5 heterozygozity [27, 28].
Further, effects of severe acute respiratory syndrome coronavirus 2 (SARSCoV-2) pandemic on BCP-ALL incidence have been debated, speculating that both increasing incidence in response to a novel widespread pathogen (population-mixing hypothesis) and declining incidence as a consequence of decreased exposure to infectious disease (second-hit hypothesis) to be possible short term (weeks to months) outcomes [115, 116]. To date, in line with findings in Hong Kong, a report from Norway found distinctly decreased incidence numbers during the first months of the SARSCoV-2 pandemic [117] and similar indications have been reported from a region in Italy [118]. In contrast, recent data based on the German Childhood Cancer Registry indicate a significant increase in age standardized incidence rates (ASIR) of childhood (0-14y) lymphoid leukemia was seen during the Covid-19 pandemic in 2020 and 2021 [119, 120]. Like in Hong Kong [114], extensive infection control measures were instated in Germany during this time. Although the reason for this increased incidence is unknown it may lend further support to infectious exposures importance to BCP-ALL progression. It is however crucial to point out that increased ASIR was reported also for other childhood cancer types who’s etiology is not considered susceptible to infectious exposure, which implies there may be other explanations to the reported ASIR of childhood lymphoid leukemia. Meanwhile, two reports from USA and Canada did not detect any change in incidence of pediatric leukemia and lymphoma as a group [121, 122]. Due to limitations in sample size and observation time, the currently available reports on this topic should be interpreted with caution. To draw any conclusions about the effects of SARSCoV-2 infection and the restrictions to prevent it’s spread, BCP-ALL incidence will need to be closely monitored, and will undoubtedly be so henceforth. Not least will observing long-term effects (years) of decreased infectious exposure on future incidence rates allow for scrutinizing of the delayed infectious exposure hypothesis [123, 124].
In light of the above, seasonal variation of ALL (most studies do not discriminate between B- and T-cell origin) as a proxy for infectious exposure has been extensively studied over the last decades, applying an array of different methods as discussed in more detail below. We can conclude from our review of previous studies (Table 1) that there is no consistent proof of a seasonal wave in time of ALL diagnosis. And, in cases where seasonality is detected, time of peak incidence is scattered throughout the year. However, when studying seasonal variation, it is important to consider that patterns of seasonality may differ from country to country depending on factors such as climate zones, affluence, way of life et.c. affecting infectious panorama and seasonality of communicable infectious disease.
In the current study we report an informative seasonality at date of diagnosis specifically for BCP-ALL, with peak incidence in August-September. Descriptive data on distribution of age at diagnosis and genetic subtype did not differ for cases diagnosed in August-September compared to other months nor the entire cohort. In Sweden almost all children attend pre-school from age 1 and a long summer holiday is customary, usually beginning in late June or early July and extending into early August. One possible explanation for the timing of our observed incidence peak is that decreased spreading of infectious disease during summer, a consequence of the prolonged summer holiday in Sweden, causes a temporary halt in the final steps of disease progression for some individuals that would otherwise have presented clinically. This would be in accordance with observations in Hong Kong during SARS-epidemic [114]. However, in contrast to the Hong Kong example cases are then instead accumulated in August-September when at risk-individuals are again exposed to infections in schools and pre-schools. This rationale builds on the assumption that an infection quite rapidly pushes disease progression the last step to giving non-ignorable clinical symptoms of BCP-ALL. It is known that symptoms of ALL do indeed evolve in only days to weeks but the length of latency from second genetic hit to clinical presentation remains unknown and may very well be variable. Thus, if such a “last” infection before diagnosis causes the second genetic hit or just puts selective pressure on already “ready-to-go” leukemic cells remains to be understood.
An alternative explanation for our observed August-September BCP-ALL incidence peak would be that some specific common pathogens have a slightly sharper ability than others to cause genetic second hits, and thus progression to overt disease, in pre-leukemic cells. Entero and varicella viruses for example have seasonal peaks during summer [125]. Again, the challenge of associating a peak in BCP-ALL incidence to a peak in spread of certain pathogens is the undetermined and likely variable latency from second genetic hit to clinical presentation.
Analysis of seasonal variation in series of aggregated counts may face several challenges. As for epidemiological studies in general, small sample size is a common problem resulting in low statistical power, which increases probability of reporting false negative results as statistical tests performed on small samples are only capable of detecting large effects [126, 127]. In our literature review, merely 9 out of 42 previous publications on seasonality in ALL (as summarized in Table 1) had more than 1000 ALL cases in their studied cohort.
The largest cohort studied to date is that of 15 835 cases of childhood leukmeia (73 % lymphatic) born and diagnosed between 1953-1995 in UK, published by Higgins, C. D. et al in 2001. [51] For the 1282 cases born and diagnosed before 1962, a suggestive but after not statistically significant incidence peak in August-September was identified. The authors however call for caution when interpreting this indication; since cases form this time period were extracted from death-records, retrieving date of diagnosis retrospectively, a ”complication to death” could have introduced an apparent seasonality. Further, no significant seasonality in childhood leukemia incidence was found when analysing the entire cohort. A possible explanation was suggested to be the fact that seasonality was examined by date of diagnosis rather than clinical onset, between which there may be a discrepancy in time masking seasonality. Based on this possible discrepancy, one could argue for increasing the aggregation period from month to quarter as was done in the present study. Quarterly transformation allows the date of onset and date of diagnosis to be in the same period of time series analysis. The main drawback of quarterly transformation is that it requires performing the analysis on three different subsets of quarters, which increases the probability of finding false positive results using classical statistical methods. We believe that our cohort of 1601 cases diagnosed during a time-span of 22 years (1995-2017) as well as formulation of the GARIMAX model in the Bayesian setup allows to provide the unbiased results in this paper.
Our literature review revealed three previously applied types of data transformation, an obligate step before performing analysis with any method. The first transformation aggregates tabular data from registries to counts by seasons or months, summing up the data for all observed years in the sample, resulting in four (seasons) or 12 (months) bin histograms. The second transformation counts number of cases per month or season in the sample, without summation between different years, resulting in time series of monthly or seasonal counts. The third transformation uses individual case-by-case statistical methods. To our knowledge, this third type of transformation was only previously used by Gao et al [49] and was also applied in the present study. Type of data transformation does to some extent depend on the chosen method for analysis, but different combinations may be applied. Therefore, choice of data transformation type is a variable that may affect output.
The most common method for detection of the seasonal variation is the Chi-square test, which was used in 13 previous studies included in our review of previous publications, out of which seven reported seasonal variation in ALL/childhood leukmeia [35, 39, 42, 44, 46, 48, 50, 63–65, 68, 69, 71]. The test is implemented to histogram data transformation and answers the question whether there is likely higher relative frequency in one group than the other, the other group usually being the mean of all data.
Exploratory analysis without statistical tests was used in 11 publications (stated as ”single-factor analysis” in Table1) 1) [38, 47, 55, 57, 58, 60, 62, 66, 67, 70, 72] with seven papers reporting positive results for seasonal variation in ALL/childhood leukmeia. The most recent paper with descriptive statistics was published in 2011, marking a shift towards use of more strict hypothesis-driven methods.
Edward’s test [129] was implemented in four reviewed publications [43, 51, 56, 61]. The model detects sinusoidal curve within a 12-month period histogram. J A Ross et al [53] implement Rodger’s test [130], a modification of Edward’s test, which evaluates the significance for cyclic trends based on the efficient score vector calculated for each seasonal peak of aggregated cases. Five studies that applied these similar tests reported significant seasonal variation of ALL/childhood leukemia diagnosis in East Anglia (UK) [56] and USA [53], but neither in Mexico [43], UK (all regions) [51], NW England [56] nor the Netherlands [61].
Cosinor Analysis [52, 54, 59] is performed by fitting the sinusoidal curves (harmonic functions) to 12-month histograms. Poisson regression with harmonic functions, an extension of cosinor analysis assuming not-normal distribution of errors, was applied in four previous studies [37, 40, 45, 48]. Moreover, Gao et al [49] investigated the seasonal variation of ALL in USA, Singapore, and a western region of Sweden, using von Mises distribution in combination with Mardia test statistics [131], reporting a seasonal peak in January in Sweden but only with 63 observations in the sample.
Joint point regression rendered the report of an increased number of ALL-diagnoses in spring and summer in Iran, studying cases diagnosed between 2006 and 2014. This method was applied to monthly histograms and allowed identification of changes in trends of studied data. [36].
Finally, Kyu Seok Shim et al [41] implemented ARIMA (Autoregressive Integrated moving average model) to analyse seasonal variation in the time series of ALL-diagnosis in South Korea, reporting positive results for the presence of seasonality with a peak in winter. The authors also showed that the seasonal wave of ALL-diagnosis correlated with that of HPIV (Human para-influenza viruses) with an assumed 1-2 month period of latency. To our knowledge, this is the only study so far correlating seasonal wave of childhood ALL to a specific viral agent, thus addressing the question of potential specific viral agents promoting progression to overt leukemia. However, the correlation to HPIV was not only made for ALL but also several other pediatric cancers which are not suspected to have a viral etiology.
All described methods above exhibit positive and negative sides. The methods that employ harmonic functions (combination of sine and cosine functions) assume perfect repeatability of the infectious agents from year to year. Since seasonal variation of infections might not be regular, this assumption may pose a problem when looking for seasonal variation in childhood ALL-diagnosis as as a proxy for infectious exposure contributing to disease progression. It is also important to keep in mind that the errors of the regressions with harmonic functions might not follow the normal distribution in count data scenarios. On the other hand, methods that are performed on aggregated histograms of months or seasons are harshly affected by outliers. Thus, a random increase in number of cases in a monthly bin without true periodicity may nevertheless show significant seasonal variation, rendering false conclusions.
There are several advantages of implementing the GARIMAX model to search for seasonal variation in BCP-ALL case count data. ARIMA is a classical model that is widely used in different fields of studies to detect seasonal variation. Generalization of the model via Negative Binomial distribution adapts the model to low case count data allowing accounting for excess variability as compared to the generalization via Poisson distribution in the limited information setting. Formulation in the bayesian setup makes the model less data hungry, and allows for inference in small sample settings. The main disadvantage of the GARIMAX model is that it requires transformation to case count data, thus blocking the opportunity to infer individual cases’ contribution to seasonal variation. Also, the model does not allow for inference at date of birth as it requires stationary or semi-stationary time series. Aggregation by date of birth is impossible as the series exhibit informational delay, i.e. many children who were already born are not yet diagnosed in the data. One possible solution would be to cut time series by 18 years (childhood ALL is not diagnosed beyond the age of 18 years), but this approach would shift the analyzed period such that the sample becomes too small for the model even in the Bayesian setup. Chi-square test has previously been widely applied for analysis of seasonal variation in date of birth of childhood ALL patients, but the test requires p-value adjustment which is impacted by outliers as discussed above.
Conclusion
In the present study we identified seasonal variation of childhood BCP-ALL incidence numbers with a peak in August-September. Results were obtained applying modern statistical methods, Generalized Autoregressive Integrated Moving Average model with External seasonal covariates (GARIMAX), to a Swedish population-based cohort of 1601 childhood BCP-ALL cases from the Swedish childhood cancer registry. The cause of our findings is yet to be determined. Nonetheless, based on the current knowledge and hypotheses regarding childhood BCP-ALL etiology, we suggest two possible explanatory models. If a final unspecific viral infection pushes leukemic clones to progress and expand in the bone marrow and cause non-ignorable clinical symtoms, then the August-September incidence peak could be explained by a rebound effect following a halt in such progression during Swedish summer (attributable to deceased spread of viral infections due to closure of schools and pre-schools).Alternatively, peak incidence in August-September could be attributable to a viral agent with seasonal spread during summer potent in causing a 2nd-hit in pre-leukmeic clones or. Which model is more likely depends on how long latency from 2nd hit to overt leukmeia is, a factor yet unknown in vivo in humans. Further studies to understand these etiological factors are required. Also, further studies of larger cohorts using modern statistical tools are needed to examine the possible temporal pattern of BCP-ALL at both time of diagnosis and birth.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Ethics Declaration
The study was approved by the Regional Ethical Review Board in Stockholm, Sweden (ethics permit numbers 2018/1849-32) in accordance with the Declaration of Helsinki.
Authorship
Contribution: A.N., B.B., C.S., G.B., N.E. and R.J. designed the study. G.B., performed analysis; G.B., R.J. and N.E. performed the GARIMAX analysis.; N.E., G.B., G.T. and A.N.S. prepared the dataset; A.N., M.H., J.A., E.P. and B.B. contributed with medical knowledge and generation of hypothesis. G.B. prepared the figures; A.N., B.B. and G.B. wrote the manuscript; All authors contributed to data interpretation; and all authors revised the manuscript and approved the final version.
Conflict-of-interest disclosure
The authors declare no competing financial interests.
Supporting information
S1 Appendix. BIC scores of the tested models, GARIMAX
Acknowledgments
The authors wish to thank Henrik Passmark, Case Officer at The Swedish National Board of Health and Welfare and Jonas Janegren at Statistics Sweden for their great support to collect the dataset. This study was supported by grants from the Swedish Childhood Cancer Fund, the Swedish Cancer Society, the Cancer Society of Stockholm, the Swedish Research Council, the Stockholm Regional Council, Mary Béves Stiftelse för Barncancerforskning, the Swedish Rare Diseases Research foundation (Sällsyntafonden), Berth von Kantzow’s foundation, Karolinska Institutet Research Foundation, Hållsten Research foundation, Stiftelsen Frimurare Barnhuset in Stockholm and the Stockholm County Council. One of the authors of this publication is a member of the European Reference Network on Rare Congenital Malformations and Rare Intellectual Disability ERN-ITHACA [EU Framework Partnership Agreement ID: 3HP-HP-FPA ERN-01-2016/739516]. Funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Footnotes
↵* ann.nordgren{at}ki.se
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.
- 93.
- 94.↵
- 95.↵
- 96.
- 97.↵
- 98.
- 99.↵
- 100.↵
- 101.
- 102.↵
- 103.↵
- 104.
- 105.
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.
- 111.
- 112.
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.
- 129.↵
- 130.↵
- 131.↵