Introduction

In December 2019, a novel coronavirus of unknown source was identified in a cluster of patients in the city of Wuhan, Hubei, China1. The outbreak first came to international attention after the World Health Organization (WHO) reports said that there was a cluster of pneumonia cases on Twitter on January 4th2, followed by the release of an official report on January 5th3. China reported its first COVID-19-related death on January 11th, while on January 13th, the first case outside China was identified4. On January 14th, the World Health Organization (WHO) tweeted that Chinese preliminary investigations reported that no human-to-human transmission had been identified5. However, the virus quickly spread to other Chinese regions and neighboring countries, while Wuhan, identified as the epicenter of the outbreak, was cut off by authorities on January 23rd, 20206. On January 30th, the WHO declared the epidemic to be a public health emergency1, and the disease caused by the virus received its official name, that is, COVID-19, on February 11th7.

The first serious COVID-19 outbreak in Europe was identified in northern Italy during February, with the country recording its first death on February 21st8. The novel coronavirus was transmitted to all parts of Europe within the next few weeks, and as a result, the WHO declared COVID-19 to be a pandemic on March 11th, 2020. As of 16:48 GMT on April 18th, 20209, there were 2,287,369 confirmed cases worldwide, with 157,468 confirmed deaths and 585,838 recovered patients. The most affected countries with more than 100 k cases (in absolute numbers, not divided by population) were the US, with 715,105 confirmed cases and 37,889 deaths; Spain, with 191,726 confirmed cases and 20,043 deaths; Italy, with 175,925 confirmed cases and 23,227 deaths; France, with 147,969 confirmed cases and 18,681 deaths; Germany, with 142,614 confirmed cases and 4405 deaths; and the UK, with 114,217 confirmed cases and 15,464 deaths. The worldwide geographical distribution of COVID-19 cases and deaths by country is depicted in Fig. 1.

Figure 1
figure 1

Geographical distribution of worldwide COVID-19 cases and deaths as of April 18th (Chartsbin43).

As shown, Europe has been severely affected by COVID-19. However, the spread of the disease now indicates that the center of the epidemic has moved to the US, with the state of New York counting more than 240 k cases and 17 k deaths. Figure 2 shows the distribution of COVID-19 cases and deaths in the United States by state as of April 18th, 202010.

Figure 2
figure 2

Geographical distribution of COVID-19 cases and deaths in the US as of April 18th (Pixelmap42).

To find new methods and approaches for disease surveillance, it is crucial to take advantage of real-time internet data. Infodemiology, i.e., information epidemiology, is a concept that was introduced by Gunther Eysenbach11,12. In the field of infodemiology, internet sources and data are employed to inform public health and policy13,14. These approaches have been suggested to be valuable for the monitoring and forecasting of outbreaks and epidemics15, such as Ebola16, Zika17, MERS18, influenza19, and measles20,21.

During the COVID-19 pandemic, several research studies using web-based data have been published. Google Trends, the most popular infodemiology source along with Twitter, has been widely used in health and medicine for the analysis and forecasting of diseases and epidemics22. As of April 20, 2020, seven (7) papers on the topic of monitoring, tracking, and forecasting COVID-19 using Google Trends data had already appeared online in PubMed (advanced search: covid AND google trends)23 for several regions: Taiwan24, China25,26, Europe27,28, the US28,29, and Iran 28,30. Note that for Twitter publications related to the COVID-19 pandemic, eight papers (8) published from March 13, 2020 to April 20, 202031,32,33,34,35,36,37,38 are available online (PubMed advanced search: covid AND twitter23). Table 1 systematically reports these COVID-19 Google Trends studies, in order of the reported publication date.

Table 1 Systematic reporting of publications on COVID-19 using Google Trends as of April 20th, 2020.

In this paper, Google Trends data on the topic of “Coronavirus (virus)” in the United States are employed at both the national and state levels to explore the relationship between COVID-19 cases and deaths and online interest in the virus. First, a correlation analysis between Google Trends and COVID-19 data is performed; then, the role of Google Trends data in the predictability of COVID-19 is explored. To the best of our knowledge, this paper is the first attempt of this kind performed for the United States.

The rest of the paper is structured as follows. The Methods section details the data collection procedure and the statistical analysis tools and methods. The Results section consists of the correlation analysis and of the forecasting models at both national and state levels. The Discussion section presents the main findings of this work, along with the limitations of this paper and future research suggestions.

Methods

Data from the Google Trends platform are retrieved in .csv39 and are normalized over the selected period. Google Trends reports the adjustment procedure as follows: “Search results are normalized to the time and location of a query by the following process: Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics. Different regions that show the same search interest for a term don't always have the same total search volumes40. The data collection methodology is designed based on the Google Trends Methodology Framework in Infodemiology and Infoveillance41. Note that the data may slightly vary based on the time of retrieval.

For keyword selection, the online interest in all commonly used variations is examined, and the variations are compared, i.e., “coronavirus (virus)”; “COVID-19 (search term)”; “SARS-COV-2 (search term)”; “2019-nCoV (search term)”; and “coronavirus (search term)”. Only “coronavirus (virus)” and “coronavirus (search term)” yield, as expected, considerably high online interest. Between the two, i.e., the topic (virus) and the search term, “coronavirus (virus)” is selected for further analysis.

Data on the worldwide distribution of COVID-19 cases and deaths are retrieved from Worldometer9. Data for the United States analysis of COVID-19 are retrieved from “The COVID Tracking Project”, which provides detailed structured data on COVID-19 cases and deaths nationally and at state level10. Maps of COVID-19 cases and deaths and online interest are created by the authors using the free online tools Pixelmap42 and Chartsbin43, with data from the respective sources9,10, while graphs, spider web charts, and maps of the correlation coefficients are created by the authors using Microsoft Excel (version 16.39).

As Google Trends data are normalized, the timeframe for which search traffic data are retrieved should exactly match the period for which COVID-19 data are available. Therefore, the timeframes for which analysis is performed are different among states, starting either on March 4th (for most cases) or on the date on which the first confirmed case was identified in each state, as shown in Table 2.

Table 2 Timeframes for which Google Trends data are retrieved by state.

Each variable used in this study is divided by its full-sample standard deviation, estimated or calculated based on the basic formula of the standard deviation of a variable. By doing so, the inherent variability of each variable was moved, and thus, all variables have a standard deviation equal to 1. This equivalence makes it possible to compare the strength of the impact of the explanatory variables used on the dependent variable. The nonparametric44 unit root test is also applied to reveal whether or not the variables are stationary. The results suggest that both variables can be used directly in the present analysis without further transformation.

The first step in exploring the role of Google Trends in the predictability of COVID-19 is to examine the relationship between Google Trends and the incidence of COVID-19. As Pearson correlation analysis is the benchmark analysis in this kind of approach, the Pearson correlation coefficients (r) between the ratio (COVID-19 deaths)/(COVID-19 cases) and Google Trends data are calculated. In particular, a minimum variance bias-corrected Pearson correlation coefficient45,46 via a bootstrap simulation is applied to deal with the limited number of observations and, therefore, small sample estimation bias (also see45,47). The bias-corrected bootstrap coefficient \({\stackrel{\sim }{\rho }}^{b}\) for the Pearson correlation is given as follows:

$${\stackrel{\sim }{\rho }}^{b}={B}^{-1}\sum_{j=1}^{B}{\stackrel{\sim }{\rho }}_{j}^{b}\left(\rho \right)$$

where \(B\) corresponds to the length of the bootstrap samples; in this case, it is set equal to 99948. Note that the terms “COVID-19 deaths” and “COVID-19 cases” refer to the cumulative (total) COVID-19 deaths and cases in the United States and that this terminology is used hereafter unless otherwise stated.

Next, secondary correlation analysis is performed using the Kendall rank correlation, which is a nonparametric test that measures the strength of dependence between two variables. The Kendall rank correlation is distribution free and is considered robust in ratio data. Considering two samples with sample sizes \(n\), the total number of pairings is \(\frac{1}{2}n(n-1)\). The following formula is used to calculate the value of the bias-corrected Kendall rank correlation:

$${\stackrel{\sim }{\tau }}^{b}={B}^{-1}\sum_{j=1}^{B}{\stackrel{\sim }{\tau }}_{j}^{b}\left(\tau \right)$$

where \(\tau\) is given by \(\tau =\frac{{n}_{c}- {n}_{d}}{\frac{1}{2}n(n-1)}\), \({n}_{c}\) is the concordant value, and \({n}_{d}\) is the discordant value.

Following, a COVID-19 predictability analysis approach based on Google Trends time series for the United States and all US states (plus DC) is performed. The predictability model is a quantile regression, which is considered to be a robust regression analysis against the presence of outliers in the sample; it was introduced by49. Building on the study conducted by46, a quantile regression that is bias corrected via balanced bootstrapping is employed. Such a model is the appropriate statistical approach for mitigating small sample estimation bias and the presence of outliers in the dataset, as it combines the advantages of bootstrap standard errors and the merits of quantile regression. Additional knowledge on quantile regression can be found in the studies conducted by50 and51, while recent applications of quantile regression can be found in52,53. More recently54 introduced unconditional quantile regression, while the study by55 provides further insights into robust estimates of regressions.

Let \({Y}_{t},\) with \(t\in T\), be a time series that represents the dependent variable, supposing a bivariate specification. Quantile regression estimates the impact of the explanatory variable \({X}_{t}\), with \(t\in T\), on the variable \({Y}_{t}\) at different points of the conditional \(q\)-quantile, with \(q\in \left(\mathrm{0,1}\right)\), of the conditional distribution. A value of the \(q\)-quantile close to zero and a value of the \(q\)-quantile close to one represent the left (lower) and right (upper) tails of the conditional distribution, respectively. The conditional quantile function is defined as follows:

$$Q_{Y|X} \left( q \right) = {\text{X}}^{\prime } \beta_{q}$$

Given the distribution of \({Y}_{t}\), the estimation of the conditional quantile functions \({\beta }_{q}\) can be obtained by solving the following minimization problem:

$${\beta }_{q}=\mathrm{arg}\underset{\beta \in {\mathbb{R}}^{k}}{\mathrm{min}}E\left({\rho }_{q}\left(Y-X\beta \right)\right)$$

where \({\rho }_{q}\left(y\right)=y\left(q-{1}_{\left\{y<0\right\}}\right)\) represents the loss function.

By minimizing the sample analog \(\left\{{y}_{1},\dots ,{y}_{n}\right\}\) that corresponds to a \({q}^{th}\) quantile sample, the estimator \({\beta }_{q}\) takes the following form:

$$\beta_{q} = {\text{arg}}\mathop {\min }\limits_{{\beta \in {\mathbb{R}}^{k} }} \mathop \sum \limits_{t = 1}^{n} \rho_{q} \left( {Y_{t} - X_{t}^{^{\prime}} \beta } \right) = {\text{arg}}\mathop {\min }\limits_{{\beta \in {\mathbb{R}}^{k} }} \left[ {q\mathop \sum \limits_{{Y_{t} \ge \beta X_{t} }} \left| {Y_{t} - \beta X_{t} } \right| + \left( {1 - q} \right)\mathop \sum \limits_{{Y_{t} < \beta X_{t} }} \left| {Y_{t} - \beta X_{t} } \right|} \right]$$

where \(\beta {X}_{t}\) is an approximation of the conditional \(q\)-quantile of the variable \({Y}_{t}\).

In our analysis, \({Y}_{t}\) stands for the ratio (COVID-19 deaths)/(COVID-19 cases), \({\rm X}_{t-1}\) is the respective Google Trends value in lag order, and \(t=1,\dots ,T\), with \(T\) being the respective number of observations. A linear trend is used as well.

Finally, the bias-corrected parameter is estimated as follows:

$${\stackrel{\sim }{\beta }}^{b}\left(q\right)=\widehat{\beta }\left(q\right)-\widehat{bias}\left(\widehat{\beta }\left(q\right)\right)$$

where \(\widehat{bias}\left(\widehat{\beta }\left(q\right)\right)\) is given by \({B}^{-1}{\sum }_{j=1}^{B}{\widehat{\beta }}_{j}^{*}\left(q\right)-\widehat{\beta }\left(q\right)\) and \(q\in (0, 1)\) denotes the quantile considered and, in this case, is set equal to 0.5 (median). Median regression is considered more robust to outliers than, for example, least squares regression. Finally, it also avoids assumptions about the error parametric distribution56.

Αll estimation results reported in this paper were computed in the R programming environment57. In particular, we employed the R packages "quantreg" and "boot" to compute the quantile regression estimates and to perform the bootstrapping, respectively. The code is available in a “Supplementary Online Material file”.

Results

Figure 3 depicts the worldwide and US online interest in terms of Google queries in the “coronavirus (virus)” topic from January 22nd to April 15th, 2020. It shows that this topic is very popular, especially in Europe and North America. Specifically, interest in the United States is considerably high (above 70) for all US states.

Figure 3
figure 3

Heat maps of the worldwide and US online interest in “Coronavirus (Virus)” (Chartsbin43).

To perform a first assessment of the relationship between Google Trends and COVID-19 data, the Pearson and Kendall rank correlations between the two variables are calculated, and the results are further compared. Tables 3 and 4 present the results of the Pearson and Kendall correlation analysis by state, respectively.

Table 3 Pearson correlation analysis by state.
Table 4 Kendall rank correlation analysis by state.

As reported in Table 3, statistically significant correlations are observed for the United States and for the states of Alabama, Arkansas, California, Colorado, Florida, Georgia, Illinois, Kentucky, Massachusetts, Minnesota, Nebraska, Nevada, New Hampshire, New York, North Carolina, Oregon, Pennsylvania, South Dakota, Tennessee, Vermont, Virginia, Washington, Wisconsin, and Wyoming as well as DC. The states of Iowa, Louisiana, Maine, Mississippi, Missouri, North Dakota, South Carolina, and Utah do not marginally reach the p < 0.1 threshold of statistical significance, i.e., \(p\in (0.1, 0.2)\).

Based on the Kendall correlation analysis, statistically significant correlations are observed for the United States and for the states of Alaska, Arizona, Arkansas, California, Connecticut, Florida, Georgia, Hawaii, Iowa, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Tennessee, Utah, Vermont, Virginia, Washington, and Wisconsin as well as DC. Figure 4 depicts the heat map of the (a) Pearson and (b) Kendall correlation coefficients in the United States by state over the period examined.

Figure 4
figure 4

Heat map of the (a) Pearson and (b) Kendall correlation coefficients by state (Microsoft Excel).

As depicted in the heat maps and in the spider web charts for the respective correlation analyses in Fig. 5, visual comparison of the two approaches indicates that the results are consistent in both analyses.

Figure 5
figure 5

Radar chart of the (a) Pearson and (b) Kendall correlation coefficients by state (Microsoft Excel).

However, the main purpose of this study is to explore the predictability of COVID-19 using Google Trends data in the United States. Proceeding with the results of the predictability analysis, Fig. 6 depicts the heat map for \({{\varvec{\beta}}}_{1}\) by state, while Table 5 presents the quantile regression estimated predictability models for the US and for each US state (plus DC). As shown, the estimated Google Trends models exhibit strong COVID-19 predictability.

Figure 6
figure 6

Heat map of \({\beta }_{1}\) of the predictability analysis models by state (Microsoft Excel).

Table 5 Predictability analysis by state.

Note that due to the low number of observations, the states of Maine, Montana, North Dakota, West Virginia, and Wyoming are not included in the predictability analysis results, but they are given the value “zero (0)” to be included in the heat map for purposes of uniformity.

Discussion

As of July 29th, 2020, there were 16,920,857 COVID-19 recorded cases worldwide, with the reported death toll at 664,141 and the number of recovered patients at 10,485,3169. In light of the COVID-19 pandemic and to find new ways of forecasting the spread of the disease, infodemiology approaches have provided valuable input in monitoring and forecasting the development of the COVID-19 pandemic over time and in measuring and analyzing the public’s awareness and response. Google Trends and Twitter have been identified as the most popular infodemiology sources, while other social media, such as Facebook and Instagram, exhibit promising results in analyzing users’ online behavioral patterns13.

Social media platforms can provide us with more qualitative data that can shift the focus to other directions. Such approaches include sentiment analysis, educational purposes, and efforts to measure and raise public awareness. Recent approaches to analyzing aspects of the COVID-19 pandemic using social media data include monitoring the Twitter usage of G7 leaders58, monitoring self-reported symptoms on Twitter59, and analyzing the public perception of the disease through Facebook60. Moreover, infodemiology sources have provided valuable input in recruiting online survey participants through Facebook to measure individuals’ COVID-19 confidence levels61 and in assessing the behavioral variations in COVID-19-related online search traffic in more than one search engine62. Finally, commentaries that make recommendations on the integration of other social media platforms, such as Facebook, Reddit, and TikTok, for disseminating medical information to inform public health and policy have been published63.

Google Trends offers a solid foundation for quantitative analysis with respect to the monitoring and predictability of COVID-19, as in the analysis presented in this study, where Google Trends data on the “coronavirus (virus)” topic were used to explore the predictability of COVID-19 in the United States at both national and state level. First, for a preliminary assessment of the relationship between Google Trends and COVID-19 data, Pearson correlation and Kendall rank correlation analyses were performed. Statistically significant correlations were observed for the United States and for several US states, which is in line with previous studies that argue that there is a relationship between Google Trends and COVID-19 data.

The COVID-19 predictability analysis, which used a quantile regression approach, exhibits very promising results and indicates the most important contribution of this study to the international literature: detecting and predicting the early spread of COVID-19 at the regional level. This contribution can be a substantial supplement in further assisting local authorities in taking the appropriate measures to handle the spread of the disease.

Figure 7 illustrates a graph of the COVID-19 deaths/cases ratio, daily COVID-19 deaths, daily COVID-19 cases, and the respective Google Trends normalized data in the United States from March 4th to April 15th, 2020. For purposes of consistency in the graph, the COVID-19-related time series are normalized on a 0–100 scale. As depicted in the graph and confirmed by the predictability analysis, the two variables are not linearly dependent. Instead, they exhibit an inversely proportional relationship, meaning that as COVID-19 progresses, the online interest in the virus decreases.

Figure 7
figure 7

COVID-19 and Google Trends data from March 4th to April 15th in the US (Microsoft Excel).

From a behavioral point of view, this result can be explained as follows. First, online interest starts to increase and reaches a peak as the number of confirmed cases becomes high and as the deaths rates start to show that the pandemic does indeed have severe consequences. However, after a certain period, the interest has an inverse course, which could also indicate that the public is overwhelmed by information overload and decreases its information “intake”. The spike in Google queries and the decline in the ratio of COVID-19 deaths/cases could be attributed to the spread of the virus over these days and the “delay” in deaths. Regarding this latter point, this means that cases increase while the total number of deaths has not yet started to considerably increase.

The latter point is in line with previous work on the topic27 suggesting that although significant correlations between COVID-19 and Google data are observed, the relationship tends to decrease in both strength and significance in regions that have been affected by COVID-19 as we move forward in time because the interest in the virus decreases. This decrease is counterintuitive and occurs before the case and death curves start to exhibit a downward trend, i.e., when a region is being heavily affected, independent of whether or not it has reached its peak. However, it would be interesting for future investigators to explore the relationship from this point onwards since, as shown in Fig. 7, the lines converge, with this convergence being indicative of a future change in the relationship dynamics when deaths peak at a later point and when they start their downward course as well.

The above can partly explain the differences in signs among states in both the Pearson and Kendall rank correlation coefficients, but a more in-depth explanation from a statistical perspective is that the Pearson correlation coefficient is estimated as the average of the deviations of observations from the sample mean. The weights of observations in the tails of the distribution are equal to the weight of other observations, and therefore, the outliers could affect the estimation of the results, especially in the case of the small sample. In consideration of ties, this study employs a bootstrap bias-corrected approach, but the main conclusions are based on quantile regressions. Unlike linear measures of dependency, quantile regression is considered superior in a sampling situation and more resistant to outliers than linear regressions, the Pearson correlation, or the Kendall rank correlation64. Taking into account that the current pandemic is a dynamic process that constantly evolves and has a serious social impact, it is very probable that there now exist—or, at a later stage, could develop—several data anomalies (e.g., due to non-pharmaceutical interventions); therefore, formal statistical tools such as the Pearson and Kendall rank correlations should be carefully interpreted.

This study has limitations. First, data from only one search engine are considered. Although Google Trends is the most popular search engine, some data on the coronavirus topic from other search engines were not included in this analysis. Second, the data at this point are very limited, and the results are based on few observations. Third, the 50 (+ 1) states exhibit diversity in terms of confirmed cases and deaths. Therefore, any conclusions drawn from this analysis refer to each case individually. Despite the known limitations of online search traffic data, the use of infodemiology metrics for informing public health and policy in general and for monitoring outbreaks and epidemics in particular has received wide attention.

To dynamically find the determinants of COVID-19, the predictability analysis in this study provides insights into how online search traffic data can play a considerable role in forming public health policies, especially in times of epidemics and outbreaks, when real-time data are essential. With the COVID-19 pandemic, the world is in uncharted territory socially, economically, and socially. This situation calls for immediate action and open research and data, and the term “multidisciplinary” has never before been more important. To that end, the role of big data in providing “opportunities for performing modeling studies of viral activity and for guiding individual country healthcare policymakers to enhance preparation for the outbreak” has been acknowledged65, and current research on the subject should focus on both exploring the role of other infodemiology variables in the predictability of COVID-19 and combining infodemiology sources with traditional sources to explore the full potential of what online real-time data have to offer for disease surveillance.