COVID-19 predictability in the United States using Google Trends time series

Mavragani, Amaryllis; Gkillas, Konstantinos

doi:10.1038/s41598-020-77275-9

Download PDF

Article
Open access
Published: 26 November 2020

COVID-19 predictability in the United States using Google Trends time series

Amaryllis Mavragani¹ &
Konstantinos Gkillas²

Scientific Reports volume 10, Article number: 20693 (2020) Cite this article

13k Accesses
75 Citations
7 Altmetric
Metrics details

Subjects

Abstract

During the unprecedented situation that all countries around the globe are facing due to the Coronavirus disease 2019 (COVID-19) pandemic, which has also had severe socioeconomic consequences, it is imperative to explore novel approaches to monitoring and forecasting regional outbreaks as they happen or even before they do so. To that end, in this paper, the role of Google query data in the predictability of COVID-19 in the United States at both national and state level is presented. As a preliminary investigation, Pearson and Kendall rank correlations are examined to explore the relationship between Google Trends data and COVID-19 data on cases and deaths. Next, a COVID-19 predictability analysis is performed, with the employed model being a quantile regression that is bias corrected via bootstrap simulation, i.e., a robust regression analysis that is the appropriate statistical approach to taking against the presence of outliers in the sample while also mitigating small sample estimation bias. The results indicate that there are statistically significant correlations between Google Trends and COVID-19 data, while the estimated models exhibit strong COVID-19 predictability. In line with previous work that has suggested that online real-time data are valuable in the monitoring and forecasting of epidemics and outbreaks, it is evident that such infodemiology approaches can assist public health policy makers in addressing the most crucial issues: flattening the curve, allocating health resources, and increasing the effectiveness and preparedness of their respective health care systems.

A new time-varying coefficient regression approach for analyzing infectious disease data

Article Open access 06 September 2023

Juxin Liu, Brandon Bellows, … Lin Wang

Comparison of statistical approaches to predicting norovirus laboratory reports before and during COVID-19: insights to inform public health surveillance

Article Open access 05 December 2023

Nikola Ondrikova, Helen Clough, … John P. Harris

COVID-19 hospitalizations forecasts using internet search data

Article Open access 11 June 2022

Tao Wang, Simin Ma, … Shihao Yang

Introduction

In December 2019, a novel coronavirus of unknown source was identified in a cluster of patients in the city of Wuhan, Hubei, China¹. The outbreak first came to international attention after the World Health Organization (WHO) reports said that there was a cluster of pneumonia cases on Twitter on January 4th², followed by the release of an official report on January 5th³. China reported its first COVID-19-related death on January 11th, while on January 13th, the first case outside China was identified⁴. On January 14th, the World Health Organization (WHO) tweeted that Chinese preliminary investigations reported that no human-to-human transmission had been identified⁵. However, the virus quickly spread to other Chinese regions and neighboring countries, while Wuhan, identified as the epicenter of the outbreak, was cut off by authorities on January 23rd, 2020⁶. On January 30th, the WHO declared the epidemic to be a public health emergency¹, and the disease caused by the virus received its official name, that is, COVID-19, on February 11th⁷.

The first serious COVID-19 outbreak in Europe was identified in northern Italy during February, with the country recording its first death on February 21st⁸. The novel coronavirus was transmitted to all parts of Europe within the next few weeks, and as a result, the WHO declared COVID-19 to be a pandemic on March 11th, 2020. As of 16:48 GMT on April 18th, 2020⁹, there were 2,287,369 confirmed cases worldwide, with 157,468 confirmed deaths and 585,838 recovered patients. The most affected countries with more than 100 k cases (in absolute numbers, not divided by population) were the US, with 715,105 confirmed cases and 37,889 deaths; Spain, with 191,726 confirmed cases and 20,043 deaths; Italy, with 175,925 confirmed cases and 23,227 deaths; France, with 147,969 confirmed cases and 18,681 deaths; Germany, with 142,614 confirmed cases and 4405 deaths; and the UK, with 114,217 confirmed cases and 15,464 deaths. The worldwide geographical distribution of COVID-19 cases and deaths by country is depicted in Fig. 1.

As shown, Europe has been severely affected by COVID-19. However, the spread of the disease now indicates that the center of the epidemic has moved to the US, with the state of New York counting more than 240 k cases and 17 k deaths. Figure 2 shows the distribution of COVID-19 cases and deaths in the United States by state as of April 18th, 2020¹⁰.

To find new methods and approaches for disease surveillance, it is crucial to take advantage of real-time internet data. Infodemiology, i.e., information epidemiology, is a concept that was introduced by Gunther Eysenbach^11,12. In the field of infodemiology, internet sources and data are employed to inform public health and policy^13,14. These approaches have been suggested to be valuable for the monitoring and forecasting of outbreaks and epidemics¹⁵, such as Ebola¹⁶, Zika¹⁷, MERS¹⁸, influenza¹⁹, and measles^20,21.

During the COVID-19 pandemic, several research studies using web-based data have been published. Google Trends, the most popular infodemiology source along with Twitter, has been widely used in health and medicine for the analysis and forecasting of diseases and epidemics²². As of April 20, 2020, seven (7) papers on the topic of monitoring, tracking, and forecasting COVID-19 using Google Trends data had already appeared online in PubMed (advanced search: covid AND google trends)²³ for several regions: Taiwan²⁴, China^25,26, Europe^27,28, the US^28,29, and Iran ^28,30. Note that for Twitter publications related to the COVID-19 pandemic, eight papers (8) published from March 13, 2020 to April 20, 2020^{31,32,33,34,35,36,37,38} are available online (PubMed advanced search: covid AND twitter²³). Table 1 systematically reports these COVID-19 Google Trends studies, in order of the reported publication date.

Table 1 Systematic reporting of publications on COVID-19 using Google Trends as of April 20th, 2020.

Full size table

In this paper, Google Trends data on the topic of “Coronavirus (virus)” in the United States are employed at both the national and state levels to explore the relationship between COVID-19 cases and deaths and online interest in the virus. First, a correlation analysis between Google Trends and COVID-19 data is performed; then, the role of Google Trends data in the predictability of COVID-19 is explored. To the best of our knowledge, this paper is the first attempt of this kind performed for the United States.

The rest of the paper is structured as follows. The Methods section details the data collection procedure and the statistical analysis tools and methods. The Results section consists of the correlation analysis and of the forecasting models at both national and state levels. The Discussion section presents the main findings of this work, along with the limitations of this paper and future research suggestions.

Methods

Data from the Google Trends platform are retrieved in .csv³⁹ and are normalized over the selected period. Google Trends reports the adjustment procedure as follows: “Search results are normalized to the time and location of a query by the following process: Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics. Different regions that show the same search interest for a term don't always have the same total search volumes”⁴⁰. The data collection methodology is designed based on the Google Trends Methodology Framework in Infodemiology and Infoveillance⁴¹. Note that the data may slightly vary based on the time of retrieval.

For keyword selection, the online interest in all commonly used variations is examined, and the variations are compared, i.e., “coronavirus (virus)”; “COVID-19 (search term)”; “SARS-COV-2 (search term)”; “2019-nCoV (search term)”; and “coronavirus (search term)”. Only “coronavirus (virus)” and “coronavirus (search term)” yield, as expected, considerably high online interest. Between the two, i.e., the topic (virus) and the search term, “coronavirus (virus)” is selected for further analysis.

Data on the worldwide distribution of COVID-19 cases and deaths are retrieved from Worldometer⁹. Data for the United States analysis of COVID-19 are retrieved from “The COVID Tracking Project”, which provides detailed structured data on COVID-19 cases and deaths nationally and at state level¹⁰. Maps of COVID-19 cases and deaths and online interest are created by the authors using the free online tools Pixelmap⁴² and Chartsbin⁴³, with data from the respective sources^9,10, while graphs, spider web charts, and maps of the correlation coefficients are created by the authors using Microsoft Excel (version 16.39).

As Google Trends data are normalized, the timeframe for which search traffic data are retrieved should exactly match the period for which COVID-19 data are available. Therefore, the timeframes for which analysis is performed are different among states, starting either on March 4th (for most cases) or on the date on which the first confirmed case was identified in each state, as shown in Table 2.

Table 2 Timeframes for which Google Trends data are retrieved by state.

Full size table

Each variable used in this study is divided by its full-sample standard deviation, estimated or calculated based on the basic formula of the standard deviation of a variable. By doing so, the inherent variability of each variable was moved, and thus, all variables have a standard deviation equal to 1. This equivalence makes it possible to compare the strength of the impact of the explanatory variables used on the dependent variable. The nonparametric⁴⁴ unit root test is also applied to reveal whether or not the variables are stationary. The results suggest that both variables can be used directly in the present analysis without further transformation.

The first step in exploring the role of Google Trends in the predictability of COVID-19 is to examine the relationship between Google Trends and the incidence of COVID-19. As Pearson correlation analysis is the benchmark analysis in this kind of approach, the Pearson correlation coefficients (r) between the ratio (COVID-19 deaths)/(COVID-19 cases) and Google Trends data are calculated. In particular, a minimum variance bias-corrected Pearson correlation coefficient^45,46 via a bootstrap simulation is applied to deal with the limited number of observations and, therefore, small sample estimation bias (also see^45,47). The bias-corrected bootstrap coefficient ${\stackrel{\sim }{\rho }}^{b}$ for the Pearson correlation is given as follows:

$${\stackrel{\sim }{\rho }}^{b}={B}^{-1}\sum_{j=1}^{B}{\stackrel{\sim }{\rho }}_{j}^{b}\left(\rho \right)$$

where $B$ corresponds to the length of the bootstrap samples; in this case, it is set equal to 999⁴⁸. Note that the terms “COVID-19 deaths” and “COVID-19 cases” refer to the cumulative (total) COVID-19 deaths and cases in the United States and that this terminology is used hereafter unless otherwise stated.

Next, secondary correlation analysis is performed using the Kendall rank correlation, which is a nonparametric test that measures the strength of dependence between two variables. The Kendall rank correlation is distribution free and is considered robust in ratio data. Considering two samples with sample sizes $n$, the total number of pairings is $\frac{1}{2}n(n-1)$. The following formula is used to calculate the value of the bias-corrected Kendall rank correlation:

$${\stackrel{\sim }{\tau }}^{b}={B}^{-1}\sum_{j=1}^{B}{\stackrel{\sim }{\tau }}_{j}^{b}\left(\tau \right)$$

where $\tau$ is given by $\tau =\frac{{n}_{c}- {n}_{d}}{\frac{1}{2}n(n-1)}$, ${n}_{c}$ is the concordant value, and ${n}_{d}$ is the discordant value.

Following, a COVID-19 predictability analysis approach based on Google Trends time series for the United States and all US states (plus DC) is performed. The predictability model is a quantile regression, which is considered to be a robust regression analysis against the presence of outliers in the sample; it was introduced by⁴⁹. Building on the study conducted by⁴⁶, a quantile regression that is bias corrected via balanced bootstrapping is employed. Such a model is the appropriate statistical approach for mitigating small sample estimation bias and the presence of outliers in the dataset, as it combines the advantages of bootstrap standard errors and the merits of quantile regression. Additional knowledge on quantile regression can be found in the studies conducted by⁵⁰ and⁵¹, while recent applications of quantile regression can be found in^52,53. More recently⁵⁴ introduced unconditional quantile regression, while the study by⁵⁵ provides further insights into robust estimates of regressions.

Let ${Y}_{t},$ with $t\in T$, be a time series that represents the dependent variable, supposing a bivariate specification. Quantile regression estimates the impact of the explanatory variable ${X}_{t}$, with $t\in T$, on the variable ${Y}_{t}$ at different points of the conditional $q$-quantile, with $q\in \left(\mathrm{0,1}\right)$, of the conditional distribution. A value of the $q$-quantile close to zero and a value of the $q$-quantile close to one represent the left (lower) and right (upper) tails of the conditional distribution, respectively. The conditional quantile function is defined as follows:

$$Q_{Y|X} \left( q \right) = {\text{X}}^{\prime } \beta_{q}$$

Given the distribution of ${Y}_{t}$, the estimation of the conditional quantile functions ${\beta }_{q}$ can be obtained by solving the following minimization problem:

$${\beta }_{q}=\mathrm{arg}\underset{\beta \in {\mathbb{R}}^{k}}{\mathrm{min}}E\left({\rho }_{q}\left(Y-X\beta \right)\right)$$

where ${\rho }_{q}\left(y\right)=y\left(q-{1}_{\left\{y<0\right\}}\right)$ represents the loss function.

By minimizing the sample analog $\left\{{y}_{1},\dots ,{y}_{n}\right\}$ that corresponds to a ${q}^{th}$ quantile sample, the estimator ${\beta }_{q}$ takes the following form:

$$\beta_{q} = {\text{arg}}\mathop {\min }\limits_{{\beta \in {\mathbb{R}}^{k} }} \mathop \sum \limits_{t = 1}^{n} \rho_{q} \left( {Y_{t} - X_{t}^{^{\prime}} \beta } \right) = {\text{arg}}\mathop {\min }\limits_{{\beta \in {\mathbb{R}}^{k} }} \left[ {q\mathop \sum \limits_{{Y_{t} \ge \beta X_{t} }} \left| {Y_{t} - \beta X_{t} } \right| + \left( {1 - q} \right)\mathop \sum \limits_{{Y_{t} < \beta X_{t} }} \left| {Y_{t} - \beta X_{t} } \right|} \right]$$

where $\beta {X}_{t}$ is an approximation of the conditional $q$-quantile of the variable ${Y}_{t}$.

In our analysis, ${Y}_{t}$ stands for the ratio (COVID-19 deaths)/(COVID-19 cases), ${\rm X}_{t-1}$ is the respective Google Trends value in lag order, and $t=1,\dots ,T$, with $T$ being the respective number of observations. A linear trend is used as well.

Finally, the bias-corrected parameter is estimated as follows:

$${\stackrel{\sim }{\beta }}^{b}\left(q\right)=\widehat{\beta }\left(q\right)-\widehat{bias}\left(\widehat{\beta }\left(q\right)\right)$$

where $\widehat{bias}\left(\widehat{\beta }\left(q\right)\right)$ is given by ${B}^{-1}{\sum }_{j=1}^{B}{\widehat{\beta }}_{j}^{*}\left(q\right)-\widehat{\beta }\left(q\right)$ and $q\in (0, 1)$ denotes the quantile considered and, in this case, is set equal to 0.5 (median). Median regression is considered more robust to outliers than, for example, least squares regression. Finally, it also avoids assumptions about the error parametric distribution⁵⁶.

Αll estimation results reported in this paper were computed in the R programming environment⁵⁷. In particular, we employed the R packages "quantreg" and "boot" to compute the quantile regression estimates and to perform the bootstrapping, respectively. The code is available in a “Supplementary Online Material file”.

Results

Figure 3 depicts the worldwide and US online interest in terms of Google queries in the “coronavirus (virus)” topic from January 22nd to April 15th, 2020. It shows that this topic is very popular, especially in Europe and North America. Specifically, interest in the United States is considerably high (above 70) for all US states.

To perform a first assessment of the relationship between Google Trends and COVID-19 data, the Pearson and Kendall rank correlations between the two variables are calculated, and the results are further compared. Tables 3 and 4 present the results of the Pearson and Kendall correlation analysis by state, respectively.

Table 3 Pearson correlation analysis by state.

Full size table

Table 4 Kendall rank correlation analysis by state.

Full size table

As reported in Table 3, statistically significant correlations are observed for the United States and for the states of Alabama, Arkansas, California, Colorado, Florida, Georgia, Illinois, Kentucky, Massachusetts, Minnesota, Nebraska, Nevada, New Hampshire, New York, North Carolina, Oregon, Pennsylvania, South Dakota, Tennessee, Vermont, Virginia, Washington, Wisconsin, and Wyoming as well as DC. The states of Iowa, Louisiana, Maine, Mississippi, Missouri, North Dakota, South Carolina, and Utah do not marginally reach the p < 0.1 threshold of statistical significance, i.e., $p\in (0.1, 0.2)$.

Based on the Kendall correlation analysis, statistically significant correlations are observed for the United States and for the states of Alaska, Arizona, Arkansas, California, Connecticut, Florida, Georgia, Hawaii, Iowa, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Tennessee, Utah, Vermont, Virginia, Washington, and Wisconsin as well as DC. Figure 4 depicts the heat map of the (a) Pearson and (b) Kendall correlation coefficients in the United States by state over the period examined.

As depicted in the heat maps and in the spider web charts for the respective correlation analyses in Fig. 5, visual comparison of the two approaches indicates that the results are consistent in both analyses.

However, the main purpose of this study is to explore the predictability of COVID-19 using Google Trends data in the United States. Proceeding with the results of the predictability analysis, Fig. 6 depicts the heat map for ${{\varvec{\beta}}}_{1}$ by state, while Table 5 presents the quantile regression estimated predictability models for the US and for each US state (plus DC). As shown, the estimated Google Trends models exhibit strong COVID-19 predictability.

Table 5 Predictability analysis by state.

Full size table

Note that due to the low number of observations, the states of Maine, Montana, North Dakota, West Virginia, and Wyoming are not included in the predictability analysis results, but they are given the value “zero (0)” to be included in the heat map for purposes of uniformity.

Discussion

As of July 29th, 2020, there were 16,920,857 COVID-19 recorded cases worldwide, with the reported death toll at 664,141 and the number of recovered patients at 10,485,316⁹. In light of the COVID-19 pandemic and to find new ways of forecasting the spread of the disease, infodemiology approaches have provided valuable input in monitoring and forecasting the development of the COVID-19 pandemic over time and in measuring and analyzing the public’s awareness and response. Google Trends and Twitter have been identified as the most popular infodemiology sources, while other social media, such as Facebook and Instagram, exhibit promising results in analyzing users’ online behavioral patterns¹³.

Social media platforms can provide us with more qualitative data that can shift the focus to other directions. Such approaches include sentiment analysis, educational purposes, and efforts to measure and raise public awareness. Recent approaches to analyzing aspects of the COVID-19 pandemic using social media data include monitoring the Twitter usage of G7 leaders⁵⁸, monitoring self-reported symptoms on Twitter⁵⁹, and analyzing the public perception of the disease through Facebook⁶⁰. Moreover, infodemiology sources have provided valuable input in recruiting online survey participants through Facebook to measure individuals’ COVID-19 confidence levels⁶¹ and in assessing the behavioral variations in COVID-19-related online search traffic in more than one search engine⁶². Finally, commentaries that make recommendations on the integration of other social media platforms, such as Facebook, Reddit, and TikTok, for disseminating medical information to inform public health and policy have been published⁶³.

Google Trends offers a solid foundation for quantitative analysis with respect to the monitoring and predictability of COVID-19, as in the analysis presented in this study, where Google Trends data on the “coronavirus (virus)” topic were used to explore the predictability of COVID-19 in the United States at both national and state level. First, for a preliminary assessment of the relationship between Google Trends and COVID-19 data, Pearson correlation and Kendall rank correlation analyses were performed. Statistically significant correlations were observed for the United States and for several US states, which is in line with previous studies that argue that there is a relationship between Google Trends and COVID-19 data.

The COVID-19 predictability analysis, which used a quantile regression approach, exhibits very promising results and indicates the most important contribution of this study to the international literature: detecting and predicting the early spread of COVID-19 at the regional level. This contribution can be a substantial supplement in further assisting local authorities in taking the appropriate measures to handle the spread of the disease.

Figure 7 illustrates a graph of the COVID-19 deaths/cases ratio, daily COVID-19 deaths, daily COVID-19 cases, and the respective Google Trends normalized data in the United States from March 4th to April 15th, 2020. For purposes of consistency in the graph, the COVID-19-related time series are normalized on a 0–100 scale. As depicted in the graph and confirmed by the predictability analysis, the two variables are not linearly dependent. Instead, they exhibit an inversely proportional relationship, meaning that as COVID-19 progresses, the online interest in the virus decreases.

From a behavioral point of view, this result can be explained as follows. First, online interest starts to increase and reaches a peak as the number of confirmed cases becomes high and as the deaths rates start to show that the pandemic does indeed have severe consequences. However, after a certain period, the interest has an inverse course, which could also indicate that the public is overwhelmed by information overload and decreases its information “intake”. The spike in Google queries and the decline in the ratio of COVID-19 deaths/cases could be attributed to the spread of the virus over these days and the “delay” in deaths. Regarding this latter point, this means that cases increase while the total number of deaths has not yet started to considerably increase.

The latter point is in line with previous work on the topic²⁷ suggesting that although significant correlations between COVID-19 and Google data are observed, the relationship tends to decrease in both strength and significance in regions that have been affected by COVID-19 as we move forward in time because the interest in the virus decreases. This decrease is counterintuitive and occurs before the case and death curves start to exhibit a downward trend, i.e., when a region is being heavily affected, independent of whether or not it has reached its peak. However, it would be interesting for future investigators to explore the relationship from this point onwards since, as shown in Fig. 7, the lines converge, with this convergence being indicative of a future change in the relationship dynamics when deaths peak at a later point and when they start their downward course as well.

The above can partly explain the differences in signs among states in both the Pearson and Kendall rank correlation coefficients, but a more in-depth explanation from a statistical perspective is that the Pearson correlation coefficient is estimated as the average of the deviations of observations from the sample mean. The weights of observations in the tails of the distribution are equal to the weight of other observations, and therefore, the outliers could affect the estimation of the results, especially in the case of the small sample. In consideration of ties, this study employs a bootstrap bias-corrected approach, but the main conclusions are based on quantile regressions. Unlike linear measures of dependency, quantile regression is considered superior in a sampling situation and more resistant to outliers than linear regressions, the Pearson correlation, or the Kendall rank correlation⁶⁴. Taking into account that the current pandemic is a dynamic process that constantly evolves and has a serious social impact, it is very probable that there now exist—or, at a later stage, could develop—several data anomalies (e.g., due to non-pharmaceutical interventions); therefore, formal statistical tools such as the Pearson and Kendall rank correlations should be carefully interpreted.

This study has limitations. First, data from only one search engine are considered. Although Google Trends is the most popular search engine, some data on the coronavirus topic from other search engines were not included in this analysis. Second, the data at this point are very limited, and the results are based on few observations. Third, the 50 (+ 1) states exhibit diversity in terms of confirmed cases and deaths. Therefore, any conclusions drawn from this analysis refer to each case individually. Despite the known limitations of online search traffic data, the use of infodemiology metrics for informing public health and policy in general and for monitoring outbreaks and epidemics in particular has received wide attention.

To dynamically find the determinants of COVID-19, the predictability analysis in this study provides insights into how online search traffic data can play a considerable role in forming public health policies, especially in times of epidemics and outbreaks, when real-time data are essential. With the COVID-19 pandemic, the world is in uncharted territory socially, economically, and socially. This situation calls for immediate action and open research and data, and the term “multidisciplinary” has never before been more important. To that end, the role of big data in providing “opportunities for performing modeling studies of viral activity and for guiding individual country healthcare policymakers to enhance preparation for the outbreak” has been acknowledged⁶⁵, and current research on the subject should focus on both exploring the role of other infodemiology variables in the predictability of COVID-19 and combining infodemiology sources with traditional sources to explore the full potential of what online real-time data have to offer for disease surveillance.

Data availability

The COVID-19 and query datasets analyzed during the current study are available on the COVID-19 Tracking Project website¹⁰ and on the “Google Trends” explore page³⁹, respectively.

References

WHO Timeline—COVID-19. World Health Organization. https://www.who.int/news-room/detail/08-04-2020-who-timeline---covid-19 (2020).
Twitter account. World Health Organization. https://twitter.com/WHO/status/1213523866703814656?s=20 (2020).
Pneumonia of unknown cause. World Health Organization. https://www.who.int/csr/don/05-january-2020-pneumonia-of-unkown-cause-china/en/ (2020).
Secon, H., Woodward, A & Mosher, D. A comprehensive timeline of the new coronavirus pandemic, from China's first COVID-19 case to the present. Business Insider. https://www.businessinsider.com/coronavirus-pandemic-timeline-history-major-events-2020-3 (2020).
Twitter account. World Health Organization. https://twitter.com/who/status/1217043229427761152?lang=en (2020).
Qin, A. & Wang, V. Wuhan, Center of Coronavirus Outbreak, Is Being Cut Off by Chinese Authorities. New York Times. https://www.nytimes.com/2020/01/22/world/asia/china-coronavirus-travel.html (2020).
Coronavirus disease named COVID-19. BBC News. https://www.bbc.com/news/world-asia-china-51466362 (2020).
COVID coronavirus Outbreak: Italy. Wolrdometer. https://www.worldometers.info/coronavirus/country/italy/ (2020).
COVID coronavirus Outbreak. Worldometer. https://www.worldometers.info/coronavirus/ (2020).
The COVID Tracking Project. The Atlantic. https://covidtracking.com (2020).
Eysenbach, G. Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J. Med. Internet Res. 11(1), e11 (2009).
Article PubMed PubMed Central Google Scholar
Eysenbach, G. Infodemiology and infoveillance tracking online health information and cyberbehavior for public health. Am. J. Prev. Med. 40(5 Suppl 2), S154–S158 (2011).
Article PubMed Google Scholar
Mavragani, A. Infodemiology and infoveillance: A scoping review. J. Med. Internet Res. 22(4), e16206 (2020).
Article PubMed PubMed Central Google Scholar
Bernardo, T. M. et al. Scoping review on search queries and social media for disease surveillance: A chronology of innovation. J. Med. Internet Res. 15(7), e147 (2013).
Article PubMed PubMed Central Google Scholar
Eysenbach, G. SARS and population health technology. J. Med. Internet Res. 5(2), e14 (2003).
Article PubMed PubMed Central Google Scholar
van Lent, L. G., Sungur, H., Kunneman, F. A., van de Velde, B. & Das, E. Too far to care? Measuring public attention and fear for Ebola using twitter. J. Med. Internet Res. 19(6), e193 (2017).
Article PubMed PubMed Central Google Scholar
Farhadloo, M., Winneg, K., Chan, M. S., Hall, J. K. & Albarracin, D. Associations of topics of discussion on twitter with survey measures of attitudes, knowledge, and behaviors related to Zika: Probabilistic Study in the United States. JMIR Public Health Surveill. 4(1), e16 (2018).
Article PubMed PubMed Central Google Scholar
Poletto, C., Boëlle, P. & Colizza, V. Risk of MERS importation and onward transmission: A systematic review and analysis of cases reported to WHO. BMC Infect. Dis. 16(1), 448 (2016).
Article PubMed PubMed Central Google Scholar
Samaras, L., García-Barriocanal, E. & Sicilia, M. A. Comparing Social media and Google to detect and predict severe epidemics. Sci. Rep. 10, 4747 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Mavragani, A. & Ochoa, G. The internet and the anti-vaccine movement: Tracking the 2017 EU measles outbreak. Big Data Cog. Comp. 2(1), 1 (2018).
Google Scholar
Du, J. et al. Public perception analysis of tweets during the 2015 measles outbreak: Comparative study using convolutional neural network models. J. Med. Internet Res. 20(7), e236 (2018).
Article PubMed PubMed Central Google Scholar
Mavragani, A., Ochoa, G. & Tsagarakis, K. P. Assessing the methods, tools, and statistical approaches in google trends research: Systematic review. J. Med. Internet Res. 20(11), e270 (2018).
Article PubMed PubMed Central Google Scholar
Google Trends & COVID Advanced Search. Pubmed. https://www.ncbi.nlm.nih.gov/pubmed/ (2020).
Husnayain, A., Fuad, A. & Su, E. C. Applications of google search trends for risk communication in infectious disease management: A case study of COVID-19 outbreak in Taiwan. Int. J. Infect Dis. 95, 221–223 (2020).
Article CAS PubMed PubMed Central Google Scholar
Li, C. et al. Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data, China, 2020. Euro Surveill. 25(10), 2000199 (2020).
Article PubMed Central Google Scholar
Effenberger, M. et al. Association of the COVID-19 pandemic with internet search volumes: A Google Trends(TM) analysis. Int. J. Infect Dis. 95, 192–197 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mavragani, A. Tracking COVID-19 in Europe: Infodemiology approach. JMIR Public Health Surveill. 6(2), e18941 (2020).
Article PubMed PubMed Central Google Scholar
Walker, A., Hopkins, C. & Surda, P. The use of google trends to investigate the loss of smell related searches during COVID-19 outbreak. Int. Forum Allergy Rhinol. 10(7), 839–847 (2020).
Article PubMed Google Scholar
Hong, Y. R., Lawrence, J., Williams, D. Jr. & Mainous, A. Population-level interest and telehealth capacity of US hospitals in response to COVID-19: Cross-sectional analysis of google search and national hospital survey data. JMIR Public Health Surveill. 6(2), e18961 (2020).
Article PubMed PubMed Central Google Scholar
Ayyoubzadeh, S. M., Zahedi, H., Ahmadi, M. R. & Kalhori, S. N. Predicting COVID-19 incidence through analysis of google trends data in Iran: Data mining and deep learning pilot study. JMIR Public Health Surveill. 6(2), e18828 (2020).
Article PubMed PubMed Central Google Scholar
Rufai, S.R. & Bunce, C. World leaders' usage of Twitter in response to the COVID-19 pandemic: a content analysis. J Public Health (Oxf). fdaa049 (2020).
Kouzy, R. et al. Coronavirus goes viral: Quantifying the COVID-19 misinformation epidemic on twitter. Cureus. 12(3), e7255 (2020).
PubMed PubMed Central Google Scholar
Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M. & Shah, Z. Top concerns of tweeters during the COVID-19 pandemic: A surveillance study. J. Med. Internet Res. 22(40), e19016 (2020).
Article PubMed PubMed Central Google Scholar
Dost, B. et al. Attitudes of anesthesiology specialists and residents toward patients infected with the novel coronavirus (COVID-19): A national survey study. Surg. Infect. (Larchmt). 21(4), 350–356 (2020).
Article PubMed CAS Google Scholar
Simcock, R. et al. COVID-19: Global radiation oncology’s targeted response for pandemic preparedness. Clin. Transl. Radiat. Oncol. 22, 55–68 (2020).
Article PubMed Google Scholar
Kim, B. Effects of social grooming on incivility in COVID-19. Cyberpsychol. Behav. Soc. Netw. 23(8), 519–525 (2020).
Article PubMed Google Scholar
Rosenberg, H., Syed, S. & Rezaie, S. The Twitter pandemic: The critical role of Twitter in the dissemination of medical information and misinformation during the COVID-19 pandemic. CJEM. 6, 1–4 (2020).
Google Scholar
Chan, A.K.M., Nickson, C.P., Rudolph, J.W., Lee, A. & Joynt, G.M. Social media for rapid knowledge dissemination: Early experience from the COVID-19 pandemic. Anaesthesia. (2020)
Google Trends Explore. https://trends.google.com/trends/explore. (April 18, 2020).
Trends Help. Google Support. https://support.google.com/trends/answer/4365533?hl=en (2020).
Mavragani, A. & Ochoa, G. Google trends in infodemiology and infoveillance: Methodology framework. JMIR Public Health Surveill. 5(2), e13439 (2019).
Article PubMed PubMed Central Google Scholar
PixelMap. AMCHARTS. https://pixelmap.amcharts.com (2020).
ChartsBin. https://chartsbin.com (2020).
Phillips, P. C. B. & Perron, P. Testing for a unit root in time series regression. Biometrica. 75(2), 335–346 (1988).
Article MathSciNet MATH Google Scholar
Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1(1), 54–75 (1986).
Article MathSciNet MATH Google Scholar
Karlsson, A. Bootstrap methods for bias correction and confidence interval estimation for nonlinear quantile regression of longitudinal data. J. Stat. Comput. Sim. 79(10), 1205–1218 (2009).
Article MathSciNet MATH Google Scholar
Guan, W. From the help desk: Bootstrapped standard errors. Stata J. 3(1), 71–80 (2003).
Article Google Scholar
Davidson, R. & MacKinnon, J. G. Bootstrap tests: How many bootstraps?. Econ. Rev. 19(1), 55–68 (2000).
Article MathSciNet CAS MATH Google Scholar
Koenker, R. & Bassett, G. Regression quantiles. Econometrica. 46(1), 33–50 (1978).
Article MathSciNet MATH Google Scholar
Koenker, R. & Hallock, K. F. Quantile regression. J. Econ. Percepct. 15(4), 143–156 (2001).
Google Scholar
Yu, K., Lu, Z. & Stander, J. Quantile regression: Applications and current research areas. J. R Stat. Soc. Series D Stat. 52(3), 331–350 (2003).
Article MathSciNet Google Scholar
Nikitina, L., Paidi, R. & Furuoka, F. Using bootstrapped quantile regression analysis for small sample research in applied linguistics: Some methodological considerations. PLoS ONE 14(1), e0210668 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, F. & Chalhoub-Deville, M. Principles of quantile regression and an application. Lang. Test. 31(1), 63–87 (2014).
Article Google Scholar
Firpo, S., Fortin, N. M. & Lemieux, T. Unconditional quantile regressions. Econometrica. 77(3), 953–973 (2009).
Article MathSciNet MATH Google Scholar
Salibian-Barrera, M. & Zamar, R. H. Bootrapping robust estimates of regression. Ann. Stat. 30(2), 556–582 (2002).
Article MATH Google Scholar
Chernozhukov, V., Hansen, C. & Jansson, M. Finite sample inference for quantile regression models. J. Econom. 152, 93–103 (2009).
Article MathSciNet MATH Google Scholar
R Core Team, 2017. R: A language and environment for statistical computing, Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/. R version 3.3.3.
Rufai, R. S. & Bunce, C. World leaders’ usage of Twitter in response to the COVID-19 pandemic: A content analysis. J. Public Health. 42(3), 510–516 (2020).
Article Google Scholar
Sarker, A. et al. Self-reported COVID-19 symptoms on Twitter: An analysis and a research resource. J. Am. Med. Inform. Assoc. 27(8), 1310–1315 (2020).
Article PubMed PubMed Central Google Scholar
Shorey, S., Ang, E., Yamina, A. & Tam, C. Perceptions of public on the COVID-19 outbreak in Singapore: a qualitative content analysis. J Public Health (Oxf). fdaa105, (2020).
Wang, P. W. et al. COVID-19-related information sources and the relationship with confidence in people coping with COVID-19: Facebook survey study in Taiwan. J. Med. Internet Res. 22(6), e20021 (2020).
Article PubMed PubMed Central Google Scholar
Hou, Z. et al. Cross-country comparison of public awareness, rumours, and behavioural responses to the COVID-19 epidemic: An internet surveillance study. J. Med. Internet Res. 22(8), e21143 (2020).
Article PubMed PubMed Central Google Scholar
Eghtesadi, M. & Florea, A. Facebook, Instagram, Reddit and TikTok: A proposal for health authorities to integrate popular social media platforms in contingency planning amid a global pandemic outbreak. Can. J. Public Health. 111, 389–391 (2020).
Article PubMed PubMed Central Google Scholar
Gideon, R. A. & Hollister, R. A. A rank correlation coefficient resistant to outliers. J. Am. Stat. Assoc. 82(398), 656–666 (1987).
Article MathSciNet MATH Google Scholar
Ting, D. S. W., Carin, L., Dzau, V. & Wong, T. Y. Digital technology and COVID-19. Nat. Med. 26, 459–461 (2020).
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science and Mathematics, Faculty of Natural Sciences, University of Stirling, Stirling, FK9 4LA, Scotland, UK
Amaryllis Mavragani
Department of Management Science and Technology, University of Patras, Patras, Greece
Konstantinos Gkillas

Authors

Amaryllis Mavragani
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Gkillas
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.M. conceived the idea, designed the methodology, performed the data collection, performed the data analysis and interpretation, wrote the paper; K.G. designed the statistical methodology, performed the statistical analysis and interpretation and performed the computational analysis. Both authors reviewed and approved the manuscript.

Corresponding author

Correspondence to Amaryllis Mavragani.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mavragani, A., Gkillas, K. COVID-19 predictability in the United States using Google Trends time series. Sci Rep 10, 20693 (2020). https://doi.org/10.1038/s41598-020-77275-9

Download citation

Received: 27 April 2020
Accepted: 06 November 2020
Published: 26 November 2020
DOI: https://doi.org/10.1038/s41598-020-77275-9

This article is cited by

Beyond the surface: accounting for confounders in understanding the link between collectivism and COVID-19 pandemic in the United States
- Mac Zewei Ma
- Sylvia Xiaohua Chen
BMC Public Health (2023)
Engaging a national-scale cohort of smart thermometer users in participatory surveillance
- Yi-Ju Tseng
- Karen L. Olson
- Kenneth D. Mandl
npj Digital Medicine (2023)
Web and social media searches highlight menstrual irregularities as a global concern in COVID-19 vaccinations
- Ariel Katz
- Yoav Tepper
- Alal Eran
Scientific Reports (2022)
COVID-19 Open-Data a global-scale spatially granular meta-dataset for coronavirus disease
- Oscar Wahltinez
- Aurora Cheung
- Kevin Murphy
Scientific Data (2022)
COVID-19 forecasts using Internet search information in the United States
- Simin Ma
- Shihao Yang
Scientific Reports (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.