Abstract
Our society is built on a complex web of interdependencies whose effects become manifest during extraordinary events such as the COVID-19 pandemic, with shocks in one system propagating to the others to an exceptional extent. We analyzed more than 100 millions Twitter messages posted worldwide in 64 languages during the epidemic emergency due to SARS-CoV-2 and classified the reliability of news diffused. We found that waves of unreliable and low-quality information anticipate the epidemic ones, exposing entire countries to irrational social behavior and serious threats for public health. When the epidemics hit the same area, reliable information is quickly inoculated, like antibodies, and the system shifts focus towards certified informational sources. Contrary to mainstream beliefs, we show that human response to falsehood exhibits early-warning signals that might be mitigated with adequate communication strategies.
Human societies build on social, economic, environmental and technological systems whose dynamics are inherently complex and often highly unpredictable in the short term. The effects of this deep-layered structural interdependency 1,2 become manifest during extraordinary events, such as natural catastrophes or pandemics, where shocks propagate across systems. Although the level of complexity of past human societies has been often underestimated, it can be claimed that, in the past few decades, the acceleration of globalization processes has brought about an unprecedented level of large-scale interdependencies, from trade of goods to communications, that dramatically changed the temporal scales of shock propagation. However, how to map and understand the potential diffusion pathways which might lead to major systemic crises or even collapse is still unknown.
The high levels of specialization 3 and adaptive flexibility 4 of human societies rely upon complex, multifaceted forms of cooperation, to the point of characterizing humans as a super-cooperator species 5. One would therefore expect that the human propensity to cooperate would be further magnified when facing major threats that put collective wellbeing at risk. In large, complex societies, an important mediator of large-scale cooperation is communication 6, which may be crucial to coordinate individual perceptions and behaviors in the pursuit of the common interest 7. The recent explosion of publicly shared, decentralized information production that characterizes digital societies 8 and in particular social media activity 9 provides an exceptional laboratory for the observation and the study of these complex social dynamics 10, and potentially functions as a very powerful resource to enact effective, pro-social cooperation and coordination in large-scale crises 11. Global pandemics are certainly an instance of such crises, and the current outbreak of COVID-19 may therefore be thought of as a natural experiment to observe social responses to a major threat that may potentially escalate to catastrophic levels, and has already managed to seriously affect levels of economic activity, and radically alter human social behaviors across the globe.
In this study, we show that information dynamics tailored to alter individuals’ perceptions and, consequently, their behavioral response, is able to drive collective attention 12 towards false 13,14 or inflammatory 15 content, a phenomenon named infodemics 16–19, sharing similarities with more traditional epidemics and spreading phenomena 20–22. Contrary to what it could be expected in principle, what this natural experiment reveals is that, on the verge of a threatening global pandemic emergency due to SARS-CoV-2 23–25, human communication activity is to a significant extent characterized by the intentional production of informational noise and even of misleading or false information 26. This generates waves of unreliable and low-quality information with potentially very dangerous impacts on the social capacity to respond adaptively at all scales by rapidly adopting those norms and behaviors that may effectively contain the propagation of the epidemics. Spreading false information or even conspiracy theories that support implausible explanations of the causal forces at work behind the crisis may create serious confusion and even discourage people from taking the crisis seriously or responsibly, all the more so, the more such signals receive social validation and spread across social groups and communities 27. Therefore, if on the one hand we face the risks of a global epidemics threat, requiring outstanding efforts for modeling and anticipating the time course of the spreading 25, on the other hand we can speak of an infodemics threat 28, where low-quality content provides an alternative for news consumption to unclear official communications. The infodemics can be thought, similarly to epidemics, as an outbreak of false rumors and fake news with unexpected effects on social dynamics (see Fig. 1). In fact, the dangerousness of infodemics can compare and sum up to a large extent to that of the epidemics itself 29.
As shown in Fig. 1, an infodemics is the result of the simultaneous action of multiple human and non-human sources of fake or unreliable news. As users are repeatedly hit by a given message from different sources, this works as an indirect validation of its reliability and relevance, leading the user to spread it in turn, and to become an informationally infectious agent.
The COVID-19 crisis allows us to provide a rigorous, evidence-based assessment of such risks, and of the real-time interaction of the infodemic and epidemic layers 21.
We focus our attention on the analysis of messages posted on a popular microblogging platform 30, an online social network characterized by heterogeneous connectivity 31 and topological shortcuts typical of small-world systems 32. Information spreading on this type of networks is well understood in terms of global cascades in a population of individuals who have to choose between complementary alternatives, while accounting for the behavior and the relative size of their social neighborhood 33, and accounting for factors which characterize the popularity of specific content, like the memory time of users and the underlying connectivity structure 34. However, the exact fundamental mechanisms responsible for the spread of false information and inflammatory content, e.g. during political events 15,35 36,37, remains fundamentally unknown. Recently, it has been suggested that this challenging phenomenon might be due to the fact that, at population level, the dynamics of multiple interacting contagions are indistinguishable from social reinforcement 38.
This peculiar feature suggests that infodemics of news consumption should be analyzed through the lens of epidemiology to gain insights about the role of human and non-human activities in spreading reliable as well as unreliable news. To this aim, we monitored the social media and collected more than 112 millions messages in 64 languages from around the world about COVID-19, between 21 January and 10 March 2020 (see Methods for details).
By using state-of-the-art machine learning techniques to analyze the online behavior of users (see Methods for details), we have discovered an extraordinary activity of automated agents, referred to as social bots 14,15,35,39. Specifically, we estimate that 40.4% of online messages during this period were due to such automated agents, doubling the activity with respect to estimates of only four years ago 35.
Where available, we have extracted URLs from messages, collecting about 20.7 millions links (3.3 millions unique) pointing to websites external to the platform. Each URL is therefore used for fact-checking, inheriting the reliability of its source (see Methods). About 50% of URLs have been fact-checked by screening almost 4,000 expert-curated web domains, whereas the remaining corpus was pointing to disappeared web pages or to content not classifiable automatically (eg, videos on YouTube) and unpopular sources. This method allowed us to overcome the limitations due to text mining of different languages for the analysis of narratives.
To better understand the diffusion of these contents across countries, we have filtered messages with geographic information. About 0.84% of collected posts were geo-tagged by the user, providing highly accurate information about their geographic location. However, by geocoding the information available in users’ profiles, we were able to extend the corpus of geolocated messages to about 56% of the total observed volume (see Methods). A total of more than 60 millions geolocated messages, containing more than 9 millions news have been analyzed. For each message, we have used an accurate machine learning approach to classify the author as human or non-human (i.e., bot), while keeping the distinction between verified and unverified users. Usually, verification is performed by the social platform to clearly identify accounts of public interest and certify they are authentic. The number of followers Ku of a single user u defines the exposure, in terms of potential visualizations at first-order approximation, of a single message m posted by user u at time t. Let Mu(t, t + Δt) indicate the set of messages posted by user u in a time window of length Δt. Since there are four different classes of users – namely verified bots (VB), unverified bots (UB), verified humans (VH) and unverified humans (UH) – we define the exposure due to a single class Ci (i = VB,UB,VH,UH) as
Note that different users of the same class might have overlapping social neighborhoods: those neighbors might be reached multiple times by the messages coming from distinct users of the same class, therefore our measure of exposure accounts for this effect. Note that our measure provides a lower bound to the number of exposed users, because we do not track higher-order transmission pathways: a user might adopt a content by reading it, while not resharing it. In this case there is no way to account for such users.
Finally, for each message, we identify the presence of links pointing to external websites: for each link we verify if it comes from a trustworthy source or not (see Methods). The reliability rm of a single message m is either 0 or 1, because we discard all web links that can not be easily assessed, such as the ones shortened by third-party services that expired or point to unreachable destinations, and the ones pointing to external platforms, such as YouTube, where it is not possible to automatically classify the reliability of the content. The news reliability of messages produced by a specific class of users is therefore defined as
Unreliability can be defined similarly, replacing rm with 1-rm. Exposure and reliability are useful descriptors that, however, do not capture alone the risk of infodemics. For this reason we have developed an Infodemic Risk Index (IRI) which quantifies the rate at which a generic user is exposed to unreliable news produced by a specific class of users (partial IRI) or by any class of users (IRI):
Both indices are well defined and range from 0 (no infodemic risk) to 1 (maximum infodemic risk). Note that we can calculate all the infodemics descriptors introduced above at a desired level of spatial and temporal resolution. IRI is robust to user classification, making it an indicator not sensitive to performances of bot detection algorithms.
Figure 2 shows how countries characterized by different levels of infodemic risk present very different profiles of news sources. In a low-risk country such as South Korea, the level of infodemic risk remains small throughout apart from an isolated spike in the early phase. As the contagion spreads to significant levels, the infodemic risk further decreases, signalling an increasing focus of the public opinion toward reliable news sources. Canada presents a slightly higher level of infodemic risk, and unlike South Korea, we see that the risk level increases as the epidemics spread, but stays at low levels. At the opposite, in a high-risk country such as Venezuela, the infodemics is in full swing throughout the period of observation, and in addition to the expected activity from unverified sources one notices that even verified ones contribute to a large extent to the infodemics. The relationship with biological contagion patterns cannot be checked here due to lack of reliable data. Finally, in a relatively high-risk country such as Russia we notice that infodemic risk is erratic with sudden, very pronounced spikes, and again also verified sources play a major role. Here too, information about the epidemics is fragmented and mostly unreliable. Overall, the global level of infodemic risk tends to decrease as the epidemics spread globally, suggesting that evidence of the expansion of the contagion leads people to look for relatively more reliable sources, and that verified influencers with many followers started inoculating the system with more reliable news (see Supplementary Figures 3 and 4), playing a role that presents interesting analogies to that of antibodies in the treatment of an infectious disease. This overall pattern is confirmed in terms of measures of Infodemic Risk aggregated daily and at country level (Fig. 3 and Supplementary Fig. 5). The effect is particularly pronounced with the escalation of the epidemics, suggesting that this effect could be mediated by levels of perceived social alarm. It is also interesting to observe though that countries with high infodemic risk might also be more unreliable in terms of reporting of epidemic data, thus altering the perceptions of people and indirectly misleading them in their search for reliable information.
However, also the dynamic profiles of infodemic risk in countries with similar risk levels may be very different. Fig. 4 compares Italy with the United States. In the case of Italy the risk is mostly due to the activity of unverified sources, but we notice that with the outbreak of the epidemics, the production of misinformation literally collapses and there is a sudden shift to reliable sources. For the USA, misinformation is mainly driven by verified sources, and it remains basically constant even after the epidemics outbreak. Notice also how infodemic risk varies substantially across US states. As the USA lag significantly behind Italy in terms of the epidemics progression, it remains to be checked whether a similar readjustment is going to be observed for the USA later on. Fig. 4 shows, however, that the relationship between reduction of infodemic risk and expansion of the epidemics seems to be a rather general trend, as the relationship between number of confirmed cases and infodemic risk is (nonlinearly) negative, confirming the result shown in Fig. 3. Fig. 4 also shows how the evolution of infodemic risk among countries with both high message volume and significant epidemic contagion tends to be very asymmetric, with major roles played not only by countries such as Iran, but also United States, Germany, the Netherlands, Austria and Norway maintaining their relative levels, and other countries like Italy, South Korea and Japan significantly reducing it with the progression of the epidemics.
Our findings demonstrate that, in a highly digitalized society, the epidemic and the infodemic dimensions of a pandemic must be seen as two sides of the same coin. The infodemics is typically driven by the combined action of both human and non-human actors (bots), which pursue largely undisclosed goals. Perceived and actual biological and social risks feed upon one another, and may co-evolve in complex ways. Especially in situations where effective therapies to contrast the diffusion of the pandemic are not readily available, coordination of behaviors and diffusion of pro-social orientations driven by reliable information at all scales are the key resources for the mitigation of adverse effects. In this perspective, we can therefore think of an integrated public health approach where the biological and informational dimensions are equally recognized, taken into account, and managed through careful policy design. This could potentially include the birth of new, highly specialized professional figures such as that of the “infodemiologist”.
Here, we have shown that in the context of the COVID-19 crisis, complex infodemic effects are indeed at work, with significant variations across countries, where level of socio-economic development is not the key discriminant to separate countries with high vs. low infodemic risk. In fact, we find that there are G9 countries with remarkable infodemic risk and developing countries with far lower risk levels. This means that, especially in countries where infodemic risk is high, the eventual speed and effectiveness of the containment of the COVID-19 could depend on a prompt regime switch in communication strategies and in the effective countervailing of the most active sources of the most dangerous categories of fake news. The escalation of the epidemics leads people to progressively pay attention to more reliable sources thus potentially limiting the impact of the infodemics, but the actual speed of adjustment may make a major difference in determining the social outcome, and in particular between a controlled epidemics and a global pandemics. This casts new light on the social mechanics of the infodemics-epidemics interaction, and may be of help to policy makers to design a more integrated strategic approach, by suitably embedding communication and information management into a comprehensive, extended public health perspective.
Data Availability
The dataset analyzed in this paper can be interactively visualized and accessed at https://covid19obs.fbk.eu/
Methods
Data collection
We have followed a consolidated strategy for collecting social media data. We focused on Twitter, which is well-known for providing access to publicly available messages upon specific requests through their application programming interface (API). We have identified a set of hashtags and keywords gaining special collective attention, namely: coronavirus, ncov, #Wuhan, covid19, covid-19, sarscov2, covid. This set includes the official name of the virus and the disease, including the preliminary ones, as well as the name of the city of the first epidemic outbreak. We have used the Filter API – to collect the data in real time from 24 Jan 2020 to 10 Mar 2020 – and of the Search API – to collect the data between 21 Jan 2020 and 24 Jan 2020. Our choice allowed us to monitor, without interruptions and regardless of the language, all the tweets posted about COVID19 since when China reported more than 6,000 cases (20 Jan 2020), calling for the attention of the international community. The Stream API has the advantage of providing all the messages satisfying our selection criteria and posted to the platform in the period of observation, provided that their volume is not larger than 1% of the overall – unfiltered – volume of posted messages. Above 1% of the overall flow of information, the Filter API provides a sample of filtered tweets and communicates an estimate of the amount of lost messages. Note that this choice is the safest as to date: in fact, it has been recently shown that biases affecting Sample API (which samples data based on rate limits), for instance, are not found in REST and Filter APIs 40. We estimate that until 24 Feb 2020 we lost about 60,000 tweets out of millions, capturing more than 99.5% of all messages posted (see Supplementary Fig. 1). The global attention towards COVID19 increased the volume of messages after 25 Feb 2020: however, Twitter restrictions allowed us to get no more than 4.5 millions messages per day, on average. We have estimated a total of 161.2 millions tweets posted until 10 Mar 2020: we have successfully collected 112.6 millions of them, providing an unprecedented opportunity for infodemics analysis.
Human vs non-human classification
The classification of users into humans and non-humans (ie, bots) is based on machine learning. It is based on a well established algorithm based on deep learning 37 with state-of-the-art accuracy 15,41. More in detail, our method has the highest accuracy (>90%) and precision in identifying bots (>95%) when compared with state-of-the-art methods. Our deep neural network model has the advantage to be more stable in the classification of certain users playing the role of broadcasters. Note that in this study we are making an explicit difference between verified and unverified human/non-human users. In fact, verified users should be considered as more authentic than unverified ones, because Twitter makes use of strict criteria for verification. Therefore, verified bot accounts might be broadcasters (whose behavior is manifestly different from the average behavior of a single human) or, in some cases, even celebrities and any case where it is very likely that the account is managed automatically and exhibits a non-human classical behavior.
Fact Checking
We have collected manually-checked web domains from multiple publicly available databases, including scientific and journalistic ones. Specifically, we have considered data shared by:
M. Zimdar for the Washington Post (2016). https://www.washingtonpost.com/posteverything/wp/2016/11/18/my-fake-news-list-went-viral-but-made-up-stories-are-only-part-of-the-problem/
C. Silverman for BuzzFeed News (2017). https://www.buzzfeednews.com/article/craigsilverman/inside-the-partisan-fight-for-your-news-feed
Fake News Watch (2015).https://web.archive.org/web/20180213181029/http://www.fakenewswatch.com/
PolitiFact (2017). https://www.politifact.com/article/2017/apr/20/politifacts-guide-fake-news-websites-and-what-they/
Bufale.net (2018). https://www.bufale.net/the-black-list-la-lista-nera-del-web/
Starbird et al, ICWSM (2018)
Fletcher et al, Factsheets, Reuters Institute and U. of Oxford (2018). https://reutersinstitute.politics.ox.ac.uk/our-research/measuring-reach-fake-news-and-online-disinformation-europe
Grinberg et al, Science 363, 374 (2019)
MediaBiasFactCheck (2020). https://mediabiasfactcheck.com/
However, databases adopted different labeling schemes to classify web domains, therefore we first had to develop a unifying classification scheme, reported in the table below, and map all existing categories to a unique set of categories. Note that we have also mapped those categories to a coarse-grain classification scheme, distinguishing just between reliable and unreliable.
We have found a total of 4,988 domains, reduced to 4,417 after removing hard duplicates across databases. Note that a domain is considered a hard duplicate if its name and its classification coincides across databases.
A second level of filtering is applied to domains which are classified differently across databases (e.g., xyz.com might be classified as FAKE/HOAX in a database and as SATIRE in another database). To deal with these cases, we have adopted our own expert classification, by assigning to each category a Harm Score between 1 and 9. When two or more domains are soft duplicates, we keep the classification with the highest Harm Score, as a conservative choice. This phase of processing reduced the overall database to unique 3,920 domains.
The Harm Score classifies sources in terms of their potential contribution to the manipulative and mis-informative character of an infodemic. As a general principle, the more systematic and intentionally harmful the knowledge manipulation and data fabrication, the higher the Harm Score (HS). Scientific content has the lowest level of HS due to the rigorous process of validation carried out through scientific methods. Mainstream media content has the second lowest level of HS due to its constant scrutiny in terms of fact checking and media accountability. Satire is an unreliable source of news but due to its explicit goal of distorting or mis-representing information according to the specific cultural codes of humor and social critique, is generally identified with ease as an unreliable source. Clickbait is a more dangerous source (and thus ranking higher in HS) due to its intent to pass fabricated or mis-represented information and facts for true, with the main purpose of attracting attention and online traffic, that is, for mostly commercial purposes, but without a clear ideological intent. Other is a general purpose category that contains diverse forms of (possibly) misleading or fabricated content, not easily classifiable but likely including bits of ideologically structured content pursuing systematic goals of social manipulation, and thus ranking higher in HS. Shadow is a similar category to the previous one, where in addition links are anonymized and often temporary, thereby adding an extra element of unaccountability and manipulation that translates into a higher level of HS. Political is a category where we find an ample spectrum of content with varying levels of distortion and manipulation of information, also including mere selective reporting and omission, whose goal is that of building consensus for a political position against others, and therefore directly aiming at polluting the public discourse and opinion making, with a comparatively higher level of HS with respect to the previous categories. Fake/hoax contains entirely manipulated or fabricated inflammatory content which is intended to be perceived as realistic and reliable and whose goal may also be political, but fails to meet the basic rules of plausibility and accountability, thus reaching a even higher level of HS. Finally, the highest level of HS is associated to conspiracy/junk science, that is, to strongly ideological, inflammatory content that aims at building conceptual paradigms that are entirely alternative and oppositional to tested and accountable knowledge and information, with the intent of building self-referential bubbles where fidelized audiences are simply refusing a priori any kind of knowledge or information that is not legitimized by the alternative source itself or by recognized affiliates, as it is typical in sects of religious or other nature.
A third level of filtering concerned poorly defined domains, e.g., the ones explicitly missing top-level domain names, such as .com .org etc, as well as the domains not classifiable with our proposed scheme. This action reduced the database to the final number of 3,892 entries, whose statistics are reported in the tables below (see also Supplementary Fig. 2).
Data availability
The datasets generated during the current study are available from the corresponding author on reasonable request. Aggregated information, compliant with all privacy regulations on this matter, are publicly available online at the Infodemics Observatory (http://covid19obs.fbk.eu/) and on a permanent repository (Zenodo address/DOI will be with the publication of this manuscript).
Acknowledgements
We acknowledge the support of the FBK’s Digital Society Department and the FBK’s Flagship Project CHUB (Computational Human Behavior). We thank all FBK’s Research Units for granting a privileged access to extraordinary high performance computing for the analysis of massive infodemics data. We thank Jason Baumgartner for kindly sharing data between 21 Jan and 24 Jan 2020.
Authors’ contributions
M.D. conceived the study. M.D. and F.V. collected the data. R.G. and N.C. performed all experiments and analysed the data. M.D. and P.S. wrote the manuscript.
Competing interests
The authors declare no competing interests.