Introduction

We are in the midst of a global crisis, owing to the outbreak and spread of the COVID-19 virus, and the substantially damaging influence of this viral infection has forced the World Health Organization (WHO) to declare the ongoing situation as a pandemic. As per the official statement of WHO, “COVID-19 is the infectious disease caused by the most recently discovered coronavirus. This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. COVID-19 is now a pandemic affecting many countries globally” [1]. As a precautionary or preventive response to this declared pandemic, countries all over the world have introduced restrictions on mobility and transportation, referred to as lockdowns. Consequently, citizens are being asked to stay indoors as a measure of safety from the infection. In this age of a multitude of news channels and popular virtual social frameworks aimed at better connectivity, a massive share of the time spent indoors is undoubtedly invested in engaging with such media. This is corroborated by the recent study [2] which has revealed that there has been about 57% increase in news consumption by watching television or on smartphones, due to constant indoor presence.

A primary obsession of people during this pandemic is about the changing statistics of affected or deceased people world-wide, and needless to say, such articles form the crux of news that the different media channels publish. This virus outbreak has also raised a plethora of other controversial issues, leading to continuing debates and discussions with consequences at both local and global levels. As a whole, it is apparent that there is only a limited number of news delivered with a positive note. The impact of negativity in the news is a long-standing concern, and has been addressed from time to time [3, 4], but the prevailing situation is predicted to leave a long-lasting and damaging impact on mental health and human psychology as a whole [5]. Meanwhile, the day-to-day statistics of deaths or count of affected patients due to the pandemic is expected to influence the news sentiment too. The authors have taken up this challenge of determining the news sentiment during a fixed period of study, as well as analyzing the influence of world-wide and country-wide statistics on the news sentiment during the selected duration.

The organization of the paper is as follows: "Literature review" section gives a brief description of the studied related works and motivations drawn for the current work; the details of each data corpus used in the work are provided in "Data description" section; "Data processing" section lists the techniques used for processing the comprehensive data corpora; the experiments and observations are discussed in "Experiment 1: sentiment analysis" section, "Experiment 2: statistical analysis" section, "Experiment 3: n-gram analysis" section, "Experiment 4: case studies" section; finally, the concluding remarks are offered in "Conclusion" section.

Literature review

The challenge of opinion mining as an application field of data mining is well addressed, and there have been multiple works in this domain with a variety of solutions based on the increasing availability of growing datasets. A vast majority of these works are dedicated to the challenge of sentiment analysis in text collections of different types. Similar to challenges in other domains, the task of sentiment analysis can also be approached as either as a supervised classification problem, or an unsupervised approach for sentiment identification [6].

The number of works that have addressed the problem of sentiment analysis with a supervised approach is more than the ones that have used unsupervised, exploratory techniques. For a supervised sentiment classification problem, the primary requirement is that the text corpus needs to be labeled, i.e., each text string in the whole data set needs to be annotated as belonging to a particular class—positive, negative, or neutral in this case. A study of the state-of-the-art works reveals that for previously annotated texts mostly based on twitter data, blog posts, web logs, movie reviews etc., the researchers have used some common machine learning techniques, namely Support Vector Machine, Naive Bayes [6,7,8,9,10,11], or even Deep Convolutional Neural Networks [12, 13], etc. It is a general observation that such techniques are more efficient in sentiment analysis tasks than the other unsupervised approaches. Also, the overall performance of supervised algorithms in challenges of opinion mining is generally lower than that in other domains [10].

On the other hand, the task of analyzing sentiments is more challenging with the use of unsupervised learning techniques. Also, such techniques are often more suited for mining the sentiment from bulky sources of data. Identification of semantic orientation [14], comparative study and low performance of the SentiWordNet lexicon in sentiment analysis [9], development of novel emoji and linguistic content-based lexicons using unsupervised approach [15, 16], sentiment polarity detection system using unsupervised approach on Turkish movie reviews [17], etc. are all different interesting research works that use unsupervised approach. The application of standard lexicons such as SentiWordNet [18], AFINN [19], etc. in unsupervised sentiment classification is widely studied and evaluated in different works [20,21,22]. These lexicon based techniques are employed in solving interesting problems, such as analyzing the sentiment of the characters in Shakespeare’s plays [23], opinion mining from clinical discharge summaries [24], development of bias-aware systems [25], etc. Other popular methods for sentiment identification include k-means [11, 26, 27], Latent Dirichlet Allocation (LDA) [28, 29], etc. In all such cases, it is seen that the inherent simplicity, lack of training, and lower computation requirement involved in unsupervised approaches make it easier to use on and learn from data corpus of substantially large size [30].

During a survey of state-of-the-art research using unsupervised lexicon based approach on text data, it is seen that most of these works are based on exploratory sentiment analysis and evaluation of classification techniques, used on different types of data. However, there is a relatively small amount of research that has worked with news data, and almost all such works are based on financial news and stock price prediction [31,32,33,34,35], etc. Similarly, there are only a few works regarding the statistical effect of real-world events on the overall sentiment of global news, mostly related to the financial sector [36,37,38], etc.

In this technologically developed era, people are engrossed in the news media, and agenda setting [39] has a crucial role to play in times of a crisis. Researchers have often determined the role played by mass media in determining or setting the agenda in response to a particular incident or event, and this is rapidly propagated among the audience [40]. Obviously, it entails a number of problems as well as ludicrous opportunities for the media agencies, as explored in [41]. In a related context, the work by Kirk et al. [42] analyzes the agenda setting and media policies in response to a disaster. While the proposed work does not focus on these issues, the authors wish to highlight the underlying role of media in maintaining global public sentiment and mental health given the ongoing COVID-19-related crisis. The news media need to be responsible as well as alert to ensure the proper propagation of awareness and shaping of public sentiment particularly involving second-level agenda setting [43, 44].

Given these observations and the ongoing pandemic, the authors were motivated to make the following research contributions:

  • The current work determines the general sentiment of news articles during the ongoing pandemic with unsupervised and transfer learning-based approaches,

  • This is the only work, as per the authors’ knowledge, that determines the implications of temporal statistics in a pandemic situation, on news sentiment throughout the world during a fixed period of study. The current work statistically determines how and after what amount of delay, the number of affected patients, and number of deaths due to COVID-19, impacts the news sentiment in regional and world-wide news,

  • The authors also analyze other relevant factors that contribute to rise or fall of global news sentiment related to particular countries.

Data description

The proposed work uses data regarding the daily news articles published online globally, as well as the statistical details of day-to-day cases and deaths due to COVID-19 throughout the world. Accordingly, two comprehensive data sets have been used in this work, as described below:

  • COVID-19 data: This set consists of daily statistical data about the numbers of confirmed cases and deaths, gathered for all the COVID-19 affected countries in the world, provided in different file formats. Found in the portal Our World In Data [45], each day’s data corpus consists of 25 attributes, such as country_ISO, location, date, total_cases, new_cases, total_deaths, new_deaths, etc.. The repository contains data from the beginning of the year 2020 till date, and is being regularly updated.

  • News data: This data corpora is provided by The GDELT Project [46], where daily news articles from all over the world are aggregated together in CSV files. The news articles are fetched based on their mention of COVID-19, and are group together based on certain keywords such as masks, tests, cases, panic, quarantine, etc. in separate files every day. Each data file contains the news article text, its URL, page title, and date. This repository contains the news-related data from 26th March, 2020 only.

Thus, the aforementioned data corpora are used to extract data for the duration—26th of March to 31st of May, 2020—i.e., a total of 66 days, spanning more than 2 months. Out of these 66 days, the regular data about number of COVID-19-affected patients and deaths are considered only for the first 60 days, whereas the news sentiment-based experiments have made use of the other days to experiment with sliding window for determining maximum correlation. The effective period of study is thus 60 days.

Data processing

The unlabeled news data described in the previous section have been processed in this part of the work. All of the steps discussed below are performed for each day’s data, to generate usable corpora for the experiments.

  • Data merging: There are 11 files containing news snippets from each day, and these are initially merged to generate a single data repository per day. Thereafter, some steps are followed for processing, as described below.

  • Removing numbers: Initially, the news text contained in the merged corpus for each day of the study is processed using regular expression-based operations. The articles contain different statistics or other details expressed as digits which are removed to generate an intermediate form of cleaned text.

  • Removing special symbols: The news articles consist of different special symbols such as -, ?, &, % etc. which are removed from the output of the previous step, to build the next intermediate form of cleaned news text.

  • Removing URLs: The hyperlinks or web addresses or URLs are also removed from the intermediate forms of the clean text, as these are not useful in determining the sentiment of a particular piece of text.

  • Removing stop words: A common approach is followed to remove the words that are not useful in sentiment analysis process, but which make up a significant part of any text. Examples of such words are: and, for, is, the, to, at, in, etc.

  • Stemming: As a last step of processing the news articles, stemming is applied to derive the root form of the inflected or derived words in each cleaned string. Such derived words are used to propagate different grammatical concepts such as mood, tense, voice, etc. As a simple example, the words working, works, and worked all have the same stemmed form work.

Once all the above steps have been performed, the processed texts for the total duration of the current study in 60 processed files are merged as a single file containing over 6.34 million distinct news articles.

Experiment 1: sentiment analysis

The merged news data corpus consisting of comprehensive, cleaned strings from the previous step is unlabeled in nature, i.e., the news articles are not originally assigned any particular sentiment label. For this purpose, any machine learning and classification-based sentiment analysis are not directly possible on this data set.

Sentiment scoring

For sentiment prediction, the cleaned text articles for each day are now scored using two different approaches, namely the AFINN lexicon [19] in an unsupervised learning approach, and by the Naive Bayes [47]-based transfer learning approach which has been trained on a popular movie reviews dataset [48].

A lexicon is a comprehensive collection of words, and AFINN is one such widely used lexicon consisting of over 3300 words where each word contains a corresponding sentiment score value. This polarity score lies between + 5 to − 5, and every string in our cleaned news text is now analyzed by applying the AFINN lexicon, to generate corresponding sentiment scores. As an example, the string It was a good memory is analyzed and scored word by word using AFINN, where the scores are 0, 0, 0, 3, and 0, respectively, to give a total score of +3. Evidently, the stop words have no role to play in such analysis, and thus, they have been removed during text processing in the previous section. The determined scores (using AFINN lexicon), are now converted to sentiment category. For this purpose, all texts with score less than 0 are labeled negative, those with score equal to 0 are neutral, and all remaining texts are annotated as positive. A notable observation is that such approaches consider only single-word construct or unigrams for sentiment scoring. This is a prime weakness of such approach, as it fails to capture the inherent essence of different multi-word constructs in English, and fails to recognize emotions and complexities of the language.

In contrast, the trained Naive Bayes classifier uses its knowledge about sentiment polarity from the aforementioned movie reviews corpus, and correspondingly applies it to assign a sentiment category to each news article per day. Unlike AFINN, this supervised classification approach considers the complete text at a time and is more sensitive to emotions, inherent figures of speech and multi-word constructs in the language used. Also, this approach gives a different view of the studied corpus of news texts, and returns the sentiment category for each news article.

In this manner, for every piece of cleaned news text, we now have an overall sentiment score (for AFINN) and sentiment category (for Naive Bayes classifier) which is either positive, 0 or negative, for that string.

Sentiment index

The news data corpus for different days do not consist of the same number of text articles, and also each news article has a different sentiment category predicted by AFINN and the trained Naive Bayes classifier. Therefore, there is a need for normalization, before any comparative study of news sentiment on different days is conducted. For this purpose, a negativity index for each day is calculated and is used as an indicator of the overall negative sentiment in news on that day. The index for the ith day is calculated as:

$$\begin{aligned} neg_{i}=\frac{\text{ Number } \text{ of } \text{ articles } \text{ of } \text{ negative } \text{ category }}{\text{ Total } \text{ number } \text{ of } \text{ news } \text{ articles }} \end{aligned}$$
(1)

Similarly, indices for positive sentiment and neutral type of news articles are determined using equations:

$$\begin{aligned} pos_{i}=\frac{\text{ Number } \text{ of } \text{ articles } \text{ of } \text{ positive } \text{ category }}{\text{ Total } \text{ number } \text{ of } \text{ news } \text{ articles }} \end{aligned}$$
(2)
$$\begin{aligned} neu_{i}=\frac{\text{ Number } \text{ of } \text{ articles } \text{ of } \text{ neutral } \text{ category }}{\text{ Total } \text{ number } \text{ of } \text{ news } \text{ articles }} \end{aligned}$$
(3)

These index values are calculated for the comprehensive data on news articles for the duration of study. The overall spread of these sentiment indices, as determined by the analysis using unigram-based AFINN, are shown in Fig. 1, while Fig. 2 illustrates the same as analyzed by Naive Bayes-based classifier. Notably, with the use of the latter, substantially large negativity (about 75%) and low positivity (about 21%) values are detected, whereas the neutrality decreases by more than 50% and is deemed almost irrelevant to the study at hand. Also, in both cases, it is obvious that any fall in negativity, results in an increase in positive sentiment, and vice versa. Therefore, news of neutral sentiment plays a negligible role. Consequently, a statistical study of the sentiment indices determined in both approaches reveals that negative sentiment has the highest mean, followed by the mean number of positive news articles. Also, these two sentiments show almost similar deviation during the studied duration, using both the scoring techniques. Finally, it is evident from both pairs of Figs. 1, 3 and 2 and 4 that the overall variation in sentiment patterns is more profound in the detection by AFINN lexicon, in spite of its poor sentiment detection performance, and is selected for the experiments in the next section.

Fig. 1
figure 1

Illustration of the significance of the three sentiments in global news during the period of study, determined using AFINN lexicon. News with neutral sentiment has minimum presence, and positive news sentiment seems to be slowly catching up with the negativity

Fig. 2
figure 2

Illustration of the significance of the three sentiments on global news during the period of study, determined using Naive Bayes. News with neutral sentiment has minimum presence, and there is a substantial gap between the positivity and negativity in news sentiment

Fig. 3
figure 3

Statistical distribution of three sentiment polarities during the 60 days of study (using AFINN—unsupervised approach)

Fig. 4
figure 4

Statistical distribution of three sentiment polarities during the 60 days of study (using Naive Bayes—transfer learning approach)

The most commonly occurring words in the news articles with negative sentiment, for the complete duration of study, are illustrated in Fig. 5.

Fig. 5
figure 5

This word-cloud highlights the specific words which are present in each day’s most negative news articles. The relatively large size of words, such as death, fatality, case, coronavirus, died, infection, and hospitalized are representative of their frequencies of occurrence during the 60-day period of study

Experiment 2: statistical analysis

This is the next set of experiments where two separate sets of data are utilized, namely:

  • the world-wide news-based negativity index values from the previous experiment determined using AFINN lexicon based approach, as the variation of sentiment polarity is found to be more in that case, and,

  • the number of new cases and number of deaths per million of the population,

These corpora are analyzed to determine the underlying relation between the variation of news sentiment and ground reality of cases and deaths due to COVID-19 pandemic.

Distribution of data

To statistically determine the link between the news negativity and number of cases or number of deaths due to the pandemic, it is essential to determine the distribution of each of these variables. Figure 6 shows the respective distributions.

Fig. 6
figure 6

Distribution of data in the three variables used for the study. a shows the distribution of data on negative indices in global news, b illustrates the characteristics of data on the number of deaths. and c gives the data distribution for the number of cases during the 60-day period of study. d illustrates a sample normal distribution

From the figures, it is noticed that all the three variables used in this work follow a near-normal or near-Gaussian [49] distribution. Therefore, it is feasible to directly determine the statistical relation between these variables.

Trends of news sentiment vs. number of cases

Initially, an attempt has been made to visually determine the relation between distribution of features from two different data corpora. In Fig. 7, the number of confirmed COVID-19 cases during the span of the study has been represented as bar plots. The negativity index values in global news have been plotted for the same duration as a line plot. It is seen that peaks in news negativity are quite often related to the rise in number of cases, as seen in the variations of both variables for different set of days. Also, the decreasing step pattern in number of cases during days 14–19 and 21–26 is distinctively reflected in the news negativity plot too.

Fig. 7
figure 7

Illustration of the number of cases vs the characteristics of negative news sentiment for 60 days. Both the variables are kept to scale in the illustration

Trends of news sentiment vs. number of deaths

Similar to the previous case, Fig. 8 gives the number of daily deaths in bar stacks, while the line plot is the same as the previous figure. It is seen that there is not much similarity in trends between the two data during the first 20 days. In contrast, some similarity in the data patterns is evident in the duration of days 22–32, after which there is no visible similarity.

However, in both the above cases, it is observed that similar patterns in news occur at a delay of a few days. This can be attributed to the fact that day-to-day statistics do not get immediately reported on the same day, and generally takes at least a day or two, to appear and make impact on the global news sentiment. This observation leads to the need for determining the optimal time window, at which the trends in the corpora are most similar.

Fig. 8
figure 8

Illustration of the number of deaths due to COVID-19 vs the characteristics of negative news sentiment for 60 days. Both the variables are kept to scale in the illustration

Determining correlation

From the previous section, it is observed that the trends in news negativity are more or less affected by the variations in the number of cases and the number of deaths. Also, the impact of the trends in number of cases or deaths is visible at a delay of a few days. Therefore, it is necessary to statistically determine the exact delay at which the news sentiment reflects the reality of the situation.

The statistical measure of similarity in data for two variables can be determined by calculating their correlation coefficient. In this part of the experiment, the authors have experimentally determined the correlation coefficient \({r_{n}}\), between the news sentiment and number of cases or number of deaths, using a set of sliding windows on the news sentiment index values, where each such window is shifted n days ahead of the actual duration of the conducted study, for values of n  = (0, 1, 2, 3, 4). This means, to re-create the most visibly aligned variations, a statistical study is done using a same set of values for the number of cases or deaths, along with values of news negativity index considered during temporally shifted sets of 60 days each. In all cases, the correlation is calculated using the Pearson correlation coefficient [50] between two variables x and y, given by the formula:

$$\begin{aligned} r_{xy}=\frac{\sum _{i=1}^{n} (x_{i}-x')(y_{i}-y')}{\sqrt{\sum _{i=1}^{n} (x_{i}-x')^{2}}\sqrt{\sum _{i=1}^{n} (y_{i}-y')^{2}}} \end{aligned}$$
(4)

This coefficient value for any two variables remains between − 1 and + 1, where a positive value close to 1 indicates that both variables change simultaneously in same direction, a negative correlation stands for two variables changing in opposite direction, and zero correlation denotes no similarity in the variables. In practice, any correlation value above 0.5 is treated as a moderately strong positive correlation. Using these concepts, along with the previous observations about delay in impact of actual change in parameters on the news sentiments, the optimal maximum positive correlation value is determined to derive the actual delay. A similar use of correlation is seen in the works by Fu et al. and Zhang et al. [36, 38].

Table 1 Distribution of Pearson correlation coefficient values for global news sentiment polarity and COVID-19-related variables

From Table 1, it is obvious that in general, there exists more correlation between the daily negative sentiment in news and number of COVID-19-related deaths, considering data world-wide, and that the positive correlation is maximum between these variables when the news negativity indices are considered using a 2-day shifted sliding window, i.e., it takes 2 days for the trends in the number of deaths, to have impact on the global news sentiment. Similarly, this shift is confirmed for the global number of cases at a delay of 3 days. This experiment validates the observation about a delay in the impact of number of confirmed patients and number of deaths, on the news sentiment, and also determines the delay in said impact on a global scale.

Aligning the curves

In the final part of this experiment, the correlation values and optimal time-windows determined in the previous section are used for plotting time-shifted news sentiment curves along with the daily number of cases and number of deaths. Accordingly, the news sentiment about daily number of cases is considered at a shift of 3 days, while that concerned with daily death count is plotted at a shift of 2 days to get the ideally aligned plots. These are shown in Figs. 9 and 10, respectively.

Fig. 9
figure 9

Illustration of the number of deaths due to COVID-19 vs the characteristics of negative news sentiment values shifted by a window of 3 days. Both the variables are kept to scale in the illustration

Fig. 10
figure 10

Illustration of the number of deaths due to COVID-19 vs the characteristics of negative news sentiment values shifted by a window of 2 days. Both the variables are kept to scale in the illustration

It can be seen from Fig. 9 that there are almost perfect matches in pattern in the duration of days 1, 12–19, 20–27, and 31 onwards, though due to differences in scale, the variations are not equally spaced. The visible resemblance in variations is also noted in Fig. 10, especially in days 14–19, 22–26, the abrupt spikes in 30–31, 36–37. However, it is a general observation that the negativity in news prevails even when the global statistics in both cases and deaths are declining which can be attributed to other factors as determined in succeeding experiments. Therefore, it can be said that the negativity index, considering global news, is quite indicative of the changes in the number of new cases and deaths during the ongoing pandemic, while the declining statistics do not seem to have much effect on the overall negativity.

Experiment 3: n-gram analysis

A n-gram can be defined as a continuous sequence of n words from a given sentence or text. In this part of the experiments, the authors have determined the 60 most common tri-grams that occur in the news during the period of study. This analysis highlights the several events, topics, or persons that have been most widely publicized by the online global news in relation with the pandemic scenario. The tri-grams have been listed along with their corresponding weighted frequency (calculated using tri-gram frequency and total occurrence of most common 60 tri-grams), as shown in Table 2.

It is obvious from the table that most of the tri-grams are regarding the pandemic, with massive usage of phrases such as tested positive coronavirus, tested positive COVID, confirmed case COVID, etc. in the global news. The news agenda during the studied period of time revolves around this central theme, and involves daily COVID-19-related updates and awareness programs being broadcast as deduced from the usage of phrases like personal protective equipment, confirmed case COVID, people tested positive, number COVID case/number coronavirus case, social distancing guideline, practice social distancing, etc. The crucial and commendable role played by World Health Organization, Centers for Disease Control and Prevention (CDC), John Hopkins University, and health care workers all over the globe in shaping the different challenges and aspects of this pandemic is also prominently noted from the table. A remarkable observation is that only three state leaders have made it to this list, namely the President of the United States of America (whose name is incidentally in the third most common tri-gram), and the Prime Ministers of United Kingdom and India, which emphasizes the prominence they enjoy as world leaders in global news, even in these times of distress.

Table 2 A set of 60 most common tri-grams

Experiment 4: case studies

In this last part of the experiments, the observations about the delayed impact of globally changing count of affected patients and deaths on the news sentiment as seen in the previous section have been used to identify similar trends for some specific countries using the respective correlation values. The study is conducted for four countries ordered chronologically, based on when the first virus outbreak occurred in that area, and all articles mentioning country X have been extracted from online global news to perform the corresponding case study on country X. For this purpose, the authors have extracted all news articles corresponding to the countries in question, from the comprehensive global news corpus, for the whole period of the study. Also, in this experiment, z-score [51] technique has been used on both the variables, to normalize the values prior to visualization. The z-score is used to bring values of different variables on the same scale, and is calculated as:

$$\begin{aligned} z-score = \frac{x_{i}-\mu }{\sigma } \end{aligned}$$
(5)

where, \(x_{i}\) denotes the current data element, \(\mu \) denotes the mean of the variables, and \(\sigma \) is standard deviation. Using this method, the data for each variable are converted to have a mean of 0, so in the following graphical representations, all values below the mean will denote a decreasing trend and vice versa.

A visual analysis of these images reveals how the observations are generally applicable throughout the data from different countries; that is, whether the global news sentiment about a country is actually affected by the daily trends in number of new cases or deaths. This is determined by the individual correlation of country-wise statistics with appropriately time-shifted global online news about that country.

The scatter plots are generated for the four countries in question. In every set of two plots for each country, perfect or partial overlaps signify only discrete, temporal alignment of the variables, and cannot be treated as a measure of continued similarity in trend, which can be better determined from a set of parallelly distributed data values.

China

The current virus outbreak is believed to have originated in China much early, in the month of December 2019, and so, the current duration of study has witnessed a sharply flattening curve in the number of cases, and complete prevention of deaths successfully. Among the 6.34 million news texts, only those that feature ’China’ have been extracted along with the corresponding sentiment index values per day. The correlation coefficients determined by sliding window approach are quite low and insignificant from a statistical point of view, as calculated and shown in Table 3. However, in the current context, such values are indicative of loosely positive similarity in trends. Remarkably, there seems to be an immediate impact of the number of daily deaths per million in China on the global news, whereas the number of cases per million takes quite some time. Along with this, the highly minimized and flattened death or infection rate is evident from Figs. 11 and 12.

Table 3 Distribution of Pearson correlation coefficient values for sliding window-based global news sentiment regarding China and COVID-19-related statistics in that country
Fig. 11
figure 11

Visualization of the distribution of normalized number of cases in China with the normalized global negative news sentiment about China, for the duration of maximum correlation

Fig. 12
figure 12

Visualization of the distribution of normalized number of deaths in China with the normalized global negative news sentiment about China, for the duration of maximum correlation

Also, it is seen that in spite of the flattened curves for cases and deaths, the negativity index values are distinctly high, and show a decreasing trend only after the 40th day of our study. The corresponding correlation coefficient values indicate more parallelly aligned set of points as seen in the first figure for the number of cases, while the points are more dispersed around the flattened death curve in the second figure, in spite of multiple overlaps, shown as deep red.

Observations: The observed negativity, though generally aligned, could be due to different other issues as evident from the global news related to China (shown in Table 4). For instance, the rise in negativity during days 14–15 of the study relate with news articles 1 and 3, while articles 3, 7, and 8 attest to the decline in negativity that follows. On a similar note, the high negativity around days 34–35 of the study can be attributed to articles 4 to 6, while the succeeding positivity is enforced by articles like 9–11. Therefore, it is evident that the global news agenda related to China is mostly motivated in driving an overall negative image of the country and its actions during the ongoing pandemic.

Table 4 A set of online news articles that may have contributed to global news sentiment regarding China during the period of study

United States of America

The outbreak spread to USA in late January, and a substantial part of the pandemic’s effect on news is observable in this case. Similar to the previous case, the news articles and sentiment index values regarding ’USA’ are extracted and used for the experiment. Table 5 shows that the number of confirmed cases has more impact on negative sentiment in the news based on USA, at a delay of 2 days, and a lower impact of the number of deaths at an overall delay of 4 days. The overall correlation is weakly positive for both the pair of variables.

Table 5 Distribution of Pearson correlation coefficient values for sliding window-based global news sentiment regarding the United States of America and COVID-19-related statistics in that country
Fig. 13
figure 13

Visualization of the distribution of normalized number of cases in USA with the normalized global negative news sentiment about USA, for the duration of maximum correlation

Fig. 14
figure 14

Visualization of the distribution of normalized number of deaths in USA with the normalized global negative news sentiment about USA, for the duration of maximum correlation

The spread of both the number of cases and deaths, in the case of the USA, resembles bell curve for the current duration of study, with gradually increasing values up to day 30, and an opposite trend thereafter. Figures 13 and 14 show that towards the later half of the studied duration, the overall number of confirmed cases and deaths follows a decreasing trend (more data points below mean), whereas negative sentiment thrives and even increases.

Observations: Apart from the effect of COVID-19-related statistics, different media reports citing the anti-China sentiment of the President of the country and governmental decisions appear to have influenced the news sentiment, as well. A set of such news articles has been provided in Table 6, while the prominence of the US President in global news is already established in Table 2. The high amount of negativity during the initial 10 days of the study, may be an effect of the articles 1–5, while the decreasing negativity since day 10 may be due to the event that article 6 and 7 correspond to. Similarly, the positive sentiment at about day 50 is aligned with the article 8, whereas the succeeding rapid rise in negativity (in spite of a drop in COVID-19 cases and deaths) could be attributed to events highlighted by articles 8–13. Similar to the observations regarding China, the agenda of global online news is driven more by different socio-political activities concerning the country.

Table 6 A set of online news articles that may have contributed to global news sentiment regarding USA during the period of study

Italy

Italy is one of the most badly affected countries due to the COVID-19 virus outbreak. During our period of study, both the death count as well as number of confirmed cases are seen to be gradually declining. The global news articles which feature ’Italy’ have been extracted along with the corresponding sentiment category of each article for this experiment. Similar to the previous experiments, for assessing the impact of death or infection-based statistics on news sentiment, a study of correlation has been undertaken. This helps to determine the measure by which the news sentiment reflects the ground reality, by considering days shifted one at a time upto 5 days. The results of the study for Italy, as shown in Table 7.

Table 7 Distribution of Pearson correlation coefficient values for sliding window-based global news sentiment regarding Italy and COVID-19-related statistics in that country
Fig. 15
figure 15

Visualization of the distribution of normalized number of cases in Italy with the normalized global negative news sentiment about Italy, for the duration of maximum correlation

Fig. 16
figure 16

Visualization of the distribution of normalized number of deaths in Italy with the normalized global negative news sentiment about Italy, for the duration of maximum correlation

It is seen that there is maximum impact of the COVID-19 situation in Italy, on global news, on the 5th day, though there is a high continuing correlation. Accordingly, the aligned scatter plots are generated using the z-scored, normalized values, as shown in Figs. 15 and 16. Evidently from the table and figures, there exists a higher correlation between the deaths in Italy and negativity index in global news, than that due to number of infected cases, although both these variables show a comparatively strong correlation with the negative news sentiment. This can also be observed by the higher number of complete and partial overlaps, as well as the gradually decreasing dispersion of the negativity proportional to the parametric values of confirmed cases or deaths in 15 and 16.

Observations: Due to the determined strong correlation, it can be determined that COVID-19 statistics are most effective on global news sentiment regarding Italy. However, a small set of relevant news articles has been put up in Table 8.

Table 8 A representative set of online news articles that may have contributed to global news sentiment regarding Italy during the period of study

India

Though the first confirmed COVID-19 case in India was noted at almost the same time as Italy, the rising effect of outbreak is quite clear in our studied time period. The study reveals interesting results, where both the number of affected cases, and number of deaths, is steadily increasing during the time period considered. The correlation coefficients determined by shifted negativity index windows is shown in Table 9. Surprisingly, the correlations are all negative in nature, indicating that the overall impact of rising deaths and spread of COVID-19 in India has a very weak effect on global news sentiment about India.

Table 9 Distribution of Pearson correlation coefficient values for sliding window-based global news sentiment regarding India and COVID-19-related statistics in that country

Given that the study intends to determine the similarity in trends of news sentiment and death or infection statistics, the least negative correlation coefficient values are selected for visualizing the trends, which are noted at a delay of 4 days in each case. A notable fact is that, statistically, this minimum negativity indicates almost no correlation. The same is depicted in the scatter plots in Figs. 17 and 18, where the negativity index values are highly dispersed, and even show a decreasing trend in the later half of the study in spite of the steep climb of actual statistics. As noted in "Experiment 1: sentiment analysis" section, the neutral news has minimal role in the global scenario, and that should be significantly minimized at a country-wide level. A possible inference may be that the negative sentiment in global news based on ’India’ is minimized so as to prevent panic among the huge population, or that the global news is not really representative of only the COVID-19 statistics in Indian context.

Fig. 17
figure 17

Visualization of the distribution of normalized number of cases in India with the normalized global negative news sentiment about India, for the duration of maximum correlation

Fig. 18
figure 18

Visualization of the distribution of normalized number of deaths in India with the normalized global negative news sentiment about India, for the duration of maximum correlation

Observations: The lack of proper correlation suggests that the news agenda is influenced by many factors other that COVID-19, during our period of study. Table 10 highlights some of the problems that were initially a cause of the massive negativity in news sentiment in spite of the minimum rate of COVID-19 affection. This covers several socio-economic aspects of Indian life during this crisis, and the analysis and discussion of such observations in itself, can be articulated as a full-fledged study of the agenda setting policies of online news media.

Table 10 A representative set of online news articles that may have contributed initially to global news sentiment regarding India

Conclusion

The proposed work addresses the challenge of identifying the general sentiment in globally published news articles as an effect of the ongoing pandemic, in both unsupervised and transfer learning-based approaches, on comprehensive data gathered for a fixed period of time. A statistical study is also undertaken to determine the impact of variations in the number of affected patients and deaths due to the COVID-19 virus, on the news sentiment at a global scale. The same study is also repeated for some countries and the sentiment of global news which pertain to the effect of COVID-19 in those countries, by considering normalized values of all variables. The observations are substantiated by n-gram analysis that highlights the most prominent tri-grams or three-word phrases that have been used in online news globally. The strongest correlation between news sentiment and COVID-19 statistics exists for Italy, which is almost similar to the observation considering news and statistics on a global scale. The authors have also utilized a set of relevant news articles to substantiate the observations during the case studies. The authors have determined that negativity is a pre-dominant sentiment in global news, and that the COVID-19-related real-world statistics, agenda setting by news agencies as well as different social (such as job loss, migrant worker problems) and political factors (such as the continued tussle between the Presidents of the USA and China), drive the negativity in online news quite strongly, which could lead to long-standing effects on mental heath of the news audience. The results lead to relevant questions and consequently a plethora of computational and social study-based research challenges. Such studies will be useful in determining the long-standing, psychological effects of news sentiment on mental health in a pandemic situation, representation of regional challenges in online global news, news media agenda setting, etc. In future, the authors wish to extend this work by utilizing country-specific news data in their respective national official languages, which will aid in further fine-grained analysis.