Health Communication Through News Media During the Early Stage of the COVID-19 Outbreak in China: A Digital Topic Modeling Approach

Background : In December 2019, some COVID-19 cases were first reported and soon the disease broke out. As this dreadful disease spreads rapidly, the mass media has been active in community education on COVID-19 by delivering health information about this novel coronavirus. | Methods: We adopted the Huike database to extract news articles about coronavirus from major press media, between January 1st, 2020, to February 20th, 2020. The data were sorted and analyzed by Python software and Python package Jieba. We sought a suitable topic number using the coherence number. We operated Latent Dirichlet Allocation (LDA) topic modeling with the suitable topic number and generated corresponding keywords and topic names. We divided these topics into different themes by plotting them into two-dimensional plane via multidimensional scaling. | Findings: After removing duplicates, 7791 relevant news reports were identified. We listed the number of articles published per day. According to the coherence value, we chose 20 as our number of topics and obtained their names and keywords. These topics were categorized into nine primary themes based on the topic visualization figure. The top three popular themes were prevention and control procedures, medical treatment and research, global/local social/economic influences, accounting for 32.6%, 16.6%, 11.8% of the collected reports respectively. | Interpretation: The Chinese mass media news reports lag behind the COVID-19 outbreak development. The major themes accounted for around half the content and tended to focus on the larger society than on individuals. The COVID-19 crisis has become a global issue, and society has also become concerned about donation and support as well as mental health. We recommend that future work should address the mass media's actual impact on readers during the COVID-19 crisis through sentiment analysis of news data.


Processing
Eleven-thousand-two-hundred-twenty articles were found with keyword search "coronavirus", dated between January 1 st , 2020 and February 20 th , 2020. After cleaning the data, 7791 articles remained.
Before applying LDA modeling, we used Python to perform data cleaning and used the Python package Jieba for data process 18,19 . The detailed data process is illustrated in Figure 1.
We next removed Chinese common stop characters, such as "ten", "a", "of", and "it". We also built a document-term matrix (DTM) and used TF-IDF to process the data.
To seek a suitable LDA topic number and the explanations to investigate the relationship between COVID-19 crisis and news reports, we conducted multiple studies. For the selection of a suitable number of topics, we used a coherence score to evaluate 20 . Topic coherence measures the consistency of a single topic by measuring the semantic similarity between words with high scores in a topic, which contributes to improving the semantic understanding of the topic. That is, words are represented as vectors by the word co-occurrence relation, and semantic similarity is the cosine similarity between word vectors. The coherence is the arithmetic mean of these similarities 21 . We used Coherence Model from Gensim, the python package for natural language processing, to calculate coherence value 22 . According to Figure   2, the coherence score increased and reached a stable score as the number of topics increased to 20, then declined after the number of topics reached 25. However, we found that the result could be uninterpretable for humans if only statistical measures were applied 23 . As a result, we combined statistical measures and manual interpretation and chose 20 topics to analyze with the help of Python 3.6.1 version and LDAvis tool 16 . We set λ = 1 and set 20 topics and their keywords. Topics' names were generated according to their corresponding keywords to expatiate the topics.
We also divided these topics into different themes to study them better. In the visualization, as in the two-dimensional plane (Figures 3 and 4), 20 topics were represented as cycles. These circles overlapped, and their centers are determined by computed topic distance 16 . By this approach, these 20 topics were classified into nine main primary themes and are shown in Table 1.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 31, 2020. .  CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 31, 2020. . https://doi.org/10.1101/2020  Remarks: The total percentage is 100·1% due to automatic rounding while exporting the results. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 31, 2020. . The areas of the circles indicate the overall prevalence, and the center of the circles is 8 determined by computing the distance between topics. Intertopic distances are shown on a two-9 dimensional plane 24 via multidimensional scaling. PC1 represents the transverse axis, and the 1 0 PC2 represents the longitudinal axis. reported on 9 January 2020. Between 20 to 23 Jan 2020, we observed a sharp increase of 1 8 relevant news. As the daily news cases decreased since 4 Jan 2020, the number of daily news 1 9 reports began to drop. The increase in the number of cases on 12 and 13 Feb was due to the public transportation) accounted for 8·9%, 6·7%, 5·6%, 4·4%, and 4·4% of articles, respectively. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 31, 2020. .  3 4 The data of daily confirmed cases and death between January 1 st , 2020, to January 16 th was extracted from the figure in a transmission dynamics study published at March 26 th , 2020 27 . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint As demonstrated in Figure 6, China News Service was the most productive media source,

Number of news Mentioned organizations and companies
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 31, 2020. There is an emphasis on Wuhan's story, where news stories focus on an individual's life instead 1 1 7 of the whole city level. We also observed that Theme 6 (Mental health) accounts for 4·4% of all staff, and the public in the news reports. These two themes indicate the mass media adopts a 1 2 0 people-oriented principle when reporting the COVID-19 crisis. Huike mass media database which only covers text news articles. However, the mass media 1 2 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 31, 2020. messaging application) to deliver health information through images, snapshots, and short 1 2 7 videos. Therefore, we might omit the news content and the impact of mass media on these valuable if we could measure the public's reaction to our study.  Qian Liu, Quan Liu, Qiuyi Chen collected and cleaned data 1 5 0 Qian Liu and Wai-kit Ming did the data analysis, data interpretation, and wrote the first version   contributed to the administration of the project and data analysis, data interpretation.  . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 31, 2020. . https://doi.org/10. 1101/2020 Babatunde Akinwunmi and Wai-kit Ming reviewed the manuscript. All authors contributed to the interpretation of the results and final manuscript.

5 8
All authors discussed and agreed on the implications of the study findings and approved the 1 5 9 final version to be published. We declare no competing interests. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 31, 2020. .