Multi-label Infectious Disease News Event Corpus Jakub Piskorski1,∗ , Nicolas Stefanovitch2 , Brian Doherty2 , Jens P. Linge2 , Sopho Kharazi3 , Jas Mantero4 , Guillaume Jacquet2 , Alessio Spadaro2 and Giulia Teodori2 1 Polish Academy of Sciences, Warsaw, Poland 2 European Commission Joint Research Centre, Ispra, Italy 3 Piksel SRL 4 Ending Pandemics Abstract This paper describes a new corpus consisting of circa 4.5K news snippets (multi-)labelled with fine- grained infectious disease-related event types. The paper presents the underlying event taxonomy consisting of 25 fine-grained event types grouped into 8 main categories, the process of creating the corpus, related statistics and reports on the performance of SVM- and RoBERTa transformer-based baseline models for multi-label event classification. The former model obtains macro 𝐹1 score of 0.56 and 0.68 for fine- and coarse-grained classification, respectively, whereas the corresponding macro 𝐹1 scores for the latter model are 0.65 and 0.76, respectively. Keywords multi-label event classification, infectious diseases, machine learning, linguistic resources 1. Introduction Surveillance and quick response to situations emerging from outbreaks of infectious diseases, e.g. Covid-19, relies on comprehension of all related events, which, among others, are reported in large amounts in news articles being published every day. Automated solutions that facilitate extraction and classification of such events is crucial in order to leverage such sources of information, especially for early-warning systems. In this paper, we describe a new corpus consisting of news snippets multi-labelled with fine-grained infectious disease-related event types reported therein. The main drive behind this endeavour was to create material for training and building respective ML-based models for event detection/classification in epidemics-related online news gathered by a large-scale news aggregation and analysis engine, and to share such a resource with the scientific community, since, to the best of our knowledge, no similar publicly accessible event-centred corpus exists for this specific domain. Event detection and classification constitutes a key enabling technique In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’23 Workshop, Dublin (Republic of Ireland), 2-April-2023 ∗ Corresponding author. Envelope-Open jpiskorski@gmail.com (J. Piskorski); nicolas.stefanovitch@ec.europa.eu (N. Stefanovitch); brian.doherty@ec.europa.eu (B. Doherty); jens.linge@ec.europa.eu (J. P. Linge); sopho.kharazi@ext.ec.europa.eu (S. Kharazi); jasmantero@gmail.com (J. Mantero); guillaume.jacquet@ec.europa.eu (G. Jacquet); alessio.spadaro@ec.europa.eu (A. Spadaro); giulia.teodori@ec.europa.eu (G. Teodori) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 171 to build higher-level applications, e.g. event extraction, news summarization, and narrative understanding. Since the beginning of the Covid-19 pandemic, a vast amount of work on studying Covid-19- related online media and automated analysis thereof has been reported, which mainly focused on exploiting topic detection [1], fake news/misinformation narrative analysis [2], entity and demographic-based analysis [3], and sentiment detection [4, 1], whereas relatively little work on automated event detection and extraction has been published in this context. A corpus of 10K tweets containing public reports of Covid-19 events centered around reporting cases, deaths, prevention measures, and cures was presented in [5]. A large hand-coded dataset of over 13K policy measures introduced worldwide related to Covid-19, gathered among others from online news, is presented in [6]. Other online media resources related to Covid-19 have been listed on the CLARIN Covid-19 response web page. 1 [7] presented a BERT-based system that extracts and classifies Covid-19 related events and relations between them, using a semi-automatically created event taxonomy consisting of 76 event types. The event taxonomy in the aforementioned work exhibits, to some degree, similarity with the one presented in this paper; however, no event-labelled corpora have been released by the authors. Furthermore, our event taxonomy was not created automatically, but emerged from a business requirement analysis by public health experts and has been designed upfront to cover any infectious diseases, going beyond the Covid-19 pandemic. Finally, the news snippets in our corpus are multi-label annotated. Related to our work, some short news text classification datasets have been published, e.g. [8] introduce a corpus of ca. 200k news headlines labelled with 40 general news categories, and work related to exploring ML-based models (accompanied with datasets) for the detection and classification of natural disasters [9], financial [10] and socio-political events [11] reported in the news, covering domains that, however, have little in common with pandemics and infectious diseases. 2. News Snippet Event Corpus This section describes the event taxonomy, creation of the corpus of news snippets with labels corresponding to events referred to in these snippets, and provides some corpus statistics. We consider an event2 , a situation (or a group thereof) that has either: occurred, is currently taking place, or is planned or considered to happen in the future, in some place and at a certain point in time (punctual events) or spanned/spans a time period with a start date and potential end date. Furthermore, references to a state (of play) of a situation (an ongoing event) that has not yet ended, statements and opinions made about it are also considered events. 2.1. Infectious Disease-related Event Taxonomy The events are grouped into 8 main categories that revolve around: reporting on the disease outbreak development, impact, measures, violations, research, support, communication, which 1 https://www.clarin.eu/content/clarin-responses-covid-19 2 Our notion of events is based on the TimeML standard specifications [12] 172 REPORTING: reporting single/multiple infection cases and deaths that occurred within a short period of time and provi- sion of general situation overview (in terms of people affected) spanning a longer time period. IMPACT: all events that are impacted by the outbreak of the infectious disease/pandemic, e.g. cancellation of events MEASURE: introduction and changes to legislation, restrictions and recommendations of preventive nature necessary to combat the disease, i.e. the number of infected/affected people and spread of the disease, roll-out of related vaccines, medicines and equipment. VIOLATION: any illegal activity, fraud, fake product discovery, unrest related to the introduced measures, and spread of misinformation. RESEARCH & DEVELOPMENT: reporting on the phenomena observed during the spread of the disease, progress on vaccines, medicine and relevant equipment development, and support to research and development related to diagnose or treat the disease. COMMUNICATION: high-level meetings to discuss the situation, impacts and/or introduce measures, and launch of new information sharing/collection instruments concerning the disease and related phenomena. SUPPORT: provision of financial and other type of support to the affected entities, community, economy, etc., and men- tions of the need or lack of such support. MISCELLANEOUS: any other events related (not covered above) or unrelated to infectious diseases, and non-events, i.e. texts not referring to any actual event nor a state of an event, e.g. descriptions of processes. Figure 1: Coarse-grained Infectious Disease-related Event Categories. are all further subdivided into 25 fine-grained event types that refer to specific aspects of the main categories, e.g. Reporting is subdivided into Reporting cases and Reporting situation. The brief description of the main event categories is provided in Figure 1, whereas the one for the fine-grained types is provided in Annex A in Figure 6. The event definitions are to a large extent ‘inclusive’, e.g., the Support: goods category covers not only the factual provision of goods to the affected people, but also plans and intents to do so, and expression of the needs of those in need to receive such support. The Miscellaneous category is envisaged to capture everything that does not fit anywhere else, and is subdivided into: (a) other events that are related to the domain, but do not fall under any other type, (b) events that are unrelated, and (c) non events, e.g., descriptions of certain generic processes and phenomena that are neither tailored in time nor refer to any specific event instances, although relevant for the domain though. It is important to emphasize at this stage that, in a practical set-up, a different merging and subdivision of Miscellaneous might be more beneficial for ML modelling purposes; however, the main drive behind this subdivision was to explore how well the 3 different fine-grained classes can be distinguished. Furthermore, the Miscellaneous: Other category was deemed as relevant from end-user perspective, i.e. constituting a source of providing ‘interesting’ information. 2.2. Data Sampling The input data for annotation was randomly sampled from news articles gathered by MEDISYS 3 , a large-scale health-related news aggregation engine [13] from a period that spans 2016-2021. Apart from conventional media sources, MEDISYS also monitors news on hundreds of official public health websites such as ministry of health and public health agency websites. 3 https://medisys.newsbrief.eu/ 173 10n(OR(economy, economic, economies, financial, unemployment, bankrupt, bankruptcy, unemployed), OR(pandemic, lockdown, disease, diseases, infection, infections, infectious, virus, viruses)) Figure 2: An example of a Solr query used to target articles with specific categories, in this case the category Impact: Economy. This query specifies that the word economy and its synonyms should be at max. 10 tokens away from the word pandemic and related terms. News articles were sourced using keywords, and snippets were further extracted from them by selecting up to max. first 4 sentences comprised within the first 500 characters of the article4 . The rationale behind considering the initial part of news articles was the assumption of inverted-pyramid style [14] of writing news articles, i.e. the most relevant events are placed in the beginning and the least important ones are left toward the end. First, news articles were randomly sourced using a list of circa 800 infectious disease names5 , e.g. Covid-19, ebola, zika, malaria, etc., and relevant name variants and acronyms. Given that a large fraction of text snippets acquired in this way fell under Miscellaneous in order to populate proportionally the other classes in the taxonomy, an additional document sampling for each category was carried out through the use of a more ‘focused’ combination of keywords (including synonyms) which were required to be found within a specific text window anywhere in the body of a news article. An example of such a keyword query is provided in Figure 2. This allowed to improve the precision (i.e. ca. 50% of the fetched articles were reporting on events in the taxonomy that fall into non-Miscellaneous categories). The potential bias that might have been introduced by the use of specific keywords is mitigated by extracting text snippets only from the beginning of articles, which do not necessarily contain any of the keywords of the query and instead use a different wording to report on an event. In addition, circa 10% of text snippets were further manually selected from the news articles to ensure the corpus is even more balanced. 2.3. Data Annotation From the sampled text snippets described above, circa 4.5K were randomly selected for annota- tion. 7 annotators were involved in this process, all of which had prior experience of annotating news texts. 2 of the annotators have a background in NLP and computational linguistics, whereas 5 others were news analysts. Initially, circa 400 randomly selected snippets were annotated by 5 annotators, who subsequently jointly resolved the conflicts. The main motivation behind this part of the annotation process was to revise the event codebook comprising the event definitions, which turned to be overlapping or incomplete to some degree. Next, the remainder of the snippets was annotated, each by at least 2 annotators. Given the fact that annotations are sets of labels, we have computed strict and loose Cohen’s 𝜅, where for the former an agreement is considered only for identical label sets, whereas in the latter case, a non-empty overlap of the label sets is considered an agreement. The average strict and loose 𝜅 for a pair of annotators are 4 Some snippets are longer than 500 characters in order to respect sentence boundaries. 5 This list contains diseases considered as the most common public health threats created for the MEDISYS platform for the purpose of retrieving relevant news articles. 174 Amid rising vaccination rates across the European Union, the 27 EU leaders on Tuesday committed to collectively donate at least 100 million doses of Covid-19 vaccine to countries in need by the end of 2021. The bloc, which described itself in a joint statement signed off at summit. Figure 3: An example of text snippet annotated with two labels, namely, Support: goods and Communi- cation: meeting, with the corresponding phrases referring to the respective events underlined. 0.59 and 0.63 resp. The conflict resolution in the annotations was jointly carried out by 2 to 4 annotators. An example of a news snippet annotated with two event labels is provided in Figure 3. Further examples are provided in Figure 7 in Annex A. 2.4. Data Statistics The corpus consists of 4441 text snippets, whose average length is 412 characters. The average number of fine- and coarse-grained labels per snippet is 1.26 and 1.19, respectively. Concerning fine-grained labels, circa 77.1% of the snippets have only one such label assigned to them, whereas the percentage of the snippets with 2, 3 and 4 labels are 19.65%, 2.9% and 0.34%, respectively. The corpus is relatively well-balanced. The statistics for the coarse- and fine-grained labels are provided in Table 1. The columns labelled with ‘Co-occurrence’ provide the percentage of instances of the given class that are labelled with at least one other label. While this figure is maximum 6.15% (Communication class) for the coarse-grained types, it can reach up to 20.56% (Impact: displacement of people class) for the fine-grained types. The snippets labelled with Miscellaneous do not co-occur with other labels by definition of the former. Figure 4 presents the text snippet length histogram. Table 2 provides a list of most frequently co-occurring pairs of fine-grained event types, while the complete event co-occurrence matrix is shown in Figure 8 in Annex A. Interestingly, the two Reporting classes are the most co-occurring ones and co-occur most frequently together; Measure: Authority Regulation and Impact: Health System tend to frequently co-occur with them as well. The other Measure classes tend to co-occur with Measure: Authority Regulation. Given the fact that Covid-19 has triggered a vast amount of news articles over the last 3 years, a large part (more than 70%) of the snippets in the corpus are related to the Covid-19 pandemic. 3. Benchmark Models We have evaluated two benchmark models, namely: (a) L2-regularized linear SVM 6 using the One-vs-the-Rest strategy, with log TFIDF-weighted 3-5 character n-grams as features, using vector normalization and 𝑐 = 0.2 resulting from parameter optimization, and (b) RoBERTA base [15], a transformer-based model, using a batch size of 32, learning rate of 2−5 and 100 warming steps with 5 training epochs. 6 We used the Liblinear implementation provided in Scikit-learn library: https://scikit-learn.org/ 175 Table 1 Corpus statistic: fine- and coarse-grained event labels. The co-occurrence statistics for the coarse-grained types refer to the co-occurrence with other coarse-grained types. Event Type Number Fraction Co-occurrence Reporting 1089 24.5% 3.31% Reporting cases 614 13.83% 10.75% Reporting situation 641 14.43% 11.23% Impact 853 19.2% 4.34% Impact: displacement of people 107 2.41% 20.56% Impact: health system 117 2.63% 13.68% Impact: economy 346 7.79% 7.23% Impact: events 157 3.54% 6.37% Impact: other 178 4.01% 8.99% Measure 987 22.2% 3.24% Measure: authority regulation 322 7.25% 16.77% Measure: facilities 116 2.61% 11.21% Measure: travel 137 3.08% 16.79% Measure: vaccine/medicine roll-out 387 8.71% 6.46% Measure: other 100 2.25% 7.00% Violation 378 8.51% 3.87% Violation: restrictions and unrest 127 2.86% 11.81% Violation: fake product or fraud 121 2.72% 7.44% Violation: misinformation 149 3.36% 5.37% R&D 532 12.0% 1.5% R&D: medicine progress 187 4.21% 2.67% R&D: phenomena 272 6.12% 2.21% R&D: funding 97 2.18% 2.06% Communication 358 8.06% 6.15% Communication: meeting 158 5.81% 9.30% Communication: launch instrument 101 2.27% 4.95% Support 293 6.6% 5.80% Support: financial 189 4.26% 8.47% Support: goods 113 2.54% 7.08% Miscellaneous 779 17.5% 0.0% Miscellaneous: other 158 3.56% 0.0% Miscellaneous: unrelated 508 11.44% 0.0% Miscellaneous: non events 115 2.59% 0.0% For the purpose of evaluation of these models, we use micro, macro, weighted and samples 176 Figure 4: Text Snippet Length histogram. Table 2 Top co-occurring pairs of fine-grained event labels: (a) Count stands for the absolute number of co- occurrences of Type 1 with Type 2; (b) Fraction 1 stands for the count normalised by the total number of co-occurences of event Type 1; (c) Fraction 2 stands for the count normalised by the total number of co-occurences of event Type 2. Event Type 1 Event Type 2 Count Fraction 1 Fraction 2 Reporting cases Reporting situation 166.0 27.0 25.9 Measure: Authority Regulation Reporting situation 67.0 20.8 10.5 Measure: Authority Regulation Reporting cases 48.0 14.9 7.8 Measure: Vaccine/Medicine Roll-out Reporting situation 32.0 8.3 5.0 Communication: Meeting Reporting situation 31.0 12.0 4.8 Impact: Economy Support: Financial 28.0 8.1 14.8 Impact: Health system Reporting situation 24.0 20.5 3.7 Measure: Authority Regulation Measure: Travel 21.0 6.5 15.3 Impact: Economy Impact: Other 20.0 5.8 11.2 Measure: Authority Regulation Measure: Facilities 20.0 6.2 17.2 𝐹1 scores, where the latter is computed as an average of 𝐹1 scores computed for each pair of sets of ground-truth and system-response labels for each instance in the training data. 5-fold cross-validation was used. The overall results for both fine- and coarse-grained classification are provided in Table 3, whereas the per-class performance of the benchmark models for the fine- and coarse-grained scenarios is provided in Table 4 and 5, respectively. For both models, the overall performance shows little variation between the 𝐹1 measures. The performance of RoBERTa vis-à-vis SVM is better in both the coarse- and the fine-grained classification scenario, with improvements of up to 9 and 13 points in 𝐹1 measures, respectively. 177 Table 3 𝐹1 scores for benchmark models for fine- and coarse-grained event classification. Fine-grained Event Types Coarse-grained Event Types Approach Micro Macro Weighted Samples Micro Macro Weighted Samples SVM 0.60 0.56 0.59 0.55 0.69 0.68 0.69 0.67 RoBERTa 0.69 0.65 0.68 0.68 0.76 0.76 0.76 0.76 Table 4 𝐹1 scores for benchmark models per class for the fine-grained event types. Event Type SVM RoBERTa Reporting cases 0.74 0.85 Reporting situation 0.64 0.75 Impact: displacement of people 0.75 0.81 Impact: health system 0.40 0.55 Table 5 Impact: economy 0.63 0.71 𝐹1 scores for benchmark models Impact: events 0.64 0.83 Impact: other 0.24 0.42 for coarse-grained event types. Measure: authority regulation 0.43 0.45 Event Type SVM RoBERTa Measure: facilities 0.55 0.70 Reporting 0.80 0.85 Measure: travel 0.60 0.79 Impact 0.65 0.73 Measure: vaccine/medicine roll-out 0.67 0.64 Measure 0.66 0.68 Measure: other 0.21 0.24 Violation 0.72 0.79 Violation: restrictions and unrest 0.54 0.71 R&D 0.69 0.71 Violation: fake product or fraud 0.75 0.80 Communication 0.63 0.78 Violation: misinformation 0.71 0.64 Support 0.58 0.70 R&D: medicine progress 0.55 0.58 Miscellaneous 0.68 0.73 R&D: phenomena 0.59 0.72 R&D: funding 0.64 0.81 Communication: meeting 0.68 0.76 Communication: launch instrument 0.59 0.70 Support: financial 0.55 0.76 Support: goods 0.49 0.63 Miscellaneous: other 0.11 0.34 Miscellaneous: unrelated 0.70 0.50 Miscellaneous: non events 0.48 0.78 As regards the models’ performance on individual classes, one can observe that, for both SVM and RoBERTa, the three worst performing classes, namely, Impact: Other, Measure: Other, Miscellaneous: Other, have almost all an 𝐹1 < 0.45, and reduce the global 𝐹1 scores. The performance behaviour might be linked to the more open and less-focused nature of the definition of the Other classes. Studying the most common confusion between labels, when both classifier and ground truth have only one label, shows (see Figure 5) that Miscellaneous has the most false positives and that the classes Impact, Measure and Research have more false positives than all the other classes. 4. Conclusions This paper briefly described the creation of a new corpus consisting of circa 4.5K news snippets (multi-)labelled with fine-grained infectious disease-related event types and reported on the 178 Figure 5: Confusion matrix for coarse-grained event types, considering only the snippets that have a single label both in the prediction and in the ground truth. performance of SVM- and transformer-based baseline models trained using the corpus. We intend to enlarge the corpus in the future, in particular using snippets that cover a wider range of diseases. The news event corpus, accompanied by the full-fledged Codebook and annotation guidelines is publicly available at https://github.com/jpiskorski/infectious-diseases-events to the scientific community for research purposes. All future extensions and updates of the corpus will be made available under the same link. References [1] P. Ghasiya, K. Okamura, Investigating covid-19 news across four nations: A topic modeling and sentiment analysis approach, IEEE Access 9 (2021) 36645–36656. doi:10.1109/ACCESS. 2021.3062875, publisher Copyright: © 2013 IEEE. [2] T. Marcoux, N. Agarwal, Narrative Trends of COVID-19 Misinformation, in: Proceed- ings of the 4𝑡ℎ Workshop on Narrative Extraction From Texts (Text2Story 2021), held in conjunction with the 43𝑟𝑑 European Conference on Information Retrieval (ECIR 2021), Association for Computational Linguistics, 2021, pp. 77–80. [3] A. E. Varol, V. Kocaman, H. U. Haq, D. Talby, Understanding covid-19 news coverage using medical nlp, 2022. [4] R. Chandra, A. Krishna, Covid-19 sentiment analysis via deep learning during the rise of novel cases, PLOS ONE 16 (2021) 1–26. [5] S. Zong, A. Baheti, W. Xu, A. Ritter, Extracting a knowledge base of covid-19 events from social media, 2020. URL: https://arxiv.org/abs/2006.02567. doi:10.48550/ARXIV.2006.02567. 179 [6] C. Cheng, J. Barceló, A. Hartnett, R. Kubinec, L. Messerschmidt, COVID-19 Government Response Event Dataset (CoronaNet v.1.0), Nat Hum Behav. 4 (2020) 756–768. [7] B. Min, B. Rozonoyer, H. Qiu, A. Zamanian, N. Xue, J. MacBride, ExcavatorCovid: Extract- ing events and relations from text corpora for temporal and causal analysis for COVID-19, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing: System Demonstrations, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 63–71. [8] R. Misra, News category dataset, 2018. URL: https://www.kaggle.com/datasets/rmisra/ news-category-dataset. doi:10.13140/RG.2.2.20331.18729. [9] T. Nugent, F. Petroni, N. Raman, L. Carstens, J. L. Leidner, A comparison of classifica- tion models for natural disaster and critical event detection from news, in: 2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 3750–3759. [10] E. Lefever, V. Hoste, A classification-based approach to economic event detection in Dutch news text, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, 2016, pp. 330–335. URL: https://aclanthology.org/L16-1051. [11] J. Haneczok, G. Jacquet, J. Piskorski, N. Stefanovitch, Fine-grained event classification in news-like text snippets - shared task 2, CASE 2021, in: Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), Association for Computational Linguistics, Online, 2021, pp. 179–192. [12] R. Saurí, J. Littman, B. Knippen, R. Gaizauskas, A. Setzer, J. Pustejovsky, TimeML annotation guidelines, https://www.researchgate.net/publication/248737128_TimeML_Annotation_ Guidelines_Version_121, 2006. [13] J. Linge, R. Steinberger, T. Weber, Internet surveillance systems for early alerting of health threats, Euro Surveill 14 (2009) 1–2. [14] J. Canavilhas, Web journalism : from the inverted pyramid to the tumbled pyramid, https://www.bocc.ubi.pt/pag/canavilhas-joao-inverted-pyramid.pdf, 2007. [15] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). A. Supplementary corpus information The definition (in a simplified form) of the fine-grained event types related to infectious diseases is provided in Figure 6. Some examples of news snippets annotated with these labels are provided in Figure 7. Figure 8 presents event type co-occurrence matrix. 180 Reporting cases: reporting on cases of infections, hospitalizations, deaths, recoveries of single persons and groups, pro- vision of updates thereon, which covers a short time span and specific location. Reporting situation: provision of updates on the overall situation of the outbreak, current total figures, observed trends, forecast, which spans longer period of time, and also covers cross-regional and cross-country comparisons. Impact: Displacement of people: reporting on movement of persons/groups that were either forced, obliged or vol- untarily fled or left their homes of places of habitual residence as a consequence of the spread of the infectious disease and/or introduction of measures to combat the disease. Bringing back displaced people to the place of origin falls under this category as well. Impact: Health system: covers events related to the impact the disease has on the health-care system, e.g. deployment of additional staff, shortage of medical equipment, high bed occupancy rate, establishment of new facilities, etc. Impact: Economy: covers events related to impact on the economy, e.g., decline/growth of certain sectors, reducing/in- creasing production, gains/losses, unveiling studies on the analysis and prognosis of the economic situation. Impact: Events: reporting on cancellation, postponement, and changing of modi operandi in the context of political, sport, cultural and other mass events, etc. Impact: Other: reporting on other impacts of the disease, e.g., societal phenomena, political situation, future predictions, etc. Measure: Authority Regulation/Recommendation: covers events related to the introduction of measures like, e.g. law, formal regulations, restrictions, and recommendations by competent government authorities and international bodies which are specifically put in place to decrease the number of infected/affected people and thwart further spread of the disease. Measure: Facilities: covers closures of facilities (e.g. schools, universities, museums, parks) resulting from regulations and/or situations, re-openings, changing related modi operandi, e.g. the introduction of teleworking, etc. Measure: Travel: introduction of travel restrictions, recommendations, closure of borders, cancellation of flights, closure of airports, provision of specific transportation means to facilitate travel, etc. Measure: Vaccine/Medicine Roll-out: covers events revolving around the roll-out of vaccines, medicines, equipment to combat the disease or mitigate the consequences, and includes also events related to sharing experience, measure hesitancy, anti-vax movements, etc. Measure: Other: covers any other events related to measures, resulting from non-governmental organization decisions, private sector, e.g. linked to introduced laws and regulations. Violation: Restrictions and Unrest: covers violations against introduced laws, regulations, measures and potential lockdowns, and protests against the introduced laws and measures. Violation: Fake product or Fraud: covers events related to unveiling or warning on fake medicine or any counterfeits, falsified or substandard disease-related material/equipment being sold and/or distributed, and infectious disease-related fraud. Violation: Misinformation: embraces events related to revealing misinformation incidents and attempts, and issuing warnings about disease-related misinformation. Research & Development: Medicine Progress: dissemination of information and updates on the progress of research and development of medicines, vaccines and equipment to combat and/or protect against infectious diseases. Research & Development: Phenomena: reporting on research on specific phenomena observed in the context of infec- tious diseases and findings which might potentially contribute to the development of medicines, vaccines, etc. Research & Development: Funding: raising funding, launching programmes and resources for R&D of technologies and materials related to fight infectious diseases. Communication: Meeting: covers official meetings, conferences and meetings, press conferences of authorities, states, international organizations, task forces, experts, etc., to discuss topics related to the (outbreak of) infectious diseases and related topics Communication: Launch Instrument: reporting on new communication, information sharing and gathering instru- ments and methods related to infectious diseases, e.g. online platforms, databases, smartphone apps, etc. Support: Financial: launching, proposing and elaborating financial instruments to support affected people, organiza- tions, economy, etc., e.g. the introduction of changes in tax regulations to relieve the most vulnerable groups. Support: Goods: providing affected people with goods, materials, and services to help and alleviate the problems resulting from the outbreak of the disease. Miscellaneous: Other: is a placeholder to capture other events related to infectious diseases, which do not fall under any of the above categories, e.g. recruitment of new experts by a company that develops infectious disease-related vaccines. Miscellaneous: Unrelated: covers events that are not related to infectious diseases in any way. Miscellaneous: Non Events: covers texts that do not refer to any event that could be tailored to a particular point in time, e.g. general descriptions of processes, etc. Figure 6: Infectious Disease-related Event Taxonomy. 181 DUBAI, United Arab Emirates Dubai’s Expo 2020 world’s fair will be postponed to Oct. 1, 2021, over the new coronavirus pandemic, a Paris-based body behind the events said Monday. The announcement by the Bureau International des Expositions came just hours after police in Kuwait dispersed what they described as a riot by stranded Egyptians unable to return home amid the coronavirus pandemic. The riot was the first reported sign of unrest from the region’s vast population of foreign workers who have lost their jobs over the crisis EVENTS: Impact: Events, Impact: Displacement of people, Violation: restrictions and unrest The World Health Organization (WHO) has confirmed the first three cases of Zika virus disease in India. Health Ministry officials said Sunday that the three patients in western Gujarat state had recovered. ”There is no need to panic,” Dr. Soumya Swaminathan, a top health ministry official, told reporters. The World Health Organization said in a statement released Friday that the three cases that India reported to the WHO on May 15 were detected through routine blood surveillance in a hospital in Ahmadabad, Gujarat’s capital” EVENTS: Reporting: cases The Gates Foundation will give Rotary $255 million, with Rotary pledging to raise $100 million, and the UK and Germany contributing $150 million and $130 million respectively to the global initiative. It is the second such grant from the foundation to Rotary International — in 2007, it gave Rotary a $100 million grant for a polio eradication programme, which Rotary matched dollar for dollar. The new money will go to vaccination programmes, better disease surveillance and research on new vaccines. EVENTS: Research: funding Support: financial Warsaw (dpa) - Czech Prime Minister Andrej Babis said on Sunday that he would like residents over the age of 60 to be able to register for a Covid-19 vaccination from March. The move would see the offer of vac- cinations extended beyond the current priority groups of health care workers, nursing home residents and staff and all citizens aged over 80. EVENTS: Measure: vaccine/medicine roll-out Little air cleansers are digital gadgets that are utilized to tidy up the air by decreasing or removing interior toxins such as germs, odours, smoke and chemicals that could be hazardous to the wellness. These small air purifier cleansers have different types such as the HEPA air cleanser, ozone air cleanser, or the ionic air cleanser. EVENTS: Miscellaneous: non event RAI News 24 reports that, as of January 7, Italy will go back to the colour-coded system sub-dividing Regions on the basis of Covid-19 restrictions. The government will decide on the colour zone for each Region on the basis of 21 Covid-19-related criteria. However, the Regions are calling on the government to revise these criteria. Meanwhile, up until January 6, all of Italy will be in a red zone, meaning that bars and restaurants will stay closed EVENTS: Measure: authority regulation, Measure: facilities Figure 7: Examples of text snippets annotated with event labels. The text fragments triggering the respective events are underlined in blue. 182 Figure 8: Event type co-occurrence matrix. 183