=Paper= {{Paper |id=Vol-3823/4_sousa_recommending_126 |storemode=property |title=Recommending News Articles for Public Health Intelligence |pdfUrl=https://ceur-ws.org/Vol-3823/4_sousa_recommending_126.pdf |volume=Vol-3823 |authors=Diana F. Sousa,Nicolas Stefanovitch,Luigi Spagnolo |dblpUrl=https://dblp.org/rec/conf/healthrecsys/SousaSS24 }} ==Recommending News Articles for Public Health Intelligence== https://ceur-ws.org/Vol-3823/4_sousa_recommending_126.pdf
                         Recommending News Articles for Public Health
                         Intelligence
                         Diana F. Sousa1,* , Nicolas Stefanovitch1 and Luigi Spagnolo1
                         1
                             European Commission Joint Research Centre, Ispra, Italy


                                        Abstract
                                        Public Health Intelligence (PHI) is the process of extracting useful information from vast amounts of data to help
                                        quickly identify and respond to health threats. Systems that perform PHI are used daily by different national
                                        and international organizations. One of the most prominent platforms is the Epidemic Intelligence from Open
                                        Sources Initiative (EIOS) platform, which continuously gathers health-related news items. However, the EIOS
                                        platform requires users to swift through unrelated information to their domain or work needs, even when using
                                        different filtering options. This inefficiency in assessing the relevance of each article creates the need to develop
                                        a recommender system that effectively positions each incoming article according to its significance. In this work,
                                        we present the first iteration of this system, making use of previous user interactions with the articles already
                                        available in the platform and the articles’ content and metadata. We investigated various configurations to address
                                        the problem of data sparsity by conducting cluster-based harmonization. Our best-performing model reports an
                                        NDGC@K of 0.4108 and an F-measure@K of 0.7287, respectively, for 𝐾 = 100 articles.

                                        Keywords
                                        Public Health Intelligence, Recommender Systems, Clustering, User Data, Health News Articles




                         1. Introduction
                         Every day, expert analysts swift through tens of thousands of health news articles to identify incoming
                         health threats, such as an outbreak of a disease and other types of relevant health information regarding
                         humans, animals, and plants. To do their work, the analysts use platforms that primarily aim to gather
                         all news articles and reports on health topics. The Epidemic Intelligence from Open Sources (EIOS)
                         platform is the most well-known Public Health Intelligence (PHI) resource. EIOS is an international
                         initiative led by the World Health Organization (WHO) with a unified all-hazards One Health approach
                         to early detection, verification, assessment and communication of public health threats using publicly
                         available information1 .
                            The analysts working on identifying relevant health information for each of their purposes and
                         domains have to carry out their day-to-day work and often prepare for large mass gatherings, e.g.
                         sports championships or the Olympics games, which present an increased risk of disease outbreaks.
                         Thus, analysts face the daily challenge of processing a high volume of information. EIOS collects 50,000
                         articles a day; as such, the possibility to organise information by relevance using a recommender system,
                         a feature currently missing in EIOS, would improve analysts’ experience by significantly alleviating the
                         time spent identifying which articles are relevant for their purpose.
                            Health recommender systems are broad and encompass epidemic forecasting tools such as HealthMap
                         [1] and EPIWATCH2 , which track disease spread by collecting information from various channels,
                         including news and social media [2]. In crises, these recommender systems are pivotal for effectively
                         allocating medical resources and guiding interventions. Moreover, they extend to environmental health
                         monitoring, offering air and water quality advice, and are integrated into Personal Health Records

                          HealthRecSys’24: The 6th Workshop on Health Recommender Systems co-located with ACM RecSys 2024
                         *
                           Corresponding author.
                          $ diana.francisco-de-sousa@ec.europa.eu (D. F. Sousa); nicolas.stefanovitch@ec.europa.eu (N. Stefanovitch);
                          luigi.spagnolo@ec.europa.eu (L. Spagnolo)
                           0000-0003-0597-9273 (D. F. Sousa); 0009-0000-2061-3216 (N. Stefanovitch); 0009-0008-0179-7468 (L. Spagnolo)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             https://www.who.int/initiatives/eios
                         2
                             https://www.epiwatch.org/

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
(PHRs) to suggest health actions [3, 4], such as vaccine recommendation features [5]. Lastly, health
applications employ these systems to promote personalized health-related behaviour [6, 7]. Despite
their potential, ensuring data privacy, system validation, multilingual adaptability, and ethical use is
paramount for maintaining public trust and successfully deploying recommender systems in public
health.
   To address the need for more efficient identification of relevant articles coming to the EIOS platform,
we created a content-based recommender system that is based on three data streams: (1) The content of
the article, specifically the first 1000 characters, taking into account complete sentences; (2) The event
type labels resulting from the application of a pandemics event classifier; (3) The user interactions with
each article (i.e., relevance score), obtained using a scoring function that considers the type and number
of interactions, augmented with a clustering procedure to tackle data sparsity. We tested XGBoost [8]
with seven different data augmentation procedures.
   The article’s main contributions are:

        • Usage of an event classifier labels to enrich the recommendation algorithm;
        • Introducing a clustering-based approach for user activity harmonization to address data sparsity
          challenges;
        • Development of a content-based system for recommending articles in real-world PHI scenarios.
        • Error analysis conducted on example use cases to assess whether the recommender can flag
          relevant information missed by the users.

   The data described and used in this paper was sourced from a live system. As a result, Intellectual
Property and Privacy regulations apply, preventing dataset sharing. Nevertheless, the experiments
detailed in this article are significant for health recommender systems. They offer valuable insights into
implementing AI-based solutions using actual user data.
   Section 2 describes the data, mainly the metadata used to train the recommender system. Section
3 describes the cluster-based procedure to perform data harmonization and tackle sparsity. Section 4
presents the recommender system, including model and evaluation metrics. Section 5 presents results,
a discussion of the clustering plus recommendation pipeline, and an error analysis of the different
clustering modalities. Finally, Section 6 presents the main conclusions and future work.


2. Data
To train and test our model, we used a dataset of approximately 3.5 million articles from the EIOS
platform from 01/01/2018 to 09/06/2022 (about four years and six months). This dataset contains all
articles and information about user interactions with those articles in all the different languages captured
by the platform. For this work, which constitutes the first iteration to create a recommendation solution
for PHI systems, the features we focused on are the text of the article, the event labels generated through
an event classifier, and the user activity for each article (i.e., relevance score). Figure 1 illustrates the
high-level pipeline involving three input data streams in the recommender system.

2.1. Text
The dataset has the full text for each article. However, due to memory limitations and to keep the focus
on the core information of the article, we decided to consider only the first few sentence(s), up to 1000
characters.
   To preprocess this truncated-article text, we only removed stop words from English articles. In order
to vectorise the articles, we used the TfidfVectorizer function from the scikit-learn3 using the maximum
document frequency set to ignore terms that have a document frequency strictly higher than 1.


3
    https://scikit-learn.org/
Figure 1: Recommender system high-level pipeline with the three data streams and expected output.


2.2. Event Labels
We assigned event labels to the articles to boost the system’s performance and better characterize
and differentiate between articles. We ran an event classifier for each article within the dataset to
classify them into one or more of 27 events following a taxonomy and pipeline created and developed
by Piskorski et al. [9]. Some of the most frequent labels are (1) Reporting Cases (i.e., reporting on cases
of infections, hospitalizations, deaths, recoveries of single persons and groups, provision of updates
thereon, which covers a short time span and specific location), (2) Reporting Situation (i.e., provision
of updates on the overall situation of the outbreak, current total figures, observed trends, forecast,
which spans longer period of time, and also covers cross-regional and cross-country comparisons),
(3) Measuring Vaccine/Medicine Roll-out (i.e., covers events revolving around the roll-out of vaccines,
medicines, equipment to combat the disease or mitigate the consequences, and includes also events
related to sharing experience, measure hesitancy, anti-vax movements, etc.). Other coarse-grain labels
are Impact, Violation, Research & Development, Communication, Support, and Miscellaneous.
   To preprocess these event labels, we applied the MultiLabelBinarizer function, given that each article
can have more than one label wrapped to work with ColumnTransformer, both from the scikit-learn
library.

2.3. User Activity
The user activity for each article is pre-determined by the weighted sum of user interactions, which
we express as a relevance score. Different types of interactions yield different weights. The platform
computes the user activity using the weights presented in Table 1.

Table 1
Types of user activities and their corresponding weights.
                                 User Activity                     Weight
                                 Read Preview                       0 or 1
                                 Read Detail                        0 or 2
                                 Flag for Follow Up                      3
                                 Export to Report                        5
                                 Attach to Team Communication            5
                                 Comment                                 5
                                 Pin to Board                     Variable
   When it comes to the "Read Preview" interaction, the weight assigned will be zero if there are no
other user interactions on the article, and one otherwise (excluding "Read Detail"). For the "Read Detail"
interaction, the weight assigned will be zero if there are no other user interactions on the article, and
two otherwise (excluding "Read Preview"). As for the "Pin to Board" activity, the weight assigned is
five or ten, based on whether the board is private or public, respectively. The weights assigned to each
activity are proportional to the complexity of the activity being performed.
   One of the issues we had to address before the application of our system was the low proportion of
articles with user interactions (2.03%). The news feeds presented to users are ordered by time and user
preference settings (i.e., pre-determined keywords, languages, etc.). When a new story emerges, EIOS
users often interact with the first article reporting on the story, with the article they deemed to be from
the most reliable source, or even with the article that reports the story in their language, among other
preferences.
   This interaction pattern means that if we have a single story reported in multiple articles from
multiple sources, the user activity will vary widely among almost identical articles, with only a few
articles getting interacted with. Thus, raw user activity does not directly equate to user interests. In
the following section, we will outline how we intend to tackle this issue using clustering to make the
relevance score a reliable measure of user interest.


3. Cluster-based Harmonization
We considered that articles with no interaction are articles for which the relevance is unknown rather
than zero, transforming the problem into a semi-supervised learning one. We corrected the relevance
score of articles in clusters to deal with this and fall back on a supervised learning problem.
   The harmonization of user activity/relevance scores happens at the level of clusters of related articles,
some of which have an interaction score and others potentially none. We intended that the clusters
captured reports on the same event; as such, they were computed considering both the time and semantic
aspects. The clustered article data corresponds to the text described in the Data section. The entire
dataset was split into five-day chunks, capturing a story’s average duration, as represented in Figure
2. Inside a chunk, all the pairs of articles were compared using sentence embeddings, and the pairs
whose similarity was above a given threshold were put into a graph. The semantic similarity model
used was distiluse-base-multilingual-cased-v2, with a threshold of 0.90. Finally, the graphs
of all clusters were merged, and the set of connected components yielded the global set of clusters. This
approach is designed to be adaptable, allowing it to pick up news stories that last longer than five days
and preventing the merging of similar stories from widely different time spans.




Figure 2: Representation of five day time-span local clusters in the timeframe considered.


  Once the clusters were computed, the second step of our procedure was to harmonize the score of all
the articles belonging to each cluster. To illustrate this, we will consider this example cluster of four
identical articles and their corresponding user activities scores:

    • Cluster: [Article 1, Article 2, Article 3, Article 4]
    • User Activities: [0, 5, 17, 0]
Table 2
Counts of the number of articles per modality, their percentage with user activity, and the threshold used for
predictions.
                         Modality    Number of Articles    User Activity   Threshold
                         Original         3 589 739            2.03            9
                         Sum              3 589 739            2.25           10
                         High             3 589 739            2.25            9
                         AVG              3 589 739            2.22            9
                         Low              3 589 739            2.20            9
                         Random           3 589 739            2.22            9
                         Discard          3 288 085            2.22            9
                         Null             3 287 754            2.20            9


   Clusters containing articles with only zero relevance are left untouched, except for the Null configura-
tion, detailed below. Clusters with mixed or only positive relevance were further processed to reassign
the relevance score of every article within that cluster. We considered seven different modalities to
perform the harmonization, which are illustrated in the following example:

    • Original: Nothing changes → [0, 5, 17, 0].
    • Sum: Application of the sum of all user activities in the cluster to all the articles in the cluster →
      [22, 22, 22, 22].
    • High: Application of the highest user activity in the cluster to all the articles in the cluster →
      [17, 17, 17, 17].
    • Average: Application of the average of all user activities computed by dividing the sum of all
      user activities by the number of articles in the cluster → [5.5, 5.5, 5.5, 5.5].
    • Low: Application of the lowest user activity in the cluster to all the articles in the cluster → [5, 5,
      5, 5].
    • Random: To each cluster, application of a random configuration from the ones described above
      → [22, 22, 22, 22] or [17, 17, 17, 17] or [5.5, 5.5, 5.5, 5.5] or [5, 5, 5, 5].
    • Discard: Keep only articles in the cluster that have user activity → [5, 17].
    • Null: Remove clusters where there is no article with user activity → [0, 5, 17, 0].

   The Discard and Null modalities constitute filtering options, not modifying the relevance score but
excluding articles with no score, using different approaches. For Discard, all non-relevant articles are
removed from the cluster for the clusters with at least one relevant article. For Null, all clusters where
all the articles have a zero relevance score are removed.
   Table 2 showcases the augmentation in general percentage for each modality compared to Original,
reflecting our extremely conservative clustering procedure. The Threshold column is the user activity
value considered at the recommendation level to decide if an article should be recommended. We
obtained this value by considering the average of the positive (> 0) user activities for each modality.
Figure 3 reports the histogram of the user activity/relevance score of articles comparing the distribution
of all the original data and the clustered articles’ distribution of the sum modality, presenting similar
profiles.


4. Recommender System
The data available does not specify which users interacted with the articles; it only shows the overall
user activity for each article. Therefore, recommendations are not based on individual user behaviour
but on global preferences towards specific topics and domains, making adopting a collaborative filtering
approach unfeasible.
Figure 3: Histogram of relevance score distribution excluding non-relevant articles: original for all articles (left)
and with sum harmonization (right).


4.1. Model
In this approach, each row of our data represented an article with a relevance score corresponding to
the weighted sum of user interactions with the article. As stated in the previous sections, the features
considered for training were the article attributes: a text section at the beginning of the article, the events
labels that report on the article classification, and the relevance scores. Our goal was to recommend
articles with higher engagement that are, therefore, more relevant.
   We divided our data into training (80%) and testing (20%) with a 5-fold cross-validation. For the
training data, we used an XGBoost regression model [8]. This model learns to predict each article’s
user engagement by building a series of decision trees sequentially, using gradient descent to minimize
the loss. We did not do hyperparameter tuning, leaving the default parameters stated in the package
documentation4 , to avoid overfitting the model to our data and maintain its generalizability to new
data.

4.2. Evaluation Metrics
The evaluation metrics considered for the different settings were the following:

       • RMSE: Root mean square error (RMSE) or root mean square deviation is one of the most commonly
         used measures for evaluating the quality of predictions. It shows how far predictions fall from
         measured true values using Euclidean distance.
       • NDGC@K: Normalized Discounted Cumulative Gain (NDCG) considers both the relevance and
         the position of items in the ranked list in the top K items.
       • Precision@K: Precision at K measures the proportion of relevant items among the top K items.
       • Recall@K: Recall at K measures the coverage of relevant items in the top K items.
       • F-measure@K: Harmonizes precision and recall to provide a balanced metric in the top K items.

   We considered 5, 10, 15, and 100 items for K. For Precision, Recall, and F-measure, since the values
considered are binary, we present only the 𝐾 = 100 configuration to reflect better the real user needs
in our setting.

4
    https://xgboost.readthedocs.io/en/stable/parameter.html
Figure 4: Heathmap of cluster size versus cluster span


Table 3
Statistics over clusters characteristics with different distributions.
                                      Only 0 Only 𝑝𝑜𝑠 More 0 More 𝑝𝑜𝑠         Eq. prop.
                          Count       124332       829         1188    182      3798
                         Max size       177          6          180     21       10
                         AVG size       2.3         2.1         5.3     3.3      2.0
                        Max span        133          7          244     28       12
                        AVG span        1.9         1.8         4.5    3.0       1.8
                        Max peak        25           6          11       4        4
                        % total rel.   0.00        0.24        0.18    0.06     0.52


5. Results and Discussion
This section presents the main results regarding all modalities and discusses the model’s successes and
potential limitations given the simplified approach.

5.1. Local Clusters Distribution
The settings used for clustering were conservative as it was performed on relatively long text with a
high threshold. In total 8.7% of the articles were clustered. The data revealed a predominant pattern
of small clusters, with 81% having a size of 2 and 99% under size 7. These clusters also tend to be
short-lived, with 49% lasting a single day and 99% up to 8 days. The manual review confirms that articles
in these clusters are remarkably similar, often being near-perfect duplicates. Notably, the clusters with
the longest lifespan appear to be populated by automatically generated reporting articles.
   In Figure 4, we plotted the distribution of cluster size and the distribution of the span of the cluster
in days; some outliers fall outside the limits of the figure and are not shown. A cluster’s median size
was two articles, and the median span was two days. Table 3 reports several statistics over the clusters,
grouping them based on whether the relevance of related articles contains only 0, only positive (𝑝𝑜𝑠),
mostly 0, mostly 𝑝𝑜𝑠, both 0 and 𝑝𝑜𝑠 in equal proportion. We report the mean and max cluster size and
span, and the maximal peak article count, and the proportion of the total relevance. We can observe
that clusters attracting most of the relevance tend to be relatively small and short.

5.2. Modality Performance
Table 4 presents the results of comparison of different clustering modalities for user data augmentation
using the RMSE, NDGC@K, Precision@K, Recall@K and F-measure@K metrics, taking into account
   Table 4
   Comparison of different clustering modalities for user data augmentation using the RMSE, NDGC@K,
   Precision@K, Recall@K and F-measure@K metrics.
                                     NDGC@K                   Precision@K     Recall@K     F-measure@K
  Modality     RMSE
                          5         10    15          100          100           100           100
  Original     1.5739   0.1903    0.1807   0.1606    0.1749       0.3466        1.0000        0.5136
  Sum         1.7362    0.4946   0.4382    0.4137    0.4108      0.5740         1.0000        0.7287
  High        1.6591    0.1722   0.2283    0.2318    0.2537      0.4320         1.0000        0.6015
  AVG         1.5313    0.1612   0.1575    0.1559    0.1734      0.3478         0.9946        0.5137
  Low         1.6188    0.1767   0.1950    0.1903    0.2201      0.4060         1.0000        0.5762
  Random      1.6380    0.1622   0.2015    0.2023    0.2516      0.4440         1.0000        0.6139
  Discard      1.6381   0.1516    0.1518   0.1377    0.1802       0.3880        1.0000        0.5575
  Null         1.6318   0.1861    0.1884   0.1855    0.1834       0.3720        1.0000        0.5420


   Table 5
   Comparison of different clustering modalities with user data augmentation performance on the original
   test set (non-augmented) using the RMSE, NDGC@K, Precision@K, Recall@K, and F-measure@K
   metrics.
                                     NDGC@K                   Precision@K     Recall@K     F-measure@K
  Modality     RMSE
                          5         10    15          100          100           100           100
  Sum         1.5794    0.2259   0.1878    0.1822    0.2014      0.3640         1.0000        0.5330
  High        1.5718    0.1678   0.2052    0.1828    0.1864      0.3280         1.0000        0.4929
  AVG         1.5732    0.1599   0.1542    0.1531    0.1660      0.3417         0.9944        0.5073
  Low         1.5715    0.1535   0.1678    0.1627    0.1882      0.3520         1.0000        0.5195
  Random      1.5719    0.1369   0.1627    0.1634    0.1966      0.3880         1.0000        0.5578
  Discard      1.5752   0.1516    0.1518   0.1377    0.1548       0.3240        1.0000        0.4886
  Null         1.5741   0.1861    0.1884   0.1788    0.1728       0.3460        1.0000        0.5139


5-fold cross validation.
   Most modalities surpass the Original configuration. However, when considering NDGC@100, only
Sum, High, Low, and Random perform distinctly better than the Original, with Sum being significantly
better. The performance of Sum places the possibility that the actual user activity value represents the
sum of all identical article interactions, performing twice as well as the Original.
   Table 5 showcases the same procedure but using the Original modality test set. In this setting, the
superior performance of the Sum modality is not as noticeable, but all modalities, except AVG, Discard,
and Null, perform better than Original. A possible justification for this behaviour could be that our
system performs better with more data regardless of how it is labelled, hindering the performance of
Null and Discard modalities. Additionally, the AVG configuration could make stronger and weaker
signals less noticeable, diluting their relative importance in a ranking setting.

5.3. Error Analysis
Table A1 (Appendix) showcases the false positives found across the five rounds of cross-validation
for the different modalities at the top five (𝐾 = 5). All modalities introduce errors compared to the
Original, with Sum and High introducing fewer wrong articles as also reflected in Table 4.
   We analysed the articles for a fail rate of over or equal to 7/8 modalities to interpret what could
have made most modalities assign relevance. We then analysed whether it was indeed a failure by our
models or if it could have been a missed relevant article by the users and/or the clustering procedure
for data augmentation. This selection resulted in six articles represented in Table 6 and marked with a
asterisk (*) in Table A1 (Appendix). Table 7 reports on the details of these articles.
   Even though Table 7 does not report on the sources for the articles, all of these are pieces that
    Table 6
    Most frequent articles and their scores across all modalities represented in the Top five (𝐾 = 5) false
    positives (fail rate ⩾ 7/8).
                Article    Original   Sum    High    AVG     Low    Random      Discard     Null
                1757749       51       46      34     41      46                   51       41
                210702        52       46      46     52      52       46          52       52
                1177084       36               36     36      36       36          36       39
                1642976       42               42     42      42       42          42       42
                2083725       33               45     39      46       46          46       46
                458168        68       64      64     68      68       64          68       68


    Table 7
    Most frequent articles across all modalities represented in the Top five (𝐾 = 5) false positives, their
    event labels, and the general topic they discuss (fail rate ⩾ 7/8).
    Article                 Event Labels                                    General Topic
    1757749          REPORTING-CASES                      First-case reporting on coronavirus in Africa
     210702      MISCELLANEOUS-UNRELATED                 Paediatric acute hepatitis reporting in the UK
    1177084            IMPACT-OTHER                       Coronavirus impact on the industry in India
    1642976        MISCELLANEOUS-OTHER                             Political landscape in Haiti
    2083725          REPORTING-CASES                         Coronavirus cases reporting in France
     458168        REPORTING-SITUATION,                 Vaccination fears and chickenpox cases in Angola
                     REPORTING-CASES


primarily reflect the general opinion of an isolated expert of the respective fields and not official sources
from health organisations, such as the WHO. So, even if the articles’ domain and general topic might
be relevant, analysts can avoid the article for not being factually about what is happening but more
of a reflection on what has been happening throughout a specific outbreak, such as in article 458168.
In this article, an expert demonstrates how vaccination fears are at fault for rising chickenpox cases
in Angola. If other sources are already monitoring the number of cases, this piece can be overlooked
because it is primarily about cause rather than consequence. Nevertheless, we believe this article and
similar articles can indicate the worsening of ongoing outbreaks. As such, these shouldn’t be ignored
but used as indicators to flag future similar events pre-emptively.


6. Conclusion and Future Work
This article presented the first step in developing a recommendation system for a pre-existing platform,
EIOS, developed for PHI. Therefore, the results and analysis still need to be completed. However, this
work successfully showcases a pipeline for developing a content-based system recommending articles
in real-world PHI scenarios. It introduces a clustering-based approach to tackle data sparsity and the
use of event classifier labels to enrich the recommender algorithm. While more complex metadata
and advanced models and approaches are available and will be used in the future, this first attempt
successfully demonstrated a way of dealing with data sparsity for our case study, which in turn improved
the model performance from an NDGC@K of 0.1749 to 0.4108, at 𝐾 = 100, for the Sum cluster-based
harmonization modality.
   Looking ahead, we plan to further develop this approach by considering multiple users, article
sources, other types of article metadata, and exploring the conjugation of clustering modalities and
filters. Additionally, we aim to involve analysts in our approach to evaluate performance on actual
end-users, thereby enhancing the robustness and applicability of our system.
References
[1] C. C. Freifeld, K. D. Mandl, B. Y. Reis, J. S. Brownstein, Healthmap: global infectious disease
    monitoring through automated classification and visualization of internet media reports, Journal of
    the American Medical Informatics Association 15 (2008) 150–157. doi:10.1197/jamia.M2544.
[2] J. S. Brownstein, C. C. Freifeld, L. C. Madoff, Digital disease detection—harnessing the web for
    public health surveillance, The New England journal of medicine 360 (2009) 2153. doi:10.1056/
    NEJMp0900702.
[3] J. M. Balbus, R. Barouki, L. S. Birnbaum, R. A. Etzel, P. D. Gluckman, P. Grandjean, C. Hancock,
    M. A. Hanson, J. J. Heindel, K. Hoffman, et al., Early-life prevention of non-communicable diseases,
    The Lancet 381 (2013) 3–4. doi:10.1016/S0140-6736(12)61609-2.
[4] H. Schäfer, S. Hors-Fraile, R. P. Karumur, A. Calero Valdez, A. Said, H. Torkamaan, T. Ulmer,
    C. Trattner, Towards health (aware) recommender systems, in: Proceedings of the 2017 International
    Conference on Digital Health, DH ’17, Association for Computing Machinery, New York, NY, USA,
    2017, p. 157–161. doi:10.1145/3079452.3079499.
[5] M. R. Pereira, Updated 2024 us vaccine recommendations from the advisory committee on immu-
    nization practices, American Journal of Transplantation 24 (2024) 514–516. doi:10.1016/j.ajt.
    2024.02.012.
[6] W. T. Riley, D. E. Rivera, A. A. Atienza, W. Nilsen, S. M. Allison, R. Mermelstein, Health behavior
    models in the age of mobile interventions: are our theories up to the task?, Translational behavioral
    medicine 1 (2011) 53–71. doi:10.1007/s13142-011-0021-7.
[7] H. Torkamaan, J. Ziegler, Recommendations as challenges: Estimating required effort and user
    ability for health behavior change recommendations, in: Proceedings of the 27th International
    Conference on Intelligent User Interfaces, IUI ’22, Association for Computing Machinery, New York,
    NY, USA, 2022, p. 106–119. doi:10.1145/3490099.3511118.
[8] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd ACM
    SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, Associ-
    ation for Computing Machinery, New York, NY, USA, 2016, p. 785–794. doi:10.1145/2939672.
    2939785.
[9] J. Piskorski, N. Stefanovitch, J. P. Linge, S. Kharazi, J. Mantero, G. Jacquet, A. Spadaro, G. Teodori,
    Multi-label infectious disease news event corpus, in: Proceedings of the Text2Story’23 Workshop,
    Elsevier, Dublin, Republic of Ireland, 2023, pp. 171–183.
A. Error Analysis with False Positives

  Table A1
  Top five (𝐾 = 5) false positives and their predicted scores across the five rounds of cross-validation for
  the different modalities.
               Article    Original   Sum     High    AVG     Low    Random      Discard    Null
             1757749*        51        46     34      41      46                   51       41
             2113787         42               37      50      38        37         45
             2355188         42               37      50      38        37         45
             1524234         34
              164429         55        49     55      42      42        41
             210702*         52        46     46      52      52        46         52       52
             2713654         50                       51                45
             1177084*        36               36      36      36        36         36       39
             1642976*        42               42      42      42        42         42       42
             1463715         36        48             32      48        36
             2083725*        33               45      39      46        46         46       46
              462637         38                       37      32                            38
             1388337         30                       33
             458168*         68        64     64      68      68        64         68       68
             2924180         40                       40      40        40         40       40
             1450025                   44     32
             2671862                   50     50              40        58
              455925                   36
              857479                   49
             1418055                          32                        32
             1152620                          32                        31
              185204                          43      37      36        43
             3105707                                          30
             1255976                                                    43
              471697                                                               27
              390951                                                               30
             2724292                                                               30
             2274914                                                               38       38
              727629                                                               42
              416240                                                                        66
             2398504                                                                        33
              979659                                                                        37
             2163701                                                                        38
             1133939                                                                        40