<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Recommending News Articles for Public Health Intelligence</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Diana</forename><forename type="middle">F</forename><surname>Sousa</surname></persName>
							<email>de-sousa@ec.europa.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">European Commission Joint Research Centre</orgName>
								<address>
									<settlement>Ispra</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicolas</forename><surname>Stefanovitch</surname></persName>
							<email>nicolas.stefanovitch@ec.europa.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">European Commission Joint Research Centre</orgName>
								<address>
									<settlement>Ispra</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Luigi</forename><surname>Spagnolo</surname></persName>
							<email>luigi.spagnolo@ec.europa.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">European Commission Joint Research Centre</orgName>
								<address>
									<settlement>Ispra</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Recommending News Articles for Public Health Intelligence</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">33E43D829CD28A9C4FE621D1EAB6FC1E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Public Health Intelligence</term>
					<term>Recommender Systems</term>
					<term>Clustering</term>
					<term>User Data</term>
					<term>Health News Articles</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Public Health Intelligence (PHI) is the process of extracting useful information from vast amounts of data to help quickly identify and respond to health threats. Systems that perform PHI are used daily by different national and international organizations. One of the most prominent platforms is the Epidemic Intelligence from Open Sources Initiative (EIOS) platform, which continuously gathers health-related news items. However, the EIOS platform requires users to swift through unrelated information to their domain or work needs, even when using different filtering options. This inefficiency in assessing the relevance of each article creates the need to develop a recommender system that effectively positions each incoming article according to its significance. In this work, we present the first iteration of this system, making use of previous user interactions with the articles already available in the platform and the articles' content and metadata. We investigated various configurations to address the problem of data sparsity by conducting cluster-based harmonization. Our best-performing model reports an NDGC@K of 0.4108 and an F-measure@K of 0.7287, respectively, for 𝐾 = 100 articles.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Every day, expert analysts swift through tens of thousands of health news articles to identify incoming health threats, such as an outbreak of a disease and other types of relevant health information regarding humans, animals, and plants. To do their work, the analysts use platforms that primarily aim to gather all news articles and reports on health topics. The Epidemic Intelligence from Open Sources (EIOS) platform is the most well-known Public Health Intelligence (PHI) resource. EIOS is an international initiative led by the World Health Organization (WHO) with a unified all-hazards One Health approach to early detection, verification, assessment and communication of public health threats using publicly available information 1 .</p><p>The analysts working on identifying relevant health information for each of their purposes and domains have to carry out their day-to-day work and often prepare for large mass gatherings, e.g. sports championships or the Olympics games, which present an increased risk of disease outbreaks. Thus, analysts face the daily challenge of processing a high volume of information. EIOS collects 50,000 articles a day; as such, the possibility to organise information by relevance using a recommender system, a feature currently missing in EIOS, would improve analysts' experience by significantly alleviating the time spent identifying which articles are relevant for their purpose.</p><p>Health recommender systems are broad and encompass epidemic forecasting tools such as HealthMap <ref type="bibr" target="#b0">[1]</ref> and EPIWATCH 2 , which track disease spread by collecting information from various channels, including news and social media <ref type="bibr" target="#b1">[2]</ref>. In crises, these recommender systems are pivotal for effectively allocating medical resources and guiding interventions. Moreover, they extend to environmental health monitoring, offering air and water quality advice, and are integrated into Personal Health Records (PHRs) to suggest health actions <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>, such as vaccine recommendation features <ref type="bibr" target="#b4">[5]</ref>. Lastly, health applications employ these systems to promote personalized health-related behaviour <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>. Despite their potential, ensuring data privacy, system validation, multilingual adaptability, and ethical use is paramount for maintaining public trust and successfully deploying recommender systems in public health.</p><p>To address the need for more efficient identification of relevant articles coming to the EIOS platform, we created a content-based recommender system that is based on three data streams: <ref type="bibr" target="#b0">(1)</ref> The content of the article, specifically the first 1000 characters, taking into account complete sentences; <ref type="bibr" target="#b1">(2)</ref> The event type labels resulting from the application of a pandemics event classifier; (3) The user interactions with each article (i.e., relevance score), obtained using a scoring function that considers the type and number of interactions, augmented with a clustering procedure to tackle data sparsity. We tested XGBoost <ref type="bibr" target="#b7">[8]</ref> with seven different data augmentation procedures.</p><p>The article's main contributions are:</p><p>• Usage of an event classifier labels to enrich the recommendation algorithm;</p><p>• Introducing a clustering-based approach for user activity harmonization to address data sparsity challenges; • Development of a content-based system for recommending articles in real-world PHI scenarios.</p><p>• Error analysis conducted on example use cases to assess whether the recommender can flag relevant information missed by the users.</p><p>The data described and used in this paper was sourced from a live system. As a result, Intellectual Property and Privacy regulations apply, preventing dataset sharing. Nevertheless, the experiments detailed in this article are significant for health recommender systems. They offer valuable insights into implementing AI-based solutions using actual user data.</p><p>Section 2 describes the data, mainly the metadata used to train the recommender system. Section 3 describes the cluster-based procedure to perform data harmonization and tackle sparsity. Section 4 presents the recommender system, including model and evaluation metrics. Section 5 presents results, a discussion of the clustering plus recommendation pipeline, and an error analysis of the different clustering modalities. Finally, Section 6 presents the main conclusions and future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data</head><p>To train and test our model, we used a dataset of approximately 3.5 million articles from the EIOS platform from 01/01/2018 to 09/06/2022 (about four years and six months). This dataset contains all articles and information about user interactions with those articles in all the different languages captured by the platform. For this work, which constitutes the first iteration to create a recommendation solution for PHI systems, the features we focused on are the text of the article, the event labels generated through an event classifier, and the user activity for each article (i.e., relevance score). Figure <ref type="figure" target="#fig_0">1</ref> illustrates the high-level pipeline involving three input data streams in the recommender system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Text</head><p>The dataset has the full text for each article. However, due to memory limitations and to keep the focus on the core information of the article, we decided to consider only the first few sentence(s), up to 1000 characters.</p><p>To preprocess this truncated-article text, we only removed stop words from English articles. In order to vectorise the articles, we used the TfidfVectorizer function from the scikit-learn<ref type="foot" target="#foot_0">3</ref> using the maximum document frequency set to ignore terms that have a document frequency strictly higher than 1. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Event Labels</head><p>We assigned event labels to the articles to boost the system's performance and better characterize and differentiate between articles. We ran an event classifier for each article within the dataset to classify them into one or more of 27 events following a taxonomy and pipeline created and developed by Piskorski et al. <ref type="bibr" target="#b8">[9]</ref>. Some of the most frequent labels are (1) Reporting Cases (i.e., reporting on cases of infections, hospitalizations, deaths, recoveries of single persons and groups, provision of updates thereon, which covers a short time span and specific location), (2) Reporting Situation (i.e., provision of updates on the overall situation of the outbreak, current total figures, observed trends, forecast, which spans longer period of time, and also covers cross-regional and cross-country comparisons), (3) Measuring Vaccine/Medicine Roll-out (i.e., covers events revolving around the roll-out of vaccines, medicines, equipment to combat the disease or mitigate the consequences, and includes also events related to sharing experience, measure hesitancy, anti-vax movements, etc.). Other coarse-grain labels are Impact, Violation, Research &amp; Development, Communication, Support, and Miscellaneous.</p><p>To preprocess these event labels, we applied the MultiLabelBinarizer function, given that each article can have more than one label wrapped to work with ColumnTransformer, both from the scikit-learn library.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">User Activity</head><p>The user activity for each article is pre-determined by the weighted sum of user interactions, which we express as a relevance score. Different types of interactions yield different weights. The platform computes the user activity using the weights presented in Table <ref type="table" target="#tab_0">1</ref>. When it comes to the "Read Preview" interaction, the weight assigned will be zero if there are no other user interactions on the article, and one otherwise (excluding "Read Detail"). For the "Read Detail" interaction, the weight assigned will be zero if there are no other user interactions on the article, and two otherwise (excluding "Read Preview"). As for the "Pin to Board" activity, the weight assigned is five or ten, based on whether the board is private or public, respectively. The weights assigned to each activity are proportional to the complexity of the activity being performed.</p><p>One of the issues we had to address before the application of our system was the low proportion of articles with user interactions (2.03%). The news feeds presented to users are ordered by time and user preference settings (i.e., pre-determined keywords, languages, etc.). When a new story emerges, EIOS users often interact with the first article reporting on the story, with the article they deemed to be from the most reliable source, or even with the article that reports the story in their language, among other preferences.</p><p>This interaction pattern means that if we have a single story reported in multiple articles from multiple sources, the user activity will vary widely among almost identical articles, with only a few articles getting interacted with. Thus, raw user activity does not directly equate to user interests. In the following section, we will outline how we intend to tackle this issue using clustering to make the relevance score a reliable measure of user interest.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Cluster-based Harmonization</head><p>We considered that articles with no interaction are articles for which the relevance is unknown rather than zero, transforming the problem into a semi-supervised learning one. We corrected the relevance score of articles in clusters to deal with this and fall back on a supervised learning problem.</p><p>The harmonization of user activity/relevance scores happens at the level of clusters of related articles, some of which have an interaction score and others potentially none. We intended that the clusters captured reports on the same event; as such, they were computed considering both the time and semantic aspects. The clustered article data corresponds to the text described in the Data section. The entire dataset was split into five-day chunks, capturing a story's average duration, as represented in Figure <ref type="figure" target="#fig_1">2</ref>. Inside a chunk, all the pairs of articles were compared using sentence embeddings, and the pairs whose similarity was above a given threshold were put into a graph. The semantic similarity model used was distiluse-base-multilingual-cased-v2, with a threshold of 0.90. Finally, the graphs of all clusters were merged, and the set of connected components yielded the global set of clusters. This approach is designed to be adaptable, allowing it to pick up news stories that last longer than five days and preventing the merging of similar stories from widely different time spans. Once the clusters were computed, the second step of our procedure was to harmonize the score of all the articles belonging to each cluster. To illustrate this, we will consider this example cluster of four identical articles and their corresponding user activities scores:  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Model</head><p>In this approach, each row of our data represented an article with a relevance score corresponding to the weighted sum of user interactions with the article. As stated in the previous sections, the features considered for training were the article attributes: a text section at the beginning of the article, the events labels that report on the article classification, and the relevance scores. Our goal was to recommend articles with higher engagement that are, therefore, more relevant.</p><p>We divided our data into training (80%) and testing (20%) with a 5-fold cross-validation. For the training data, we used an XGBoost regression model <ref type="bibr" target="#b7">[8]</ref>. This model learns to predict each article's user engagement by building a series of decision trees sequentially, using gradient descent to minimize the loss. We did not do hyperparameter tuning, leaving the default parameters stated in the package documentation <ref type="foot" target="#foot_1">4</ref> , to avoid overfitting the model to our data and maintain its generalizability to new data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Evaluation Metrics</head><p>The evaluation metrics considered for the different settings were the following:</p><p>• RMSE: Root mean square error (RMSE) or root mean square deviation is one of the most commonly used measures for evaluating the quality of predictions. It shows how far predictions fall from measured true values using Euclidean distance. • NDGC@K: Normalized Discounted Cumulative Gain (NDCG) considers both the relevance and the position of items in the ranked list in the top K items. • Precision@K: Precision at K measures the proportion of relevant items among the top K items.</p><p>• Recall@K: Recall at K measures the coverage of relevant items in the top K items.</p><p>• F-measure@K: Harmonizes precision and recall to provide a balanced metric in the top K items.</p><p>We considered 5, 10, 15, and 100 items for K. For Precision, Recall, and F-measure, since the values considered are binary, we present only the 𝐾 = 100 configuration to reflect better the real user needs in our setting.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head><p>This section presents the main results regarding all modalities and discusses the model's successes and potential limitations given the simplified approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Local Clusters Distribution</head><p>The settings used for clustering were conservative as it was performed on relatively long text with a high threshold. In total 8.7% of the articles were clustered. The data revealed a predominant pattern of small clusters, with 81% having a size of 2 and 99% under size 7. These clusters also tend to be short-lived, with 49% lasting a single day and 99% up to 8 days. The manual review confirms that articles in these clusters are remarkably similar, often being near-perfect duplicates. Notably, the clusters with the longest lifespan appear to be populated by automatically generated reporting articles.</p><p>In Figure <ref type="figure" target="#fig_4">4</ref>, we plotted the distribution of cluster size and the distribution of the span of the cluster in days; some outliers fall outside the limits of the figure and are not shown. A cluster's median size was two articles, and the median span was two days. Table <ref type="table" target="#tab_1">3</ref> reports several statistics over the clusters, grouping them based on whether the relevance of related articles contains only 0, only positive (𝑝𝑜𝑠), mostly 0, mostly 𝑝𝑜𝑠, both 0 and 𝑝𝑜𝑠 in equal proportion. We report the mean and max cluster size and span, and the maximal peak article count, and the proportion of the total relevance. We can observe that clusters attracting most of the relevance tend to be relatively small and short.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Modality Performance</head><p>Table <ref type="table">4</ref> presents the results of comparison of different clustering modalities for user data augmentation using the RMSE, NDGC@K, Precision@K, Recall@K and F-measure@K metrics, taking into account</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Comparison of different clustering modalities for user data augmentation using the RMSE, NDGC@K, Precision@K, Recall@K and F-measure@K metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Modality</head><p>RMSE NDGC@K Precision@K Recall@K F-measure@K Comparison of different clustering modalities with user data augmentation performance on the original test set (non-augmented) using the RMSE, NDGC@K, Precision@K, Recall@K, and F-measure@K metrics.</p><p>Modality RMSE NDGC@K Precision@K Recall@K F-measure@K </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5-fold cross validation.</head><p>Most modalities surpass the Original configuration. However, when considering NDGC@100, only Sum, High, Low, and Random perform distinctly better than the Original, with Sum being significantly better. The performance of Sum places the possibility that the actual user activity value represents the sum of all identical article interactions, performing twice as well as the Original.</p><p>Table <ref type="table" target="#tab_2">5</ref> showcases the same procedure but using the Original modality test set. In this setting, the superior performance of the Sum modality is not as noticeable, but all modalities, except AVG, Discard, and Null, perform better than Original. A possible justification for this behaviour could be that our system performs better with more data regardless of how it is labelled, hindering the performance of Null and Discard modalities. Additionally, the AVG configuration could make stronger and weaker signals less noticeable, diluting their relative importance in a ranking setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Error Analysis</head><p>Table A1 (Appendix) showcases the false positives found across the five rounds of cross-validation for the different modalities at the top five (𝐾 = 5). All modalities introduce errors compared to the Original, with Sum and High introducing fewer wrong articles as also reflected in Table <ref type="table">4</ref>.</p><p>We analysed the articles for a fail rate of over or equal to 7/8 modalities to interpret what could have made most modalities assign relevance. We then analysed whether it was indeed a failure by our models or if it could have been a missed relevant article by the users and/or the clustering procedure for data augmentation. This selection resulted in six articles represented in Table <ref type="table">6</ref> and marked with a asterisk (*) in Table <ref type="table" target="#tab_0">A1</ref> (Appendix). Table <ref type="table">7</ref> reports on the details of these articles.</p><p>Even though Table <ref type="table">7</ref> does not report on the sources for the articles, all of these are pieces that</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Recommender system high-level pipeline with the three data streams and expected output.</figDesc><graphic coords="3,177.17,65.61,240.94,195.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Representation of five day time-span local clusters in the timeframe considered.</figDesc><graphic coords="4,177.17,542.64,240.95,91.96" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>•</head><label></label><figDesc>Cluster: [Article 1, Article 2, Article 3, Article 4] • User Activities: [0, 5, 17, 0]</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Histogram of relevance score distribution excluding non-relevant articles: original for all articles (left) and with sum harmonization (right).</figDesc><graphic coords="6,75.27,65.60,221.13,221.13" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Heathmap of cluster size versus cluster span</figDesc><graphic coords="7,184.82,65.60,225.64,180.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Types of user activities and their corresponding weights.</figDesc><table><row><cell>User Activity</cell><cell>Weight</cell></row><row><cell>Read Preview</cell><cell>0 or 1</cell></row><row><cell>Read Detail</cell><cell>0 or 2</cell></row><row><cell>Flag for Follow Up</cell><cell>3</cell></row><row><cell>Export to Report</cell><cell>5</cell></row><row><cell>Attach to Team Communication</cell><cell>5</cell></row><row><cell>Comment</cell><cell>5</cell></row><row><cell>Pin to Board</cell><cell>Variable</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3</head><label>3</label><figDesc>Statistics over clusters characteristics with different distributions. Only 0 Only 𝑝𝑜𝑠 More 0 More 𝑝𝑜𝑠 Eq. prop.</figDesc><table><row><cell>Count</cell><cell>124332</cell><cell>829</cell><cell>1188</cell><cell>182</cell><cell>3798</cell></row><row><cell>Max size</cell><cell>177</cell><cell>6</cell><cell>180</cell><cell>21</cell><cell>10</cell></row><row><cell>AVG size</cell><cell>2.3</cell><cell>2.1</cell><cell>5.3</cell><cell>3.3</cell><cell>2.0</cell></row><row><cell>Max span</cell><cell>133</cell><cell>7</cell><cell>244</cell><cell>28</cell><cell>12</cell></row><row><cell>AVG span</cell><cell>1.9</cell><cell>1.8</cell><cell>4.5</cell><cell>3.0</cell><cell>1.8</cell></row><row><cell>Max peak</cell><cell>25</cell><cell>6</cell><cell>11</cell><cell>4</cell><cell>4</cell></row><row><cell>% total rel.</cell><cell>0.00</cell><cell>0.24</cell><cell>0.18</cell><cell>0.06</cell><cell>0.52</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 5</head><label>5</label><figDesc></figDesc><table><row><cell></cell><cell>5</cell><cell>10</cell><cell>15</cell><cell>100</cell><cell>100</cell><cell>100</cell><cell>100</cell></row><row><cell>Original</cell><cell cols="4">1.5739 0.1903 0.1807 0.1606 0.1749</cell><cell>0.3466</cell><cell>1.0000</cell><cell>0.5136</cell></row><row><cell>Sum</cell><cell cols="4">1.7362 0.4946 0.4382 0.4137 0.4108</cell><cell>0.5740</cell><cell>1.0000</cell><cell>0.7287</cell></row><row><cell>High</cell><cell cols="4">1.6591 0.1722 0.2283 0.2318 0.2537</cell><cell>0.4320</cell><cell>1.0000</cell><cell>0.6015</cell></row><row><cell>AVG</cell><cell cols="4">1.5313 0.1612 0.1575 0.1559 0.1734</cell><cell>0.3478</cell><cell>0.9946</cell><cell>0.5137</cell></row><row><cell>Low</cell><cell cols="4">1.6188 0.1767 0.1950 0.1903 0.2201</cell><cell>0.4060</cell><cell>1.0000</cell><cell>0.5762</cell></row><row><cell>Random</cell><cell cols="4">1.6380 0.1622 0.2015 0.2023 0.2516</cell><cell>0.4440</cell><cell>1.0000</cell><cell>0.6139</cell></row><row><cell>Discard</cell><cell cols="4">1.6381 0.1516 0.1518 0.1377 0.1802</cell><cell>0.3880</cell><cell>1.0000</cell><cell>0.5575</cell></row><row><cell>Null</cell><cell cols="4">1.6318 0.1861 0.1884 0.1855 0.1834</cell><cell>0.3720</cell><cell>1.0000</cell><cell>0.5420</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://scikit-learn.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">https://xgboost.readthedocs.io/en/stable/parameter.html</note>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(L. Spagnolo) 0000-0003-0597-9273 (D. F. Sousa); 0009-0000-2061-3216 (N. Stefanovitch); 0009-0008-0179-7468 (L. Spagnolo)</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Clusters containing articles with only zero relevance are left untouched, except for the Null configuration, detailed below. Clusters with mixed or only positive relevance were further processed to reassign the relevance score of every article within that cluster. We considered seven different modalities to perform the harmonization, which are illustrated in the following example: The Discard and Null modalities constitute filtering options, not modifying the relevance score but excluding articles with no score, using different approaches. For Discard, all non-relevant articles are removed from the cluster for the clusters with at least one relevant article. For Null, all clusters where all the articles have a zero relevance score are removed.</p><p>Table <ref type="table">2</ref> showcases the augmentation in general percentage for each modality compared to Original, reflecting our extremely conservative clustering procedure. The Threshold column is the user activity value considered at the recommendation level to decide if an article should be recommended. We obtained this value by considering the average of the positive (&gt; 0) user activities for each modality. Figure <ref type="figure">3</ref> reports the histogram of the user activity/relevance score of articles comparing the distribution of all the original data and the clustered articles' distribution of the sum modality, presenting similar profiles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Recommender System</head><p>The data available does not specify which users interacted with the articles; it only shows the overall user activity for each article. Therefore, recommendations are not based on individual user behaviour but on global preferences towards specific topics and domains, making adopting a collaborative filtering approach unfeasible. In this article, an expert demonstrates how vaccination fears are at fault for rising chickenpox cases in Angola. If other sources are already monitoring the number of cases, this piece can be overlooked because it is primarily about cause rather than consequence. Nevertheless, we believe this article and similar articles can indicate the worsening of ongoing outbreaks. As such, these shouldn't be ignored but used as indicators to flag future similar events pre-emptively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>This article presented the first step in developing a recommendation system for a pre-existing platform, EIOS, developed for PHI. Therefore, the results and analysis still need to be completed. However, this work successfully showcases a pipeline for developing a content-based system recommending articles in real-world PHI scenarios. It introduces a clustering-based approach to tackle data sparsity and the use of event classifier labels to enrich the recommender algorithm. While more complex metadata and advanced models and approaches are available and will be used in the future, this first attempt successfully demonstrated a way of dealing with data sparsity for our case study, which in turn improved the model performance from an NDGC@K of 0.1749 to 0.4108, at 𝐾 = 100, for the Sum cluster-based harmonization modality.</p><p>Looking ahead, we plan to further develop this approach by considering multiple users, article sources, other types of article metadata, and exploring the conjugation of clustering modalities and filters. Additionally, we aim to involve analysts in our approach to evaluate performance on actual end-users, thereby enhancing the robustness and applicability of our system. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Error Analysis with False Positives</head></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Healthmap: global infectious disease monitoring through automated classification and visualization of internet media reports</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Freifeld</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">D</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">Y</forename><surname>Reis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Brownstein</surname></persName>
		</author>
		<idno type="DOI">10.1197/jamia.M2544</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="150" to="157" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Digital disease detection-harnessing the web for public health surveillance</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Brownstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Freifeld</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Madoff</surname></persName>
		</author>
		<idno type="DOI">10.1056/NEJMp0900702</idno>
	</analytic>
	<monogr>
		<title level="j">The New England journal of medicine</title>
		<imprint>
			<biblScope unit="volume">360</biblScope>
			<biblScope unit="page">2153</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Early-life prevention of non-communicable diseases</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Balbus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Barouki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">S</forename><surname>Birnbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Etzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">D</forename><surname>Gluckman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Grandjean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hancock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hanson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Heindel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hoffman</surname></persName>
		</author>
		<idno type="DOI">10.1016/S0140-6736(12)61609-2</idno>
	</analytic>
	<monogr>
		<title level="j">The Lancet</title>
		<imprint>
			<biblScope unit="volume">381</biblScope>
			<biblScope unit="page" from="3" to="4" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Towards health (aware) recommender systems</title>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hors-Fraile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Karumur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Valdez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Said</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Torkamaan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ulmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Trattner</surname></persName>
		</author>
		<idno type="DOI">10.1145/3079452.3079499</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 International Conference on Digital Health, DH &apos;17</title>
				<meeting>the 2017 International Conference on Digital Health, DH &apos;17<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="157" to="161" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Updated 2024 us vaccine recommendations from the advisory committee on immunization practices</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Pereira</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.ajt.2024.02.012</idno>
	</analytic>
	<monogr>
		<title level="j">American Journal of Transplantation</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="page" from="514" to="516" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Health behavior models in the age of mobile interventions: are our theories up to the task?</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">T</forename><surname>Riley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">E</forename><surname>Rivera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Atienza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nilsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Allison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mermelstein</surname></persName>
		</author>
		<idno type="DOI">10.1007/s13142-011-0021-7</idno>
	</analytic>
	<monogr>
		<title level="j">Translational behavioral medicine</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="53" to="71" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Recommendations as challenges: Estimating required effort and user ability for health behavior change recommendations</title>
		<author>
			<persName><forename type="first">H</forename><surname>Torkamaan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ziegler</surname></persName>
		</author>
		<idno type="DOI">10.1145/3490099.3511118</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 27th International Conference on Intelligent User Interfaces, IUI &apos;22</title>
				<meeting>the 27th International Conference on Intelligent User Interfaces, IUI &apos;22<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="106" to="119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">XGBoost: A scalable tree boosting system</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
		<idno type="DOI">10.1145/2939672.2939785</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD &apos;16</title>
				<meeting>the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD &apos;16<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="785" to="794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Multi-label infectious disease news event corpus</title>
		<author>
			<persName><forename type="first">J</forename><surname>Piskorski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Stefanovitch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Linge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kharazi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mantero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Jacquet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Spadaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Teodori</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Text2Story&apos;23 Workshop</title>
				<meeting>the Text2Story&apos;23 Workshop<address><addrLine>Dublin, Republic of Ireland</addrLine></address></meeting>
		<imprint>
			<publisher>Elsevier</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="171" to="183" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
