<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Natural Language Processing-based Approach for Cyber Risk Assessment in the Healthcare Ecosystems</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Stefano</forename><surname>Silvestri</surname></persName>
							<email>stefano.silvestri@icar.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for High Performance Computing and Networking</orgName>
								<orgName type="institution">National Research Council of Italy (ICAR-CNR)</orgName>
								<address>
									<addrLine>via Pietro Castellino 111</addrLine>
									<postCode>80131</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Tricomi</surname></persName>
							<email>giuseppe.tricomi@icar.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for High Performance Computing and Networking</orgName>
								<orgName type="institution">National Research Council of Italy (ICAR-CNR)</orgName>
								<address>
									<addrLine>via Pietro Castellino 111</addrLine>
									<postCode>80131</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Università degli Studi di Messina</orgName>
								<address>
									<addrLine>Contrada di Dio 1</addrLine>
									<postCode>98166</postCode>
									<settlement>Messina</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">CINI-Consorzio Interuniversitario Nazionale per l&apos;Informatica</orgName>
								<address>
									<addrLine>Via Ariosto 25</addrLine>
									<postCode>00185</postCode>
									<settlement>Roma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><forename type="middle">Felice</forename><surname>Russo</surname></persName>
							<email>giuseppefelice.russo@icar.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for High Performance Computing and Networking</orgName>
								<orgName type="institution">National Research Council of Italy (ICAR-CNR)</orgName>
								<address>
									<addrLine>via Pietro Castellino 111</addrLine>
									<postCode>80131</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mario</forename><surname>Ciampi</surname></persName>
							<email>mario.ciampi@icar.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for High Performance Computing and Networking</orgName>
								<orgName type="institution">National Research Council of Italy (ICAR-CNR)</orgName>
								<address>
									<addrLine>via Pietro Castellino 111</addrLine>
									<postCode>80131</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Natural Language Processing-based Approach for Cyber Risk Assessment in the Healthcare Ecosystems</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">CCBFC9BDA01FF3D517E9B5C8003490B5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Natural Language Processing, Large Language Models, Cyber Threats, Cyber Vulnerabilities, Impact Assessment, Cyber Risk Assessment (M. Ciampi) 0000-0002-9890-8409 (S. Silvestri)</term>
					<term>0000-0003-3837-8730 (G. Tricomi)</term>
					<term>0009-0001-2090-9647 (G. F. Russo)</term>
					<term>0000-0002-7286-6212 (M. Ciampi)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The cyber risk in the healthcare sector is constantly increasing, due the large adoption of digital services formed by a complex interconnection of different systems and technologies, which offer a larger attack surface for the attackers. Therefore, the risk assessment of the assets involved in these services is crucial to prevent and mitigate possible critical consequences, which could also affect the health of the patients. A large source of constantly updated information about threats and vulnerabilities of the assets of the healthcare ecosystems is available in natural language text on the Internet (cyber security news, forum, social media, etc.), but it is not easy to fully exploit them for a risk assessment process, due to the complexity of natural language. This paper proposes an AI-based approach for the individual risk assessment of the assets of digital healthcare systems based on the use of NLP and Knowledge Bases, which exploits the information extracted from natural language news from the web. The methodology has been developed within the activities of the EC-funded H2020 AI4HEALTHSEC project, where it has also been successfully tested in real-world scenarios. Moreover, the datasets collected have been made publicly available on the SoBigData research infrastructure.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The healthcare ecosystem is rapidly adopting a growing number of recent technologies, such as Internet of Things (IoT), wearable and implantable devices, Picture Archiving and Communication System (PACS), Electronic Health Records (EHRs), DiCOM images, and others, interconnected to realise and offer innovative healthcare digital services. While their adoption and use improve the quality of service to patients, and support and ease the work of the physicians and the medical professionals, on the other hand, this complex and dynamic inter-connection of several systems offers a larger attack surface for the threat actors interested in attacking the system by exploiting the existing vulnerabilities <ref type="bibr" target="#b0">[1]</ref>, also taking into account a low level of awareness of the cyber risks by the the healthcare personnel <ref type="bibr" target="#b1">[2]</ref>, often causing dramatic impacts to the healthcare ecosystem <ref type="bibr" target="#b2">[3]</ref>. In example, a cyber-attack on a insecure PACS server could lead to the web exposure of sensitive information of patients, or an attack to a remote monitoring software of a medical device could damage the equipment of the hospital or change the configuration of the device <ref type="bibr" target="#b3">[4]</ref>. This sector has recently suffered several serious cyber attacks: for example, in 2017 and 2021 there were ransomware attacks on U.K. National Health System (NHS) and Ireland's Department of Health and Health Service Executive respectively <ref type="bibr" target="#b4">[5]</ref>. Furthermore, inherent vulnerabilities have been found in some medical devices such as Braun's infusion pump and Medtronic's insulin pump <ref type="bibr" target="#b2">[3]</ref>. Finally, approximately 90% of healthcare organisations experienced a data breach in 2018 <ref type="bibr" target="#b5">[6]</ref>. For these reasons, it is necessary to study the most frequent attacks in healthcare to make the services offered more secure and resilient <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b6">7]</ref>. Due to the complexity of the healthcare ecosystems, performing an effective cyber risk assessment can help to limit and prevent the cyber security incidents <ref type="bibr" target="#b7">[8]</ref>. The cyber risk assessment process has the purpose of identifying, evaluating, and prioritising security risks to the assets of an organisation, allowing to perform the most appropriate action to mitigate the risks and the vulnerabilities.</p><p>Internet is a constantly updated source of threat, incident, and vulnerability-related information for healthcare ecosystem assets in the form of unstructured Natural Language (NL) within blogs, specialized Cyber-Security (CS) websites, social media, Knowledge Bases (KBs) and others. Although these sources contain crucial information about risk management and assessment, on the other hand, it is difficult to fully leverage them, due to the inherent complexity (polysemy, irony, long and complex sentences, non-standardized abbreviations, acronyms) of NL. Therefore, extracting relevant information from this mass of data becomes a demanding task <ref type="bibr" target="#b8">[9]</ref>. The information extraction from NL text issues is currently addressed in literature adopting AI-based Natural Language Processing (NLP) models, usually implementing Named Entity Recognition (NER) systems <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref> using Large Language Models (LLMs) and CS KBs. However, there is a lack of focus in the literature on analyzing and prioritizing threats and vulnerabilities about the most frequent threats in healthcare. In this context, this paper extends the ideas previously presented in <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>, combining NLP-based threat and vulnerability approaches to define an impact and risk assessment for the healthcare ecosystems, evaluating it by exploiting CS textual sources available on the Internet, presenting the final NLP cyber risk assessment methodology developed within the activities of the EC-funded H2020 AI4HEALTHSEC research project, as well as the collection of a textual CS dataset related to the "SoBigData.it" research project.</p><p>The paper is organized as follows: in Section 2, the most recent and related studies in the literature are outlined; subsequently, the details of the proposed approach are described in Section 3.5; afterwards, Section 4 shows the implementation of the proposed solution, a description of the datasets used and the research project where the approach was tested in real-world scenarios. Finally, Section 5 provides conclusions and future works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>There are several recent works in the literature dealing with risk assessment and CS information extraction from NL documents. The authors of <ref type="bibr" target="#b7">[8]</ref> reviewed and compared different generic cyber risk assessment frameworks in the healthcare field, comparing them, discussing the methodology of assessment and the limitations associated with them. A threat and mitigation model tailored for the IoT health devices is presented in <ref type="bibr" target="#b16">[17]</ref>, combining STRIDE and DREAD models: threats are identified using STRIDE model on the device access points, and then ranked using DREAD. This approach is suitable for both the designers and users of health IoT devices.</p><p>The security and privacy challenges in Medical Cyber-Physical Systems (MCPS) are discussed in <ref type="bibr" target="#b17">[18]</ref>, highlighting that trust and threat models usually consider MCPS stakeholders, including healthcare practitioners, system administrators and non-medical staff, with incorrect levels of trust. Also, in <ref type="bibr" target="#b1">[2]</ref>, the issues related to the CS awareness of the healthcare personnel are underlined, reviewing the existing gaps in CS strategies adopted by healthcare organizations and the risk assessment methodologies adopted. The authors demonstrated that in this domain, there is often a lack of adequate training for healthcare workers and a lack of specialized figures, such as a chief information officer, highlighting the need to have security protocols updated to the latest standards.</p><p>Also, AI-based information extraction from CS textual documents has been recently developed and presented in the literature. In <ref type="bibr" target="#b12">[13]</ref> is presented SecureBERT, a Bidirectional Encoder Representations from Transformers (BERT) model trained on CS-domain large NL corpora, which outperforms other similar models in NLP tasks in the CS domain. The authors of <ref type="bibr" target="#b9">[10]</ref> collected a large corpus of labeled sequences from Industrial Control Systems device's documentation to pre-train and fine-tune a BERT language model, named CyBERT. Also <ref type="bibr" target="#b11">[12]</ref> proposed another interesting CS NER system, which exploits an architecture based on BERT, an LSTM, Iterated Dilated Convolutional Neural Networks (ID-CNNs), and Conditional Random Field, to improve the obtained performances.</p><p>The main innovation of the proposed approach is the use of CS information extracted from NL texts to calculate the threat, vulnerability, and impact levels, allowing the risk assessment for the various assets involved in digital healthcare services to be finally obtained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>The proposed risk assessment methodology is composed of the following five steps: i) Healthcare Ecosystem Assets Identification and Categorisation; ii) Threat Identification and Assessment; iii) Vulnerability Assessment; iv) Impact Assessment; and v) Risk Assessment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Healthcare Ecosystem Assets Identification and Categorisation</head><p>The preliminary step of the methodology provides a list of the assets of the considered digital complex healthcare system by identifying the corresponding services involved and their assets, with the final purpose of measuring their criticality within the healthcare system. For instance, the assets of a remote patient consultation service could include a Database, a Linux Server, communication software, and a web server. After their identification, the assets are also categorized, using the Common Platform Enumeration (CPE)<ref type="foot" target="#foot_0">1</ref> catalogue to map them with the corresponding area (based on their type) and category (depending on their functionalities), as shown in the next Table <ref type="table" target="#tab_0">1</ref>. This step allows us to understand the importance of each asset within the ecosystem and to provide a list of the assets that require risk assessment. These classifications are used to evaluate the criticality of each asset of the healthcare system, by measuring the dependency level that an asset has with other system components. We defined our dependency levels:</p><p>• Independent assets have a distinct operation and exhibit no dependency on other assets. If the asset fails, no cascading events occur. • Incoming dependency, if syntactically, another asset uses its data or functionality. If such an asset fails, the operation of all related assets that use its data or functionality may be disrupted. • Outgoing dependency, if syntactically it uses data or functionality of another asset. Therefore, if the latter asset fails, the operation of the former asset will be affected as well. • Coupling relationship reveals that two assets have both incoming and outgoing dependencies.</p><p>Thereupon, failures in one of the assets will affect the functionality of the other.</p><p>Thus, the criticality level of an asset can be determined by the number of services and relevant business flows it participates in. Specifically, the General Asset Criticality level based on running services (GAC) is calculated as the weighted summation of their interdependencies, normalized by the total number of services in the examined healthcare ecosystem. Thereupon, the Asset Criticality for a specific service (ACS) is equal to its GAC value divided by the number of relevant/redundant assets that co-exist in the service. Finally, based on the ACS range values, it is possible to assign a criticality level to each asset, as shown in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Asset Criticality Levels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACS Value Range Asset Criticality Level</head><formula xml:id="formula_0">[0,1] Low (1,2] Medium (2,3] High</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Threat Identification and Assessment</head><p>Once the assets have been identified, the next step aims to assess the threats that could affect those assets, following the approach previously described in <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>. Firstly, a threat identification phase is performed by exploiting the Common Attack Pattern Enumeration and Classification (CAPEC) <ref type="foot" target="#foot_1">2</ref> , which also provides a detailed set of the characteristics of the threats, such as Likelihood of Attack, Related Attack Patterns, Execution Flow, Prerequisites and others. In this way, we obtain the list of the threats for each asset that operates in the considered healthcare service/system (identified in the previous step). Each threat also includes the CAPEC ID, a CAPEC category that will be used to rate the threat, and the corresponding characteristics.</p><p>Then, it is possible to assess the threats, assigning them a severity level. Our methodology exploits the NL history of reported incidents related to those threats, extracted from large CS domain collections available online, such as forums, social media, news, and others, using an AI-based NLP approach. In detail, we use a Named Entity Recognition (NER) architecture based on Secure-BERT <ref type="bibr" target="#b12">[13]</ref>, a BERT model pre-trained on a very large CS domain text collection (more than 2.2 million documents), preprocessed with a CS customized tokenizer, and finetuned for the NER task, to extract the mentions of the pairs threat and asset found in each sentence of the NL source. In this case, we produced a custom training set, annotated with the entity types of interest (Asset, and Threat) using the semi-supervised approach described in <ref type="bibr" target="#b18">[19]</ref>. Then, the threat level is calculated based on the percentage of the occurrence of the mentions of that threat within the considered dataset, following the ranges shown in Table <ref type="table">3</ref>. The assessment is finally performed through a mapping between the assets of the services of the healthcare system and the pairs asset and threat with the corresponding threat level.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Threat Levels and corresponding percentage of occurrence. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Vulnerability Assessment</head><p>The next step has the purpose of building a vulnerability exploit prediction scoring system specifically tailored for the healthcare domain. To this end, we adopted the NLP and Machine Learning (ML) approach described in <ref type="bibr" target="#b14">[15]</ref>, which leverages CS domain textual data sources to train a supervised ML classification model able to predict the vulnerability score, obtaining in this way the vulnerability assessment. In summary, this method uses the textual data included in the CVE (the Report column of this KB) and the corresponding exploitability and impact metrics, namely the attack vector, attack complexity, privileges required, user interaction, scope, confidentiality impact, integrity impact and availability, to obtain a vector representation with the corresponding labels related to exploitability and impact metrics, used to train a set of ML XGBoost classifiers, which are able to predict the labels of the Attack Vector (Network, Adjacent Network, Local, Physical) and of the exploitability and impact metrics, summarised in the next Table <ref type="table">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Exploitability and impact metrics and corresponding labels. on a very large CS domain text collection (more than 2.2 million documents), preprocessed with a CS customised tokenizer to improve its performance. This model has been fine-tuned for the NER task, to extract the mentions of the pairs of threat and asset found in each corpus sentence for the threat assessment, the mentions of vulnerabilities, the corresponding adjectives, and the assets for the impact assessment. To this end, we created two custom training sets, annotated with the entity types of interest (Asset, and Threat in the first case and Asset, Vulnerability and Adjectives in the latter case) using the semi-supervised approach described in <ref type="bibr" target="#b18">[19]</ref>. The implementation of this module is based on the Huggingface Transformers Python library. The vulnerability assessment ML classifiers have been implemented using the Dmlc XGBoost library, a distributed gradient boosting library designed to be highly efficient and flexible.</p><p>The proposed methodology has been developed and implemented within the activities of the EC-funded H2020 project "AI4HEALTHSEC-A Dynamic and Self-Organised Artificial Swarm Intelligence Solution for Security and Privacy Threats in Healthcare ICT Infrastructures". In this project, the proposed approach has been tested in real-world pilot scenarios provided by the Fraunhofer Institute for Biomedical Engineering (IBMT), a partner of the project. The pilots tested three different complex healthcare systems scenarios, namely Implantable Medical Devices, Wearables, and Biobank. The results of the tests, reported in <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>, confirmed the effectiveness and the applicability of our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and Future Works</head><p>The paper proposes an AI-based approach for the individual risk assessment of the assets of digital healthcare systems. The approach, after the classification of the criticality of the assets using CS KBs, leverages NER and ML systems to extract and classify relevant information from textual CS sources, allowing to calculate the threat, vulnerability and impact levels, which are finally combined to obtain the risk level of each asset. The methodology was successfully tested in real-world pilot scenarios of the EC-funded H2020 AI4HEALTHSEC project, demonstrating its applicability and effectiveness. Moreover, the datasets, which are constantly updated, are made publicly available on the SoBigData research infrastructure.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Assets areas and categories.</figDesc><table><row><cell>Area</cell><cell>Name</cell></row><row><cell>1</cell><cell>User interactions with implants and sensors</cell></row><row><cell>2</cell><cell>Medical equipment and IT devices</cell></row><row><cell>3</cell><cell>Services and processes</cell></row><row><cell>4</cell><cell>Interdependent HCIIs -Ecosystem</cell></row><row><cell>Category</cell><cell>Functionalities</cell></row><row><cell>Influence</cell><cell>Found in most organizations, distinct</cell></row><row><cell>Type</cell><cell>Software, hardware, Operating System (OS), Information</cell></row><row><cell></cell><cell>Sensitivity</cell></row><row><cell>Sensitivity</cell><cell>Restricted, unrestricted</cell></row><row><cell>Criticality</cell><cell>Essential, required, deferrable</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://nvd.nist.gov/products/cpe</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://capec.mitre.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://thehackernews.com</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">Available at https://data.d4science.org/ctlg/ResourceCatalogue/ the_hackernews_dataset</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work is supported by the European Union-NextGenerationEU-National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR)-Project: "SoBigData.it-Strengthening the Italian RI for Social Mining and Big Data Analytics"-Prot. IR0000013-Avviso n. 3264 del 28/12/2021.</p><p>We thank Simona Sada and Giuseppe Trerotola for the administrative and technical support provided.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Availability None, Low, High</head></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Then, an extension of CVE Exploit Prediction Scoring System (EPSS) is adopted <ref type="bibr" target="#b19">[20]</ref>, defining a Common Vulnerability Scoring System (CVSS)-like score using the labels predicted by the trained ML models on the NL texts, and following the specifications provided by <ref type="bibr" target="#b20">[21]</ref>. The vulnerability level is based on the ranges of the computed CVSS-like score, as shown in Table <ref type="table">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5 CVSS score ranges and corresponding vulnerability levels</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CVSS-like Score Range</head><p>Vulnerability Level 8.0, 10 Very High 6.0, 8.0 High 4.0, 6.0 Medium 2.0, 4.0 Low 0.0, 2.0 Very Low</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Impact Assessment</head><p>The next step of the proposed methodology is the Individual Impact Assessment, where the impact level is calculated to measure the effect that can be expected as the result of the successful exploitation of a vulnerability that resides in a critical asset. In this case, the methodology leverages the CVE KB used in conjunction with the same NER module used in the case of Threat Assessment fine-tuned to extract the assets and vulnerabilities entity types (see Section 3.2). This methodology exploits an additional set of adjectives related to the vulnerabilities and belonging to a predefined dictionary. These adjectives, such as severe, serious, dangerous, etc., tend to indicate via a weight coefficient the severity level of the vulnerability. In detail, this dictionary is the result of the processed fea-tures evaluated with two different classifiers that output scores to predict relevancy and severity, following the approach described in <ref type="bibr" target="#b21">[22]</ref>. Each adjective is associated with a coefficient, calculated by taking through the logodd ratio, then computing the exponential function on the log-odd, and converting odds to probability, using the formula: 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝑜𝑑𝑑𝑠/(1 + 𝑜𝑑𝑑𝑠). In this way, it is possible to associate the vulnerability to a scale Low, Medium, and High, where Low corresponds to [0, 33) (meaning that there is an 0-33% impact assessment probability), Medium corresponds to [33, 66), i.e., and High corresponds to [66, 100].</p><p>For vulnerabilities expressed in CVSS (obtained in the previous step), the three security criteria Confidentiality (C), Integrity (I), and Availability (A) are rated on a threetier-scale: None, Low, and High (see previous Table <ref type="table">4</ref>). We can define a mapping from this three-tier scale onto a five-tier scale ranging from Very Low (VL) to Very High (VH) combining these characteristics, as shown in Table <ref type="table">6</ref>, providing in this way an initial impact level of a specific asset/vulnerability combination.</p><p>Then, the final impact level per asset is obtained by combining the initial impact with the asset criticality level (see Table <ref type="table">2</ref>), with the previous scale related to the adjectives and the corresponding vulnerabilities extracted by the NER module, as stated in next Table <ref type="table">7</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Risk Assessment</head><p>Finally, the Risk assessment is obtained by combining the Threat, Vulnerability, and Impact levels obtained in the previous steps, calculating the individual risk level for each asset following the next Table <ref type="table">8</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Implementation and Experiments</head><p>To implement the Threat and Impact assessment methods, we firstly needed a large and updated CS domain textual document collection. To this end, we collected the news published by The Hacker News website 3 , a CS news platform that attracts over 8 million readers monthly, which is daily updated with attacks, threats, vulnerabilities, and other CS news. A Python web crawler and scraper for this website has been specifically developed to retrieve, extract, collect, and normalise the text of each posted news. The scraping task is performed bi-weekly, making this dataset constantly updated also increasing its size. Moreover, this corpus is also made publicly on the SoBigData research infrastructure 4 . The NER module is based on SecureBERT <ref type="bibr" target="#b12">[13]</ref>, a BERT model pre-trained   </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Swarm intelligence model for securing healthcare ecosystem</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ribino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ciampi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Islam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papastergiou</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.procs.2022.10.131</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1016/j.procs.2022.10.131" />
	</analytic>
	<monogr>
		<title level="j">Procedia Computer Science</title>
		<imprint>
			<biblScope unit="volume">210</biblScope>
			<biblScope unit="page" from="149" to="156" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Influence of human factors on cyber security within healthcare organisations: A systematic review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Nifakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chandramouli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Nikolaou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Papachristou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Panaousis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bonacina</surname></persName>
		</author>
		<idno type="DOI">10.3390/s21155119</idno>
	</analytic>
	<monogr>
		<title level="j">Sensors</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Mckee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Laulheret</surname></persName>
		</author>
		<ptr target="https://www.trellix.com/blogs/research/mcafee-enterprise-atr-uncovers-vulnerabilities-in-globally-used-b-braun-infusion-pump/" />
		<title level="m">McAfee Enterprise ATR uncovers vulnerabilities in globally used B. Braun infusion pump</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A dynamic cyber security situational awareness framework for healthcare ICT infrastructures</title>
		<author>
			<persName><forename type="first">S</forename><surname>Islam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papastergiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mouratidis</surname></persName>
		</author>
		<idno type="DOI">10.1145/3503823.3503885</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th Pan-Hellenic Conference on Informatics, PCI &apos;21</title>
				<meeting>the 25th Pan-Hellenic Conference on Informatics, PCI &apos;21<address><addrLine>Volos, Greece</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="334" to="339" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Rees</surname></persName>
		</author>
		<ptr target="https://www.pinsentmasons.com/out-law/analysis/cyber-attacks-healthcare-europe" />
		<title level="m">Cyber attacks in healthcare: the position across europe</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m">Sixth annual benchmark study on privacy &amp; security of healthcare data</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
		<respStmt>
			<orgName>Ponemon Institute</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A study of cyber attacks: In the healthcare sector</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">S</forename><surname>Bhosale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nenova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Iliev</surname></persName>
		</author>
		<idno type="DOI">10.1109/Lighting49406.2021.9598947</idno>
	</analytic>
	<monogr>
		<title level="m">2021 Sixth Junior Conference on Lighting (Lighting)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Cyber security risk assessment methods for smart healthcare</title>
		<author>
			<persName><forename type="first">S</forename><surname>Memon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Memon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Memon</surname></persName>
		</author>
		<idno type="DOI">10.1109/KHI-HTC60760.2024.10481961</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC)</title>
				<imprint>
			<date type="published" when="2024">2024. 2024</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Using BERT and augmentation in named entity recognition for cybersecurity domain</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tikhomirov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Loukachevitch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sirotina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dobrov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">25th International Conference on Applications of Natural Language Processing and Information Systems</title>
				<meeting><address><addrLine>Saarbrücken, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="16" to="24" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Cybert: Cybersecurity claim classification by fine-tuning the bert language model</title>
		<author>
			<persName><forename type="first">K</forename><surname>Ameri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hempel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Sharif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lopez</surname><genName>Jr</genName></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Perumalla</surname></persName>
		</author>
		<idno type="DOI">10.3390/jcp1040031</idno>
		<ptr target="https://www.mdpi.com/2624-800X/1/4/31.doi:10.3390/jcp1040031" />
	</analytic>
	<monogr>
		<title level="j">Journal of Cybersecurity and Privacy</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="615" to="637" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Named entity recognition using bert with whole world masking in cybersecurity domain</title>
		<author>
			<persName><forename type="first">S</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhao</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICBDA51983.2021.9403180</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 6th International Conference on Big Data Analytics (ICBDA)</title>
				<meeting><address><addrLine>Xiamen, China</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page" from="316" to="320" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Joint bert model based cybersecurity named entity recognition</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<idno type="DOI">10.1145/3451471.3451508</idno>
	</analytic>
	<monogr>
		<title level="m">2021 The 4th International Conference on Software Engineering and Information Management</title>
				<meeting><address><addrLine>ICSIM, Yokohama, Japan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="236" to="242" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Secure-BERT: A domain-specific language model for cybersecurity</title>
		<author>
			<persName><forename type="first">E</forename><surname>Aghaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Shadid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Al-Shaer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Security and Privacy in Communication Networks</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="39" to="56" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Cyber threat analysis using natural language processing for a secure healthcare system</title>
		<author>
			<persName><forename type="first">S</forename><surname>Islam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papastergiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Silvestri</surname></persName>
		</author>
		<idno type="DOI">10.1109/ISCC55528.2022.9912768</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Symposium on Computers and Communications (ISCC)</title>
				<imprint>
			<date type="published" when="2022">2022. 2022</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A machine learning approach for the nlp-based analysis of cyber threats and vulnerabilities of the healthcare ecosystem</title>
		<author>
			<persName><forename type="first">S</forename><surname>Silvestri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Islam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papastergiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tzagkarakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ciampi</surname></persName>
		</author>
		<idno type="DOI">10.3390/s23020651</idno>
	</analytic>
	<monogr>
		<title level="j">Sensors</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Cyber threat assessment and management for securing healthcare ecosystems using natural language processing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Silvestri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Islam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amelin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weiler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papastergiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ciampi</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10207-023-00769-w</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Information Security</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="31" to="50" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Threat modeling of internet of things health devices</title>
		<author>
			<persName><forename type="first">A</forename><surname>Omotosho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">A</forename><surname>Haruna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">M</forename><surname>Olaniyi</surname></persName>
		</author>
		<idno type="DOI">10.1080/19361610.2019.1545278</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Applied Security Research</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="106" to="121" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">On threat modeling and mitigation of medical cyberphysical systems</title>
		<author>
			<persName><forename type="first">H</forename><surname>Almohri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Alemzadeh</surname></persName>
		</author>
		<idno type="DOI">10.1109/CHASE.2017.69</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="114" to="119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Integrated use of KOS and deep learning for data set annotation in tourism domain</title>
		<author>
			<persName><forename type="first">G</forename><surname>Aracri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Folino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Silvestri</surname></persName>
		</author>
		<idno type="DOI">10.1108/JD-02-2023-0019</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Documentation</title>
		<imprint>
			<biblScope unit="volume">79</biblScope>
			<biblScope unit="page" from="1440" to="1458" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Exploit prediction scoring system (EPSS)</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jacobs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Romanosky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Edwards</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Adjerid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Roytman</surname></persName>
		</author>
		<idno type="DOI">10.1145/3436242</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Threats</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Common Vulnerability Scoring System version 3.1 Specification Document</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A V V</forename></persName>
		</author>
		<ptr target="https://www.first.org/cvss/v3-1/cvss-v31-specification_r1.pdf" />
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>FIRST.Org</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Random forests</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine learning</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="5" to="32" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
