=Paper= {{Paper |id=Vol-3221/IAIL_paper1 |storemode=property |title=Using Sentence Embeddings and Semantic Similarity for Seeking Consensus when Assessing Trustworthy AI |pdfUrl=https://ceur-ws.org/Vol-3221/IAIL_paper1.pdf |volume=Vol-3221 |authors=Dennis Vetter,Jesmin Jahan Tithi,Magnus Westerlund,Roberto V. Zicari,Gemma Roig |dblpUrl=https://dblp.org/rec/conf/hhai/VetterTWZR22 }} ==Using Sentence Embeddings and Semantic Similarity for Seeking Consensus when Assessing Trustworthy AI== https://ceur-ws.org/Vol-3221/IAIL_paper1.pdf
Using Sentence Embeddings and Semantic Similarity
for Seeking Consensus when Assessing Trustworthy
AI
Dennis Vetter1 , Jesmin Jahan Tithi2 , Magnus Westerlund3 , Roberto V. Zicari3,4 and
Gemma Roig1
1
  Goethe University Frankfurt, 60629 Frankfurt am Main, Germany
2
  Intel Labs, Santa Clara, CA 95054, United States
3
  Arcada University of Applied Sciences, 00550 Helsinki, Finland
4
  Seoul National University, Seoul 08826, South Korea


                                         Abstract
                                         Assessing the trustworthiness of artificial intelligence systems requires knowledge from many different
                                         disciplines. These disciplines do not necessarily share concepts between them and might use words with
                                         different meanings, or even use the same words differently. Additionally, experts from different disciplines
                                         might not be aware of specialized terms readily used in other disciplines. Therefore, a core challenge of
                                         the assessment process is to identify when experts from different disciplines talk about the same problem
                                         but use different terminologies. In other words, the problem is to group problem descriptions (a.k.a.
                                         issues) with the same semantic meaning but described using slightly different terminologies.
                                             In this work, we show how we employed recent advances in natural language processing, namely
                                         sentence embeddings and semantic textual similarity, to support this identification process and to bridge
                                         communication gaps in interdisciplinary teams of experts assessing the trustworthiness of an artificial
                                         intelligence system used in healthcare.

                                         Keywords
                                         Sentence Embedding, Semantic Similarity, Natural Language Processing, Trustworthy Artificial Intelli-
                                         gence




1. Introduction
The design, development and implementation of artificial intelligence (AI) systems requires
knowledge from many different disciplines to be successful. Therefore, the teams involved in
AI projects are often interdisciplinary to provide knowledge of all the relevant areas. Each
area of expertise comes with its own specialized language, terms, definitions and jargon that
can make communication between experts from different fields challenging, as they do not
necessarily share the same concepts and may use the same words to mean something different
[1]. Additionally, often time, people from one field might not be familiar with specialized terms
used in another field. For example, an AI engineer might know the meaning of the terms
1st International Workshop on Imagining the AI Landscape After the AI Act (In conjunction with The first International
Conference on Hybrid Human-Artificial Intelligence), June 13, 2022, Amsterdam, The Netherlands
$ vetter@em.uni-frankfurt.de (D. Vetter); roig@cs.uni-frankfurt.de (G. Roig)
 0000-0002-5977-5535 (D. Vetter); 0000-0002-6439-8076 (G. Roig)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Schematic overview of the AI solution with the sub-tasks segmentation, alignment and Brixia
score estimation. For the Brixia score, the lung is separated into 6 regions and each is rated with a
number from 0 (no damage) to 3 (high damage) [6]. Image modified from [5].


“precision and recall" whereas a healthcare professional may know the word “prognosis" which
the AI engineer might not know about.
   One practical example where the interdisciplinary nature of communication shows up is the
case where a team of interdisciplinary experts assesses an AI system for its trustworthiness
[2, 1, 3, 4]. The stakeholders performing the assessment need to be aware of possible differences
in the meaning of specialized terms so that they can understand each other properly. This
requires them to cooperate with each other to work on a common vocabulary [2, 1].
   In this paper, we show how recent advances in the AI domain of natural language processing
(NLP) can be used to support this process. Concretely, we apply it in the assessment of the
trustworthiness of an AI system developed to evaluate the degree of lung damage in COVID-
19 patients from their chest X-ray (CXR) images. Italian researchers developed the AI system in
early 2020 to support the radiologists of a local hospital during the drastically rising cases of
COVID-19 that overwhelmed the hospital system [5]. The goal of the system was to provide
the radiologists with a qualified second opinion so they can work more confidently, faster, and
with fewer mistakes.
   The assessed AI system consists of multiple neural networks, one for each of the following
sub-tasks: (1) segmentation of the CXR image into lung and background, (2) alignment of the
image, and (3) estimation of the semi-quantitative Brixia score. For the Brixia score, the lung is
separated into six regions and each region is assigned a number between 0 (no damage) and 3
(highly damaged). This separation into different areas and scoring based on a pre-defined set of
values allows for efficient communication between radiologists [6]. A schematic view of the
tasks performed by the AI system is given in Fig. 1.
   To train the networks, the researchers collected a large dataset of CXR images and annotations
by either one radiologist (used for training) or the consensus of multiple radiologists (used for
evaluation). Their results show that the AI system is performing equally well as an average
human radiologist [5].
   The assessment of the above AI system [5] used the Z-Inspection® process described by Zicari
et al. [2], which is a holistic approach and includes participation of the entire community of
key stakeholders. For assessing trustworthiness, Z-Inspection® builds on the Ethics Guidelines
for Trustworthy AI by the European Commission’s High-Level Expert Group on AI with the
four ethical principles of (i) respect for human autonomy, (ii) prevention of harm, (iii) fairness,
and (iv) explicability, which are implemented through the seven key requirements of (1) human
agency and oversight, (2) technical robustness and safety, (3) privacy and data governance,
(4) transparency, (5) diversity, non-discrimination and fairness, (6) societal and environmental
well-being, and (7) accountability [3].
   Part of the assessment is to use socio-technical scenarios [7, 8] to identify different potential
issues (ethical, legal, technical, etc) with the system, based on interviews with the whole team,
the developers, other stakeholders, and additional materials such as academic papers, source
code, datasets that are available. To achieve a disciplinary depth, the group of stakeholders is
split into working groups (WGs) according to the different backgrounds of the participants.
Each of these WGs then describes what potential issues/problems/tensions (conflicts between
two or more desirable goals) they see with the system. This is followed by the mapping step,
where the issues are structured and connected to the ethical principles and key requirements
that they are conflicting with [2]. The goal of this mapping step is to have a description of the
issues in “structural ethical terms” [9]. The output of this mapping is then used for consolidation
where a group-based consensus is reached regarding which issues can be combined and which
issues are redundant. This consolidation allows to distill the most critical issues identified
about the system; the consolidated statement is then reported to the system’s developers and
stakeholders, along with recommendations on possible steps to mitigate the issues or lower
their impact. A schematic illustration of this process can be found in Fig. 2.




Figure 2: Schematic illustration of the mapping process. First step is to build a common knowledge
base to develop socio-technical scenarios. Then the group is separated into WGs, according to the
different backgrounds. The results of the WGs are combined in the consolidation step, based on which
recommendations to the stakeholders are made. Adapted from [2].


   For the assessment, the team consisted of a large number of participants from many different
disciplines who described the issues they identified in their own language and jargon from their
fields. This resulted in a large number of issues, sometimes talking about similar things from
slightly different perspectives using different terminology. According to the participants, the
large number made manual consolidation as described in [2, 9] both intellectually challenging
and labor-intensive. They described the main difficulty as identifying which issues could be
combined. From reports of previous assessments [9, 10, 11], it was considered highly likely that
different issues could be combined as they describe the same tension, but the final number of
tensions, as well as the number of issues per tension, were infeasible to estimate.
   To help in the consolidation process, we decided to use modern text analysis methods to
lift semantic meaning from the text based on the concept of Semantic Textual Similarity (STS)
[12]. In NLP, STS is the task of determining the overlap in meaning between texts. The goal
of STS is to provide a numerical score where high values indicate that two texts have similar
meanings and low values indicate that their meanings are different [12]. In this context, the task
of identifying issues that describe the same conflict can be seen as identifying and clustering
groups of issues that share high STS scores.
   We make the following contributions:
      • we show how NLP models can facilitate communication between experts from different
        domains in trustworthy AI assessment process,
      • we present and evaluate two different approaches of STS to group related issues identified
        by multidisciplinary teams of experts: 1) a clustering-based approach 2) a graph-based
        approach, both of these use deep learning based STS computation underneath for scoring
      • we show that the graph-based approach works comparably well to clustering, while not
        requiring tuning of hyperparameters.


2. Method
2.1. Word Embeddings and Sentence Embeddings
Currently, the best performing systems for STS are using deep learning-based embeddings.
The basic type of embeddings are word embeddings. In word embeddings, a deep neural
network is used to map a word into a fixed dimensional vector space. This mapping is done in a
way that captures the meaning of the word so that words with similar meanings have similar
vector representations, and analogies in word meanings can be approximated by mathematical
operations. As an example, with the analogy “king is to queen as man is to woman” the
encoding 𝑒𝑚𝑏𝑋 in the vector space should fulfill the equation 𝑒𝑚𝑏𝑘𝑖𝑛𝑔 −𝑒𝑚𝑏𝑞𝑢𝑒𝑒𝑛 ≈ 𝑒𝑚𝑏𝑚𝑎𝑛 −
𝑒𝑚𝑏𝑤𝑜𝑚𝑎𝑛 [13, 14, 15].
  Sentence embeddings are extensions of word embeddings to complete sentences. Again, deep
neural networks are used to map the sentence into a high-dimensional vector space, so that the
vector representation also captures the meaning of the sentence [16, 17, 18].

2.2. Measuring Semantic Textual Similarity - STS
After training word or sentence embeddings, the semantic textual similarity of words or sen-
tences is computed from the similarity of their vector representations. A popular metric for this
is the cosine similarity. For two words or sentences 𝐴 and 𝐵, this is defined as the cosine of the
angle 𝜃 between their vector representations 𝑒𝑚𝑏𝐴 and 𝑒𝑚𝑏𝐵 :

                                                         𝑒𝑚𝑏𝑇𝐴 · 𝑒𝑚𝑏𝐵
                      similarity(𝐴, 𝐵) := 𝑐𝑜𝑠(𝜃) =                                             (1)
                                                      ||𝑒𝑚𝑏𝐴 || · ||𝑒𝑚𝑏𝐵 ||
  The computation of STS scores from embeddings is widely used for a variety of tasks such as
checking if similar questions were already asked in a forum or the identification of different
topics in large text corpora [18].

2.3. Identifying groups of similar issues
For the identification of groups of similar issues, we compared two approaches: 1) clustering-
based and 2) graph-based.

Cluster-based group identification. Separating a set of objects into groups such that objects
in the same group have higher similarity and objects in different groups have a lower similarity
is the description of a classical clustering problem. Good clustering is best achieved through an
iterative process with four key steps: (1) feature selection, (2) cluster identification, (3) cluster
validation, and (4) result interpretation. Validation and interpretation are especially important,
as algorithms used for cluster identification can always find a division of the objects, but judging
whether the division is appropriate and useful, or if a different division should be produced is a
decision to be made by the user [19].
   In our use-case, feature extraction is performed by creating sentence embeddings that map
the English text to a high-dimensional vector. An essential strength of this approach is that it
allows us to use raw sentences and does not require any preprocessing. This makes the approach
straightforward, especially when compared to other approaches where high-quality results may
require extensive preprocessing pipelines and tuning [20, 21]. The following step is to perform
dimensionality reduction, as clustering algorithms are known to have problems when working
with high-dimensional vectors. We used UMAP [22] to map the high-dimensional embedding
vectors to lower dimensions, such that most of the relevant local and global structures in the
data are preserved [22]. Compared with other popular dimensionality reduction techniques,
UMAP preserves more of the global and local structure of the data than PCA [22], while also
producing more compact and better separated clusters than t-SNE [22, 23], which makes it well
suited to our task.
   The next step is to iteratively use a clustering algorithm and verify and interpret the resulting
clusters until a satisfactory result is found. With this approach, the different clusters correspond
to the different groups of issues with high similarity. Fig. 3 in the next section shows the output
of this approach.

Graph-based group identification. Another approach that works well with data with a
similarity measure is spectral clustering [24]. For spectral clustering, the data is arranged in a
weighted, fully connected graph, which is called the similarity graph. In the similarity graph,
each node corresponds to a data point and the weight of the edge between two nodes to the
similarity of the two associated data points. This allows to reformulate the clustering problem
into a graph partitioning problem, where the edges between partitions have low weights [24].
A popular variation of the similarity graph is the k-nearest neighbor graph. With this variation,
a node 𝑁𝐼 is connected to another node 𝑁𝐽 , if 𝑁𝐽 is among the 𝑘 nearest neighbors of 𝑁𝐼 [24].
   Applied to the use-case considered here, the nodes in the similarity graph correspond to
issues, and the weight of the edge between two nodes corresponds to the cosine similarity
of their embeddings. To simplify the resulting graph, we apply the 1-nearest-neighbor graph
variation, meaning that each node is only connected to the node of it’s most similar issue. With
this construction, we found that the similarity graph consists of multiple weakly connected
components, groups of connected nodes with no connections between nodes from different
groups. This simplified the spectral clustering task to identifying the weakly connected com-
ponents, which in turn provide the separation into groups of issues with high similarity. In
addition, we use the PageRank algorithm [25] to assign importance to each of the nodes, based
on the connected nodes and their respective importance. The idea behind this is that nodes with
many incoming edges are more important and often better better represent an underlying issue
compared to nodes with only one incoming edge. Fig. 4 in the next section shows the output of
this approach, the outputs of the two approaches will be compared in the next section.


3. Experiments
In this section, we present the dataset and a subjective evaluation of the results of the two
approaches. Code to reproduce our findings is available on GitHub1 .

3.1. Dataset
The dataset was made available to us by the authors of the use-case [26], it contains the issues
as described by the different expert WGs in a tabular form. Each issue has the following
information: an ID, WG name, a title, and a description. The title is a short summary of the
issue, while the description provides additional context; the sentence embedding is computed
from a concatenation of both. An example issue is listed in Table 1.

Table 1
Example issue with ID, WG, Title, and Description.
                  ID     E2
                WG       ethics / healthcare
                Title    Not all patients may benefit equally from the tool.
         Description     The adoption of the system may lead to different care standards for different
                         patient groups.

   In total, the dataset consists of 58 issues described by 51 experts in the six working groups:
technical, social, ethics, ethics / healthcare, radiologists, and healthcare. Table 2 gives a summary
of size and issues described by each WG.

3.2. Evaluation of different sentence embeddings
Computations of sentence embeddings are central to our approaches, as this step implicitly
defines the similarities between the issues. It is therefore important to use a well-performing NLP
model for this task, for which deep neural networks are state-of-the-art [18]. The implementation

    1
        https://github.com/dennisrv/iail2022
Table 2
Size and number of issues per WG
                                       WG            Members   Issues
                                    technical          21        23
                                      social            5         9
                                      ethics            3         4
                               ethics / healthcare      4         8
                                  radiologists          3         5
                                   healthcare          15         9
                                      total            51        58

provided by Reimers et al. [18] makes it possible to use a number of different large, pre-trained
networks. For our use case, the all-mpnet-base-v2 network produced the best results. This
network uses MPNet, a transformer architecture with 12 layers, 12 attention heads and a hidden
size of 768 [27], which was then fine-tuned for general purpose textual similarity tasks using a
dataset with over one billion sentence pairs [28]. In general, we could observe a correlation
between the subjective quality of the embeddings and the average performance of the network
on several NLP tasks, consistent with the findings in [28]. With this network, the sentence
embeddings are a 768-dimensional real-valued vector.

3.3. Results of the cluster-based approach
The clustering-based approach required some tuning of parameters to achieve a good result.
The best results were achieved with a two-step dimensionality reduction with UMAP, which
first reduced the 768 dimensions of the sentence embeddings to 15 and then to 2. Following this
step, we performed a clustering with the HDBSCAN algorithm [29], as this algorithm can find a
good number of clusters from the data and does not need the desired number of clusters as an
input parameters. Fig. 3 shows the results of the clustering approach.
   The result contained 12 groups of issues with most of them containing issues from different
WGs. As expected, most of the groups were rather small with 3-5 issues and one larger group
containing 9 issues. Through manual inspection, we found that most of the group assignments
were reasonable, and only few cases of wrongly assigned issues were found. An example of this
is that the issue Transparency would seem to be enhanced if others could have access to the system
was clustered with issues that were about concerns regarding data safety and privacy.
   The strength of this approach is that the low-dimensional mapping with UMAP enables a 2D
visualization of the clusters and their relative positions. It is therefore possible to identify cases
where a manual inspection could identify that both clusters might be about the same topic, as
these clusters will be closer to each other. An example of this are clusters 2 and 3 (bottom left)
in Fig. 3. These clusters are thematically related; cluster 2 contains issues about privacy in the
dataset, while cluster 3 contains issues about data safety and access to the dataset.
Figure 3: Clustering of issues after using UMAP for a mapping to 2 dimensions that preserves most of
the relevant local and global structure in the data.


3.4. Results of the graph-based approach
Our special construction of the similarity graph and the following simplification of the spectral
clustering task to the identification of weakly connected components made it possible to us
to omit a pre-specification of the number of clusters, a common input parameter for spectral
clustering algorithms [24]. Instead, the number of clusters emerged naturally as the number of
weakly connected components.
   In Fig. 4 we show the results of the graph-based approach. This approach identified 11
groups of issues (i.e., clusters), with more equally distributed sizes compared to the cluster-
based approach. Most of the groups also contained issues from at least 2 different WGs. While
the result of this approach and the clustering-based approach were not identical, a manual
inspection confirmed that it also produced a reasonable grouping of issues.
   The strength of this approach is that it does not require tuning. In addition, nodes with high
importance were generally found to capture group content well, which facilitated the manual
review. An example of this can be seen later in Table 3 where the top issue is the most important
and also captures the problem at a more general level.

3.5. Comparison of the approaches
Comparing the two approaches, we could observe that the cluster-based approach seemed to
prefer grouping the issues in smaller, more specific groups, such as concerns about stakeholder
inclusion (4 issues) and concerns on patient benefits (3 issues). With the graph-based approach, it
was found more likely to combine issues from multiple smaller clusters into one larger group,
such as inclusion of and benefit for patients unclear (9 issues). The differences in size of produced
groups are highlighted in Fig. 5a. Additionally, we found the graph-based approach more
Figure 4: Example of the similarity graph constructed from the issues identified by the experts. The
color of a node corresponds to the WG describing the issue. The thickness of the edges is proportional to
the similarity of connected issues, the size of nodes is proportional to the importance of the associated
issues.


likely to assign issues to groups where we could see no clear connection, although with low
importance and therefore easy to identify. Contrary, the clustering-based approach subjectively
produced less inappropriate groupings, but the lack of an importance within the cluster made
the issues that don’t belong more difficult to identify.
   Furthermore, we could also observe that the two approaches agreed on which issues belonged
to the same group in many cases. While often there was no complete agreement, there was still
a high overlap between the assigned groups, as shown in Fig.5b. For this purpose we computed
the overlap of sets of issues 𝑠𝐴 and 𝑠𝐵 as

                                                        |𝑠𝐴 ∩ 𝑠𝐵 |
                                𝑜𝑣𝑒𝑟𝑙𝑎𝑝(𝑠𝐴 , 𝑠𝐵 ) =                                                   (2)
                                                      𝑚𝑎𝑥(|𝑠𝐴 |, |𝑠𝐵 |)

   An example of a group of issues that both approaches agreed on can be seen in Table 3.
   For the current assessment [26], we only used the results of the graph-based approach for a
pre-screening of the issue groupings during the consolidation phase. However, we plan to use
a combination of both clustering approaches for future assessments, as both provide slightly
different perspectives and, therefore, are a good start for the discussion between participants.
We should also note that in some cases, it was not immediately apparent to the participants
whether issues talk about the same problem or not; this could only be solved via discussion and
group-consensus.
(a) Sizes of groups identified by the different ap-   (b) Overlap with the most similar group identified
    proaches.                                             by the other approach.
Figure 5: Histogram of group sizes (a) and overlap with the most similar group identified by the other
approach (b).


Table 3
Issues that belong to the same group in both approaches. The issues are ordered according to their
importance as assigned by the graph-based approach (descending).
         WG       Description
      technical   The dataset used for training is likely not representative for the general popu-
                  lation it is currently used on
        ethics    [..] there is no way to know whether diverse demographics receive disparate
                  treatment.
      technical   The model is trained on a particular set of devices and software, undermining
                  the reliability in different scenarios and context.


3.6. Limitations
While we found the two approaches to produce sufficient results for our purpose, we could
not verify them with data from additional use-cases, as such data was not readily available.
In addition, we observed cases where the sentence embeddings put too much importance on
single words or phrases. For example, the issues “Transparency would seem to be enhanced if
others could have access to the system" and “There is a [data safety] concern if data and software
engineers have access to the system and others outside of the medical profession" were assigned
to the same group. While the issues have different meanings, the overlap in words used was
sufficient to let them appear “similar enough" to the sentence embedding network. Another
occurrence was that all issues containing the word “Score" were grouped together, where some
of them were later manually assigned to other groups.
   Our proposed solution is to have an iterative process in which discussions with the stake-
holders about the results of the grouping are conducted, and to use this approach as a support
tool only, and not one that gives the definite answer.
4. Conclusions
Sentence embeddings and semantic textual similarity can be a useful tool for a Trustworthy AI
self-assessment to help an interdisciplinary team of experts and stakeholders with identifying
possible risks related to the use of an AI system. Our approach was used in practice in a complex
use-case with over 50 experts. The approach was used to support initial expert discussions and
help build group consensus in a situation where a large number of participants in the assessment
made manual consolidation very time-consuming and cumbersome. Participants described
it as too demanding for one person to be aware of everyone else’s work, making it difficult
to find consensus. Instead, our analytical method helped by providing experts with an initial
descriptive measure to start the consolidation discussion. Since both modeling approaches
presented provided an initial result of sufficient and similar quality, we cannot say that one
approach is clearly superior to the other. However, the main advantage of both approaches is
that they provide an initial grouping of issues. This initial grouping made it much easier to
understand the different questions and helped the experts to get a broad picture of the work
done by other groups. Because the groupings of questions share a common semantic topic,
it was also easier to identify errors in the algorithmic approach and to identify groupings of
questions that might belong together.
   To summarize, in the eyes of the participants, the main strength of our method was that it
improved their ability to effectively participate in the communication and focus on contributing
to the assessment process. In future assessments, we plan to further validate this approach for
consolidation and to investigate with a panel of stakeholders which of the two approaches is
more effective for finding consensus.


Acknowledgments
DV received funding from the European Union’s Horizon 2020 research and innovation program
under grant agreement no. 101016233 (PERISCOPE), and from the European Union’s Connecting
Europe Facility program under grant agreement no. INEA/CEF/ICT/A2020/2276680 (xAIM).
The funders had no role in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.


References
 [1] J. Whittlestone, R. Nyrup, A. Alexandrova, K. Dihal, S. Cave, Ethical and societal implica-
     tions of algorithms, data, and artificial intelligence: a roadmap for research, Nuffield Foun-
     dation, London, 2019. URL: https://www.nuffieldfoundation.org/wp-content/uploads/2019/
     02/Ethical-and-Societal-Implications-of-Data-and-AI-report-Nuffield-Foundat.pdf.
 [2] R. V. Zicari, J. Brodersen, J. Brusseau, B. Düdder, T. Eichhorn, T. Ivanov, G. Kararigas,
     P. Kringen, M. McCullough, F. Möslein, N. Mushtaq, G. Roig, N. Stürtz, K. Tolle, J. J.
     Tithi, I. van Halem, M. Westerlund, Z-Inspection®: A Process to Assess Trustworthy AI,
     IEEE Transactions on Technology and Society 2 (2021) 83–97. doi:10.1109/TTS.2021.
     3066209, conference Name: IEEE Transactions on Technology and Society.
 [3] High-Level Expert Group on Artificial Intelligence, Ethics guidelines for trustworthy
     AI, Text, European Commission, 2019. URL: https://op.europa.eu/en/publication-detail/-/
     publication/d3988569-0434-11ea-8c1f-01aa75ed71a1.
 [4] High-Level Expert Group on Artificial Intelligence, Assessment List for Trustworthy
     Artificial Intelligence (ALTAI) for self-assessment, Text, European Commission, 2020. URL:
     https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=68342.
 [5] A. Signoroni, M. Savardi, S. Benini, N. Adami, R. Leonardi, P. Gibellini, F. Vaccher, M. Ra-
     vanelli, A. Borghesi, R. Maroldi, D. Farina, BS-Net: Learning COVID-19 pneumonia
     severity on a large chest X-ray dataset, Medical Image Analysis 71 (2021) 102046. URL:
     https://www.sciencedirect.com/science/article/pii/S136184152100092X. doi:10.1016/j.
     media.2021.102046.
 [6] A. Borghesi, A. Zigliani, S. Golemi, N. Carapella, P. Maculotti, D. Farina, R. Maroldi,
     Chest X-ray severity index as a predictor of in-hospital mortality in coronavirus disease
     2019: A study of 302 patients from Italy, International Journal of Infectious Diseases
     96 (2020) 291–293. URL: https://linkinghub.elsevier.com/retrieve/pii/S1201971220303283.
     doi:10.1016/j.ijid.2020.05.021.
 [7] J. Leikas, R. Koivisto, N. Gotcheva, Ethical Framework for Designing Autonomous In-
     telligent Systems, Journal of Open Innovation: Technology, Market, and Complexity 5
     (2019) 18. URL: https://www.mdpi.com/2199-8531/5/1/18. doi:10.3390/joitmc5010018,
     number: 1 Publisher: Multidisciplinary Digital Publishing Institute.
 [8] F. Lucivero, Ethical Assessments of Emerging Technologies: Appraising the moral plausi-
     bility of technological visions, number 15 in The International Library of Ethics, Law and
     Technology, 1st ed. 2016 ed., Springer International Publishing : Imprint: Springer, Cham,
     2016. doi:10.1007/978-3-319-23282-9.
 [9] J. Brusseau, What a Philosopher Learned at an AI Ethics Evaluation, AI Ethics Journal
     1 (2020). URL: https://www.aiethicsjournal.org/10-47289-aiej20201214. doi:10.47289/
     AIEJ20201214.
[10] R. V. Zicari, J. Brusseau, S. N. Blomberg, H. C. Christensen, M. Coffee, M. B. Ganapini,
     S. Gerke, T. K. Gilbert, E. Hickman, E. Hildt, S. Holm, U. Kühne, V. I. Madai, W. Osika,
     A. Spezzatti, E. Schnebel, J. J. Tithi, D. Vetter, M. Westerlund, R. Wurth, J. Amann, V. An-
     tun, V. Beretta, F. Bruneault, E. Campano, B. Düdder, A. Gallucci, E. Goffi, C. B. Haase,
     T. Hagendorff, P. Kringen, F. Möslein, D. Ottenheimer, M. Ozols, L. Palazzani, M. Petrin,
     K. Tafur, J. Tørresen, H. Volland, G. Kararigas, On Assessing Trustworthy AI in Healthcare.
     Machine Learning as a Supportive Tool to Recognize Cardiac Arrest in Emergency Calls,
     Frontiers in Human Dynamics 3 (2021) 30. URL: https://www.frontiersin.org/article/10.
     3389/fhumd.2021.673104. doi:10.3389/fhumd.2021.673104.
[11] R. V. Zicari, S. Ahmed, J. Amann, S. A. Braun, J. Brodersen, F. Bruneault, J. Brusseau,
     E. Campano, M. Coffee, A. Dengel, B. Düdder, A. Gallucci, T. K. Gilbert, P. Gottfrois, E. Goffi,
     C. B. Haase, T. Hagendorff, E. Hickman, E. Hildt, S. Holm, P. Kringen, U. Kühne, A. Lucieri,
     V. I. Madai, P. A. Moreno-Sánchez, O. Medlicott, M. Ozols, E. Schnebel, A. Spezzatti, J. J. Tithi,
     S. Umbrello, D. Vetter, H. Volland, M. Westerlund, R. Wurth, Co-Design of a Trustworthy
     AI System in Healthcare: Deep Learning Based Skin Lesion Classifier, Frontiers in Human
     Dynamics 3 (2021) 40. URL: https://www.frontiersin.org/article/10.3389/fhumd.2021.688152.
     doi:10.3389/fhumd.2021.688152.
[12] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, SemEval-2017 Task 1: Semantic
     Textual Similarity Multilingual and Crosslingual Focused Evaluation, in: Proceedings of
     the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for
     Computational Linguistics, Vancouver, Canada, 2017, pp. 1–14. URL: http://aclweb.org/
     anthology/S17-2001. doi:10.18653/v1/S17-2001.
[13] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, in:
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543.
     URL: http://aclweb.org/anthology/D14-1162. doi:10.3115/v1/D14-1162.
[14] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic regularities in continuous space word rep-
     resentations, in: Proceedings of the 2013 conference of the north american chapter of
     the association for computational linguistics: Human language technologies, 2013, pp.
     746–751.
[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed Representations
     of Words and Phrases and their Compositionality, in: Advances in Neural Information
     Processing Systems, volume 26, Curran Associates, Inc., 2013. URL: https://proceedings.
     neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
[16] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal
     Sentence Representations from Natural Language Inference Data, in: Proceedings of the
     2017 Conference on Empirical Methods in Natural Language Processing, Association
     for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 670–680. URL: https:
     //aclanthology.org/D17-1070. doi:10.18653/v1/D17-1070.
[17] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-
     Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, R. Kurzweil, Universal Sentence Encoder,
     arXiv:1803.11175 [cs] (2018). URL: http://arxiv.org/abs/1803.11175, arXiv: 1803.11175.
[18] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-
     Networks, arXiv:1908.10084 [cs] (2019). URL: http://arxiv.org/abs/1908.10084, arXiv:
     1908.10084.
[19] R. Xu, D. C. Wunsch, Survey of Clustering Algorithms, IEEE TRANSACTIONS ON
     NEURAL NETWORKS 16 (2005) 35.
[20] V. Srividhya, R. Anitha, Evaluating preprocessing techniques in text categorization,
     International journal of computer science and application 47 (2010) 49–51. URL: http:
     //sinhgad.edu/ijcsa-2012/pdfpapers/1_11.pdf.
[21] D. S. Vijayarani, J. Ilamathi, Nithya, Preprocessing Techniques for Text Mining - An
     Overview, International Journal of Computer Science & Communication Networks 5
     (2015) 11.
[22] L. McInnes, J. Healy, J. Melville, UMAP: Uniform Manifold Approximation and Projection
     for Dimension Reduction, arXiv:1802.03426 [cs, stat] (2020). URL: http://arxiv.org/abs/
     1802.03426, arXiv: 1802.03426.
[23] D. Kobak, G. C. Linderman, Initialization is critical for preserving global data structure
     in both t-SNE and UMAP, Nature Biotechnology 39 (2021) 156–157. URL: https://www.
     nature.com/articles/s41587-020-00809-z. doi:10.1038/s41587-020-00809-z, number:
     2 Publisher: Nature Publishing Group.
[24] U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17
     (2007) 395–416. URL: http://link.springer.com/10.1007/s11222-007-9033-z. doi:10.1007/
     s11222-007-9033-z.
[25] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing
     Order to the Web., Technical Report 1999-66, Stanford InfoLab, 1999. URL: http://ilpubs.
     stanford.edu:8090/422/, backup Publisher: Stanford InfoLab.
[26] H. Allahabadi, J. Amann, I. Balot, A. Beretta, C. Binkley, J. Bozenhard, F. Bruneault,
     J. Brusseau, S. Candemir, L. A. Cappellini, S. Chakraborty, N. Cherciu, C. Cociancig,
     M. Coffee, I. Ek, L. Espinosa-Leal, D. Farina, G. Fieux-Castagnet, T. Frauenfelder, A. Gallucci,
     G. Giuliani, A. Golda, I. van Halem, E. Hildt, S. Holm, G. Kararigas, S. A. Krier, U. Kühne,
     F. Lizzi, V. I. Madai, A. F. Markus, S. Masis, E. Wiinblad Mathez, F. Mureddu, E. Neri,
     W. Osika, M. Ozols, C. Panigutti, B. Parent, F. Pratesi, P. A. Moreno-Sánchez, G. Sartor,
     M. Savardi, A. Signoroni, H. Sormunen, A. Spezzatti, A. Srivastava, A. F. Stephansen, B. T.
     Lau, J. J. Tithi, J. Tuominen, S. Umbrello, F. Vaccher, D. Vetter, M. Westerlund, R. Wurth,
     R. V. Zicari, Assessing Trustworthy AI in times of COVID-19. Deep Learning for predicting
     a multi-regional score conveying the degree of lung compromise in COVID-19 patients.,
     Preliminary manuscript made available by the authors (2022).
[27] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, MPNet: Masked and Permuted Pre-training for
     Language Understanding, arXiv:2004.09297 [cs] (2020). URL: http://arxiv.org/abs/2004.
     09297, arXiv: 2004.09297.
[28] N. Reimers, Pretrained Models — Sentence-Transformers documentation, no date. URL:
     https://www.sbert.net/docs/pretrained_models.html, accessed: 2022-01-11.
[29] L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering, The Journal
     of Open Source Software 2 (2017) 205. URL: http://joss.theoj.org/papers/10.21105/joss.00205.
     doi:10.21105/joss.00205.