An Application of the Disease Ontology (DO) for Clustering COVID-19 Hospitalizations in Rio de Janeiro Lucas Maddalena 1 and Fernanda Baião 1 1 Department of Industrial Engineering, Pontifícal Catholic University of Rio de Janeiro, Rua Marquês de São Vicente 225, Rio de Janeiro, 22451-900, Brazil Abstract On the 21st century, the exponential growth of technology, led the world facing a myriad of information coming from multitudinous sources. Then, finding ways of storing knowledge committed to certain rules became imperious. Ontologies have been playing an important role on connecting data to the semantics of the real world. Data, without such ontological commitment, could be interpreted as representations of different entities than the one it actually is, leading to biased analysis and inaccurate prediction on data-driven projects. Such kind of artifact formalizes shared knowledge regarding a domain of discourse. Therefore, this study will, based on works showing the benefits of bringing ontologies to the scenario of Machine Learning techniques, enrich similarity metrics between instances of data. So, the Human Disease Ontology (DO) will be used. Instead of calculating pairwise similarities between two diseases (terms on DO), groups of diseases will be considered. Therefore, this work will rely on adapting a groupwise similarity metric Data collection will be done considering the SIVEP-Gripe Dataset. Then, an analysis will be made on how better Machine Learning Algorithms can perform the analysis is made considering semantic rather than just numerical and categorical features. Keywords 1 Disease Ontology, COVID-19, Clustering Introduction In December 2019, the first case of coronavirus disease (COVID-19), caused by the SARS- CoV-2 virus, was reported. It did not take long for the disease to get enormous proportions and become a worldwide concern, and on March 11th, 2020, the World Health Organization (WHO) declared the disease outbreak a global pandemic [1]. COVID-19 is affecting the four corners of the world, and data is coming from a thousand-and-one different providers. Therefore, data integration in the COVID-19 domain can be compromised and semantic commitments shall be considered when treating pandemic data. As an illustration, in China, from Jan 15 until March 2, 2020, there have been seven different versions of the COVID-19 case definition issued by the government, and [2] estimate that the lack of a temporal consensus on the definitions led China official pandemic tracking to increase up to 7.1 times (IC 95%, 4.8 – 10.9) from one definition to another. One of the main purposes of ontologies is to make the real-world data semantics explicit [3]; consequently, many benefits can be extracted by this kind of artifact, including its use as a communication artifact among different stakeholders, as a common data model to mediate data Proceedings of the 15th Seminar on Ontology Research in Brazil (ONTOBRAS) and 6th Doctoral and Masters Consortium on Ontologies (WTDO), December 22-25, 2022. EMAIL: lucasgm@tecgraf.puc-rio.br (A. 1); fbaiao@puc-rio.br (A. 2) ORCID: 0000-0001-6411-4259 (A. 1); 0000-0001-7932-7134 (A. 2) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) integration and access, or even as a formal specification to enable reasoning on data. In the COVID-19 domain, several works already proposed ontologies and applications, such as [4,5,6,7]. Recently, the multiple benefits of ontologies (including foundational ontologies, conceptual models, and other semantically aware artifacts) to enhance data analysis and knowledge extraction have been increasingly advocated. In this context, [8] present how ontologies, and specifically foundational ontologies, can have multiple benefits on every step of the internal cycle of the Data Science Life Cycle, while [9] show the benefits of pairing conceptual models with ontologies to Machine Learning (ML) techniques. The present work focuses on data regarding the comorbidities (i.e., diseases) of patients who have been diagnosed with COVID-19 and were hospitalized in the state of Rio de Janeiro. The main objective is to analyze the impact of a semantically aware approach when finding similar subsets of hospitalizations in the dataset. To this end, we apply a partition-based clustering technique and compared its results in two scenarios. The first scenario (semantic unaware) represented each hospitalization as a binary vector of comorbidities and applied the conventional cosine similarity metric. The second (semantic aware) scenario was proposed as follows. Disease Ontology (DO) [10] is an ontology which integrates disease and medical vocabularies through extensive cross mapping of DO terms to other medical ontologies, such as MeSH. We matched each comorbidity found in the dataset with a corresponding concept in the Disease Ontology (DO). A total of 161 distinct diseases were linked to DO concepts, and we observed 465 different combinations of diseases, for all the patients in the dataset. To compute similarities between individual comorbidities, we applied the measure proposed by [11], which addressed semantics to find similarities between data, and specifically proposed a similarity metric in the bio-ontologies domain using DO terms. However, since each hospitalized patient was characterized by a (possibly empty) set of comorbidities in the dataset, the similarity between distinct hospitalizations required a groupwise similarity metric, i.e., measuring the similarity between two different groups of diseases, which represents the diseases a COVID-19 hospitalized patient has. For instance, while the pairwise metric performs a comparison between two terms such as “diabetes” and “asthma”, the groupwise similarity metric compares two sets of terms, such as “Diabetes, gilbert’s syndrome and flu” and “Psoriasis and AIDS”. Therefore, we applied the metric proposed by [12] for calculating groupwise similarities between sets of DO terms. On [12], it is calculated groupwise similarities between terms on the SNOMED CT. Hence, the semantic aware groupwise similarity between hospitalizations proposed in our work was computed by combining the groupwise metric of [12] with the pairwise similarity between DO terms of [11]. The impact of the proposed semantically aware approach when finding similar subsets of hospitalizations in the dataset is assessed in the Data Post-Processing step using metrics of cluster quality. An additional analysis was performed to show how well the resulting clusters from each scenario partitioned the subsets of diseases. Disease Ontology (DO) In this research, we make use of the Human Disease Ontology (DO), a domain ontology organized as a directed acyclic graph, representing the domain of ontologies and is mapped to uncountable others application ontologies. DO makes the knowledge on the domain of human diseases explicit, by describing diseases through ontology properties, such as is-a, has-material-basis-in or has-symptom. For instance, DO states that: bone disease is-a connective tissue disease congenital megabladder has-material-basis-in autosomal dominant inheritance allergic conjunctivitis has-symptom allergic reaction. Also, as shown on Figure 1, a term on DO can be linked to other ontologies through relations such as has-symptom and has-phenotype. Figure 1: The representation of tyrosinemia type II in Disease Ontology (DO). Source: [13] The Human Disease Ontology, in its last update on April 28th, 2022, comprises 17,840 classes and 45 properties [15] and is widely applied for several purposes in Academic and Industry contexts. In addition, it has been used by more than 50 other biomedical ontologies and there is a numerous list of software tools and other web resources that: (1) support the use of DO data, (2) have integrated or were built using DO data, or (3) provide data linkages to the DO website [16]. On the Benefits of Semantics, Ontologies and Conceptual Modeling in the Data Science Lifecycle Managing data cannot be accomplished solely by humans with their limited cognitive capabilities [9]. Also, available data keeps growing and is becoming more important as a resource for decision- making. Thus, it is crucial to understand the domain which the data represents, to make a more precise usage of it. Works [8,9] show that pairing conceptual modeling/ontologies artifacts with data science/machine learning techniques can not only enhance Data Science projects results but also support the development and evaluation of conceptual modelling approaches. However, this work will focus on the first mentioned kind of benefit, when semantical commitment helps on Machine Learning techniques. In particular, [8] defend the benefits of using foundational and domain ontologies appears in each cycle of the Data Science Life Cycle, including Problem Understanding, Data pre- and post-processing, and Data Mining for different techniques (Classification and Clustering, for example). Such benefits are summarized on Table 1. On the Data Pre-processing step, [8] defend ontologies could help on both on semantic interoperability and ontological commitment made explicit. These benefits refer to data integration which can be made not considering the ontological commitment of the sources providing the data and, therefore, joining data features which refers to different entities of the real world, leading to misinterpretations and false results on the DS project. When clustering data, relying on foundational ontologies may lead to cluster results that better reflect real-world categorization. Moreover, calculating data similarity committed on ontological foundational can lead to similarities between data way more befitting to the domain where the treated data lays on. Table 1 Multiple Benefits of Foundational Ontologies and Domain Ontologies on Data Science. Source: Adapted from [8] DS Lifecycle Step Benefit Problem understanding Semantic transparency Complexity management mechanisms for complex domains Data models are more uniform Data pre-processing Semantic interoperability Ontological commitments made explicit Clustering Higher probability of clusters that reflect genuine real-world categorizations Similarity calculation grounded on ontological foundations Easier to identify similarities that are not accidental Preventing unwarranted associations evaluation Data post-processing Improved understanding of the patterns discovered Systematic guidance in the validation of the patterns discovered grounded on ontological meta-properties Traditional data mining methods and techniques treat data as merely “sums of attribute values”, and such approach can lead to biases and bad understanding of the patterns discovered [8]. Indeed, clustering techniques mostly relies on calculating similarities – a data pre-processing step – which does not consider semantical attributes and are basically mathematical operations to calculate Euclidian distance and other kind of metrics. However, there have been for the past few years many proposals of considering ontologies on the calculation of object similarities, such as [16,17]. Also, on the biomedical field, especially for Gene Ontology (GO) [19,20], there are several similarity metrics considering many different ontologies, such as Wang [11] and [21,22,23]. However, the metric proposed in [11] can also be extended for comparison between DO terms. In this research scenario, ontologies will show up as a tool on data preparation step and, therefore, may enhance analysis results. The ontology terms (diseases) and taxonomic relations (is-a) will be considered when computing similarities between group of comorbidities, since each comorbidity is linked to a disease in the Disease Ontology. Similarities should be calculated following a groupwise approach, to enable a comparison between two groups of comorbidities. Pairwise similarities may be trivially computed by a simple application of a distance metric, either one of the four last mentioned metrics or any of the metrics available in HESML (Half-Edge Semantic Measures Library) [24]. Semantic aware groupwise metrics, however, are not that simple. According to [24], “A groupwise semantic similarity measure is used to compute the degree of similarity between two sets of concepts defined into an ontology. This type of measure is commonly used to compare sets of GO terms in genomics, although they could also be used to compare sets of WordNet synsets evoked by two words”. Section 6 details the approach used to calculate DO terms groupwise similarities. Associating comorbidities to diseases in the Disease Ontology We analyzed the dataset from SIVEP-Gripe (Sistema de Informação de Vigilância Epidemiológica da Gripe or Flu Epidemiological Vigilance Information System), a nationwide surveillance database used to monitor severe acute respiratory infections in Brazil. Each instance of such dataset represents a hospital admission due to COVID-19, characterized by several features regarding case evolution (Death or Recovery), patient previous COVID-19 vaccine administrations and others. However, this dataset contains a lot of imprecise and missing data, specially on data referring to the patient comorbidities, which this work aims to tackle. Hence, data selection followed a semantic aware methodology, described as follows. Data was selected by filtering the first three thousand hospitalization of 2021 in the State of Rio de Janeiro. However, since this work will rely mostly on analyzing each patient set of comorbidities, the filtering also considered instances of data with noisy, inaccurate and missing information regarding this feature. Also, since this study focuses on the pairing of ontologies to the Data Science Lifecycle, rather than discovering new patterns, we did not prioritize analyzing larger datasets. Patient comorbidities which appeared in the dataset were then mapped to the ontology. Each comorbidity on the dataset was associated with a DO disease. This step was performed manually, by searching for DO classes whose names were syntactically similar to the comorbidity name appearing in the dataset. Some of these associations can be seen on Table 2. For example, if a hospitalization entry on SIVEP-Gripe dataset has, for instance, the word “DPOC” (short for Doença Pulmonar Obstrutiva Crônica in Portuguese) in MORB_DESC column, we consider that the patient has “Chronic Obstructive Pulmonary Disease”, which has the ID DOID:3083 in the DO. Table 2 Disease Matching between SIVEP-Gripe names with DO terms. Source: Authors Name on SIVEP-Gripe Database DO Match ALCOOLISMO alcohol use disorder ALZHEIMER Alzheimer’s disease AMILOIDOSE amyloidosis ANEMIA deficiency anemia Calculating (dis)similarities between DO terms There are several ways to calculate pairwise similarities between classes in an ontology. In this work, the proposed metric on [11] is applied to measure semantic similarity among DO terms. For computing such metric, Wang defines a term 𝐴 in DO as 𝐷𝐴𝐺 = (𝐴, 𝑇𝐴 , 𝐸𝐴 ), where 𝑇𝐴 is the set of all ancestors in DO graph and 𝐸𝐴 is the set of edges connecting DO terms to 𝐴. The S-Value of DO term 𝑡 related to term 𝐴 is defined as the contribution of 𝑡 to the semantics of 𝐴, such that, for any 𝑡 in 𝐷𝐴𝐺𝐴, its S-value related to term A is defined on equation 1. 1, 𝑖𝑓 𝑡 = 𝐴 (1) 𝑆𝐴 (𝑡) = { ′ ′ max{𝑤𝑒 × 𝑆𝐴 (𝑡 )|𝑡 ∈ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑓 𝑡} , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 However, 𝑤𝑒 is a value representing the semantic contribution factor for edge 𝑒 ∈ 𝐸𝐴 linking term 𝑡 with its child 𝑡′, thus for every 𝑒, a corresponding weight 𝑤𝑒 may be predefined. Wang similarity measure for DO terms only considers is-a relationships, and the corresponding weight 𝑤𝑒 is preset to be 0.7. Also, for a given term 𝐴, the total semantic contribution of 𝐴, 𝑆𝑉(𝐴) in 𝐷𝐴𝐺𝐴 is given on equation 2. ∑𝑡∈𝑇𝐴 ∪𝑇𝐵 𝑆𝐴 (𝑡) + 𝑆𝐵 (𝑡) (2) 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵) = 𝑆𝑉(𝐴) + 𝑆𝑉(𝐵) For computing such metrics, the R software package DOSE [24] was used, which is part of the open- source software for bioinformatics Bioconductor. Figure 3 shows a heatmap representing pairwise similarities among some DO terms. For instance, let 𝐴 a vector of DO ID terms as follows on equation 3. 𝐴 = (8498,409,2841,850,2914,7148,8857) (3) The seven terms on vector 𝐴 represent, respectively, the diseases in the following set: (osteoarthritis, liver disease, asthma, lung disease, immune system disease, rheumatoid arthritis, lupus erythematosus). We define a matrix 𝑆, such that the value on position 𝑆𝐴𝑖,𝐴𝑗 represents the similarity 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴𝑖 , 𝐴𝑗 ), with the graphical representation on Figure 2. Also, Figure 2 displays where in the ontology the terms on vector 𝐴 are placed, with respect to their relationships and hierarchies between other terms. Moreover, the relationship has-subclass is equivalent to is-a in the way that, if A is-a B, then B has-subclass A. Figure 2: Graph representing path-to-root concepts of six diseases in DO. Source: Author Figure 3: Pairwise similarities between DO terms. Source: Author As can be seen on Figure 3, rheumatoid arthritis has a high similarity with osteoarthritis because both diseases have a relationship is-a with arthritis. Also, since rheumatoid arthritis is-a autoimmune disease of musculoskeletal system together with lupus erythematosus, such DO terms have higher pairwise similarity when comparing lupus erythematosus with osteoarthritis. Calculating groupwise (dis)similarities Each row in the hospitalization’s dataset represents a hospital entry, which refers to a unique patient. As aforementioned, each entry contains data about the diseases a patient has. Hence, each instance on the dataset is characterized as a single group of DO terms. With the previous definitions, only pairwise similarity metrics between classes in the ontology can be computed. Then, for calculating similarities between set of diseases i.e., groupwise similarities, other approaches were required. For instance, consider an ordered set 𝐶 containing 𝑛 terms from DO, and an example instantiation of C in which 𝑛 = 4, as shown on equation 4. C = {𝒍𝒖𝒑𝒖𝒔 𝒆𝒓𝒚𝒕𝒉𝒆𝒎𝒂𝒕𝒉𝒐𝒔𝒖𝒔, 𝒓𝒉𝒆𝒖𝒎𝒂𝒕𝒐𝒊𝒅 𝒂𝒓𝒕𝒉𝒓𝒊𝒕𝒊𝒔, 𝒍𝒊𝒗𝒆𝒓 𝒅𝒊𝒔𝒆𝒂𝒔𝒆, 𝒂𝒔𝒕𝒉𝒎𝒂} (4) Also, let 𝐷 ⊆ 𝐶 the subset representing the diseases a patient suffers, and an example instantiation of 𝐷, as on equation 5. D = {𝒓𝒉𝒆𝒖𝒎𝒂𝒕𝒐𝒊𝒅 𝒂𝒓𝒕𝒉𝒓𝒊𝒕𝒊𝒔, 𝒂𝒔𝒕𝒉𝒎𝒂} (5) Any subset of diseases in 𝐶 may be a represented as a document vector 𝑣, i.e., a 𝑛 - dimensional binary vector, in which each coordinate represents if the concept of 𝐶 is in 𝐷. Thus, in this case, 𝑣𝐷𝑇 = (0 1 0 1). This representation is useful and broadly used in Natural Language Processing models and some machine learning techniques that rely on similarity measures between instances of data. Cosine (dis)similarity Considering 𝑥, 𝑦 vectors in the n-dimensional space, cosine similarity between these vectors is represented as on equation 6. 𝑥∙𝑦 (6) 𝐺𝑆𝑖𝑚𝑐𝑜𝑠 (𝑥, 𝑦) = ‖𝑥‖‖𝑦‖ The operation 𝑥 ∙ 𝑦 represents the usual ℝ𝑛 inner product and ‖𝑥‖ represents the Euclidian magnitude of a vector 𝑥 ∈ ℝ𝑛 . Also, this similarity metric follows the property shown on equation 7. ∀(𝑥, 𝑦) ∈ ℝ𝑛 × ℝ𝑛 : 0 ≤ 𝐺𝑆𝑖𝑚𝑐𝑜𝑠 (𝑥, 𝑦) ≤ 1 (7) Therefore, cosine dissimilarity is defined on equation 8. 𝐺𝐷𝑆𝑖𝑚𝑐𝑜𝑠 (𝑥, 𝑦) = 1 − 𝐺𝑆𝑖𝑚𝑐𝑜𝑠 (𝑥, 𝑦) (8) Even though this metric represents, at some way, groupwise disease similarities, ontologies are not considered as semantical enrichment artifacts. Therefore, according to [8], data mining techniques relying in these metrics may lead to less genuine understanding of patterns discovered, due to the lack of semantics. Hence, section 6.2 provides an ontologically well-founded (dis)similarity metric that may be considered as an extension of the original cosine similarity and is inspired on [12] work, which applies the metric on the domain of radiology. Semantically aware cosine (dis)similarity For introducing semantic similarity between document vectors, [12] first define (in their words, in a loosely way) the similarity between two concepts 𝐶1, 𝐶2 in an ontology as shown on equation 9. 1 (9) 𝑆𝑖𝑚(𝐶1, 𝐶2) = 𝑑 Where 𝑑 is the number of nodes in the shortest path between concept nodes (inclusive of) 𝐶1 and 𝐶2. However, the authors clarify that other similarity measures can be used, as long as it preserves the basic property that increasing distance within the ontology is concomitant with a decrease in semantic similarity. Hence, the similarity measure defined by [11] for DO terms will be used, as displayed on equation 10. 𝑆𝑖𝑚(𝐶1, 𝐶2) = 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐶1, 𝐶2) (10) Henceforward, each term of the domain ontology brought up by the dataset, together with all the other concepts in their paths-to-root (a.k.a. seed concepts), will represent each coordinate of the document vectors which will be further analyzed. However, Wang pairwise similarity measure already represents the weight of seed concepts in its formula. Hence, in this work, only the Disease Ontology terms presented on the explored dataset will be considered, and such group of diseases will be represented as a set 𝐶, called context set. Finally, with the definitions above, the DO terms groupwise similarities, 𝐺𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵), with respect to a context can now be computed. Hence, let 𝐶 = {𝐶1, 𝐶2, . . . , 𝐶𝑛} be a set of diseases representing the context set and let two group of disease terms, namely, 𝐴 and 𝐵, which by definition, 𝐴, 𝐵 ⊆ 𝐶. Then, groupwise similarity considering semantic is represented on equation 11. ∑𝑐∈𝐶∩(𝐴∪𝐵) max 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝑎, 𝑐) ∙ max 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝑏, 𝑐) (11) 𝑎∈𝐴 𝑏∈𝐵 𝐺𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵) = 2 2 √∑𝑐∈𝐶∩𝐴 (max 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝑎, 𝑐)) ∙ √∑𝑐∈𝐶∩𝐵 (max 𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝑏, 𝑐)) 𝑎∈𝐴 𝑏∈𝐵 Also, this similarity metric ranges from 0 to 1, therefore, dissimilarity is derived as on equation 12. 𝐺𝐷𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵) = 1 − 𝐺𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵) (12) For instance, let’s calculate the similarity between group of DO terms for context 𝐶, as in Table 3. Table 3 Values for computing DO terms groupwise similarities. Source: Authors asthma liver lung immune rheumatoid disease disea system disease arthritis se 𝐴 = {𝑎𝑠𝑡ℎ𝑚𝑎, 𝑙𝑖𝑣𝑒𝑟 𝑑𝑖𝑠𝑒𝑎𝑠𝑒} 1 1 0.65 0.36 0.13 𝐵 = {𝑟ℎ𝑒𝑢𝑚𝑎𝑡𝑜𝑖𝑑 𝑎𝑟𝑡ℎ𝑟𝑖𝑡𝑖𝑠} 0.084 0.13 0.13 0.26 1 Similarity between groups of diseases 𝐴 and 𝐵 is then calculated as in equation 13 and 14. 𝐺𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵) (13) 1 ∙ 0.084 + 1 ∙ 0.13 + 0.65 ∙ 0.13 + 0.36 ∙ 0.26 + 0.13 ∙ 1 = √12 + 12 + 0.652 + 0.362 + 0.132 ∙ √0.0842 + 0.132 + 0.132 + 0.262 + 1 𝐺𝑆𝑖𝑚𝑊𝑎𝑛𝑔 (𝐴, 𝐵) = 0.1824 (14) Therefore, in this work, both semantically aware and unaware groupwise dissimilarities were calculated. Figure 4 shows how smooth dissimilarity is when enriching data with semantics, while semantically unaware measures lead to false dissimilarities between data objects, which potentially may impact on further cluster analysis. Figure 4: Heatmaps of groupwise dissimilarities using semantically unaware (left) and semantically aware (right) metrics. Source: Author Hospitalizations cluster analysis On [26], clustering is defined as the process of grouping a set of data objects into multiple groups or clusters so that the objects within a cluster have high similarity but are very dissimilar to objects in other clusters. Euclidian and Manhattan distance are often used as dissimilarity measure on clustering techniques. However, in this study, clustering analysis will rely on both cosine similarity and cosine similarity based on the prior mentioned Wang [11] measure. This work focuses on the use of the K-medoids clustering technique [27], which is a Partitioning- Based clustering algorithm that is scalable and compatible to cluster objects upon precomputed dissimilarity metrics, which is the case of the data in this study. Also, for choosing clustering algorithm parameters (such as the number of clusters) this work relies on the Silhouette Coefficient [28] as a metric which we want to maximize. Such metric, based on the intra-cluster and extra-cluster distances, provides information regarding the quality of the clusters. For making use of such algorithms, Scikit-learn [29] implementation of K-Medoids and Silhouette Score on Python programming language [30] was used. The average silhouette coefficient was then calculated for each instance of K-medoids application, on both semantically aware and unaware dissimilarity data and for different numbers 𝑘 of clusters, ranging from 2 to 15. As seen on Figure 5, the optimal number of clusters 𝑘 = 𝑘 ∗ , which maximizes ∗ the average silhouette score was, on the semantic aware case was 𝑘𝑛𝑜𝑠𝑒𝑚 = 4 and on the semantic ∗ unaware case was 𝑘𝑠𝑒𝑚 = 5, where each clustering obtained, respectively, scores of 0.277108 and 0.12143. Figure 5: Comparing average silhouette score for different number of clusters. Source: Author The obtained results regarding the quality of the clustering on both treated data are in fact encouraging. The bar plot displayed on figure 6 shows that not only the average silhouette is clearly higher, but the metric evaluated individually for each data point is clearly higher on the overall. Also, cluster 0 of the cosine dissimilarity clustering has mainly negative silhouette scores. Moreover, when clustering the semantically aware data, the average silhouette score was higher than 83% (386 out of 465) of the observations on the semantically unaware scenario. Figure 6: Bar plot with values of silhouette score for each data point. Source: Author Dimensionality reduction In this work, Multidimensional Scaling MDS [31], a dimensionality reduction technique was useful to transform groups of DO terms dissimilarities into points in the cartesian plane. Therefore, both charts displayed on Figure 7 were possible. Moreover, information regarding both the clustering results and the obtained silhouettes scores were represented, respectively, by introducing different colors and radius sizes for each point. Also, for every cluster, the medoid point was represented with a black cross, where it emerges a box displaying all DO terms presented by the highest 4 silhouette scored group of diseases of each cluster. The results shown in Figure 7 are crucial to make explicit how clustering results are improved when adding semantics to data. While on the left chart clusters are overlapping (one more evidence to explain the low silhouette scores obtained), the one in the right, shows how the clusters were better separated, thus way closer to the main objective of this technique, which is to maximize intra-cluster similarities and maximize inter-cluster similarities. Lastly, our results evidenced the benefit of “Higher probability of clustering results that reflect real-world categorizations”, exactly as mentioned by [8]. When comparing both scenarios, the semantically unaware clusters grouped diseases which are, by common sense, dissimilar to each other; on the other hand, semantically aware clusters reflected real-world categorizations, i.e., diseases within the same cluster are clearly more similar to each other. Figure 7: Graphical representation of clustering on semantically unaware (right) data and semantically aware (left) data. Source: Author Conclusions This work proposes a semantic awareness application of the Data Science Lifecycle on the COVID- 19 domain and shows the benefits of considering ontologies and other semantic structures as tools for enhancing ML techniques. Even though there are ontology terms groupwise metrics in the literature, they are not as present and accessible as the ones measuring pairwise similarities. So, in the context where groupwise distance between sets of objects are required, an adaption of the [12] proposal for calculating groupwise similarities was made so [11] was computed. The benefits of enriching a disease dissimilarity metric with context information were evident. Firstly, when calculating groupwise similarities, Cosine Similarity, as shown on Figure 4 led to context inaccurate (dis)similarities between data objects and was pointed that could lead to bad results later, during cluster analysis. Figures 5 and 6 shows how the overall silhouettes score (i.e.) on an overall are considerably higher when enriching data with semantics. Figure 7 aims at giving the reader a visualization of the most important results in a nutshell. It displays the overlapping clusters, which is a result of the semantically unaware similarity calculation. Also, such visual results agree with silhouette values found on the Data Pre-Processing step. On the other scatter plot, where data is semantically enriched, intra-clusters distances were minimized, and inter-cluster similarities were maximized. Finally, an analysis of the group of diseases grouping is visually represented on Figure 7. Future Works In this work, text treatment step on this work did not rely on modern Natural Language Processing (NLP) techniques. Leading, to manual tasks such as linking terms in the DO with data regarding comorbidities of the hospitalized patients. Therefore, as future work, such step can be automatized so more information can be considered. Also, enriching similarity pairwise metric by not only considering is-a relationships, but many others an ontology can provide. Also, such as the work of [8], the use of foundational ontologies and their associated metaproperties can also be applied for a project using the Data Science Lifecycle. Hence, OntoCovid and OntoTB [32,33] are well-founded ontologies that may help when applying ML techniques. An analysis on how clustering results are associated with mortality and to the use of mechanical ventilation will be made. Therefore, semantically enrichment of data could serve as an tool for better results on data-driven decision making. Acknowledgements This study was conducted in the context of the project "Effectiveness of COVID-19 Vaccination in Brazil Using Mobile Data" (EFFECT-BR), which is one out of ten, among 440 others worldwide, selected by the Grand Challenges ICODA COVID-19 Data Science, funded by the Bill & Melinda Gates foundation. Also, the Center for Healthcare Operations and Intelligence (NOIS2) which is part of PUC-Rio Industrial Engineering Department, together with Tecgraf institute, Fundação Oswaldo Cruz (FIOCRUZ) and Instituto D’Or (IDOR) gave the support needed so this study could be made. This work was was partially funded by projects FAPERJ E-26/010.002657/2019 and CNPq 422810/2021-5. References [1] World Health Organization, “WHO Director-General’s opening remarks at the media briefing on COVID-19 - 11 March 2020,” World Health Organization, Mar. 11, 2020. https://www.who.int/director-general/speeches/detail/who-director-general-s-opening-remarks- at-the-media-briefing-on-covid-19---11-march-2020 (accessed Apr. 22, 2022). [2] T. K. Tsang, P. Wu, Y. Lin, E. H. Y. Lau, G. M. Leung, and B. J. Cowling, “Effect of changing case definitions for COVID-19 on the epidemic curve and transmission parameters in mainland China: a modelling study,” The Lancet Public Health, vol. 5, no. 5, Apr. 2020, doi: 10.1016/s2468- 2667(20)30089-x. [3] G. Guizzardi, “Ontology, Ontologies and the ‘I’ of FAIR,” Data Intelligence, vol. 2, no. 1–2, pp. 181–191, Jan. 2020, doi: 10.1162/dint_a_00040. 2 http://www.nois.ind.puc-rio.br [4] S. Babcock, J. Beverley, L. G. Cowell, and B. Smith, “The Infectious Disease Ontology in the age of COVID-19,” Journal of Biomedical Semantics, vol. 12, no. 1, Jul. 2021, doi: 10.1186/s13326- 021-00245-1. [5] L. Wan et al., “Development of the International Classification of Diseases Ontology (ICDO) and its application for COVID19 diagnostic data analysis,” BMC Bioinformatics, vol. 22, Oct. 2021, doi: 10.1186/s12859021044022. [6] H. Wu, Y. Zhong, Y. Tian, S. Jiang, and L. Luo, “Automatic diagnosis of COVID-19 infection based on ontology reasoning,” BMC Medical Informatics and Decision Making, vol. 21, no. S9, Nov. 2021, doi: 10.1186/s12911-021-01629-0. [7] A. Sargsyan et al., “The COVID-19 Ontology,” Bioinformatics, vol. 36, no. 24, pp. 5703–5705, Dec. 2020, doi: 10.1093/bioinformatics/btaa1057. [8] G. Amaral, F. Baião, and G. Guizzardi, “Foundational ontologies, ontology-driven conceptual modeling, and their multiple benefits to data mining,” WIREs Data Mining and Knowledge Discovery, vol. 11, no. 4, Jul. 2021, doi: 10.1002/widm.1408. [9] W. Maas and V. C. Storey, “Pairing conceptual modeling with machine learning,” Data & Knowledge Engineering, vol. 134, no. C, p. 101909, Jun. 2021, doi: 10.1016/j.datak.2021.101909. [10] L. M. Schriml et al., “Human Disease Ontology 2018 update: classification, content and workflow expansion,” Nucleic Acids Research, vol. 47, no. D1, pp. D955–D962, Jan. 2019, doi: 10.1093/nar/gky1032. [11] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-F. Chen, “A new method to measure the semantic similarity of GO terms,” Bioinformatics, vol. 23, no. 10, pp. 1274–1281, May 2007, doi: 10.1093/bioinformatics/btm087. [12] T. Mabotuwana, M. C. Lee, and E. V. Cohen-Solal, “An ontology-based similarity measure for biomedical data – Application to radiology reports,” Journal of Biomedical Informatics, vol. 46, no. 5, pp. 857–868, Oct. 2013, doi: 10.1016/j.jbi.2013.06.013. [13] J. Han, M. Kamber, and J. Pei, Data mining : concepts and techniques. Burlington, Ma: Elsevier, 2012. [14] H. Pan et al., “Biomedical ontologies and their development, management, and applications in and beyond China,” Journal of Bio-X Research, vol. 02, Art. no. 04, 2019, doi: 10.1097/JBR.0000000000000051. [15] NCBO BioPortal, “Human Disease Ontology | NCBO BioPortal,” bioportal.bioontology.org. https://bioportal.bioontology.org/ontologies/DOID (accessed May 27, 2022). [16] Disease Ontology, “Disease Ontology - Institute for Genome Sciences - Use Cases,” disease- ontology.org, 2022. https://disease-ontology.org/community/use-cases (accessed Jun. 08, 2022). [17] K. Gibert, A. Valls, and M. Batet, “Introducing semantic variables in mixed distance measures: Impact on hierarchical clustering,” Knowledge and Information Systems, vol. 40, no. 3, pp. 559– 593, Jun. 2013, doi: 10.1007/s10115-013-0663-5. [18] W. Lee, N. Shah, K. Sundlass, and M. Musen, “Comparison of Ontology-based Semantic- Similarity Measures,” AMIA ... Annual Symposium proceedings. AMIA Symposium, vol. 2008, pp. 384–388, Nov. 2008, [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655943/ [19] M. Ashburner et al., “Gene Ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, May 2000, doi: 10.1038/75556. [20] Gene Ontology Consortium, “The Gene Ontology resource: enriching a GOld mine,” Nucleic Acids Research, vol. 49, no. D1, pp. D325–D334, Jan. 2021, doi: 10.1093/nar/gkaa1113. [21] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Aug. 1997, vol. 10, pp. 19–33. [Online]. Available: https://aclanthology.org/O97-1002 [22] P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, vol. 11, pp. 95–130, 1999, doi: 10.1613/jair.514. [23] D. Lin, “An information-theoretic definition of similarity,” in IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB), 1998, pp. 296–304. [24] J. J. Lastra-Díaz, A. García-Serrano, M. Batet, M. Fernández, and F. Chirigati, “HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset,” Information Systems, vol. 66, pp. 97–118, Jun. 2017, doi: 10.1016/j.is.2017.02.002. [25] G. Yu, L.-G. Wang, G.-R. Yan, and Q.-Y. He, “DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis,” Bioinformatics, vol. 31, no. 4, pp. 608–609, Feb. 2015, doi: 10.1093/bioinformatics/btu684. [26] J. Han, M. Kamber, and J. Pei, Data mining : concepts and techniques. Burlington, Ma: Elsevier, 2012. [27] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for K-medoids clustering,” Expert Systems with Applications, vol. 36, no. 2, pp. 3336–3341, Mar. 2009, doi: 10.1016/j.eswa.2008.01.039. [28] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, Nov. 1987, doi: 10.1016/0377-0427(87)90125-7. [29] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, Art. no. 85, 2011, [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html [30] G. van Rossum and F. L. Drake, Python 3 : reference manual. United States: Sohobooks, 2009. [31] J. B. Kruskal, “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis,” Psychometrika, vol. 29, no. 1, pp. 1–27, Mar. 1964, doi: 10.1007/bf02289565. [32] L. Maddalena and F. Baião, “OntoCovid: Applying SABiO to conceptual modeling well grounded in the COVID-19 domain,” in CEUR Workshop Proceedings, 2021, vol. 3050. [33] T. Guarnier et al., “Um Modelo Conceitual Baseado em Ontologia para Doenças Infecciosas com Ênfase em Tuberculose,” 2020. [Online]. Available: http://ceur-ws.org/Vol-2728/short5.pdf