Ontology Enhanced FAIR Data Point Searches Xiaofeng Liao1,∗ , Coos Baakman1 , Kees Burger2 , Luiz Olavo Bonino da Silva Santos3,4 and Peter A.C. ’t Hoen1,∗ 1 Radboudumc, Geert Grooteplein Zuid 26/28, 6500 HB Nijmegen, The Netherlands 2 Health-RI, Jaarbeursplein 6, 3521 AL Utrecht, The Netherlands 3 University of Twente, PO Box 217 7500 AE Enschede, The Netherlands 4 Leiden University Medical Center, Postbus 9600, 2300 RC Leiden, The Netherlands Abstract The FAIR Data Point has an increasingly important role in efforts to meet FAIR principles. It provides machine readable access to the metadata of different types of digital objects. In this paper, we focus on metadata of datasets. Since its first reference implementation, more tailored implementations have been developed and deployed in the Health Care and Life Sciences domain. However, a problem coming with these increasing amount of FAIR Data Point instances and the datasets published is the Findability of relevant datasets from the large volume of resources. For efficient finding of relevant datasets we need to exploit the richness of their metadata and a good ranking algorithm. In this paper we report the enhancements of the search and ranking capabilities of FAIR Data Point’s reference implementation. Specifically, we improved its semantic search capability via creating association between class terms and the words frequently occur in the class description and labels. We also implemented a TF-IDF based ranking algorithm on the search results to present users the most relevant results. With these two enhancements, the FAIR Data Point can respond to a user’s search request with higher coverage and present the list with the more relevant results based on the Term Frequency - Inverse Document Frequency (TF-IDF) metric. Keywords FAIR Data Point, Ontology, Enhancement, Semantic Search, Ranking, TF-IDF 1. Introduction FAIR Data Point (FDP) is a common approach to publish semantically-rich and machine- actionable metadata according to the FAIR principles[1]. A definition of its software architecture specifying its core components and services to register, index and allow users to search for meta- data content of available was given in [2] and a reference implementation1 was also presented. More tailored implementations have been developed including: The Netherlands eScience SWAT4HCLS 2024: The 15th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences, February 26–29, 2024, Leiden, The Netherlands ∗ Corresponding author. XiaoFeng.Liao@Radboudumc.nl (X. Liao); Coos.Baakman@radboudumc.nl (C. Baakman); kees.burger@health-ri.nl (K. Burger); l.o.boninodasilvasantos@utwente.nl (L. O. Bonino da Silva Santos); Peter-Bram.tHoen@radboudumc.nl (P. A.C. ’t Hoen) 0000-0002-4706-1084 (X. Liao); 0000-0003-4317-1566 (C. Baakman); 0000-0002-5437-779X (K. Burger); 0000-0002-1164-1351 (L. O. Bonino da Silva Santos); 0000-0003-4450-3112 (P. A.C. ’t Hoen) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://github.com/FAIRDataTeam/FAIRDataPoint CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Center2 , LOVD3 and The SURF Data Repository4 . There are also softwares supports the FAIR Data Point protocol, including MOLGENIS software[3] and Castor EDC5 . The work described in this publication is integrated in FAIR Data Cube [4]. More and more FAIR Data Points are being set up and running with various metadata of datasets from Health Care and Life Sciences domain are published to serve researchers. A detailed list of FAIR Data Point instances can be found in the FAIR Data Point HOME Server6 , where, at the time of this paper’s writing, there were 41 active instances hosting metadatas of datasets and other types of digital objects. An aspect that is currently under-investigated and important to increase the Findability of datasets published are improvements in the engines searching capabilities for relevant datasets. Dataset search is often complicated and inefficient when compared to a typical internet search, where algorithms use criteria of similarity between the potential keywords and the content and links included on websites. The current reference implementation of FAIR Data Point only allows for searching and ranking datasets in a primitive way. The main reason for this technological limitation is that links between datasets are still rare. This compromises the use of traditional web-based ranking algorithms. In this work, we used the FAIR Data Point reference implementation as the basis for our enhancement work. We implemented a semantic search framework for datasets that can extend existing dataset search tools in two ways: 1. improving the semantic search capability over metadata via association with frequent words occur in class labels and description in ontology. 2. ranking the search results by applying the Term Frequency - Inverse Document Frequency (TF-IDF) [5] metric. 2. Design and Implementation To imkprove the search capability of the FAIR Data Point’s reference implementation, we designed a prototype including: • A Semantic Query Enhancer (SQE) component, which enhance queries by associating the user’s search keywords with terms occur in class labels and descriptions in the pre-loaded ontology. • A ranking algorithm based on TF-IDF metric to rank the results retrieved in the previous step. This is done in the following steps as depicted in Figure 1 • index metadata from the ontologies • retrieve and store associations for the search query words • find documents for each associated words that has sufficient relevance • score and rank the documents, before returning them 2 https://github.com/fair-data/fairdatapoint 3 https://github.com/LOVDnl/fdp.lovd.nl 4 https://repository.surfsara.nl/ 5 https://www.castoredc.com/ 6 https://home.fairdatapoint.org/ Figure 1: The 4 steps to enhance search query words by association terms from an ontology. Ontology For ontology, we chose Thesaurus.owl and generate association by linking the word to the class description. An example is given in the Figure 2 where “disease” is associated with “pain”, because they occur in the same class description. The NCI Thesaurus [6] serves as a widely employed reference terminology designed to enhance translational research in cancer, encompassing both basic and clinical science. Comprising nearly 110,000 terms distributed among approximately 36,000 concepts, the Thesaurus is organized into 20 subdomains. These subdomains encompass diverse areas such as diseases, drugs, anatomy, genes, gene products, techniques, and biological processes. By doing association, we get a result of 2449259 word associations. Figure 2: Association is generated via link keywords with terms occurred at class description Rank We used a basic TF-IDF [5] algorithm to rank the dataset. The reference implementation applies no ranking algorithm on the result list but only the primitive result from a SPARQL query against the triplestore behind the FAIR Data Point server. 3. Result On the portal of our ontology enhanced implementation7 , a user can use the ontology enhanced search capability by clicking ”Switch to Ontology-based” link, as shown in Figure 3a. A simple comparison of the search capability between the reference implementation and our ontology enhanced implementation is given in Figure 3. Specifically, in Figure 3a, a search of the keyword ”disease” in the reference implementation gives 0 results. However in Figure 3b, the keyword ”disease” found 2 results. The reason of this difference attributes to the association between ”disease” and ”immunology”/”interleukin-1”, which co-occurred in the class description in the Thesaurus.owl ontology. (a) Results of searching ”Disease” in the reference implementa- tion. (b) Results of searching ”Disease” in the Ontology Enhanced implementation Figure 3: The comparison of search results between the reference implementation and the ontology enhanced implementation. 4. Discussion Due to lack of user logs on the reference implementation of FAIR Data Point server, it is hard to apply a machine learning based ranking algorithm. These user logs supposed to contain the search keywords a user entered and the target results the user clicked. We plan to log the user 7 http://145.38.186.66/ behaviors at our enhanced FAIR Data Point portal to capture the keywords a user entered and the datasets the user clicked. With these logs, it is possible to train and apply a learning to rank algorithm. With more datasets being submitted and published to our running FAIR Data Point instance, a more detailed evaluation on the search performance in terms of precision and recall would be available. In the existing setup, the singular ontology Thesaurus.owl is employed, given its status as a comprehensive ontology in the field of cancer research. However, it is worth noting that the incorporation of multiple ontologies is feasible, provided that technical challenges such as memory consumption are effectively addressed. In our upcoming implementation, a configuration option will be introduced to enable users to select ontologies based on their specific requirements. Acknowledgments This work was supported by SURF-DCC via the pilot:”Enhancing FAIR Data Point’s Search Capability as a FAIR Service v2.” References [1] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, et al., The fair guiding principles for scientific data management and stewardship, Scientific data 3 (2016) 1–9. [2] L. O. B. da Silva Santos, K. Burger, R. Kaliyaperumal, M. D. Wilkinson, FAIR Data Point: A FAIR-Oriented Approach for Metadata Publication, Data Intelligence 5 (2023) 163–183. URL: https://doi.org/10.1162/dint_a_00160. doi:10.1162/dint_a_00160 . [3] K. J. van der Velde, F. Imhann, B. Charbon, C. Pang, D. van Enckevort, M. Slofstra, R. Barbieri, R. Alberts, D. Hendriksen, F. Kelpin, et al., Molgenis research: advanced bioinformatics data software for non-bioinformaticians, Bioinformatics 35 (2019) 1076–1078. [4] X. Liao, A. Niehues, C. de Visser, J. Huang, T. H. Ederveen, C. Doornbos, P. Kulkarni, K. J. van der Velde, M. A. Swertz, M. Brandt, A. J. van Gool, P. A. ’. Hoen, Fair data cube, a fair data infrastructure for integrated multi-omics data analysis, medRxiv (2023). URL: https://doi.org/10.1101/2023.04.23.23289000. doi:10.1101/2023.04.23.23289000 . [5] A. Rajaraman, J. D. Ullman, Data Mining, Cambridge University Press, 2011, p. 1–17. doi:10. 1017/CBO9781139058452.002 . [6] G. Fragoso, S. de Coronado, M. Haber, F. Hartel, L. Wright, Overview and utilization of the nci thesaurus, Comparative and functional genomics 5 (2004) 648–654.