OntoClue, a framework to compare vector-based approaches for document relatedness using the RELISH corpus Rohitha Ravinder 1,2, Tim Fellerhoff 1,3, Vishnu Dadi 1,4, Lukas Geist 1,4, Guillermo Rocamora 1,5 , Muhammad Talha 1,4, Dietrich Rebholz-Schuhmann 1,6 and Leyla Jael Castro 1 1 ZB MED Information Centre for Life Sciences, Gleueler Str. 60, Cologne, 50931, Germany 2 Bonn-Aachen International Centre for Information Technology (B-IT), University of Bonn, Friedrich-Hirzebruch-Allee 6, Bonn, 53115, Germany 3 Heinrich-Heine University Düsseldorf, Universitätsstraße 1, Düsseldorf, 40225, Germany 4 Hochschule Bonn-Rhein-Sieg, Grantham-Allee 20, Sankt Augustin, 53757, Germany 5 Universidad de Murcia, Avda. Teniente Flomesta 5, Murcia, 30003, Spain 6 University of Cologne, Albertus-Magnus-Platz, Cologne, 50923, Germany Abstract The continuous increase of biomedical scholarly publications makes it challenging to construct document recommendation algorithms to navigate through literature, an important feature for researchers to keep up with relevant publications. Understanding semantic relatedness and similarity between two documents could improve document recommendations. The objective of this study is performing a comparative analysis of vector-based approaches to assess document similarity in the RELISH corpus. Here we present our approach to compare five different techniques to generate vectors representing the text in the documents. These techniques employ a combination of various Natural Language Processing frameworks such as Word2Vec, Doc2Vec, dictionary-based Named Entity Recognition as well as state-of-the-art models based on BERT. Keywords 1 Document similarity, Word embeddings, Named Entity Recognition 1. Introduction Recommendation systems are a successful method to cope with information overload wrt scientific publications [1]. For biomedical publications, PubMed Related Articles (PMRA) [2] is still considered the de facto standard; however, Natural Language Processing (NLP) advances, including word-embeddings, offer alternative paths to improve the state of the art and explore further similarity, relatedness and relevance. The RELISH [3] dataset corresponds to a document-to-document relevance assessment (definitely relevant, partially relevant, non-relevant) that can be used for comparing, improving and translating newly developed literature search techniques, including recommendation systems. Here we present OntoClue, a framework to compare different approaches to generate vectors for articles in the RELISH corpus. Proceedings Semantic Web Applications and Tools for Healthcare and Life Sciences, February 13–16, 2023, Basel, Switzerland EMAIL: ljgarcia@zbmed.de (A. 8) ORCID: 0000-0002-8725-1317 (A.2); 0000-0002-3082-7522 (A.3); 0000-0002-2910-7982 (A.4); 0000-0002-1018-0370 (A.7); 0000-0003-3986-0510 (A.8) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. OntoClue framework OntoClue can be summarized in the following steps: (i) retrieve title and abstract for the RELISH articles in XML recording those that cannot be retrieved, (ii) trim the RELISH corpus so it includes only retrieved articles, (iii) reduce the RELISH corpus so only relevance assessment for which there is a clear consensus are kept, (iv) connect approaches to be compared by OntoClue in a workflow fashion, (v) optimize the approaches using an Area Under the Curve (AUC) approach, (vi) evaluate precision and cumulative gain for each approach using the optimal parameters, (vii) provide comparison tables for the different approaches. The hyperparameter optimization follows a multi-classification approach using Cosine Similarity intervals from 0 to 1 with increments of 0.1 and counting the number of definitely relevant, partially relevant and non-relevant RELISH pairs for each interval. The optimization is based on the best AUC score obtained from different hyperparameter combinations for each participating approach. The optimization can also be simplified to two classes by combining definitely and partially relevant into one single class “relevant”. We are testing and tuning our OntoClue framework with five approaches: (i) Doc2Vec [4], existing approach for document vectors; (ii) word2doc2vec, in-house approach to document vectors; (iii) whatizit-dictionary, using Whatizit [5], a dictionary-based named entity recognition approach; (iv) hybrid-doc2vec, combination of Doc2Vec and Whatizit; and (v) a BERT-based approach using BERT pre-trained models (all of the others are trained with the RELISH articles only). 3. Future Work We plan to use our OntoClue framework to compare the five mentioned approaches so we can select the best approach to propose a new recommendation system that should cover not only the biomedical domain but also the agricultural one as they correspond to our use case LIVIVO, the ZB MED literature portal. The recommendation system should also integrate multilingualism as LIVIVO contains publications in English, German, French, Portuguese and Spanish. In addition, we want to support coverage for non-traditional, e.g., data and software, and non-peer-reviewed journal publications, e.g., conference papers and preprints. 4. Acknowledgements This work was partially supported by the STELLA project funded by DFG (project no. 407518790), the NFDI4DataScience project funded by GWK and DFG​ (no. NFDI 34/1), and the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A) 5. References 1. Zhu J, Patra BG, Yaseen A. Recommender system of scholarly papers using public datasets. AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:672-679. PMID: 34457183; PMCID: PMC8378599. 2. Lin J, Wilbur WJ. PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics. 2007 Oct 30;8:423. doi: 10.1186/1471-2105-8-423. PMID: 17971238; PMCID: PMC2212667. 3. Brown P; RELISH Consortium, Zhou Y. Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Database (Oxford). 2019 Jan 1;2019:baz085. doi: 10.1093/database/baz085. PMID: 33326193; PMCID: PMC7291946. 4. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems. 2013:3111–3119. 5. Dietrich Rebholz-Schuhmann, Miguel Arregui, Sylvain Gaudan, Harald Kirsch, Antonio Jimeno. Text processing through Web services: calling Whatizit, Bioinformatics, Volume 24, Issue 2, 15 January 2008, Pages 296–298, https://doi.org/10.1093/bioinformatics/btm557.