Multilabel-classification task for Medline abstracts Nelson Quiñones 1,2,3, Cesar Canales 1, Javier Torres 1, Dietrich Rebholz-Schuhmann 3,4, Leyla Jael Castro 3, Andrés Aristizábal 1 1 Universidad ICESI, CL 18 122-135, Cali, 760031, Colombia 2 Leibniz University of Hannover, Welfengarten 1, Hannover, 30167, Germany 3 ZB MED Information Centre for Life Sciences, Gleueler Str 60, Cologne, 50931, Germany 4 University of Cologne, Albertus-Magnus-Platz, Cologne, 50923, Germany Abstract Assigning categories to scholarly articles is a common approach to help researchers navigate the continuously growing scientific literature. This is a task well-covered in biomedical literature thanks to efforts assigning Medical Subject Headings to biomedical abstracts, particularly from Medline. Here we propose a multilabel-classification approach to assign major topics to biomedical literature with the purpose of later applying transfer learning to cover conference papers and preprints, as well as the agricultural domain. In this short paper, we present some preliminary results. Keywords 1 Multilabel-classification, literature categorization, MeSH topics 1. Introduction ZB MED Information Centre for Life Science hosts literature for the biomedical and agricultural domains. They are currently implementing a new topic-based recommender system. To this end, we are first exploring the literature in the biomedical domain by taking advantage of the Medical Subject Headings (MeSH) [1] descriptors assigned to Medline abstracts and available in the PubMed repository. MeSH is a comprehensive controlled vocabulary for the purpose of indexing literature in the biomedical domain. Here we present preliminary results from our initial experiments on assigning Unified Medical Language System (UMLS) Semantic Network Types (STY) [2] to Medline publications annotated with MeSH terms. With the lessons learned from this approach, we will move to transfer learning approaches to cover biomedical publications outside the Medline scope. 2. Materials and Methods We worked with title, abstract and MeSH descriptors corresponding to a subset of the PubMed Central Open Access [3] articles retrieved with the Biopython library [4]. From the initial set of 7.4 million abstracts from 2015 to 2022, we retained only 2.8 million corresponding to those with all the elements, i.e., abstract, title, and MeSH descriptors, available in machine processable form. Data was further cleaned and transformed to create word embeddings. We then translated the MeSH terms to UMLS STYs to (i) reduce the number of prediction classes, from 348,860 in MeSH to 127 in UMLS STY, and to (ii) prioritize those types that could be more meaningful to biomedical researchers. The dataset creation process took 2 days in an AMD Ryzen 5 3400G. Our method corresponds to a fine-tuning of transformer models. First, we initialized the models with pre-trained parameters, and then we fine-tuned such parameters by using labeled data from the downstream tasks. A new layer on top of the based model abstracts the knowledge enclosed by our dataset. Preliminary exploration was done on the HugginFace Proceedings Semantic Web Applications and Tools for Healthcare and Life Sciences, February 13–16, 2023, Basel, Switzerland EMAIL: ljgarcia@zbmed.de (A. 5) ORCID: 0000-0002-1018-0370 (A.4); 0000-0003-3986-0510 (A.8) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) platform. We then perform hyperparameter optimization with an algorithm called Hyperband [5] to find the best configuration to train the final model. We kept track of metrics including Hamming Score, Accuracy Score, macro F1, micro F1, and Hamming Loss. Still, our main metric to guide the hyperparameter optimization process was the F1 micro as our corpus exhibits a high imbalance in the classes. The Mobster algorithm (available in the Syne Tune Library) was used in over 10% of the articles in the corpus to find the values corresponding to the best hyperparameter configuration. We allowed the algorithm to run for four days in a virtual machine part of the deNBI Cloud platform, equipped with an RTX6000 GPU and 128GB of ram. The optimal configuration was used to train our model on our dataset. In addition, we created a proof-of-concept web application1 to use the model and display predicted STYs for a given PubMed identifier. We used vanilla JS for the web application and uploaded the model to HuggingFace2. 3. Results and Discussion The hyperparameters used in the optimization process were the following: Learning rate (LR) between [5e-6 ∼ 1e-4], dropout rate (DR) between [0 ∼ 1], model selection [biobert-v1.1, distilbert-base-uncased, scibert_scivocab_uncased, Bio_ClinicalBERT, bert-base-uncased], the maximum length of input tokens (L) between [100 ~ 512], batch size between (BS) [4 ~ 64], and the number of threads (NT) used for processing between 1 and 8. The best-performing model had an LR of 2.0-05, a DR of 0.0, used the scibert_scivocab_uncased model, an L of 403, an BS of 23, and an NT of 5. After training the previously mentioned model with the training dataset, we obtained the following results with the validations dataset: an F1 micro of 0.489, an accuracy score of 0.196, an F1 macro of 0.416, Hamming score of 0.389, and Hamming Loss of 0.016. Although the metric scores do not show high values, our approach can still be further developed and improved. Multi-classification and multi-labeling with the number of classes, i.e., STY labels, that we are dealing with do not commonly show high scores as happens with binary classification. Still, pre-trained data opens new possibilities for this sort of task. 5. Acknowledgements This work was partially supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A) 4. References [1] Dhammi IK, Kumar S. Medical subject headings (MeSH) terms. Indian J Orthop. 2014 Sep;48(5):443-4. doi: 10.4103/0019-5413.139827. PMID: 25298548; PMCID: PMC4175855. [2] National Library of Medicine (US); 2009 Sep-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK9676/ [3] PMC Open Access Subset [Internet]. Bethesda (MD): National Library of Medicine. 2003 - [cited 2022 11 20]. Available from https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ [4] Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878 [5] Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” arXiv, June 18, 2018. https://doi.org/10.48550/arXiv.1603.06560. 1 https://github.com/zbmed-semtec/topic-categorization-system 2 https://wandb.ai/javtor/huggingface and https://huggingface.co/datasets/Javtor/biomedical-topic-categorization