=Paper=
{{Paper
|id=Vol-2936/paper-17
|storemode=property
|title=PIDNA at BioASQ MESINESP: Hybrid Semantic Indexing for Biomedical Articles in Spanish
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-17.pdf
|volume=Vol-2936
|authors=Yi Huang,Buse Giledereli,Abdullatif Koksal,Arzucan Ozgur,Elif Ozkirimli
|dblpUrl=https://dblp.org/rec/conf/clef/HuangGKOO21
}}
==PIDNA at BioASQ MESINESP: Hybrid Semantic Indexing for Biomedical Articles in Spanish==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-17.pdf</pdf>
<pre>
PIDNA at BioASQ MESINESP: Hybrid Semantic
Indexing for Biomedical Articles in Spanish
Yi Huang1 , Buse Giledereli2,3 , Abdullatif Koksal3 , Arzucan Ozgur3 and
Elif Ozkirimli4
1
  Data and Analytics Chapter, Roche (China) Holding Ltd., Shanghai, China
2
  Data and Analytics Chapter, Roche Müstahzarları Sanayi Anonim Şirketi, Turkey
3
  Computer Engineering Department, Bogazici University, Turkey
4
  Data and Analytics Chapter, F. Hoffmann-La Roche AG, Switzerland


                                         Abstract
                                         Semantic indexing of biomedical articles is difficult due to the extensive use of domain-specific termi-
                                         nology. The task is even more difficult when the corpus is not in English and when there are only a
                                         limited number of training data points. In this paper, we describe a hybrid semantic indexing method
                                         for biomedical articles in Spanish with the data provided for the MESINESP task (subtrack 1) of the
                                         BioASQ challenge 2021. The method integrates transformer-based multi-label text classification and
                                         named entity recognition (NER). Our approach has outperformed the baseline methodology by a wide
                                         margin in microF1 and has ranked as the second team in the challenge.

                                         Keywords
                                         Biomedical semantic indexing, Text classification, Transformer-based framework, BioASQ challenge


1. Introduction
The growing body of scientific publications makes it very hard to keep track of recent advances.
Indexing provides valuable article annotation for information retrieval, but automatic indexing
of the articles remains a major bottleneck due to the long-tailed distribution of labels from
a large set of controlled vocabulary. Indexing becomes even harder for text in non-English
languages with limited training data.
   BioASQ is a challenge in large-scale biomedical semantic indexing and question answering
[1, 2]. The aim of the Medical Semantic Indexing in Spanish (MESINESP) task [3, 4, 5] is to
provide a rich environment for studies in indexing large-scale medical and clinical clauses
written in Spanish, which would help to keep track of different aspects of the literature. Many
stakeholders such as pharmaceutical companies and researchers in clinical medicine would
benefit from systematic labeling of this emerging number of biomedical articles. In MESINESP
task, training and evaluation data are proposed for medical semantic indexing task in Spanish
(detailed statistics in Table 1). The training data of biomedical articles in Spanish are labeled
with DeCS (Health Sciences Descriptors). DeCS is a structured vocabulary developed based on

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" yi.huang.yh4@roche.com (Y. Huang); buse.giledereli@roche.com (B. Giledereli); abdullatif.koksal@boun.edu.tr
(A. Koksal); arzucan.ozgur@boun.edu.tr (A. Ozgur); elif.ozkirimli@roche.com (E. Ozkirimli)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: The procedure of hybrid semantic indexing. For frequent terms, transformer-based MLTC
is applied. For rare terms or terms with high F1 in the validation step, a term is labeled if there is an
entity identified with NER matching its synonyms. The results of BERT-MLTC and Synonym matching
are pooled together as the final labels.


Medical Subject Headings (MeSH) terms and serves as a common terminology for consistent
search in Spanish, Portuguese and English. For each DeCS heading, there is a list of descriptors
and synonyms from both European and Latin Spanish DeCS 2019 data sets.
   In MESINESP 8 - 2020 [3], best performing teams made effective use of pretrained transformer
models such as ELMO [6], multilingual BERT [7], and X-BERT [8] for text classification. There
were also hybrid indexing approaches on top of pretrained transformer models. One of them
from Priberam Labs used an ensemble of a Support Vector Machine model, a search engine and
a BERT-based classifier [9] and proved the performance of multiple binary classifiers. Another
team from the University of Lisboa leveraged an additional NER model to recognize MeSH
terms in the abstracts [10, 11]. Following the same line of thought, in this work, we propose a
hybrid approach where the BERT model trained on only Spanish and based on multiple binary
classifiers is used with additional entity synonym matching in the articles to DeCS terms. Our
results show that this hybrid method outperforms a solely transformers-based model in rare
classes and prove that NER integration is helpful for the long-tailed distribution problem.
   The paper is organized as follows. Section 2 describes the hybrid design of the system. Section
3 provides experimental details for our proposed approach. Section 4 discusses the results and
the evaluation for the MESINESP dataset. Finally, Section 5 summarizes our conclusions and
perspectives for future work.


2. Hybrid Semantic Indexing
Pretrained large language models such as BERT provide a powerful and high performance
framework in many NLP tasks [7] for various languages, including Spanish [12]. They have
been applied to the multi-label text classification (MLTC) task [13] to find the corresponding
labels within a label set for semantic indexing. When the label set is large, the MLTC task
Table 1
Dataset Statistics of MESINESP subtrack 1
        Dataset                                            Statistic
         Number of articles (training)                     237574
         Number of articles (development)                  1065
         Number of articles (test)                         10179 (500 with expert annotation)
         Average number of labels per article (training)   8.37
         Average number of articles per label (training)   88.64


often uses auxiliary multiple binary classifiers. However, for some of the labels, their binary
classifiers are trained with few data points and the final multi-label model is often biased to
high-frequency labels.
   Besides MLTC, NER is another approach to extract key information from the text. Using
NER, rare or unseen entities can be effectively recognized as long as there are enough training
data of the same entity category. Therefore, in this work, we explore the possibility to integrate
NER into the prediction pipeline as a pre-processing step, especially for those rare labels, to
complement MLTC.
   For this BioASQ challenge, the Text Mining Unit of the Barcelona Supercomputing Center
has extracted entities related to medications, diseases, symptoms, and medical procedures for
training, development and test sets. With the entities identified, there can be various ways to
infer the labels. In this work, we evaluate a straight-forward approach by simply matching the
entities to synonyms (provided by DeCS).
   Figure 1 illustrates the procedure of the whole hybrid method. We take advantage of the
transformer-based MLTC (BERT-MLTC) for frequent terms, and leverage the entities provided
by the Barcelona Supercomputing Center for rare terms which occur fewer than 3 times in the
training set, as well as the terms with high term-wise F1 score in the validation dataset. The
semantic indexing result is the union of outputs of these two approaches.


3. Experiments
There are 237574 articles in the training set, 1065 in the development or validation set, and
10179 in the test set for the subtrack 1 of MESINESP (Table 1). 500 articles from the test data
set are expert-annotated from LILACS and IBECS and used in the official evaluation. There are
34040 DeCS terms with synonyms, 22434 of which occur in the training set, and 17006 occur at
least 3 times.
   For BERT-MLTC, we use the BertForSequenceClassification backbone in the transformers
library [14] with the BETO (bert-base-spanish-wwm-cased) pretrained model [12]. The BETO
model has 110 million parameters. The training data are truncated with a maximal length of
512 and grouped with a batch size of 32. Only terms that occur at least 3 times in the training
set (frequent terms) are included. We use AdamW with a weight decay of 0.01 as the optimizer,
and determine the learning rate by hyperparameter search. The transformer-based framework
is implemented in PyTorch.
Table 2
Evaluation results of hybrid semantic indexing, compared with BERT-MLTC only system as well as
BERTDeCS version 4 (the best system) and MESINESP_baseline_t1 (the baseline system). pi_dna used
hybrid semantic indexing, while pi_dna_2 used BERT-MLTC
 System        MiF      EBP      EBR      EBF      MaP      MaR      MaF      MiP      MiR      Acc.
 BERTDeCS      0.4837   0.5077   0.4736   0.4763   0.5237   0.3990   0.3926   0.5077   0.4618   0.3261
 pi_dna        0.4225   0.4919   0.3876   0.4120   0.4463   0.3149   0.3082   0.4667   0.3859   0.2722
 pi_dna_2      0.3978   0.4836   0.3630   0.3915   0.4062   0.2717   0.2700   0.4520   0.3551   0.2546
 baseline      0.2876   0.2449   0.3839   0.2841   0.3720   0.3787   0.3438   0.2335   0.3746   0.1710


   For synonym matching, we perform an exact match of each full entity (recognized and pro-
vided by the Text Mining Unit of the Barcelona Supercomputing Center) to all DeCS synonyms
in a case-insensitive manner. If the corresponding DeCS term occurs fewer than 3 times in
the training set (rare term), or its term-wise F1 is larger than a threshold (0.01 in this work by
hyperparameter search), the term will be set positive in the hybrid result.
   The result of hybrid semantic indexing is named 𝑝𝑖_𝑑𝑛𝑎 in the official evaluation. We
also keep the original result of BERT-MLTC, named 𝑝𝑖_𝑑𝑛𝑎_2, to assess the improvement by
integrating NER and synonym information.


4. Results
For each system, the MESINESP subtrack 1 evaluate performance with Micro F-Measure (MiF),
Example Based Precision (EBP), Example Based Recall (EBR), Example Based F-Measure (EBF),
Macro Precision (MaP), Macro Recall (MaR), Macro F-Measure (MaF), Micro Precision (MiP),
Micro Recall (MiR) and Accuracy (Acc.). MiF is the official evaluation metric for this task. A
summary of the results for our approaches, as well as the best and the baseline systems, are
listed in Table 2.
   Among the 26 systems of subtrack 1, our system of hybrid semantic indexing has ranked as
the 6th (second as a team) in the challenge with a miF of 0.4225, whereas the plain BERT-MLTC
achieved a miF of 0.3978. Compared to the baseline model MESINESP_baseline_t1, both models
provide a significant improvement in MiF and MiP scores. This shows how strong the BERT-
MLTC model is in frequent labels. The main improvement over the baseline is for the precision
scores.
   For macro scores, the baseline model already has high MaR (0.3787) and the performance in this
metric improves only slightly for the highest scoring system (0.399). The hybrid model (pi_dna)
provides an advantage in rare classes. It outperforms the BERT-MLTC based model (pi_dna_2)
in every metric, with the biggest improvements being in MaP (+0.0401) and MaR (+0.0432)
metrics. This improvement in macro scores shows the advantage of the NER based synonym
matching in the rare classes, which is important in datasets with long-tailed distribution.
   Noteworthy, our systems are efficient in both training and inference steps. With one GPU
(A100), the training step takes less than 1 hour for one epoch, and the inference step takes 5
minutes for the whole test set (10179 articles).
  This is our first time participating MESINESP, therefore, we also compare our results with
previous systems with hybrid indexing approaches in past editions of MESINESP, with the
baseline model system as a benchmark. Last year, the best system (on miF) from Priberam
Labs, PriberamTEnsemble, achieved a miF of 0.4093 (+0.1398) and a maF of 0.2115 (-0.0701);
the best system (on miF) from the University of Lisboa, LasigeBioTM TXMC F1, achieved a
miF of 0.2507 (-0.0188) and a maF of 0.0858 (-0.1958). This year, our system (pi_dna) achieves a
miF of 0.4225 (+0.1349) and a maF of 0.3082 (-0.0356). Our system works at a similar level as
PriberamTEnsemble (with fewer base models) and better than LasigeBioTM TXMC F1.


5. Conclusions
In this study, we introduce a hybrid semantic indexing method for Spanish biomedical articles,
and show its effectiveness and efficiency in the MESINESP task subtrack 1 of the BioASQ
challenge 2021. We propose the integration of MLTC, NER and terminology as a promising
approach for non-English biomedical text mining. The proposed hybrid approach can also be
used to index document types in other domains with domain-specific language.
   As future work, we are going to explore more options in the base models of the hybrid
approaches. For example, synonym matching can be improved by taking DeCS synonyms into
the NER step [11]. In addition, hyperparameters such as the rare term cutoff 3 can be further
fine-tuned for higher performance in MaR.


References
 [1] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis-
     senborn, A. Krithara, S. Petridis, D. Polychronopoulos, Y. Almirantis, J. Pavlopoulos,
     N. Baskiotis, P. Gallinari, T. Artieres, A. Ngonga, N. Heino, E. Gaussier, L. Barrio-Alvers,
     M. Schroeder, I. Androutsopoulos, G. Paliouras, An overview of the bioasq large-scale
     biomedical semantic indexing and question answering competition, BMC Bioinformatics
     16 (2015) 138. URL: http://www.biomedcentral.com/content/pdf/s12859-015-0564-6.pdf.
     doi:10.1186/s12859-015-0564-6.
 [2] A. Nentidis, G. Katsimpras, E. Vandorou, A. Krithara, L. Gasco, M. Krallinger, G. . Paliouras,
     Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Se-
     mantic Indexing and Question Answering. (2021).
 [3] C. Rodriguez-Penagos, A. Nentidis, A. Gonzalez-Agirre, A. Asensio, J. Armengol-Estapé,
     A. Krithara, M. Villegas, G. Paliouras, M. Krallinger, Overview of mesinesp8, a spanish
     medical semantic indexing task within bioasq 2020 (2020).
 [4] L. Gasco, A. Nentidis, A. Krithara, D. Estrada-Zavala, , R.-T. Murasaki, E. Primo-Peña,
     C. Bojo-Canales, G. Paliouras, M. Krallinger, Overview of BioASQ 2021-MESINESP track.
     Evaluation of advance hierarchical classification techniques for scientific literature, patents
     and clinical trials. (2021).
 [5] L. Gasco, M. Antonio, M. Krallinger, Mesinesp2 corpora: Annotated data for medical
     semantic indexing in spanish, 2021. doi:10.5281/zenodo.4707104, funded by the Plan
     de Impulso de las Tecnologías del Lenguaje (Plan TL).
 [6] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized word representations, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), 2018, pp. 2227–2237.
 [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [8] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. Dhillon, Taming pretrained transformers for
     extreme multi-label text classification, arXiv preprint arXiv:1905.02331 (2019).
 [9] R. Cardoso, Z. Marinho, A. Mendes, S. Miranda, Priberam at mesinesp multi-label classifi-
     cation of medical texts task, 2021. arXiv:2105.05614.
[10] F. M. Couto, A. Lamurias, Mer: a shell script and annotation server for minimal named
     entity recognition and linking, Journal of Cheminformatics 10 (2018) 58. URL: https:
     //doi.org/10.1186/s13321-018-0312-9. doi:10.1186/s13321-018-0312-9.
[11] A. Neves, A. Lamurias, F. M. Couto, Extreme multi-label classification applied to the
     biomedical and multilingual panorama, 2020.
[12] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert
     model and evaluation data, in: PML4DC at ICLR 2020, 2020.
[13] R. You, Y. Liu, H. Mamitsuka, S. Zhu, BERTMeSH: deep contextual representation learn-
     ing for large-scale high-performance MeSH indexing with full text, Bioinformatics
     37 (2020) 684–692. URL: https://doi.org/10.1093/bioinformatics/btaa837. doi:10.1093/
     bioinformatics/btaa837.
[14] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.

</pre>