1. Introduction

CoLe and LYS at BioASQ MESINESP Task: large scale multilabel text categorization with sparse and dense indices.

Francisco J. Ribadas-Pena

Shuyuan Cao

Elmurod Kuriyozov

Grupo COLE

Departamento de Informática

Universidade de Vigo E.S. Enxeñaría Informática

Campus As Lagoas

Ourense

Spain

0 Grupo LYS, Departamento de Computación y Tecnologías de la Información, Universidade de A Coruña Facultade de Informatica , Campus de Elviña, A Coruña 15071 , Spain

In this paper we describe our participation in the second edition of mesinesp shared-task in the BioASQ biomedical semantic indexing challenge. The system employed in this participation tries to exploit different strategies for the use of similarity between documents to build a multi-label classifier that assigns DeCS descriptors to new documents from the descriptors previously assigned to similar documents. We have implemented and evaluated two complementary proposals: (1) the use of sparse document representations, based on the extraction of linguistically motivated index terms and their subsequent indexing using Apache Lucene and (2) the use of indices storing dense representations of training documents obtained by means of sentence level embeddings. The results obtained in oficial runs were far from the best performing systems, but we believe that our approach ofers an acceptable performance taking into account the minimum processing requirements that the proposed document similarity scheme supposes.

eol>Information Retrieval Dense Representation Sparse Textual Representation Multi-Label Classification

1. Introduction

This edition was composed of three sub-tracks, dealing with scientific literature (Sub-track 1, mesinesp-l), clinical trials (Sub-track 2, mesinesp-t), and biomedical patents (Sub-track3, mesinesp-p).

Our team has participated in the three sub-tracks evaluating the adequacy of various approaches based on textual similarity. The methods used in the three sub-tracks have been essentially the same and are an extension of those used in our participation at the previous edition of this challenge [ 8, 7 ]. The starting idea of our method is to identify the training documents most similar to a given test document. Using the set of descriptors assigned to these similar documents we construct the list of candidate labels to be returned as a result.

In our experiments and in the submitted runs we have evaluated diferent approaches for identifying this list of similar training documents. As in previous editions of the BioASQ challenge, we have used several natural language processing (NLP) techniques to extract linguistically motivated representations of the training documents that are stored in an Apache Lucene textual index. This index is later queried with the contents of each test document to retrieve the most similar documents. In addition to using this kind of sparse document representations we have proposed the use of dense representations based on sentence-level embeddings. The dense vectors extracted from train documents are indexed in order to locate, during the categorization phase, the set of vectors closest to the dense vectors extracted from the sentence-level embeddings of the test documents to be annotated. Additionally, we have tried to improve the performance of our sparse method based on Apache Lucene using an alternative type of index which is based on the creation of inverse DeCS code profiles that link index terms extracted from the documents with the DeCS tags with which they have a high co-occurrence level.

The rest of this paper is organized as follows. Section 2 describes the details of our method based on sparse representations on Apache Lucene indices. The generation of inverse DeCS codes profiles is also described in this section. Section 3 details the use of dense representations extracted from sentence-level embeddings. Section 4 provides the preliminary experiments with these methods that were used to carry out the parameterization of the oficial runs sent to the challenge. Finally, in section 5 we present the details of these oficial runs and provide a discussion of the results obtained by our approaches in the challenge.

2. Similarity with sparse representations

Methods following nearest neighbors (-NN) approaches have been widely used in the context of large scale multi-label categorization.The sparse representation approach we have followed in our BioASQ challenge participation 1 is essentially a large multi-label -NN classifier backed by an Apache Lucene 2 index.

Our annotation scheme starts by indexing the contents of the mesinesp training articles. For each new article to be annotated, the created index index is queried using its contents as query terms. The list of similar articles returned by the indexing engine and their corresponding similarity measures are exploited to determine the following results:

1Source code available at https://github.com/..../mesinesp2. 2https://lucene.apache.org/

• predicted number of descriptors to be assigned • ranked list of predicted DeCS codes

The first aspect is a regression problem, which aims to predict the number of descriptors to be included in the final list, depending on the number of descriptors assigned to the most similar articles identified by the indexing engine and on their respective similarity scores. The other task is a multi-label classification problem, which aims to predict a descriptors list based on the descriptors manually assigned to the most similar mesinesp articles. In both cases, regression and multi-label classification, similarity scores calculated by the indexing engine are exploited. Query terms employed to retrieve the similar articles are extracted from the original article contents and linked using a global OR operator to conform the final query sent to the indexing engine.

In our case, the scores provided by the indexing engine are actually similarity measures computed according to the weighting scheme being employed, which do not have an uniform and predictable upper bound and do not behave like a real distance. In order to ensure these similarity scores own the properties of a real distance metric, we have applied a normalization procedure, where the most similar document retrieved from the index will have a new score close to 0.0 and the scores of the rest of similar documents are adjusted in accordance with it.

With this information the number of descriptors to be assigned to the article being annotated is predicted using a weighted average scheme, where the weight of each similar article is the inverse of normalized distance cubed, that is, 13 .

To create the ranked list of descriptors a distance weighted voting scheme is employed, associating the same weight values (the inverse of normalized distances cubed) to the respective similar articles. Since this is actually a multi-label categorization task, there are as many voting tasks as candidate descriptors were extracted from the articles retrieved by the indexing engine. For each candidate label, positive votes come from similar articles annotated with it and negative votes come from articles not including it.

2.1. Text representations

Regarding article representation we have evaluated several index term extraction approaches. Our aim was to determine whether linguistic motivated index term extraction could help to improve annotation performance in the -NN based method we have described. We employed the following methods: Stemming based representation (STEMS). This was the simplest approach which employs stop-word removal, using a standard stop-word list for Spanish, and the default Spanish stemmer from the Snowball project3.

Morphosyntactic based representation (LEMMAS). In order to deal with morphosyntac

tic variation in Spanish we have employed a lemmatizer to identify lexical roots and we also replaced stop-word removal with a content-word selection procedure based on part-of-speech (PoS) tags.

We have delegated the linguistic processing tasks to the tools provided by the spaCy Natural Language Processing (NLP) toolkit 4. In our case we have employed the PoS tagging and lemmatization information provided by spaCy, using the standard Spanish models without any specific data for biomedical related contents.

Only lemmas from tokens tagged as a noun, verb, adjective, adverb or as unknown words are taken into account to constitute the final article representation, since these PoS are considered to carry the sentence meaning.

Nominal phrases based representation (NPS). In order to evaluate the contribution of

more powerful NLP techniques, we have employed a surface parsing approach to identify syntactic motivated nominal phrases from which meaningful multi-word index terms could be extracted.

Noun Phrase (NP) chunks identified by spaCy are selected and the lemmas of the constituent tokens are joined together to create a multi-word index term.

Dependencies based representation (DEPS). We have also employed as index terms triples

of dependence-head-modifier extracted by the dependency parser provided by spaCy. In our case spaCy provides a dependency parsing model for Spanish that identify syntactic dependency labels following the Universal Dependencies(UD) scheme. The complex index terms were extracted from the following UD relationships 5: acl, advcl, advmod, amod, ccomp, compound, conj, csuj, dep, flat , iobj, nmod , nsubj, obj, xcomp, dobj and pobj. Named entities representation (NERS). Another type of multi-word representations taken into account are named entities. We have employed the NER module in spaCy to extract general named entities (location,misc , organization, person) from articles content. We also added to this representation the set of named entities (disease, medication, procedure, symptom) made available as additional resources by the mesinesp organizers. Keywords representation (KEYWORDS). The last kind of multi-word representation we have included are keywords extracted with statistical methods from articles textual content. We have employed the implementation of TextRank algorithm [ 4 ] provided by the textacy library 6.

Exact matches of DeCS labels (MATCHES). In addition to these representations, we also have employed a pattern matching approach to extract exact matches of DeCS labels and of their corresponding synonyms from the abstract text. In our case we have added to the document representation as index term each one of those matches in order to maintain its absolute occurrence frequency.

4Available at https://spacy.io/ 5Detailed list of UD relationships available at https://universaldependencies.org/u/dep/ 6https://textacy.readthedocs.io 2.2. Inverted DeCS code profiles

Apache Lucene provides a general information retrieval engine that implements a vector space model with diferent well-known scoring algorithms, such as TF-IDF, BM25 variants, and others. Lucene maintains an inverted index where it links the index terms extracted by its analyzers with the documents where they appear and maintains information about occurrence frequencies of these index terms in order to calculate the query scores.

As a complementary experiment to our proposal of sparse similarity, instead of using a conventional retrieval system we have proposed our own simplified version of an inverted index at descriptor level. Each possible index term is linked to a list of DeCS codes with which it maintains a degree of co-correlation greater than a certain threshold. The intuition behind this approach is that the presence of certain indexing terms in a given document is a good predictor of the convenience of labeling that document with DeCS codes strongly linked, from a co-occurrence point of view, with those terms.

To implement this idea we have used as a co-occurrence metric between index terms and DeCS codes the Normalized Pointwise Mutual Information (NPMI), calculated on the training set as follows, being an index term and a DeCS code: where is the Pointwise Mutual Information computed by

(, ) = (, ) =

(, ) − ( (, )) ︂(

(, ) )︂ () · () with (, ) = |docs. labeled with containing term | , () = |d|odcosc.si.nctornatianiinnigngcotlelremctio|n| and () = |do|cds.oicns.tlraabineilnedg wcoiltlhect|ion| .

|docs. in training collection|

The measure (, ) normalizes the values of (, ) in [ − 1, 1 ], resulting in − 1 for a term and a DeCS code never occurring together, 0 for independence, and +1 for complete co-occurrence of term and code .

For the construction of these inverse DeCS code profiles, we have treated separately the single index terms, corresponding to representations of type lemmas, and the compound index terms, which correspond to the multi-word terms extracted by ners, nps and keywords representations. As thresholds for the NPMI co-occurrence metric we have used the values 0.25, 0.50 and 0.75, linking with each index term, both single and compound, the DeCS codes whose co-occurrence measured according to NPMI exceeds these thresholds.

With these inverted descriptor profiles we have implemented a simple matching scheme to annotate an input document. Given a document to be annotated, its simple and compound terms are extracted, using the methods described in the preceding section. Using the described term-to-code profiles, the NPMI co-occurrence scores of each possible candidate DeCS code are accumulated in a table every time one of the terms related to a given DeCS code appears.

To build the final list of DeCS code candidates to be assigned to a given test document we use as a reference the set of codes predicted by the sparse similarity method described in the previous section. This reference set determines the number of DeCS codes to predict, , and provides additional codes needed to fulfill that number of output codes whether the number of DeCS codes with higher accumulated co-occurrence scores predicted with the DeCS codes profiles are less than .

3. Similarity with dense representations

In recent years we are experiencing the rise of powerful language models such as BERT and other similar approaches that have increased the performance of multiple language processing tasks and have allowed that solutions based on Transformer models to dominate the stateof-the-art in many NLP fields today. A natural evolution of these word embeddings is to move them towards embeddings at the sentence-level with approaches as those provided by SentenceTransformers [ 6 ] project 7 that allows to convert sentences in natural languages into dense vectors with enriched semantics.

In this context we have evaluated the possibility of taking advantage of these dense semantic representations of whole sentences as a basis for an approach similar to the one described in the previous section. We replace the use of text indexers to match similar documents with the search for similar vectors in the dense vector space where documents from the training dataset are represented. The procedure that we follow to generate the dense vectors that will represent a document as a whole, either from training or test collections, is the following: • The paragraphs of the document are split into sentences and the dense vector that represents every sentence is calculated using Sentence Transformers models. • The dense representation of the whole document is calculated as the mean vector of the dense vectors extracted from the sentences that the abstract is comprised of.

Once we have the dense representations of the training documents using this procedure, we use the FAISS [ 3 ] library 8 to create a searchable index on these dense vectors. This index allows us to eficiently calculate distances between dense vectors and determine for the dense vector associated with a given test abstract (our query vector) the list of closest training dense vectors using the Euclidean distance or other similarity metrics on vectors.

By having this mechanism of similarity between dense vectors, the procedure used to annotate the test documents is analogous to that one used with the sparse similarity approach with Lucene indices. In this case we can directly use the real distances between the query vector generated from the text to be annotated and the most similar dense vectors provided by FAISS library. With these distances, the number of labels to be assigned is estimated and the output DeCS codes are selected by means of the weighted voting scheme already described in section 2.

7https://www.sbert.net/ 8https://github.com/facebookresearch/faiss 4. Premilinary results

In this section we briefly present the results of a series of preliminary experiments carried out to validate the methods described in the previous sections and to characterize the parameters to be used in our oficial runs submitted to the challenge. All of these experiments have used the data provided by the organization of the mesinesp2 challenge for sub-track-1 [ 2 ], with a train dataset with 237,574 articles annotated with DeCS codes and a development dataset with 1,065 documents.

In the case of assigning DeCS codes using similarity over sparse representations supported by an Apache Lucene index, we have separately evaluated the performance of the diferent methods of extraction of index terms introduced in section 2. We also tested diferent values for the parameter , the number of neighbors considered to predict the number of labels and to vote the final list of output labels. Table 1 shows the results obtained in this previous evaluation.

Regarding the use of the inverse profiles of DeCS codes, we have evaluated the use of simple index terms, compound index terms and a mixture of both to build the DeCS codes profiles. Using in all cases the three co-occurrence thresholds previously indicated, 0.25, 0.5, 0.75. To determine the number of DeCS codes to predict in each test document and to provide additional codes, the result list from best execution of the sparse similarity scheme in Table 1 has been used as reference. The results of these experiments with inverse profiles are shown in Table 2.

Finally, in the case of assigning DeCS codes through similarity over dense representations, we have evaluated the use of Sentence Transformers with two diferent pretrained language models, one multilingual model 9 and a Spanish monolingual model 10. We also evaluated diferent values for the parameter. The obtanied results are detailed in Table 3.

9Using pretrained sentence-level model stsb-xlm-r-multilingual from Sentence Transformers project. Provides dense vectors with 768 dimensions.

10Using pretrained word-level model mrm8488/electricidad-base-generator from Hugging Face models repository. Provides dense vectors with 256 dimensions.

5. Oficial runs and discussion

Although our team has submitted results to all the sub-tracks of the mesinesp challenge, a parameterization adapted to the specific characteristics of each sub-track has not been carried out. All the configurations used in the oficial runs have been identical in the three sub-tracks only with adjustments in the number of neighbors considered according to the results of previous experiments with the provided development datasets. The only exception is sub-track-3 where a substantially diferent configuration has been used in one of the submitted runs.

In table 4 the oficial performance measures obtained by our runs in the three mesinesp2 sub-tasks are shown. The oficial runs submitted during our participation were created using the following configurations: iria1. This run followed the sparse similarity approach described in section 2. The sparse representation of mesinesp articles was created using all of the index term extraction methods described in section 2.1. During indexing and querying, terms appearing in 5 or less abstracts and terms used in more than 50% of training documents were discarded. The number of neighbors used by the -NN classifier was 20 and the predicted number of descriptors to be returned was increased by a 10% in order to ensure slightly better values in recall related measures. iria2. For this run in Sub-track-1 the same setup as iria1 was employed, but instead of using the original train dataset this run applied a sort of Label Powerset approach proposed in our previous participation at mesinesp challenge [ 8 ]. A new training dataset annotated with those ”metalabels” was created by joining pairs of DeCS labels with NPMI cooccurrence scores above 0.25. This dataset was indexed and processed as described for run iria1.

In Sub-track-3 iria2 setup followed the inverse DeCS codes prolfie approach from section 2.2. The employed profiles were a mix of single and compound profiles with a threshold of 0.75 for the co-occurrence scores. Instead of using the results of a sparse method as reference this runs was created directly over the set of exact matches extracted from the abstract text (matches representation). iria3. This run followed the dense similarity approach introduced in section 3. We employed the multilingual word model to create dense vectors for every training document and indexed those vectors in a FAISS index. The number of neighbors used by the -NN classifier was 30 and, as in iria1 run, the predicted number of descriptors to be returned was increased by a 10%. iria4. This run employed the inverse DeCS codes profile approach introduced in section 2.2. The employed profiles were a mix of single and compound profiles created using a threshold of 0.75 for the co-occurrence scores between terms and DeCS codes. The reference results employed by this approach were those from iria1. iria-mix. This run mixed the predictions of iria1 and the predictions of iria3, adding the exact matches extracted from the textual content of the labeled abstract (matches representation). Predictions from iria1 and iria3 had a weight of 1.0 and the DeCS labels matched on the abstract text were weighted by 1.5.

The results of our participation in the mesinesp task of the BioASQ biomedical semantic indexing challenge were not very competitive, far from the performance of the winner teams. In any case, we think that our experience confirms the suitability of methods based on similarity as a viable alternative for large scale text categorization in rich domains such as biomedical document collections.

We have evaluated diferent classification methods based on similarity over sparse and dense representations. In our experiments the best results have been obtained using sparse representations where diferent index term extraction techniques were combined. That confirmed the results in our previous BioASQ participation, with small performance improvements mainly due to improvements in the quality of the employed NLP tools and models.

Results with similarity over dense representations were generally disappointing. The proposed method, a simple mean of sentence based dense vectors, was extremely simple and it remains for future to evaluate better approaches that could improve the dense representation of the documents as a whole. In the same way, it is expected that fine-tuning the language models using biomedical texts will improve their performance and this is precisely one of the lines of future work to experiment with.

Acknowledgments

F.J. Ribadas-Pena and S. Cao have been supported by ERDF/MICINN-AEI (TIN2017-85160-C2-2R and PID2020-113230RB-C22), and by the Galician Regional Government (Xunta de Galicia) under projects ED431D-2018/50 and ED431D-2017/12.

E. Kuriyozov received funding from ERDF/MICINN-AEI (ANSWER-ASAP, TIN2017-85160-C2-1R, and SCANNER-UDC, PID2020-113230RB-C21), from Xunta de Galicia (ED431C 2020/11) and from Centro de Investigación de Galicia ”CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. He is also funded for his PhD by El-Yurt-Umidi Foundation under the Cabinet of Ministers of the Republic of Uzbekistan.

[1]

Gasco ,

Nentidis ,

Krithara ,

Estrada-Zavala ,

R-T.

Murasaki ,

Primo-Peña ,

Bojo-Canales ,

Paliouras , M. Krallinger: Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials . 2021

[2]

Gasco ,

Krallinger , M. Antonio, MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish, 2021 . Funded by the Plan de Impulso de las Tecnologías de las del Lenguaje (Plan TL) . Zenodo, URL: https://doi.org/10.5281/zenodo.4722925.

[3]

Johnson , M. Douze, H. Jégou: Billion-scale similarity search with GPUs . arXiv preprint arXiv:1702.08734 , 2017 , URL: https://arxiv.org/abs/1702.08734.

[4]

Mihalcea , P. Tarau: TextRank: Bringing order into texts . Association for Computational Linguistics . 2004 .

[5]

Nentidis ,

Katsimpras ,

Vandorou ,

Krithara ,

Gasco ,

Krallinger , G. Paliouras: Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and

Question

Answering . 2021

[6]

Reimers , I. Gurevych: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation . Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2020 .

[7]

F.J.

Ribadas-Pena , L. M. de Campos , V.M.

Darriba-Bilbao , A. E.

Romero

: CoLe and UTAI at BioASQ 2015: Experiments with Similarity Based Descriptor Assignment . CEUR Workshop Proceedings , vol. 1391 . 2015 .

[8]

F.J.

Ribadas-Pena ,

Cao , E. Kuriyozov: CoLe and LYS at BioASQ MESINESP8 Task: Similarity based Descriptor Assigment in Spanish . CEUR Workshop Proceedings , vol. 2696 . 2020 .