=Paper=
{{Paper
|id=Vol-2936/paper-23
|storemode=property
|title=COLE and LYS at BioASQ MESINESP Task: large scale multilabel text categorization with
                        sparse and dense indices
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-23.pdf
|volume=Vol-2936
|authors=Francisco J. Ribadas-Pena,Shuyuan Cao,Elmurod Kuriyozov
|dblpUrl=https://dblp.org/rec/conf/clef/Ribadas-PenaCK21
}}
==COLE and LYS at BioASQ MESINESP Task: large scale multilabel text categorization with
                        sparse and dense indices==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-23.pdf</pdf>
<pre>
CoLe and LYS at BioASQ MESINESP Task: large scale
multilabel text categorization with sparse and dense
indices.
Francisco J. Ribadas-Pena1 , Shuyuan Cao1 and Elmurod Kuriyozov2
1
  Grupo COLE, Departamento de Informática, Universidade de Vigo
E.S. Enxeñaría Informática, Campus As Lagoas, Ourense 32004, Spain
2
  Grupo LYS, Departamento de Computación y Tecnologías de la Información, Universidade de A Coruña
Facultade de Informatica, Campus de Elviña, A Coruña 15071, Spain


                                         Abstract
                                         In this paper we describe our participation in the second edition of mesinesp shared-task in the BioASQ
                                         biomedical semantic indexing challenge. The system employed in this participation tries to exploit dif-
                                         ferent strategies for the use of similarity between documents to build a multi-label classifier that assigns
                                         DeCS descriptors to new documents from the descriptors previously assigned to similar documents. We
                                         have implemented and evaluated two complementary proposals: (1) the use of sparse document repre-
                                         sentations, based on the extraction of linguistically motivated index terms and their subsequent index-
                                         ing using Apache Lucene and (2) the use of indices storing dense representations of training documents
                                         obtained by means of sentence level embeddings. The results obtained in official runs were far from
                                         the best performing systems, but we believe that our approach offers an acceptable performance tak-
                                         ing into account the minimum processing requirements that the proposed document similarity scheme
                                         supposes.

                                         Keywords
                                         Information Retrieval, Dense Representation, Sparse Textual Representation, Multi-Label Classification


1. Introduction
The mesinesp2 [1] shared-task on medical semantic indexing in Spanish is part of the BioASQ [5]
2021 challenge. Content indexing using structured vocabularies is a critical task in the man-
agement of large textual collections in scientific and technical domains and it is essential to
make possible sophisticated search engines to help researchers in accessing to relevant informa-
tion. Although Spanish is one of the most spoken languages, most of the previous efforts and
advances in semantic indexing have been oriented exclusively to English texts and the aim of
the mesinesp challenge series was to evaluate the state-of-art and to promote the research in
semantic indexing of Spanish scientific literature.
   The second edition of mesinesp shared-task asked participant teams to label test documents
with codes from DeCS (Descriptores en Ciencias de la Salud), a controlled hierarchical vocabu-
lary which is a translation and extension of the MeSH (Medical Subjects Headings) thesaurus.

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
  ribadas@uvigo.es (F. J. Ribadas-Pena); shuyuan.cao@uvigo.es (S. Cao); e.kuriyozov@udc.es (E. Kuriyozov)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
This edition was composed of three sub-tracks, dealing with scientific literature (Sub-track
1, mesinesp-l), clinical trials (Sub-track 2, mesinesp-t), and biomedical patents (Sub-track3,
mesinesp-p).
   Our team has participated in the three sub-tracks evaluating the adequacy of various ap-
proaches based on textual similarity. The methods used in the three sub-tracks have been
essentially the same and are an extension of those used in our participation at the previous
edition of this challenge [8, 7]. The starting idea of our method is to identify the training
documents most similar to a given test document. Using the set of descriptors assigned to these
similar documents we construct the list of candidate labels to be returned as a result.
   In our experiments and in the submitted runs we have evaluated different approaches for
identifying this list of similar training documents. As in previous editions of the BioASQ
challenge, we have used several natural language processing (NLP) techniques to extract
linguistically motivated representations of the training documents that are stored in an Apache
Lucene textual index. This index is later queried with the contents of each test document
to retrieve the most similar documents. In addition to using this kind of sparse document
representations we have proposed the use of dense representations based on sentence-level
embeddings. The dense vectors extracted from train documents are indexed in order to locate,
during the categorization phase, the set of vectors closest to the dense vectors extracted from the
sentence-level embeddings of the test documents to be annotated. Additionally, we have tried
to improve the performance of our sparse method based on Apache Lucene using an alternative
type of index which is based on the creation of inverse DeCS code profiles that link index terms
extracted from the documents with the DeCS tags with which they have a high co-occurrence
level.
   The rest of this paper is organized as follows. Section 2 describes the details of our method
based on sparse representations on Apache Lucene indices. The generation of inverse DeCS
codes profiles is also described in this section. Section 3 details the use of dense representations
extracted from sentence-level embeddings. Section 4 provides the preliminary experiments
with these methods that were used to carry out the parameterization of the official runs sent to
the challenge. Finally, in section 5 we present the details of these official runs and provide a
discussion of the results obtained by our approaches in the challenge.


2. Similarity with sparse representations
 Methods following 𝑘 nearest neighbors (𝑘-NN) approaches have been widely used in the context
of large scale multi-label categorization.The sparse representation approach we have followed
in our BioASQ challenge participation 1 is essentially a large multi-label 𝑘-NN classifier backed
by an Apache Lucene 2 index.
   Our annotation scheme starts by indexing the contents of the mesinesp training articles. For
each new article to be annotated, the created index index is queried using its contents as query
terms. The list of similar articles returned by the indexing engine and their corresponding
similarity measures are exploited to determine the following results:
    1
        Source code available at https://github.com/..../mesinesp2.
    2
        https://lucene.apache.org/
    • predicted number of descriptors to be assigned
    • ranked list of predicted DeCS codes

   The first aspect is a regression problem, which aims to predict the number of descriptors to
be included in the final list, depending on the number of descriptors assigned to the most similar
articles identified by the indexing engine and on their respective similarity scores. The other
task is a multi-label classification problem, which aims to predict a descriptors list based on the
descriptors manually assigned to the most similar mesinesp articles. In both cases, regression
and multi-label classification, similarity scores calculated by the indexing engine are exploited.
Query terms employed to retrieve the similar articles are extracted from the original article
contents and linked using a global OR operator to conform the final query sent to the indexing
engine.
   In our case, the scores provided by the indexing engine are actually similarity measures
computed according to the weighting scheme being employed, which do not have an uniform
and predictable upper bound and do not behave like a real distance. In order to ensure these
similarity scores own the properties of a real distance metric, we have applied a normalization
procedure, where the most similar document retrieved from the index will have a new score
close to 0.0 and the scores of the rest of similar documents are adjusted in accordance with it.
   With this information the number of descriptors to be assigned to the article being annotated
is predicted using a weighted average scheme, where the weight of each similar article is the
inverse of normalized distance cubed, that is, 𝑑13 .
   To create the ranked list of descriptors a distance weighted voting scheme is employed,
associating the same weight values (the inverse of normalized distances cubed) to the respective
similar articles. Since this is actually a multi-label categorization task, there are as many voting
tasks as candidate descriptors were extracted from the articles retrieved by the indexing engine.
For each candidate label, positive votes come from similar articles annotated with it and negative
votes come from articles not including it.

2.1. Text representations
Regarding article representation we have evaluated several index term extraction approaches.
Our aim was to determine whether linguistic motivated index term extraction could help to
improve annotation performance in the 𝑘-NN based method we have described. We employed
the following methods:

Stemming based representation (STEMS). This was the simplest approach which employs
    stop-word removal, using a standard stop-word list for Spanish, and the default Spanish
    stemmer from the Snowball project3 .

Morphosyntactic based representation (LEMMAS). In order to deal with morphosyntac-
    tic variation in Spanish we have employed a lemmatizer to identify lexical roots and
    we also replaced stop-word removal with a content-word selection procedure based on
    part-of-speech (PoS) tags.

   3
       http://snowball.tartarus.org
          We have delegated the linguistic processing tasks to the tools provided by the spaCy
       Natural Language Processing (NLP) toolkit 4 . In our case we have employed the PoS
       tagging and lemmatization information provided by spaCy, using the standard Spanish
       models without any specific data for biomedical related contents.
          Only lemmas from tokens tagged as a noun, verb, adjective, adverb or as unknown
       words are taken into account to constitute the final article representation, since these PoS
       are considered to carry the sentence meaning.

Nominal phrases based representation (NPS). In order to evaluate the contribution of
    more powerful NLP techniques, we have employed a surface parsing approach to identify
    syntactic motivated nominal phrases from which meaningful multi-word index terms
    could be extracted.
          Noun Phrase (NP) chunks identified by spaCy are selected and the lemmas of the
       constituent tokens are joined together to create a multi-word index term.

Dependencies based representation (DEPS). We have also employed as index terms triples
    of dependence-head-modifier extracted by the dependency parser provided by spaCy. In
    our case spaCy provides a dependency parsing model for Spanish that identify syntactic
    dependency labels following the Universal Dependencies(UD) scheme. The complex index
    terms were extracted from the following UD relationships 5 : acl, advcl, advmod, amod,
    ccomp, compound, conj, csuj, dep, flat, iobj, nmod , nsubj, obj, xcomp, dobj and pobj.

Named entities representation (NERS). Another type of multi-word representations taken
    into account are named entities. We have employed the NER module in spaCy to extract
    general named entities (location,misc , organization, person) from articles content.
    We also added to this representation the set of named entities (disease, medication,
    procedure, symptom) made available as additional resources by the mesinesp organizers.

Keywords representation (KEYWORDS). The last kind of multi-word representation we
    have included are keywords extracted with statistical methods from articles textual
    content. We have employed the implementation of TextRank algorithm [4] provided by
    the textacy library 6 .

Exact matches of DeCS labels (MATCHES). In addition to these representations, we also
     have employed a pattern matching approach to extract exact matches of DeCS labels and
     of their corresponding synonyms from the abstract text. In our case we have added to the
     document representation as index term each one of those matches in order to maintain
     its absolute occurrence frequency.


   4
     Available at https://spacy.io/
   5
     Detailed list of UD relationships available at https://universaldependencies.org/u/dep/
   6
     https://textacy.readthedocs.io
Table 1
Performance comparison of term extraction approaches in sparse representations.

                       𝑘     MiF      MiP      MiR      MaF      MaP      MaR      Acc
                all    5    0.3830   0.4030   0.3650   0.2625   0.3718   0.2640   0.2502
                      10    0.4011   0.4230   0.3814   0.2727   0.4275   0.2724   0.2633
                      20    0.4117   0.4349   0.3909   0.2592   0.5206   0.2571   0.2684
                      30    0.4110   0.4343   0.3901   0.2690   0.4902   0.2667   0.2690
                      40    0.4052   0.4283   0.3845   0.2424   0.5253   0.2416   0.2617
              stems    5    0.3768   0.3976   0.3581   0.2579   0.3686   0.2577   0.2464
                      10    0.3985   0.4210   0.3784   0.2683   0.4246   0.2667   0.2618
                      20    0.4065   0.4302   0.3854   0.2619   0.4759   0.2588   0.2654
                      30    0.4059   0.4297   0.3846   0.2467   0.5032   0.2445   0.2635
                      40    0.4032   0.4264   0.3823   0.2391   0.5217   0.2380   0.2601
            lemmas     5    0.3762   0.3963   0.3581   0.2526   0.3656   0.2538   0.2454
                      10    0.3963   0.4181   0.3766   0.2656   0.4222   0.2658   0.2595
                      20    0.4057   0.4280   0.3855   0.2621   0.4765   0.2599   0.2648
                      30    0.4045   0.4271   0.3841   0.2483   0.5067   0.2473   0.2616
                      40    0.4009   0.4235   0.3807   0.2399   0.5274   0.2388   0.2580
               ners    5    0.2811   0.3080   0.2584   0.1682   0.2675   0.1690   0.1712
                      10    0.2974   0.3270   0.2727   0.1681   0.3149   0.1670   0.1805
                      20    0.3072   0.3386   0.2811   0.1612   0.3687   0.1602   0.1847
                      30    0.3064   0.3381   0.2801   0.1542   0.3950   0.1526   0.1826
                      40    0.3033   0.3348   0.2772   0.1487   0.4091   0.1484   0.1791
          keywords     5    0.3346   0.3514   0.3194   0.1991   0.3027   0.2001   0.2164
                      10    0.3507   0.3689   0.3342   0.2010   0.3460   0.1994   0.2261
                      20    0.3580   0.3770   0.3408   0.1952   0.3992   0.1908   0.2300
                      30    0.3592   0.3789   0.3415   0.1863   0.4454   0.1805   0.2290
                      40    0.3579   0.3775   0.3402   0.1749   0.4743   0.1684   0.2265
                nps    5    0.2111   0.2497   0.1828   0.0895   0.1735   0.0875   0.1185
                      10    0.2257   0.2681   0.1949   0.0889   0.2145   0.0853   0.1265
                      20    0.2305   0.2744   0.1987   0.0811   0.2571   0.0756   0.1273
                      30    0.2289   0.2728   0.1971   0.0713   0.2697   0.0659   0.1247
                      40    0.2282   0.2723   0.1964   0.0650   0.2811   0.0596   0.1232
               deps    5    0.3483   0.3648   0.3332   0.2138   0.3144   0.2170   0.2261
                      10    0.3630   0.3808   0.3468   0.2180   0.3633   0.2170   0.2359
                      20    0.3702   0.3899   0.3524   0.2122   0.4167   0.2095   0.2385
                      30    0.3654   0.3849   0.3479   0.1981   0.4444   0.1939   0.2325
                      40    0.3628   0.3823   0.3452   0.1894   0.4681   0.1840   0.2292
           matches     -    0.2574   0.2016   0.3559   0.3171   0.3815   0.3674   0.1517


2.2. Inverted DeCS code profiles
 Apache Lucene provides a general information retrieval engine that implements a vector space
model with different well-known scoring algorithms, such as TF-IDF, BM25 variants, and others.
Lucene maintains an inverted index where it links the index terms extracted by its analyzers
with the documents where they appear and maintains information about occurrence frequencies
of these index terms in order to calculate the query scores.
   As a complementary experiment to our proposal of sparse similarity, instead of using a
conventional retrieval system we have proposed our own simplified version of an inverted
index at descriptor level. Each possible index term is linked to a list of DeCS codes with which
it maintains a degree of co-correlation greater than a certain threshold. The intuition behind
this approach is that the presence of certain indexing terms in a given document is a good
predictor of the convenience of labeling that document with DeCS codes strongly linked, from
a co-occurrence point of view, with those terms.
   To implement this idea we have used as a co-occurrence metric between index terms and
DeCS codes the Normalized Pointwise Mutual Information (NPMI), calculated on the training
set as follows, being 𝑡 an index term and 𝑑 a DeCS code:
                                                                         𝑃 𝑀 𝐼(𝑡, 𝑑)
                                               𝑁 𝑃 𝑀 𝐼(𝑡, 𝑐) =
                                                                        −𝑙𝑜𝑔(𝑃 (𝑡, 𝑐))
where 𝑃 𝑀 𝐼 is the Pointwise Mutual Information computed by
                                               (︂               )︂
                                                    𝑃 (𝑡, 𝑐)
                             𝑃 𝑀 𝐼(𝑡, 𝑐) = 𝑙𝑜𝑔
                                                  𝑃 (𝑡) · 𝑃 (𝑐)

with 𝑃 (𝑡, 𝑐) = |docs.|labeled with 𝑐 containing term 𝑡|
                       docs. in training collection|
                                                                    |docs. containing term 𝑡|
                                                         , 𝑃 (𝑡) = |docs. in training collection|
                                                                                                                 |docs. labeled with 𝑐|
                                                                                                  and 𝑃 (𝑐) = |docs. in training collection|
                                                                                                                                             .
   The measure 𝑁 𝑃 𝑀 𝐼(𝑡, 𝑐) normalizes the values of 𝑃 𝑀 𝐼(𝑡, 𝑐) in [−1, 1], resulting in −1 for
a term 𝑡 and a DeCS code 𝑐 never occurring together, 0 for independence, and +1 for complete
co-occurrence of term 𝑡 and code 𝑐.
   For the construction of these inverse DeCS code profiles, we have treated separately the
single index terms, corresponding to representations of type lemmas, and the compound
index terms, which correspond to the multi-word terms extracted by ners, nps and keywords
representations. As thresholds for the NPMI co-occurrence metric we have used the values 0.25,
0.50 and 0.75, linking with each index term, both single and compound, the DeCS codes whose
co-occurrence measured according to NPMI exceeds these thresholds.
   With these inverted descriptor profiles we have implemented a simple matching scheme to
annotate an input document. Given a document to be annotated, its simple and compound
terms are extracted, using the methods described in the preceding section. Using the described
term-to-code profiles, the NPMI co-occurrence scores of each possible candidate DeCS code are
accumulated in a table every time one of the terms related to a given DeCS code appears.
   To build the final list of DeCS code candidates to be assigned to a given test document we
use as a reference the set of codes predicted by the sparse similarity method described in the
previous section. This reference set determines the number of DeCS codes to predict, 𝑛, and
provides additional codes needed to fulfill that number of output codes whether the number
of DeCS codes with higher accumulated co-occurrence scores predicted with the DeCS codes
profiles are less than 𝑛.


3. Similarity with dense representations
 In recent years we are experiencing the rise of powerful language models such as BERT and
other similar approaches that have increased the performance of multiple language processing
tasks and have allowed that solutions based on Transformer models to dominate the state-
of-the-art in many NLP fields today. A natural evolution of these word embeddings is to
Table 2
Performance results with inverse DeCS code profiles.

           terms type    threshold     MiF          MiP    MiR      MaF     MaP      MaR       Acc
               single         025     0.1707    0.1803    0.1620   0.1593   0.2692   0.1752   0.0973
                              050     0.2860    0.3022    0.2716   0.2192   0.2906   0.2594   0.1716
                              075     0.4147    0.4381    0.3937   0.2858   0.4694   0.2990   0.2689
           compound           025    0.2174     0.2297    0.2064   0.1958   0.2413   0.2285   0.1276
                              050    0.3052     0.3224    0.2897   0.2557   0.2713   0.2939   0.1870
                              075    0.4247     0.4486    0.4032   0.2838   0.4959   0.2901   0.2768
                both          025     0.2417    0.2553    0.2295   0.2224   0.3025   0.2492   0.1432
                              050     0.3183    0.3362    0.3021   0.2694   0.3157   0.3126   0.1960
                              075     0.4191    0.4427    0.3979   0.2950   0.4554   0.3123   0.2721


move them towards embeddings at the sentence-level with approaches as those provided by
SentenceTransformers [6] project 7 that allows to convert sentences in natural languages into
dense vectors with enriched semantics.
   In this context we have evaluated the possibility of taking advantage of these dense semantic
representations of whole sentences as a basis for an approach similar to the one described in
the previous section. We replace the use of text indexers to match similar documents with the
search for similar vectors in the dense vector space where documents from the training dataset
are represented. The procedure that we follow to generate the dense vectors that will represent
a document as a whole, either from training or test collections, is the following:

    • The paragraphs of the document are split into sentences and the dense vector that
      represents every sentence is calculated using Sentence Transformers models.
    • The dense representation of the whole document is calculated as the mean vector of the
      dense vectors extracted from the sentences that the abstract is comprised of.

   Once we have the dense representations of the training documents using this procedure,
we use the FAISS [3] library 8 to create a searchable index on these dense vectors. This index
allows us to efficiently calculate distances between dense vectors and determine for the dense
vector associated with a given test abstract (our query vector) the list of 𝑘 closest training dense
vectors using the Euclidean distance or other similarity metrics on vectors.
   By having this mechanism of similarity between dense vectors, the procedure used to annotate
the test documents is analogous to that one used with the sparse similarity approach with
Lucene indices. In this case we can directly use the real distances between the query vector
generated from the text to be annotated and the most similar 𝑘 dense vectors provided by FAISS
library. With these distances, the number of labels to be assigned is estimated and the output
DeCS codes are selected by means of the weighted voting scheme already described in section 2.


    7
        https://www.sbert.net/
    8
        https://github.com/facebookresearch/faiss
Table 3
Performance results with dense representations.

                      𝑘     MiF       MiP      MiR      MaF       MaP      MaR       Acc
             mono     5    0.2818    0.2916   0.2726    0.1390   0.2234    0.1415   0.1790
                     10    0.3019    0.3130   0.2915    0.1470   0.3364    0.1446   0.1917
                     20    0.3097    0.3206   0.2995    0.1477   0.4344    0.1419   0.1965
                     30    0.3103    0.3213   0.3000    0.1466   0.4872    0.1394   0.1969
                     40    0.3103    0.3213   0.3000    0.1461   0.5208    0.1377   0.1971
            multi     5    0.3443   0.3595    0.3302   0.1861    0.2926   0.1883    0.2221
                     10    0.3642   0.3793    0.3504   0.1970    0.3744   0.1961    0.2373
                     20    0.3716   0.3869    0.3574   0.1948    0.4377   0.1931    0.2420
                     30    0.3731   0.3879    0.3593   0.1935    0.4780   0.1904    0.2429
                     40    0.3709   0.3853    0.3574   0.1880    0.4937   0.1841    0.2408


4. Premilinary results

   In this section we briefly present the results of a series of preliminary experiments carried out
to validate the methods described in the previous sections and to characterize the parameters to
be used in our official runs submitted to the challenge. All of these experiments have used the
data provided by the organization of the mesinesp2 challenge for sub-track-1 [2], with a train
dataset with 237,574 articles annotated with DeCS codes and a development dataset with 1,065
documents.
   In the case of assigning DeCS codes using similarity over sparse representations supported
by an Apache Lucene index, we have separately evaluated the performance of the different
methods of extraction of index terms introduced in section 2. We also tested different values for
the parameter 𝑘, the number of neighbors considered to predict the number of labels and to
vote the final list of output labels. Table 1 shows the results obtained in this previous evaluation.
   Regarding the use of the inverse profiles of DeCS codes, we have evaluated the use of simple
index terms, compound index terms and a mixture of both to build the DeCS codes profiles.
Using in all cases the three co-occurrence thresholds previously indicated, 0.25, 0.5, 0.75. To
determine the number of DeCS codes to predict in each test document and to provide additional
codes, the result list from best execution of the sparse similarity scheme in Table 1 has been
used as reference. The results of these experiments with inverse profiles are shown in Table 2.
   Finally, in the case of assigning DeCS codes through similarity over dense representations, we
have evaluated the use of Sentence Transformers with two different pretrained language models,
one multilingual model 9 and a Spanish monolingual model 10 . We also evaluated different
values for the 𝑘 parameter. The obtanied results are detailed in Table 3.


    9
      Using pretrained sentence-level model stsb-xlm-r-multilingual from Sentence Transformers project.
Provides dense vectors with 768 dimensions.
   10
      Using pretrained word-level model mrm8488/electricidad-base-generator from Hugging Face models
repository. Provides dense vectors with 256 dimensions.
Table 4
Official results for BioASQ mesinesp2 Task.
                                                         Sub-track 1

             system         rank     MiF     EBP      EBR       EBF     MaP      MaR      MaF      MiP      MiR       Acc.
               best         1/26    0.4837   0.5077   0.4736   0.4763   0.5237   0.3990   0.3926   0.5077   0.4618   0.3261
            iria-mix        15/26   0.3725   0.4245   0.3402   0.3662   0.5345   0.2326   0.2354   0.4193   0.3351   0.2341
              iria-4        17/26   0.3656   0.3938   0.3481   0.3585   0.4476   0.2877   0.2760   0.3909   0.3435   0.2279
              iria-1        18/26   0.3406   0.3670   0.3238   0.3339   0.4236   0.2348   0.2315   0.3641   0.3199   0.2089
              iria-2        19/26   0.3389   0.3650   0.3218   0.3319   0.4214   0.2327   0.2293   0.3622   0.3185   0.2073
        mesinesp baseline   20/26   0.2876   0.2449   0.3839   0.2841   0.3720   0.3787   0.3438   0.2335   0.3746   0.1710
              iria-3        21/26   0.2537   0.2758   0.2337   0.2460   0.2869   0.0854   0.0817   0.2729   0.2369   0.1480

                                                         Sub-track 2

             system         rank     MiF     EBP      EBR       EBF     MaP      MaR      MaF      MiP      MiR       Acc.
               best         1/21    0.3640   0.3666   0.3655   0.3558   0.4177   0.3391   0.3102   0.3666   0.3614   0.2242
              iria-1        12/21   0.2454   0.2303   0.2625   0.2379   0.3167   0.1863   0.1534   0.2289   0.2644   0.1411
              iria-4        14/21   0.2003   0.1919   0.2142   0.1958   0.1620   0.2049   0.1571   0.1868   0.2158   0.1132
            iria-mix        15/21   0.2003   0.1919   0.2142   0.1958   0.1620   0.2049   0.1571   0.1868   0.2158   0.1132
              iria-3        16/21   0.1562   0.1422   0.1681   0.1502   0.1617   0.0730   0.0505   0.1419   0.1736   0.0857
        mesinesp baseline   17/21   0.1288   0.0971   0.3791   0.1452   0.0977   0.3619   0.2403   0.0781   0.3678   0.0802

                                                         Sub-track 3

             system         rank     MiF     EBP      EBR       EBF     MaP      MaR      MaF      MiP      MiR       Acc.
               best         1/21    0.4514   0.4487   0.4662   0.4494   0.5041   0.4271   0.4138   0.4487   0.4541   0.3005
              iria-2        7/21    0.3203   0.3509   0.2878   0.3061   0.4980   0.3166   0.3171   0.3657   0.2849   0.1910
        mesinesp baseline    8/21   0.2992   0.4117   0.2298   0.2827   0.5290   0.2497   0.2518   0.4293   0.2296   0.1779
            iria-mix        13/21   0.2542   0.2790   0.2414   0.2528   0.4659   0.2284   0.2213   0.2750   0.2364   0.1526
              iria-4        16/21   0.2169   0.2251   0.2085   0.2119   0.3105   0.2442   0.2289   0.2232   0.2109   0.1288
              iria-1        18/21   0.1871   0.1941   0.1826   0.1844   0.2589   0.1966   0.1825   0.1926   0.1820   0.1093
              iria-3        19/21   0.0793   0.0824   0.0758   0.0777   0.1120   0.0598   0.0501   0.0822   0.0765   0.0437


5. Official runs and discussion

   Although our team has submitted results to all the sub-tracks of the mesinesp challenge, a
parameterization adapted to the specific characteristics of each sub-track has not been carried
out. All the configurations used in the official runs have been identical in the three sub-tracks
only with adjustments in the number of neighbors considered according to the results of previous
experiments with the provided development datasets. The only exception is sub-track-3 where
a substantially different configuration has been used in one of the submitted runs.
   In table 4 the official performance measures obtained by our runs in the three mesinesp2
sub-tasks are shown. The official runs submitted during our participation were created using
the following configurations:

iria1. This run followed the sparse similarity approach described in section 2. The sparse
      representation of mesinesp articles was created using all of the index term extraction
      methods described in section 2.1. During indexing and querying, terms appearing in 5 or
      less abstracts and terms used in more than 50% of training documents were discarded.
      The number of neighbors used by the 𝑘-NN classifier was 20 and the predicted number
      of descriptors to be returned was increased by a 10% in order to ensure slightly better
      values in recall related measures.

iria2. For this run in Sub-track-1 the same setup as iria1 was employed, but instead of using
      the original train dataset this run applied a sort of Label Powerset approach proposed in
      our previous participation at mesinesp challenge [8]. A new training dataset annotated
      with those ”metalabels” was created by joining pairs of DeCS labels with NPMI co-
      occurrence scores above 0.25. This dataset was indexed and processed as described for
      run iria1.
      In Sub-track-3 iria2 setup followed the inverse DeCS codes profile approach from
      section 2.2. The employed profiles were a mix of single and compound profiles with a
      threshold of 0.75 for the co-occurrence scores. Instead of using the results of a sparse
      method as reference this runs was created directly over the set of exact matches extracted
      from the abstract text (matches representation).

iria3. This run followed the dense similarity approach introduced in section 3. We employed
      the multilingual word model to create dense vectors for every training document and
      indexed those vectors in a FAISS index. The number of neighbors used by the 𝑘-NN
      classifier was 30 and, as in iria1 run, the predicted number of descriptors to be returned
      was increased by a 10%.

iria4. This run employed the inverse DeCS codes profile approach introduced in section 2.2. The
      employed profiles were a mix of single and compound profiles created using a threshold
      of 0.75 for the co-occurrence scores between terms and DeCS codes. The reference results
      employed by this approach were those from iria1.

iria-mix. This run mixed the predictions of iria1 and the predictions of iria3, adding
      the exact matches extracted from the textual content of the labeled abstract (matches
      representation). Predictions from iria1 and iria3 had a weight of 1.0 and the DeCS
      labels matched on the abstract text were weighted by 1.5.

   The results of our participation in the mesinesp task of the BioASQ biomedical semantic
indexing challenge were not very competitive, far from the performance of the winner teams. In
any case, we think that our experience confirms the suitability of methods based on similarity
as a viable alternative for large scale text categorization in rich domains such as biomedical
document collections.
   We have evaluated different classification methods based on similarity over sparse and dense
representations. In our experiments the best results have been obtained using sparse represen-
tations where different index term extraction techniques were combined. That confirmed the
results in our previous BioASQ participation, with small performance improvements mainly
due to improvements in the quality of the employed NLP tools and models.
   Results with similarity over dense representations were generally disappointing. The pro-
posed method, a simple mean of sentence based dense vectors, was extremely simple and it
remains for future to evaluate better approaches that could improve the dense representation of
the documents as a whole. In the same way, it is expected that fine-tuning the language models
using biomedical texts will improve their performance and this is precisely one of the lines of
future work to experiment with.


Acknowledgments
F.J. Ribadas-Pena and S. Cao have been supported by ERDF/MICINN-AEI (TIN2017-85160-C2-2-
R and PID2020-113230RB-C22), and by the Galician Regional Government (Xunta de Galicia)
under projects ED431D-2018/50 and ED431D-2017/12.
E. Kuriyozov received funding from ERDF/MICINN-AEI (ANSWER-ASAP, TIN2017-85160-C2-1-
R, and SCANNER-UDC, PID2020-113230RB-C21), from Xunta de Galicia (ED431C 2020/11) and
from Centro de Investigación de Galicia ”CITIC”, funded by Xunta de Galicia and the European
Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G
2019/01. He is also funded for his PhD by El-Yurt-Umidi Foundation under the Cabinet of
Ministers of the Republic of Uzbekistan.


References
 [1] L. Gasco, A. Nentidis, A. Krithara, D. Estrada-Zavala, R-T. Murasaki, E. Primo-Peña, C.
     Bojo-Canales, G. Paliouras, M. Krallinger: Overview of BioASQ 2021-MESINESP track.
     Evaluation of advance hierarchical classification techniques for scientific literature, patents
     and clinical trials. 2021
 [2] L. Gasco, M. Krallinger, M. Antonio, MESINESP2 Corpora: Annotated data for medical
     semantic indexing in Spanish, 2021. Funded by the Plan de Impulso de las Tecnologías de
     las del Lenguaje (Plan TL). Zenodo, URL: https://doi.org/10.5281/zenodo.4722925.
 [3] J. Johnson, M. Douze, H. Jégou: Billion-scale similarity search with GPUs. arXiv preprint
     arXiv:1702.08734, 2017, URL: https://arxiv.org/abs/1702.08734.
 [4] R. Mihalcea, P. Tarau: TextRank: Bringing order into texts. Association for Computational
     Linguistics. 2004.
 [5] A. Nentidis,G. Katsimpras, E. Vandorou, A. Krithara, L. Gasco, M. Krallinger, G. Paliouras:
     Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Se-
     mantic Indexing and Question Answering. 2021
 [6] N. Reimers, I. Gurevych: Making Monolingual Sentence Embeddings Multilingual using
     Knowledge Distillation. Proceedings of the 2020 Conference on Empirical Methods in
     Natural Language Processing, Association for Computational Linguistics, 2020.
 [7] F.J. Ribadas-Pena, L.M. de Campos, V.M. Darriba-Bilbao, A. E. Romero: CoLe and UTAI at
     BioASQ 2015: Experiments with Similarity Based Descriptor Assignment. CEUR Workshop
     Proceedings, vol. 1391. 2015.
 [8] F.J. Ribadas-Pena, S. Cao, E. Kuriyozov: CoLe and LYS at BioASQ MESINESP8 Task:
     Similarity based Descriptor Assigment in Spanish. CEUR Workshop Proceedings, vol. 2696.
     2020.

</pre>