Statistical graph matching for indexing Spanish
              biomedical documents?

                   Alicia Lara-Clares and Ana Garcia-Serrano

                               ETSI Informática
             Universidad Nacional de Educación a Distancia (UNED)
                         {alara,agarcia}@lsi.uned.es


      Abstract. In this work, we describe a statistical graph matching method
      for semantic indexing of documents from large-scale biomedical reposi-
      tories in Spanish language provided at the MESINESP 2020 task (8th
      BioASQ Workshop [15]). The results obtained show enough accurate be-
      havior, especially with respect to the rest of the results in the task. The
      execution time and computational requirements have been a priority in
      our approximation, which has proved to be efficient and robust for tackle
      further improvements.

      Keywords: Biomedical semantic indexing · Knowledge Discovery · Graph
      matching


1   Introduction

Although the number of medical data is growing at an exponential rate, literature
in the medical domain is often found as unstructured or semi-structured data.
In these cases, it is necessary to find methods to automatically extract and
categorize the data contained in them, using different techniques as, for example,
biomedical semantic indexing.
    The BioASQ [15] is an EU-funded support action [1] to set up a challenge on
biomedical semantic indexing and question answering (QA). The MESINESP
task is based on the use of resources such as a structured medical vocabu-
lary DeCS [2] used in two databases for Spanish health content: IBECS [3]
and LILACS [7]. The main objective of this task is the development of a se-
mantic indexing tool for Spanish content. Other objectives are: (a) determining
the current-state-of-the art, (b) identifying challenges, and (c) comparing the
strategies and results to those published for English data1 .

  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
?
  Supported by the UNED predoctoral grant started in April 2019 (BICI N7, Novem-
  ber 19th, 2018)
1
  https://temu.bsc.es/mesinesp/
    In this paper, we propose a statistical graph matching method implemented
as a module into the HESML framework [9–11]. This method obtains informa-
tion on the frequency with which DeCS codes are annotated to rank the list
of candidates that are extracted from the text following two different methods
described in Section 2.1.
    The results are encouraging enough, especially when compared to the rest
of the experiments and knowing the main difficulties. We will continue working
in this task with mixed approaches ([6]), looking forward to obtaining a robust
and efficient method capable of correctly indexing DeCS codes. An important
feature of this approach is its independence of the language.
    The rest of the paper is organized as follows. In section 2, we describe the
architecture of the system. Section 3 describes the evaluation process and the
results obtained. Finally, section 4 outlines the conclusions and future work.


2     System description
The MESINESP task, is the first task on semantic indexing of Spanish medical
texts, provides a dataset divided in training (318.658 documents), development
(750 documents) and test (23.509 documents) sets. The average of DeCS codes
per document is 8.12, and the document with the maximum number of codes
has a total of 53 different ones. At a glance, the training set give us the idea of
a scattered distribution of the codes annotated in the documents, as described
in Table 1. Our proposed method try to overcome this problem using statistical
information about the frequency of a DeCS code annotated in a document.


           Total codes that appear more than 10 of documents% 6
           Total codes that appear more than 1 of documents% 48
           Total codes that appear less than 1 of documents%     33654
           Total codes that never appear                         22523
           Total codes                                           33702
              Table 1. Frequencies of the codes in the training set2


2.1   Proposed method
The method proposed herein is a first approximation focused on the efficiency
and robustness of the system. Figure 1 represents the information flow to anno-
tate the test set.
    The process stages are the following:
 1. Creation of the frequencies graph from the training and development data.
    In this step, a directed graph is developed, where each DeCS code represent
    a node, and the edges are the number of times the codes co-occur in each
    document.
                           Fig. 1. Information resources.


 2. Parse the DeCS ontology and the list of codes and descriptors provided for
    the competition 3 .
 3. Split the sentences and identification of the chunks for each sentence using
    the Stanford CoreNLP library [12].

    Once the test dataset is processed, the next step is the alignment of the
documents with a list of DeCS code candidates. In this work, there has been
carried out using two different methods, (a) exact matching and (b) graph-
based matching. In the first one, every possible descriptor is matched with each
document. If the descriptor exists in the text, it is selected as a candidate. In the
second method, every chunk is compared with the list of possible descriptors,
aligning as much DeCS codes as possible.
    Other experiments have been planned but could not be carried out due to
time constraints. The first one is the alignment of chunks with DeCS codes
using a semantic sentence similarity measure, as for example, the Jaccard simi-
larity [13, 4]. The HESML framework provides a set of semantic similarity mea-
sures that allows the comparison between every descriptor with all the available
chunks. The problem of this approximation is that the annotation of each doc-
ument takes about 30 seconds, so the method would take more than a week to
annotate all the documents.


3     Evaluation and results

MESINESP task has been evaluated using the following measures: (a) Accuracy
(Acc.), (b) Example Based Precision (EBP), (c) Example Based Recall (EBR),
(d) Example Based F-Measure (EBF), (e) Macro Precision (MaP), (f) Macro
Recall (MaR), (g) Macro F-Measure (MaF), (h) Micro Precision (MiP), (i) Micro
3
    Downloaded from https://temu.bsc.es/mesinesp/index.php/resources/
Recall (MiR) and (j) Micro F-Measure (MiF), but we only include in this section
the Micro F-measure, since it is the official evaluation measure for this task.
   Our results are shown in Table 2 including the first position and the baseline
results of the task.


         System                   MiF EBF MaF Acc Position (MiF)
         X-BERT BioASQ F1 0,1071 0,1051 0,0008 0,0575             1
         Graph matching (ours) 0,0836 0,0846 0,001 0,0451        15
         Exact matching (ours) 0,0826 0,0829 0,001 0,0442        18
         BioASQ Baseline         0,0161 0,0217 0,0022 0,0116     22
Table 2. This table shows the results or the two methods (Graph matching and Exact
matching), as well as the best and the baseline ones


    A total of 25 methods have been submitted to the MESINESP competition.
Our work has focused on the efficiency and robustness of the method and ex-
ecutes the whole process in less than 30 minutes without requiring a training
process. We have used the DeCS ontology and our new hypothesis is that the
results will improve using another ontology-based similarity measure to the con-
cept alignment without losing efficiency of the system.
    The main difficulty in this task is derived from the use of a purely statisti-
cal method that prioritizes the most frequent terms and considers neither the
ontology hierarchy for avoiding the annotation of redundant child-parent terms
nor the less frequent codes that the experts annotate in the gold standard. For
example, the terms ”tumor de mediastino” and ”mediastino” are annotated us-
ing our approximation for the document ID ”biblio-1000005”, but the experts
only annotate the terms ”mediastino” and ”neoplasias del timo”. But, it hap-
pens that the term ”tumor de mediastino” is explicitly written in the title of
the document and, for this reason, it is considered as relevant for our algorithm.
On the other hand, the term ”pesar” is wrongly considered as relevant for our
algorithm in most of the documents, because no semantics is considered in the
selection of candidates.


4   Conclusions and Future Work

In this work, we describe a statistical graph matching method for semantically
index documents from large-scale biomedical repositories in Spanish language
provided at the MESINESP 2020 task [15]). The execution time and computa-
tional requirements have been priority factors in our approximation, giving us a
first approach that is efficient and sufficiently robust to improve the results in
the future.
    Addressing the task, we understand that the Spanish language has not been
thoroughly studied in a semantic indexing task, and there are only a few available
tools. For example, there are some Name Entity Recognizers (NER) that find
UMLS concepts in Spanish biomedical documents, such as QuickUMLS [14] or
IXAMedTagger [8]. But, as far as we know, there is not a NER tool for aligning
DeCS codes with texts. Even more, the code sets tend to follow biased, unbal-
anced, and scattered distributions, as shown in a similar task of indexing CIE-10
codes for Spanish clinical documents [5].
   In the future work, we are going to focus on the integration of the parser for
the DeCS ontology in HESML. We will try to prove that our proposal will over-
come the problems with the running time of the experiments based on sentence
similarity measures by allowing the use of different ontology-based measures.
Finally, we want also to test a new model that recognizes co-occurrence patterns
beyond the basic measure of the frequency of occurrence of terms.


References
 1. The bioasq challenge — bioasq.org. http://www.bioasq.org/, accessed: 2020-6-8
 2. DeCS - health sciences descriptors. http://decses.bvsalud.org/I/homepagei.
    htm, accessed: 2020-6-10
 3. IBECS. https://www.isciii.es/QueHacemos/Servicios/Biblioteca/Paginas/
    IBECS.aspx, accessed: 2020-6-10
 4. . Manning, C.D., Manning, C.D., Schütze, H.: Foundations of Statistical Natural
    Language Processing. MIT Press (1999)
 5. Almagro, M., Unanue, R.M., Fresno, V., Montalvo, S.: ICD-10 coding of spanish
    electronic discharge summaries: An extreme classification problem. IEEE Access
    8, 100073–100083 (2020)
 6. Benavent, J., Benavent, X., de Ves, E., Granados, R., Garcia-Serrano, A.: Expe-
    riences at imageclef 2010 using cbir and tbir mixing information approaches. In:
    CEUR Proceedings. 2010 CLEF September, Padua, Italy. vol. 1176 (2010)
 7. BIREME               (http://www.bireme.br/).                LILACS             Unity
    (http://metodologia.lilacs.bvs.br/): LILACS database) :. http://metodologia.
    lilacs.bvsalud.org/php/level.php?&component=19, accessed: 2020-6-10
 8. Gojenola, K., Oronoz, M., Pérez, A., Casillas, A., Taldea, I.X.A.: IxaMed: Applying
    freeling and a perceptron sequential tagger at the shared task on analyzing clinical
    texts. In: SemEval@ COLING. pp. 361–365. ixa.eus (2014)
 9. Lastra-Diaz, J.J., Garcia-Serrano, A., Batet, M., Fernandez, M., Chirigati, F.:
    Hesml: a scalable ontology-based semantic similarity measures library with a set
    of reproducible experiments and a replication dataset. Information Systems 66,
    97–118 (2017)
10. Lastra-Dı́az, J.J., Goikoetxea, J., Taieb, M.A.H., others: A reproducible survey on
    word embeddings and ontology-based methods for word similarity: linear combi-
    nations outperform the state of the art. Applications of Artificial . . . (2019)
11. Lastra-Dı́az, J.J., Goikoetxea, J., Hadj Taieb, M.A., Garcı́a-Serrano, A., Aouicha,
    M.B., Agirre, E.: Reproducibility dataset for a large experimental survey on word
    embeddings and ontology-based methods for word similarity. Data Brief 26, 104432
    (Oct 2019)
12. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The
    stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd
    annual meeting of the association for computational linguistics: system demon-
    strations. pp. 55–60. aclweb.org (2014)
13. P. Jaccard: Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. sci.
    nat. 44, 223–270 (1908)
14. Soldaini, L., Goharian, N.: Quickumls: a fast, unsupervised approach for medical
    concept extraction. In: MedIR workshop, sigir. pp. 1–4. ir.cs.georgetown.edu (2016)
15. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
    M.R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almiran-
    tis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.C.N.,
    Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I.,
    Paliouras, G.: An overview of the BIOASQ large-scale biomedical semantic in-
    dexing and question answering competition. BMC Bioinformatics 16, 138 (Apr
    2015)