Biomedical Abbreviation Recognition and
          Resolution by PROSA-MED

    Soto Montalvo1 , Maite Oronoz2 , Horacio Rodrı́guez3 , Raquel Martı́nez4
                            1
                          URJC, soto.montalvo@urjc.es
                        2
                        UPV/EHU, maite.oronoz@ehu.eus
                         3
                           UPC, horacio@cs.upc.edu
                   4
                     NLP&IR Group, UNED, raquel@lsi.uned.es


      Abstract. The amount of abbreviations used in biomedical literature
      increases constantly. Despite the existence of acronym dictionaries, it
      is not viable to keep them updated with new creations. Thus, in the
      processing of biomedical texts, discovering and disambiguating acronyms
      and their expanded forms are essential aspects and this is the objective
      proposed by BARR task at IberEval 2017 Workshop. This paper presents
      our participation in this task. We propose five systems that deal with the
      problem in different ways. Three of the systems are atomic approaches,
      while two of them are combinations of the atomic systems. One of the
      systems clearly outperforms the others, both in the detection of entities
      (F-score of 0.749 in the test set) as well as identifying relations between
      short-long forms (F-score of 0.697 in the test set).

      Keywords: abbreviation recognition, abbreviation disambiguation, pat-
      terns, dictionaries


1   Introduction

The volume of biomedical texts is greater and greater, and at the same time,
the number of biomedical abbreviations is growing rapidly, being the ambigu-
ity of biomedical abbreviations a challenge. Particularly, handling abbreviations
without nearby definitions is a critical issue [3].
    The acronyms have a high reference value, in the sense that they most of
the time act as reference anchors of textual context [6]. Because of this and
the common problem of recognition of abbreviations, acronyms and symbols,
and their disambiguation (the same short form can have several different long
forms), the Biomedical Abbreviation Recognition and Resolution (BARR) track
is proposed [2].
    Usually, existing work on acronym recognition in medical domain is proposed
for English biomedical documents, being difficult to adapt these proposals to
other languages. The BARR track has the aim to promote the development and
evaluation of biomedical abbreviation identification systems in Spanish biomed-
ical documents.
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


             In this paper we present the approaches proposed by our team, in particular
         five different proposals.
             The remainder of this paper is organized as follows. Section 2 presents the
         proposed systems. Section 3 summarizes the results and discuss about them.
         Finally, conclusions are presented in Section 4.


         2      Proposed Systems
         We propose different approaches to identify entities, both short and long forms,
         in the texts and also the relations between them. Our team is composed by
         members of different universities, in a way that we propose a method by each
         university (URJC and UNED propose one method together) and two additional
         methods which combine the other three proposal in some way. Following we
         describe all the methods.

         2.1     EHU atomic approach
         This system tries to take advantage of an already developed linguistic analyser,
         called FreelingMed [7]. This analyzer has been adapted to provide all the possible
         expansions for the abbreviations and acronyms that are already stored in its
         dictionaries. FreelingMed tokenizes the text, assigns the offsets to each token
         and identifies the medical terms appearing in SNOMED CT [11] as multiword
         terms. The output of the analyzer is usually given in XML but we have changed
         it to a format that is easier to manage (see Figure 1).


                               Fig. 1. The output of the FreelingMed analyzer.

               Introducción          introducción                                         0:12
               La                     el                                                    14:16
               coexistencia           coexistencia                                          17:29
               de                     desviación estándar # disfunción erectil           30:32
               esclerosis múltiple   esclerosis múltiple                                  33:52
               EM                     electromiograma # eritema multiforme
                                      # esclerosis múltiple # estancia media
                                      # estenosis mitral                                    54:56


             Figure 1 shows that for the acronym “EM” five possible expansions or medical
         terms are stored in the dictionaries of FreelingMed: “electromiograma”, “eritema
         multiforme”, “esclerosis múltiple”, “estancia media” and “estenosis mitral”. The
         EHU approach looks for all these expansions around the “EM” acronym, and if
         any of them is found, the relation is written as result. The whole medical term
         is searched near the short form as a unique unit but not the subelements of it
         (“esclerosis” or “múltiple”).


                                                                                                                        248
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


             For the detection of entities, three main approaches are considered: i) an
         heuristic that marks word-forms that follow certain pattern usually appearing
         in abbreviations and acronyms; ii) elements that come marked as abbreviation
         or acronym from the dictionaries of FreelingMed; and, iii) elements that come
         from Freeling (in the basis of FreelingMed) marked as abbreviations referring to
         units of weight (e.g. mg), length (e.g. cm), time (e.g. min) etc.


         2.2    UPC atomic approach

         The second atomic system is based on the combination of three acronym / expan-
         sion pair extractors covering roughly the three most frequent cases of acronym
         / expansion mentions:

          – Similarity-based. This approach tries to detect in a document (within the
            title and the abstract) mentions of single words or multiwords likely to be an
            acronym (short form) and an expansion (long form) so that the two forms are
            likely able to be mapped using a set of 13 hand crafted mapping rules. These
            mapping rules are applied in decreasing order of confidence. For recognizing
            the short forms we have used the set of regular expressions proposed in [5]
            constrained for satisfying the strict form of the word shapes proposed in [1]5 .
            For long form candidates we have collected all the ngrams up to 5 words,
            constrained for satisfying loose word shapes, and discarding the candidates
            starting or ending by a stopword. The set of allowed word shapes has been
            built from the annotations in the training set. The most frequent and most
            accurate rule can be paraphrased as following: “The length of the acronym
            in chars has to be equal to the number of expansion tokens. Each character
            of the acronym should correspond to the first letter of the corresponding
            token in the expansion”.
          – Gazetteer-based. We have used a big terminology of the medical domain
            obtained from several sources (containing 103,169 terms). The terminology,
            which covers six languages was compiled following an iterative approach in a
            way that at each iteration available resources for one language were included
            and then mapped, when possible, to other languages using dbpedia links
            (”sameAs” and ”label”). The main source of resources includes for English
            Bioportal 6 and DrugBank 7 , for Spanish CIE10 8 and CIMA9 , and for French
            pyMedTermino 10 . The terminology includes both short forms and long forms
            and we have obtained possible pairs using the Similarity-based approach
            described above. 14,360 pairs were obtained in this way. An example of such
         5
           A word shape is a simple pattern aiming to represent the character level form of a
           word (case, letter, number, punctuation mark, space), e.g. the strict form shape of
           ’DM2’ is ’AA0’ while the loose shape is ’A0’.
         6
           https://bioportal.bioontology.org/
         7
           https://www.drugbank.ca/
         8
           CIE10.org
         9
           https://www.aemps.gob.es/cima/
        10
           https://bitbucket.org/jibalamy/pymedtermino


                                                                                                                        249
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


            patterns (represented as a regular expression) is u’ (LPC) .0,15 (linfoma
            primario cerebral) ’.
          – Distance-based. We have collected a set of patterns acronym / expansion
            occurring closely and frequently in the training set. The most frequent pat-
            tern is represented by the regular expression ([A-Za-z][ˆ]+ [A-Za-z][ˆ]+ [A-
            Za-z][ˆ]+) ([A − Z]3, 3), covering, for instance, “enfermedad renal crónica
            (ERC)”. This pattern occurs 73 times in the training set.

         2.3    UNED atomic approach
         This system combines a pattern-based approach with a dictionary-based ap-
         proach, and consists on two steps: abbreviations detection and definition match-
         ing for them.
             In the first step, we detect terms in capital letters or combination of capital
         letters with lowercased letters, numbers and other characters. We use parenthet-
         ical constructions as indicator of a possible abbreviation [8]. Once the abbre-
         viation (short form) is located and validated, the second step searches for its
         definition (long form) on the left side of the open parenthesis using the algo-
         rithm proposed by Schwartz and Hearst [10]. We select each word, one by one
         and combining them in each iteration, until a combination of them match with
         the short form. We have extended the algorithm of Schwartz and Hearst in or-
         der to allow the words of the long form do not appear necessarily in the same
         order that the characters of the short form. The number of words we combine
         searching the long form do not exceed the double of the characters of the short
         form.
             In addition, some special cases for the approach based on patterns are consid-
         ered. For instance, the following text has two relation pairs for the same acronym
         and two different definitions, one per language: “amino-terminal propeptide of
         procollagen type 1 (P1NP, propéptido aminoterminal del procolágeno 1)”.
             In case of the pattern-based approach does not find a valid definition for the
         abbreviation, we use a dictionary where each entry is an abbreviation and its
         possible long forms. In the same order that long forms appear in the dictionary,
         we search each one in the same sentence where the abbreviation is, and the first
         one that matches is selected as the long form of the pair of the relation. The
         dictionary used has 7,916 entries.

         2.4    Output Combination
         We have implemented three simple combination mechanisms named as and, or,
         and vot, that are applied over the results of the atomic systems. and accepts
         an annotation only in the case all three atomic systems propose it. or accepts
         all the annotations of the atomic systems just checking that no contradictions
         (partial overlapping) occur in the mentions. vot implements a democratic vota-
         tion schema, i.e. an annotation is accepted in the case at least two of the atomic
         systems have proposed it.
             For our final submission only and and or combinations were submitted.


                                                                                                                        250
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


         3    Results and Discussion
         In this section we present the results obtained identifying entities and relations
         abbreviation-definition.
             The evaluation metric used for evaluating the participating systems of the
         BARR track has been the F-score micro measure. The organization has provided
         an adaptation of the Markyt platform for the evaluation [9]. This platform al-
         lows to visualize and compare generated predictions against the Gold Standard
         annotations.
             The organization has provided training, test and background collections [4].
         Table 1 shows the results of the six systems over the training 1 data set. The
         first column shows the system, and the columns 2-4 and 5-7 show the values of
         precision, recall and F-score respectively, for the identification of Entities and
         Relations. On the other hand, the three first rows show the results for the atomic
         systems, and the last three rows the results for the systems that combine the
         previous ones. In both cases the systems are ordered by F-score value.

                Table 1. Preliminary results of the proposed systems over training 1 set.


                                              Entities      Relations
                                  System    P R F          P R F
                                  UNED     0.86 0.64 0.74 0.79 0.58 0.67
                                  UPC      0.30 0.39 0.34 0.34 0.30 0.32
                                  EHU      0.29 0.37 0.33 0.90 0.10 0.18
                                  Comb OR 0.32 0.47 0.38 0.39 0.38 0.39
                                  Comb AND 1.00 0.06 0.12 1.00 0.06 0.12
                                  Comb VOT 0.48 0.08 0.14 1.00 0.07 0.13


             The UNED system obtains high precision values, specially identifying enti-
         ties. This confirm that an approach based on patterns is suitable for this problem.
         The recall values are a bit lower due to this system does not detect nested en-
         tities. Moreover, it is probably the patterns did not detect all special cases that
         could appear in texts. The system considers some special cases which implies
         variations in the patterns, but it is possible that exist more special cases not
         considered.
             The F-score is lower identifying relations because not for all entities detected
         the system finds a valid long form. The system searches long forms in a maximum
         number of words on the left of an acronym and it can be out of this window.
             The main objective of the EHU system has been to reuse a linguistic analyzer
         that was already developed. This approximation is limited as it only can detect
         abbreviations already gathered in the dictionaries of FreelingMed and analyzed
         as an unique element in its long form. Table 1 shows in its relation column that a
         recall of 0.1 is obtained but with a precision of 0.9. Those results, in our opinion,
         are clearly related to the type of approach.


                                                                                                                        251
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


              The results of UPC system were bad. There are several explanations for this:

          – The Gazetteer-based component had a very small contribution to the global
            system.
          – The acronym detector resulted in the training phase on many failures (about
            50 false negatives and more than 200 false positive). Specially in the case of
            one character abbreviations the results were bad.
          – Also the detection of long form candidates resulted in many false positives,
            specially single word terms and multiterms starting or ending with a non
            valid POS.
          – Finally, some of the mapping rules, specially those involving a single word
            long form, presented a low accuracy.

            Table 2 presents the results over the test set. For the test set only the and
         and or combinations were done between the UNED and UPC systems output.


                                  Table 2. Results of the runs over test set.


                                              entities      relations
                                 System    P R F          P R F
                                 UNED     0.87 0.66 0.75 0.81 0.60 0.70
                                 UPC      0.65 0.21 0.32 0.31 0.10 0.15
                                 EHU      0.25 0.10 0.15 0.61 0.02 0.03
                                 Comb OR 0.77 0.37 0.50 0.57 0.27 0.36
                                 Comb AND 0.99 0.10 0.18 0.98 0.09 0.16


             Due to the big volume of data in the test set, only the 22.5 % of the corpus
         was analyzed in time in the EHU approach and there were some computer mem-
         ory problems in the UPC approach. FreelingMed is quite slow (39 sec. to analyze
         an abstract of 178 tokens) due to the volume of the dictionaries it uses: a to-
         ken dictionary of 578,539 entries and a multiword dictionary of 474,800 entries.
         To detect words usually used with a non-medical meaning, for instance “bar”,
         with a medical meaning (e.g. “bacilo acidorresistente”), a second analysis phase
         is applied with a dictionary of around 930,000 entries. In addition a mapping
         between SNOMED-CT and the Unified Medical Language (UMLS11 )(1,007,705
         entries) is applied. Not all these resources are needed for the BARR task, but
         they are already included in FreelingMed. The time problem with FreelingMed
         and the memory problems in the UPC approach have a direct relationship with
         the results in the recall column shown in Table 2.
             As can be seen on both tables (Tables 1 and 2), the UNED approach outper-
         forms by large extent every other one for both tasks and all measures. Taking
         into account this high difference, combinations produce no improvement. It is
        11
             https://www.nlm.nih.gov/research/umls/


                                                                                                                        252
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


         worth noting, however, that and combination reach the best precision for both
         tasks, at a cost of a extremely low recall. The NESTED type has not been treated
         in the entities identification task.


         4    Conclusions

         This paper has described our participation in the BARR task at IBEREVAL
         2017 workshop, which goal is to find acronyms and acronyms-long form relations.
         We have proposed five different approaches, three atomic systems and two more
         systems, which combine on different ways the atomic proposals.
            The UNED system clearly stands out among the presented systems. Being
         so clear the difference in results with the two other atomic approaches, the
         combinations are not able of improving the UNED system results.
            Dictionary-based approaches are language dependent while the ones based
         on the use of regular expressions or patterns show to be more flexible.
            There is, obviously, room for improvements. We plan to focus on performing
         combination not only as a final process but using partial results from the other
         atomic sources.


         5    Acknowledgments

         This work has been funded by the Spanish Ministry of Science and Innovation
         (PROSA-MED Project: TIN2016-77820-C3, TADEEP Project: TIN2015-70214-
         P).


         References
         1. Dai, HJ. & Lai, PT. & Chang, YC. & Tzong-Han Tsai, R.: Enhancing of chemical
            compound and drug name recognition using representative tag scheme and fine-
            grained tokenization. Journal of Cheminformatics 2015 7(S-1) (2015)
         2. Intxaurrondo, A. & Pérez-Pérez, M. & Pérez-Rodrı́guez, G. & Lopez-Martin, J.A.
            & Santamarı́a, J. & de la Peña, S. & Villegas, M. & Akhondi, S.A. & Valencia, A. &
            Lourenço, A. & Krallinger, M.: The Biomedical Abbreviation Recognition and Res-
            olution (BARR) track: benchmarking, evaluation and importance of abbreviation
            recognition systems applied to Spanish biomedical abstracts. SEPLN 2017, (2017)
         3. Kim, S. & Yoon, J.: Link-topic model for biomedical abbreviation disambiguation.
            Journal of Biomedical Informatics,53:367-380. (2015)
         4. Krallinger, M. & Intxaurrondo, A. & Lopez-Martin, J.A. & de la Peña, S. & Pérez-
            Pérez, M. & Pérez-Rodrı́guez, G. & Santamara, J. & Villegas, M. & Akhondi, S.A.
            & Lourenço, S. & Valencia, V.: Resources for the extraction of abbreviations and
            terms in Spanish from medical abstracts: the BARR corpus, lexical resources and
            document collection. SEPLN 2017, (2017)
         5. Larkey, L.S & Ogilvie, P. & Price, M.A. & Tamilio, B.: Acrophile: an automated
            acronym extractor and server. Proceedings of the fifth ACM conference on Digital
            libraries (ACM DL), pp. 205-214. (2000)


                                                                                                                        253
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


         6. Maud, E. & della Rocca, L. & Steinberger, R. & Tanev, H.: Acronym recognition
            and processing in 22 languages. Proceedings of the 9th Conference Recent Advances
            in Natural Language Processing (RANLP), pp. 237244. (2013)
         7. Oronoz, M. & Casillas, A. & Gojenola, K. & Pérez, A.: Automatic Annotation
            of Medical Records in Spanish with Disease, Drug and Substance Names. Lecture
            Notes in Computer Science, 8259. Progress in Pattern Recognition, ImageAnaly-
            sis, ComputerVision, and Applications 18th Iberoamerican Congress, CIARP 2013
            Havana, Cuba, November 20-23. (2013)
         8. Park, Y. & Byrd, R.J.: Hybrid Text Mining for Finding Abbreviations and Their
            Definitions. Proceedings of the 2001 Conference on Empirical Methods in Natural
            Language Processing, pp. 126-133. (2001)
         9. Pérez, M., Pérez, G., Rabal, O., Vazquez, M., Oyarzabal, J., Fernández, F., Va-
            lencia, A., Krallinger, M., Lourenço, A.: The Markyt visualisation, prediction
            and benchmark platform for chemical and gene entity recognition at BioCre-
            ative/CHEMDNER challenge. Database. (2016)
         10. Schwartz, A.S & Hearst, M.A.: A simple algorithm for identifying abbreviations
            definitions in biomedical text. Pacific Symposium on Biocomputing, pp. 451-462.
            (2003)
         11. SNOMED-CT, Systematized Nomenclature of Medicine-Clinical Terms. Interna-
            tional Health Terminology Standards Development Organisation (IHTSDO). 2016.
            Accessed 2014-04-09.


                                                                                                                        254