=Paper= {{Paper |id=Vol-1881/BARR2017_paper_5 |storemode=property |title=CNIO at BARR IberEval 2017: Exploring Three Biomedical Abbreviation Identifiers for Spanish Biomedical Publications |pdfUrl=https://ceur-ws.org/Vol-1881/BARR2017_paper_5.pdf |volume=Vol-1881 |authors=Ander Intxaurrondo,Martin Krallinger |dblpUrl=https://dblp.org/rec/conf/sepln/IntxaurrondoK17 }} ==CNIO at BARR IberEval 2017: Exploring Three Biomedical Abbreviation Identifiers for Spanish Biomedical Publications== https://ceur-ws.org/Vol-1881/BARR2017_paper_5.pdf
       CNIO at BARR IberEval 2017: exploring three
       biomedical abbreviation identifiers for Spanish
                  biomedical publications

                         Ander Intxaurrondo, Martin Krallinger

           CNIO - Spanish National Cancer Research Center, 28029 Madrid, Spain
                     {aintxaurron,mkrallinger}@cnio.es



       Abstract. This paper describes the adaptation and assessment of three state-
       of-the-art publicly available, widely used, biomedical abbreviation recognition
       systems developed originally to process English scientific literature. The under-
       lying assumption of using these tools was that abbreviations, and abbreviation-
       definition pairs do show similar properties shared by texts written in both lan-
       guages. The three systems, ADRS, Ab3P and BADREX were evaluated at the
       Biomedical Abbreviation Recognition and Resolution (BARR) task of IberEval
       2017. These three tools are based on heuristics that exploit aspects such as the
       presence of parentheses surrounding abbreviation mentions, which are commonly
       mentioned in the same sentence after the abbreviation description or long form.
       The obtained results showed that the heuristics used by these systems work well
       also for medical publications in other languages, such as Spanish and Portuguese.


1   Introduction

This paper describes the IberEval 2017 Biomedical Abbreviation Recognition and Res-
olution (BARR) task and the benchmarking CNIO participation in this track ([3]). The
BARR track requires essentially finding abbreviations and their corresponding long
forms (descriptions or definitions) in medical publications written in Spanish.
    In the latest years, the interest of applying natural language processing tools to
the biomedical domain has increased. A considerable number of publications describe
methods related to biomedical named entity recognition approaches, for entity types
such as diseases/symptoms, proteins, genes, drugs and chemicals. Moreover, a consid-
erable number of domain-specific information retrieval and extraction systems specif-
ically tailored to process biomedical and medical texts have been implemented during
the last decade. We can find an extensive collection of research publications for the En-
glish language in this area; meanwhile, there is a lack of research for other languages.
    An important challenge studied intensely by the biomedical text mining is the recog-
nition and resolution of abbreviations and acronyms in biomedical documents. It is very
common to find abbreviations of concepts and entities in clinical records without their
long form or definition. Due to the lack of widely followed standardizations for abbrevi-
ations and their meanings, interpretation of abbreviations is a challenge both for humans
as well as machines. Disambiguating abbreviations can help to construct medical ab-
breviation dictionaries, and so to improve the performance of different text processing
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




         approaches applied to the biomedical domain. This may help health care professionals
         work in the interpretation of ambiguous abbreviations.
             The aim of the BARR track was promoting the recognition and resolution of abbre-
         viations found in Spanish medical publications. The task consisted of two tracks:

             – Abbreviation mention (entity) evaluation.
             – Abbreviation Short form - long form relation detection evaluation.

             For this track, a corpus collection of medical article abstracts written in Spanish
         was releases, the BARR document collection, while a manually annotated corpus of
         abbreviations and long forms served to train and test systems of participating teams, the
         BARR Gold Standard corpus.
             This paper is structured as follows. In section 2 we briefly introduce the tracks of
         the BARR task. In section 3 we explain the three tools we used to extract abbreviations
         and their long forms. In section 4 we focus on the results of the submissions resulting
         from the use of these tools. And finally, in section 5, we draw some conclusions.


         2     Evaluation tracks

         In this section, we make a brief introduction of the two evaluations tracks present in the
         BARR task.


         2.1    Entity evaluation track

         In this track, participants had to detect mentions of abbreviations, i.e. short forms and
         their corresponding long forms and nested long forms in documents.
             In the BARR corpus, among other annotations, the main types of entity corre-
         sponded to: LONG, SHORT, MULTIPLE and NESTED mention types. Abbreviations
         and acronyms were labelled as SHORT, whiletheir descriptions (co-mentioned in the
         same sentence) were tagged as LONG. Note that short forms that were mentioned some-
         where else in the record, were labelled as MULTIPLE.
             Sometimes long forms did not correspond to a continuous string of text. In these
         special situations, long forms corresponded basically to several fragments of text and
         were labeled as NESTED.


         2.2    Relation evaluation track

         For the BARR track, participants had to detect mentions of short forms together with
         their long forms (SF-LF relation pairs) or nested long forms (NESTED-SF). The sys-
         tems tested through the CNIO submissions were unable to detect nested cases, and thus
         did not return results for this relation type.
             Figure 1 shows an example of manual annotation. The figure shows a long form and
         short form pair in the same context. We can find the short form mentioned again later
         in the abstract, which is labeled as Multiple.




                                                                                                                        279
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




                                        Fig. 1. Manual annotation example.


         3     Abbreviation detection and recognition

         We used three different state-of-the-art tools to detect abbreviations and acronyms.
         These tools were initially developed to detect short forms and their long forms in
         biomedical documents. Although they were developed for the English language by de-
         fault in the biomedical domain, we wanted to try their performance with Spanish and
         Portuguese documents.
             We named these tools as Ab3P, ADRS and BADREX. They all use the following
         heuristic to find abbreviations in texts: if there are opening and closing parentheses in
         the same sentences, we will likely find the long or short forms inside the parentheses,
         and their other form nearby. They check the characters inside the parentheses, and look
         for words that could match with those characters.
             Before executing each tool, we split the sentences in the abstracts using IXA pipes
         [1], and looked for long and short form pairs in each sentence individually. Splitting
         sentences we prevented making short and long forms between entities detected at the
         beginning and the end of the abstract, making it possible to detect pairs only when they
         were in the same context. We considered titles as a single sentence.
             None of these tools return the offsets of short or long forms. It is up to the users to
         get them.
             The following subsections describe each tool, and how we adapted them for Span-
         ish, in case it was necessary.


         3.1    ADRS

         ADRS1 is how we call the algorithm developed by [4], a state-of-the-art algorithm
         developed in Java. ADRS returns the abbreviations and their definitions found within
         sentences.
             ADRS’s main strategy consists of detecting parentheses, and considering the inner
         content, with a maximum of two words and ten characters, as a potential short form,
         following the pattern ”long-form (short-form)”. Long forms must be in the same sen-
         tence. Every character in the short form matches a character in the long form, following
         the order of the characters in the short form. The heuristic also handles the inverse form
         ”short-form (long-form)”.
             To use ADRS, we integrated the original code with our system’s code. Before ex-
         ecuting the tool for each publication, we split all sentences of each abstract using Ixa
          1
              http://biotext.berkeley.edu/code/abbrev/ExtractAbbrev.java




                                                                                                                        280
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




         Pipes ([1]), and analysed each line with ADRS, in order to get all short form and long
         form pairs. We considered titles as a single sentence.


         3.2   Ab3P

         Ab3P2 ([5]) is a state-of-the-art tool used to detect abbreviations precisely. It is devel-
         oped in C++, and simple to use. Ab3P returns all abbreviations and their long forms
         detected in each line of the document, with their estimated precision. There is a Java
         fork available for download3 , but it is still incomplete; it has not been improved for 3
         years, and does not seem to be in the plans of the authors to finish it.
             Ab3P’s heuristics are based on the ADRS algorithm. The paper describes about 10
         rules and 30 strategies used by their heuristic, and absent in ADRS. After applying
         all strategies, the tool estimates the accuracy of each given strategy; the strategy that
         returns the highest accuracy value is considered the most reliable, and so is selected as
         the long form of a short form.
             We had many issues adapting the C++ code to our system in Java. In order to solve
         this, we executed Ab3P for each document individually, and later processed the outputs
         with our system. To execute this tool, all sentences need to be split by line in the input
         document, so we used Ixa Pipes once again to split the sentences. For each input file to
         be analysed by Ab3P, the first line of the input file belonged to the title.


         3.3   BADREX

         BADREX4 , developed by [2], is a GATE5 plug-in that detects abbreviations and their
         long forms in text using regular expressions.
             This heuristic applies 5 steps to detect long and short forms. The first step is based
         on the ADRS algorithm. The second step uses subsets to discard conditions of short
         forms. Step 3 applies the regular expressions. Step 4 splits potential short and long
         form’s non-alpha characters to match adjacent characters. And finally, Step 5 detects
         unpaired mentions of the long and short form in the same abstract (MULTIPLE men-
         tions).
             Regular expressions for long and short pairs are specified in separated files6 . It is
         possible to adapt them to other languages or needs. We adapted them so it could work
         with acute (á), diaeresis (ü) and tildes (ñ). Giving them the possibility to work with
         Spanish special characters improved the tool’s performance drastically. Table 1 shows
         BADREX’s original regular expressions, together with the file names where these ex-
         pressions are stored, and their variations to Spanish.
             To make use of BADREX, we integrated the GATE API to our system, and executed
         the plug-in directly from the API. All sentences in the abstract were split once again.
          2
            https://github.com/ncbi-nlp/Ab3P
          3
            https://github.com/aureooms/ab3p
          4
            https://github.com/philgooch/
            BADREX-Biomedical-Abbreviation-Expander
          5
            Open-source text analyser. https://gate.ac.uk/
          6
            Directory: BADREX DIR/resources/regex




                                                                                                                        281
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




                RegEx file name                Regex for English                             RegEx for Spanish
              inner post.txt               )([,;:]\s*\w+)?[\)\]]                       })([,;:]\s*\p{L}+)?[\)\]]
             inner post 2.txt            }\2([,;:]\s*\w+)?)[\)\]]                     }\2([,;:]\s*\p{L}+)?)[\)\]]
               inner pre.txt       })\s*[\(\[](\2[\w\-\&’\.\/\+\s\]{1,           })\s*[\(\[](\2[\p{L}\-\&’\.\/\+\s]{1,
              inner pre 2.txt }\b(\w)(\w+[\-’\/\+\s]{1,2}))\s*[\(\[](.{1, }\b(\p{L})(\p{L}+[\-’\/\+\s]{1,2}))\s*[\(\[](.{1,
               outer pre.txt    \b((\w)\W{0,2}(\w+[\-\&’\/\+\s]{1,2}){1,    \b((\p{L})\W{0,2}(\p{L}+[\-\&’\/\+\s]{1,2}){1,
               outer pre.txt                      \b(.{1,                                       \b(.{1,
                                  Table 1. BADREX regular expressions, for English and Spanish.


     1   B i o m e d i c a l A b b r e v i a t i o n E x p a n d e r b a d r e x = new B i o m e d i c a l A b b r e v i a t i o n E x p a n d e r ( ) ;
     2   URL c o n f i g U r l = new F i l e ( ”BADREX DIR / r e s o u r c e s / c o n f i g . t x t ” ) . toURI ( ) . toURL ( ) ;
     3   URL g a z U r l = new F i l e ( ”BADREX DIR / r e s o u r c e s / l o o k u p / a b b r e v s . d e f ” ) . toURI ( ) . toURL ( ) ;
     4   badrex . setConfigFileURL ( configUrl ) ;
     5   badrex . setGazetteerListsURL ( gazUrl ) ;
     6   b a d r e x . s e t E x p a n d A l l S h o r t F o r m I n s t a n c e s ( B o o l e a n . FALSE ) ;
     7   b a d r e x . s e t L o n g T y p e ( ” Long ” ) ;
     8   b a d r e x . s e t L o n g T y p e F e a t u r e ( ” longForm ” ) ;
     9   badrex . setMaxInner ( 1 0 ) ;
    10   badrex . setMaxOuter ( 1 0 ) ;
    11   badrex . setSentenceType ( ” Sentence ” ) ;
    12   badrex . setShortType ( ” Short ” ) ;
    13   badrex . setShortTypeFeature ( ” shortForm ” ) ;
    14   b a d r e x . s e t S w a p S h o r t e s t ( B o o l e a n . TRUE ) ;
    15   badrex . setThreshold ( 0 . 9 f ) ;
    16   b a d r e x . s e t U s e B i d i r e c t i o n M a t c h ( B o o l e a n . FALSE ) ;
    17   b a d r e x . s e t U s e L o o k u p s ( B o o l e a n . FALSE ) ;
    18
    19   Gate . i n i t ( ) ;
    20   F i l e p l u g i n s D i r = Gate . getPluginsHome ( ) ;
    21   //      load the Tools plugin
    22   F i l e a P l u g i n D i r = new F i l e ( p l u g i n s D i r , ”ANNIE” ) ;
    23   / / load the plugin .
    24   G a t e . g e t C r e o l e R e g i s t e r ( ) . r e g i s t e r D i r e c t o r i e s ( a P l u g i n D i r . toURI ( ) . toURL ( ) ) ;
    25
    26   badrex . i n i t ( ) ;

                                         Listing 1.1. BADREX plug-in and GATE initialization.



         Code listing 1.1 shows how we initialized the plug-in and GATE. Listing 1.2 shows
         how we executed the plug-in to analyse sentences and extract short forms and their
         long forms from each sentence.

         3.4      Labelling ’MULTIPLE’ entities:
         After getting all short and long form pairs, we detected the offsets of all entities par-
         ticipating in each relation. If we detected a short form twice in the same sentence, we
         considered as pairs those long and short forms that were closest to each other, while the
         other short form would be labelled as MULTIPLE.
             We also checked the appearances of each entity in the rest of the document. We
         labeled those appearances as MULTIPLE as well.

         4      Results
         This section shows the final results we obtained after evaluating the predictions of each
         track through Markyt. To prepare our system for the background set, we initially worked
         using the sample set and the available train sets.




                                                                                                                                                           282
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




     1   Document d = F a c t o r y . newDocument ( s e n t e n c e ) ;                                     / / sentence to analyse
     2   C o r p u s c o r p u s = F a c t o r y . newCorpus ( ” t e s t c o r p u s ” ) ;
     3
     4   LanguageAnalyser s e n t e n c e S p l i t t e r = ( LanguageAnalyser )
     5                  Factory . createResource ( ” gate . creole . s p l i t t e r . RegexSentenceSplitter ” ) ;
     6   SerialAnalyserController serialController = ( SerialAnalyserController )
     7                  Factory . createResource ( ” gate . creole . SerialAnalyserController ” ) ;
     8   s e r i a l C o n t r o l l e r . add ( s e n t e n c e S p l i t t e r ) ;
     9
    10   c o r p u s . add ( d ) ;
    11   s e r i a l C o n t r o l l e r . setCorpus ( corpus ) ;
    12   s e r i a l C o n t r o l l e r . execute ( ) ;
    13   corpus . c l e a r ( ) ;
    14
    15   badrex . setDocument ( d ) ;
    16
    17   badrex . execute ( ) ;
    18
    19   A n n o t a t i o n S e t abbrevAS = d . g e t A n n o t a t i o n s ( ) . g e t ( ” S h o r t ” ) ;
    20   A n n o t a t i o n S e t termAS = d . g e t A n n o t a t i o n s ( ) . g e t ( ” Long ” ) ;
    21
    22   / / HashMap o f s t r i n g s t o s t o r e s h o r t and l o n g f o r m s
    23   Map p a i r s = new HashMap ( ) ;
    24   I t e r a t o r  t e r m I t e r = termAS . i t e r a t o r ( ) ;
    25   while ( t e r m I t e r . hasNext ( ) )
    26   {
    27                   Annotation term = t e r m I t e r . next ( ) ;
    28                   A n n o t a t i o n a b b r e v = abbrevAS . i t e r a t o r ( ) . n e x t ( ) ;
    29                   FeatureMap t e r m F e a t s = term . g e t F e a t u r e s ( ) ;
    30                   FeatureMap a b b r e v F e a t s = abbrev . g e t F e a t u r e s ( ) ;
    31                   S t r i n g shortForm = ( S t r i n g ) termFeats . get ( ” shortForm ” ) ;
    32                   S t r i n g longForm = ( S t r i n g ) a b b r e v F e a t s . g e t ( ” longForm ” ) ;
    33
    34                 p a i r s . p u t ( s h o r t F o r m , longForm ) ;
    35   }
    36
    37   Factory . deleteResource (d ) ;

                                      Listing 1.2. BADREX plug-in execution through GATE.




         4.1      Entity evaluation results

         We submitted three runs for this track. The first run belongs to the Ab3P tool, ex-
         plained above in section 3.2, the second one to the ADRS tool (section 3.1), and finally
         BADREX, in section 3.3.
             We can find our results of the training set in Table 2. We obtain the best results
         with the tool ADRS, with Ab3P not being far. None of the systems was able to detect
         a single NESTED entity. After the predictions, we discovered that each tool worked
         well detecting abbreviations, but they often were not able to find the correct long form
         nearby.
             BADREX is a good tool to detect abbreviations, but its heuristics to find the long
         form do not work that well. While this tool is very useful to detect long-short pairs in
         English, it still needs to be adapted for other languages.
             Table 3 shows our final results of the entity evaluation track. We can find the same
         tendency of the training set here, with ADRS being the best system, Ab3P not far, and
         BADREX the last one.




                                                                                                                                      283
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




                                                 Entity evaluation
                                            Tool Precision Recall F-measure
                                           Ab3P    86.21 56.04      67.92
                                           ADRS    83.71 59.84      69.79
                                          BADREX   81.27 45.39      58.25
                                    Table 2. Entity evaluation results. Train set.


                                                 Entity evaluation
                                            Tool Precision Recall F-measure
                                           Ab3P    87.95 56.72      68.96
                                           ADRS    84.47 62.29      71.70
                                          BADREX   81.60 47.06      59.69
                                    Table 3. Entity evaluation results. Test set.




         4.2   Relation evaluation results

         We also submitted three runs for this track. These three runs were based on the detected
         entities in the entity evaluation track, each run with the corresponding tool.
             We can find our results of the training set in Table 4. Once more, we can see that
         ADRS is the best tool for abbreviation and long form detection. Ab3P is very close to
         ADRS again. Meanwhile, BADREX is far from getting the same performance of the
         other two systems. None of the systems was able to detect a single NESTED relation.
             Table 5 shows our final results of the relation evaluation track. Just like in the entity
         evaluation track, results have the same tendency here.
             An interesting project for the future would be extending these tools to work with
         NESTED entities, and be able to associate them with long and short forms.



         5     Conclusions

         In this paper, we presented our results of our participation in the entity and relation eval-
         uation tracks for the Biomedical Abbreviation Recognition and Resolution (BARR) task
         at the IberEval 2017 workshop. We worked with 3 different state-of-the-art tools used
         to detect long forms and short forms for English biomedical texts, applying them to the
         Spanish language. We submitted 3 runs in total, being each submission for each tool.
         Two of the tools perform quite well for Spanish, giving good results when detecting
         biomedical entities, and relating abbreviations found in the text with their long forms
         in the same context; meanwhile, the third tool needs more polishing to perform better
         in Spanish. The tools used show that applying algorithms focused for abbreviation res-
         olution in English, based on patters and regular expressions, can also be used in other
         languages, such as Spanish and Portuguese.
             For future work, we would like to investigate on the improvement of these tools for
         Spanish, in order to improve performance, detect nested entities, and make relations
         between nested and short and long forms possible.




                                                                                                                        284
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




                                                 Relation evaluation
                                            Tool  Precision Recall F-measure
                                           Ab3P     83.60 51.55      63.48
                                           ADRS     78.96 54.17      64.26
                                          BADREX    64.57 36.75      46.84
                                   Table 4. Relation evaluation results. Train set.


                                                 Relation evaluation
                                            Tool  Precision Recall F-measure
                                           Ab3P     84.23 53.29      65.28
                                           ADRS     79.05 56.90      66.17
                                          BADREX    64.46 39.28      48.81
                                   Table 5. Relation evaluation results. Test set.



         6    Acknowledgments

         We acknowledge the the encomienda MINETAD-CNIO/OTG Sanidad Plan TL and
         OpenMinted (654021) H2020 project for funding.


         References
         1. Agerri, R., Bermudez, J., Rigau, G.: Ixa pipeline: Efficient and ready to use multilingual nlp
            tools. In: Proceedings of the Ninth International Conference on Language Resources and Eval-
            uation (LREC’14) (2014)
         2. Gooch, P.: Badrex: In situ expansion and coreference of biomedical abbreviations using dy-
            namic regular expressions. CoRR (2012)
         3. Intxaurrondo, A., Pérez-Pérez, M., Pérez-Rodrı́guez, G., López-Martı́n, J., Santamarı́a, J.,
            de la Peña, S., Villegas, M., Akhondi, S., Valencia, A., Lourenço, A., Krallinger, M.: The
            biomedical abbreviation recognition and resolution (barr) track: benchmarking, evaluation
            and importance of abbreviation recognition systems applied to spanish biomedical abstracts.
            SEPLN (2017)
         4. Schwartz, A., Hearst, M.: A simple algorithm for identifying abbreviation definitions in
            biomedical text. In: In Proceedings of Pacic Symposium on Biocomputing (2003)
         5. Sohn, S., Comeau, D.C., Kim, W., Wilbur, W.J.: Abbreviation definition identification based
            on automatic precision estimates. BMC Bioinformatics (2008)




                                                                                                                        285