<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CNIO at BARR IberEval 2017: exploring three biomedical abbreviation identifiers for Spanish biomedical publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ander Intxaurrondo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Krallinger</string-name>
          <email>mkrallingerg@cnio.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNIO - Spanish National Cancer Research Center</institution>
          ,
          <addr-line>28029 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>278</fpage>
      <lpage>285</lpage>
      <abstract>
        <p>This paper describes the adaptation and assessment of three stateof-the-art publicly available, widely used, biomedical abbreviation recognition systems developed originally to process English scientific literature. The underlying assumption of using these tools was that abbreviations, and abbreviationdefinition pairs do show similar properties shared by texts written in both languages. The three systems, ADRS, Ab3P and BADREX were evaluated at the Biomedical Abbreviation Recognition and Resolution (BARR) task of IberEval 2017. These three tools are based on heuristics that exploit aspects such as the presence of parentheses surrounding abbreviation mentions, which are commonly mentioned in the same sentence after the abbreviation description or long form. The obtained results showed that the heuristics used by these systems work well also for medical publications in other languages, such as Spanish and Portuguese.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        This paper describes the IberEval 2017 Biomedical Abbreviation Recognition and
Resolution (BARR) task and the benchmarking CNIO participation in this track ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). The
BARR track requires essentially finding abbreviations and their corresponding long
forms (descriptions or definitions) in medical publications written in Spanish.
      </p>
      <p>In the latest years, the interest of applying natural language processing tools to
the biomedical domain has increased. A considerable number of publications describe
methods related to biomedical named entity recognition approaches, for entity types
such as diseases/symptoms, proteins, genes, drugs and chemicals. Moreover, a
considerable number of domain-specific information retrieval and extraction systems
specifically tailored to process biomedical and medical texts have been implemented during
the last decade. We can find an extensive collection of research publications for the
English language in this area; meanwhile, there is a lack of research for other languages.</p>
      <p>An important challenge studied intensely by the biomedical text mining is the
recognition and resolution of abbreviations and acronyms in biomedical documents. It is very
common to find abbreviations of concepts and entities in clinical records without their
long form or definition. Due to the lack of widely followed standardizations for
abbreviations and their meanings, interpretation of abbreviations is a challenge both for humans
as well as machines. Disambiguating abbreviations can help to construct medical
abbreviation dictionaries, and so to improve the performance of different text processing
approaches applied to the biomedical domain. This may help health care professionals
work in the interpretation of ambiguous abbreviations.</p>
      <p>The aim of the BARR track was promoting the recognition and resolution of
abbreviations found in Spanish medical publications. The task consisted of two tracks:
– Abbreviation mention (entity) evaluation.
– Abbreviation Short form - long form relation detection evaluation.</p>
      <p>For this track, a corpus collection of medical article abstracts written in Spanish
was releases, the BARR document collection, while a manually annotated corpus of
abbreviations and long forms served to train and test systems of participating teams, the
BARR Gold Standard corpus.</p>
      <p>This paper is structured as follows. In section 2 we briefly introduce the tracks of
the BARR task. In section 3 we explain the three tools we used to extract abbreviations
and their long forms. In section 4 we focus on the results of the submissions resulting
from the use of these tools. And finally, in section 5, we draw some conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Evaluation tracks</title>
      <p>In this section, we make a brief introduction of the two evaluations tracks present in the
BARR task.
2.1</p>
      <sec id="sec-2-1">
        <title>Entity evaluation track</title>
        <p>In this track, participants had to detect mentions of abbreviations, i.e. short forms and
their corresponding long forms and nested long forms in documents.</p>
        <p>In the BARR corpus, among other annotations, the main types of entity
corresponded to: LONG, SHORT, MULTIPLE and NESTED mention types. Abbreviations
and acronyms were labelled as SHORT, whiletheir descriptions (co-mentioned in the
same sentence) were tagged as LONG. Note that short forms that were mentioned
somewhere else in the record, were labelled as MULTIPLE.</p>
        <p>Sometimes long forms did not correspond to a continuous string of text. In these
special situations, long forms corresponded basically to several fragments of text and
were labeled as NESTED.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Relation evaluation track</title>
        <p>For the BARR track, participants had to detect mentions of short forms together with
their long forms (SF-LF relation pairs) or nested long forms (NESTED-SF). The
systems tested through the CNIO submissions were unable to detect nested cases, and thus
did not return results for this relation type.</p>
        <p>Figure 1 shows an example of manual annotation. The figure shows a long form and
short form pair in the same context. We can find the short form mentioned again later
in the abstract, which is labeled as Multiple.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Abbreviation detection and recognition</title>
      <p>We used three different state-of-the-art tools to detect abbreviations and acronyms.
These tools were initially developed to detect short forms and their long forms in
biomedical documents. Although they were developed for the English language by
default in the biomedical domain, we wanted to try their performance with Spanish and
Portuguese documents.</p>
      <p>We named these tools as Ab3P, ADRS and BADREX. They all use the following
heuristic to find abbreviations in texts: if there are opening and closing parentheses in
the same sentences, we will likely find the long or short forms inside the parentheses,
and their other form nearby. They check the characters inside the parentheses, and look
for words that could match with those characters.</p>
      <p>
        Before executing each tool, we split the sentences in the abstracts using IXA pipes
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and looked for long and short form pairs in each sentence individually. Splitting
sentences we prevented making short and long forms between entities detected at the
beginning and the end of the abstract, making it possible to detect pairs only when they
were in the same context. We considered titles as a single sentence.
      </p>
      <p>None of these tools return the offsets of short or long forms. It is up to the users to
get them.</p>
      <p>The following subsections describe each tool, and how we adapted them for
Spanish, in case it was necessary.
3.1</p>
      <sec id="sec-3-1">
        <title>ADRS</title>
        <p>
          ADRS1 is how we call the algorithm developed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], a state-of-the-art algorithm
developed in Java. ADRS returns the abbreviations and their definitions found within
sentences.
        </p>
        <p>ADRS’s main strategy consists of detecting parentheses, and considering the inner
content, with a maximum of two words and ten characters, as a potential short form,
following the pattern ”long-form (short-form)”. Long forms must be in the same
sentence. Every character in the short form matches a character in the long form, following
the order of the characters in the short form. The heuristic also handles the inverse form
”short-form (long-form)”.</p>
        <p>
          To use ADRS, we integrated the original code with our system’s code. Before
executing the tool for each publication, we split all sentences of each abstract using Ixa
1 http://biotext.berkeley.edu/code/abbrev/ExtractAbbrev.java
Pipes ([
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]), and analysed each line with ADRS, in order to get all short form and long
form pairs. We considered titles as a single sentence.
3.2
        </p>
        <p>
          Ab3P
Ab3P2 ([
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) is a state-of-the-art tool used to detect abbreviations precisely. It is
developed in C++, and simple to use. Ab3P returns all abbreviations and their long forms
detected in each line of the document, with their estimated precision. There is a Java
fork available for download3, but it is still incomplete; it has not been improved for 3
years, and does not seem to be in the plans of the authors to finish it.
        </p>
        <p>Ab3P’s heuristics are based on the ADRS algorithm. The paper describes about 10
rules and 30 strategies used by their heuristic, and absent in ADRS. After applying
all strategies, the tool estimates the accuracy of each given strategy; the strategy that
returns the highest accuracy value is considered the most reliable, and so is selected as
the long form of a short form.</p>
        <p>We had many issues adapting the C++ code to our system in Java. In order to solve
this, we executed Ab3P for each document individually, and later processed the outputs
with our system. To execute this tool, all sentences need to be split by line in the input
document, so we used Ixa Pipes once again to split the sentences. For each input file to
be analysed by Ab3P, the first line of the input file belonged to the title.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>BADREX</title>
        <p>
          BADREX4, developed by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], is a GATE5 plug-in that detects abbreviations and their
long forms in text using regular expressions.
        </p>
        <p>This heuristic applies 5 steps to detect long and short forms. The first step is based
on the ADRS algorithm. The second step uses subsets to discard conditions of short
forms. Step 3 applies the regular expressions. Step 4 splits potential short and long
form’s non-alpha characters to match adjacent characters. And finally, Step 5 detects
unpaired mentions of the long and short form in the same abstract (MULTIPLE
mentions).</p>
        <p>Regular expressions for long and short pairs are specified in separated files6. It is
possible to adapt them to other languages or needs. We adapted them so it could work
with acute (a´), diaeresis (u¨) and tildes (n˜). Giving them the possibility to work with
Spanish special characters improved the tool’s performance drastically. Table 1 shows
BADREX’s original regular expressions, together with the file names where these
expressions are stored, and their variations to Spanish.</p>
        <p>To make use of BADREX, we integrated the GATE API to our system, and executed
the plug-in directly from the API. All sentences in the abstract were split once again.
2 https://github.com/ncbi-nlp/Ab3P
3 https://github.com/aureooms/ab3p
4 https://github.com/philgooch/</p>
        <p>BADREX-Biomedical-Abbreviation-Expander
5 Open-source text analyser. https://gate.ac.uk/
6 Directory: BADREX DIR/resources/regex
RegEx file name Regex for English RegEx for Spanish
inner post.txt )([,;:]\s*\w+)?[\)\]] })([,;:]\s*\p{L}+)?[\)\]]
inner post 2.txt }\2([,;:]\s*\w+)?)[\)\]] }\2([,;:]\s*\p{L}+)?)[\)\]]
inner pre.txt })\s*[\(\[](\2[\w\-\&amp;’\.\=\+\s\]{1, })\s*[\(\[](\2[\p{L}\-\&amp;’\.\=\+\s]{1,
inner pre 2.txt }\b(\w)(\w+[\-’\=\+\s]{1,2}))\s*[\(\[](.{1, }\b(\p{L})(\p{L}+[\-’\=\+\s]{1,2}))\s*[\(\[](.{1,
outer pre.txt \b((\w)\W{0,2}(\w+[\-\&amp;’\=\+\s]{1,2}){1, \b((\p{L})\W{0,2}(\p{L}+[\-\&amp;’\=\+\s]{1,2}){1,
outer pre.txt \b(.{1, \b(.{1,</p>
        <p>Table 1. BADREX regular expressions, for English and Spanish.</p>
        <p>Code listing 1.1 shows how we initialized the plug-in and GATE. Listing 1.2 shows
how we executed the plug-in to analyse sentences and extract short forms and their
long forms from each sentence.
After getting all short and long form pairs, we detected the offsets of all entities
participating in each relation. If we detected a short form twice in the same sentence, we
considered as pairs those long and short forms that were closest to each other, while the
other short form would be labelled as MULTIPLE.</p>
        <p>We also checked the appearances of each entity in the rest of the document. We
labeled those appearances as MULTIPLE as well.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>This section shows the final results we obtained after evaluating the predictions of each
track through Markyt. To prepare our system for the background set, we initially worked
using the sample set and the available train sets.</p>
      <p>Listing 1.2. BADREX plug-in execution through GATE.
4.1</p>
      <sec id="sec-4-1">
        <title>Entity evaluation results</title>
        <p>We submitted three runs for this track. The first run belongs to the Ab3P tool,
explained above in section 3.2, the second one to the ADRS tool (section 3.1), and finally
BADREX, in section 3.3.</p>
        <p>We can find our results of the training set in Table 2. We obtain the best results
with the tool ADRS, with Ab3P not being far. None of the systems was able to detect
a single NESTED entity. After the predictions, we discovered that each tool worked
well detecting abbreviations, but they often were not able to find the correct long form
nearby.</p>
        <p>BADREX is a good tool to detect abbreviations, but its heuristics to find the long
form do not work that well. While this tool is very useful to detect long-short pairs in
English, it still needs to be adapted for other languages.
tendency of the training set here, with ADRS being the best system, Ab3P not far, and
BADREX the last one.
Entity evaluation
Tool Precision Recall F-measure
Ab3P 86.21 56.04 67.92
ADRS 83.71 59.84 69.79</p>
        <p>BADREX 81.27 45.39 58.25
Table 2. Entity evaluation results. Train set.
We also submitted three runs for this track. These three runs were based on the detected
entities in the entity evaluation track, each run with the corresponding tool.</p>
        <p>We can find our results of the training set in Table 4. Once more, we can see that
ADRS is the best tool for abbreviation and long form detection. Ab3P is very close to
ADRS again. Meanwhile, BADREX is far from getting the same performance of the
other two systems. None of the systems was able to detect a single NESTED relation.</p>
        <p>Table 5 shows our final results of the relation evaluation track. Just like in the entity
evaluation track, results have the same tendency here.</p>
        <p>An interesting project for the future would be extending these tools to work with
NESTED entities, and be able to associate them with long and short forms.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we presented our results of our participation in the entity and relation
evaluation tracks for the Biomedical Abbreviation Recognition and Resolution (BARR) task
at the IberEval 2017 workshop. We worked with 3 different state-of-the-art tools used
to detect long forms and short forms for English biomedical texts, applying them to the
Spanish language. We submitted 3 runs in total, being each submission for each tool.
Two of the tools perform quite well for Spanish, giving good results when detecting
biomedical entities, and relating abbreviations found in the text with their long forms
in the same context; meanwhile, the third tool needs more polishing to perform better
in Spanish. The tools used show that applying algorithms focused for abbreviation
resolution in English, based on patters and regular expressions, can also be used in other
languages, such as Spanish and Portuguese.</p>
      <p>For future work, we would like to investigate on the improvement of these tools for
Spanish, in order to improve performance, detect nested entities, and make relations
between nested and short and long forms possible.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We acknowledge the the encomienda MINETAD-CNIO/OTG Sanidad Plan TL and
OpenMinted (654021) H2020 project for funding.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agerri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bermudez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigau</surname>
          </string-name>
          , G.:
          <article-title>Ixa pipeline: Efficient and ready to use multilingual nlp tools</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gooch</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Badrex: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions</article-title>
          .
          <source>CoRR</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Pe´rez-Pe´rez,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Pe´rez-Rodr´ıguez, G.,
          <article-title>Lo´pez-Mart´ın</article-title>
          , J., Santamar´ıa, J., de la Pen˜a,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Akhondi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Valencia</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Lourenc¸o,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>The biomedical abbreviation recognition and resolution (barr) track: benchmarking, evaluation and importance of abbreviation recognition systems applied to spanish biomedical abstracts</article-title>
          .
          <source>SEPLN</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A simple algorithm for identifying abbreviation definitions in biomedical text</article-title>
          .
          <source>In: In Proceedings of Pacic Symposium on Biocomputing</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Comeau</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilbur</surname>
          </string-name>
          , W.J.:
          <article-title>Abbreviation definition identification based on automatic precision estimates</article-title>
          .
          <source>BMC Bioinformatics</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>