<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A simple method to extract abbreviations within a document using regular expressions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian Sa´nchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paloma Mart´ınez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Universidad Carlos III of Madrid Avd. Universidad</institution>
          ,
          <addr-line>30, Legane ́s, 28911, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>297</fpage>
      <lpage>301</lpage>
      <abstract>
        <p>Biomedical Abbreviation Recognition and Resolution (BARR) is an evaluation track of the 2nd Human Language Technologies for Iberian languages (IberEval) workshop, a workshop series organized by the Spanish Natural Language Processing Society (SEPLN). In this first edition of BARR, we focus on the discovery of biomedical entities and abbreviation, and relating detected abbreviations with their long forms. This paper describes the approach and the system presented in the sub-track 2, which consists in offers a method to extract abbreviations within a document using regular expressions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Many clinical documents are created in a daily basis, most of them contain
abbreviations for common medical and clinical terms, names of diseases, symptoms, etc., the
correct interpretation of them could be sometimes confusing for patients and even for
medical professionals. This also adds some workload, because find, retrieve and
interpret an abbreviation could often includes not just analyse the term but the whole
document context.</p>
      <p>There is some research that proposes certain solutions and approaches for the
problem, but most of it is focused on analysing text written in English, in this context the
BARR2 track has the aim to promote the development and evaluation of clinical
abbreviation identification systems by providing Gold Standard training and test corpora
manually annotated by domain experts with abbreviation-definition pairs within
abstracts of clinical texts and clinical case studies written in Spanish.</p>
      <p>Our participation was focused on the sub-track 2: provide resolution of short forms
regardless whether its definition is mentioned within the actual document. For this
approach, and in line with our participation in the previous BARR track, we refer to an
abbreviation as a Short Form (SF) and the definition as the Long Form (LF).</p>
      <p>
        This paper is organized as follows: Section 2 describes our proposed approach.
Section 3 presents evaluation and results. Finally, conclusions and future work are discussed
in Section 4.
The main goal was to propose a solution for the sub-track 2. This proposal was based on
our previous work A proposed system to identify and extract abbreviation definitions in
Spanish biomedical texts for the Biomedical Abbreviation Recognition and Resolution
(BARR) 2017 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To accomplish this, first, we assumed that an abbreviation or short
form could appear many times in the document, its length should be between 2 and 8
characters, and have just one and single definition. Secondly, we use an external source
to obtain the definition, so we declined to evaluate the content of the document, i.e.:
      </p>
      <p>A la exploracio´n f´ısica se observaba paraparesia con amioatrofia por desuso de EEII
In this example we consider EEII as a short form or abbreviation, and the actual
definition or long form is not provided within the text.</p>
      <p>Using the mentioned assumptions as guidelines, we divided the system process into
the following tasks:
2.1</p>
      <sec id="sec-1-1">
        <title>Prepare and organize the definitions</title>
        <p>
          We used the Diccionario de Siglas Me´dicas [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] as the main source for the definitions.
All the terms were extracted from there, stored in a database and exposed as a service
in a REST API. The total number of terms contained in the dictionary and exported to
the database was 3386. This service was meant to be used as part of the system used in
the presented approach.
        </p>
        <p>Definitions are returned as a list of key-value objects, composed by the short form
and the long form or definition, i.e. for the abbreviation EEII:</p>
        <p>Some abbreviations could have more than one definition, in this way it is possible
to obtain all the known definitions for a given abbreviation.
2.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Detect Short Forms</title>
        <p>For this task we use a Perl 6 script which parses all the documents one by one to
obtain all the short forms found in the text. Short form identification was performed
using regular expressions1 . The set of rules was based in our previous work, but some
improvements were added. A total of 5 regular expressions were used for the proposed
system.</p>
        <p>One of the improvements in one of the regular expression used was the following;
1 https://docs.perl6.org/language/regexes</p>
        <p>f 1 . . 8 g &lt;[ a . . z n n n/] &gt;? &lt;[A . . Z0 . . 9 ] &gt; +</p>
        <p>This regular expression matches (from left to right):
– Zero or one lowercase letter
– Zero or one ”-” character
– Between 1 and 8 uppercase letters
– Zero or one lowercase letter or ” ”, ”-”, ”n” characters
– One or more uppercase letter or number
– Zero or more lowercase letters</p>
        <p>Also, to match abbreviations in the form gr/dl, the following regexp was used:
&lt;[ a . . zA . . Z ] &gt;
f 1 . . 4 g n / &lt;[nw] &gt;</p>
        <p>f 1 . . 4 g</p>
        <p>This regular expression matches (from left to right):
– Between 1 and 4 lowercase or uppercase letters
– The character /
– Between 1 and 4 word characters (a-z, A-Z, 0-9, including the character )</p>
        <p>The whole document is parsed and all the matches found are stored in a list. This
detection also stores the position of the matched short form in the text. Once the document
is processed, the next step is obtain the definition of each of them.
2.3</p>
      </sec>
      <sec id="sec-1-3">
        <title>Get Long Forms</title>
        <p>Using the REST API provided for the definitions database, it was possible for the script
to make GET requests, via HTTP, for each of the abbreviations found in the document.
If a definition was not found in the database, the script discarded the current
abbreviation processed and continued with the next match.</p>
        <p>This step was executed every time the script needed to find a definition, if a
definition provided was associated with an abbreviation, the script marked it as done and did
not execute this step even if there were more matches in the document, this provided a
better performance for the system.
2.4</p>
      </sec>
      <sec id="sec-1-4">
        <title>Process Long Forms</title>
        <p>When a response was provided by the API, the script continued with the next step,
which was to process and to obtain the supposedly right definition for the abbreviation.
In this step there were two possibilities:</p>
        <p>If the response contained just one definition, the script used it and marked it as the
definitive for the current evaluated abbreviation and started the task with the next item
in the list.</p>
        <p>If the response contained two or more definitions, the script performed another set
of actions for each result:
Normalize part of the text, get the content until the start offset of abbreviation. This
normalization includes remove stop words and stemming the text using the Perl 6
modules Lingua::Stopwords 2 and Lingua::Stem::Es 3.</p>
        <p>Normalize text of definition, remove stop words and stemming the text using the
same tools as the previous step.</p>
        <p>Extract the same amount of words from the normalized definition as the length of
characters from the abbreviation, beginning from the start offset of the abbreviation
and moving to the left side, so if the abbreviation is EEII this step should get four
(4) stemmed words from the normalized text.</p>
        <p>Perform and intersection operation to get a list that contains only the elements
common to both the normalized text and the normalized definition, and return the
total elements found.</p>
        <p>In the case that not matches were found, the steps above were repeated, but instead
of extracting a number of words from the normalized text, the intersection operation
was made with all the text until the start offset of the abbreviation. That way a wide
range of stemmed words were compared which provided a better context and more
opportunities to find similarities in both texts.</p>
        <p>Once all the long forms were processed, the script selected the one which more
intersected elements and use it as the definition. As a final step, the script obtained the
lemmatized version of the definition using the Python library pattern 4
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Evaluation and Results</title>
      <p>
        For this sub-task at BARR2 the primary evaluation metric used consisted in precision,
recall, and f-score of the predictions against manual gold standard[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A corpus
consisting in a manually labeled collection of Spanish medical abstracts constructed using a
customized version of AnnotateIt, BRAT as well as using the Markyt annotation system
[9] was released for the organizers to test the systems[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The results for our first test with the training data (a total of 4260 annotations
provided) were:
PRECISION = 0 . 5 1 3 0 0 8 5 = 1 4 5 9 . 5 0 9 2 / 2845
RECALL = 0 . 3 4 2 6 0 7 8 = 1 4 5 9 . 5 0 9 2 / 4260
F MEASURE = 0 . 4 1 0 8 4</p>
      <p>After some adjustment in the rules for short form detection and cleanup some texts
in the definitions stored in the database we got an improvement in the results:
PRECISION = 0 . 7 2 2 4 3 7 = 1 5 2 6 . 5 0 9 4 / 2113
RECALL = 0 . 3 5 8 3 3 5 5 5 = 1 5 2 6 . 5 0 9 4 / 4260
F MEASURE = 0 . 4 7 9 0 5 5 1 7
2 http://modules.perl6.org/dist/Lingua::Stopwords:cpan:CHSANCH
3 http://modules.perl6.org/dist/Lingua::Stem::Es:cpan:CHSANCH
4 https://www.clips.uantwerpen.be/pages/pattern-es</p>
      <p>The evaluation scores obtained for our 3 submitted predictions were:</p>
      <p>There were some issues that the organizers noticed in this sub-track: definitions
could appear in different forms, there are variants of some of the definitions, and some
typos; all of them could affect the results on some level.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and future work</title>
      <p>For this sub-task we relayed just in one dictionary, which provides a good resource
for definitions in the medical field, more sources are needed to improve the results.
This proposal offers a solution in this specific field, but it could be extended to analyze
documents related to other fields.</p>
      <p>Another interesting improvement could be to add some Machine Learning processes
to classify texts and provided an accurate selection of the definition of an abbreviation
in the context of the document processed.</p>
      <p>There were many missed definitions. An attempt to get and stored definitions for
missed abbreviations matches using externals sources could be an important
improvement. Finally in addition, apply some methods to identify and extract definitions within
the document processed, which was the main goal of the sub-track 1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Diccionario de Siglas Me´dicas. Ministerio de Sanidad y Consumo (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Martin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betanco</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          , Santamar´ıa, J.,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding mentions of abbreviations and their definitions in spanish clinical cases: the barr2 shared task evaluation results</article-title>
          .
          <source>SEPLN</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Intxaurrondo</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>de la Torre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betanco</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          , andJ A.
          <string-name>
            <surname>Lopez-Martin</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>GonzalezAgirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Santamar´ıa, J.,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Resources, guidelines and annotations for the recognition, definition resolution and concept normalization of spanish clinical abbreviations: the barr2 corpus</article-title>
          .
          <source>SEPLN</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Sa´nchez,
          <string-name>
            <surname>C.</surname>
          </string-name>
          , Mart´ınez, P.:
          <article-title>A proposed system to identify and extract abbreviation definitions in spanish biomedical texts for the biomedical abbreviation recognition and resolution (barr) 2017</article-title>
          . BARR IBEREVAL (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>