<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Vicomtech at BARR2: Detecting Biomedical Abbreviations with ML methods and dictionary-based heuristics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Montse Cuadros</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naiara Perez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iker Montoya</string-name>
          <email>fiker92montug@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aitor Garc a Pablos</string-name>
          <email>agarciapg@vicomtech.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>D Lanik S.A, Donostia-San Sebastian</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vicomtech</institution>
          ,
          <addr-line>Paseo Mikeletegi 57, Donostia-San Sebastian</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>322</fpage>
      <lpage>328</lpage>
      <abstract>
        <p>This paper presents the system developed by Vicomtech to participate in the Second Biomedical Abbreviation Recognition and Resolution (BARR2) track. For this purpose, we have used simple machine learning approaches on annotated electronic health records and the datasets provided in the track. The machine learning approaches have been tested individually and in combination with heuristics based on a dictionary of biomedical abbreviations adapted for the task.</p>
      </abstract>
      <kwd-group>
        <kwd>biomedical nlp</kwd>
        <kwd>abbreviations</kwd>
        <kwd>machine learning</kwd>
        <kwd>dictionary- based approaches</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>tasks is the number of abbreviations that each subtasks asks for and where the
de nitions should originate from.</p>
      <p>Sub-track 1 requires detecting all the abbreviations for which the de nitions
are given explicitly in the document. Both the short form (i.e., the
abbreviation or acronum) and the long form (i.e., the de nition or description) must be
reported. For example, for the following piece of text:
"... se aplico radiofrecuencia (RF) sobre la v a accesoria auriculo-ventricular
(AV) de conduccin bidireccional. Se interrumpe la taquicardia y la preexcitacion,
finalizando el procedimiento. Quedo con bloqueo de rama derecha (BRD)
..."
the answer should note the 3 short forms \RF", \AV", and \BRD", along with
their explicit long forms \radiofrecuencia", \auriculo-ventricular", and \bloqueo
de la rama derecha", respectively:</p>
      <p>Sub-track 2 requires detecting all the abbreviations within the document,
and providing a resolution regardless their appearing explicitly in the text. The
following text excerpt contains such 2 short forms, \RMN" and \MTT":
Se solicito una RMN de pie izquierdo, que revelo una fractura de estres
en el 2o MTT con callo periostico...</p>
      <p>The system developed for this sub-track should be able to nd these two elements
and give their long forms, \resonancia magneitca nuclear" and \metatarso",
respectively:
S1889-836X2015000200005-2 878 881
nuclear resonancia magnetico nuclear
S1889-836X2015000200005-2 943 946
RMN
MTT
resonancia magnetica
metatarso
metatarso</p>
      <p>
        The organization[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has provided a sample set, a training set and a
development set of the sizes shown in Table 1. The test set provided for evaluating the
approaches was about 10 times bigger than the other sets, containing 2879
clinical tests, even though the submitted runs where eventually evaluated against a
set of the same size as the training set.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        This work is a continuation of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where several experiments were performed for
detecting and disambiguating abbreviations in electronic health records (EHR).
Sample set
      </p>
      <p>Training set</p>
      <p>Development set</p>
      <p>
        Testing set
Clinical tests
Sub-track 1
Sub-track 2
15
10
89
318
287
4,261
146
178
1,878
220
239
3,414
In this work, a small corpus of 149 EHRs was compiled manually annotated
with 2,389 abbreviations and acronyms. These EHRs were provided by a local
hospital and belong to di erent clinical specialties. Of the short forms
annotated, 2 clinicians manually disambiguated two sets, one containing the 15th
most ambiguous forms and the other the 30th most ambiguous forms. Finally, a
dictionary of short- and long-form pairs was crafted based on [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the
annotated corpora. The present work relies on the EHR corpora and the hand-crafted
dictionary, in addition to the datasets provided by the organization of the track.
      </p>
      <p>The following sections describe the approaches taken to the problems of
abbreviation recognition (both in BARR2 sub-tracks 1 and 2), and of abbreviation
resolution in sub-track 1 (i.e., nding the explicit long form) and sub-track 2.
For the purpose of the BARR2 track, most of the e ort has been put to the
problem of recognition.
3.1</p>
      <sec id="sec-2-1">
        <title>Abbreviation recognition</title>
        <p>For each sub-track, we have trained several classi ers and envisaged two extra
methods based on regular expressions and the hand-crafted dictionary in order
to improve the recall of the machine learning approaches.</p>
        <p>
          Machine Learning approach Several machine learning classi ers have been
trained with Weka [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] (default settings), using the EHR dataset described above
and both the BARR2 Training sets (BARR2 TS) for sub-track 1 and sub-track
2. The same very cheap features as in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have been used for learning the models:
{ Uppercase: whether the token is all uppercase
{ Digit: whether the token contains digits
{ Strange ending: whether the token has a strange ending, where a strange
ending is one that doesn't t to the normal ones in tokens which are not
abbreviations
{ Length: token length
{ Uppercase count: amount of uppercase characters in the token
{ Lowercase count: amount of lowercase characters in the token
{ Vowel ratio: amount of vowels in the token divided by its length
{ Punctuation ratio: amount of punctuation characters in the token divided
by its length
4
        </p>
        <p>EHRs</p>
        <p>Taking these results into account, the classi ers selected for the BARR2
competition have been J48 trained with BARR2 TS only and RF trained with
the combined datasets.</p>
        <p>Pattern-based approach (Pat) This approach consists of a set of regular
expressions aiming to retrieve the abbreviations and acronyms that the ML
approach does not cover. Basically, it retrieves all the strings of upper- and
lowercase characters that have an uppercase character and are inside brackets.
That is, this approach makes sense mainly in sub-track 1. Additionally, some
tests have been carried out to try to retrieve short forms with digits too, but the
results have worsened.</p>
        <p>
          Dictionary-based approach (Regex) This approach is based on the
dictionary introduced above and a set of rules hand-crafted after study and observation
of the abbreviations in several sets of EHR and the literature. For this work, the
dictionary developed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] has been re ned taking in account the BARR2
Training and Development set examples. The nal version of the dictionary contains
3447 unique pairs of biomedical short- and long-form pairs.
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Abbreviation resolution for sub-track 1</title>
        <p>Regarding sub-track 1, the system uses one or the combination of the Machine
Learning approach, Pattern-based approach and Dictionary-based approach to
detect abbreviations candidates. Once the candidates are found and after
checking they are surrounded by brackets, an 8th n-gram window before the
abbreviation is considered as the possible de nition. This possible de nition is rstly
checked against our dictionary, and if exists, we select it. Otherwise, a set of
heuristics are considered in order to determine if the text before is the de
nition. The heuristics are based on: 1) the capital letters of the de nition and the
letters of the abbreviation in the same order or backwards, 2) the size of the
de nition related to the size of the abbreviation, 3) a priority of sizes de nitions
(3-ngrams &gt; 2-ngram &gt;4-ngram &gt; 5-ngram ... ). The di erent heuristics exclude
the following ones when one is triggered. Finally if a de nition is found, both
abbreviation and de nitions are selected and their o sets in the original clinical
text are calculated.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Abbreviation resolution for sub-track 2</title>
        <p>Regarding sub-track 2, the system uses one or the combination of the Machine
Learning approach and Dictionary-based approach to detect the abbreviations
candidates. For each possible candidate a de nition is selected from our
dictionary. Finally the o sets where the abbreviation is found in the clinical text are
provided.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Evaluation</title>
      <p>Vicomtech has submitted a total of 4 systems to sub-track 1 and 4 systems to
sub-track 2. The systems rely on either one of the approaches described above
or their combinations. We have tested them with the Sample set rstly, but
then re ned them by using the BARR2 Training and Development sets. Pat and
Regex individually had a lower scores regarding recall, so we have used them
only in combination with the J48 or RF classi ers.</p>
      <p>Tables 3 and 4 show the performance of the systems submitted to sub-track
1 and sub-track 2, respectively. In both tables, Training, Development and Test
results are presented. Regarding sub-track 1, adding Pat to the classi er seems
to improve recall a little, but precision worsens accordingly. Regex does not seem
to have hardly any e ect. As for sub-track 2, the J48 classi er yields a slightly
better precision and slightly worse recall than RF; in both cases, Regex improves
recall by 1-3 points but worsens precision by more.</p>
      <p>Overall, there are no big di erences between the systems submitted, and
there is a clear drop in recall in the Test dataset for all. The results seem to be
competitive, but o cial results of other participants in the track have not been
published at the time of writing, so no remarks can be made in the matter.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Concluding Remarks</title>
      <p>In this paper we present the results of applying di erent machine learning
approaches combined with heuristics based on pattern matching and regex based
6
on abbreviation dictionaries. The results show that both tasks are similar in
terms of precision, recall and F1-measure when seen from the perspective of
the presented results. However, the tasks are quite di erent, being two di erent
problems that only share partially the detection of abbreviations. Sub-track 1
aims for detecting de nitions expressed in the text, and sub-track 2 aims for
having it in a dictionary. The dictionary has to be precise and sometimes fails
due to changes in the language of the abbreviation or spelling mistakes.</p>
      <p>Additionally, there were some exceptions or di erent abbreviations that we
did not contemplate because the task description was not telling this such as:
S1889-836X2015000100003-1 SHORT_FORM 398 402 P1NP SHORT-LONG LONG_FORM
404 452 propeptido amino-terminal del procolageno tipo 1
related to:
...resultado en los niveles del P1NP (propeptido amino-terminal del
procolageno tipo 1)...
which to our rst understanding was not at all the goal of sub-track1, which had
to be in the other way round.</p>
      <p>Overall, we present a robust method for detecting abbreviations in two
different scenarios showing similar results.
This work has been supported by Vicomtech and the Spanish Ministry of
Economy and Competitiveness (MINECO/FEDER, UE) under the project TUNER
(TIN2015-65308-C5-1-R).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Martin</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez</surname>
            <given-names>Betanco</given-names>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ,
          <article-title>Santamar a</article-title>
          , J.,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding mentions of abbreviations and their de nitions in Spanish Clinical Cases: the BARR2 shared task evaluation results</article-title>
          .
          <source>In: SEPLN</source>
          <year>2018</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Intxaurrondo</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>de la Torre</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez</surname>
            <given-names>Betanco</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Marimon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>LopezMartin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Santamar</surname>
          </string-name>
          <string-name>
            <given-names>a</given-names>
            , J.,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Resources, guidelines and annotations for the recognition, de nition resolution and concept normalization of Spanish clinical abbreviations: the BARR2 corpus</article-title>
          .
          <source>In: SEPLN</source>
          <year>2018</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Laguna</surname>
          </string-name>
          , J.Y.:
          <article-title>Diccionario de siglas medicas y otras abreviaturas, eponimos y terminos medicos relacionados con la codi cacion de las altas hospitalarias (</article-title>
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>An introduction to the weka data mining system</article-title>
          .
          <source>ACM SIGCSE Bulletin</source>
          <volume>38</volume>
          (
          <issue>3</issue>
          ),
          <volume>367</volume>
          {
          <fpage>368</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Montoya</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Analisis, normalizacion, enriquecimiento y codi cacion de historia cl nica electronica (HCE). Master's thesis, Konputazio Ingeniaritza eta Sistema Adimentsuak Unibertsitate Masterra, Euskal Herriko Unibertsitatea (UPV/EHU</article-title>
          ) (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>