Introduction

Vicomtech at BARR2: Detecting Biomedical Abbreviations with ML methods and dictionary-based heuristics

Montse Cuadros

Naiara Perez

Iker Montoya

fiker92montug@gmail.com

Aitor Garc a Pablos

agarciapg@vicomtech.org 1 0 D Lanik S.A, Donostia-San Sebastian , Spain 1 Vicomtech , Paseo Mikeletegi 57, Donostia-San Sebastian , Spain

2018

322 328

This paper presents the system developed by Vicomtech to participate in the Second Biomedical Abbreviation Recognition and Resolution (BARR2) track. For this purpose, we have used simple machine learning approaches on annotated electronic health records and the datasets provided in the track. The machine learning approaches have been tested individually and in combination with heuristics based on a dictionary of biomedical abbreviations adapted for the task.

biomedical nlp abbreviations machine learning dictionary- based approaches

Introduction

tasks is the number of abbreviations that each subtasks asks for and where the de nitions should originate from.

Sub-track 1 requires detecting all the abbreviations for which the de nitions are given explicitly in the document. Both the short form (i.e., the abbreviation or acronum) and the long form (i.e., the de nition or description) must be reported. For example, for the following piece of text: "... se aplico radiofrecuencia (RF) sobre la v a accesoria auriculo-ventricular (AV) de conduccin bidireccional. Se interrumpe la taquicardia y la preexcitacion, finalizando el procedimiento. Quedo con bloqueo de rama derecha (BRD) ..." the answer should note the 3 short forms \RF", \AV", and \BRD", along with their explicit long forms \radiofrecuencia", \auriculo-ventricular", and \bloqueo de la rama derecha", respectively:

Sub-track 2 requires detecting all the abbreviations within the document, and providing a resolution regardless their appearing explicitly in the text. The following text excerpt contains such 2 short forms, \RMN" and \MTT": Se solicito una RMN de pie izquierdo, que revelo una fractura de estres en el 2o MTT con callo periostico...

The system developed for this sub-track should be able to nd these two elements and give their long forms, \resonancia magneitca nuclear" and \metatarso", respectively: S1889-836X2015000200005-2 878 881 nuclear resonancia magnetico nuclear S1889-836X2015000200005-2 943 946 RMN MTT resonancia magnetica metatarso metatarso

The organization[ 2 ] has provided a sample set, a training set and a development set of the sizes shown in Table 1. The test set provided for evaluating the approaches was about 10 times bigger than the other sets, containing 2879 clinical tests, even though the submitted runs where eventually evaluated against a set of the same size as the training set. 3

Methodology

This work is a continuation of [ 5 ], where several experiments were performed for detecting and disambiguating abbreviations in electronic health records (EHR). Sample set

Training set

Development set

Testing set Clinical tests Sub-track 1 Sub-track 2 15 10 89 318 287 4,261 146 178 1,878 220 239 3,414 In this work, a small corpus of 149 EHRs was compiled manually annotated with 2,389 abbreviations and acronyms. These EHRs were provided by a local hospital and belong to di erent clinical specialties. Of the short forms annotated, 2 clinicians manually disambiguated two sets, one containing the 15th most ambiguous forms and the other the 30th most ambiguous forms. Finally, a dictionary of short- and long-form pairs was crafted based on [ 3 ] and the annotated corpora. The present work relies on the EHR corpora and the hand-crafted dictionary, in addition to the datasets provided by the organization of the track.

The following sections describe the approaches taken to the problems of abbreviation recognition (both in BARR2 sub-tracks 1 and 2), and of abbreviation resolution in sub-track 1 (i.e., nding the explicit long form) and sub-track 2. For the purpose of the BARR2 track, most of the e ort has been put to the problem of recognition. 3.1

Abbreviation recognition

For each sub-track, we have trained several classi ers and envisaged two extra methods based on regular expressions and the hand-crafted dictionary in order to improve the recall of the machine learning approaches.

Machine Learning approach Several machine learning classi ers have been trained with Weka [ 4 ] (default settings), using the EHR dataset described above and both the BARR2 Training sets (BARR2 TS) for sub-track 1 and sub-track 2. The same very cheap features as in [ 5 ] have been used for learning the models: { Uppercase: whether the token is all uppercase { Digit: whether the token contains digits { Strange ending: whether the token has a strange ending, where a strange ending is one that doesn't t to the normal ones in tokens which are not abbreviations { Length: token length { Uppercase count: amount of uppercase characters in the token { Lowercase count: amount of lowercase characters in the token { Vowel ratio: amount of vowels in the token divided by its length { Punctuation ratio: amount of punctuation characters in the token divided by its length 4

EHRs

Taking these results into account, the classi ers selected for the BARR2 competition have been J48 trained with BARR2 TS only and RF trained with the combined datasets.

Pattern-based approach (Pat) This approach consists of a set of regular expressions aiming to retrieve the abbreviations and acronyms that the ML approach does not cover. Basically, it retrieves all the strings of upper- and lowercase characters that have an uppercase character and are inside brackets. That is, this approach makes sense mainly in sub-track 1. Additionally, some tests have been carried out to try to retrieve short forms with digits too, but the results have worsened.

Dictionary-based approach (Regex) This approach is based on the dictionary introduced above and a set of rules hand-crafted after study and observation of the abbreviations in several sets of EHR and the literature. For this work, the dictionary developed in [ 5 ] has been re ned taking in account the BARR2 Training and Development set examples. The nal version of the dictionary contains 3447 unique pairs of biomedical short- and long-form pairs. 3.2

Abbreviation resolution for sub-track 1

Regarding sub-track 1, the system uses one or the combination of the Machine Learning approach, Pattern-based approach and Dictionary-based approach to detect abbreviations candidates. Once the candidates are found and after checking they are surrounded by brackets, an 8th n-gram window before the abbreviation is considered as the possible de nition. This possible de nition is rstly checked against our dictionary, and if exists, we select it. Otherwise, a set of heuristics are considered in order to determine if the text before is the de nition. The heuristics are based on: 1) the capital letters of the de nition and the letters of the abbreviation in the same order or backwards, 2) the size of the de nition related to the size of the abbreviation, 3) a priority of sizes de nitions (3-ngrams > 2-ngram >4-ngram > 5-ngram ... ). The di erent heuristics exclude the following ones when one is triggered. Finally if a de nition is found, both abbreviation and de nitions are selected and their o sets in the original clinical text are calculated. 3.3

Abbreviation resolution for sub-track 2

Regarding sub-track 2, the system uses one or the combination of the Machine Learning approach and Dictionary-based approach to detect the abbreviations candidates. For each possible candidate a de nition is selected from our dictionary. Finally the o sets where the abbreviation is found in the clinical text are provided. 4

Experiments and Evaluation

Vicomtech has submitted a total of 4 systems to sub-track 1 and 4 systems to sub-track 2. The systems rely on either one of the approaches described above or their combinations. We have tested them with the Sample set rstly, but then re ned them by using the BARR2 Training and Development sets. Pat and Regex individually had a lower scores regarding recall, so we have used them only in combination with the J48 or RF classi ers.

Tables 3 and 4 show the performance of the systems submitted to sub-track 1 and sub-track 2, respectively. In both tables, Training, Development and Test results are presented. Regarding sub-track 1, adding Pat to the classi er seems to improve recall a little, but precision worsens accordingly. Regex does not seem to have hardly any e ect. As for sub-track 2, the J48 classi er yields a slightly better precision and slightly worse recall than RF; in both cases, Regex improves recall by 1-3 points but worsens precision by more.

Overall, there are no big di erences between the systems submitted, and there is a clear drop in recall in the Test dataset for all. The results seem to be competitive, but o cial results of other participants in the track have not been published at the time of writing, so no remarks can be made in the matter. 5

Concluding Remarks

In this paper we present the results of applying di erent machine learning approaches combined with heuristics based on pattern matching and regex based 6 on abbreviation dictionaries. The results show that both tasks are similar in terms of precision, recall and F1-measure when seen from the perspective of the presented results. However, the tasks are quite di erent, being two di erent problems that only share partially the detection of abbreviations. Sub-track 1 aims for detecting de nitions expressed in the text, and sub-track 2 aims for having it in a dictionary. The dictionary has to be precise and sometimes fails due to changes in the language of the abbreviation or spelling mistakes.

Additionally, there were some exceptions or di erent abbreviations that we did not contemplate because the task description was not telling this such as: S1889-836X2015000100003-1 SHORT_FORM 398 402 P1NP SHORT-LONG LONG_FORM 404 452 propeptido amino-terminal del procolageno tipo 1 related to: ...resultado en los niveles del P1NP (propeptido amino-terminal del procolageno tipo 1)... which to our rst understanding was not at all the goal of sub-track1, which had to be in the other way round.

Overall, we present a robust method for detecting abbreviations in two different scenarios showing similar results. This work has been supported by Vicomtech and the Spanish Ministry of Economy and Competitiveness (MINECO/FEDER, UE) under the project TUNER (TIN2015-65308-C5-1-R).

1. Intxaurrondo , A. , Marimon , M. , Gonzalez-Agirre , A. , Lopez-Martin , J.A. , Rodriguez

Betanco

, H. , Santamar a , J., Villegas , M. , Krallinger , M. : Finding mentions of abbreviations and their de nitions in Spanish Clinical Cases: the BARR2 shared task evaluation results . In: SEPLN 2018 ( 2018 )

2. Intxaurrondo , A., de la Torre , J.C. , Rodriguez

Betanco

, H. , Marimon , M. , LopezMartin , J.A. , Gonzalez-Agirre , A. , Santamar a , J., Villegas , M. , Krallinger , M. : Resources, guidelines and annotations for the recognition, de nition resolution and concept normalization of Spanish clinical abbreviations: the BARR2 corpus . In: SEPLN 2018 ( 2018 )

3. Laguna , J.Y.: Diccionario de siglas medicas y otras abreviaturas, eponimos y terminos medicos relacionados con la codi cacion de las altas hospitalarias ( 2003 )

4. Markov , Z. , Russell , I.: An introduction to the weka data mining system . ACM SIGCSE Bulletin 38 ( 3 ), 367 { 368 ( 2006 )

5. Montoya , I. : Analisis, normalizacion, enriquecimiento y codi cacion de historia cl nica electronica (HCE). Master's thesis, Konputazio Ingeniaritza eta Sistema Adimentsuak Unibertsitate Masterra, Euskal Herriko Unibertsitatea (UPV/EHU ) ( 2017 )