Vicomtech at BARR2: Detecting Biomedical Abbreviations with ML methods and dictionary-based heuristics Montse Cuadros1 , Naiara Pérez1 , Iker Montoya2 , and Aitor Garcı́a Pablos1 1 Vicomtech, Paseo Mikeletegi 57, Donostia-San Sebastian, Spain {mcuadros,nperez,agarciap}@vicomtech.org 2 I+D Lanik S.A, Donostia-San Sebastian, Spain {iker92montu}@gmail.com Abstract. This paper presents the system developed by Vicomtech to participate in the Second Biomedical Abbreviation Recognition and Res- olution (BARR2) track. For this purpose, we have used simple ma- chine learning approaches on annotated electronic health records and the datasets provided in the track. The machine learning approaches have been tested individually and in combination with heuristics based on a dictionary of biomedical abbreviations adapted for the task. Keywords: biomedical nlp · abbreviations · machine learning · dictionary- based approaches 1 Introduction This paper describes Vicomtech’s participation in the Second Biomedical Ab- breviation Recognition and Resolution (BARR2) track of the third IberEval workshop (IberEval 2018), both in sub-tracks 1 and 2. Sub-track 1 consists in detecting only explicit occurrences of abbreviation-definition pairs. For sub-track 2, resolution of short forms must be provided regardless whether its definition is mentioned within the actual document. Both sub-tracks focus on clinical free text in Spanish. This paper is organized as follows: Section 2 presents the two tasks in more detail; Section 3 presents our approaches to the two problems; Section 4 shows our results; finally, Section 5 contains our concluding remarks. 2 Biomedical Abbreviation Recognition and Resolution 2nd Edition (BARR2) The Second Biomedical Abbreviation Recognition and Resolution track[1] is or- ganized in two sub-tasks, sub-track1 and sub-track2. Both tasks require rec- ognizing abbreviations and acronyms in Spanish clinical texts, and providing the correct definition for each recognized element. The difference between the Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) 2 M.Cuadros et al. tasks is the number of abbreviations that each subtasks asks for and where the definitions should originate from. Sub-track 1 requires detecting all the abbreviations for which the definitions are given explicitly in the document. Both the short form (i.e., the abbrevia- tion or acronum) and the long form (i.e., the definition or description) must be reported. For example, for the following piece of text: "... se aplicó radiofrecuencia (RF) sobre la vı́a accesoria auriculo-ventricular (AV) de conduccin bidireccional. Se interrumpe la taquicardia y la preexcitación, finalizando el procedimiento. Quedó con bloqueo de rama derecha (BRD) ..." the answer should note the 3 short forms “RF”, “AV”, and “BRD”, along with their explicit long forms “radiofrecuencia”, “auriculo-ventricular”, and “bloqueo de la rama derecha”, respectively: S1888-75462014000200009-1 SHORT_FORM 1524 1526 RF SHORT-LONG LONG_FORM 1507 1522 radiofrecuencia S1888-75462014000200009-1 SHORT_FORM 1573 1575 AV SHORT-LONG LONG_FORM 1551 1571 auriculo-ventricular S1888-75462014000200009-1 SHORT_FORM 1720 1723 BRD SHORT-LONG LONG_FORM 1695 1718 bloqueo de rama derecha Sub-track 2 requires detecting all the abbreviations within the document, and providing a resolution regardless their appearing explicitly in the text. The following text excerpt contains such 2 short forms, “RMN” and “MTT”: Se solicitó una RMN de pie izquierdo, que reveló una fractura de estrés en el 2o MTT con callo perióstico... The system developed for this sub-track should be able to find these two elements and give their long forms, “resonancia magnéitca nuclear” and “metatarso”, respectively: S1889-836X2015000200005-2 878 881 RMN resonancia magnética nuclear resonancia magnético nuclear S1889-836X2015000200005-2 943 946 MTT metatarso metatarso The organization[2] has provided a sample set, a training set and a develop- ment set of the sizes shown in Table 1. The test set provided for evaluating the approaches was about 10 times bigger than the other sets, containing 2879 clin- ical tests, even though the submitted runs where eventually evaluated against a set of the same size as the training set. 3 Methodology This work is a continuation of [5], where several experiments were performed for detecting and disambiguating abbreviations in electronic health records (EHR). 323 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) Vicomtech at BARR2 3 Sample set Training set Development set Testing set Clinical tests 15 318 146 220 Sub-track 1 10 287 178 239 Sub-track 2 89 4,261 1,878 3,414 Table 1. Number of documents in the different sets In this work, a small corpus of 149 EHRs was compiled manually annotated with 2,389 abbreviations and acronyms. These EHRs were provided by a local hospital and belong to different clinical specialties. Of the short forms anno- tated, 2 clinicians manually disambiguated two sets, one containing the 15th most ambiguous forms and the other the 30th most ambiguous forms. Finally, a dictionary of short- and long-form pairs was crafted based on [3] and the anno- tated corpora. The present work relies on the EHR corpora and the hand-crafted dictionary, in addition to the datasets provided by the organization of the track. The following sections describe the approaches taken to the problems of ab- breviation recognition (both in BARR2 sub-tracks 1 and 2), and of abbreviation resolution in sub-track 1 (i.e., finding the explicit long form) and sub-track 2. For the purpose of the BARR2 track, most of the effort has been put to the problem of recognition. 3.1 Abbreviation recognition For each sub-track, we have trained several classifiers and envisaged two extra methods based on regular expressions and the hand-crafted dictionary in order to improve the recall of the machine learning approaches. Machine Learning approach Several machine learning classifiers have been trained with Weka [4] (default settings), using the EHR dataset described above and both the BARR2 Training sets (BARR2 TS) for sub-track 1 and sub-track 2. The same very cheap features as in [5] have been used for learning the models: – Uppercase: whether the token is all uppercase – Digit: whether the token contains digits – Strange ending: whether the token has a strange ending, where a strange ending is one that doesn’t fit to the normal ones in tokens which are not abbreviations – Length: token length – Uppercase count: amount of uppercase characters in the token – Lowercase count: amount of lowercase characters in the token – Vowel ratio: amount of vowels in the token divided by its length – Punctuation ratio: amount of punctuation characters in the token divided by its length 324 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) 4 M.Cuadros et al. Table 2 shows the performance of the trained classifiers in the BARR2 Devel- opment set in terms of F1-measure. The first column refers to the models trained on the EHRs only; the second column refers to the models trained on the BARR2 TS only; finally, the last column shows the results on learning the classifiers on both datasets. Overall, combining both datasets yields worse results than using the BARR2 TS alone, except with the Random Fields (RF) algorithm, which is the best classifier obtained, followed by J48. Not surprisingly, the models trained on EHRs only perform the worst. EHRs BARR2 TS combined J48 81.71 89.63 88.52 KNN1 78.97 89.69 88.42 Naive Bayes 56.44 60.05 58.49 REPtree 82.14 89.12 87.82 SMO 0.00 62.65 54.51 RF 79.45 91.14 91.34 Table 2. F1-measure of abbreviation recognition on the Development sets of sub-tracks 1 and 2, trained on 3 different corpora Taking these results into account, the classifiers selected for the BARR2 competition have been J48 trained with BARR2 TS only and RF trained with the combined datasets. Pattern-based approach (Pat) This approach consists of a set of regular expressions aiming to retrieve the abbreviations and acronyms that the ML approach does not cover. Basically, it retrieves all the strings of upper- and lowercase characters that have an uppercase character and are inside brackets. That is, this approach makes sense mainly in sub-track 1. Additionally, some tests have been carried out to try to retrieve short forms with digits too, but the results have worsened. Dictionary-based approach (Regex) This approach is based on the dictio- nary introduced above and a set of rules hand-crafted after study and observation of the abbreviations in several sets of EHR and the literature. For this work, the dictionary developed in [5] has been refined taking in account the BARR2 Train- ing and Development set examples. The final version of the dictionary contains 3447 unique pairs of biomedical short- and long-form pairs. 3.2 Abbreviation resolution for sub-track 1 Regarding sub-track 1, the system uses one or the combination of the Machine Learning approach, Pattern-based approach and Dictionary-based approach to 325 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) Vicomtech at BARR2 5 detect abbreviations candidates. Once the candidates are found and after check- ing they are surrounded by brackets, an 8th n-gram window before the abbre- viation is considered as the possible definition. This possible definition is firstly checked against our dictionary, and if exists, we select it. Otherwise, a set of heuristics are considered in order to determine if the text before is the defini- tion. The heuristics are based on: 1) the capital letters of the definition and the letters of the abbreviation in the same order or backwards, 2) the size of the definition related to the size of the abbreviation, 3) a priority of sizes definitions (3-ngrams > 2-ngram >4-ngram > 5-ngram ... ). The different heuristics exclude the following ones when one is triggered. Finally if a definition is found, both abbreviation and definitions are selected and their offsets in the original clinical text are calculated. 3.3 Abbreviation resolution for sub-track 2 Regarding sub-track 2, the system uses one or the combination of the Machine Learning approach and Dictionary-based approach to detect the abbreviations candidates. For each possible candidate a definition is selected from our dictio- nary. Finally the offsets where the abbreviation is found in the clinical text are provided. 4 Experiments and Evaluation Vicomtech has submitted a total of 4 systems to sub-track 1 and 4 systems to sub-track 2. The systems rely on either one of the approaches described above or their combinations. We have tested them with the Sample set firstly, but then refined them by using the BARR2 Training and Development sets. Pat and Regex individually had a lower scores regarding recall, so we have used them only in combination with the J48 or RF classifiers. Tables 3 and 4 show the performance of the systems submitted to sub-track 1 and sub-track 2, respectively. In both tables, Training, Development and Test results are presented. Regarding sub-track 1, adding Pat to the classifier seems to improve recall a little, but precision worsens accordingly. Regex does not seem to have hardly any effect. As for sub-track 2, the J48 classifier yields a slightly better precision and slightly worse recall than RF; in both cases, Regex improves recall by 1-3 points but worsens precision by more. Overall, there are no big differences between the systems submitted, and there is a clear drop in recall in the Test dataset for all. The results seem to be competitive, but official results of other participants in the track have not been published at the time of writing, so no remarks can be made in the matter. 5 Concluding Remarks In this paper we present the results of applying different machine learning ap- proaches combined with heuristics based on pattern matching and regex based 326 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) 6 M.Cuadros et al. Training Development Test Recognition method P R F1 P R F1 P R F1 J48 94.09 83.56 88.51 96.75 84.18 90.03 88.12 74.79 80.91 J48 + Pat 90.94 87.76 89.32 93.45 88.70 91.01 87.38 75.63 81.08 J48 + Regex 94.09 83.56 88.51 96.75 84.18 90.03 88.56 74.79 81.71 J48 + Regex + Pat 90.61 87.76 89.16 93.45 88.70 91.01 88.29 76.05 81.71 Table 3. Results of sub-track 1 on the BARR2 Training, Development and Test sets Training Development Test Recognition method P R F1 P R F1 P R F1 J48 91.92 82.81 86.81 90.28 78.83 84.17 87.57 70.20 77.93 RF 90.46 84.20 87.22 89.71 80.06 84.61 86.41 70.44 77.61 J48 + Regex 85.58 85.82 85.70 84.75 82.63 83.67 81.58 73.36 77.25 RF + Regex 85.71 85.63 85.68 85.79 83.60 84.68 81.72 72.89 77.05 Table 4. Results of sub-track 2 on the BARR2 Training, Development and Test sets on abbreviation dictionaries. The results show that both tasks are similar in terms of precision, recall and F1-measure when seen from the perspective of the presented results. However, the tasks are quite different, being two different problems that only share partially the detection of abbreviations. Sub-track 1 aims for detecting definitions expressed in the text, and sub-track 2 aims for having it in a dictionary. The dictionary has to be precise and sometimes fails due to changes in the language of the abbreviation or spelling mistakes. Additionally, there were some exceptions or different abbreviations that we did not contemplate because the task description was not telling this such as: S1889-836X2015000100003-1 SHORT_FORM 398 402 P1NP SHORT-LONG LONG_FORM 404 452 propéptido amino-terminal del procolágeno tipo 1 related to: ...resultado en los niveles del P1NP (propéptido amino-terminal del procolágeno tipo 1)... which to our first understanding was not at all the goal of sub-track1, which had to be in the other way round. Overall, we present a robust method for detecting abbreviations in two dif- ferent scenarios showing similar results. 327 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) Vicomtech at BARR2 7 6 Acknowledgments This work has been supported by Vicomtech and the Spanish Ministry of Econ- omy and Competitiveness (MINECO/FEDER, UE) under the project TUNER (TIN2015-65308-C5-1-R). References 1. Intxaurrondo, A., Marimon, M., Gonzalez-Agirre, A., Lopez-Martin, J.A., Ro- driguez Betanco, H., Santamarı́a, J., Villegas, M., Krallinger, M.: Finding mentions of abbreviations and their definitions in Spanish Clinical Cases: the BARR2 shared task evaluation results. In: SEPLN 2018 (2018) 2. Intxaurrondo, A., de la Torre, J.C., Rodriguez Betanco, H., Marimon, M., Lopez- Martin, J.A., Gonzalez-Agirre, A., Santamarı́a, J., Villegas, M., Krallinger, M.: Re- sources, guidelines and annotations for the recognition, definition resolution and concept normalization of Spanish clinical abbreviations: the BARR2 corpus. In: SEPLN 2018 (2018) 3. Laguna, J.Y.: Diccionario de siglas médicas y otras abreviaturas, epónimos y términos médicos relacionados con la codificación de las altas hospitalarias (2003) 4. Markov, Z., Russell, I.: An introduction to the weka data mining system. ACM SIGCSE Bulletin 38(3), 367–368 (2006) 5. Montoya, I.: Análisis, normalización, enriquecimiento y codificación de historia clı́nica electrónica (HCE). Master’s thesis, Konputazio Ingeniaritza eta Sistema Adimentsuak Unibertsitate Masterra, Euskal Herriko Unibertsitatea (UPV/EHU) (2017) 328