A simple method to extract abbreviations within a document using regular expressions Christian Sánchez, Paloma Martı́nez Computer Science Department, Universidad Carlos III of Madrid Avd. Universidad, 30, Leganés, 28911, Madrid, Spain Abstract. Biomedical Abbreviation Recognition and Resolution (BARR) is an evaluation track of the 2nd Human Language Technologies for Iberian languages (IberEval) workshop, a workshop series organized by the Spanish Natural Lan- guage Processing Society (SEPLN). In this first edition of BARR, we focus on the discovery of biomedical entities and abbreviation, and relating detected abbrevi- ations with their long forms. This paper describes the approach and the system presented in the sub-track 2, which consists in offers a method to extract abbre- viations within a document using regular expressions. 1 Introduction Many clinical documents are created in a daily basis, most of them contain abbrevia- tions for common medical and clinical terms, names of diseases, symptoms, etc., the correct interpretation of them could be sometimes confusing for patients and even for medical professionals. This also adds some workload, because find, retrieve and in- terpret an abbreviation could often includes not just analyse the term but the whole document context. There is some research that proposes certain solutions and approaches for the prob- lem, but most of it is focused on analysing text written in English, in this context the BARR2 track has the aim to promote the development and evaluation of clinical ab- breviation identification systems by providing Gold Standard training and test corpora manually annotated by domain experts with abbreviation-definition pairs within ab- stracts of clinical texts and clinical case studies written in Spanish. Our participation was focused on the sub-track 2: provide resolution of short forms regardless whether its definition is mentioned within the actual document. For this approach, and in line with our participation in the previous BARR track, we refer to an abbreviation as a Short Form (SF) and the definition as the Long Form (LF). This paper is organized as follows: Section 2 describes our proposed approach. Sec- tion 3 presents evaluation and results. Finally, conclusions and future work are discussed in Section 4. 2 Proposed Approach The main goal was to propose a solution for the sub-track 2. This proposal was based on our previous work A proposed system to identify and extract abbreviation definitions in Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) Spanish biomedical texts for the Biomedical Abbreviation Recognition and Resolution (BARR) 2017 [4]. To accomplish this, first, we assumed that an abbreviation or short form could appear many times in the document, its length should be between 2 and 8 characters, and have just one and single definition. Secondly, we use an external source to obtain the definition, so we declined to evaluate the content of the document, i.e.: A la exploración fı́sica se observaba paraparesia con amioatrofia por desuso de EEII In this example we consider EEII as a short form or abbreviation, and the actual defi- nition or long form is not provided within the text. Using the mentioned assumptions as guidelines, we divided the system process into the following tasks: 2.1 Prepare and organize the definitions We used the Diccionario de Siglas Médicas [1] as the main source for the definitions. All the terms were extracted from there, stored in a database and exposed as a service in a REST API. The total number of terms contained in the dictionary and exported to the database was 3386. This service was meant to be used as part of the system used in the presented approach. Definitions are returned as a list of key-value objects, composed by the short form and the long form or definition, i.e. for the abbreviation EEII: [ { ” long form ” : ” Extremidades i n f e r i o r e s ” , ” s h o r t f o r m ” : ” EEII ” }, { ” long form ” : ” Extremidades i z q u i e r d a s ” , ” s h o r t f o r m ” : ” EEII ” } ] Some abbreviations could have more than one definition, in this way it is possible to obtain all the known definitions for a given abbreviation. 2.2 Detect Short Forms For this task we use a Perl 6 script which parses all the documents one by one to obtain all the short forms found in the text. Short form identification was performed using regular expressions1 . The set of rules was based in our previous work, but some improvements were added. A total of 5 regular expressions were used for the proposed system. One of the improvements in one of the regular expression used was the following; 1 https://docs.perl6.org/language/regexes 298 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) <[a . . z ]>? \−? <[A . . Z]> ∗∗ { 1 . . 8 } <[a . . z \ \ −\/] >? <[A . . Z0 . . 9 ] > + <[a . . z ]>∗ This regular expression matches (from left to right): – Zero or one lowercase letter – Zero or one ”-” character – Between 1 and 8 uppercase letters – Zero or one lowercase letter or ” ”, ”-”, ”\” characters – One or more uppercase letter or number – Zero or more lowercase letters Also, to match abbreviations in the form gr/dl, the following regexp was used: <[a . . zA . . Z ] > ∗ ∗ { 1 . . 4 } \ / <[\w] > ∗ ∗ { 1 . . 4 } This regular expression matches (from left to right): – Between 1 and 4 lowercase or uppercase letters – The character / – Between 1 and 4 word characters (a-z, A-Z, 0-9, including the character ) The whole document is parsed and all the matches found are stored in a list. This de- tection also stores the position of the matched short form in the text. Once the document is processed, the next step is obtain the definition of each of them. 2.3 Get Long Forms Using the REST API provided for the definitions database, it was possible for the script to make GET requests, via HTTP, for each of the abbreviations found in the document. If a definition was not found in the database, the script discarded the current abbrevia- tion processed and continued with the next match. This step was executed every time the script needed to find a definition, if a defini- tion provided was associated with an abbreviation, the script marked it as done and did not execute this step even if there were more matches in the document, this provided a better performance for the system. 2.4 Process Long Forms When a response was provided by the API, the script continued with the next step, which was to process and to obtain the supposedly right definition for the abbreviation. In this step there were two possibilities: If the response contained just one definition, the script used it and marked it as the definitive for the current evaluated abbreviation and started the task with the next item in the list. If the response contained two or more definitions, the script performed another set of actions for each result: 299 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) • Normalize part of the text, get the content until the start offset of abbreviation. This normalization includes remove stop words and stemming the text using the Perl 6 modules Lingua::Stopwords 2 and Lingua::Stem::Es 3 . • Normalize text of definition, remove stop words and stemming the text using the same tools as the previous step. • Extract the same amount of words from the normalized definition as the length of characters from the abbreviation, beginning from the start offset of the abbreviation and moving to the left side, so if the abbreviation is EEII this step should get four (4) stemmed words from the normalized text. • Perform and intersection operation to get a list that contains only the elements common to both the normalized text and the normalized definition, and return the total elements found. In the case that not matches were found, the steps above were repeated, but instead of extracting a number of words from the normalized text, the intersection operation was made with all the text until the start offset of the abbreviation. That way a wide range of stemmed words were compared which provided a better context and more opportunities to find similarities in both texts. Once all the long forms were processed, the script selected the one which more intersected elements and use it as the definition. As a final step, the script obtained the lemmatized version of the definition using the Python library pattern 4 3 Evaluation and Results For this sub-task at BARR2 the primary evaluation metric used consisted in precision, recall, and f-score of the predictions against manual gold standard[3]. A corpus consist- ing in a manually labeled collection of Spanish medical abstracts constructed using a customized version of AnnotateIt, BRAT as well as using the Markyt annotation system [9] was released for the organizers to test the systems[2]. The results for our first test with the training data (a total of 4260 annotations pro- vided) were: PRECISION = 0 . 5 1 3 0 0 8 5 = 1 4 5 9 . 5 0 9 2 / 2845 RECALL = 0 . 3 4 2 6 0 7 8 = 1 4 5 9 . 5 0 9 2 / 4260 F−MEASURE = 0 . 4 1 0 8 4 After some adjustment in the rules for short form detection and cleanup some texts in the definitions stored in the database we got an improvement in the results: PRECISION = 0 . 7 2 2 4 3 7 = 1 5 2 6 . 5 0 9 4 / 2113 RECALL = 0 . 3 5 8 3 3 5 5 5 = 1 5 2 6 . 5 0 9 4 / 4260 F−MEASURE = 0 . 4 7 9 0 5 5 1 7 2 http://modules.perl6.org/dist/Lingua::Stopwords:cpan:CHSANCH 3 http://modules.perl6.org/dist/Lingua::Stem::Es:cpan:CHSANCH 4 https://www.clips.uantwerpen.be/pages/pattern-es 300 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) The evaluation scores obtained for our 3 submitted predictions were: P r e c i s i o n : 74.93 Recall : 37.69 F1 : 5 0 . 1 6 There were some issues that the organizers noticed in this sub-track: definitions could appear in different forms, there are variants of some of the definitions, and some typos; all of them could affect the results on some level. 4 Conclusions and future work For this sub-task we relayed just in one dictionary, which provides a good resource for definitions in the medical field, more sources are needed to improve the results. This proposal offers a solution in this specific field, but it could be extended to analyze documents related to other fields. Another interesting improvement could be to add some Machine Learning processes to classify texts and provided an accurate selection of the definition of an abbreviation in the context of the document processed. There were many missed definitions. An attempt to get and stored definitions for missed abbreviations matches using externals sources could be an important improve- ment. Finally in addition, apply some methods to identify and extract definitions within the document processed, which was the main goal of the sub-track 1. References 1. Diccionario de Siglas Médicas. Ministerio de Sanidad y Consumo (2016) 2. Intxaurrondo, A., Marimon, M., Gonzalez-Agirre, A., Lopez-Martin, J., Betanco, H.R., Santa- marı́a, J., Villegas, M., Krallinger, M.: Finding mentions of abbreviations and their definitions in spanish clinical cases: the barr2 shared task evaluation results. SEPLN (2018) 3. Intxaurrondo, A., de la Torre, J., Betanco, H.R., andJ A. Lopez-Martin, M.M., Gonzalez- Agirre, A., Santamarı́a, J., Villegas, M., Krallinger, M.: Resources, guidelines and annota- tions for the recognition, definition resolution and concept normalization of spanish clinical abbreviations: the barr2 corpus. SEPLN (2018) 4. Sánchez, C., Martı́nez, P.: A proposed system to identify and extract abbreviation definitions in spanish biomedical texts for the biomedical abbreviation recognition and resolution (barr) 2017. BARR IBEREVAL (2017) 301