=Paper= {{Paper |id=Vol-1881/BARR2017_paper_2 |storemode=property |title=IBI-UPF at BARR-2017: Learning to Identify Abbreviations in Biomedical Literature System description |pdfUrl=https://ceur-ws.org/Vol-1881/BARR2017_paper_2.pdf |volume=Vol-1881 |authors=Francesco Ronzano,Laura Inés Furlong |dblpUrl=https://dblp.org/rec/conf/sepln/RonzanoF17 }} ==IBI-UPF at BARR-2017: Learning to Identify Abbreviations in Biomedical Literature System description== https://ceur-ws.org/Vol-1881/BARR2017_paper_2.pdf
           IBI-UPF at BARR-2017: learning to identify
              abbreviations in biomedical literature
                       System description

                          Francesco Ronzano and Laura I. Furlong

                 Integrative Biomedical Informatics Group, Research Programme
                                on Biomedical Informatics (GRIB)
                       Hospital del Mar Medical Research Institute (IMIM)
                                    Universidad Pompeu Fabra
                                        Barcelona, Spain
                   {francesco.ronzano,laura.furlong}@upf.edu



         Abstract. This paper presents the participation of the IBI-UPF team to the Biomed-
         ical Abbreviation Recognition and Resolution (BARR) track organized in the
         context of the Evaluation of Human Language Technologies for Iberian Lan-
         guages 2017 (IBEREVAL). The purpose of the track was to automatically identify
         abbreviation-definition pairs in the abstract of biomedical articles in Spanish. By
         releasing a sample corpus and two collections of training documents, the orga-
         nizers provided a total of 1,150 abstracts of biomedical articles, the majority of
         them in Spanish, manually annotated with respect to the identifications of abbre-
         viations and the corresponding definitions. We tackled the task by implementing
         an approach articulated in two sequential phases. In the first one, by relying on a
         set of shallow linguistic features extracted from the textual contents of biomed-
         ical abstracts, we trained two token classifiers to spot sequences of one or more
         tokens that respectively represent abbreviations or definitions. Then, a third clas-
         sifier is trained to distinguish abbreviations that are candidate short forms of a
         definition expressed in the same abstract sentence from other types of abbre-
         viations. In a second phase, relations between the abbreviations and definitions
         previously spotted are identified by means of a set of heuristics based on struc-
         tural and linguistic traits of the text of each abstract. We evaluate the first phase
         of our approach by considering the set of Spanish biomedical abstracts manually
         annotated, provided by the organizers of the BARR track.


1      Introduction

Nowadays, automated approaches to mine biomedical texts are becoming key tools to
enable researchers, as well as any other interested actor, to effectively access to and
take advantage of the huge and rapidly growing amount of articles available on-line [6].
PubMed1 , the main search engine of life science and biomedical papers, currently in-
cludes more than 27 million articles and is growing at a rate of about 7% of new publi-
cations every [18].
 1
     https://www.ncbi.nlm.nih.gov/pubmed/
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




              Abbreviations, acronyms and symbols are extensively used in biomedical texts: their
         identification and correct interpretation are essential to automatically analyze this kind
         of documents. Several approaches have been proposed during the last decades to extract
         abbreviation-definition pairs in biomedical texts [9, 17]. Part of them are based on a
         mix of pattern-matching and heuristic rules sometimes complemented by corpus statis-
         tics [2, 7, 14, 20, 22, 23] while other ones propose hybrid systems that rely on supervised
         learning approaches that are properly trained on manually annotated corpora [3, 12, 13,
         21]. During the last decade, in the biomedical domain, besides scientific papers, also
         clinical notes have focused several efforts towards the autormated extraction and inter-
         pretation of abbreviations [4, 11, 19].
              The Biomedical Abbreviation Recognition and Resolution (BARR) [8] track has
         been organized in the context of the Evaluation of Human Language Technologies for
         Iberian Languages (IBEREVAL 2017) in order to promote the investigation of new
         approaches to identify abbreviations together with their definitions in Spanish biomedi-
         cal documents. In this paper we describe our participation (UPF-IBI team) to the BARR
         track. In particular, in Section 2 we provide more details on the BARR task by introduc-
         ing some core aspects of the BARR corpus of biomedical abstracts manually annotated
         with respect to abbreviations. In Section 3 we describe the set of Natural Language
         Processing tools and resources we exploited to support the automated identification of
         abbreviation-definition pairs in biomedical abstracts. Section 4 explains our approach
         to face the BARR task. In Section 5 we provide some preliminary evaluation of our au-
         tomated abbreviation identification system by considering the training set of manually
         annotated abstracts provided by BARR organizers. To conclude, in Section 6 we sum-
         marize the key points of our BARR participation outlining future venues of research to
         improve our approach.


         2    BARR track: task and dataset

         The information extraction task proposed to the participants of the BARR track consists
         in the identification of abbreviations (or Short Forms, SFs) that occur in sentences of
         Spanish biomedical abstracts and their association to the corresponding definitions, re-
         ferred to as Long Forms (LFs). An example of hSF, LFi pair is hTAC, Tomografa Axial
         Computarizadai. Besides proposing approaches to mine the broad variety of possible
         SFs that can be exploited to refer to a specific LF, BARR participants were also required
         to deal with the detection of nested hSF, LFi pairs: in these pairs two or more SFs share
         portions of the corresponding LFs or the LF associated to a SF is not constituted by a
         consecutive sequence of words. The expression dolor oncológico (DO) y no oncológico
         (DNO) includes two nested hSF, LFi pairs: hDO, dolor oncológicoi y hDNO, dolor no
         oncológicoi.
             In order to train automated approaches for the detection of hSF, LFi pairs (both
         simple and nested ones), BARR organizers released a sample corpus and two training
         corpora globally providing 1,150 manually annotated abstracts of biomedical articles:
         about 90% of these documents are Spanish texts. The evaluation of the abbreviation
         extraction approaches proposed in the context of the BARR task is performed by com-
         puting precision, recall and f1-score of each proposed approach with respect to a test




                                                                                                                        256
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




         corpus that includes 600 Spanish biomedical abstracts: the extraction of entities (SFs
         and LFs) and their relations are considered as two separate tasks. More details concern-
         ing the corpus of biomedical papers released in the context of the BARR track together
         with the description of how these documents have been manually annotated can be
         found in [10].


         3      Tools and resources

         To identify SFs, LFs and their associations, we exploited a mix of machine learning
         and heuristic approaches, both based on the characterization of the textual contents of
         biomedical abstracts through a set of shallow linguistic and corpus-based features. We
         computed these features by processing Spanish abstracts by means of the IXA Pipes
         NLP tools [1]: we performed sentence splitting, tokenization, Part of Speech tagging
         and constituency parsing. To process Spanish documents, IXA Pipes rely on NLP mod-
         els trained on the Spanish texts of the AnCora Corpus2 . Besides linguistic analyses, we
         determined the frequency of usage of abstracts’ words by relying on a word-frequency
         dictionary built from a 2016 dump of the Spanish Wikipedia. We exploited the GATE
         Framework [5] to integrate the text mining tools just mentioned into a single pipeline.


         4      Method

         Our abbreviation identification approach is composed of two sequential steps: the entity
         spotting and the relation extraction phase. The first phase relies on machine learning
         approaches to identify and characterize both SFs and LFs. The second phase exploits a
         set of heuristics in order to refine the entities previously identified and extract relations
         between SFs and LFs. We considered among the heuristics implemented in the second
         phase, a set of rules properly built to automatically characterize simple cases of nested
         hSF, LFi pairs. In this Section we provide a detailed description of the two phases of
         our abbreviation identification approach.


         4.1     Phase 1: entity spotting

         The first phase of our approach aims at: (i) extracting abbreviations and LFs; (ii) select-
         ing, among the spotted abbreviations, the ones that are SFs and thus occur in the same
         sentence of the corresponding LF.
             All these information extraction tasks have been performed by training distinct
         token-based classifiers. In these classifiers each token is characterized by means of the
         following types of features that we exploited to model both the token under considera-
         tion and the ones included in a context window of size [−2, 2]:

             – Part of Speech;
             – number of characters, including punctuations;
             – percentage of uppercase, numeric and punctuation characters;
          2
              http://clic.ub.edu/corpus/ancora




                                                                                                                        257
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




           – if the first / last char is uppercase;
           – if the last char is a punctuation;
           – number of repetitions of the token in the abstract;
           – match of the token with one of the entries of the Dictionary of Medical Abbrevia-
             tions SEDOM3 ;
           – frequency of the token in the Spanish Wikipedia.

             Each one of the types of features listed before generates five feature values for
         each token: one describing the token under analysis and four characterizing respectively
         the two previous tokens and two following tokens in the same sentence. We plan to
         explore in our future work the influence of different window sizes on the performance
         of our token-based classifiers, by considering also windows that are symmetric and not-
         symmetric with respect to the token to classify. We computed token features scoped
         to each sentence, thus setting as missing the feature values of the context tokens that
         cannot be determined since they are outside sentence boundaries. We selected our set
         of features in order to describe traits of tokens and their context that we considered
         relevant to the identification and characterization of abbreviations and LFs. For instance
         the presence of high percentages of uppercase letters is proper of many abbreviations.
             By relying on the previous set of features we build three Random Forest classifiers
         respectively trained to determine:

           – Abbreviation Token Classifier: if a token represents or not an abbreviations;
           – Long Form Token Classifier: if a token is at the Beginning, Inside or Outside a
             LF;
           – Abbreviation Type Classifier: if a token classified as an abbreviation by the Ab-
             breviation Token Classifier is a SF or represents another kind of abbreviation (e.g.
             an abbreviation for which the Long Form is not provided in the same sentence).

             In our approach presented to the BARR track, after selecting the best subset of
         features with respect to the task to perform, we trained each classifier over the whole
         set of tokens of the manually annotated Spanish biomedical abstracts provided by the
         BARR track organizers. Section 5 includes an initial evaluation of the performance of
         our classifiers over the BARR manually annotated Spanish abstracts.


         4.2     Phase 2: relation extraction

         Once identified SFs and LFs, in this phase we mainly implemented the following set
         of heuristics to determine if a SF includes the related LF in the scope of the sentence
         where it occurs:

           – Long Form sanitizing heuristics:
             (A.1) delete all LFs that have all tokens with a length shorter than three characters
             or that does not include a noun token;
             (A.2) remove the initial token from the text span of the LFs that start with an article;
          3
              http://www.sedom.es/diccionario/




                                                                                                                        258
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




           – SF - LF relations identification heuristics:
             (B.1) collect for each SF all the candidate LFs, including the LFs identified by
             the classifier and the noun phrases occurring in the same sentence, not overlapping
             the SF, distant from the SF at most three characters and spanning a number of
             characters bigger than the number of characters of the SF. If the SF is between
             parenthesis, we consider only the preceding candidate LFs;
             (B.2.1) if there is only one candidate LF: if the candidate LF has been identified by
             the Long Form Classifier, create a SF - LF relation. Otherwise, if it is a noun phrase
             apply the SF-LF scoring function (described below) and create a SF - LF relation
             if the score is greater than 0.
             (B.2.2) if there is more than one candidate LF: score each candidate LF by means
             of the SF-LF scoring function and chose the one with highest score, greater than
             0. If there is more than one candidate LF characterized by the highest score give
             precedence to the one that has been identified by the Long Form Classifier, if any,
             otherwise choose one of the candidate LFs randomly.


             As mentioned in the previous procedure, we defined a SF-LF scoring function that,
         given a pair of SF and candidate LF, returns a double value that is equal to 0 if the LF
         is not recognized as related to the SF. Otherwise such function returns a number greater
         than 0: the greater is this value, the higher we estimate that the candidate LF represents
         a definition of the SF. A value equal to 1 spots a perfect match between the SF and
         candidate LF. The return values of the SF-LF scoring function have been defined by
         relying on the precision estimates of the SF / LF matching strategies defined by [16].
             We extended the SF - LF relation extraction procedure just described by means of a
         set of refinement steps so as to properly deal with special cases including:


           – groups of SFs like fibrosis intersticial y atrofia tubular [FI y AT].
           – if no LF has been found, starting from the considered SF we try to build the LF by
             matching word-initials backwards;
           – if no LF has been found, if the SF matches some of the abbreviations of the Dic-
             tionary of Medical Abbreviations SEDOM, we search for the corresponding LF re-
             trieved from the same Dictionary in the set of candidate LFs previously described.
             This approach covers borderline cases like hCO2, Dixido de carbonoi in which it
             would have been impossible to determine the SF - LF relation.

             We also defined a basic set of heuristics to spot cases of nested relations between
         SF and LFs. We identify the eventual presence of nested relations if, after a candidate
         LF two or more SFs are present before the end of the sentence or the occurrence of the
         following candidate LF. If this situation occurs we exploit a set of rules based on string
         matching and POS tags so as to identify the NESTED entities and the SF - NESTED
         relations. In partiuclar, for each SF marked as nested candidate, we search backwards
         for non consecutive words matching the initials of the same SF and including at least
         one noun token.




                                                                                                                        259
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




         5     Evaluations and runs
         We evaluated the performance of the three Random Forest classifiers described in Sec-
         tion 4.1 by means of a 10-fold-cross-validation over the 237,603 tokens of manually
         annotated BARR abstracts (Table 1).


                                 Classifier            Precision Recall F1-score
                                 Abbreviation Token         0.937 0.918      0.927
                                 Long Form Token            0.623 0.345      0.444
                                 Abbreviation Type          0.838 0.828      0.833
         Table 1. Evaluation of entity spotting phase classifiers over manually annotated BARR abstracts
         (micro-average): (i) Abbreviation Token Classifier: weighted F1-score of classification of tokens
         as abbreviation or not, (ii) Long Form Token Classifier: weighted F1-score of Beginning and
         Inside tokens, (iii) Abbreviation Type Classifier: weighted F1-score of classification of abbrev.
         in: DERIVED, GLOBAL, NONE, MULTIPLE, SHORT



             From Table 1 we can notice that the identification and characterization of abbrevi-
         ations obtain satisfactory performance. As far as concern the identification of LFs, the
         Random Forest classifier obtains a low F1-score. This drawback of the first processing
         phase of our system (Section 4.1), probably related to the need to define better token
         level features for LF identification, is mitigated by the second phase (Section 4.2) in
         which the LFs spotted by the Long Form Token Classifier are sanitized and properly
         complemented by the LF candidates retrieved by considering nominal phrases.
             We submitted to the BARR track three runs to the entity extraction task and three
         runs to the relation extraction task (referred to as run v1, v3 and v4 in both tasks). In
         each run we incrementally improved the coverage and complexity of the set of heuristics
         exploited with respect to the previous one:
             – run v1: initial version of out BARR abbreviation-definition extraction system, in-
               cluding our implementation of the three token-based classifiers of the entity spot-
               ting phase (see Section 4.1) and an initial implementation of the relation extraction
               rules (see Section 4.2);
             – run v3: with respect to the run v1, we improved the set of relation extraction rules
               by including heuristics to handle the three special cases of SF - LF relation listed
               at the end of Section 4.2 (groups of SFs, matching word-initials, LF retrieval from
               the Dictionary of Medical Abbreviations SEDOM). Besides improving the perfor-
               mance of relation extraction, these modifications allowed our system to refine fur-
               therly the set of entities spotted by the three token-based classifiers of the entity
               spotting phase (see Section 4.1);
             – run v4: with respect to the run v3, our final run (v4) adds the basic set of heuristics
               that are tailored to spot cases of nested relations between SF and LFs, described in
               the last part of Section 4.2.
            In Table 2 and Table 3 we provide the results of the evaluation of our BARR runs, as
         computed by means of the Markyt Web tool [15]. In particular, Table 2 shows the results




                                                                                                                        260
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)




         of the entity and relation extraction tasks for each one of our three runs, against the train-
         ing set of BARR abstracts. We can notice that each new run improves the abbreviation-
         definition extraction performance.
              A consistent evaluation of our abbreviation identification approach against the BARR
         test set has not been possible due to a bug that affected our system: in our text analysis
         system we exploited the version 8.4 of the GATE General Architecture for Text En-
         gineering that did not process the texts inside