Annotating biomedical ontology terms in electronic health records using crowd-sourcing Andre Lamurias 1,2∗, Vasco Pedro 3 , Luka Clarke 2 and Francisco M. Couto 2 1 BioISI: Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Portugal 2 LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal 3 Unbabel, 360 3rd Street, Suite 700, San Francisco, CA 94107-1213, USA ABSTRACT IV consisted in the recognition of chemical entities in the titles and Electronic health records have been adopted by many institutions abstracts of PubMed articles. For this task, the best F-measure was and constitute an important source of biomedical information. Text of 87.39%. The difference between the results of the two tasks could mining methods can be applied to this type of information to automati- be due to the fact that EHR may contain more noise than scientific cally extract useful knowledge. We propose a crowd-sourcing pipeline articles. These results show that there is a need to improve the state- to improve the precision of extraction and normalization of biomedi- of-the-art, to satisfy the user expectations on automated extraction cal terms. Although crowd-sourcing has been applied in other fields, of biomedical information from unstructured text. it has not been applied yet to the annotation of health records. We In this paper we propose a pipeline to improve the extraction expect this pipeline to improve the precision of supervised machine and normalization of biomedical ontology terms in EHR by crowd- learning classifiers, by letting the users suggest the boundaries of the sourcing the validation of the results obtained with machine learning terms, as well as the respective ontology concept. We intend to apply algorithms. This approach has been applied to other types of tasks, this pipeline to the recognition and normalization of disorder menti- with promising results. The crowd would be used to validate the ons (i.e., references to a disease or other health related conditions in boundaries of the term, as well as the ontology concept associated. a text) in electronic health records, as well as drug, gene and protein mentions. 2 NORMALIZATION OF BIOMEDICAL TERMS TO ONTOLOGIES 1 INTRODUCTION The results produced by NER methods may be normalized to uni- Electronic health records (EHRs) are a source of information rele- que identifiers from ontologies. The advantage of this approach is vant to various research areas of biomedicine. These records contain that the structure of the reference ontology may be used to validate details on diseases, symptoms, drugs and mutations, as well as the information extracted from the text. We have explored semantic relations between these terms. As more institutions adopt this type similarity between chemical entities matched to ChEBI concepts, of system, there is an increasing need for methods that automati- which improved the precision of our system (Lamurias et al., 2015). cally extract information from textual data. This information may The normalization of entities is a challenge due to the ambiguity be matched to existing ontologies, with the objective of either vali- and variability of the terminology. The same label may refer to dif- dating the information extracted or expand the ontology with new ferent concepts, depending on the context, while one concept may information. be mentioned with different names, due to spelling variants, abbre- Text mining methods have been proposed to automatically extract viations and capitalization. While the ontology may provide a set useful information from unstructured text, such as EHR. Named of synonyms for each concept, it is usually incomplete, requiring a Entity Recognition (NER) is a text mining task which aims at iden- method more advanced than string matching to correctly normalize tifying the segments of text that refer to an entity or term of interest. an entity. Another task is normalization, which consists of assigning an onto- logy concept identifier to the recognized term. Finally, the relations described between the identified terms can be extracted, which is 3 CROWD-SOURCING IN ANNOTATION TASKS known as Relation Extraction. Text processing tasks are suitable candidates for crowd-sourcing The results of these tasks should be as accurate as possible so that since they cannot be solved computationally, and can be broken minimal human intervention is required to use the results for other down into smaller micro-tasks (Good and Su, 2013). For exam- applications. To evaluate fairly the state-of-the-art of text mining ple, it has been applied to machine translation (Ambati and Vogel, systems, community challenges have been organized, where the 2010), recognition of names in historical records (Sukharev et al., competing systems are evaluated on the same gold standard. The 2014), question-answering (Mrozinski et al., 2008) and ontology task 14 of SemEval 2015 consisted in the NER of disorder mentions alignment (Sarasua et al., 2012). Crowd-sourcing micro-tasks are from EHR, as well as the normalization to the SNOMED-CT sub- usually defined by the large volume of tasks to be performed, as set of UMLS (Campbell et al., 1998). The best F-measure obtained well as the simplicity of each individual task. The participants may for this task was of 75.5%. The CHEMDNER task of BioCreative be motivated by monetary rewards (e.g. Amazon Mechanical Turk), games with purpose (Von Ahn and Dabbish, 2008), or simply the ∗ To whom correspondence should be addressed: alamu- satisfaction of having contributed to a larger project (Jansen et al., rias@lasige.di.fc.ul.pt 2014). Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Lamurias et al Computational methods to map a term to an ontology concept, reward and recognize data sharing and integration on the semantic usually based on string similarity, are able to find one or more web. This could also be applied to the proposed pipeline, by distri- matches for each term. However, a machine is not able to iden- buting KnowledgeCoins for each text validated by a user, improving tify the most correct term from a list of matches with the accuracy the reputation of that user. Potential participants in this kind of pro- of a human annotator. By letting a large number of participants ject would be medicine students. The University of Lisbon accepts evaluate the ontology concepts matched to the terms recognized almost three hundred medicine students per year, which could pro- in a given text, a new dataset can be generated with these cor- vide a relatively large crowd for our pipeline. Retired physicians, rections. This dataset would be used to train a classifier able to nurses, physician assistants and researchers may also participate, in determine the correct concept corresponding to a recognized biome- order to provide more specialized curation. This type of crowd has dical concept, such as disorder, chemical, protein or gene, with high been used by CrowdMed to provide crowd-sourced diagnostics to precision. This classifier can be trained with a supervised mach- complex medical cases, with high levels of accuracy. ine learning algorithm, or with reinforcement learning. Likewise, a golden dataset could be generated to evaluate and tune the classifier. 5 CONCLUSION 4 PIPELINE We propose a novel pipeline for recognition and normalization of biomedical terms to ontology concepts, using crowd-sourcing. The The pipeline is composed by two modules: one for NER of disorder, complete and automatic annotation of biomedical texts such as EHR chemical, gene and protein mentions, and another for normalization requires systems with high precision. The normalization task is to SNOMED-CT, ChEBI and Gene Ontology concepts, respectively. particularly challenging due to the subjective nature of ontology The NER modules starts with classifiers trained with existing mapping. By letting a large group of specialized participants correct annotated corpora. We have trained classifiers based on the Con- the mistakes of a machine learning classifier, we expect an improve- ditional Random Fields algorithm (Lafferty et al., 2001) for both ment of the performance of current biomedical text mining systems. disorder and chemical entity mentions. We will train more classifi- The idea is not only to create a scalable knowledge base but help ers to recognize gene and protein mentions, with existing corpora from a community of specialist curators that may be available to annotated with those types of entities. The results of these classifi- help in creating a golden standard for a new biomedical area, or ers will be evaluated by the crowd, who will be able to accept the improve current results, or just validate some results. entity and its boundaries, adjust the boundaries, or reject the entity if it does not correspond at all to what the classifier predicted. These corrections will be used to improve the performance of the first step, ACKNOWLEDGEMENTS through reinforcement learning, with different weights assigned to This work was supported by the Fundação para a Ciência e the specialists according to their usage profile. a Tecnologia (https://www.fct.mctes.pt/) through the PhD grant The normalization module will first attempt to map the string PD/BD/106083/2015, the Biosys PhD programme and LaSIGE Unit to a concept of the respective ontology. Since multiple matches Strategic Project, ref. PEst-OE/EEI/UI0408/2014. may exist for the same string, this ambiguity will be solved with a semantic similarity measure. These mappings will be evaluated by the crowd, why the option of accepting the concept as correct, or REFERENCES choosing another one from the same ontology. As before, these cor- Ambati, V. and Vogel, S. (2010). Can crowds build parallel corpora for machine rections will be used to train a machine learning classifier, using the translation systems? In Proceedings of the NAACL HLT 2010 Workshop on Cre- semantic similarity values as features. ating Speech and Language Data with Amazon’s Mechanical Turk, pages 62–65. Association for Computational Linguistics. For example, taking the sentence “The rhythm appears to be atrial Campbell, K. E., Oliver, D. E., and Shortliffe, E. H. (1998). The unified medical fibrillation” as input, the NER classifier may recognize only the language system toward a collaborative approach for solving terminologic problems. word “fibrillation” as a disorder mention. In this case, the boundary Journal of the American Medical Informatics Association, 5(1), 12–16. of the term may be extended to include “atrial”. In SNOMED- Good, B. M. and Su, A. I. (2013). Crowdsourcing for bioinformatics. Bioinformatics, page btt333. CT, several concept are related to atrial fibrillation, for example, Jansen, D., Alcala, A., and Guzman, F. (2014). Amara: A sustainable, global solution “Atrial fibrillation” (C0004238) and “Atrial fibrillation and flutter” for accessibility, powered by communities of volunteers. In Universal Access in (C0155709). If the second concept is chosen by the system instead Human-Computer Interaction. Design for All and Accessibility Practice, pages 401– of the first one, the user may indicate this mistake. Otherwise, the 411. Springer. user will confirm that the mapping is correct. Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Every document processed by our system is anonymized using Lamurias, A., Ferreira, J. D., and Couto, F. M. (2015). Improving chemical entity standard procedures, which includes removing all references to per- recognition through h-index based semantic similarity. Journal of Cheminformatics, sonal details. The user only evaluates individual phrases containing 7(Suppl 1), S13. annotations, to prevent the re-identification of documents. We will Mrozinski, J., Whittaker, E., and Furui, S. (2008). Collecting a why-question corpus for development and evaluation of an automatic qa-system. In 46th Annual Meeting apply a sliding-window approach to harmonize the evaluations per- of the Association of Computational Linguistics: Human Language Technologies, formed by the crowd, so that each phrase evaluated by a user should pages 443–451. overlap with other phrases. With this strategy, we can align the sequ- Sarasua, C., Simperl, E., and Noy, N. F. (2012). Crowdmap: Crowdsourcing onto- ence of phrases that was accepted by the majority of the crowd, and logy alignment with microtasks. In The Semantic Web–ISWC 2012, pages 525–541. prevent errors committed due to the lack of context. Springer. Sukharev, J., Zhukov, L., and Popescul, A. (2014). Learning alternative name spellings As an incentive to the participation of users, we intend to apply a on historical records. mechanism of rewards based on a virtual currency. KnowledgeCoin Von Ahn, L. and Dabbish, L. (2008). Designing games with a purpose. Communications (Couto, 2014) is a virtual currency that was originally proposed to of the ACM, 51(8), 58–67. 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes