Annotating biomedical ontology terms in electronic health records
                    using crowd-sourcing
             Andre Lamurias 1,2∗, Vasco Pedro 3 , Luka Clarke 2 and Francisco M. Couto 2
    1
        BioISI: Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Portugal
        2
          LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa,
                                                           Portugal
                         3
                           Unbabel, 360 3rd Street, Suite 700, San Francisco, CA 94107-1213, USA


ABSTRACT                                                                    IV consisted in the recognition of chemical entities in the titles and
   Electronic health records have been adopted by many institutions         abstracts of PubMed articles. For this task, the best F-measure was
and constitute an important source of biomedical information. Text          of 87.39%. The difference between the results of the two tasks could
mining methods can be applied to this type of information to automati-      be due to the fact that EHR may contain more noise than scientific
cally extract useful knowledge. We propose a crowd-sourcing pipeline        articles. These results show that there is a need to improve the state-
to improve the precision of extraction and normalization of biomedi-        of-the-art, to satisfy the user expectations on automated extraction
cal terms. Although crowd-sourcing has been applied in other fields,        of biomedical information from unstructured text.
it has not been applied yet to the annotation of health records. We            In this paper we propose a pipeline to improve the extraction
expect this pipeline to improve the precision of supervised machine         and normalization of biomedical ontology terms in EHR by crowd-
learning classifiers, by letting the users suggest the boundaries of the    sourcing the validation of the results obtained with machine learning
terms, as well as the respective ontology concept. We intend to apply       algorithms. This approach has been applied to other types of tasks,
this pipeline to the recognition and normalization of disorder menti-       with promising results. The crowd would be used to validate the
ons (i.e., references to a disease or other health related conditions in    boundaries of the term, as well as the ontology concept associated.
a text) in electronic health records, as well as drug, gene and protein
mentions.                                                                   2   NORMALIZATION OF BIOMEDICAL TERMS TO
                                                                                ONTOLOGIES
1       INTRODUCTION
                                                                            The results produced by NER methods may be normalized to uni-
Electronic health records (EHRs) are a source of information rele-          que identifiers from ontologies. The advantage of this approach is
vant to various research areas of biomedicine. These records contain        that the structure of the reference ontology may be used to validate
details on diseases, symptoms, drugs and mutations, as well as              the information extracted from the text. We have explored semantic
relations between these terms. As more institutions adopt this type         similarity between chemical entities matched to ChEBI concepts,
of system, there is an increasing need for methods that automati-           which improved the precision of our system (Lamurias et al., 2015).
cally extract information from textual data. This information may              The normalization of entities is a challenge due to the ambiguity
be matched to existing ontologies, with the objective of either vali-       and variability of the terminology. The same label may refer to dif-
dating the information extracted or expand the ontology with new            ferent concepts, depending on the context, while one concept may
information.                                                                be mentioned with different names, due to spelling variants, abbre-
   Text mining methods have been proposed to automatically extract          viations and capitalization. While the ontology may provide a set
useful information from unstructured text, such as EHR. Named               of synonyms for each concept, it is usually incomplete, requiring a
Entity Recognition (NER) is a text mining task which aims at iden-          method more advanced than string matching to correctly normalize
tifying the segments of text that refer to an entity or term of interest.   an entity.
Another task is normalization, which consists of assigning an onto-
logy concept identifier to the recognized term. Finally, the relations
described between the identified terms can be extracted, which is
                                                                            3   CROWD-SOURCING IN ANNOTATION TASKS
known as Relation Extraction.                                               Text processing tasks are suitable candidates for crowd-sourcing
   The results of these tasks should be as accurate as possible so that     since they cannot be solved computationally, and can be broken
minimal human intervention is required to use the results for other         down into smaller micro-tasks (Good and Su, 2013). For exam-
applications. To evaluate fairly the state-of-the-art of text mining        ple, it has been applied to machine translation (Ambati and Vogel,
systems, community challenges have been organized, where the                2010), recognition of names in historical records (Sukharev et al.,
competing systems are evaluated on the same gold standard. The              2014), question-answering (Mrozinski et al., 2008) and ontology
task 14 of SemEval 2015 consisted in the NER of disorder mentions           alignment (Sarasua et al., 2012). Crowd-sourcing micro-tasks are
from EHR, as well as the normalization to the SNOMED-CT sub-                usually defined by the large volume of tasks to be performed, as
set of UMLS (Campbell et al., 1998). The best F-measure obtained            well as the simplicity of each individual task. The participants may
for this task was of 75.5%. The CHEMDNER task of BioCreative                be motivated by monetary rewards (e.g. Amazon Mechanical Turk),
                                                                            games with purpose (Von Ahn and Dabbish, 2008), or simply the
∗ To   whom correspondence         should    be    addressed:     alamu-    satisfaction of having contributed to a larger project (Jansen et al.,
rias@lasige.di.fc.ul.pt                                                     2014).


 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                                             1
Lamurias et al


   Computational methods to map a term to an ontology concept,             reward and recognize data sharing and integration on the semantic
usually based on string similarity, are able to find one or more           web. This could also be applied to the proposed pipeline, by distri-
matches for each term. However, a machine is not able to iden-             buting KnowledgeCoins for each text validated by a user, improving
tify the most correct term from a list of matches with the accuracy        the reputation of that user. Potential participants in this kind of pro-
of a human annotator. By letting a large number of participants            ject would be medicine students. The University of Lisbon accepts
evaluate the ontology concepts matched to the terms recognized             almost three hundred medicine students per year, which could pro-
in a given text, a new dataset can be generated with these cor-            vide a relatively large crowd for our pipeline. Retired physicians,
rections. This dataset would be used to train a classifier able to         nurses, physician assistants and researchers may also participate, in
determine the correct concept corresponding to a recognized biome-         order to provide more specialized curation. This type of crowd has
dical concept, such as disorder, chemical, protein or gene, with high      been used by CrowdMed to provide crowd-sourced diagnostics to
precision. This classifier can be trained with a supervised mach-          complex medical cases, with high levels of accuracy.
ine learning algorithm, or with reinforcement learning. Likewise, a
golden dataset could be generated to evaluate and tune the classifier.
                                                                           5    CONCLUSION
4   PIPELINE                                                               We propose a novel pipeline for recognition and normalization of
                                                                           biomedical terms to ontology concepts, using crowd-sourcing. The
The pipeline is composed by two modules: one for NER of disorder,
                                                                           complete and automatic annotation of biomedical texts such as EHR
chemical, gene and protein mentions, and another for normalization
                                                                           requires systems with high precision. The normalization task is
to SNOMED-CT, ChEBI and Gene Ontology concepts, respectively.
                                                                           particularly challenging due to the subjective nature of ontology
   The NER modules starts with classifiers trained with existing
                                                                           mapping. By letting a large group of specialized participants correct
annotated corpora. We have trained classifiers based on the Con-
                                                                           the mistakes of a machine learning classifier, we expect an improve-
ditional Random Fields algorithm (Lafferty et al., 2001) for both
                                                                           ment of the performance of current biomedical text mining systems.
disorder and chemical entity mentions. We will train more classifi-
                                                                           The idea is not only to create a scalable knowledge base but help
ers to recognize gene and protein mentions, with existing corpora
                                                                           from a community of specialist curators that may be available to
annotated with those types of entities. The results of these classifi-
                                                                           help in creating a golden standard for a new biomedical area, or
ers will be evaluated by the crowd, who will be able to accept the
                                                                           improve current results, or just validate some results.
entity and its boundaries, adjust the boundaries, or reject the entity
if it does not correspond at all to what the classifier predicted. These
corrections will be used to improve the performance of the first step,     ACKNOWLEDGEMENTS
through reinforcement learning, with different weights assigned to         This work was supported by the Fundação para a Ciência e
the specialists according to their usage profile.                          a Tecnologia (https://www.fct.mctes.pt/) through the PhD grant
   The normalization module will first attempt to map the string           PD/BD/106083/2015, the Biosys PhD programme and LaSIGE Unit
to a concept of the respective ontology. Since multiple matches            Strategic Project, ref. PEst-OE/EEI/UI0408/2014.
may exist for the same string, this ambiguity will be solved with
a semantic similarity measure. These mappings will be evaluated by
the crowd, why the option of accepting the concept as correct, or          REFERENCES
choosing another one from the same ontology. As before, these cor-         Ambati, V. and Vogel, S. (2010). Can crowds build parallel corpora for machine
rections will be used to train a machine learning classifier, using the        translation systems? In Proceedings of the NAACL HLT 2010 Workshop on Cre-
semantic similarity values as features.                                        ating Speech and Language Data with Amazon’s Mechanical Turk, pages 62–65.
                                                                               Association for Computational Linguistics.
   For example, taking the sentence “The rhythm appears to be atrial       Campbell, K. E., Oliver, D. E., and Shortliffe, E. H. (1998). The unified medical
fibrillation” as input, the NER classifier may recognize only the              language system toward a collaborative approach for solving terminologic problems.
word “fibrillation” as a disorder mention. In this case, the boundary          Journal of the American Medical Informatics Association, 5(1), 12–16.
of the term may be extended to include “atrial”. In SNOMED-                Good, B. M. and Su, A. I. (2013). Crowdsourcing for bioinformatics. Bioinformatics,
                                                                               page btt333.
CT, several concept are related to atrial fibrillation, for example,
                                                                           Jansen, D., Alcala, A., and Guzman, F. (2014). Amara: A sustainable, global solution
“Atrial fibrillation” (C0004238) and “Atrial fibrillation and flutter”         for accessibility, powered by communities of volunteers. In Universal Access in
(C0155709). If the second concept is chosen by the system instead              Human-Computer Interaction. Design for All and Accessibility Practice, pages 401–
of the first one, the user may indicate this mistake. Otherwise, the           411. Springer.
user will confirm that the mapping is correct.                             Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional random fields:
                                                                               Probabilistic models for segmenting and labeling sequence data.
   Every document processed by our system is anonymized using              Lamurias, A., Ferreira, J. D., and Couto, F. M. (2015). Improving chemical entity
standard procedures, which includes removing all references to per-            recognition through h-index based semantic similarity. Journal of Cheminformatics,
sonal details. The user only evaluates individual phrases containing           7(Suppl 1), S13.
annotations, to prevent the re-identification of documents. We will        Mrozinski, J., Whittaker, E., and Furui, S. (2008). Collecting a why-question corpus
                                                                               for development and evaluation of an automatic qa-system. In 46th Annual Meeting
apply a sliding-window approach to harmonize the evaluations per-
                                                                               of the Association of Computational Linguistics: Human Language Technologies,
formed by the crowd, so that each phrase evaluated by a user should            pages 443–451.
overlap with other phrases. With this strategy, we can align the sequ-     Sarasua, C., Simperl, E., and Noy, N. F. (2012). Crowdmap: Crowdsourcing onto-
ence of phrases that was accepted by the majority of the crowd, and            logy alignment with microtasks. In The Semantic Web–ISWC 2012, pages 525–541.
prevent errors committed due to the lack of context.                           Springer.
                                                                           Sukharev, J., Zhukov, L., and Popescul, A. (2014). Learning alternative name spellings
   As an incentive to the participation of users, we intend to apply a         on historical records.
mechanism of rewards based on a virtual currency. KnowledgeCoin            Von Ahn, L. and Dabbish, L. (2008). Designing games with a purpose. Communications
(Couto, 2014) is a virtual currency that was originally proposed to        of the ACM, 51(8), 58–67.


2                            Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes