Gold-Standard Ontology-Based Annotation of Concepts in Biomedical Text in the CRAFT Corpus: Updates and Extensions Michael Bada, Lawrence Hunter Nicole Vasilevsky, Melissa Haendel University of Colorado Ontology Development Group, Library School of Medicine Oregon Health & Science University Aurora, CO, USA Portland, OR, USA {mike.bada, larry.hunter}@ucdenver.edu {vasilevs, haendel}@ohsu.edu Abstract—Ontologies are increasingly used for semantic eight prominent OBOs has been annotated, resulting in gold- integration across disparate curated biomedical resources, while standard ontology-based markup of genes and gene products, gold-standard annotated corpora are needed for accurate chemicals and molecular entities, biomacromolecular sequence training and evaluation of text-mining tools. Bringing together features, cells and cellular and extracellular components and the respective power of these, we created the Colorado Richly locations, organisms, biological processes and molecular Annotated Full-Text (CRAFT) Corpus, a collection of full-length, functionalities. With these ~100,000 concept annotations open-access biomedical journal articles that have been manually among the ~800,000 words in the 67 articles of the 1.0 release, annotated both syntactically and semantically with select Open it is one of the largest gold-standard biomedical semantically Biomedical Ontologies (OBOs), the first release of which includes annotated corpora. In addition to this substantial conceptual ~100,000 annotations of concepts mentioned in the text of 67 markup, the corpus is fully annotated along a number of articles and mapped to the classes of eight prominent OBOs. Here we present our continuing work on the corpus, including syntactic and other axes, notably by sentence segmentation, updated versions of these annotations with newer versions of the tokenization, part-of-speech tagging, syntactic parsing, text ontologies, new annotations made with two additional OBOs, formatting, and document sectioning. annotations made with newly created extension classes defined in In the several years since the initial release of the CRAFT terms of existing classes of the ontologies, and new annotations of Corpus, in addition to efforts within our group and in roots of prefixed and suffixed words. collaboration with others, including the first comprehensive Keywords—annotation, corpus, markup, ontology. gold-standard evaluation of prominent concept-recognition systems [4], it has already been used in multiple external I. INTRODUCTION projects to drive development of systems for biomedical curation, search, visualization, and semantic and syntactic NLP With the ever-rising amount of biomedical literature, it is tasks (e.g.. [5, 6]). Here we present our continuing work on increasingly difficult for scientists to keep up with the the corpus, including updated versions of these annotations published work in their fields of research, much less related with newer versions of the ontologies, new annotations made ones. The use of natural language processing (NLP) tools can with two additional OBOs, annotations made with newly make the literature more accessible by aiding concept created extension classes defined in terms of existing classes recognition and information extraction. As NLP-based of the ontologies, and new annotations of roots of prefixed and approaches have been increasingly used for biocuration, so too have biomedical ontologies, whose use enables semantic suffixed words. integration across disparate curated resources, and millions of II. METHODS biomedical entities have been annotated with them. Particularly important are the Open Biomedical Ontologies (OBOs), a set All continuing work on the concept annotations of the of open, orthogonal, interoperable ontologies formally CRAFT Corpus was performed in Knowtator, a plugin to representing knowledge over a wide range of biology, Protégé-Frames [5]. (as was done for the v1.0 concept medicine, and related disciplines [1]. annotations). The lead annotator (MB) made updates to the v1.0 concept annotations using newer versions of the Manually annotated document corpora have become critical ontologies that had been used to mark up the articles by gold-standard resources for the training and testing of biomedical NLP systems. This was the motivation for the removing annotations of obsoleted classes, editing previously creation of the Colorado Richly Annotated Full-Text (CRAFT) made annotations, and creating new annotations for new Corpus, a collection of 97 full-length, open-access journal classes. A list of approximately 20 prefixes and suffixes was articles from the biomedical literature [2, 3]. Within these compiled, and roots of words with these affixes were articles, each mention of the concepts explicitly represented in annotated as their unaffixed analogs would be. As the updating progressed with each ontology, corresponding ontology total # annotations average # annotations median # annotations max # annotations extension classes were created to use for further annotation. per article per article per article Annotation of the corpus with the Molecular Process MOP 293 / 331 4/5 2/2 34 / 34 Ontology (MOP) and Uberon was performed in one primary UBERON 12,238 / 15,051 183 / 225 130 / 169 578 / 709 round (by NV) followed by a review (by MB) using the original concept annotation guidelines [6]. Roots of words with max # total # average # median # aforementioned affixes were also annotated, and extension ontology unique unique concepts unique concepts unique concepts classes were also created and used for additional annotation. concepts per article per article per article The articles were annotated with a single ontology at a time MOP 19 / 20 2/2 1/1 6/6 and a batch at a time (8 articles per batch for the MOP and 4 UBERON 850 / 898 31 / 37 24 / 30 109 / 129 articles per batch for Uberon), and interannotator agreement (IAA) was calculated for each batch using Knowtator’s built-in IAA calculation functionality. The curators strove for IAA ≥ 90% for each annotation batch. III. RESULTS AND DISCUSSION So as to remain current and relevant, the v1.0 concept annotations of the corpus are being reviewed and updated by addition, editing, and deletion of annotations as appropriate, relying on newer versions of the eight OBOs previously used. Updating with four of these has been completed. The extension of annotation of specific affixed root words is largely for consistency: In the v1.0 corpus, any whitespace Figure 1: IAA (as F1-measure) vs. annotation batch number. or punctuation character could serve as an annotation IV. CONCLUSIONS delimiter; thus, “chromatin” of “anti-chromatin” would be annotated with the Gene Ontology class for chromatin We have presented our continuing work on the gold- (GO:0000785), but it could not be annotated within standard concept annotations of the CRAFT Corpus, including “antichromatin”, as there is no delimiter. The rendering of updated versions of the annotations with newer versions of such affixes is variable in that they can be nondelimited from ontologies, new annotations made with additional OBOs, their root words or delimited by whitespace or punctuation, so annotations made with newly created ontology extension with this updating, the markup of such affixed words is now classes, and new annotations of roots of prefixed and suffixed more consistent; furthermore, additional knowledge is words. We intend to soon release these updated annotations in captured. A specific list of such affixes to consider has been future versions of the corpus, and we also have longer-term compiled and will be provided with the next release. plans for further development of the corpus. While creating the concept annotations for the v1.0 corpus, ACKNOWLEDGMENT we encountered a variety of difficulties with annotating exclusively with explicitly represented OBO classes, including This work was supported by grant DARPA-BAA-14-14. class ambiguity, lack of sufficiently generic classes, lack of classes for words consisting of combinations of multiple REFERENCES ontology classes, representation of the same concept in multiple ontologies and incompleteness of ontologies. To [1] http://www.obofoundry.org ameliorate these issues, we have been creating and using [2] Bada M, Eckert M, Evans D, Garcia K, Shipley K, et al. (2012) Concept Annotation in the CRAFT Corpus. BMC Bioinform 13:161. specific extension classes for concept annotations for the [3] Verspoor K et al. (2012) A corpus of full-text journal articles is a robust corpus update. All of these are formally defined in terms of evaluation tool for revealing differences in performance of biomedical explicitly represented OBO classes, and we intend to make natural language processing tools. BMC Bioinform 13:207. these definitions available in OWL files in the next release. [4] Funk C., Baumgartner W., Garcia B., Roeder C., Bada M. et al. (2014) However, we also intend to release the annotations in sets both Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bionform 15:59. including and excluding these extension classes for users who [5] Liu H et al. (2012) BioLemmatizer: a lemmatization tool for respectively do and do not wish to make use of annotations morphological processing of biomedical text. J Biomed Semantics 3:3. with such classes in their work. [6] Nunes T, Campos D et al. (2013) BeCAS: biomedical concept Finally, for the purpose of capturing additional types of recognition services and visualization. Bioinform 29(15), 1915-1916. biomedically relevant concepts, annotations have been created [7] Ogren P.V. (2006) Knowtator: a Protégé plug-in for annotated corpus for the articles of the corpus using the classes of the MOP construction. Proc Hum Lang Tech Conf N Am Chap Assoc Comp Ling. ontology of chemical processes [7] and the Uberon anatomical [8] Bada M et al. (2010) An overview of the CRAFT concept annotation ontology [8]. Tables 1 and 2 display relevant statistics for the guidelines. Proc 4th Ling Annot Wkshp, Assoc Comp Ling, 207-211. 67 articles of the public set, excluding and including use of [9] http://obofoundry.org/ontology/mop.html extension classes, and IAA statistics are presented in Figure 1. [10] Mungall CJ, Torniai C, Gkoutos GV, Lewis SE et al. (2011) Uberon, an integrative multi-species anatomy ontology. Genome Biol 13:R5.