Semantic Modeling for Accelerated Immune Epitope Database (IEDB) Biocuration Gully A Burns Randi Vita James Overton Information Sciences Institute La Jolla Institute for Allergy and Toronto, Ontario, Canada Marina del Rey, California Immunology james@overton.ca burns@isi.edu La Jolla, California rvita@lji.org Ward Fleri Bjoern Peters La Jolla Institute for Allergy and La Jolla Institute for Allergy and Immunology Immunology La Jolla, California La Jolla, California wfleri@lji.org bpeters@lji.org ABSTRACT ACM Reference Format: The Immune Epitope Database (IEDB) indexes and organizes pub- Gully A Burns, Randi Vita, James Overton, Ward Fleri, and Bjoern Peters. 2018. Semantic Modeling for Accelerated Immune Epitope Database (IEDB) lished information pertaining to the molecular targets of adaptive Biocuration. In . ACM, New York, NY, USA, 6 pages. https://doi.org/ immune responses to support epitope discovery efforts. The IEDB is an exemplary system with a well-designed repository, a commercial- grade user interface and a large user community. It is expressly 1 INTRODUCTION ‘built-for-purpose’, with a specialized Entity-Relation (ER) schema As a scientific discipline, biomedicine is complex, multidisciplinary, designed specifically to describe experimental findings (in this case, continuously evolving and increasingly data-driven. This has lead outcomes from assays relevant to immune epitope studies). Like to the development of literally thousands of biomedical databases many biomedical databases, this use of a specialized ER model im- across a large number of domains. In the field of molecular bi- pacts the process of indexing and organizing available scientific ology, the journal “Nucleic Acids Research" publishes an annual information. Biocuration staff and end-users must be trained specif- review of active molecular biology databases [6] with articles that ically in the details of the representation to populate and use the describe each system and a managed online catalog of active sys- system. The extent of system interoperability is generally limited tems1 . This list includes several large-scale, international infor- to the use of standard terminology. We apply a knowledge engi- matics projects. In particular, the 2017 review presents a “golden neering modeling methodology called “Knowledge Engineering set” of 110 databases that have “consistently served as authorita- from Experiment Design" (KEfED) that uses a workflow-like con- tive, comprehensive, and convenient data resources widely used by struct to model studies that had been curated into the IEDB. This the community”. In this paper, we describe preliminary modeling methodology generates a semantic model for experimental data work within one of these database systems (the Immune Epitope from dependency relations between experimental variables based Database, IEDB), to improve curation processes and permit a more on an experiment’s protocol. We also applied the Karma mapping standardized representation of experiment observations. Ultimately, system to build a linked data representation of IEDB content across we anticipate this work to permit graceful evolution of the systems’ the whole database as a potential methodology for exporting IEDB underlying data schema and biocuration model. content to a linked data format. This work demonstrates the fea- This work is a principled approach to data modeling using “meta- sibility of using KEfED modeling to represent previously-curated data propagation” through experimental workflows describing the data in existing systems and then mapping that existing dataset physical processes in a laboratory experiment. Metadata propaga- to a linked data model. This may offer a graceful method for the tion is a concept developed for e-Science workflow systems [7], but evolution of existing, well-established databases. was repurposed as the driving principle of a flexible data modeling methodology for experimental data called “Knowledge Engineering CCS CONCEPTS from Experimental Design" (KEfED) [11]. • Information systems → Network data models; Information in- The Ontology of Biomedical Investigations (“OBI”) provides a tegration; mechanism for describing experimental protocols within the con- text of a well-defined upper ontology [2]). We previously developed KEYWORDS an approach to modeling experimental variables [3] that we are currently integrating into OBI in order to apply KEfED modeling Immune Epitopes, Knowledge Engineering, Biocuration to data in the IEDB. In this paper, we model a specific article that had been previ- K-CAP2017 Workshops and Tutorials Proceedings, ously curated into the IEDB to act as a proof-of-concept of using the © Copyright held by the owner/author(s). 1 http://www.oxfordjournals.org/nar/database/a/ K-CAP2017 Workshops and Tutorials Proceedings, G.A. Burns et al. Figure 1: A KEfED model based on Richardson et. al. 1998. The red line shows dependency relations between measurements of ‘response frequency’ from a ‘protection from challenge assay’ and parameters set previously in the protocol. KEfED methodology. We also describe use of the Karma data inte- they are managed elsewhere5 . Our goal in this work was to explore gration tool2 as a way of automatically populating a KEfED-driven the feasibility of developing semantic models within the KEfED linked data representation. modeling formalism that could reconstruct the logic of the data that the IEDB currently contains. 2 METHODS As an advanced scientific database, the IEDB is based on complex, KEfED modeling work was performed with the “kefed.io” toolset3 . domain-specific knowledge. A key structural design concept that We downloaded the latest versions of IEDB4 and evaluated the permits the capture of data from a wide number of different types of IEDB schema and content in consultation with curation staff. We immunological experiments is the IEDB’s use of well-defined assay referenced IEDB’s use of OBI ontology terms for assays in this types6 . These are experimental processes that generate specific modeling effort, whilst developing and proposing extensions to types of measurements with well-defined meanings that serve as OBI for data item and value specification classes in order to provide the basic building blocks of immunological studies. The IEDB’s set adequate coverage for appropriate variables and associated values of assay types is also documented as classes in OBI7 providing a within the KEfED models under development. well-defined base vocabulary to build upon. An intermediate target for this modeling work was to provide a KEfED-based design pattern that could be used to convert IEDB data 3.1 Richardson et al. 1998: A Worked Example to linked data using ISI’s Karma information integration tool [8]. We focus on one study in particular: Richard et al. 1998 [10]. A We queried the B-Cell table from the IEDB for data from “protection KEfED model that illustrates the assays used in this study is shown from challenge” assays and then mapped columns from that data set in Figure 1. This study uses peptidergic epitopes derived from pro- onto the values of variables generated from our manually-curated teins found in envelope proteins of Feline Immunodeficiency Virus KEfED model for the same class of experiments. This provided (FIV) as immunogens (i.e. to trigger an immune response). Animals a viable procedure to migrate existing data from IEDB to linked that had been immunized with these epitopes were subsequently data generated under a KEfED-based model. All modeling work investigated with four assays that measured (A) whether the im- was performed by hand and this effort was executed as a proof of munization process provides protection from the effects of a sub- concept for subsequent development. sequent immune challenge; (B+C) the degree of antigen-antibody binding occurring after the immunization step, measured either by 3 RESULTS (B) immunoprecipitation or (C) an ELISA step; finally (D) whether Under development since 2004, the IEDB has undergone three large antibodies generated from experimental subjects were themselves scale iterations to provide coverage of >95% of the relevant exper- capable of neutralizing FIV in a test environment. imental biomedical literature. At present (November 30th 2017), Structurally, when viewed at this level, the design is simple. The it lists records from 18,902 journal articles focused on infectious host animal is immunized and a blood sample is drawn and pro- diseases, allergy, autoimmunity and transplantation. HIV-derived cessed with biological activity and binding assays. In addition, the and cancer epitopes are considered out of scope for this system as same immunized host is subjected to a “protection from challenge 2 http://karma.isi.edu 5 https://www.hiv.lanl.gov/content/immunology/ 3 https://github.com/SciKnowEngine/kefed.io 6 https://help.iedb.org/hc/en-us/articles/114094147271-IEDB-Assay-Types-IEDB-3-0- 4 available from http://www.iedb.org/database_export_v3.php 7 http://obi-ontology.org/ Semantic Modeling for Accelerated Immune Epitope Database (IEDB) Biocuration K-CAP2017 Workshops and Tutorials Proceedings, assay” to assess how well the immunization process protects the to support data of a variety of different types including ordinal, animal from FIV infection. The primary technical challenge of this categorical and structured data. This corresponds to distinctions work arises from the definition of variables that are relevant to the we previously defined in the ‘Ontology of Experimental Variables IEDB curation process. and Value’ (OoEVV) [3]. In Figure 1, we provided variables with simple names (“host- A key extension for KEfED is a representation of a data-driven parameters”, “administration-details”, etc.) to denote composite data context for each measurement made within the experiment. Within structures that mirrored the relevant substructure of data pertaining this design, this function is provided by the Metadata_Context to IEDB-relevant data. An example of this substructure is shown in class which simply links measurement and parameter values to- Table 1 for the parameter “epitope” denoted in Figure 1 as an input gether via parameterizes and has_context properties. to the “in-vitro immunization administration” process. Note that this substructure exactly corresponds to the data provided by the 3.3 Mapping data from ‘Protection From IEDB in their assay pages (for example: http://www.iedb.org/assay/ Challenge’ Experiments with Karma 1508651). By capturing the data structure used in IEDB directly into The Karma system provides a methodology for rapidly mapping parameters, we are able to match the KEfED modeling approach data sources to an OWL ontology acting as a schema for linked precisely to the data described in IEDB. This effort is intended to data [8]. We executed a native SQL query over several IEDB ta- supplement existing biocuration efforts at IEDB [5] and so will be bles (article, bcell, object, and assay_type) to retrieve data evaluated within their framework for quality control. pertaining to “Protection from Challenge” assays across the whole database. This query retrieved 2,000 rows of data that specified Table 1: Substructure of ‘epitope’ variable “in vivo assay measuring B cell epitope specific protection from challenge” (term URI: http://purl.obolibrary.org/obo/OBI_0001710) Sub-Parameter Type Example Value or its subtypes as their assay type. We extended OBI with OWL classes corresponding to missing elements shown in Figure 2 and epitope-type category linear peptide constructed a Karma model that mapped the extracted data to this linear-sequence string RAISSWKQRNRWEWRPD extended KEfED/OBI ontology. Figure 3 provides a screenshot of start-position integer 387 a subset of the Karma model showing a portion of the mapping. end-position integer 403 The Karma interface uses the term URI as the primary label on the source-name string Envelope glycoprotein gp150 model display but will also show the label of the term if the user source-accession URI ncbi-protein:Q05312.1 mouses over the term’s node in the user interface. Modeling work source-organism URI ncbi-taxon:45409 was performed on a 2.5 GHz Intel Core i7 Macbook Pro with 16 GB source-org-name string FIV (isolate wo) RAM. 3.4 The Granularity of Processes: Expanding 3.2 Representing KEfED Models using OBI the “Protection from Challenge" Assay elements Finally, we consider that descriptions of experimental processes Within the scope of established ontologies describing experimen- have an inherent granularity based on the degree of detail that is tal methodology in the biomedical community, OBI is likely the required. We highlight this question by considering the ‘Protection most mature and well-supported [2]. Despite being linked to and from Challenge’ assay shown in Figure 1. A detailed reading of incorporated in several other projects within the community (see the paper, reveals that the assay as described in IEDB is actually https://bioportal.bioontology.org/ontologies/OBI), there is no single made up of a number of individual steps that included (a) an im- recommended methodology of how to use OBI terms to describe an mune challenge, (b) extraction of tissue and blood from the host, (c) experimental workflow. We therefore developed a schema for OBI- RNA extraction and (d) subsequent competitive PCR to establish like elements that could capture the crucial elements of a KEfED measures of viremia. This is significant since these intermediate model. Figure 2 shows this schema formatted as a UML2.0 class steps involve data sets that form the main evidence presented by diagram. The purpose of this schema is to provide a framework for the paper’s authors (measures of viremia in blood and spleen) that developing KEfED models that could act as templates made up of themselves must be evaluated to generate the final data item to be OBI-compatible terminology. curated into the IEDB: “response frequency”. Consistent with OBI’s extension of the Basic Formal Ontology This is illustrated in detail in Figure 4 showing how, in this paper, (‘BFO’) [1], this schema extends the Continuant class to define the assay has quite a complex substructure at this intermediate Material Entity and Data Item classes. These elements are en- level. It is also worth noting that many of these processes would tities within the workflow that have continued existence over time. themselves have detailed substructure that may of significance We also define Planned_Process elements that map directly to to a researcher maintaining their own laboratory-based record OBI’sMaterial Processing, Assay, and Data_Processing classes. of experimental work with a very level of detail. We model this These elements denote key KEfED elements to describe the work- structure by permitting Planned_Process class instances to have flow. Less-well defined is the way in which the values of each data has_part relations with other Planned_Process instances. This item is defined. Here, consistent with ongoing discussions within would permit multiple levels of sub-processes to be described in the OBI community, we extend the Value_Specification class modeling of experimental protocols. K-CAP2017 Workshops and Tutorials Proceedings, G.A. Burns et al. Continuant has_participant label: String [1...*] ontologyId: URI M = Material Processing has_ has_ A = Assay specified_ specified_ D = Data Transformation input output Material_Entity E = Whole Experiment parameterizes has_part is_specified_ [1...*] is_specified_ output_of input_of Data_Item D= Dependent variable_type: enum[D, I, C] I = Independent is_part_of Planned_Process parameterizes C = Constant [1...*] label:String is_value_spec process_type: enum[M,A,D, E] receives_input ification_of Metadata_Context participates_in ontologyId: URI _from [1...*] ... has_value_ parameterizes has_first_ specification [0...*] provides_input has_context part _to Value_Specification is_context_for Experiment label: String ontologyId: URI has_part unit_label: String has_part [1 ... *] has_part units: URI has_specified_value: URI Study_Design Study_Design_Execution label (String) label (String) ontologyId: URI ontologyId: URI Nominal_VS StructuredObject_VS diagramXML (XML) type_label: String Categorical_VS has_part has_part category_labels: String[] OntologicalTerm_VS is_about: URI Investigation Ordinal_VS label (String) max_rank: int ontologyId: URI rank_labels: String[] NaturalLanguage_VS language: enum[en,de,...] Scalar_VS precision: enum[int, float, double] max: Number min: Number Figure 2: A data schema for representing KEfED models and data. 4 RELATED WORK representations deal with the structure of claims at the level of the The IEDB uses OBI to support query formation within its user low-level variables that form the core of the KEfED representation. interface [14]. There are other ontological representations of pro- tocols that complement this work. The Bioassay ontology (BAO) 5 DISCUSSION provides a representation of chemical biology screening assays [13]. This paper describes a simple proof-of-concept analysis of using The Evidence Ontology (ECO) provides a high-level ontological the KEfED modeling approach as a possible methodology for im- representation of different types of evidence used by biologists to proving the accuracy and speed of biocuration for an established draw conclusions that ties closely to OBI [4]. The latest version of biomedical database. Though far from definitive, this early work the Experiment Action Ontology (EXACT2) incorporates OBI and provides support to the notion that KEfED methods may effectively constructs and focuses on representing the most granular actions provide a general method of capturing scientific knowledge from (incubate, heat, etc) [12]. SMART Protocols provides a methodology published experimental studies at a level of granularity that matches originally derived from models of provenance8 . STAR Methods is established databases such as the IEDB. a publisher-initiated attempt to standardize terminology describ- A possible area of difficulty in applying KEfED to experimental ing methodological resources used in biology [9]. None of these findings in the literature is that there are typically a wide variety of 8 https://smartprotocols.github.io/ experiments performed in any given subdomain. A database such Semantic Modeling for Accelerated Immune Epitope Database (IEDB) Biocuration K-CAP2017 Workshops and Tutorials Proceedings, Figure 3: A screenshot taken from the Karma model showing the mapping between the OBI-derived KEfED ontology and data from the IEDB. as the IEDB manages to circumvent this complexity through the ACKNOWLEDGMENTS expertise of trained biocurators, who map research findings into the The authors would like to thank Sharayu Gandhi for her careful database schema. By using KEfED, we match our representation as work on development of the kefed.io Javascript interface. The work closely as possible to the experimental design reported in the paper was performed under subcontract directly funded by the La Jolla by using a more flexible data structure as a target of biocuration. It Institute For Allergy and Immunology. is this semantic flexibility that provides a target that closely mirrors the existing schema of the IEDB to enable this methodology to be used in data curation. It remains an open question as to how much REFERENCES [1] Robert Arp, Barry Smith, and Andrew D. Spear. 2015. Building Ontologies with of the experimental idiosyncrasies of each study design should Basic Formal Ontology. The MIT Press. be modeled. The rule of thumb we apply is to use the minimum [2] Anita Bandrowski, Ryan Brinkman, Mathias Brochhausen, Matthew H. Brush, information needed to recreate the structured conclusions of the Bill Bug, Marcus C. Chibucos, Kevin Clancy, Melanie Courtot, Dirk Derom, Michel Dumontier, Liju Fan, Jennifer Fostel, Gilberto Fragoso, Frank Gibson, study. Interestingly, using KEfED to model existing biomedical Alejandra Gonzalez-Beltran, Melissa A. Haendel, Yongqun He, Mervi Heiskanen, databases’ capabilities provides a possible evaluation methodology Tina Hernandez-Boussard, Mark Jensen, Yu Lin, Allyson L. Lister, Phillip Lord, James Malone, Elisabetta Manduchi, Monnie McGee, Norman Morrison, James A. for future work. This would require a quantitative comparison of Overton, Helen Parkinson, Bjoern Peters, Philippe Rocca-Serra, Alan Ruttenberg, KEfED-based methods to existing database capabilities based on Susanna-Assunta Sansone, Richard H. Scheuermann, Daniel Schober, Barry Smith, (A) schema verification / validation, (B) system performance, and Larisa N. Soldatova, Christian J. Jr Stoeckert, Chris F. Taylor, Carlo Torniai, Jessica A. Turner, Randi Vita, Patricia L. Whetzel, and Jie Zheng. 2016. The (C) usability. Ontology for Biomedical Investigations. PloS one 11, 4 (2016), e0154556. https: A key future aspect of this process of knowledge capture is to //doi.org/10.1371/journal.pone.0154556 develop methods of machine reading capable of identifying and [3] Gully A P C Burns and Jessica A Turner. 2013. Modeling functional Mag- netic Resonance Imaging (fMRI) experimental variables in the Ontology of populating KEfED models automatically. This remains an important Experimental Variables and Values (OoEVV). Neuroimage (May 2013). https: and difficult challenge problem. //doi.org/10.1016/j.neuroimage.2013.05.024 [4] Marcus C. Chibucos, Christopher J. Mungall, Rama Balakrishnan, Karen R. Christie, Rachael P. Huntley, Owen White, Judith A. Blake, Suzanna E. Lewis, and Michelle Giglio. 2014. Standardized description of scientific evidence using K-CAP2017 Workshops and Tutorials Proceedings, G.A. Burns et al. ‘Protection from Challenge Assay’ Figure 4: An expanded KEfED model that shows the internal substructure of the “Protection from Challenge Assay” process node for this experiment. the Evidence Ontology (ECO). Database : the journal of biological databases and and Interpretations: a Neural Connectivity Use Case. BMC Bioinformatics 12, 1 curation 2014 (2014). https://doi.org/10.1093/database/bau075 (2011), 351. https://doi.org/10.1186/1471-2105-12-351 [5] Ward Fleri, Kerrie Vaughan, Nima Salimi, Randi Vita, Bjoern Peters, and [12] Larisa N. Soldatova, Daniel Nadis, Ross D. King, Piyali S. Basu, Emma Haddi, Alessandro Sette. 2017. The Immune Epitope Database: How Data Are En- Veronique Baumle, Nigel J. Saunders, Wolfgang Marwan, and Brian B. Rudkin. tered and Retrieved. Journal of immunology research 2017 (2017), 5974574. 2014. EXACT2: the semantics of biomedical protocols. BMC bioinformatics 15 https://doi.org/10.1155/2017/5974574 Suppl 14 (2014), S5. https://doi.org/10.1186/1471-2105-15-S14-S5 [6] Michael Y. Galperin, Xose M. Fernandez-Suarez, and Daniel J. Rigden. 2017. The [13] Ubbo Visser, Saminda Abeyruwan, Uma Vempati, Robin P. Smith, Vance Lemmon, 24th annual Nucleic Acids Research database issue: a look back and upcoming and Stephan C. Schurer. 2011. BioAssay Ontology (BAO): a semantic description changes. Nucleic acids research (Jan. 2017). https://doi.org/10.1093/nar/gkx021 of bioassays and high-throughput screening results. BMC Bioinformatics 12 [7] Yolanda Gil. 2014. Intelligent Workflow Systems and Provenance-Aware Software. (2011), 257. https://doi.org/10.1186/1471-2105-12-257 In Proceedings of the Seventh International Congress on Environmental Modeling [14] Randi Vita, James A. Overton, Jason A. Greenbaum, Alessandro Sette, and Bjoern and Software. San Diego, CA. http://www.isi.edu/~gil/papers/gil-iemss14.pdf Peters. 2013. Query enhancement through the practical application of ontology: [8] Craig A. Knoblock, Pedro Szekely, Jose Luis Ambite, and Aman Goel, Shubham the IEDB and OBI. Journal of biomedical semantics 4 Suppl 1 (April 2013), S6. Gupta, Kristina Lerman, Maria Muslea, Mohsen Taheriyan, and Parag Mallick. https://doi.org/10.1186/2041-1480-4-S1-S6 2012. Semi-Automatically Mapping Structured Sources into the Semantic Web. In Proceedings of the Extended Semantic Web Conference. Crete, Greece. [9] Emilie Marcus. 2016. A STAR Is Born. Cell 166, 5 (Aug. 2016), 1059–1060. https://doi.org/10.1016/j.cell.2016.08.021 [10] J. Richardson, A. Moraillon, F. Crespeau, S. Baud, P. Sonigo, and G. Pancino. 1998. Delayed infection after immunization with a peptide from the transmembrane glycoprotein of the feline immunodeficiency virus. Journal of virology 72, 3 (March 1998), 2406–2415. [11] Thomas Russ, Cartic Ramakrishnan, Eduard Hovy, Mihail Bota, and Gully Burns. 2011. Knowledge Engineering Tools for Reasoning with Scientific Observations