=Paper=
{{Paper
|id=Vol-1747/IT503_ICBO2016
|storemode=property
|title=Malaria Study Data Integration and Information Retrieval Based on OBO Foundry Ontologies
|pdfUrl=https://ceur-ws.org/Vol-1747/IT503_ICBO2016.pdf
|volume=Vol-1747
|authors=Jie Zheng,Jashon Cade,Brian Brunk,David Roos,Chris Stoeckert,San James,Emmanuel Arinaitwe,Bryan Greenhouse,Grant Dorsey,Steven Sullivan,Jane Carlton,Gabriel Carrasco-Escobar,Dionicia Gamboa,Paula Maguina-Mercedes,Joseph Vinetz
|dblpUrl=https://dblp.org/rec/conf/icbo/ZhengCBRSJAGDSC16
}}
==Malaria Study Data Integration and Information Retrieval Based on OBO Foundry Ontologies ==
Malaria study data integration and information retrieval based on OBO Foundry ontologies Jie Zheng, JaShon Cade, Brian Brunk, David S. Steven A. Sullivan, Jane M. Carlton Roos, Christian J. Stoeckert Jr. Center for Genomics & Systems Biology EuPath Bioinformatics Resource Center Department of Biology University of Pennsylvania New York University Philadelphia, PA, USA New York, NY, USA San Emmanuel James, Emmanuel Arinaitwe Gabriel Carrasco-Escobar, Dionicia Gamboa Infectious Diseases Research Collaboration Universidad Peruana Cayetano Heredia Kampala, Uganda Lima, Peru Bryan Greenhouse, Grant Dorsey Paula Maguina-Mercedes, Joseph M. Vinetz Department of Medicine Division of Infectious Diseases University of California San Francisco University of California San Diego San Francisco, CA, USA La Jolla, CA, USA Abstract— The International Centers of Excellence in Malaria I. INTRODUCTION Research (ICEMR) projects involve studies to understand the The ICEMR program is a global network of 10 independent epidemiology and transmission patterns of malaria in different research centers created to improve understanding of the geographic regions. Two major challenges of integrating data epidemiology and transmission patterns of malaria in different across these projects are: (1) standardization of highly geographic regions [1]. Integrating data generated by these heterogeneous epidemiologic data collected by various ICEMR Centers into the Plasmodium Genomics Resource (PlasmoDB) projects; (2) provision of user-friendly search strategies to [2], a component of the Eukaryotic Pathogen Bioinformatics identify and retrieve information of interest from the very Resource Center (EuPath BRC), provides web-enabled access complex ICEMR data. We pursued an ontology-based strategy to to ICEMR project members, and ultimately the broader address these challenges. We utilized and contributed to the international research community. Common data collected Open Biological and Biomedical Ontologies to generate a across all ICEMR projects are represented in Figure 1. consistent semantic representation of three different ICEMR However, data produced by the various ICEMR projects is data dictionaries that included ontology term mappings to data heterogeneous with respect to origin, type of data, format, and spatio-temporal scale. The main challenges of sharing and fields and allowed values. This semantic representation of ICEMR data served to guide data loading into a relational database and presentation of the data on web pages in the form of search filters that reveal relationships specified in the ontology and the structure of the underlying data. This effort resulted in the ability to use a common logic for storing and display of data on study participants, their clinical visits, and epidemiological information on their living conditions (dwelling) and geographic location. Users of the Plasmodium Genomics Resource, PlasmoDB, accessing the ICEMR data will be able to search for participants based on environmental factors such as type of dwelling, location or mosquito biting rate, characteristics such as age at enrollment, relevant genotypes or gender and visit data such as laboratory findings, diagnoses, malaria medications, symptoms, and other factors. Keywords—standardizing data dictionaries, OBO Foundry, Figure 1. Common model of an ICEMR study. Red boxes PlasmoDB, ICEMR indicate processes, blue boxes are material entities, and black boxes are dependent continuants (qualities, data). Bolded boxes indicate the entities that the main search categories are about. integration of ICEMR data include standardizing the complex annotator, the BioPortal search web services [18] were used. and heterogeneous data for consistent representation and Both annotator and search results were reviewed manually. providing a user-friendly interface for easy exploration of the Consistent representation of ICEMR data was achieved data for constructing searches.. once the variables and values in the different ICEMR data Ontologies play a crucial role in heterogeneous data dictionaries were either mapped to existing ontology terms or integration by supporting consistent data representation and new ontology terms were created for that purpose. New providing a semantic framework to reveal the relationships ontology terms were created using two approaches. between data thereby facilitating information retrieval and new knowledge discovery [3]. We made use of the Open Biological a) If the terms were general and in a domain which have and Biomedical Ontologies (OBO) Foundry [4] which been covered by an OBO ontology, they were submitted to the promotes interoperable ontologies and provides a listing of relevant ontology via its issue tracker to be added in by the ontologies seeking to follow Foundry principles. These ontology developers. For example, disease terms were ontologies were used to provide a common understanding of submitted to the DO tracker and terms related to the what the information collected according to different ICEMR environment were submitted to the EnVO tracker. data dictionaries and case record forms was about. The OBO- b) If the terms were specific to the ICEMR projects, based mappings were useful for guiding data loading and they were added in the Eupath ontology. The Eupath ontology queries but were not directly usable for providing intuitive is an application ontology developed for providing terms to display of the available data on search forms. These were annotate data in the EuPath BRC. The EuPath ontology was combined in a EuPath application ontology. Using WebProtege built based on OBI with integration of other OBO ontologies [5], we created an ICEMR terminology to organize the classes such as PATO, OGMS, DO, etc. when needed. of data, create top-level categories, and re-label terms according to user preference while still maintaining the OBO C. Organization of ICEMR data dictionary variables for IRIs where applicable to preserve the semantic underpinnings. guiding searches of ICEMR data The result was a linked OBO-based application ontology and The ontological mapping of data dictionary variables web display terminology to provide interoperability and provides semantic clarity of types. However, organization intuitive access to the datasets based on different data according to term types (e.g., processes, material entities, dictionaries. qualities, etc.) does not necessarily provide intuitive listing on II. METHODS web sites for mining the data. As illustrated in Figure 1, the five main types of interest are ‘participants’, ‘dwellings’, A. ICEMR data and data dictionaries (clinical) ‘visits’, ‘entomological measurements’ and Multiple ICEMR projects have provided data for inclusion ‘geographic location’. Therefore, we organized the data in PlasmoDB. Each ICEMR project has provided data dictionary variables into categories based on their relation to dictionaries covering all data variables and values required for these types. Within each category, the data dictionary variables interpreting the associated data. By data dictionary, we mean are grouped based on the mapped OBO ontology terms. For a list of terms with definitions and specification of data example, ‘height’, ‘weight’, and ‘temperature’ (measurement variables, data types, format of data, and allowed values data) are grouped together in the ‘physical examination’ (including controlled vocabulary values). Data dictionaries are category (which in turn is placed in the ‘visit’ category). The used in data exchanges among ICEMR projects and sharing outcome of categorization of the variables from the multiple with different repositories. However, data dictionaries from the ICEMR data dictionaries is the ICEMR terminology and is the different ICEMR projects generally look very different from basis for displaying search parameters of this data on the each other in terms of type and quantity of content. PlasmoDB website. The ICEMR terminology is represented in the OWL format containing only ‘is a’ relations enabling B. Consistent representation of ICEMR data visualization of the ICEMR data dictionary hierarchy To standardize the data dictionaries from different ICEMR organization using ontology editors. WebProtege [5] is a web- projects, the variables and controlled vocabulary values were based collaborative ontology development platform and mapped to OBO ontologies. These included the Ontology for provides a means for domain experts to review and post Biomedical Investigations (OBI) [6], Phenotype qualities comments on terms. We uploaded the ICEMR terminology to (PATO) [7], Ontology for General Medical Science (OGMS) WebProtege and used it for collaboratively reviewing both the [8], Environmental Ontology (EnVO) [9], Disease Ontology organization of the ICEMR terminology and the labels of terms (DO) [10], Drug Ontology (DRON) [11], Infectious Disease to be displayed on the PlasmoDB web site before loading the Ontology (IDO) [12], Human Phenotype Ontology (HP) [13], ICEMR data into the database. This approach ensured that the Information Artifact Ontology (IAO) [14], Ontology for data was correctly displayed on PlasmoDB for each ICEMR Biobanking (OBIB) [15], and Symptom Ontology (SYMP) project. For the ICEMR terminology, we specified display [16]. The mapping of terms specified in the data dictionaries to labels using the rdfs:label annotation property as they are the OBO ontologies was performed using the BioPortal annotator default term labels rendered on WebProtege. In addition, we web services [17]. The annotator service can accurately used annotation properties to specify ontological names, (>95%) tag text with ontology terms. However, ontologies in definitions, whether the term was an organizing category or a the annotator might not be the latest version since these need to variable. If the term corresponded to a data dictionary variable, go through an indexing process before being added to the then annotation properties were also used for the original annotator. For terms where mappings were not found using the variable name in the data dictionary and source, the mapped Supported in part by National Institute of Allergy and Infectious Diseases National Institutes of Health, Department of Health and Human Services Contract No. HHSN272201400030C, U.S. Public Health Service cooperative agreements U19AI089674 (MGD) and U19AI089681 (JMV). ontology term, and the ontological definition. The common shows a sampling of mapping between symptom related display labels in the ICEMR terminology were agreed upon by variables to ontology terms. the contributing ICEMR projects. Each contributing ICEMR project had variables unique to that project. Therefore, the Ontology term mapping was also performed on the application of the ICEMR terminology for organization of each controlled values of variables. 413 controlled values used in ICEMR data dictionary resulted in different but still consistent the Uganda ICEMR data were mapped to OBO ontology terms. outputs. The application of the ICEMR terminologies to the The remaining 68 unmapped terms were added into the EuPath different projects can be viewed at the WebProtege site ontology. Few corresponding ontology terms were found for (http://webprotege.stanford.edu/) as “ICEMR Amazonia”, the controlled values in the Amazonia and Indian ICEMR data “ICEMR Indian”, and “ICEMR PRISM” (Uganda ICEMR (14 for Amazonia and 5 for Indian, respectively). For those project). values without mapped ontology terms, we have created standardized labels and will add the terms to either OBO III. RESULTS ontologies or EuPath ontology as described in the Methods. A. ICEMR data and data dictionaries After ontology term mapping and standardization of value labels across data from multiple ICEMR projects, we generated Longitudinal data from three ICEMR projects with studies (data dictionary to standardized) term mapping files for each in Uganda, India, and Amazonia were submitted for inclusion ICEMR. These mapping files were used in the ICEMR project in PlasmoDB. Data and data dictionaries from the Uganda and data loading process and enabled consistent data representation Indian ICEMR projects were provided in English whereas data in the PlasmoDB database. and the data dictionary from the Amazonia ICEMR project were in Spanish. The Amazonia ICEMR project also provided Table 2. Ontology mapping of symptom related variables a translated data dictionary in English. All three ICEMR Data ICEMR display projects provided participant data, dwelling data on dictionary Ontology term ID Ontology term label name participants, and participant-associated clinical visit data. The abdominalpain HP_0002027 Abdominal pain Abdominal pain Uganda ICEMR project also submitted entomological apainduration EUPATH_0000154 duration of abdominal Abdominal pain measurement data. pain duration Anorexia SYMP_0000523 anorexia Anorexia The Amazonia ICEMR data dictionary included 84 aduration EUPATH_0000155 duration of anorexia Anorexia duration variables and 179 controlled values for 26 variables. The Cough SYMP_0000614 cough Cough Indian ICEMR data dictionary contained 118 variables with cduration EUPATH_0000156 duration of cough Cough duration Diarrhea DOID_13250 diarrhea Diarrhea 149 controlled values for 32 variables. The Uganda ICEMR dduration EUPATH_0000157 duration of diarrhea Diarrhea duration data dictionary contained 121 different kinds of variables and Fatigue SYMP_0019177 fatigue Fatigue 481 controlled values for 21 variables. fmduration EUPATH_0000158 duration of fatigue Fatigue duration febrile EUPATH_0000097 febrile Febrile B. Ontology term mapping fever EUPATH_0000100 subjective fever Fever (subjective) Variables and values specified in the ICEMR data Headache HP_0002315 Headache Headache hduration EUPATH_0000159 duration of headache Headache duration dictionaries were mapped to 10 different OBO Foundry Jaundice HP_0000952 Jaundice Jaundice ontologies (listed in the Methods). Table 1 lists the mapping jduration EUPATH_0000160 duration of jaundice Jaundice duration results for each ICEMR project. A total of 209 new terms were jointpains SYMP_0000064 joint pain Joint pains added to the EuPath ontology for unmapped ICEMR variables. djointpains EUPATH_0000161 duration of joint pains Joint pains The EuPath ontology can be viewed on the WebProtege site duration muscleaches EUPATH_0000252 Muscle aches Muscle aches (http://webprotege.stanford.edu/) as the “EuPath ontology” mduration EUPATH_0000162 duration of muscle Muscle aches project. aches duration rfa OGMS_0000015 clinical history Other medical Table 1. Summary of mapped ontology terms complaint seizure SYMP_0000124 seizure Seizures ICEMR Variables OBO Ontologies EuPath Ontology sduration EUPATH_0000163 duration of seizures Seizures duration Project fduration EUPATH_0000164 duration of subjective Subjective fever Amazonia 84 15 69 fever duration Vomiting HP_0002013 Vomiting Vomiting India 118 31 87 vduration EUPATH_0000165 duration of vomiting Vomiting duration Uganda 121 17 104 C. Organization of terms for search filters and exploration of data Data dictionary variables from the different ICEMR For each ICEMR project, around 100 different variables projects referring to the same thing were often different. For can be used to search and retrieve the data. As indicated in the example, “edad” in the Amazonia ICEMR data dictionary, Introduction, malaria researchers are interested in mining the “age_en” in the Indian ICEMR data dictionary, and “age” in data for insights about the connections between study the Uganda ICEMR data dictionary all refer to participant age participants, their living conditions (dwelling), their health at the time of enrollment and mapped to the ontology term status (clinical visit), their geographic location and exposure to EUPATH_0000120: ‘age since birth at time of enrollment’. As mosquitos (entomological measurement data). We assigned the another example of the encountered heterogeneity, Table 2 variables to these five categories based on their mapped ontology terms taking into account whether they were a different ICEMR data are found common categories but also subclass of or having a logical connection to the categories. some categories specific to individual projects. Therefore, each With the exception of geographic location, each category had ICEMR project has its own representation of the ICEMR around 20 different variables that required further grouping to terminology used as web site search filters to explore its data. provide intuitive access to the data for end users. Further The application of this approach for the Uganda ICEMR grouping was made based on the ontological understanding of project is shown in Figure 3. The applications for the other data. For example, height, weight, and temperature data are all ICEMRs will be very similar and therefore users familiar with generated by physical examination. Thus, a new class of data one ICEMR search will also find the other ICEMR searches to OGMS_0000083: ‘physical examination’ was added under be familiar. Furthermore, the common display and underlying category ‘visit’. Using this approach, around 5 different ontology mappings provide the opportunity for future cross subtypes were created under each category (except ‘geographic ICEMR searches. location’). For example, in addition to ‘physical examination’, ‘medication’, ‘diagnosis’, ‘symptoms’, ‘laboratory findings’, IV. DISCUSSION/ CONCLUSIONS ‘visit type’ and ‘visit details’ were added as subtypes of the Related but different semantic approaches were used to category ‘visit’. address the dual challenges of standardizing data dictionaries Term labels used in an ontology are typically chosen for across projects and generating user-friendly displays to search ontological clarity and can be quite long. As a result, such and explore the associated data. labels are often not user-friendly or practical for providing Our approach for standardization is to relate all variables searches on web sites like PlasmoDB. Alternative display and associated values to terms from interoperable ontologies names were therefore generated for ontology terms. For listed at the OBO Foundry. OBO Foundry ontologies provide example, the display name ‘Age at time of enrollment’ is used the benefit of wide coverage but can also be selectively for ontology term EUPATH_0000120: ‘age since birth at time imported to create an application ontology such as the EuPath of enrollment’. ontology. When existing terms were not available for mapping, Figure 2 shows the organization of variables that will be new ones were created for introduction into the source displayed on the website in the three ICEMR projects ontologies or just placed in the application ontology. The use discussed here using Protégé, an OWL editor [19]. Among the of the Basic Formal Ontology (BFO) [20] by the EuPath ontology as its upper level greatly facilitated the task of Amazonia India Uganda Figure 2. Standardized representation of variables from the Amazonia (left), India (middle), and Uganda (right) ICEMR data dictionaries for web display. Highlighted is an example of variable common to all three, ‘Age at time of enrollment’, which is placed under ‘Participant Study Details’ along with variables that are common to only two (e.g., ‘Clinical History’) or unique (e.g., ‘Reason for withdrawal’). Other categories and variables common to all three ICEMRs are underlined in red. standardization across projects. BFO models reality rather than provides a flexible existing system for introducing data from data models and helps interpret when variables and values are other ICEMR projects or other studies of the same type. about the same processes, material entities, and measurements. However, the ontologic semantic organization did not directly ACKNOWLEDGMENT translate well to web site displays for exploring relationships We acknowledge the developers of the Disease Ontology, between study participants, their living conditions, and data the Environmental Ontology and the Drug Ontology for adding gathered at clinical visits to understand malaria epidemiology. our requested terms into their respective ontologies. G.C.E and Instead, categorical organization was better suited for web D.G thank Carmen Puemape and Mitchell Guzman for display. excellent technical assistance in data management. REFERENCES [1] J. B. Gutierrez, O. S. Harb, J. Zheng, D. J. Tisch, E. D. Charlebois, C. J. Stoeckert, et al., “A framework for global collaborative data management for malaria research,” Am. J. Trop. Med. Hyg. vol. 93 no. 3 Suppl., pp. 124-32, September 2015. [2] C. Aurrecoechea, J. Brestelli, B. P. Brunk, J. Dommer, S. Fischer, B. Gajria, et al., “PlasmoDB: a functional genomic database for malaria parasites,” Nucleic Acids Res. vol. 37, pp. D539-43, January 2009. [3] V. G. Dugan, S. J. Emrich, G. I. Giraldo-Calderón, O. S. Harb, R. M. Newman, B. E. Pickett, et al, “Standardized metadata for human pathogen/vector genomic sequences," PloS One. vol 9 no 6, pp. e99979, June 2014. [4] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, et al., “The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration,” Nat Biotechnol. vol. 25, pp. 1251-5, November 2007. [5] M. Horridge, T. Tudorache, C. Nuylas, J. Vendetti, N. F. Noy, and M. A. Musen.. “WebProtégé: a collaborative Web-based platform for editing biomedical ontologies,” Bioinformatics. vol. 30, pp. 2384-5, August 2014. [6] A. Bandrowski, R. Brinkman, M. Brochhausen, M. H. Brush, B. Bug, M. C. Chibucos, et al., “The Ontology for Biomedical Invetigastions,” PLoS One. vol 11 no. 4, pp. e0154556, April 2016. [7] The Phenotype And Trait Ontology (PATO) [online]. Available: https://github.com/pato-ontology/pato/ [8] The Ontology for General Medical Sciences (OGMS) [online]. Available: https://github.com/OGMS/ogms/ [9] P. L. Buttigieg, N. Morrison, B. Smith, C. J. Mungall, and S. E. Lewis, “The environment ontology: contextualising biological and biomedical entities,” J. Biomed. Sem. vol. 4, pp. 43, December 2013. [10] W.A. KIbbe, C. Arze, V. Felix, E. Mitraka, E. Bolton, G. Fu, et al., “Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data,” Nucleic Acids Res. vol. 43, pp. D1071-8, January 2015. Figure 3. An example search of the Uganda ICEMR project [11] J. Hanna, E. Joseph, M. Brochhausen, and W. R. Hogan, “Building a data. At the top, participants with an age from 0.5 to 3 drug ontology based on RxNorm and other sources,” J. Biomed. Sem. years old at enrollment can be selected. The selected vol. 4, pp. 44 , December 2013. participants can be filtered to find those that had subjective [12] L. G. Cowell and B. Smith, “Infectious disease ontology,” in Infectious fever (lower section). disease informatics, Springer New York, 2010, pp. 373-395. [13] P. N. Robinson, S. Köhler, S. Bauer, D. Seelow, D. Horn, and S. An ICEMR terminology was created for the purpose of Mundlos, “The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease,” Am. J. Hum. Genet. vol. 83, pp. web display to organize the standardized variables according to 610-5, November 2008. ways that users are expected to browse them. The ICEMR [14] The Information Artifact Ontology (IAO) [Online]. Available: terminology also takes into account the need for shortened https://github.com/information-artifact-ontology/IAO/ names on a web form. Underlying all the terms however is [15] M. Brochhausen, J. Zheng, D. Birtwell, H. Williams, A. M. Masci, H. J. their basis for understanding through mapping to OBO / Ellis, et al., “OBIB – a novel ontology for biobanking,” J. Biomed. Sem. EuPath ontology terms. vol. 7, pp. 23, May 2016 [16] The Symptom Ontology (SYMP) [Online]. Available: The separation of web display and variable standardization http://symptomontologywiki.igs.umaryland.edu/mediawiki/index.php provides for flexibility in providing different emphases in data [17] C. Jonquet, N. H. Shah, M. A. Musen, “The open biomedical annotator” browsing while maintaining the same underlying semantics. Summit on Translat Bioinforma. vol. 2009, pp. 56-60, March 2009. The overall approach has allowed us to achieve the goal of [18] P. L. Whetzel, N. F. Noy, N. H. Shah, P. R. Alexander, C. Nyulas, T. providing a common system with consistent representation for Tudorache, et al., “BioPortal: enhanced functionality via new Web the three currently participating ICEMR projects. It also services from the National Center for Biomedical Ontology to access and use ontologies in software applications,” Nucleic Acids Res. vol. 39, [20] R. Arp, B. Smith, and A. D. Spear, “Building ontologies with Basic pp. W541-5, July 2011. Formal Ontology,” The MIT Press, 2015. [19] Protégé [Online]: Available: http://protege.stanford.edu