KR-MED 2006 "Biomedical Ontology in Action" November 8, 2006, Baltimore, Maryland, USA An Online Ontology: WiktionaryZ Erik M. van Mulligen, Ph.D.1,2, Erik Möller, Peter-Jan Roes3, Marc Weeber, Ph.D.2, Gerard Meijssen, Christine Chichester Ph.D.2,4 , Barend Mons Ph.D.,1,2,4 1 Dept. of Medical Informatics, Erasmus Medical Center, Rotterdam, the Netherlands 2 Knewco Inc, Rockville, United States of America 3 Charta Software, Rotterdam, the Netherlands 4 Human and Clinical Genetics, Leiden University Medical Center, Leiden, the Netherlands e.vanmulligen@erasmusmc.nl There is a great demand for online maintenance and retrieval of documents with a more specific, narrower refinement of knowledge on biomedical entities1. meaning). Another task would be semantic Collaborative maintenance of large biomedical navigation between texts (e.g., exploring the semantic ontologies combines the intellectual capacity of relationships between an identified concept in a text millions of minds for updating and correcting the and concepts in other texts5). annotations of biomedical concepts with their semantic relationships according to latest scientific Outside the biomedical domain the W3C has been insights. These relationships extend the current working on defining exchange standards for specialization and participation relationships as ontologies. Their objective is to facilitate the currently exploited in most ontology projects. The development of technologies that enable cross- ontology layer has been developed on top of the community data integration and collaborative efforts Wikidata2 component and allows for presentation of by adding semantics to the data. An example is the these biomedical concepts in a similar way as semantic web where webpages are semantically Wikipedia pages. Each page contains all information tagged and through these semantic tags linked to on a biomedical concept with semantic relationships other webpages (similar to the current hyperlinked to other related concepts. A first version has been web). RDF, OWL and DAML6 are examples of populated with data from the Unified Medical standards to impose semantic tags on information on Language System (UMLS), SwissProt, the web. The meaning of these tags is captured in GeneOntology, and Gemet. The various fields are ontologies that contain additional information on how online editable in a Wiki style and are maintained via these semantic tags interrelate. These semantic a powerful versioning regiment. Next steps will interrelated tags can be used by applications for include the definition of a set of formal rules for the instance to semantically navigate between web ontology to enforce (onto)logical rigor. resources. INTRODUCTION All these tasks heavily rely on ontologies that serve In order to deal with the deluge of biomedical as a repository of these biomedical concepts. information many projects have been initiated that Ontologies provide facilities to semantically relate aim at semantically annotating content. Many of the different biomedical topics. A first generation of these projects can be characterized as an attempt to ontologies (with limited scope) is available now. exploit advanced natural language processing and Good ontological principles have been a research text mining technology to identify the relevant topic and many scientific projects aim at a next semantic topics contained in a text3. By identifying generation of ontologies7. The Open Biomedical these concepts in a text one can exploit available Ontologies consortium provides a platform for information about a concept as being formalized in making available ontologies for shared use in the an ontology for a number of tasks. One of these tasks medical and biomedical domain that have been is to improve information retrieval4 (e.g., retrieval of constructed with tools that bring in a greater degree texts on a particular concept might also include the of logical and ontological rigor8. Various tools have 31 been constructed that assist users with constructing Furthermore, a biomedical ontology is not a static, these ontologies. Protégé is a freely downloadable one-time effort. Such an ontology should be program to construct ontologies using a strong continuously revised and updated with the latest new formalism9. biomedical concepts and the latest semantic relations between the concepts1. Only imagining the rate with OntoBuilder is another ontology editor that has been which genomics and proteomics data are produced developed to automatically derive ontologies from a yielding new information on genes and proteins it corpus (web pages) with support to refine and becomes clear that a comprehensive and up-to-date restructure them. Its focus is in particular on ontology is beyond the capabilities of any single ontologies supporting the semantic web10. The main scientific project. emphasis of all these tools is to make the development of (rigorous) ontologies easier. The The only way to cope with such enormous amounts whole process of collaboration, discussion and of data in so many different biomedical fields is to interrelating ontologies has not yet been addressed in have an open environment in which all scientists can these tools. collaboratively share their knowledge on particular biomedical topics. Therefore we are currently In this paper a mechanism is presented to harvest investigating the possibilities of using a web-based from existing ontologies originating from different approach to build and maintain biomedical sources and make these ontologies available for web- ontologies. Benefiting from the pioneering work of based refinement through a collaborative effort of the the Wikimedia Foundation on collaborative community of scientists. The hypothesis is that the development of web-based encyclopedias, we are online interaction, discussion and annotation of exploring the possibilities to adapt a Wikimedia biomedical concepts will lead to wider coverage and product in such a way that it can be used to support higher quality ontologies with more semantics collaboration on ontology work: the WiktionaryZ defined. Typically, most ontologies limit themselves software. to defining a hierarchy containing the specialization or participation relations. The biomedical semantic Many of the current vocabularies do not satisfy the relations (a particular biomedical concept has a ontological principles as current research has particular semantic relationship with another defined13. In addition, editing and updating ontologies biomedical concept) require experts to interact and should follow rules that guarantee soundness and refine. These are important for the next generation of correctness of the ontology. Description logic in intelligent applications. combination with the specification of a separate hierarchy along the specialization and participation It is clear that an ontology has to cover a substantial relation could make it possible to automatically part of the domain in order to be useful. In the detect errors in the concept classification. The biomedical domain, this would require that at least a WiktionaryZ has been developed in such a way that substantial part of all medical concepts and of all such an additional hierarchy can be expressed. genomic and proteomic concepts have to be in. Current vocabularies in these fields yield about In addition to creating a collaborative instrument for 1,352K concepts for the medical domain (UMLS11) biomedical scientists, this approach is also of interest and about 200K for the genomics and proteomics to language engineering scientists. A systematic domain (Swiss-Prot, EntrezGene, and Gene translation of biomedical terms is a rich source for Ontology12). language engineers and of great interest to them. Building a comprehensive ontology is an enormous METHODS endeavor. Bringing together all ontological The architecture of WiktionaryZ (see Figure 1) has knowledge from different biomedical disciplines in been based on the existing MediaWiki software. one environment seems to be quite impossible. Wikidata itself is an extension of the MediaWiki 32 Figure 1 - Schematic overview of the architecture of WiktionaryZ. It has been developed on top of the existing MediaWiki software. software that allows for structured data functionality These two branches are more or less independent: beyond editing flat documents like Wikipedia new versions of the authoritative version can be articles. All data are stored in a MySQL relational imported without disrupting the community version. database management system. WiktionaryZ has been Vice versa are edits made by the community clearly built using Wikidata to store multilingual ontologies. (visually) distinguishable from the authoritative It supports the notion of concepts, terms, synonyms, version avoiding any confusion with respect to translations, definitions and alternative definitions, accountability. The authority can monitor and semantic relations, attributes, ontology class selectively include community edits to refine its own membership, and source annotations. Each of these authoritative version. The community can harvest elements is stored in the database as a separate entity. from the latest release of the version maintained by These entities can be combined in various queries the authority after its import into the authoritative supporting different applications. Specific branch. applications (e.g., WikiProtein and WikiAuthors) can be defined as an implementation of the WiktionaryZ Every scientist can contribute and discuss schema definition (with possibly some application- information on a concept. The version management specific extensions). layer treats every edit as a new version. Versions can be rolled back if such a rollback does not cause The WiktionaryZ software provides the same relational inconsistencies. The LiquidThreads functionality as the MediaWiki software with respect extension supports multiple threads per Wiki page. to online editing (talk pages) and version This means that one could have a discussion thread management. In order to distinguish between the around the definition of a concept and a separate one ontology as provided by the authority - i.e. the for the translations of terms. The WiktionaryZ organization that developed the thesaurus or software and its database are available under a free vocabulary - and the version as maintained by the content license as defined by the Free Content community an extended version management system Definition (http://www.freecontentdefinition.org). is in place. The WiktionaryZ software discriminates A Wikidata application is defined by a namespace between two version branches: the so-called and associated functionality. Each different authoritative version and the community version. vocabulary can have its own namespace and attached 33 to its namespace can be additional tables that require hierarchically organized relations can be easily specific functionality. For instance, in the extended and refined by the user. WikiProtein namespace each protein can be described by its own specific features, such as amino Attached to each concept are terms (and synonyms), acid sequence, the species of origin, the the language utterances used to refer to the concept. experimentally identified function, etc. For a gene These terms are organized per language. Translations concept, the DNA sequence could be given. Despite for each term can be entered and the system has been these specializations for each namespace, the predefined with codes as defined in the ISO/FDIS concepts share a common set of data (and structure) 639-3 standard. Attached to each definition can be for each concept. attributes. Initially these attributes will specify properties on the defined meaning: for instance the Each biomedical concept is defined by a definition – semantic type (e.g., a disease, a gene, a finding, a a short and precise specification of the concept. A chemical, etc.) of the biomedical concept. biomedical concept can have additional definitions: these definitions might comprise real alternatives for In order to benefit from the biomedical concepts as the definition or definitions with a slightly different already defined in existing vocabularies and thesauri perspective: aiming at a different scientific discipline batch import facilities have been developed for the or at a different community (high school students, for WiktionaryZ. Import facilities are now available for instance). Figure 2 shows an example of the the UMLS files, Swiss-Prot files, Gene Ontology information comprised at a WiktionaryZ page. The files, and the Gemet files. Most information palette of semantic relations between the biomedical contained in these vocabularies and thesauri has been concepts has initially been defined as the set of succesfully imported and made available in a relations defined in the Semantic Network of the WiktionaryZ environment. Unified Medical Language System11. This set of 34 DISCUSSION rules should lead to alerts to the user but should not be prohibited. It is at the moment unclear how much No other online editing environment has been of the potential inconsistency problems can be developed that supports collaboration of scientists on avoided by this framework. annotation and semantic refinement of an ontology. The currently available tools allow for development The alignment of different vocabularies also requires of ontologies along some ontology design principles. special attention. How can identical concepts defined However, many scientists need to be involved to in different vocabularies be aligned (mapped to the refine the ontologies to a fine granular conceptual same concept)? It is yet unclear how we can support level, to annotate the concepts, and to express the automatic detection of (almost) synonymous concepts semantic relationships between concepts, in short, to (e.g., “water” and “H2O” as being equivalent but represent and codify the continuous advances of defined in different vocabularies). This aspect has scientific knowledge about any biomedical subject. been a topic of study for already quite some years For effective use of ontologies in biomedical and we will explore the possibilities that have been applications it is crucial to go beyond the current identified. foundational relations of ontologies and beyond the well established and consistently described concepts. A comprehensive biomedical ontology that can be Our first experiments with building the WiktionaryZ effectively used for a number of tasks demonstrate that it is quite feasible to have large sets (bioinformatics, clinical medicine) will contain at of concepts contained in a Wikidata database. The least 2 million biomedical concepts. This is a rough web based interface is fast enough to retrieve the estimate based on combining the current available concepts and combine all concept related data thesauri, taken into account the overlap and the dispersed in different tables to the user. Pages are amount of non-medical concepts together with those referenced per term. In case of a homonymous terms parts that are still missing. Currently the National the page shows all the concepts for which the term is Library of Medicine, the Swiss Institute for defined. The concept page can be very long. BioInformatics, and the Gene Ontology Consortium Currently WiktionaryZ does not provide any have, apart from providing their sources, expressed mechanism to define views on the data. A simple first their interest in this effort. An online maintained approach would be to only show data for the ontology will provide mechanisms to improve their language(s) that the user has indicated. More authoritative sources as well. advanced views that are depending on the nature of the user’s task can also be foreseen (i.e., differentiate In order to be able to include other ontologies/ between annotators, scientists, students, ontology thesauri as well the development of a method that can developers, translators, high school students, etc.). both read and write ontologies expressed in a standard syntax (OBO, OWL) has to be developed. The WiktionaryZ does provide a powerful search This would make it possible to easily include a wide facility: it searches for exact matches and allows for range of ontologies that are currently available in this partial matches, both in the expressions associated format. Furthermore, the export allows the source with each concept and in their definitions. authorities to download the latest edits for inclusion Misspellings and phonetic search are not in their local version of the source. The current implemented yet. It is evident that the current implementation of the system shows that it is implementation lacks the ontological framework that technically feasible to have all these thesauri allows for more sophisticated and rigorous quality combined in one WiktionaryZ environment. What the control. This is essential when various users with impact - both with respect to quality and performance different skill levels in ontology development are - of a large scientific community will be on such an editing the ontology. Inclusion of a set of proper and online ontology remains a topic of research and will well-defined relations expressed in a formal way be part of future evaluation studies. should yield a more robust and more consistent editing of the ontology. Violation of these editing 35 References 13. Smith B, Ceusters W, Klagges B, Köhler J, 1. Wang K. Gene-function Wiki Would Let Kumar A, Lomax J, et al. Relations In Biologists Pool Worldwide Resources. Nature Biomedical Ontologies. Genome Biology 2005; 2006; 439-534 6(5) 2. Möller E. Wikidata: Wiki-Style Databases. Available from: http://mail.wikipedia.org/pipermail/wikitec h-l/2004-September/025377.html 3. Nagao K., Shirai Y, Squire K. Semantic Annotation And Transcoding: Making Web Content More Accessible. IEEE Multimedia, 2001;8(2):69-81 4. Müller H-M, Kenny EE, Sternberg PW. Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PloS Biology, 2004;2(11). 5. Buitelaar P, Eigner Th, Racioppa S. Semantic Navigation With VieWs. Proceedings of the Workshop on User Aspects of the Semantic Web at the European Semantic Web Conference. 2005. 6. Miller E. Weaving Meaning : An Overview Of The Semantic Web. Presented at the University of Michigan, Ann Arbor, Michigan USA, 2004 7. Smith B, Rosse C: The Role Of Foundational Relations In The Alignment Of Biomedical Ontologies. Proc. Medinf 2004. Amsterdam: IOS Press, 2004;444-8. 8. Available from: http://obo.sourceforge.net/main.html 9. Knublauch H, Fergerson RW, Noy NF, Musen MA. The Protégé OWL Plugin: An Open Development Environment For Semantic Web Applications. Third International Semantic Web Conference, Hiroshima, Japan, 2004. 10. Roitman H, Gal A. OntoBuilder: Fully Automatic Extraction And Consolidation Of Ontologies From Web Sources Using Sequence Semantics. Proceedings of the International Conference on Semantics of a Networked World (ICSNW), 2006 11. Lindberg DA, Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med. 1993;32(4):281-91. 12. Bada M, Stevens R, Goble C, Gil Y, Ashburner M, Blake JA, et al: A Short Study On The Success Of The GeneOntology. J Web Semantics 2004;1:235-40. 36