Analyzing Tools for Biomedical Text Annotation with Multiple Ontologies Kele T. Belloze1,*, Daniel Igor S. B Monteiro2, Tulio F. Lima2, Floriano P. Silva-Jr1 and Maria Claudia Cavalcanti2 1 Laboratório de Bioquímica de Proteínas e Peptídeos, Instituto Oswaldo Cruz - FIOCRUZ, Rio de Janeiro, Brasil 2 Departamento de Ciência da Computação, Instituto Militar de Engenharia, Rio de Janeiro, Brasil Textual databases such as PubMed, which contain refer- also been tested. Other characteristics observed involve: ences to scientific articles, are a major source for the extrac- representation of the annotation (intrusive – in the text or tion of useful information, since many scientific discoveries non-intrusive – as attached file), types of documents are deposited only in text form. However, due to the mas- compatible (which file extensions are supported as an input sive size of these bases, computational approaches are need- in the tool), documentation availability and platform of ed to extract information from texts. Currently, PubMed has development (web or desktop). more than 20 million citations from biomedical literature. Despite the fact that tools such as KIM, Ontea and Ontology-based semantic annotation is an approach that RDFace generate automatic annotations, they have their aims to enrich the text with semantic descriptions and own ontologies which are not on the biomedical domain. thereby facilitating the extraction of information based on Knowtator is a plugin for Protégé Server and only a few semantic content embedded there. The ontologies used are tasks of the annotation process are automated. MnM, usually focused on one domain. However, scientific articles GoNTogle and RDFa Editor tools perform automatic often deal with different areas. For example, texts whose annotation and have flexibility in loading arbitrary theme is about drug targets have references on molecular ontologies, but could not be used due to support problems. biology, pharmacology, chemical compounds and organism The NCBO Annotator is a web service that annotates full names. This suggests that the semantic annotation has to be texts using ontologies from biomedical domain available at done with multiple ontologies, to cover all or most of the NCBO BioPortal. However, it is not available for immediate domains of the text. Our motivation to annotate semantically usage, and demands the development of a client to that web with multiple ontologies is to be able to identify extra service. annotations that would be made manually. For example, let AutôMeta and GATE can perform automatic annotation us imagine that in a text it is annotated the name of a gene of documents and also have flexibility in loading arbitrary with the Gene Ontology, the organism with NCBITaxon and ontologies. These are selected as tools to be tested and used the pharmacogenomic relationship of interest (such as the for the purpose of our experiment on semantic annotation. knockout technique) with the PHARE ontology. Realizing AutôMeta uses RDFa (an annotation language that the text mentions the knockout technique applied to the recommended by W3C), has a reasoner to infer new gene G of the organism O, causes its death, it would be annotations and it supports the load of large ontologies such useful then to annotate manually that G is essential for O. as Gene Ontology and NCI Thesaurus, among others. The To support semantic annotations, there are many tools texts for annotation must be in 'txt' format and annotation is available all over the web. This paper aims to identify and made using an intrusive method. GATE is a tool for natural compare these tools, focusing on texts and ontologies in the language processing. It is very solid and mature in the task biomedical area using two of main characteristics: 1) form of semantic annotation using the resources of language and of annotation and 2) flexibility in the ontology load. The processing. Its differential is on being able to load different form of annotation can be automatic or manual. In this extensions of documents (txt, pdf, doc, etc.). Additionally, it study, we investigated automated tools to check if they have performs non-intrusive annotation, and archives them in the option of manual annotation. The selection of only ‘xml’ files. Ontologies are loaded as processing resources, automatic tools is due to the large volume of texts and also which can happen very slowly in the case of large because of the difficulty and high cost to keep specialists ontologies. Both tools have good documentations and are responsible for the task of manual annotation. The manual free. annotation additional feature is important because it would Moreover, it is noteworthy that these tools works with a be used to insert extra annotation. With respect to the load set of input texts, but only uses one ontology at a time. flexibility of ontologies, some items are observed such as Therefore, it is possible to have texts annotated with the size and format of the ontologies and the possibility of multiple ontologies, but in separate files, generating a new using an arbitrary ontology (a user choice). The utilization volume of texts and many output files. The simultaneous of arbitrary ontologies allows different domains to be used annotation with multiple ontologies is still an unsolved for the annotation. The tools with these characteristics have problem. 1