Analyzing Tools for Biomedical Text Annotation with
                              Multiple Ontologies
          Kele T. Belloze1,*, Daniel Igor S. B Monteiro2, Tulio F. Lima2, Floriano P. Silva-Jr1
                                             and Maria Claudia Cavalcanti2
 1
     Laboratório de Bioquímica de Proteínas e Peptídeos, Instituto Oswaldo Cruz - FIOCRUZ, Rio de Janeiro, Brasil
          2
            Departamento de Ciência da Computação, Instituto Militar de Engenharia, Rio de Janeiro, Brasil


Textual databases such as PubMed, which contain refer-             also been tested. Other characteristics observed involve:
ences to scientific articles, are a major source for the extrac-   representation of the annotation (intrusive – in the text or
tion of useful information, since many scientific discoveries      non-intrusive – as attached file), types of documents
are deposited only in text form. However, due to the mas-          compatible (which file extensions are supported as an input
sive size of these bases, computational approaches are need-       in the tool), documentation availability and platform of
ed to extract information from texts. Currently, PubMed has        development (web or desktop).
more than 20 million citations from biomedical literature.            Despite the fact that tools such as KIM, Ontea and
   Ontology-based semantic annotation is an approach that          RDFace generate automatic annotations, they have their
aims to enrich the text with semantic descriptions and             own ontologies which are not on the biomedical domain.
thereby facilitating the extraction of information based on        Knowtator is a plugin for Protégé Server and only a few
semantic content embedded there. The ontologies used are           tasks of the annotation process are automated. MnM,
usually focused on one domain. However, scientific articles        GoNTogle and RDFa Editor tools perform automatic
often deal with different areas. For example, texts whose          annotation and have flexibility in loading arbitrary
theme is about drug targets have references on molecular           ontologies, but could not be used due to support problems.
biology, pharmacology, chemical compounds and organism             The NCBO Annotator is a web service that annotates full
names. This suggests that the semantic annotation has to be        texts using ontologies from biomedical domain available at
done with multiple ontologies, to cover all or most of the         NCBO BioPortal. However, it is not available for immediate
domains of the text. Our motivation to annotate semantically       usage, and demands the development of a client to that web
with multiple ontologies is to be able to identify extra           service.
annotations that would be made manually. For example, let             AutôMeta and GATE can perform automatic annotation
us imagine that in a text it is annotated the name of a gene       of documents and also have flexibility in loading arbitrary
with the Gene Ontology, the organism with NCBITaxon and            ontologies. These are selected as tools to be tested and used
the pharmacogenomic relationship of interest (such as the          for the purpose of our experiment on semantic annotation.
knockout technique) with the PHARE ontology. Realizing                AutôMeta uses RDFa (an annotation language
that the text mentions the knockout technique applied to the       recommended by W3C), has a reasoner to infer new
gene G of the organism O, causes its death, it would be            annotations and it supports the load of large ontologies such
useful then to annotate manually that G is essential for O.        as Gene Ontology and NCI Thesaurus, among others. The
   To support semantic annotations, there are many tools           texts for annotation must be in 'txt' format and annotation is
available all over the web. This paper aims to identify and        made using an intrusive method. GATE is a tool for natural
compare these tools, focusing on texts and ontologies in the       language processing. It is very solid and mature in the task
biomedical area using two of main characteristics: 1) form         of semantic annotation using the resources of language and
of annotation and 2) flexibility in the ontology load. The         processing. Its differential is on being able to load different
form of annotation can be automatic or manual. In this             extensions of documents (txt, pdf, doc, etc.). Additionally, it
study, we investigated automated tools to check if they have       performs non-intrusive annotation, and archives them in
the option of manual annotation. The selection of only             ‘xml’ files. Ontologies are loaded as processing resources,
automatic tools is due to the large volume of texts and also       which can happen very slowly in the case of large
because of the difficulty and high cost to keep specialists        ontologies. Both tools have good documentations and are
responsible for the task of manual annotation. The manual          free.
annotation additional feature is important because it would           Moreover, it is noteworthy that these tools works with a
be used to insert extra annotation. With respect to the load       set of input texts, but only uses one ontology at a time.
flexibility of ontologies, some items are observed such as         Therefore, it is possible to have texts annotated with
the size and format of the ontologies and the possibility of       multiple ontologies, but in separate files, generating a new
using an arbitrary ontology (a user choice). The utilization       volume of texts and many output files. The simultaneous
of arbitrary ontologies allows different domains to be used        annotation with multiple ontologies is still an unsolved
for the annotation. The tools with these characteristics have      problem.


                                                                                                                                1