=Paper= {{Paper |id=None |storemode=property |title=Unifying Phenotypes to Support Semantic Descriptions |pdfUrl=https://ceur-ws.org/Vol-1041/ontobras-2013_paper50.pdf |volume=Vol-1041 |dblpUrl=https://dblp.org/rec/conf/ontobras/MirandaS13 }} ==Unifying Phenotypes to Support Semantic Descriptions== https://ceur-ws.org/Vol-1041/ontobras-2013_paper50.pdf
     Unifying Phenotypes to Support Semantic Descriptions
                       Eduardo Miranda 1 , André Santanchè 1
                 1
                 Institute of Computing – State University of Campinas
           Av. Albert Einstein, 1251 – Cidade Universitária, Campinas, Brazil
     eduardo.miranda@students.ic.unicamp.br, santanche@ic.unicamp.br

    Abstract. In life sciences, there are several biological datasets shared through
    the web. All this abundance of data carries a great opportunity to explore com-
    plex relationships among the diversity of species. However, their physical for-
    mat varies from independent data files to databases, which are heterogeneous
    in model and representation, hampering their integration. Ontologies are one
    of the promising choices to address this challenge. However, the existing dig-
    ital phenotypic descriptions are stored in semi-structured formats, making ex-
    tensive use of natural language. If on one hand, this patrimony is highly rel-
    evant, on the other hand, converting it in ontologies is not a straightforward
    task. The present article addresses this problem adding an intermediate step
    between semi-structured phenotypic descriptions and ontologies. It remodels
    semi-structured descriptions to a graph abstraction in which the data are linked.
    Graph transformations subsidize the transition from semi-structured data rep-
    resentation to a more formalized representation through ontologies.

1. Introduction
Bioinformatics is the science of integrating, managing, mining and interpreting informa-
tion from biological data [Gibas and Jambeck 2001]. In the life science field, there are a
large number of distributed biological datasets freely available and ready to use. However,
this wealth of information has hardly been tapped even today due its distributed nature,
heterogeneity and complex data types and representation [Parr et al. 2012]. In this sce-
nario, their combination and interconnection are barely feasible [Quan 2007]. A massive
amount of relevant information is hidden in the potential connection of unrelated files.
         In this work we are interested in a specific biology context, in which biologists
apply computational tools to build and share digital descriptions of living beings as phe-
notypes. These descriptions are a fundamental starting point for several biology tasks,
like living beings identification and tools for phylogenetic tree analysis. Even though the
last generation of these tools is based on open standards (e.g., XML), the descriptions are
still based on textual sentences in natural language [Balhoff et al. 2010].
        Semantic integration in this context is one of the main challenges. Besides on-
tologies to support phenotype description, there are tools to annotate descriptions by as-
sociating ontology concepts to textual descriptions [Balhoff et al. 2010]. This distinction
between description and their annotations based on ontologies does not consider that de-
scriptions can conversely contribute to ontology expansion and revision. The challenge in
this work is to establish a model to represent a common denominator among phenotipical
description standards, which will support findings in the latent semantics implicit in re-
lations in a strategy inspired by folksonomies. These semantics can guide the interaction
between textual descriptions and ontologies.

                                           154
        In a previous work [Alves and Santanchè 2013], we showed that the latent seman-
tics presented in tags and their correlations, as a product of an organic work collectively
produced by a community on the web (the folksonomies), can be exploited to expand and
review ontologies. While the model behind folksonomies is based on the correlation of
three elements – tags, resources and users – descriptions in the biological context present
a more complex and specialized structures. Co-occurrence is a strong principle we con-
sidered to extract latent semantics. The main idea is that the set of tags put together in a
given resource can provide a “context” to interpret each tag. Consider a tag cell, which
can have a distinct interpretation according to the context. The co-occurrence with the
tags cytoplasm or organelle will put it in the biology context. Moreover, the compilation
of data concerning the occurrence and co-occurrence of millions of tags can support the
analysis of similarity among terms – see more details in [Alves and Santanchè 2013]. We
consider that we can apply an equivalent technique to put terms of phenotype descriptions
in a context, to improve their interpretation and correlation.
        The present paper addresses this problem in exploiting existing biology assets
related to phenotypic descriptions, and the latent semantics resulting from their intercon-
nection, to support their development towards a richer semantical representation, as part
of ontologies. It implies promoting relations among concepts to first class citizens. Ac-
cordingly, we designed a three layered method illustrated in Figure 1, in which graph
databases intermediate this evolvement process from fragmentary data sources to accom-
plish full integration descriptions as ontologies.
        Our approach remodels semi-structured descriptions to a graph abstraction, in
which the data can be integrated more easily. Graph transformations are applied for the
transition from a semi-structured data representation to a more formalized representa-
tion through ontologies. As we will further explain, this graph representation will also
support an analytical tool to compare data across studies, wherein it will help evolution-
ary biologists to answer evolutionary questions. This paper presents a work in progress
concerning the first step of this method, focusing in the integration of data from the semi-
structured data layer and their transition to the graph data abstraction layer. Our proposed
graph-based model is derived from a comparative analysis among four standards related
to phenotype description, plus a practical experiment.




                                                             Ontology concepts




                                                         Graph data abstraction




                                                       Semi-structured data


                         Figure 1. Three layers method diagram.


                                           155
        This paper is organized as follows: Section 2 summarizes the related work; Sec-
tion 3 presents the comparative analysis which subsidizes our minimal common denom-
inator model; Section 4 presents out graph-based model; Section 5 shows a practical
experiment of unifying phenotypes; Section 6 presents concluding remarks.

2. Related Work
Integration is a key point as humans are progressively unable of handling the sheer volume
of data presented [Bell et al. 2009]. It is an important step towards knowledge discovery
[Lenzerini 2002]. The integration of digital phenotype descriptions is a relevant challenge
in this context since they support fundamental biology tasks as the building of identifica-
tion keys for living beings and can support the creation of a complete evolutionary Tree of
Life [Parr et al. 2012] assembling genomic and morphological data so as to congregate the
phylogenetic relationships among all living or extinct organisms [Ciccarelli et al. 2006].
Likewise, integrating these data may contribute to better understanding of how a morpho-
logical trait became organized and evolved over time [Mabee 2006].
        Recent approaches enrich descriptions via ontology annotations, using the
Entity-Quality (EQ) formalism for phenotype modeling. EQ is a representation
[Balhoff et al. 2010] which associates ontology entity terms (E) – e.g., bone or vertebra
from Teleost Anatomy Ontology (TAO) – with quality terms (Q) – e.g., triangular, hori-
zontal, smooth from the Phenotype and Trait Ontology (PATO) [Dahdul et al. 2010]. On-
tologies have gained wide acceptance in biology due to their ability of representing knowl-
edge and also the advantage of querying and reasoning information [Gkoutos et al. 2004].
Furthermore, semantic web standards to represent ontology concepts with unique identi-
fiers facilitates interoperability across databases [Mabee et al. 2007]. Recently, several
tools have emerged to support annotation of biological phenotypes using ontologies,
e.g., Phenex (http://phenoscape.org/wiki/Phenex) and Phenote (http://www.phenote.org/ ),
both curation tools designed for annotation of phenotypic characters with ontology con-
cepts using EQ formalism [Balhoff et al. 2010].
        [Dahdul et al. 2010] developed a workflow for curation of phenotypic characters
extracted from scientific publications. It is important to note the limitations of this cura-
tion process, considering that it is very time-consuming since it is manually carried out
by domain experts.

3. Common Denominator
There is a wide variety of representation formats for phenotype description, adopted by
information systems and open standards, which represent differently the same informa-
tion. In this section, we analyze four of them – Xper2 , SDD, Nexus and NeXML – looking
for a minimal common denominator, which is the foundation for our graph-based model,
to be used to link related information.
        SDD, Nexus and NeXML are widely adopted open standards further detailed.
     2
Xper (http://lis-upmc.snv.jussieu.fr/lis/ ) is a management system adopted by the system-
atist community, for the storing, editing and analyzing of phenotype descriptive data. It
focuses mainly on taxonomic descriptions, allowing creation, sharing and comparison
of identification keys [Ung et al. 2010a, Ung et al. 2010b]. Xper2 was developed in the
Laboratoire Informatique & Systématique of the University Pierre et Marie Curie and

                                            156
this work is part of a bigger project in collaboration with this lab. Therefore, Xper2 was
adopted for our practical experiments.
         In order to illustrate our analysis, let us consider a practical case, in which a biol-
ogist is building a phenotype description of monitor lizards (genus Varanus). The process
starts with the biologist collecting observations of lizards, organized as characters and
character states (C, CS). [Pimentcl and Riggins 1987] defined character as “a feature of
organisms that can be evaluated as a variable with two or more mutually exclusive and
ordered states”. The observations involved the species Varanus albiguralis and Varanus
brevicauda. The final result is the character-by-taxon matrix illustrated in Figure 2.

                                                                                           Nostrils' form
                                                     section of the tail                   1 – well round
                                    nostrils' form




                                                                           nuchal scales
                                                                                           2 – oval or split-like
                                                     transversal



                                                                                           Transversal section of the tail
                                                                                           1 – laterally compressed
                                                                                           2 – roundish
            Varanus albiguralis        2                    1                 2
                                                                                           Nuchal scales
                                                                                           1 – same size than head scales
            Varanus brevicauda         1                    2                 1            2 – bigger than head scales


                                  Figure 2. Character-by-taxon matrix

        In order to transform these observations to digital records and generalize them –
e.g., devising general characters and states observed in a genre of monitor lizards – the
biologist will use a tool like Xper2 . Phenotypes descriptions can be stored in the Xper2
native format or can be exported to the SDD open format. The Structure Descriptive Data
(SDD) (http://wiki.tdwg.org/SDD) is a platform and application-independent XML-based
standard developed by the Biodiversity Information Standards (historic acronym: TDWG)
for recording and exchanging descriptions of biological and biodiversity data of any type
[Hagedorn 2007]. SDD is adopted by several other phenotype description tools – e.g.,
Lucid Central (http://www.lucidcentral.org) and Linnaeus II (http://www.eti.uva.nl/ ).
        We further introduce some key elements of the SDD format, which are recurrent
in the formats confronted in this section. A SDD description comprises, in a single file,
a domain schema and its instances. Figure 3 shows a diagram with a fragment of a SDD
file containing the description of a varanus lizard. A (C,CS) description in SDD has two
main blocks: (i) defines the characters involved and their possible states – Figure 3 top;
(ii) describes an Operational Taxonomic Unit (OTU) using the characters defined in (i) –
Figure 3 bottom. OTU is a biology term which refers to a given entity in sampling level
adopted to the study – e.g., a specimen, a gender etc.
        s and their  (shown in Figure 3 top) are prim-
itives to describe an OTU [Hagedorn 2007]. Each  has its
 – comprising a label and a description as plain texts – and a set of
 elements with their possible states.  and
 elements defined here will be referred throughout the XML document
by their ids.
        The  (Figure 3 bottom) links the OTU being described
to States of each . It has two essential items: (i) the OTU

                                                                           157
                                                                 Label
                                           Representation                “nostrils' form”

                                                                 Detail
                                                                   “Monitors' nostrils may have different forms...”
                                 id=“c6”
                CategoricalCharacter                                                        Label
                                                                         id=“s12”
                                                                                                     “well round”
                                            States          StateDefinition
                                                                                            Detail
                                                                                              “Nostrils look like a quite per...”
 Datasets
                                                                         id=“s13”
                                                                                            Label
                                                                                                “oval or split-like”
                                                            StateDefinition
                                                                                            Detail
  Dataset
                                                                                              “Nostrils are not perfectly rou...”


                                           SummaryData                     ref=“c6”

                                 id=“D1”                      Categorical                           ref=“s13”
                 CodedDescription                                                           State
                                                                 Label
                                           Representation                “V. albiguralis”

                                                                 Detail
                                                                     “White-throated monitor. Distribution: Africa (West...”




                       Figure 3. Fragment of SDD Schema with Instances 1


being described, where its name and description are listed in natural language under
; (ii) a set of character and values ( and ),
which address the characters defined in the previous section through the ref attribute.
It is possible and usual to define multiple states for a character of a given OTU. A first
integration, problem observed here is that each character or OTU described does not have
a global unique identification among documents. Therefore, the description can only be
used by the document where it was declared and it is not possible to guarantee the equiv-
alence of two or more .
         In Figure 5 we expand our analysis to the Xper2 native format, Nexus and NeXML.
Our study addresses mainly morphological character descriptions. Figure 5 provides sim-
plified diagrams focusing on the elements to record descriptions, which will be confronted
here. Figure 4 presents the symbols adopted in the diagram. All the formats adopt XML
and the symbols represent the relations among elements and their respective cardinality.
Five types of elements, which are focus of our analysis, receive special symbols: the
Entity being described, which can be a taxon or a specimen; the Character defini-
tion and its respective association with entities (Character instance); the State
definition and its respective association with entities (State instance).
         Nexus [Maddison et al. 1997] is an extensively used file format developed for stor-
age and exchange of phylogenetic data, including morphological and molecular charac-
ters, taxa distances, genetic codes, phylogenetic trees etc. It was designed in 1987 and it is
still used by many popular software as Xper2 (http://lis-upmc.snv.jussieu.fr/lis/ ), Mesquite
(http://mesquiteproject.org/ ), MrBayes (http://mrbayes.sourceforge.net/ ) and data repos-
itories, like TreeBASE(http://treebase.org/ ) and Dryad (http://datadryad.org/ ). Nexus
gathers together (C,CS) based descriptions and related trees [Vos et al. 2012].

   1
    Knowledge base of the genus Varanus from http://lis-upmc.snv.jussieu.fr/xper2/infosXper2Bases/liste-
bases-recherche.php


                                                      158
                 Element  types                                 Relationship  types




                   structural                  one  to  one                               one  to  one
                    element                                                              (zero  or  one)



                   exclusive               one  to  many                                      one  to  many
                    option                (zero  or  more)                                   (one  or  more)

                                              Structural  element  specializations

                     Entity       Character          Character  Instance             State             State  Instance




                   Figure 4. Symbols and semantic used in the diagrams


        NeXML (http://www.nexml.org) [Vos et al. 2012] is a standard inspired by the
Nexus. It supports and extends Nexus functionalities and addresses some Nexus limi-
tations – e.g., connects objects with ontology concepts, supports citations and annotations
[Vos et al. 2012]. In order to accomplish full compatibility and interoperability among
different environments, NeXML defines a formalized XSD grammar and enables seman-
tic annotations of any element in a NeXML document, which goes towards to a “Mini-
mum Information About a Phylogenetic Analysis” (MIAPA) standard.
        These comparative diagrams show that even if the structures are arranged dif-
ferently, they address the same key elements. All formats organize data in accordance
with the (C,CS) data model that, in practice, is an entity-attribute-value (EAV) model,
in which entities are OTUs, attributes are characters and values are character-states
[Vos et al. 2012]. Nexus and NeXML formats define a matrix, in which OTUs are listed in
rows, characters are columns and the cells contain a numeric code for a specific character-
state (see Figure 2). Although Xper2 and SDD do not define a matrix, both formats have
a similar structure to describe OTUs with their (C, CS) records.

4. From XML Structures to Graphs
The next step in our Three Tier Method is designing a graph model. In a previous
work [Alves and Santanchè 2013], we have compared several approaches to capture la-
tent relations+semantics among tags produced collaboratively. Graph models to represent
and analyze data were a common denominator. The role of the graph is not to reflect all
details of the original model. The central challenge is how to abstract key elements, for
which we are looking for potential relations to be discovered. It is a movement from the
latent semantics to an explicit semantics expressed as links.
         On one hand, we devised in the previous section the common denominator we
are looking for: OTUs, character and character states. On the other hand, a second im-
portant ingredient is devising what is our target in ontologies. As mentioned in Sec-
tion 2, a predominant ontology model for phenotype descriptions is the Entity-Quality
(EQ) [Balhoff et al. 2010]. An Entity refers to the “part” of the OTU being described,
which is related to one or more Qualities. In a comparison with the (C, CS) approach,
a Character comprises an Entity plus the Quality involved in the description in a single
textual sentence. A State is a complementary part of the Quality. Even though it is not a
trivial task to split Characters into their components of Entity and Quality, a first step will

                                                              159
     Nexus
                                                                                                     NeXML

                      Characters
                                                                                                                    OTUs

                                           charLabels
                                                                                                                                  OTU

                                                              character-­‐name
                                                                                                                   Format




                                                                                       index
                                                                                   x
                                                                               inde
                                           stateLabels
                                                                                                                                  States

                                                            character-­‐number
                                                                                                                                                    State
                                                                                                                   matrix

                      charStateLabel                              state-­‐name
                                                                                                                                   row




                                                                               index
                                           character-­‐name
                                                                    in                                                                              OTU
                                                                      de
                                                                        x
             matrix
                                             state-­‐name                                                                                           cell
                        row
                                                               index




                                                                                                                                                                      char
                                       taxon-­‐name

                                                                                                                                                                     state
                                         entry



                              (a) Nexus                                                                                      (b) NeXML


    SDD                                                                                                Xper2


               CategoricalCharacter                                                                            Variables

                                       Representation                                                                        Variable

                                           States                                                                                          name

                                                          StateDe7inition
                                                                                                                                           mode
                                                                                                               Individuals
             CodedDescription


                                Representation                                                                               Individual


                                   SummaryData                                                                                               name


                                                    Categorical                                                                            description_list


                                                                       State                                                                                  description_element




                                (c) SDD                                                                                       (d) Xper2

                              Figure 5. Formats for representing phylogenetic data


be linking disperse elements referring to the same semantic concept.
        Departing from the key elements identified in the previous section, we can devise
the following linking discovery challenges:
     • Which OTUs in the graph refer to the same real world OTU (link OTU-OTU)?
     • Which characters can be applied to each OTU (link OTU-character)?
     • Which states for each character can be observed in each OTU (link OTU-
       character-state)? Conversely, which OTUs have a given character+state?
       The answer to these questions will enable to integrate, summarize and compare
data concerning each OTU and each character. Therefore, it becomes possible to answer
queries like:

                                                                                               160
     • What are the possible colors of a Varanus tongue?
     • Which animals present an oval nostrils form?
        The discovery process is carried by graph transformations. As graphs are crucial
for our modeling approach, our method was built over graph databases. These databases
reduce the gap between how data is modeled (as graphs) and how it is stored. It is capable
of representing data structures with high abidance. Compared with relational databases,
graph databases do not require join operations because it is done implicitly traversing the
graph from node to node. Graph databases are less schema-dependent and for this reason,
they can scale more easily in size and complexity as the application evolves.
        The questions stated before were the basis to conceive the model presented in
Figure 6. We adopted the property graph model, in which nodes and relationships can
maintain extra metadata as a set of key/value pairs. Moreover, relationships are typed,
enabling to create multi-relational networks with heterogeneous sets of edges. Different
from single-relational networks, in which edges are of the same type, multi-relational
networks are more appropriate to represent complex domain models, due the variety of
relationship types in the same graph [Rodriguez and Shinavier 2010].
       In our graph model, OTUs and character-states are nodes connected by characters
(edges). Therefore the statement “V. albiguralis has a well round tail shape” becomes V.
albiguralis (node) → tail shape (edge) → well round (node).

                                          Character
                     OTU                                         Character-State
            Type      OTU                                       Type       State
                                       Type      Character
            Label                                               Label
                                       Detail
            Detail                                              Detail


          Figure 6. Property graph model to represent phenotype descriptions.


5. Practical Experiment of Unifying Phenotypes
We have implemented an automatic process to ingest SDD files into a graph database,
in order to show the linking possibilities raised by our model. In our experiments, we
use the Neo4j (http://www.neo4j.org/ ), an open-source graph database. Our data integra-
tion processing flow is divided into the main stages: preprocessing, data ingestion, data
linkage.
        One of the problems faced in bioinformatics is related to the identification of ob-
jects within and across repositories [Page 2008]. More precisely, an object may refer
to a taxon, gene, anatomical feature, phenotypic description, geographical location etc.
Uniquely identifying those objects is undoubtedly a key point for the success of our pro-
posed solution.
       In order to address this issue, some organizations – e.g., Universal Biological
Indexer and Organizer (uBio), Integrated Taxonomic Information System (ITIS), Cata-
logue of Life (CoL), The International Plant Names Index (IPNI), National Center for
Biotechnology Information (NCBI) etc. – incorporated into their projects the Life Sci-
ence Identifiers (LSIDs), which was proposed by the Object Management Group (OMG)

                                              161
(http://www.omg.org/ ). LSID is a persistent, location-independent resource identifier,
whose purpose is to uniquely identify biological resources [Clark et al. 2004]. The per-
sistent property refers to the fact that LSID identifiers are unique, can be assigned to only
one object forever and they never expire. The location-independent property specifies
that each authority locally creates LSIDs and they are the responsible to guaranteeing the
uniqueness of LSIDs.
         We applied LSIDs to unify OTUs in the graph referring to the same real world
object. In order to find a valid LSID, we adopted the Global Names Resolver (GNR) web
service (http://resolver.globalnames.org/ ) that executes exact or fuzzy matching against
canonical forms of scientific names in 170 distinct data sources. The Canonical form (cf)
is the simplest, most complete and unambiguous form of a name. The Canonical form of
scientific names consists of the genus and species – when applied – with no authorship,
rank, nomenclatural annotation or subgenus.
        Our system used three of the six types of matching offered by the GNR resolver:
(i) exact matching; (ii) exact matching of canonical forms – this process reduce a given
name to its canonical form and checks it with an exact match; (iii) fuzzy matching of
canonical forms – uses a modified version of the TaxaMatch algorithm [Rees 2008] and it
intends to work around misspellings errors. It does a fuzzy match of the canonical form
of a given name – even with mistakes – against spellings considered correct. The GNR
resolver reports the matching quality (“confidence score”) for each match.
        The matching module of the system is still a work in progress, but we already
have obtained some relevant results to show the viability of our approach. From the
LIS knowledge base we collected 7 distinct morphological descriptions: genus Varanus;
species Varanus gouldii, Varanus timorensis, Varanus auffenbergi and Varanus scalaris;
species groups Varanus indicus, Varanus prasinus, Varanus salvator; and Autralian spiny-
tailed monitor lizards. Through Xper2 those morphological descriptions were exported to
the SDD format and imported into the graph database, with no preprocessing. Figure 7(a)
shows an overview of the resulting graph without labels. We can note the disconnect-
edness of the graph (7-partite graph). On the other hand, Figure 7(b) shows the same
knowledge after employing the LSID unification. The graphs became connected. Before
applying the LSID unification the graph had 74 distinct taxonomic units (TUs). After per-
forming the LSID unification its total reduced to 44 TUs, i.e., 30 taxonomic units (40%)
were recurring and were integrated in a single node.
        The next step is to link equivalent characters of the same OTU, enabling integra-
tion of states of the same character. In the present stage of this research we apply a simple
matching algorithm. One example of our preliminary results is presented in the diagram
of Figure 8. As can be seen, our algorithm was able to unify all “nuchal scales” charac-
ters, by defining the same type to the edges. Moreover, we unified and congregated the
possible states observed for this character across different description files.

6. Conclusion
Several initiatives propose to relate phenotype descriptions with ontologies to enable a se-
mantic integration. The challenge is how to expand and revise the ontology while new de-
scriptions were created. Tools which annotate descriptions with ontologies address them
as an external artifact crafted apart, disregarding the synergy between building an ontol-

                                            162
                (a) Graph 7-partite                                       (b) Connected graph

                               Figure 7. Varanus knowledge base

                                                                                         Prasinus.sdd

                                                                                      smooth, unkeeled
                                                                       es
                                                              al   scal
                                                        n uch
                                                                     les         granular to slightly keeled
        Varanus bogerti                                       al sca
                                                         nuch
                                                         nuchal sc
                                                                   ales
        Varanus beccarii                                                               strongly keeled
                                                        nuchal sc
                                                                     ales
        Varanus prasinus                                                         triangular keeled, hull-shaped
                                                        nuchal s
                                                                cales
      Varanus komodoensis
                                                        nuchal scales            same size than head scales
                                                        nuchal s
                                                                    cales
                                                                                   bigger than head scales

                                                                                         Varanus.sdd


                                      Figure 8. Graph Diagram


ogy and using it. [Shirky 2005] emphasizes the importance of the semantics organically
built by a community, where a binary categorization approach – in which a concept A “is”
or “is not” part of a category B – to a probabilistic approach – in which a percentage of
people relates A to B. This work contributes in this direction. Inspired by previous work,
which explores latent semantics in folksonomies, this work analyzes standards to describe
phenotypes to find a common denominator, which is the bases to link descriptions.
       The main contribution of this work is to create the basis to exploit the latent se-
mantics in the descriptions. The viability and the potential of our approach were tested by
experiments. These experiments are the first steps to exploit a bigger latent semantics sce-
nario. Moreover, having the capability of integrating knowledge around taxonomic units

                                               163
will enable, for instance, evolutionary biologists to generate new research questions, gain
predictive insight or confront evolutionary hypotheses. More complete answers might be
provided as new data sources are integrated.
        Our representation in a graph database is aligned with the
RDF [Manola and Miller 2004] graph-based representation, which will be the next
step to achieve the third layer. The challenge will be to map labels of character/character-
states in RDF properties/values. The unification of characters and states, as shown on
this preliminary work, is a first and high relevant step for this mapping. Since several
ontologies related to phenotype descriptions are in OWL, the relations discovered in
our graph can subsidize a better matching of labels and concepts in OWL ontologies by
confronting relations. For example, to enhance the match of a character label (in the
graph database) with an OWL property, it is possible to consider the states allowed by
the character, confronting them with the property range (values allowed by the property).
        There are several possible ways to extend this work. One possible way is to in-
corporate morphological descriptions stored in other knowledge bases, e.g., MorphoBank
(http://morphobank.org/ ) or Dryad (http://datadryad.org/ ). Another direction is to inves-
tigate correlations between State nodes and ontology terms.

Acknowledgment
Work partially financed by (CNPq 138197/2011-3), the Microsoft Research FAPESP
Virtual Institute (NavScales project), CNPq (MuZOO Project and PRONEX-FAPESP),
INCT in Web Science(CNPq 557.128/2009-9) and CAPES, as well as individual grants
from CNPq.

References
Alves, H. and Santanchè, A. (2013). Folksonomized Ontology and the 3E Steps Tech-
  nique to Support Ontology Evolvement. Journal of Web Semantics, 18(1):19–30.
Balhoff, J. P., Dahdul, W. M., Kothari, C. R., Lapp, H., Lundberg, J. G., Mabee, P., Mid-
  ford, P. E., Westerfield, M., and Vision, T. J. (2010). Phenex: Ontological annotation
  of phenotypic diversity. PLoS ONE, 5(5):e10500.
Bell, G., Hey, T., and Szalay, A. (2009).           Beyond the data deluge.         Science,
  323(5919):1297–1298.
Ciccarelli, F. D., Doerks, T., Von Mering, C., Creevey, C. J., Snel, B., and Bork, P.
  (2006). Toward automatic reconstruction of a highly resolved tree of life. Science,
  311(5765):1283–1287.
Clark, T., Martin, S., and Liefeld, T. (2004). Globally distributed object identification for
  biological knowledgebases. Briefings in bioinformatics, 5(1):59–70.
Dahdul, W. M., Balhoff, J. P., Engeman, J., Grande, T., Hilton, E. J., Kothari, C., Lapp,
  H., Lundberg, J. G., Midford, P. E., Vision, T. J., Westerfield, M., and Mabee, P. M.
  (2010). Evolutionary characters, phenotypes and ontologies: Curating data from the
  systematic biology literature. PLoS ONE, 5(5):e10708.
Gibas, C. and Jambeck, P. (2001). Developing bioinformatics computer skills. O’Reilly
  Media, Inc.

                                            164
Gkoutos, G., Green, E., Mallon, A.-M., Hancock, J., and Davidson, D. (2004). Using
  ontologies to describe mouse phenotypes. Genome Biology, 6(1):R8.
Hagedorn, G. (2007). Structuring Descriptive Data of Organisms – Requirement Anal-
  ysis and Information Models. PhD thesis, Universität Bayreuth,Fakultät für Biologie,
  Chemie und Geowissenschaften.
Lenzerini, M. (2002). Data integration: A theoretical perspective. In Proceedings of the
  twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
  systems, pages 233–246. ACM.
Mabee, P. M. (2006). Integrating evolution and development: the need for bioinformatics
  in evo-devo. BioScience, 56(4):301–309.
Mabee, P. M., Ashburner, M., Cronk, Q., Gkoutos, G. V., Haendel, M., Segerdell, E.,
  Mungall, C., and Westerfield, M. (2007). Phenotype ontologies: the bridge between
  genomics and evolution. Trends in ecology & evolution, 22(7):345–350.
Maddison, D. R., Swofford, D. L., and Maddison, W. P. (1997). Nexus: An extensible file
  format for systematic information. Systematic Biology, 46(4):590–621.
Manola, F. and Miller, E. (2004). RDF Primer – W3C Recommendation. Technical report,
  W3C.
Page, R. (2008). Biodiversity informatics: the challenge of linking data and the role of
  shared identifiers. Briefings in Bioinformatics, 9(5):345–354.
Parr, C. S., Guralnick, R., Cellinese, N., and Page, R. D. (2012). Evolutionary informat-
   ics: unifying knowledge about the diversity of life. Trends in ecology & evolution,
   27(2):94–103.
Pimentcl, R. A. and Riggins, R. (1987). The nature of cladistic data. Cladistics, 3(3):201–
  209.
Quan, D. (2007). Improving life sciences information retrieval using semantic web tech-
  nology. Briefings in bioinformatics, 8(3):172–182.
Rees, T. (2008). Taxamatch, a ”fuzzy” matching algorithm for taxon names, and potential
  applications in taxonomic databases. In Weitzman, A. and Belbin, L., editors, Pro-
  visional Abstracts of the 2008 Annual Conference of the Taxonomic Databases Work-
  ing Group, Fremantle, Australia. Biodiversity Information Standards (TDWG) and the
  Missouri Botanical Garden.
Rodriguez, M. A. and Shinavier, J. (2010). Exposing multi-relational networks to single-
  relational network analysis algorithms. Journal of Informetrics, 4(1):29 – 41.
Ung, V., Causse, F., and Vignes Lebbe, R. (2010a). Xper2 : managing descriptive data
  from their collection to e-monographs.
Ung, V., Dubus, G., Zaragüeta-Bagils, R., and Vignes-Lebbe, R. (2010b). Xper2: intro-
  ducing e-taxonomy. Bioinformatics, 26(5):703–704.
Vos, R. A., Balhoff, J. P., Caravas, J. A., Holder, M. T., Lapp, H., Maddison, W. P., Mid-
  ford, P. E., Priyam, A., Sukumaran, J., Xia, X., et al. (2012). Nexml: rich, extensible,
  and verifiable representation of comparative data and metadata. Systematic Biology,
  61(4):675–689.


                                           165