<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building conceptual spaces for exploring and linking biomedical resources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>R. Berlanga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E. Jim´enez-Ruiz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V. Nebot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Lenguajes y Sistemas Informa ́ticos Universitat Jaume I</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The establishment of links between data (e.g., patient records) and Web resources (e.g., literature) and the proper visualization of such discovered knowledge is still a challenge in most Life Science domains (e.g., biomedicine). In this paper we present our contribution to the community in the form of an infrastructure to annotate information resources, to discover relationships among them, and to represent and visualize the new discovered knowledge. Furthermore, we have also implemented a Web-based prototype tool which integrates the proposed infrastructure.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The ever increasing volume of web resources as well as generated data from
automated applications is challenging current approaches for biomedical
information processing and analysis. One current trend is to build semantic spaces
where corporate data and knowledge resources can be mapped in order to ease
their exploration and integration. Semantic spaces are usually defined in terms of
widely accepted knowledge resources (e.g. thesauri and domain ontologies), and
they are populated by applying (semi)automatic semantic annotation processes.</p>
      <p>Apart from these semantic spaces it is also crucial to propose new
summarization tools that help both users and machines to better analyze and extract
knowledge from these spaces. On-Line Analytical Processing (OLAP) techniques
have been very successfully used to analyze summarized data from different
perspectives (dimensions) and detail levels (categories). However, OLAP cannot be
directly applied to the aforementioned semantic spaces for several reasons: first,
data resources and knowledge are highly heterogeneous and dynamic and second,
semantic annotations are based on graph structures which make it difficult their
translation to OLAP multidimensional spaces. Despite these limitations, OLAP
operators could be very useful as they provide an intuitive and interactive way
to explore multidimensional spaces.</p>
      <p>In this paper we propose a new visual paradigm, called 3D conceptual maps,
which allows users to explore and analyze interesting associations derived from
data and web resources, which have been previously annotated with a reference
domain ontology. Conceptual maps can be dynamically built according to the
users analysis requirements, and they provide interactivity through operators
similar to traditional OLAP operators (e.g. drill-down, roll-up, etc.) The main
novelty of the new operators is that they are semantic-aware, that is, they take
into account the semantics of the domain ontologies to summarize the data that
is visualized in the conceptual maps. We also present a web-based prototype
tool called 3D knowledge browser (3DKB), which integrates the previous visual
paradigm and operations.</p>
      <p>
        As far as we know, there are no similar tools in the literature which allow
summarizing and exploring discovered concepts and relationships from
different biomedical sources (not only literature). Previous work exists on discovering
biomedical relationships from semantic annotations, for example [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] to
mention a few, but they are limited to present results as tabular data, and the target
collection is always PubMed abstracts. Instead, our proposal is aimed to deal
with multiple sources (e.g. PubMed abstracts, patient records, public databases,
and so on) and it provides mechanisms to explore the discovered relationships
through the reference ontologies.
      </p>
      <p>The paper is organized as follows. In Section 2 we introduce the motivating
scenario. Then, Sections 3 and 4 present our prototype and its use through
two use cases. Section 5 is devoted to the methodological aspects. First, we
describe the normalization formalism to represent both the knowledge resources
and the target collections. Then, we introduce the main operators required over
the normalized representation to provide interactivity with the conceptual maps.
Finally, we give some conclusions and future work.
2</p>
      <p>
        Motivating Scenario
The need of semantically integrating different biomedical sources arose in the
context of the European Health-e-Child (HeC) [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] integrated project. HeC
aimed to develop an integrated health care platform to allow European
paediatrics to access, analyse, evaluate, enhance and exchange integrated biomedical
information focused on three paediatric diseases: (1) heart disorders, (2)
inflammatory disorders and (3) brain tumours. The biomedical information sources
covered six distinct levels of granularity (also referred to as vertical levels),
classified as molecular (e.g., genomic and proteomic data), cellular (e.g., results of
blood tests), tissue (e.g., synovial fluid tests), organ (e.g., affected joints, heart
description), individual (e.g., examinations, treatments), and population (e.g.,
epidemiological studies).
      </p>
      <p>The 3DKB tool is mainly aimed at providing an integrated and interactive
way to browse biomedical concepts as well as to access external information
(e.g., PubMed abstracts) and HeC patient data related to those concepts. The
3DKB is intended to facilitate the integration by providing the clinician with a
predefined subset of semantically annotated web objects that are relevant to her
domain. These objects are thus implicitly linked to clinician and patient data,
which are also semantically annotated with the same knowledge resource.</p>
      <p>
        In our current implementation, we selected the Unified Medical Language
System Metathesaurus (UMLS) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as the knowledge resource with which
semantic annotations are generated. UMLS represents the main effort for the
creation of a multipurpose reference thesaurus. UMLS contains concepts from more
than one hundred terminologies, classifications, and thesauri; e.g. FMA, MeSH,
SNOMED CT or ICD. UMLS includes two million terms and more than three
million term names, hypernymy classification with more than one million
relationships, and around forty millions of other kinds of relationships.
3
      </p>
      <p>
        Prototype Implementation
The current prototype has been developed using AJAX (Asynchronous JavaScript
and XML) technologies. Figure 1 shows an overall view of the 3DKB tool [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for
the JIA domain. It consists of three main parts, namely: 1) the configuration of
the 3D Conceptual Map (from now on 3D-Map), which contains the selected
vertical levels (i.e., HeC levels) and an optional free text query to evaluate against
the visualized concepts, 2) the 3D-Map itself, which contains the biomedical
concepts stratified in vertical levels according to the previous configuration, and
3) a series of tabs that contain a ranked list of objects associated to a selected
concept from the 3D-Map. In the latter, each tab represents a different type of
object (e.g., PubMed abstract, Swissprot protein and HeC patient data). There
is a special tab entitled “Tree” which contains all the possible levels that can be
selected to configure and build the 3D-Map. The levels are based on the UMLS
semantic types [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] which are grouped within the correspondent HeC levels as
in [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The layers of the 3D-Map can be defined by selecting levels of the
“Tree” tab and also through a keyword-based query. In the second case, only
the most specific concepts whose lexical forms match the query are visualized.
      </p>
      <p>The visual paradigm of 3D-Maps relies on the vertical integration vision
proposed in HeC. That is, all the involved knowledge, data and information
are organized into different disjoint conceptual levels (i.e., vertical levels), each
one representing a different perspective of the biomedical research. In this way,
the 3DKB presents a stratified view of the information based on vertical levels
(see Individual.Disease and Organ boxes in 3D-Map of Figure 1). Within each
level, biomedical concepts deemed relevant for both the clinician domain (e.g.,
rheumatology, cardiology and oncology) and the clinician information requests
are shown as balls in the 3D-Map. Relevance of concepts is defined in terms of
the collection frequency (e.g., PubMed abstracts), and it is represented in the
3D-Map through the ball size. Regarding the color of the ball, normal concepts
are displayed in blue, expanded concepts in red and concepts containing query
entities in green.</p>
      <p>Semantic bridges are another important visual element of the 3D-Map, which
are defined as links between concepts of two different vertical levels and they are
represented as 3D lines in the 3D-Map. Semantic bridges can represent either
co-occurrences of concepts in the target collection or well-known relationships
between concepts stated in some domain ontology (e.g., UMLS). Semantic bridges
can help clinicians to select the context in which the required information must
hold. For example, from the 3D-Map in Figure 1 we can retrieve documents or
patient IDs about arthritis related to limb joints by clicking an existing bridge
between the concepts Arthritis and Limb Joints. Finally, semantic bridges have
also associated a relevance index, which depends on the correlation measure we
have chosen for their definition (e.g. count, log ratio, odds ratio, etc.).</p>
      <p>Another interesting feature of 3D-Maps is the ability of browsing through the
taxonomical hierarchies of the biomedical concepts (e.g., UMLS hierarchy). In
the example of Subfigure 2, the user can expand the concepts Operation and
Implantation (biggest balls in Figure 2(a)). The resulting concepts are red-coloured
(Subfigure 2(b)) and represent more specific concepts like Catheterisation,
Surgical repair, Intubation, or Cardiovascular Operations.</p>
      <p>In order to manage the elements of the 3D-Map a series of operations are
provided in the 3D-Map tools panel (see left hand-side of Figure 1). These
operations are split within two categories: operations to manage the whole 3D-Map
(rotate, zoom and shift) and concept-related operations. The operations to
manage the concept visualization involve (1) the retrieval of the objects associated
to the clicked concept, (2) the expansion of the clicked concept, (3) the removal
of the concepts of a level with the exception of the clicked concept, and (4) the
deletion of the clicked concept.
4</p>
      <p>Use Cases
In this section we will show the functionalities of the 3DKB through two use
cases based on some HeC clinician information requests.
4.1</p>
      <p>Case 1: Exploring the relation between procedures and results
in the Tertalogy of Fallot (ToF) domain
In this case, the clinician is interested in knowing the relation between the
different surgical techniques reported in the literature and the findings and results that
are usually correlated to them. For this purpose, the clinician builds the 3D-Map
for the semantic levels Individual.Health Procedures. and Individual.Finding. As
a result the clinician obtains the map presented in Figure 3(a). However, the
clinician is only interested in repair operations. So, she refines the query by
specifying the keyword repair in the query input field. The resulting 3D-Map is
shown in the Figure 3(b), where relevant concepts are coloured in green. These
relevant concepts contain at least one sub-concept (including itself) matching
the specified query. Now, the clinician can select one of the green-coloured
concepts, for example Repair Fallot Tetralogy, in order to filter the map to just those
concepts that are related to it (see Figure 3(c)). Finally, she finds an interesting
bridge between the selected concept and the finding concept Death. Figure 3(d)
shows the documents that are retrieved by clicking this bridge. Notice that these
abstracts are about death cases related to TOF repair.
4.2</p>
      <p>Case 2: Finding potential proteins that can be related to
different types of a disease within the Brain Tumours (BT)
domain
In this use case, the clinician is interested in comparing the proteins related to a
disease and its subtypes. Taking the brain tumour domain, the clinician specifies
the concept query epilepsy without selecting any vertical level. As a result, she
obtains the 3D-Map of Figure 4(a) which contains the concepts attack epileptic,
epilepsy intractable, epilepsy lobe temporal, epilepsy extratemporal and epilepsy
focal.</p>
      <p>To retrieve the proteins related to these diseases, the tab @SwissProt is
selected. For example in Figure 4(b) the related proteins to attack epileptic are
shown. The user can then get much more information about these proteins by
clicking the buttons NCBI and KEGG, which jump to the corresponding pages
in Entrez Gene and KEGG sites respectively. Note that, the relevance of each
protein entry is calculated with the frequency of the concept and its sub-concepts
in the Swissprot DB description of the protein.
5</p>
      <p>
        Method
OLAP (On-line Analytical Processing) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] tools were introduced to ease
information analysis and navigation from large amounts of transactional data.
OLAP systems rely on multidimensional data models, which are based on the
fact/dimension dichotomy. Data are represented as facts (i.e. subject of
analysis), while dimensions contain a hierarchy of levels, which provide different
granularities to aggregate the data. One fact and several dimensions to analyze
it give rise to what is known as the data cube. Common operations include slice
(i.e. performing a selection on one dimension of the given cube, thus resulting
in a sub-cube), dice (i.e. similar to slice but performing a selection on two or
more dimensions), drill-down (i.e. navigating among levels of data ranging from
the most summarized (up) to the most detailed (down)), roll-up (i.e. inverse of
drill-down, that is, climbing up the concept hierarchy) and pivot (i.e. rotate the
data to provide an alternative presentation).
      </p>
      <p>Since multidimensionality provides a friendly, easy-to-understand and
intuitive visualization of data for non-expert end-users, we have borrowed the
previous concepts and operations to apply them to our 3D conceptual maps.
5.1</p>
      <p>Representation of Semantic Spaces
In order to achieve a browsable analytical semantic space, it is necessary to
normalize the representation of both the knowledge resource and the target
collection (e.g., patient records, PubMed abstracts, and so on). This normalization
consists of two main steps: (1) to arrange existing concepts into a well-structured
multidimensional schema, and (2) to represent the objects collection under this
schema. The first step must be guided by a series of predefined dimensions which
roughly represent semantic groups. For example, in the HeC project dimensions
correspond to vertical levels: population, disease, organ, and so on. The main
issue to be addressed in this step is the irregular structures of the taxonomies
provided by existing knowledge resources. The second step has two main tasks:
(1) to semantically annotate the objects collection with concepts from the
knowledge resource, and (2) normalize the annotation sets of each object to the
multidimensional schema defined in the previous step. The subsequent sections are
devoted to describe all this process in detail.</p>
      <p>Semantic Annotation During the last years, we have witnessed a great
interest in massively annotating biomedical scientific literature. Most of the
current annotators rely on well-known lexical/ontological resources such as MeSH,
Uniprot, UMLS and so on. These knowledge resources usually provide both the
lexical variants for each inventory concept and the concept taxonomies. Some
knowledge resources are more formal (e.g. FMA, Galen, etc.), providing logic
definitions for concepts from which the taxonomy can be inferred.</p>
      <p>In our work, the knowledge resource used to generate semantic annotations
is called reference ontology, denoted O. The lexical variants associated to each
ontology concept c is denoted with lex(c), which is a list of strings. The
taxonomic relations between two concepts a and b is represented as a b. A semantic
annotation of a text fragment T consists of identifying the concepts in O such
that they are more likely to represent the meaning of T .</p>
      <p>
        Most semantic annotation systems are dictionary look-up approaches, that
is, they rely on the lexicon provided by the ontology in order to map text words
to concept lexical variants. Some popular annotation systems in the biomedical
domain are Whatizit [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and MetaMap [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Current research of semantic
annotation is focusing on scalability issues, and the definition of gold [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and silver
standards [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to evaluate the quality of these systems. In these standards, an
XML format IeXML has been proposed to represent the generated semantic
annotations. An example of this format is shown in Figure 5, which was generated
with our annotation system.
&lt;e id="UMLS:C1709323:T062::1,2"&gt;&lt;w id="1"&gt;Open&lt;/w&gt; &lt;w id="2"&gt;label&lt;/w&gt;&lt;/e&gt;
&lt;e id="UMLS:C0282460:T062::1,2,3"&gt;&lt;w id="1"&gt;phase&lt;/w&gt; &lt;w id="2"&gt;II&lt;/w&gt;&lt;w id="3"&gt;trial&lt;/w&gt;&lt;/e&gt;
of &lt;e id="UMLS:C0205171:T081"&gt;single&lt;/e&gt;, &lt;e id="UMLS:C0205385:T080"&gt;ascending&lt;/e&gt;
&lt;e id="UMLS:C0439568:T079"&gt;doses&lt;/e&gt; of MRA in
&lt;e id="UMLS:C0007457:T098|UMLS:C0043157:T098"&gt;Caucasian&lt;/e&gt;&lt;e id="UMLS:C0008059:T100"&gt;children&lt;/e&gt;
with &lt;e id="UMLS:C0205082:T080"&gt;severe&lt;/e&gt;
&lt;e id="UMLS:C1384600:T047::1,2,3,4|UMLS:C0682057:T100::2"&gt;&lt;w id="1"&gt;systemic&lt;/w&gt;
&lt;w id="2"&gt;juvenile&lt;/w&gt;&lt;w id="3"&gt;idiopathic&lt;/w&gt;&lt;w id="4"&gt;arthritis&lt;/w&gt;&lt;/e&gt;: proof of principle of
the &lt;e id="UMLS:C1707887:T062"&gt;efficacy&lt;/e&gt; of
&lt;e id="UMLS:C0063717:T116,T129,T192::1,2"&gt;&lt;w id="1"&gt;IL-6&lt;/w&gt; &lt;w id="2"&gt;receptor&lt;/w&gt; &lt;/e&gt;
&lt;e id="UMLS:C0332206:T169"&gt;blockade&lt;/e&gt; in this
&lt;e id="UMLS:C0332307:T080|UMLS:C0455704:T170"&gt;type&lt;/e&gt; of arthritis and demonstration of
&lt;e id="UMLS:C0439590:T079"&gt;prolonged&lt;/e&gt; &lt;e id="UMLS:C0205210:T080"&gt;clinical&lt;/e&gt; improvement.&lt;/s&gt;
      </p>
      <p>
        One of the main drawbacks of current semantic annotation systems is that
they usually focus on very specific entity types like proteins and diseases. In
our work, we aim to generate semantic annotations of any entity type involved
in the biomedical research. For this reason, we have chosen the UMLS-Meta as
knowledge resource, which provides more than 100 entity types (semantic types).
However, just a few annotation systems are able to manage the huge amount of
lexical information provided by UMLS-Meta, and they are too slow to deal with
large text collections. As a consequence we developed a novel annotation system,
called Concept Retrieval, which is based on information retrieval techniques to
efficiently perform the text annotation [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This annotation system was tested
in the CALBC competition over a collection of 150.000 PubMed abstracts about
immunology [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Knowledge normalization In order to build semantic spaces for analyzing
document collections, the reference ontology O associated to the knowledge
resource is normalized as follows:
– First a set of dimensions are defined, (D1, · · · Dn), which represent a partition
of the concepts in the domain ontology. Each dimension Di represents a
different semantic space (e.g. semantic types or vertical levels), and cannot
share any common sub-concept with the other dimensions.
– Each dimension Di can define a set of categories or levels Li , which forms in
j
turn a partition over Di but with the following constraints: (1) there cannot
be two concepts c and d in Lij such that either c d or d c, and (2) all
the concepts in Lij have a common super-concept that belongs to Di.
– Every concept of the ontology is encoded under the labeling scheme presented
in [18]. Thus, each concept c ∈ O is represented with the following descriptor:
c, pre index, anc index, desc intervals, anc intervals, topo order
where pre index is the pre-order index in the spanning tree of O, desc intervals
is the list of index intervals of the descendants of c (i.e., {c |c c}),
anc index is the pre-order index of the reversed spanning tree, and anc intervals
is the list of index intervals of the ancestors of c. Finally, topo order is the
topological order of the concept in the spanning tree of O. More specifically,
this descriptor represents two labeling schemes, namely: L− for descendants,
and L+ for ancestors. Under these labeling schemes, queries over the
taxonomical relationships are efficiently computed with a specific interval algebra
[18].</p>
      <p>One interesting application of the labeling scheme L+ is the efficient
construction of ontology fragments tailored to an input set of concepts, called signature.
In this way, we can automatically build each dimension Di with the ontology
fragment obtained with the signature formed by all the concepts identified in
the collection (through semantic annotation) and that belong to some semantic
group representing the dimension (e.g. disease, protein, and so on). To obtain
the categories of a dimension Di, we take into consideration the taxonomic
relationships in the fragment and the previous restrictions over dimensions and
their categories.</p>
      <p>Data and resource normalization After semantic annotation, each document
of the target collection Col has associated a list of concepts from the reference
ontology O. However, these annotation sets are not suited for multidimensional
analysis, and therefore a normalization process similar to that applied to the
ontology must be performed. The main goal of objects normalization is to
represent the semantic annotations within the normalized multidimensional space.
Thus, each document d ∈ Col is represented as the multidimensional fact:
f act(d) = (D1 = c1, · · · , Dn = cn)
where ci (0 ≤ i ≤ n) is either a concept from the dimension Di or the null
value. Remember that concepts are represented under the labeling scheme L−,
and consequently they are expressed through their pre index numbers.</p>
      <p>As a semantic annotator can tag more than one concept of the same
dimension, the normalization process consists in selecting the most relevant concepts
for each dimension. For this purpose, for each document d we first build a
concept affinity matrix M d of size Nc × Nc, where Nc is the number of distinct
concepts present in the annotations of d. This matrix is initialized as follows:
– Midj = M jid = 1, if ci and cj co-occur in a same sentence of the document
d,
– Midj = 0.5 and Mjdi = 1, if ci cj in the reference ontology O,
– Midi = 1,
– otherwise Midj = 0 with i = j .</p>
      <p>The affinity matrix can be used in several existing graph-based algorithms
that aim to rank the nodes according to the neighbors contributions. We have
chosen the regularization framework proposed in [19], which can be summarized
with the following formula:</p>
      <p>Rd = ((1 − α) · (I − αSd)−1 · Y T )T
(1)</p>
      <p>Here, R is the vector representing the rank of concepts. This is obtained by
finding an optimal smoothed function that best fits a given vector Y , which
is achieved by applying the laplacian operator over the affinity matrix M d as
follows:</p>
      <p>Sd = D−1/2 · M d · D−1/2</p>
      <p>In our case, the vector Y consists of the frequencies of each concept in the
document d. The parameter α is directly related to the smoothness of the
approximation function (we set it to α = 0.9).</p>
      <p>An alternative to this method is to use a centrality-based algorithm over M d.
Our preliminary experiments over the HeC collections showed that this method
obtains very similar ranks to the previous one.</p>
      <p>Once the rank Rd is obtained, the normalization process consists in selecting
the top-scored concepts of each dimension to represent the d’s fact.</p>
      <p>As an example, the multidimensional fact resulted from the document
presented in Figure 5 is as follows:
( ResearchActivity:C1709323, PopulationGroup:C0007457, AgeGroup:C0008059,
Disease:C1384600, ImmunologyFactor:C0063717, ...)
5.2</p>
      <p>Building 3D conceptual maps
As mentioned in the introduction, our main aim is to build a browseable
representation of the semantic spaces defined in the previous section. For this purpose,
we define the 3D conceptual map, which is a sequence of different layers that
correspond to different dimensions expressed at some detail level (category). In
this map, concepts are visualized as balls, which are placed within their
corresponding layer with a size proportional to their relevance w.r.t. the target
collection. Concept bridges (or conceptual associations) are visualized as links
between concepts of adjacent layers. 3D maps are built from the normalized
conceptual representation described in the previous section, by using a series of
basic operations, which are described in turn.</p>
      <p>Basic operations The basic operations that can be defined over a dimension
Di are the following ones:
Layer definition, which establishes the concepts that will be placed at the
layer. This operation can be done either by specifying one dimension category
or through a keyword-based query. In the first case, all the concepts of the
dimension category are visualized, whereas in the second case only the most
specific concepts in Di whose lexical forms match the query are visualized.
Concept containment, returns all the sub-concepts of a selected concept q of
a dimension Di. Formally,
descendants(Di, q) = {c | c ∈ Di ∧ c
q}
Text containment, which returns true if there exists some concept c
whose lexicon, lex(c), matches the specified keywords:
q
contains(Di, q, kywds)) = {c|c ∈ Di ∧ c
q ∧ matches(lex(c), kywds)}
Direct subconcepts, denoted children(Di, c), which returns the set of direct
sub-concepts of a concept C. This operation is used to browse the taxonomy
downwards (drill-down operation).</p>
      <p>All these operations are efficiently performed by using the interval algebra
over the L− scheme associated to the ontology concepts.</p>
      <p>Aggregations Summarization is one of the main purposes of the proposed
analytical tool to facilitate the exploration of the collection contents. Similarly to
OLAP-like systems, summarization is performed through well-defined
aggregations over the semantic annotations of the objects collections. More specifically,
the following aggregations are performed to visualize summarized information:
Concept Relevance. The relevance of a concept c is calculated by aggregating
the relevance of its sub-concepts w.r.t each specific collection. Formally,</p>
      <p>RelCol(c, Di) = Γ∀c ∈descendants(Di,c) scoreCol(c )
where Γ is an aggregation function (e.g., sum, avg, and so on) and score is
the function that is evaluated against the collection. The simplest scoring
function is the number of hits, namely:</p>
      <p>scoreCol(c) = hitsCol(c ) = count({d|d ∈ Col, f act(d)[Di] = c })
Alternatively, the scoring function can take into account the relevance of each
concept in the documents it appears. Thus, we can aggregate the relevance
scores estimated to select concept facts (see Formula 1) as follows:
scoreCol(c) =</p>
      <p>Rd[c]
d∈Col,∃i,fact(d)[Di]=c
Concept Associations. Given two dimension levels Lin and Ljm, belonging to
dimensions Di and Dj (i = j) respectively, the following 2D cube stores the
aggregated contingency tables necessary for correlation analysis:</p>
      <p>CU BECol(Lin, Ljm) = {(ci, cj, ni,j, ni, nj)|ci ∈ Lin ∧ cj ∈ Ljm}
Here ni,j measures the number of objects in the collection where ci and cj
co-occur, ni is the number of objects where ci occurs, and nj is the number of
objects where cj occurs. Notice that ni and nj are calculated in a similar way
as concept relevance. The contingency table for each pair (ci, cj) is calculated
as shown in Table 1.</p>
      <p>ci c¯i
cj ni,j nj − ni,j
c¯j ni − ni,j NCol − ni − nj
The measures ni,j, ni and nj are calculated as follows:
ni,j = |{d|d ∈ Col ∧ f act(d)[Di]
ci ∧ f act(d)[Dj]
cj}|
ni
nj
= |{d|d ∈ Col ∧ f act(d)[Di]
= |{d|d ∈ Col ∧ f act(d)[Dj]
ci}|
cj}|
Semantic Bridges. A semantic bridge is a strong association between
concepts which has good evidence in the target collection. Bridges are
calculated from contingency tables by defining a scoring function φ(ci, cj). In this
way, bridges will be those concept associations whose score is greater than
a specified threshold δ:</p>
      <p>BridgesCφol(Li, Lj) = {(ci, cj, φ(ci, cj))|φ(ci, cj) &gt; δ}
As an example, we can use the interest factor as score, that is:
φ(ci, cj) =
ni,j · N
ni · nj
In our current setting, we use a series of well-known interestingness
measures such as log likelihood ratio, mutual information, interest factor and
F1-measure.</p>
      <p>Browsing conceptual maps Two main browsing operations can be performed
in a conceptual map: (1) expand a concept into its sub-concepts, and (2) go to
a ranked list of objects associated to the clicked map elements (concepts and
bridges). The semantics of these operations corresponds to the well-known
drilldown and drill-through OLAP operations.</p>
      <p>Drill-down: If we expand a concept c in the 3D map, it must be updated
accordingly. Thus, the concept c is substituted by its children in the O’s
taxonomy, bridges involved by c are removed from the map, and new bridges
are calculated for the sub-concepts of c and drawn in the map.</p>
      <p>Drill-through: If a concept (bridge) is selected for drill-through, the system
must retrieve the objects of the target collection relevant to it. The ranked
list of objects is shown in a separate list (e.g., tab) ordered by relevance.
Notice that we can simply use the score calculated to construct facts (i.e.,
Rd) for ranking documents w.r.t. concepts, formally:</p>
      <p>relevance(d, c) = Rd[c]
For ranking documents w.r.t. bridges, we just combine the scores of the
involved concepts in the selected bridge:</p>
      <p>relevance(d, (ci, cj, φ)) = relevance(d, ci) · relevance(d, cj)
6</p>
      <p>Conclusions
In this paper we have presented a novel semantics-aware integration and
visualization paradigm that allows users to easily explore and navigate
discovered relations between data and web resources. The contribution is two-fold.
On one hand, we provide the infrastructure for integrating different information
resources through semantic annotation with domain ontologies. On the other
hand, users can interactively build conceptual maps according to their
requirements and explore them with classical OLAP-style operations such as roll-up
and drill-down. Some future work includes the refinement of the created
dimension hierarchies in order to account for more meaningful aggregations and also
to devise more efficient calculation of new bridges. Finally, we plan to develop
an on-line service to provide conceptual maps on-demand.
[18] Nebot, V., Berlanga, R.: Efficient retrieval of ontology fragments using an interval
labeling scheme. Inf. Sci. 179(24) (2009) 4151–4173
[19] Zhou, D., Scho¨lkopf, B., Hofmann, T.: Semi-supervised learning on directed
graphs. In: Advances in Neural Information Processing Systems 17 (NIPS). (2004)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirsch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arregui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaudan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riethoven</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoehr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Ebimed - text crunching to gather facts for proteins from medline</article-title>
          .
          <source>Bioinformatics</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ) (
          <year>2007</year>
          )
          <fpage>237</fpage>
          -
          <lpage>244</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pezik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Medevi:
          <article-title>Retrieving textual evidence of relations between biomedical concepts from medline</article-title>
          .
          <source>Bioinformatics</source>
          <volume>24</volume>
          (
          <issue>11</issue>
          ) (
          <year>2008</year>
          )
          <fpage>1410</fpage>
          -
          <lpage>1412</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Tsuruoka</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>FACTA: a text search engine for finding associated biomedical concepts</article-title>
          .
          <source>Bioinformatics</source>
          <volume>24</volume>
          (
          <issue>21</issue>
          ) (
          <year>2008</year>
          )
          <fpage>2559</fpage>
          -
          <lpage>2560</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Comaniciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioannis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClatchey</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moley-Fletcher</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennec</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pongiglione</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Health-e-Child: An integrated biomedical platform for grid-based pediatrics</article-title>
          .
          <source>In: Proceedings of Health-Grid</source>
          <year>2006</year>
          .
          <article-title>Volume 120 of Studies in Health Technology and Informatics</article-title>
          .,
          <string-name>
            <surname>Valencia</surname>
          </string-name>
          , Spain (
          <year>2006</year>
          )
          <fpage>259</fpage>
          -
          <lpage>270</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jim</surname>
          </string-name>
          <article-title>´enez-</article-title>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClatchey</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manset</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paraire</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , R´ıos, A.:
          <article-title>The management and integration of Biomedical knowledge: Application in the Health-e-Child project</article-title>
          .
          <source>In: OnToContent'06, 1st International Workshop on Ontology content and evaluation in Enterprise. Volume 4278 of LNCS</source>
          . (
          <year>2006</year>
          )
          <fpage>1062</fpage>
          -
          <lpage>1067</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research</source>
          32(Database issue) (
          <year>January 2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nebot</surname>
          </string-name>
          , V.:
          <article-title>3D-Browser technical reports and tool</article-title>
          . http://krono.act.uji.es/Projects/hec-3dbrowser (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>McCray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The umls semantic network</article-title>
          . In LC, K., ed.
          <source>: Proc 13th Annu Symp Comput App Med Care</source>
          , IEEE Computer Society Press (
          <year>1989</year>
          )
          <fpage>503</fpage>
          -
          <lpage>507</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCray</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          :
          <article-title>Exploring semantic groups through visual approaches</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>36</volume>
          (
          <issue>6</issue>
          ) (
          <year>2003</year>
          )
          <fpage>414</fpage>
          -
          <lpage>432</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jimeno-Yepes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards enrichement of a biomedical ontology based on text mining</article-title>
          .
          <source>Technical report:</source>
          http://krono.act.uji.es/publications/techrep/tkbg-ebi
          <source>-report</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jimeno-Yepes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaudan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , RebholzSchuhmann, D.:
          <article-title>Assessment of disease named entity recognition on a corpus of annotated sentences</article-title>
          .
          <source>BMC Bioinformatics 9(Suppl 3)</source>
          (
          <year>2008</year>
          ) S3
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Codd</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Codd</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salley</surname>
          </string-name>
          , C.T.:
          <string-name>
            <surname>Providing</surname>
            <given-names>OLAP</given-names>
          </string-name>
          (
          <article-title>On-Line Analytical Processing) to User-Analysts: An IT Mandate</article-title>
          .
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Codd</surname>
          </string-name>
          and Associates (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arregui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaudan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirsch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimeno-Yepes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <source>Text processing through Web services: calling Whatizit. Bioinformatics</source>
          <volume>24</volume>
          (
          <issue>2</issue>
          ) (
          <year>2008</year>
          )
          <fpage>296</fpage>
          -
          <lpage>298</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Aronson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Effective mapping of Biomedical text to the UMLS metathesaurus: the MetaMap program</article-title>
          .
          <source>Proc AMIA Symp</source>
          (
          <year>2001</year>
          )
          <fpage>17</fpage>
          -
          <lpage>21</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tateisi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>GENIA corpusa semantically annotated corpus for bio-textmining</article-title>
          .
          <source>Bioinformatics 19(suppl 1)</source>
          (
          <year>2003</year>
          )
          <fpage>i180</fpage>
          -
          <lpage>i182</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimeno-Yepes</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Mulligen</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kors</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milward</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corbett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buyko</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beisswanger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hahn</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Calbc silver standard corpus</article-title>
          .
          <source>J. Bioinformatics and Computational Biology</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ) (
          <year>2010</year>
          )
          <fpage>163</fpage>
          -
          <lpage>179</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nebot</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Semantic annotation of biomedical texts through concept retreieval</article-title>
          .
          <source>Procesamiento del Lneguaje Natural</source>
          <volume>45</volume>
          (
          <year>2010</year>
          )
          <fpage>247</fpage>
          -
          <lpage>250</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>