Conceptual Exploration of Documents and Digital
            Libraries in the Biomedical Domain

               Leyla Jael Garcia Castro1, John Gómez1, Alexander Garcia2

                                       1
                                         biotea.ws Project
                                {leylajael, johncar}@gmail.com
             2
               Florida State University, School of Library and Information Science,
                                    Tallahassee, Florida, USA
                                     alexgarciac@gmail.com


       Abstract. In this demo we present our approach to the use of Semantic Web
       technology in scholarly communications; it entails the understanding of re-
       search papers as an interface to the Web of Data. We are using the connectivity
       tissue provided by RDF technologies in order to facilitate semantic retrieval as
       well as to improve the user-experience when interacting with biomedical litera-
       ture. Availability: http://biotea.idiginfo.org, http://biotea.idiginfo.org/gene/
       Keywords: Semantics publications, linked data, biomedical visualization.


1   Introduction

As biomedical literature grows exponentially, so do the connections across docu-
ments. Research articles are deeply interconnected to one another, and to resources in
the web –e.g., biomedical databases. Such interconnectedness makes the paper an
ideal interface to the Web of Data (WoD). Connections may be as evident as those
due to bibliographic references; they may also be as complex as those due to similari-
ties in topics, research questions, materials and methods, etc. Making use of such
connectivity tissue requires a semantically processed dataset, delivering a self-
descriptive document fully interoperable with the Web. However, a solely linked
dataset may not be enough; semantic search engines, tools, and mash-ups should also
be provided in order to facilitate the access to the data as well as to improve the end-
user experience. RDF4PMC [1] is a semantic dataset for the open subset of PubMed
Central (PMC); it comprises and makes available (i) a set of RDF files generated from
the open access subset of PubMed Central (PMC) and enriched with semantic annota-
tions, (ii) a Web Services API for querying the RDF data set, (iii) a SPARQL Protocol
and RDF Query Language (SPARQL) endpoint containing a subset of the RDF files
as a proof of concept, (iv) an article-centric prototype that acts as an interface to the
WoD, and (v) an implemented transformation process from our RDF files to
Bio2RDF (http://bio2rdf.org). Self-descriptive documents deliver by RDF4PMC
model makes it possible to focus on the user experience by building browsers that
“know” the data types presented in the documents –biomedical entities. In this paper
we present a prototype that utilizes the resulting dataset from RDF4PMC; particular-
ly, it enables search and retrieval for human genes. Our browser presents not only
metadata and content for the retrieved articles, but also enriches the user experience
by adding graphical components that include information relevant to the biomedical
entities identified in the articles. Our browser also makes it possible for users to ex-
amine the network of relationships across documents.


2   Semantic browsing of biomedical literature

RDF4PMC orchestrates ontologies such as DoCO (http://purl.org/spar/doco), BIBO
(http://purl.org/ontology/bibo),     DC       (http://dublincore.org),  and      FOAF
(http://xmlns.com/foaf/0.1) to model the metadata and content of the article. Mean-
ingful terms in the content are represented as annotations modeled with the Annota-
tion Ontology (AO) [2]; terms are linked to biological entities from different special-
ized vocabularies –proteins, genes, drugs, and diseases, among others. Our architec-
ture, as presented in Fig.1, is layer-based. The Presentation layer consists on a web-
based Search & Retrieval interface. Once the text to be searched for has been typed in
the Presentation layer, the Communication layer is used to retrieve the requested in-
formation. Once the relevant information has been retrieved, the Presentation layer is
in charge of organizing this data. More search options will be powered by ontology
mapping and ontology indexes. In this early version of our browser, users initiate the
search by providing the name of a human gene. From the gene name, the correspond-
ing protein accession is retrieved from the GeneWiki RDF; GeneWiki [3] is a Wik-
ipedia project comprising about 10000 pages on human genes, including mappings to
proteins, diseases, chemicals, and literature, among others.


Fig.1. Architecture for RDF4PMC browser; data flow is shown by arrows between layers.
Current components are shown in continuous-line, future components are in dashed-lines.

   Our RDF4PMC browser makes it possible for users to search for a human gene,
and to retrieve related papers; the retrieval is presented as a list of alphabetically or-
dered articles: metadata –title, authors, and abstract, as well as links – GeneWiki,
PubMed, PMC, and DOI, are displayed. Whenever possible, links to identifiers.org
(http://identifiers.org) –a resolvable persistent system for biological related infor-
mation, and Bio2RDF are also provided. Additionally, a cloud of tags complementing
the bibliographic data for each article is presented; this cloud contains the biological
relevant terms identified in the article. The weight of each term in the cloud depends
on the number of biological entities associated to it.
   Fig.2 presents a detailed view for one of the articles retrieved after searching for
the human gene “insulin”. As illustrated in Fig.2 a), whenever a term in the cloud is
selected, the vocabularies and, subsequently, the biological entities are displayed;
different colors are used for different vocabularies. The interactive zone, Fig.2 b),
changes depending on the selection in the cloud: (i) when a term is selected, the para-
graphs containing that term are displayed; a simple navigation bar allows user to
move from one paragraph to the other; similarly (ii) when a biological entity is select-
ed, relevant information is displayed, e.g., sequences and 3D structures for proteins,
structures for chemicals, or images for species. Our prototype identifies the type of
the      selected    biological     entity,     and      uses    Biojs    components
(http://code.google.com/p/biojs) in order to display the graphical data.


Fig.2. a) Search and retrieval based on human gene names. b) Enriched content based on anno-
tations is displayed in the interactive zone.

   A graph-based display is also possible; it facilitates navigation and filtering de-
pending on terms contained in the retrieved articles. As articles are connected by
shared terms, they are naturally arranged in a graph where articles are represented as
nodes while terms are represented as edges. Whenever two articles share a link, there
will be an edge connecting them; the number of biological entities associated to a
term will determine the weight of the edge. The term-browsing graph is created from
a term searched for by the user, for instance catalase. Once the searched term has
been provided, the graph is generated with all articles containing that term; edges are
added depending on the other shared terms between any pair of articles. As docu-
ments and terms are retrieved, the graph is reorganized following a force-based algo-
rithm. Fig. 3 shows the term-browsing graph for the term catalase; it also illustrates
some features facilitating the navigation: mouse over events on nodes and edges, as
well as filtering options. For nodes, mouse-over activates a close icon on the right
upper corner so the article can be removed from the graph; in the figure it is possible
to observe the close icon for the article with title “Localization of the Carnation Ital-
ian ringspot virus […]”. For arcs, mouse-over displays the actual term, in the figure it
is possible to observe the term PROTEASE. Terms can also be filtered out; as multi-
ple terms can be shared amongst articles, an edge-based filtering feature has been
defined: depending on the weight, i.e., number of biological entities, terms can be
excluded from the graph; in the figure the minimum weight has been set to 30.


Fig.3.Partial connectivity graphs for the term catalase –some nodes are not displayed. Addi-
tional terms with more than 30 biological entities are displayed as edges.


3   Conclusions and future work

The components for the RDF4PMC browser that have been described in this paper
illustrate how our browser acts as an interface to the WoD, particularly the data con-
tained in the open subset of PMC. RDF4PMC delivers self-describing content that
allows the implementation of semantic search, exploration, and retrieval mechanisms
throughout the dataset as well as a new reading experience when interacting with the
documents. The former is achieved by making extensive use of the interlinked nature
of the RDF dataset; the latter is the consequence of using specialized visualization and
manipulation gadgets that make it possible for the reader to browse the document
focusing on those aspects he/she considers relevant for the research at hand. Explor-
ing the network of interconnected documents facilitate the formulation of more pre-
cise queries; being able to post process queries based on information found in the
documents deliver a new pivot from where to build queries. In the near future we are
planning to explore the use of ontology mappings and expansion; users should not
need to know the exact name of a gene, and should be able to search for related in-
formation, e.g., diseases or drugs. We will focus on knowledge that could be easily
inferred as a consequence of the interrelated nature of the semantic model.


References
1. García A, García Castro LJ, McLaughlin C, Flager S: RDFising PubMed Central. In:
   Bioontologies: 2012; Long Beach, CA, USA; 2012.
2. Ciccarese P, Ocana M, Garcia Castro L, Das S, Clark T: An open annotation ontology for
   science on web 3.0. Journal of Biomedical Semantics 2011, 2(Suppl 2):S4.
3. Huss JW, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch JB, Su
   AI: The Gene Wiki: community intelligence applied to human gene annotation. Nucle-
   ic Acids Research, 38(suppl 1):D633-D639.