Conceptual Exploration of Documents and Digital Libraries in the Biomedical Domain Leyla Jael Garcia Castro1, John Gómez1, Alexander Garcia2 1 biotea.ws Project {leylajael, johncar}@gmail.com 2 Florida State University, School of Library and Information Science, Tallahassee, Florida, USA alexgarciac@gmail.com Abstract. In this demo we present our approach to the use of Semantic Web technology in scholarly communications; it entails the understanding of re- search papers as an interface to the Web of Data. We are using the connectivity tissue provided by RDF technologies in order to facilitate semantic retrieval as well as to improve the user-experience when interacting with biomedical litera- ture. Availability: http://biotea.idiginfo.org, http://biotea.idiginfo.org/gene/ Keywords: Semantics publications, linked data, biomedical visualization. 1 Introduction As biomedical literature grows exponentially, so do the connections across docu- ments. Research articles are deeply interconnected to one another, and to resources in the web –e.g., biomedical databases. Such interconnectedness makes the paper an ideal interface to the Web of Data (WoD). Connections may be as evident as those due to bibliographic references; they may also be as complex as those due to similari- ties in topics, research questions, materials and methods, etc. Making use of such connectivity tissue requires a semantically processed dataset, delivering a self- descriptive document fully interoperable with the Web. However, a solely linked dataset may not be enough; semantic search engines, tools, and mash-ups should also be provided in order to facilitate the access to the data as well as to improve the end- user experience. RDF4PMC [1] is a semantic dataset for the open subset of PubMed Central (PMC); it comprises and makes available (i) a set of RDF files generated from the open access subset of PubMed Central (PMC) and enriched with semantic annota- tions, (ii) a Web Services API for querying the RDF data set, (iii) a SPARQL Protocol and RDF Query Language (SPARQL) endpoint containing a subset of the RDF files as a proof of concept, (iv) an article-centric prototype that acts as an interface to the WoD, and (v) an implemented transformation process from our RDF files to Bio2RDF (http://bio2rdf.org). Self-descriptive documents deliver by RDF4PMC model makes it possible to focus on the user experience by building browsers that “know” the data types presented in the documents –biomedical entities. In this paper we present a prototype that utilizes the resulting dataset from RDF4PMC; particular- ly, it enables search and retrieval for human genes. Our browser presents not only metadata and content for the retrieved articles, but also enriches the user experience by adding graphical components that include information relevant to the biomedical entities identified in the articles. Our browser also makes it possible for users to ex- amine the network of relationships across documents. 2 Semantic browsing of biomedical literature RDF4PMC orchestrates ontologies such as DoCO (http://purl.org/spar/doco), BIBO (http://purl.org/ontology/bibo), DC (http://dublincore.org), and FOAF (http://xmlns.com/foaf/0.1) to model the metadata and content of the article. Mean- ingful terms in the content are represented as annotations modeled with the Annota- tion Ontology (AO) [2]; terms are linked to biological entities from different special- ized vocabularies –proteins, genes, drugs, and diseases, among others. Our architec- ture, as presented in Fig.1, is layer-based. The Presentation layer consists on a web- based Search & Retrieval interface. Once the text to be searched for has been typed in the Presentation layer, the Communication layer is used to retrieve the requested in- formation. Once the relevant information has been retrieved, the Presentation layer is in charge of organizing this data. More search options will be powered by ontology mapping and ontology indexes. In this early version of our browser, users initiate the search by providing the name of a human gene. From the gene name, the correspond- ing protein accession is retrieved from the GeneWiki RDF; GeneWiki [3] is a Wik- ipedia project comprising about 10000 pages on human genes, including mappings to proteins, diseases, chemicals, and literature, among others. Fig.1. Architecture for RDF4PMC browser; data flow is shown by arrows between layers. Current components are shown in continuous-line, future components are in dashed-lines. Our RDF4PMC browser makes it possible for users to search for a human gene, and to retrieve related papers; the retrieval is presented as a list of alphabetically or- dered articles: metadata –title, authors, and abstract, as well as links – GeneWiki, PubMed, PMC, and DOI, are displayed. Whenever possible, links to identifiers.org (http://identifiers.org) –a resolvable persistent system for biological related infor- mation, and Bio2RDF are also provided. Additionally, a cloud of tags complementing the bibliographic data for each article is presented; this cloud contains the biological relevant terms identified in the article. The weight of each term in the cloud depends on the number of biological entities associated to it. Fig.2 presents a detailed view for one of the articles retrieved after searching for the human gene “insulin”. As illustrated in Fig.2 a), whenever a term in the cloud is selected, the vocabularies and, subsequently, the biological entities are displayed; different colors are used for different vocabularies. The interactive zone, Fig.2 b), changes depending on the selection in the cloud: (i) when a term is selected, the para- graphs containing that term are displayed; a simple navigation bar allows user to move from one paragraph to the other; similarly (ii) when a biological entity is select- ed, relevant information is displayed, e.g., sequences and 3D structures for proteins, structures for chemicals, or images for species. Our prototype identifies the type of the selected biological entity, and uses Biojs components (http://code.google.com/p/biojs) in order to display the graphical data. Fig.2. a) Search and retrieval based on human gene names. b) Enriched content based on anno- tations is displayed in the interactive zone. A graph-based display is also possible; it facilitates navigation and filtering de- pending on terms contained in the retrieved articles. As articles are connected by shared terms, they are naturally arranged in a graph where articles are represented as nodes while terms are represented as edges. Whenever two articles share a link, there will be an edge connecting them; the number of biological entities associated to a term will determine the weight of the edge. The term-browsing graph is created from a term searched for by the user, for instance catalase. Once the searched term has been provided, the graph is generated with all articles containing that term; edges are added depending on the other shared terms between any pair of articles. As docu- ments and terms are retrieved, the graph is reorganized following a force-based algo- rithm. Fig. 3 shows the term-browsing graph for the term catalase; it also illustrates some features facilitating the navigation: mouse over events on nodes and edges, as well as filtering options. For nodes, mouse-over activates a close icon on the right upper corner so the article can be removed from the graph; in the figure it is possible to observe the close icon for the article with title “Localization of the Carnation Ital- ian ringspot virus […]”. For arcs, mouse-over displays the actual term, in the figure it is possible to observe the term PROTEASE. Terms can also be filtered out; as multi- ple terms can be shared amongst articles, an edge-based filtering feature has been defined: depending on the weight, i.e., number of biological entities, terms can be excluded from the graph; in the figure the minimum weight has been set to 30. Fig.3.Partial connectivity graphs for the term catalase –some nodes are not displayed. Addi- tional terms with more than 30 biological entities are displayed as edges. 3 Conclusions and future work The components for the RDF4PMC browser that have been described in this paper illustrate how our browser acts as an interface to the WoD, particularly the data con- tained in the open subset of PMC. RDF4PMC delivers self-describing content that allows the implementation of semantic search, exploration, and retrieval mechanisms throughout the dataset as well as a new reading experience when interacting with the documents. The former is achieved by making extensive use of the interlinked nature of the RDF dataset; the latter is the consequence of using specialized visualization and manipulation gadgets that make it possible for the reader to browse the document focusing on those aspects he/she considers relevant for the research at hand. Explor- ing the network of interconnected documents facilitate the formulation of more pre- cise queries; being able to post process queries based on information found in the documents deliver a new pivot from where to build queries. In the near future we are planning to explore the use of ontology mappings and expansion; users should not need to know the exact name of a gene, and should be able to search for related in- formation, e.g., diseases or drugs. We will focus on knowledge that could be easily inferred as a consequence of the interrelated nature of the semantic model. References 1. García A, García Castro LJ, McLaughlin C, Flager S: RDFising PubMed Central. In: Bioontologies: 2012; Long Beach, CA, USA; 2012. 2. Ciccarese P, Ocana M, Garcia Castro L, Das S, Clark T: An open annotation ontology for science on web 3.0. Journal of Biomedical Semantics 2011, 2(Suppl 2):S4. 3. Huss JW, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch JB, Su AI: The Gene Wiki: community intelligence applied to human gene annotation. Nucle- ic Acids Research, 38(suppl 1):D633-D639.