=Paper=
{{Paper
|id=Vol-1747/IP22_ICBO2016
|storemode=property
|title=SourceData: Making Data Discoverable
|pdfUrl=https://ceur-ws.org/Vol-1747/IP22_ICBO2016.pdf
|volume=Vol-1747
|authors=Nancy George,Sara El-Gebali,Thomas Lemberger
|dblpUrl=https://dblp.org/rec/conf/icbo/GeorgeEL16
}}
==SourceData: Making Data Discoverable ==
SourceData: Making Data Discoverable Nancy George *, Robin Liechti *, Sara El-Gebali *, Lou Götz *, Isaac Crespo , 2, 1, 2, 1, 1 Ioannis Xenarios , Thomas Lemberger 1 1, 3 *contributed equally, 1 Vital-IT, Swiss Institute of Bioninformatics, Lausanne Switzerland 2 SourceData, EMBO, Heidelberg, Germany 3 Correspondence to: thomas.lemberger@embo.org) In molecular and cell biology, most of the data presented in published papers are not available in formats that allow for direct analysis and systematic mining. The goal of the SourceData project (http://sourcedata.embo.org) is to make published data easier to find, to connect papers containing related information and to promote the reuse and novel analysis of published data. The main concept underlying the project is that the structure of a dataset provides information about the design of the study in question and can be exploited in powerful data-oriented search strategies. SourceData has therefore developed tools to generate machine-readable descriptive metadata from figures in published manuscripts. Experimentally tested hypotheses are represented as directed relationships between standardized biological entities. Once processed, a comprehensive ‘scientific knowledge graph’ can be generated from this data (see demo video1 at https://vimeo.com/sourcedata/kg), making the body of data efficiently searchable. Importantly, this graph is objectively grounded in published data and not on the potentially subjective interpretation of the results. SourceData has developed algorithms to efficiently search the data-‐oriented knowledge graph and an interface, shown in Figure 1, that enables users to find paper based on their data content: Figure 1. SourceData Search interface The search capabilities have also been incorporated into the SmartFigure viewer. This application can be embedded directly into online publications and allows the visualisation of figures in the context of related data published elsewhere. Readers can then navigate from one figure to the next by following linked entities (see Figure 2 and see demo video2 at http://vimeo.com/sourcedata/search). Figure 2. ‘SmartFigures” viewer Access to the SourceData database by computer programs is provided through a public Application Program Interface (http://sourcedata.vital-‐ it.ch/public/#/api), giving developers the chance to produce their own software solutions or machine-‐driven analyses based on the SourceData data format. Future perspectives for the project include the integration of a structured representation of time and incorporating descriptions of experimental procedures and reagents. Furthermore, SourceData will develop portable ‘figure/data packages’ that combine and cross-‐link the human-‐interpretable figure to the underlying machine-‐readable metadata and data files. This means linking the original experimental dataset with the representative experimental figure to allow ease of re-‐use and transparency of data. Finally, we plan to adapt the SourceData model to integrate existing approaches for the representation of large-‐scale biological data. With SourceData, we are developing a platform that simultaneously improves the discoverability and utility of research data and of the scientific articles where these data are reported. It will therefore provide the basis for a reward system that will incentivize authors to share their data openly, thus driving a broader adoption of open data and open science by the community.