<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semantic Data Integration Methodology for Translational Neurodegenerative Disease Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sumit Madan</string-name>
          <email>sumit.madan@scai.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maksims Fiosins</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Bonn</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juliane Fluck</string-name>
          <email>juliane.fluck@scai.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)</institution>
          ,
          <addr-line>Schoss Birlinghoven, Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Center for Neurodegenerative Diseases</institution>
          ,
          <addr-line>Tuebingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>German National Library of Medicine (ZB MED) - Information Centre for Life Sciences</institution>
          ,
          <addr-line>Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute of Geodesy and Geoinformation, University of Bonn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Institute of Medical Systems Biology, Center for Molecular Neurobiology, University Medical Center Hamburg-Eppendorf</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The advancement of omics technologies and execution of large-scale clinical studies have led to the production of heterogeneous and big patient datasets. Researchers at DZNE (German Center for Neurodegeneration Diseases) and Fraunhofer SCAI (Fraunhofer Institute for Algorithms and Scientific Computing), located at several sites, are focusing on generation, integration, and analysis of such data, especially related to the field of neurodegenerative diseases. In order to extract meaningful and valuable biological insights, they analyze such datasets separately and, more importantly, in a combined manner. Blending of such datasets, which are often located at different sites and lack semantical traits, requires the development of novel data integration methodologies. We use the concept of federated semantic data layers to disseminate and create a unified view of different types of datasets. In addition to the semantically-enriched data in such data layers, we propose another level of condensed information providing only derived results that is integrated in a central integration platform. Furthermore, the implementation of a semantic lookup platform encloses all semantic concepts needed for the data integration. This eases the creation of connections, hence, improves interoperability between multiple datasets. Further integration of biological relevant relationships between several entity classes such as genes, SNPs, drugs, or miRNAs from public databases leverages the use of existing knowledge. In this paper, we describe the semantic-aware service-oriented infrastructure including the semantic data layers, the semantic lookup platform, and the integration platform and, additionally, give examples how data can be queried and visualized. The proposed architecture makes it easier to support such an infrastructure or adapt it to new use cases. Furthermore, the semantic data layers containing derived results can be used for data publication.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Data Integration</kwd>
        <kwd>Semantic Data Layer</kwd>
        <kwd>Translational Research</kwd>
        <kwd>Neurodegenerative Diseases</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Translational medicine in a disease area aims to shorten the time between new scientific
findings in laboratories to new therapies for patients. Especially in the area of
neurodegenerative diseases and dementia, a fast translation is necessary to reduce the
suffering of the patients and their families and the economic burden on the society of a
growing ageing population. In industrialized countries, late-stage dementia and
complications due to underlying dementia have become the most common causes of death
besides heart diseases, malignant growth, and cerebrovascular diseases and currently
affects 1.6 million people in Germany [1]. As the population ages, the number of people
suffering from dementia and related neurodegenerative disorders (NDD) will
substantially rise. To date, all attempts to slow down disease progression by medical (e.g.
pharmacological) or non-medical interventions have failed.</p>
      <p>To address these challenges, the German Center for Neurodegenerative Diseases
(DZNE) was founded in 2009 as an institute of the Helmholtz association. The DZNE
has ten sites distributed over Germany that integrate the leading national expertise in
the field of neurodegeneration research. DZNE covers a wide range of research topics
from fundamental research over clinical to health care and population research. Its
broad scope enables the DZNE to follow a translational approach with the ultimate goal
to develop novel preventive or therapeutic solutions for neurodegenerative diseases. A
current bottleneck in analysing the heterogeneous data generated at the distributed
DZNE sites is that different data entities for the same disease or even the same patient
are analysed separately and the full potential of a holistic analysis of all data is not
leveraged. The key aim of the BMBF-funded project Integrative Data Semantics for
Neurodegeneration research (IDSN) (www.idsn.info/en/) is the ability to integrate and
query data from the different DZNE research fields and combine this with existing
disease information and biomedical databases.</p>
      <p>To achieve coherent data integration, several general and NDD-specific data
integration tasks have to be addressed. In general, there is a need to integrate large-scale
data coming from high-throughput screening, clinical cohort and/or clinical routine
data. Other large-scale data becoming standard in many disease fields are for example
automated cellular assays or imaging data. Many more data types will be standard in
the future. On the other hand, task-specific data types vary significantly depending on
the use case and the disease area, and annotation of data and metadata is needed in such
a way that they are interoperable and can be reused. These demands are well described
as requests within the FAIR data principles [2].</p>
      <p>To cope with these diverse requirements, we present a novel semantic integration
methodology for linked biological and clinical data. We realize the architecture by
using existing open-source tools in concordance with identified requirements and describe
the technical details of the implementation. Key elements of the presented integration
platform are (1) a central semantic lookup platform for the vocabulary used within, (2)
the modularity of its components, (3) the semantic integration of the different data types
and (4) their compliance with the FAIR data principles, (5) a data integration platform
for different types of data, and finally (6) query environments allowing for integrative
analysis of data by end-users such as clinicians or researchers.</p>
      <p>In the following we give an overview of related approaches, describe the IDSN
architecture and give examples how to integrate the data and how this data can be queried
and visualised.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Several studies demonstrated the usefulness of data integration in various ways.
Daemen et al. [3] were able to enhance the predictive power of clinical decision support
models that were used to define personalized therapies for rectal and prostate cancer
patients. They used a methodological integration framework while incorporating data
from different genome-wide sources including genomics, proteomics, transcriptomics,
and epigenetics.</p>
      <p>Dawany et al. [4] used a large-scale integration of complex data from
high-throughput experiments (HTP) to identify the shared genes and pathways in different cancer
types. The integration and normalization methodology was applied on microarray data
from more than 80 laboratories with more than 4000 samples that included more than
10 cancer tissue types. For each cancer type an organized list of genes was identified
as the potential biomarkers including various kinases and transcription factors.</p>
      <p>Iyappan et al. [5] have employed RDF-based technologies to link and publish large
volumes and different types of neuroscience related data. They have transformed the
data into simple triple format (subject, predicate, and object) that represents
relationships between entities. As usual in RDF, the nodes and the edges are encoded using
Uniform Resource Identifiers (URIs) that are provided by biological-specific
ontologies. They have integrated semantically-enriched data such as PPI networks,
miRNATarget-Interaction networks, transcriptomic data (from GEO and ArrayExpress) and
relationships from further biological databases. Although RDF-based technologies are
well suited for data interoperability, they are not suitable for every dataset type.
Furthermore, they often lack performance and consume large amount of disk space with
huge datasets.</p>
      <p>In the Open PHACTS discovery platform, RDF and especially Application
Programming Interfaces (APIs) is extensively used for designing and development of linked
data applications with respect to integrating pharmacology data [6]. The API layer
provides output in JSON format for the application developers, hiding RDF that is
considered complex for the purpose of user interaction and data presentation. The Open
PHACTS discovery platform provides integrated access to more than eleven linked
datasets that cover information about chemistry, pathways, and proteins.</p>
      <p>The Neuroscience Information Framework (NIF), published in 2008 and initiated by
National Institutes of Health Blueprint for Neuroscience Research, is an ecosystem that
provides a repository of searchable neuroscience resources [7]. Experimental databases,
brain atlases, neuroscience-specific literature, commercial tools and several other data
types are supported by NIF. It provides access to the data through a single web-based
platform that is mainly available for finding such resources.</p>
    </sec>
    <sec id="sec-3">
      <title>Architecture</title>
      <p>The main purpose of the IDSN architecture presented in Figure 1 is to provide a
modular platform for the integrated analysis of data supporting translational
neurodegenerative disease research. The data is derived from clinical routine and cohort data
or from different high-throughput analysis of biomaterial and is originated at distributed
sources. The architecture supports different types of data to be stored and integrated
and should be easily extendable to new data types.</p>
      <p>In a nutshell the primary data, that is stored in distributed or federated data storages,
is analysed locally to derive analysis results and is supplied with semantic annotations
using standard ontologies and terminologies to make data interoperable. As a result, a
standardized semantic layer for the different data sources is generated, where
standardised vocabulary ensures data interoperability. Links to the original data are incorporated
to ensure data provenance.</p>
      <p>The used ontologies and terminologies are stored in a separate semantic lookup
platform, which is used to retrieve appropriate concepts to annotate data, to provide
information about the concepts, and to provide mappings between different ontologies and
terminologies. This service is not only used from the federated data storage systems but
from the semantic linked data hub as well.</p>
      <p>The semantically enriched data is transferred as semantic layer to a data
management platform. The data management platform is a component responsible for unified
data access to all semantic data layers. Furthermore, the data management platform
ensures that data from all semantic layers correspond to common platform standards
(for example, FAIR principles) and that data consistency is checked.</p>
      <p>The semantic linked data hub, which fetches and indexes the data from the data
management platform, is the central part of the IDSN architecture. It stores the data in
various appropriate formats to allow fast data queries and retrieval. In addition, further
external data (secondary data) is added within the semantic linked data hub to provide
additional information about primary data, or additional links (associations) between
primary data elements. Furthermore, external services can be used by the semantic
linked data hub to analyse the data or to provide background information. Finally, for
visual and further computational analysis graphical user interfaces allow for dedicated
data visualisation and interactive analysis.</p>
      <p>In this service-oriented architecture, the key functionality of each service is
accessible and consumed through a well-defined, rich API that conforms to the popular
Representational State Transfer (REST) paradigm. Plugging of available services in this
architecture is established through systematic adoption and reuse of the existing APIs,
which, to match the needs, can be subjected to extensions. The newly-developed API
for the semantic data integration platform, which provides programmatic access to the
integrated data, is designed to answer common scientific queries of the users.</p>
      <p>Currently, the IDSN platform can handle three main types of bio-medical data,
which correspond to the following semantic layers:
 omics data layer contains expression data of small RNAs and RNAs, as well
as mutation variants from healthy and NDD subjects.
pharmacological (assay) data layer contains compound activity rates for the
induction of various cellular processes such as apoptosis or the induction of
protein expression such as CASP3 induction.
clinical data layer includes longitudinal clinical routine and longitudinal
cohort data from healthy and NDD subjects.</p>
      <p>Subsequent sections will describe the important modules of the architecture in more
detail. As an example for the conversion from source data to a semantic layer, we
describe the content of the ‘omics’ semantic layer in more detail.</p>
      <sec id="sec-3-1">
        <title>Semantic layer for omics data</title>
        <p>The main purpose of a semantic layer is to provide meaningful semantics to data, in
order to enable the connection between primary and secondary data. Another key aspect
of the introduced semantic layer is that it stores derived data, which represents the
analysed and, in some cases, interpreted data by experts. We focus extensively on the
usage of derived data for several reasons:</p>
        <p>1) Raw data consumes a lot of space and it is costly to store it redundantly on several
locations. 2) Often the raw data needs to be manipulated (for e.g. cleansing,
transforming, standardizing, harmonizing, normalizing) to prepare it for the combined or
comparative analysis. 3) The development and execution of analysis pipelines for
processing raw data needs expert knowledge, which is generally available on the local
sites. 4) Raw datasets are not necessarily interoperable as their metadata annotations
might not use the common vocabulary or, even worse, they might not even exist.</p>
        <p>As result, a semantic layer reduces complexity and facilitates user data access,
search, and understanding of data. Note that under “semantic layer” we understand data
structure for a single type of data such as RNA sequencing (RNA-seq) and CAP gene
expression analysis (CAGE) for RNA expression or whole exome sequencing (WES)
for protein-coding genes (about 1% of the genome) in order to find mutation variants.
In the case of the RNA source data in FASTQ format, in a first step, counts of expressed
RNA or small RNA are calculated. This is done with available tools such as OASIS
(https://oasis.dzne.de/) [8] for the calculation of small RNA counts. In a second step,
differential expression scores (p-values) between healthy and diseased subjects are
calculated.</p>
        <p>The derived data information, the RNA counts as well as the p-values for differential
expression together with fold change information, is stored in the semantic layer instead
of the initial FASTQ data (for an overview, see Table 1). In addition, the RNA entities
are normalized to their corresponding genes in HGNC (https://www.genenames.org/)
and Ensemble (http://ensemblgenomes.org/). Furthermore, all metadata annotations are
normalized as well and stored in the semantic layer. Examples of metadata annotations
are the organism, tissue, cell type or disease type. For the normalization of annotations,
vocabularies from the semantic lookup service are used (cf. next subsection “Data
annotation”).</p>
        <p>In the case of WES data, genetic variants are normalized to dbSNP
(https://www.ncbi.nlm.nih.gov/snp) entities. The variant frequency is calculated with
the help of external reference data sources. Furthermore, for the calculation of the
disease burden for genetic variants, the CADD (Combined Annotation Dependent
Depletion) [9] score is used. CADD can quantitatively prioritize functional, deleterious, and
disease causal variants across a wide range of functional categories including effect
sizes and genetic architectures. It can be used to prioritize causal variation in both
research and clinical settings. The variant frequency as well as the CADD score is stored
in the semantic layer for WES data. In addition to the internal data, information from
external resources are integrated. Gene and variant -disease relationships are integrated
from DisGeNET (http://www.disgenet.org) which assembles this information from
databases as well as from literature. Furthermore, miRTarBase and SCAIView
(https://www.scaiview.com/) are used to integrate miRNA-gene relations.</p>
        <p>For data annotations the designed semantic layers utilize the controlled
neuro-specific vocabularies and mappings available in the semantic lookup platform. This
includes consistent annotation of the data with particular semantic terms, which are
common for different data types as well as metadata. The annotations are stored as key:value
descriptions using controlled vocabulary for both key and value terms. An example
key:value pair is HGNC:BCL2, where the key HGNC is the reference to the
terminology and BCL2 the HGNC label for the gene BCL2. Currently, the semantic lookup
platform contains more than 20 pre-defined annotation keys, however, new (not
normalized) keys are allowed as well. Using predefined keys allows user to preserve
semantic meaning of annotations. Those not-normalized keys and values are uploaded
into the semantic lookup platform as additional terminology.</p>
        <p>Table 1. Types of omics data in the IDSN platform. From different primary data,
expressed small RNAs and genes and gene variants are identified and normalized to
the corresponding concepts from mirBase, HGNC, Ensembl or dbSNP. Small RNA
count, RNA count and variant counts as well as differential expression for RNA and
CADD score for variants are calculated and stored in the semantic layers. Within the
semantic integration platform further relationships such as gene-variant, miRNA-gene
or gene-disease relations are added from external resources.</p>
        <p>Primary Data type Controlled
Data measured vocabulary
small small miRbase,
RNA-sec RNA Ensembl
RNA-seq RNA</p>
        <sec id="sec-3-1-1">
          <title>CAGE RNA</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>HGNC, Ensembl</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>HGNC,</title>
          <p>Ensembl
WES
mutation dbSNP
variant
Derived data
counts,
differential
expression
(p-value)
counts,
differential
expression
(p-value)
counts,
differential
expression
(p-value)
variant calling,
variant
annotation (CADD)
Secondary data External
source
miRNA-gene miRTarBase
relations SCAIView
gene-disease
relations</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>DisGeNET</title>
          <p>gene-variant dbSNP,
relations vari- DisGeNET
ant-disease
relations</p>
          <p>For the annotation of data sets, an annotation tool that integrates the semantic lookup
platform was developed. The annotation of data is designed as a semi-automated
process: the system automatically suggests a ranked list of normalized concepts for
existing annotations based on the Levenshtein distance between database entries and
controlled vocabulary. These suggestions are provided within a user interface to the users
for manual curation. It enables editing, adding, and deleting as well as searching for
concepts within the integrated semantic lookup platform interface.</p>
          <p>We also annotate biological conditions that are part of the metadata. Biological
condition annotation allows to group samples of a dataset in such a way, that samples of
one group correspond to particular biological condition. Examples of biological
conditions are healthy and diseased patients, or several diseases, or several stages or
conditions of a particular disease. Annotation of biological conditions allows to perform
some data analysis automatically for the semantic layer or within the semantic data
integration platform. For example, differential expression analysis of small RNA
datasets based on annotation of biological conditions can be directly computed using
OASIS.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Semantic lookup platform</title>
        <p>In translational bio-medical research, controlled bio-medical vocabularies such as
terminologies, ontologies, taxonomies play an important role for annotation,
integration, and analysis of biological data. These vocabularies are essential for data
interoperability across departments and institutes. The semantic lookup platform extends the
proposed architecture by providing access to semantics through bio-medical
vocabularies facilitating data integration and interoperability. It provides a coverage on terms
within the semantic data layers, a detailed description for each term, and provide
mappings to same concepts from different vocabularies.</p>
        <p>After reviewing several existing open-source software projects (such as AberOWL,
Ontobee, BioPortal, Ontology Lookup Service, Ontology Cross-reference Service), we
chose two services: as entity resolution service (ERS) we selected the Ontology Lookup
Service (OLS) [10] and as entity mapping service (EMS) we chose the Ontology
Crossreference Service (OXO) (both developed by the EMBL-EBI).</p>
        <p>Both services provide a web-based user interface for exploring and visualizing the
vocabularies (ERS) and mappings (EMS) and, additionally, a flexible REST-based API
to programmatically access these resources. Additionally, they both provide a utility to
regularly update vocabularies and mappings. Furthermore, the ERS includes a search
engine for terms and synonyms with autocomplete functionality. To manage
vocabularies, the ERS also uses a flexible configuration system.</p>
        <p>An important extension of the provided services is the incorporation of
terminologies that enable to annotate and map all entities in the different semantic layers. Mainly
these are relevant life science instances such as genes, SNPs, miRNAs, organisms, cell
lines, and terminologies for the description of neuroscience-relevant clinical conditions.</p>
        <p>The designed semantic data layers utilize the vocabularies available in the semantic
lookup platform. Terms in such controlled vocabularies have several characteristics that
make them suitable for annotation and curation of data. A single term often represents
a formal specification of a biomedical concept. They are defined and standardized by
assigning a persistent identifier (with an IRI), a unique primary label, and a textual
description. They can also include further metadata such as abbreviations, synonyms,
and cross-references. Additionally, these terms can be hierarchically organized and put
in a relationship with each other. Using such vocabularies allows the alignment of
datasets, makes datasets semantically meaningful, and facilitates data understanding by
end users.</p>
        <p>An advantage of using hierarchical vocabularies or ontologies within the semantic
lookup platform is the possibility to search by parent terms. For example, a search for
“neurodegenerative disease” will find all samples annotated by any subclass of this
disease category or search for “brain” will find samples annotated by one of the brain
parts.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Data management platform</title>
        <p>The importance of a good data management to support scientific discovery and
innovation is highly emphasized by Wilkinson et al. [11], for which they have developed
the FAIR (findable, accessible, interoperable, and reusable) guiding principles to
manage scientific data. An additional aspect is that it is also important for the various DZNE
departments to discover datasets published by other departments with the goal that these
datasets can be evaluated and re-used for further experiments. Thus, we incorporated
such a data management platform that provides services to catalog and search (derived)
datasets in the proposed architecture.</p>
        <p>We use the open source software DKAN (https://getdkan.org/) to catalog and publish
the biomedical datasets generated at different DZNE sites. The data management
platform generates a formal citation for each added dataset. To cite the data, it supports the
popular Open Data Metadata Schema (https://project-open-data.cio.gov/v1.1/ schema/)
that is based on the Data Catalog Vocabulary (DCAT) (https://www.w3.org/
TR/vocabdcat/), a W3C recommendation, which is designed to facilitate interoperability between
data catalogs. It provides a persistent identifier as soon as a dataset is published.
Additionally, the datasets include (neuroscience-specific) metadata, licenses, authors, and
version information, all of which is cataloged, indexed, and searchable through a
webbased user interface. The software also offers several REST-API endpoints to
communicate with other services while allowing browsing the datasets, accessing metadata,
and retrieving the datasets.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Semantic Data Integration Platform</title>
        <p>The semantic data integration platform is the central part of the proposed
architecture. It interlinks between different types of biomedical derived data together with
annotations and secondary data. The primary goal of the integrated semantic data hub is
to enable end users to answer their research questions. For example, researchers of
neurodegenerative diseases may be interested to investigate the role of a particular gene in
different types of diseases. Clinical doctors may be interested in the interpretation of
genetic tests of a particular patient.</p>
        <p>The fundamental data structure for the indexed data in the semantic hub is
represented as a graph that we consider as essential for the analysis of the integrated data.
The nodes in the graph represent the entities and edges represent the relations that are
used to connect entities with each other. Furthermore, nodes and edges may have
additional properties or metadata such as the context information or the provenance attached
to them. The platform also allows flexible data modeling to integrate heterogeneous
datasets that are not (fully) suited for the graph-based structure such as clinical routine
data. Hence, the platform design additionally covers the combination of two further
database types to integrate relational and document-based data.</p>
        <p>During the indexing process in the semantic hub, data is being connected and aligned
with secondary data. This data has been incorporated from external resources and is
necessary to link the different entity types. Examples are the regulation of genes by
miRNAs that have been extracted from miRTarBase and from SCAIView. Other
external resources are listed in Table 1. As such associations are of graph-nature, they fit
perfectly in the graph database of the platform.</p>
        <p>The REST-based API, as for the other services, is a key component in the data hub
for common scientific queries. According to the needs of the queries, the implemented
interfaces access, filter, and combine integrated data from databases. In such a way, the
API can wrap the well-optimized queries built specifically for either graph, relational
and/or document-based databases. This enables us to provide a high-performance
platform with fast responses. During the development we also focused on the requirements
of the developers who build dedicated (web-based) user interfaces or apply analytical
approaches over the integrated data. Furthermore, the platform also communicates with
the semantic lookup platform to retrieve entity-based information, or with further
external services such as SCAIView, OASIS, NeuroMMSig
(https://neurommsig.scai.fraunhofer.de/) to retrieve secondary data relevant for the asked scientific
questions.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The Small RNA Expression Atlas as visualisation and computational analysis use case</title>
      <p>The Small RNA Expression Atlas (SEA) is a web application that allows for the search
of known and novel small RNAs across ten organisms using standardized search terms
and ontologies (http://sea.ims.bio/) [12]. It is based on the IDSN semantic hub,
however, for one particular primary data type (small RNAs). In contrast to proprietary
patient data that is not publically available, SEA incorporates publicly available datasets
from GEO. For the generation of the semantic layers, all data is semantically annotated
with the support of the annotation tool and the semantic lookup platform. Furthermore,
derived data is obtained by using the OASIS web application. The derived data includes
smallRNA counts as well as pathogen expression. Future analysis incorporated within
the semantic layer includes differential expression and classification relevance scores,
p-values for DE and Gini indices for classification respectively. SEA supports
interactive result visualisation of the data within the semantic integration platform. It allows
for querying and displaying sRNA expression information, primary and derived data
visualization, as well as visual analysis for disease-specific biomarker detection based
on relevance scores. In addition, it supports the re-analysis of selected data and contains
a user model for user-specific data management.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this manuscript we discussed a novel semantic data integration architecture with
primary focus on neurodegenerative disease research. The architecture allows to create
unified view of distributed biological data using the concept of federated semantic data
layers, and to integrate data together with derived and secondary data in a central
integration platform.</p>
      <p>Using of semantic concepts provides data with semantic meaning, which facilitates
querying by end users, as well as allows interoperability of different types of biological
data. The semantic lookup platform provides all necessary semantic concepts.</p>
      <p>The architecture demonstrated its efficiency serving as basis for smallRNA
Expression Atlas (SEA). SEA allows semantic integration of a big amount of publicly
available smallRNA data, linked storage of smallRNA and pathogen information together
with DE and classification results as well as smallRNA-gene and smallRNA-disease
associations from external databases.</p>
      <p>In this manuscript, we focused primarily on the integration of omics data. For two
other types of data: pharmacological assays as well as clinical information, a similar
approach was used. The resulting semantic-aware architecture will represent the basis
for DZNE data integration, which will allow querying across the various highlighted
data types.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>The project IDSN is supported by the German Federal Ministry of Education and
Research (BMBF) as part of the program "i:DSem – Integrative Data Semantics in the
Systems Medicine", project number 031L0029 [A-C].</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bickel</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          :
          <article-title>Die Häufigkeit von Demenzerkrankungen</article-title>
          , https://www.deutschealzheimer.de/fileadmin/alz/pdf/factsheets/infoblatt1_haeufigkeit_demenzerkrankungen _dalzg.pdf, (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aalbersberg</surname>
          </string-name>
          , Ij.J.,
          <string-name>
            <surname>Appleton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomberg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boiten</surname>
            , J.-W., da Silva Santos,
            <given-names>L.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bourne</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouwman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brookes</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crosas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dillo</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumon</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edmunds</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evelo</surname>
            ,
            <given-names>C.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkers</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Beltran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grethe</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heringa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , 't Hoen, P.A..,
          <string-name>
            <surname>Hooft</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kok</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kok</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lusher</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martone</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mons</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Packer</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Persson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rocca-Serra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roos</surname>
            , M.,
            <given-names>van Schaik</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Sansone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-A.</given-names>
            ,
            <surname>Schultes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Sengstag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Slater</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Strawn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Swertz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Thompson</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van der Lei</surname>
          </string-name>
          , J., van
          <string-name>
            <surname>Mulligen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velterop</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waagmeester</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wittenburg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolstencroft</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mons</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The FAIR Guiding Principles for scientific data management and stewardship</article-title>
          .
          <source>Sci. Data</source>
          .
          <volume>3</volume>
          ,
          <issue>160018</issue>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Daemen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gevaert</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojeda</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debucquoy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suykens</surname>
            ,
            <given-names>J.A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sempoux</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Machiels</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haustermans</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moor</surname>
          </string-name>
          , B. De:
          <article-title>A kernel-based integration of genomewide data for clinical decision support</article-title>
          .
          <source>Genome Med</source>
          .
          <volume>1</volume>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Dawany</surname>
            ,
            <given-names>N.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dampier</surname>
            ,
            <given-names>W.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tozeren</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Large-scale integration of microarray data reveals genes and pathways common to multiple cancer types</article-title>
          .
          <source>Int. J. Cancer</source>
          .
          <volume>128</volume>
          ,
          <fpage>2881</fpage>
          -
          <lpage>2891</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Iyappan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawalia</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raschka</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann-Apitius</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senger</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : NeuroRDF:
          <article-title>Semantic integration of highly curated data to prioritize biomarker candidates in Alzheimer's disease</article-title>
          .
          <source>J. Biomed. Semantics</source>
          .
          <volume>7</volume>
          ,
          <issue>45</issue>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loizou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harland</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettifer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>API-centric Linked Data integration: The Open PHACTS Discovery Platform case study</article-title>
          .
          <source>J. Web Semant</source>
          .
          <volume>29</volume>
          ,
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akil</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ascoli</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowden</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bug</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donohue</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grafstein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grethe</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halavi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kennedy</surname>
            ,
            <given-names>D.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marenco</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martone</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
          </string-name>
          , H.-M.,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shepherd</surname>
            ,
            <given-names>G.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sternberg</surname>
          </string-name>
          , P.W.,
          <string-name>
            <surname>Van Essen</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>R.W.:</given-names>
          </string-name>
          <article-title>The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience</article-title>
          .
          <source>Neuroinformatics</source>
          .
          <volume>6</volume>
          ,
          <fpage>149</fpage>
          -
          <lpage>160</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>R.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethune</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fiosins</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magruder</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Capece</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shomroni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonn</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Oasis 2: Improved online analysis of small RNA-seq data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>BMC</given-names>
            <surname>Bioinformatics</surname>
          </string-name>
          .
          <volume>19</volume>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          Genet.
          <volume>46</volume>
          ,
          <fpage>310</fpage>
          -
          <lpage>315</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>CEUR Workshop Proc</source>
          .
          <volume>1686</volume>
          ,
          <issue>160018</issue>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>