<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SemNExT: A Framework for Semantically Integrating and Exploring Numeric Analyses</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evan W. Patton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabeth Brown</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Poegel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hannah De Los Santos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Fasano</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kristin P. Bennett</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science, Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Mathematical Sciences, Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Neural Stem Cell Institute</institution>
          ,
          <addr-line>Rensselaer, NY</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Combining statistical techniques with semantic data representations holds the potential to enhance understandability of scienti c results. It can augment scienti c ndings with existing data sources in a reproducible manner through provenance capture, as well as enable further analysis and deduction through computer and human understandable de nitions of terms. We present a framework for semantically integrating and exploring numerical analyses. We call our work SemNExT for Semantic Numeric Exploration Technology. We apply our approach to data analysis aimed at improving understanding of human brain development that leverages the Cortecon RNA-Seq data repository. Our approach supports enrichment of Cortecon data through combinations with structured data sources available via SQL or SPARQL from the web to provide semantically enhanced analyses combined with statistical analyses. Our results are encoded as RDF graphs that may be used as input to reasoners and may drive provenance-aware visualizations. We introduce our infrastructure, describe its use on transcriptomic data analysis of a model of cerebral cortex development, and discuss some emerging suggestions for best practices and future research challenges.</p>
      </abstract>
      <kwd-group>
        <kwd>modeling</kwd>
        <kwd>statistical processes</kwd>
        <kwd>provenance</kwd>
        <kwd>linked data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Our work at the intersection of semantics and data analysis began after
discussions with successful data analysts detailed the level of human investigation
typically required. We hypothesized that some of the human interpretation and
vetting of results could be automated with semantic technologies and
structured web-available resources. We explore this hypothesis in the setting of brain
development data analysis. Recent work by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], among others, has explored the
application of machine learning and statistical techniques to transcriptomic data
from an in-vitro stem cell model of the development of the human cerebral cortex
called Cortecon.4 Roughly, Cortecon captures snapshots over 77 days of RNA
presence in an experiment that models \human brain development in a dish." In
4 http://cortecon.neuralsci.org
the model, embryonic stem cells di erentiate and then begin forming the 6 layers
of the cerebral cortex. Through Cortecon we can ethically examine human brain
development and how diseases such as autism and Alzheimer's may emerge from
gene mutations and exposure to toxins. We build on this work and explore
techniques for annotating and publishing statistical analyses and their provenance
as linked data. We explore the potential of leveraging knowledge from structured
resources to enhance the statistical output with unambiguous term de nitions,
links to related content, and connections to query and reasoning services.
      </p>
      <p>Our Semantic Numeric Exploration Technology (SemNExT) framework is
designed to facilitate statistical analyses that can be easily linked to external
structured resources and published as linked data. This framework serves as an
exemplar for complementing the work of statisticians with the use of provenance
and linked data best practices.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        BioGPS provides a pluggable workspace for interacting with di erent genomic
and proteomic datasets available on the World Wide Web [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, this
framework does not provide a machine readable representation of the data,
making it di cult to repurpose and link with additional resources unless they can
be easily expressed via a URL templating scheme.
      </p>
      <p>
        Structured scienti c work ows such as Kepler [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provide inspiration for
this work. Kepler provides a provenance module to capture and query work ow
execution history using a SQL-based schema.
      </p>
      <p>
        Mathematicians have also produced modeling languages for mathematics,
such as MathML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the Mizar Mathematical Library.5 MathML is a W3C
recommended standard for representing mathematical formulae on the web using
a markup language based on XML. The Mizar Mathematical Library provides
a web accessible, machine-veri able library of mathematical functions. For a
thorough review of ontologies and representations for mathematical knowledge
on the semantic web, we refer readers to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        STATO is an ontology for modeling statistical tests and their applications,
as well as probability distributions, variables, spread, and variation metrics.6
Our modeling e orts complement the work done by the STATO authors, as
we introduce additional tests and provide an exemplar model that combines
statistical processes with the W3C's Provenance Model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>There is ongoing work at the Gene Ontology Consortium to develop a term
enrichment protocol using REST principles and a JSON-based interchange
format.7 The protocol allows for metadata exchange to describe the supported
inputs and outputs supported by enrichment servers. There is also a JSON-LD
context to enable interoperability with linked data clients. It lacks, however,</p>
      <sec id="sec-2-1">
        <title>5 http://mizar.org/library/ 6 http://stato-ontology.org/ 7 https://github.com/cmungall/term-enrichment-protocol</title>
        <p>Analysis</p>
        <p>Analysis</p>
        <p>Factory
Activity</p>
        <p>Annotator</p>
        <p>SemNExTAPI
DataSource
Factory</p>
        <p>DataSource
Entity</p>
        <p>DatabaseSource</p>
        <p>SPARQLSource
a clear ability to expose provenance information to the client and a means of
incorporating additional or alternative analyses.</p>
        <p>
          Capadisli, Auer, and Riedl explore linked statistical data using RDF Data
Cube [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to enable a semantic web frontend to the R statistical language [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It
uses distributed SPARQL queries to integrate data from multiple linked data
repositories for performing analysis. PROV-O was used to model data
provenance. We build on these ideas by incorporating the ability to interface with
SQL-based data sources and aim to provide a more general programming
mechanism for semantic integration across multiple data sources.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Architecture Overview</title>
      <p>The SemNExT architecture is divided into six major components: (1) Data
Source, (2) Analysis, (3) Annotator, (4) Provenance, (5) Web Service, and
(6) Visualization. The interaction of these components is shown in Fig 1. Brie y,
SemNExT instantiates a set of data sources and analyses using factories. Each
data source provides one or more annotators that exploit the semantics of the
underlying dataset to extract relevant attributes and links about entities. The
act of annotating and analyzing entities are captured as activities and the results
of these operations are cached and returned to the CHeM diagram interface.
Data Source. SemNExT is an extensible framework for integrating datasets
sourced from across the web. We describe the datasets used for our analyses
and visualizations of transcriptomic data in Section 6.</p>
      <p>
        Analysis. SemNExT can be set up to perform a variety of analyses on numeric
data provided by data sources via the rpy2 Python package. Section 4 describes
at a high level the di erent analyses used on the Cortecon genetic data.
Annotator. SemNExT data sources export Annotator objects that the
framework combines to extract information about entities across many datasets.
Annotators provide rules to help the framework order operations appropriately at
runtime. For example, UMLS provides relations and attributes for instances of
type umls-sn:Gene or Genome, i.a., so it contributes an Annotator that
annotates an entity in that class when an entity of that type is detected. Via
subclass relationships or inverse functional properties, the framework identi es
appropriate entities, extracts the associated relations and attributes, and returns
the enriched entity in the result. Subsection 6.5 presents an example.
Provenance. SemNExT heavily relies on the W3C recommended PROV
ontology [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for its provenance data model. Analyses and annotators are modeled as
prov:Entity and their applications as prov:Activity. Section 5 details the
RDF and PROV models for SemNExT.
      </p>
      <p>Web Service. SemNExT provides a RESTful web services interface that is used
for searching for diseases, genes, and KEGG pathways8 and for obtaining
semantically enriched networks through the integration of datasets.
Visualization. SemNExT provides a hybrid visualization called chord-heat map
(CHeM) diagrams built using the JavaScript library D3. These diagrams will be
discussed more in Section 7.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Statistical Methods</title>
      <p>
        The Cortecon data set is an in-vitro stem cell model of human cerebral cortex
development [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The data consist of RNA expression captured by RNA-Seq counts
measured over 9 time points (days 0, 7, 12, 19, 26, 33, 49, 63, and 77). The data
contain 14065 signi cant di erentially expressed genes, which were ltered using
criteria decided upon by the Neural Stem Cell Institute. Our goal was to gain
a clearer understanding of the biological processes underlying human
corticogenesis and to provide insight into the developmental pathologies of neurological
disorders. By analyzing genes with known mutations linked to speci c diseases,
with respect to temporal gene transcription pattern in the data revealed by
Singular Value Decomposition (SVD) and understanding their role in the stages of
cortical development revealed by fuzzy C-means (FCM) clustering, we can help
identify the stages of corticogenesis where the root causes of the diseases may
be found.
      </p>
      <p>To conduct the analysis, we rst normalize the data by gene into z-scores
using the mean and sample standard deviation of each gene across time. We then
performed SVD and FCM clustering. SVD, a form of dimensional reduction,
reveals waves of gene transcription by plotting the genes in the space spanned
by the rst two principal components. The angle of the genes within produces a
temporal ordering of the genes. FCM extract clusters of related genes based on
their temporal activity resulting in six clusters, which are correspond to stages
of development Pluripotency, Ectoderm, Neural Di erentiation, Cortical
Specication, Deep Layers, and Upper Layers. The analysis reveals a \transcriptomic
8 http://www.genome.jp/kegg/pathway.html
sn:Analysis prov:used
prov:Entity</p>
      <p>sn:DataSource
qb:Observation</p>
      <p>qb:dataset
prov:wasGeneratedBy
prov:wasDerivedFrom</p>
      <p>
        sn:ComputedObservation
clock" which allows one to understand the role of genes with respect to these
developmental stages which are analogous to hours in a clock. All stages except
Ectoderm were identi ed by van de Leemput et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Characterization of this
newly identi ed Ectoderm cluster is ongoing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Modeling SemNExT processes in RDF</title>
      <p>
        In order to link these data and statistical analyses with structured resources, we
begin by converting it from its original tabular form into an RDF graph structure
using the semantic techniques of Lebo et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Observations are modeled as
qb:Observation and we introduce a subclass sn9:ComputedObservation
for modeling the statistical results. Fig. 2 provides an overview of the relevant
portion of the SemNExT ontology.
      </p>
      <p>The code implementing the statistical analyses from Section 4 is modeled as
prov:Entity and applications of analyses to observations are captured using
prov:Activity. We make extensive use of STATO terminology (subclasses of
obi:0200000 in Fig. 2) and include extensions to STATO where appropriate.10</p>
      <p>Consider the task of normalization. The code to compute the z-score is
modeled as a method in an R le that, given set of observations, computes the mean
and standard deviation of the set, and returns the z-score for each observation.
When this task is applied to the dataset of interest, the sn:Analysis used to
model this activity is an instantiation of the transformation class stato:0000104
that uses the speci ed R code and input data. The output set of z-scores is
derived from the input observations and modeled as sn:ComputedObservation.</p>
      <p>The resulting RDF products are then made available as a dataset using a
combination of the Vocabulary for Interlinked Datasets (VOID) and the
W3Crecommended Data Catalog Vocabulary (DCAT) and Data Cube Vocabulary.</p>
      <sec id="sec-5-1">
        <title>9 https://semnext.tw.rpi.edu/ontology/semnext# 10 https://semnext.tw.rpi.edu/docs/STATO-extensions.html</title>
        <p>Databases</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Linking Statistical Analyses to External Resources</title>
      <p>
        The DatabaseSource object is used by SemNExT to interface with relational
databases. We brie y describe each database incorporated by SemNExT.
Cortecon The Cortecon project from the Neural Stem Cell Institute provides
gene read data for a stem cell culture di erentiated into neurons during the rst
77 days of development [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Ensembl. Ensembl is a product of the EMBL - EBI and Wellcome Trust Sanger
Institute to automatically annotate eukaryotic genomes.11 SemNExT uses its
mappings from genes to transcripts to proteins to link di erent datasets.
StringDB. StringDB is a protein-protein interaction (PPI) database containing
both known and predicted interactions based on literature textmining.12
SemNExT uses StringDB as one of two background PPI knowledge bases.
Uni ed Medical Language System (UMLS). UMLS provides a metathesaurus
integrating healthcare and bioinformatics databases.13 SemNExT uses UMLS as
a link set to bridge databases.
6.2</p>
      <p>
        Linked Data
SemNExT integrates directly with a number of linked data resources via
the built-in SPARQLSource data source. The framework makes heavy use of
owl:sameAs links as well as inverse functional properties on identi ers to infer
sameness between entities in di erent datasets and drive the annotation process.
Bio2RDF. We make use of a number of linked life science resources through
the Bio2RDF project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Of particular interest are the curated Uniprot Gene
Ontology Annotations and descriptions of KEGG pathways.
      </p>
      <p>
        ReDrugS. The Repurposing Drugs with Semantics (ReDrugS) project14 is a rich
linked data resource that links Online Mendelian Inheritance in Man (OMIM),
IRefIndex, and DrugBank from Bio2RDF to identify likely paths for o -label
drug uses [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It provides a rich scoring mechanism based on how evidence is
obtained, e.g. via randomized controlled trials.
      </p>
      <p>
        Uniprot. We use the SPARQL endpoint provided by the Uniprot Consortium15
to obtain curated Gene Ontology annotations and expression data [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
11 http://ensembl.org/
12 http://string-db.org/
13 http://www.nlm.nih.gov/research/umls/
14 http://redrugs.tw.rpi.edu
15 http://beta.sparql.uniprot.org/
6.3
      </p>
      <p>Linking strategies
Reuse of existing identi cation schemes for linking is a primary goal of
SemNExT. Our URI schemes make extensive use of database identi ers for genes,
e.g. Entrez IDs and Ensembl IDs. It is not always the case that a clear identi
cation scheme is available for a particular class, however. SemNExT therefore uses
three di erent strategies to perform text-based entity linking (e.g., by matching
rdfs:labels), the most trivial being exact text and substring matching.</p>
      <p>The matching algorithm extensively uses type hierarchies across data sources
to limit the search space and return relevant matches. The SemNExT ontology
is a hand curated ontology mapping that integrates relevant concepts from our
input data sources to enable this work.</p>
      <p>When multiple concepts in a dataset match an entity, the framework will
exploit broader relationships in an attempt to nd a root, or archetype, node
encompassing all of the matched nodes. This allows us to answer, for example,
a single disease when many genetic variations may exist. If the user provides a
speci c variation name, the more general concept will not be used to override it.
6.4</p>
      <p>
        Enrichment analysis using linked data resources
After the data sources have been linked, we can use the structured resource
to perform a gene set enrichment analysis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] against the Gene Ontology.16
This is accomplished by generating a log-odds matrix and computing the
pvalue for each Gene Ontology term using a hypergeometric distribution. Such a
GO enrichment analysis helped reveal the role of the newly described Ectoderm
stage. Examining the enrichment of stages using genes with mutations causing
neurological diseases suggests hypotheses for how the disease pathologies arise
during corticogenesis.
      </p>
      <p>We also attempted to perform a similar analysis on the Tissue Speci city
Annotations provided by Uniprot. Unfortunately, the annotations provided are
textual rather than URI based, which leads to very poor results without
additional processing in the form of natural language processing.
6.5</p>
      <p>Linking Example: Septin-9
Consider the gene Septin-9 (SEPT9), which is an important gene in cell division
and its mutations are implicated in some cancers. To build a network with this
gene, SemNExT begins by searching all of its datasets for the identi er SEPT9
(e.g., dc:identi er ). It then queries for relations and attributes of SEPT9. For
example, the system nds SEPT9's Entrez ID (10801) in Cortecon and adds that
information to the gene's shared in-memory model. A UMLS database annotator
then executes because it can map the Entrez ID to the UMLS concept identi er
for SEPT9. Extracted information includes the fact that SEPT9 is part of the
process \Cytokinesis" and that it has an Ensembl Gene ID (ENSG00000184640).
16 http://geneontology.org/
Ensembl's annotator triggers on the addition of the Ensembl Gene ID to the
shared model, which allows the system to pick up the Gene Ontology annotations
provided by Ensembl, for example \cytoskeleton." This procedure continues until
no new information can be obtained from the background knowledge sources.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Semantic Statistical Visualizations using JSON-LD</title>
      <p>SemNExT provides a RESTful interface that produces JSON-LD descriptions
of a genetic network that drives visualizations such as our Chord and Heat
Map (CHeM) diagrams for understanding how mutations associated with disease
may produce changes in cerebral cortex development implemented using D3.js.17
Figure 3 shows a CHeM diagram for Holoprosencephaly, a disease in which the
brain fails to form two hemispheres. Genes are organized around the outside of
the diagram according to the transcriptomic clock revealed by SVD. The stages
of development found by FCM are shown by di erent colors in the inner band.
17 A demonstration is available at https://semnext.tw.rpi.edu/chords.html
Chords between genes are PPI extracted from structured sources of
proteincoding genes. The width of each gene varies with the number of connections.
The diagram's outer ring shows the normalized RNA-Seq counts of each gene as
a heatmap for each day of measurement, with day 0 closest to the center and
day 77 in the outermost ring.</p>
      <p>We chose to use JSON-LD as our primary serialization to exploit the D3
visualization toolkit while maintaining interoperability with semantic web capable
agents. The framework publishes JSON-LD contexts for its network
representation that makes use of best practice ontologies for linked statistics, including
STAT-O, RDF Data Cube, PROV-O, and the Gene Ontology.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Evaluation</title>
      <p>We evaluated our integration by determining the overlap of concepts across the
data sources that SemNExT interfaces with. Table 1 presents a breakdown of
the number of concepts identi ed in each dataset. Queries to match genes and
proteins were limited to the species Homo Sapiens. Coverage was computed by
pairwise selecting datasets containing a shared concept (disease, gene, protein,
directed and undirected protein interaction) by using the techniques identi ed
in Section 6 to infer owl:sameAs links. Intersections of all datasets sharing a
concept were computed in a similar fashion, except that the entity being linked
was required to exist in every dataset containing its concept to be counted.</p>
      <p>We found that di erentiation of classes in di erent resources was a challenge.
For example, Cortecon uses high level concepts such as Disease that in UMLS are
decomposed further into concepts such as \Disease or Syndrome," \Congenital
Abnormality," and \Pathologic Function." Where appropriate, we attempted to
limit linking to relevant classes to improve search and integration responsiveness.</p>
      <p>Lastly, edge extraction for the chord diagrams uses any available data source
that asserts an edge between two proteins coded by genes. However, systems such
as ReDrugS that we incorporate have more robust scoring mechanisms validated
by biostatisticians that could further re ne appropriate edge choices.
9</p>
    </sec>
    <sec id="sec-9">
      <title>Discussion</title>
      <p>The combination of statistics and semantics can provide robust error checking
and synchronicity between members of a project. While migrating the CHeM
visualizations from a Matlab version to the JSON-LD based API written in
Python, a project developer noted discrepancies in the values of z-scores between
the two implementations.</p>
      <p>Listing 1.1. An excerpt of a provenance trace from a SemNExT analysis where z-score
values were mismatched between Matlab and Python analyses.</p>
      <p>Comparing the provenance of the two methods (Listing 1.1) made it clear that
z-scores from the Matlab implementation were based on the corrected sample
standard deviation18 whereas the Python implementation z-scores were based
on an uncorrected sample standard deviation.19 This di erence in behavior was
the result of Matlab's implementation defaulting to a normalization of N 1
whereas NumPy's normalized by a default of N .
Due to the use of manual ontology mappings between data sources, our approach
is limited in its direct applicability to other domains without a time investment
by knowledge representation experts. However, the mapping we generated
contained a limited number ( 100) of axioms so its construction was relatively short
(on the order of hours). Larger, more complex domains may face scaling issues
unless automated ontology mapping approaches are employed.
18 https://www.mathworks.com/help/matlab/ref/std.html
19 http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html</p>
      <p>Using relational databases is another limitation. Annotators accessing databases
must be hand-coded whereas SPARQL-based accesses make use of class
hierarchies, class mappings, owl:sameAs assertions, and reuse of URIs. We leave
exploration of tools to replace hand-coded SQL accesses (e.g., D2R) to future work
and note that more resources are becoming available via SPARQL.
10</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusions &amp; Future Work</title>
      <p>We modeled statistical analyses of genomic data using best-in-class ontologies.
These statistical outputs were linked with additional structured resources using
linked open data techniques. The resulting RDF graphs were then visualized
using a combination of chord diagrams and heat maps. We evaluated our mapping
techniques by comparing individual mappings inferred by our API to the total
number of individuals across all datasets for a relevant selection of classes. Lastly,
we identi ed some bene ts to being able to represent statistical and linked data
transformations and their provenance in RDF.</p>
      <p>We recommend a number of practices based on our experiences: (1)
Statistical resources should be modeled with the RDF Data Cube and annotated
with appropriate provenance and frameworks should take advantage of these
structured resources when available; (2) JSON-LD provides a balance between
the verbosity of existing semantic web serializations and the succinctness of
formats such as CSV to incorporate semantics into web-based visualizations; and
(3) Capturing of provenance of statistical operators, especially when such
operations may have di ering default behavior between implementations, is important
as it enables replicability and can aid in debugging analyses.
10.1</p>
      <p>Future Work
We are looking to expand the number of supported analyses and to apply the
SemNExT framework to datasets outside of bioinformatics. SemNExT currently
does not automatically reconcile concepts with adjective variations in labels
that tend to violate traditional stemming rules, for example \Spinal cancer"
in Cortecon compared with \Spine cancer" in UMLS. We intend to investigate
resources such as Wordnet as an initial means of resolving these con icts and will
explore other natural language processing techniques as necessary to improve the
framework's ability to link resources without identi ers.</p>
      <p>We are also interested in modeling the inputs and outputs of statistical
analyses using languages such as OWL-S20. This will enable linking of relevant
analyses at runtime rather than at code design time.
20 http://www.w3.org/Submission/OWL-S/
Mr. Patton was supported by an NSF Graduate Research Fellowship and RPI
internal funds. Ms. Brown, Ms. H. De Los Santos and Dr. Bennett were supported
by NSF Grant 1331023 and RPI internal funds.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ausbrooks</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buswell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carlisle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dalmas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devitt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Froumentin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ion</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohlhase</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Mathematical markup language (mathml) version 2.0</article-title>
          . Tech. rep.,
          <source>World Wide Web Consortium</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , E., los Santos, H.D.,
          <string-name>
            <surname>Boles</surname>
            ,
            <given-names>N.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiehl</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patton</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Temple</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fasano</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          :
          <article-title>Temporal analysis of di erentiating pluripotent stem cells using singular value decomposition</article-title>
          .
          <source>In Preparation (nd)</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Callahan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz-Toledo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ansell</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Bio2RDF release 2: Improved coverage, interoperability and provenance of life science linked data</article-title>
          .
          <source>In: The Semantic Web: Semantics and Big Data</source>
          , pp.
          <volume>200</volume>
          {
          <fpage>212</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          , R.:
          <article-title>Linked statistical data analysis</article-title>
          .
          <source>In: Proc. SemStats</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennison</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The RDF data cube vocabulary</article-title>
          .
          <source>Tech. rep., W3C</source>
          (
          <year>2014</year>
          ), http://www.w3.org/TR/vocab
          <article-title>-data-cube/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Ontologies and languages for representing mathematical knowledge on the semantic web</article-title>
          .
          <source>Semantic Web</source>
          <volume>4</volume>
          (
          <issue>2</issue>
          ),
          <volume>119</volume>
          {
          <fpage>158</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lebo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erickson</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>G.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DiFranzo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michaelis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flores</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shangguan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
          </string-name>
          , J.:
          <article-title>Producing and using linked open government data in the twc logd portal</article-title>
          . In: Wood,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (ed.)
          <source>Linking Government Data</source>
          . Springer, New York, NY (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lebo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahoo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>PROV-O: The PROV ontology</article-title>
          .
          <source>Tech. rep., W3C</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. van de Leemput, J.,
          <string-name>
            <surname>Boles</surname>
            ,
            <given-names>N.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiehl</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corneo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lederman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menon</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levi</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          , et al.:
          <article-title>CORTECON: a temporal transcriptome analysis of in vitro human cerebral cortex development from human embryonic stem cells</article-title>
          .
          <source>Neuron</source>
          <volume>83</volume>
          (
          <issue>1</issue>
          ),
          <volume>51</volume>
          {
          <fpage>68</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Ludascher,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Altintas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Berkley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Higgins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Jaeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.A.</given-names>
            ,
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Scienti c work ow management and the kepler system</article-title>
          .
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>18</volume>
          (
          <issue>10</issue>
          ),
          <volume>1039</volume>
          {
          <fpage>1065</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solanki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erickson</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dordick</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.:</given-names>
          </string-name>
          <article-title>A nanopublication framework for biological networks using cytoscape.js</article-title>
          .
          <source>In: Proc. 5th ICBO</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tamayo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mootha</surname>
            ,
            <given-names>V.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukherjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebert</surname>
            ,
            <given-names>B.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillette</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pomeroy</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golub</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lander</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          , et al.:
          <article-title>Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression pro les</article-title>
          .
          <source>Proc. Natl Acad Sci USA</source>
          <volume>102</volume>
          (
          <issue>43</issue>
          ),
          <volume>15545</volume>
          {
          <fpage>15550</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>UniProt</given-names>
            <surname>Consortium</surname>
          </string-name>
          , et al.:
          <article-title>The universal protein resource (UniProt)</article-title>
          .
          <source>Nucleic acids research 36(suppl 1)</source>
          ,
          <source>D190{D195</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orozco</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leglise</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodale</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batalov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hodge</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Janes</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huss</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          , et al.:
          <article-title>BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources</article-title>
          .
          <source>Genome Biol</source>
          <volume>10</volume>
          (
          <issue>11</issue>
          ),
          <source>R130</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>