<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Linked Data Representation for Summary Statistics and Grouping Criteria</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>s P. M</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>l Dumonti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shruthi Ch</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Rensselaer Polytechnic Institute.</institution>
          <addr-line>110 8th Street, Troy, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maastricht University.</institution>
          <addr-line>Minderbroedersberg 4-6, 6211 LK Maastricht</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of the Virgin Islands. Charlotte Amalie</institution>
          ,
          <addr-line>St. Thomas, USVI</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Summary statistics are fundamental to data science, and are the buidling blocks of statistical reasoning. Most of the data and statistics made available on government web sites are aggregate, however, until now, we have not had a suitable linked data representation available. We propose a way to express summary statistics across aggregate groups as linked data using Web Ontology Language (OWL) Class based sets, where members of the set contribute to the overall aggregate value. Additionally, many clinical studies in the biomedical eld rely on demographic summaries of their study cohorts and the patients assigned to each arm. While most data query languages, including SPARQL, allow for computation of summary statistics, they do not provide a way to integrate those values back into the RDF graphs they were computed from. We represent this knowledge, that would otherwise be lost, through the use of OWL 2 punning semantics, the expression of aggregate grouping criteria as OWL classes with variables, and constructs from the Semanticscience Integrated Ontology (SIO), and the World Wide Web Consortium's provenance ontology, PROV-O, providing interoperable representations that are well supported across the web of Linked Data. We evaluate these semantics using a Resource Description Framework (RDF) representation of patient case information from the Genomic Data Commons, a data portal from the National Cancer Institute.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Representation</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Provenance</kwd>
        <kwd>Summary Statistics</kwd>
        <kwd>Data Science</kwd>
        <kwd>Transparency</kwd>
        <kwd>Interoperability</kwd>
        <kwd>Data Exploration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>One of the most common forms of data analysis involves the use of basic
statistics over groups of things. Sums, counts, and averages provide the foundation</p>
    </sec>
    <sec id="sec-2">
      <title>Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 J. McCusker et al.</title>
      <p>
        for understanding data, and because of this, these aggregation functions form
the basis of statistics and statistical analysis that underpins much of modern
science. We provide a means to express aggregate facts about classes of entities
in a way that avoids issues with the open world assumption by asserting those
facts, relative to speci c graphs, and then closing those graphs using
cryptographic graph hash identities. This method can be used to automatically create
classes through user interaction with semantically-aware, aggregation-based data
exploration and analysis tools based on OnLine Analytical Processing (OLAP)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and statistical languages such as R [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        This paper provides background on existing work to formalize aggregation
criteria as OWL classes with URIs derived from those aggretation criteria using
the RDF graph digest algorithm RGDA1 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We extend it with a method for
asserting aggregate facts about those OWL classes using the Semanticscience
Integrated Ontology (SIO) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in order to provide support for realist
representations of scienti c data and knowledge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These aggregate facts are treated as
attributes of the class, grouping both di erent aggregate measures (like count,
mean, and standard deviation) to speci c kinds of attributes (like age and
survival). Identi ers for the source RDF graph using the graph digest algorithm
RGDA1 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] can, if used in the provenance of those aggregate facts, will provide
a closed set of assertions that the aggregate facts were computed over. We
therefore provide a formal method to express the semantics of aggregate functions
over well-de ned grouping criteria. This allows us to create summary graphs of
data that can be used in multiple contexts and introspected using reusable
software. These mappings are illustrated using patient case information from the
Genomic Data Commons (GDC) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a National Cancer Institute data portal
for reusable cancer data.
2
      </p>
      <sec id="sec-2-1">
        <title>Example Data Using Semanticscience Integrated</title>
      </sec>
      <sec id="sec-2-2">
        <title>Ontology</title>
        <p>The expression of summary statistics requires a vocabulary that includes support
for attributes, objects, and their interrelationships. As an integrated ontology,
the Semanticscience Integrated Ontology (SIO) ontology includes terminology
for rei ed roles, attributes (including quantities and qualities), time, and
processes. SIO is an integrated ontology - it provides both an upper level framework
for semantics and detailed semantics for general science. SIO is often extended
for particular domains, and has been used e ectively, for example, for modeling
scienti c data using relationships between entities and attributes. The top-level
class of sio:entity 4 is subclassed by sio:object (continuents), sio:attribute
(characteristics of any entity), and sio:process (occurrents). All kinds of entities can
have attributes. The subtype of an attribute determines which attribute is
being characterized, and sio:`has value' relates literal forms of the value of the
attribute to it. Entities are related to their attributes using sio:`has attribute'.</p>
        <sec id="sec-2-2-1">
          <title>4 SIO properties and classes are expressed here using their labels, and quoted when</title>
          <p>necessary.</p>
          <p>
            We use case data for our examples from the Genomic Data Commons (GDC)
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], that includes patient demographics, diagnosis, and survival. This kind of
data is representative of biomedical information and includes information
attributes with and without temporality, roles between entities, and some
processes. GDC is particularly interesting because much of the exploration of data
within the GDC is based on displaying aggregate statistics, even though there is
no way to use those statistics outside of the user interfaces they provide. Our
example can set us on a path to pre-computing aggregate statistics and providing
not just visualizations of them, but also provide generalized API access to that
information in new ways. While GDC does not provide examples of all possible
RDF-based data, it is su cient to illustrate our summary statistics approach.
We also discuss below how to apply these approaches to all possible RDF by
providing rei cation rules for both datatype and object properties. The data is
provided as supplementary material as gdc cases.ttl [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], and contains case
information about 33,549 patients that are currently available in the portal as of
March, 2019. The RDF data was converted from the cases' API endpoint. The
structure of the study information follows major classes and properties from
SIO, as shown in Figure 1. We resolve data values to classes in NCI Thesaurus
[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] using the BioPortal Annotator [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. The resulting RDF is loaded into a triple
store, along with the summary statistics, which is also represented in RDF.
In order to create aggregate function semantics, we rst have to formally de ne
the grouping criteria. Fortunately, the grouping criteria used in OWL
restrictions covers most cases. We therefore de ne an aggregate value as an attribute
of an OWL Class. The aggregation semantics from Calvanese et al. [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] e
ectively introduce variables into OWL class de nitions. A conventional OWL class
contains references to classes, properties, individuals, and literals:5
Class: GDC_Subject
EquivalentTo: sio:human
and sio: has role some (sio: subject role
          </p>
          <p>and sio: in relation to some sio:investigation)</p>
          <p>This can be expressed as this SPARQL query:
select ?GDC_Subject WHERE {
?GDC_Subject a sio:SIO_000485; # human
sio:SIO_000228 [ # has role
a sio:SIO_000883; # study subject
sio:SIO_000668 [ # in relation to</p>
          <p>a sio:SIO_000747 # investigation</p>
          <p>]</p>
          <p>When this class is applied to the sample data in Supplemental Materials it
results in 33,549 matches, one for each human:</p>
          <p>GDC Subject
8 case:d4f90900-3b81-4015-8e11-4b4525345063 9
&lt; =</p>
          <p>case:d52a195d-7d63-4eb6-81c2-3c473ba57979
: : : : ;
An aggregate query in Calvanese et al. is expressed as:</p>
          <p>q (x; (y))
where x is a sequence of grouping variables, (y) is the aggregation term, and
is the query condition expressed in rst order logic. We translate this into
manchester notation through the following template:
Class: x
SubClassOf:
5 The following pre xes are used in all SPARQL, Manchester notation, and Turtle
examples:
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
sio: http://semanticscience.org/resource/
prov: http://www.w3.org/ns/prov#
case: http://example.com/gdc/case/
project: http://example.com/gdc/project/
We will introduce y in Section 5, as Calvanese et al. do not provide a way to
represent y in a knowledge graph. In order to more explicitly treat their uses of
variables, we de ne a function G (g1; : : : ; gn) = x:
Class: G ( g1; : : : ; gn)
SubClassOf:
An aggregate query of study subjects by investigation can now be expressed as:
Class: G(?x)
SubClassOf: sio:human
and sio: has role some (sio: subject role</p>
          <p>and sio: in relation to value ?x)
The selection SPARQL query would look like this:
select ?GDC_Subject ?x WHERE {
?GDC_Subject a sio:SIO_000485; # human
sio:SIO_000228 [ # has role
a sio:SIO_000883; # study subject
sio:SIO_000668 ?x # in relation to
].</p>
          <p>?x a sio:SIO_000747 # investigation
A class is de ned for every matched value in the knowledge base:
Class: G(case:FM-AD)
EquivalentTo: sio:human
and sio: has role some (sio: subject role</p>
          <p>and sio: in relation to value case:FM-AD)
Class: G(case:TARGET-NBL)
EquivalentTo: sio:human
and sio: has role some (sio: subject role</p>
          <p>and sio: in relation to value case:TARGET-NBL)</p>
          <p>The members of each grouping criterion G (g1; : : : ; gn) are therefore members
of the generated class G(), and the RDF graph that describes these summary
statistics can provide rdf:type statements for each member as provenance. There
are 45 di erent investigations in GDC total, above we show the three with the
most subjects. These variables can replace classes, properties, and individuals,
and can be mixed in with non-variable criteria, as was shown in the above
example. Calvanese et al. discuss the computation of the aggregate operations
MIN, MAX, COUNT DISTINCT, SUM, and AVG (mean), but do not specify
how the values relate to the computed classes. The following sections relate the
work done by Calvanese et al. to a complete representation of both the grouping
criteria and aggregate statistics in RDF. This includes methods for computable
URIs for the grouping criteria classes G(), how to provide aggregate statistics
on all G() using sio:`has attribute', and how to link those aggregate statistics to
their source graphs.
4</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Computable URIs for Grouping Criteria</title>
        <p>We use the following method for computing URIs for G, which allows for
alignment of grouping criteria independent of the source graph or individual naming
schemes. Take the concise bounded description (CBD, or C()), minus
annotations, of G in a separate RDF graph C(G), where G itself is represented as a
blank node. A CBD is the direct connections of a resource in the graph along
with the transitive direct connections of any blank nodes in the description.
Compute the graph digest of C(G) and use the digest value to rewrite the URI
for G in the original graph. The CBD means that the URI should be computable
in any case. The URI will be di erent if the inferred closure of statements on
G is computed instead of the minimal grouping critieria, so users will need to
maintain consistency there.
5</p>
      </sec>
      <sec id="sec-2-4">
        <title>Generating Facts About Classes using Aggregate</title>
      </sec>
      <sec id="sec-2-5">
        <title>Attributes</title>
        <p>
          We use OWL 2 punning [
          <xref ref-type="bibr" rid="ref10 ref23">23, 10</xref>
          ] to provide natural metamodeling of aggregate
attributes of classes and to reify non-SIO-based triples into a SIO-compatible
format. \Punning" here refers to the ability to separate OWL Classes, Properties,
and Individuals, even if they share the same URI. For instance, giving OWL class
a meta-type in addition to owl:Class would place that ontology in OWL-Full, as
would any facts (as opposed to annotations).
        </p>
        <p>Once a grouping operation has been performed over a set of data to produce
owl:Classes, it is now possible to compute aggregate functions over the members
of the class. The aggregate functions available in SIO include mean, median,
standard deviation, count, mode, minimal and maximal values, and can be
extended with new statistical concepts using similar representation patterns. This
is accomplished by using the following steps, with predicates mapped to OWL
properties as shown in Table 1:
1. De ne an OWL class that is a subclass of the aggregation criteria, as shown
in Section 3.
2. Pun the owl:Class to an owl:Individual in order to make assertions about it.</p>
        <p>In RDF nothing needs to be explicitly done to do this.
8S; P; O</p>
        <p>8S; P; O
3. If the predicate is not a subproperty of sio:`is related to', reify the predicate
of the values being aggregated using the following rules:</p>
        <p>P (S; O) ^ P 6 rel^
P 2 DatatypeProperty ) 9A A 2 P ^ attr(S; A) ^ val(A; O)</p>
        <p>P (S; O) ^ P 6 rel^
P 2 ObjectProperty ) 9A A 2 P ^ role(S; A) ^ to(A; O)
(1)
Note that this operation is simply mapping a non-SIO property into the SIO
framework. Punning is not needed for native SIO attributes.</p>
        <p>We can now determine a way to represent the aggregation term
expressed in RDF where y is an attribute of G(g1; : : : ; gn), and each
attribute of type :
(y). This is
(y) is the
8G; (y)9A 2 ; Y 2 yattr (G; Y ) ^ attr (Y; A) ^ val (A; (y))
Following the data in the Supplemental Materials, when G(case:TCGA-BRCA)
is de ned as an aggregate as expressed above:
Class: G(case:TCGA-BRCA)
SubClassOf: sio:human
and sio: has role some (sio: subject role</p>
        <p>and sio: in relation to value case:TCGA-BRCA)
the aggregate facts can then be asserted about G(case:TCGA-BRCA) in this
way:
G(case:TCGA-BRCA) sio:has-attribute
[ a sio:count; sio: has value 1098 ],
[ a sio:age;
sio: has attribute
[ a sio:mean; sio: has value 21582 ],
[ a sio: maximal value ; sio: has value
[ a sio: minimal value ; sio: has value
].</p>
        <p>By providing formal representations for aggregations, it becomes possible to
formally de ne them using grouping criteria. Additional facts can be provided
about each set through aggregate functions, which can be extended with more
sophisticated statistical functions. Since each aggregation becomes a de ned and
denoted thing, it becomes possible to provide the provenance of those de nitions,
which would include the members of the class and aggregate query used to de ne
it. These classes and facts about these classes can now be de ned automatically
using grouping functions and instances. Because of this, OLAP-like tools can
use and generate assertions about the aggregate sets that they produce through
user interaction, since OLAP relies on GROUP BY, ltering criteria, and
aggregation functions. These classes can then be subjected to statistical analysis</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Predicate</title>
      <p>rel()</p>
      <p>Property
sio:'is related to'
attr()</p>
      <p>sio:'has attribute'
val()
role()
to()
sio:'has value'
sio:'has role'
sio:'in relation to'
der()
prov:wasDerivedFrom</p>
    </sec>
    <sec id="sec-4">
      <title>De nition</title>
    </sec>
    <sec id="sec-5">
      <title>A is related to B i there is some relation between A and B..</title>
    </sec>
    <sec id="sec-6">
      <title>A relation that associates an entity with an</title>
      <p>attribute where an attribute is an intrinsic
characteristic such as a quality, capability,
disposition, function, or is an externally derived
attribute determined from some descriptor</p>
    </sec>
    <sec id="sec-7">
      <title>A relation between an informational entity and its</title>
      <p>actual value (numeric, date, text, etc).</p>
    </sec>
    <sec id="sec-8">
      <title>A relation between an entity and a role that it bears.</title>
    </sec>
    <sec id="sec-9">
      <title>A comparative relation to indicate that the instance of the class holding the relation exists in relation to another entity.</title>
    </sec>
    <sec id="sec-10">
      <title>A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.</title>
      <p>and the de nitions can be re-applied to further datasets for hypothesis testing
or published as nanopublications.
6</p>
      <sec id="sec-10-1">
        <title>Closing Graphs Over Aggregates</title>
        <p>
          Aggregation techniques face a number of challenges when dealing with the open
world assumption. First, aggregation functions assume that the available data
is complete and whole. For instance, asserting that a class has ten instances is
implicitly means that ten known instances have been counted, but that number
could be higher (because of unknown instances). Additionally, because of the
non-unique naming assumption, if those instances are actually identical but not
known to be, then the class may actually have fewer than ten instances.
Expressing aggregate values is useful, but the open world assumption prevents any
nal conclusions about aggregate statistics, because there can always be more
facts to be discovered. We need to close the graph to additional statements. A
number of approaches have been proposed to compute the content digest of an
RDF graph [
          <xref ref-type="bibr" rid="ref15 ref17 ref20 ref3">20, 3, 17, 15</xref>
          ], and an implementation of [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] has been published as
part of RDFlib [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The algorithm in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] has the additional bene t of e ciently
computing reproducible identi ers for blank nodes within the graph, producing
stable identi ers for all RDF graphs. The approach in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] uses nanopublications
to provide a mechanism for referencing RDF graph content by URI similar to
the approach in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], but its approach to blank node skolemization (by providing
a UUID for each blank node) means that the graph identi ers are not consistent.
By creating di erent graph identi ers for the same graph, it becomes impossible
to verify that identical graphs from di erent sources are potentially the same.
By encoding the graph digest as part of a URI scheme, the aggregate attributes
can be encoded with a prov:wasDerivedFrom link to the digest-identi ed graph:
8G; (y); N 9A 2
        </p>
        <p>; Y 2 y (attr (G; Y ) ^ attr (Y; A) ^ val(A; (y)) ^ der(A; N ))</p>
        <p>This allows for the reuse of the classes G across aggregations, while providing
evidence for computation of A; Y in speci c graphs N . The grouping criteria
on G can be re-applied to other datasets for further validation. For instance,
a new graph can either be merged with the original, or computed separately
for independent validation. Members of each G() can be classi ed using OWL
inference (due to the equivalence restrictions), and aggregation across those sets
can be computed within the new members. Each aggregate value would, because
of its derivation statement, be traceable to a speci c RDF graph, and changes
in the input can be validated simply by comparing URIs of the derivations to
determine the reproducibility and/or repeatiblility of the aggregate value.
7</p>
      </sec>
      <sec id="sec-10-2">
        <title>Related</title>
      </sec>
      <sec id="sec-10-3">
        <title>Work</title>
        <p>
          Aggregation semantics in logic programming is well researched, [
          <xref ref-type="bibr" rid="ref1 ref21 ref22">22, 21, 1</xref>
          ] but in
many ways, these works do not address how to integrate the output of aggregate
functions into a knowledge representation that integrates with the original data.
The HiFun query language provides the means to express analytic queries
directly through a relational algebra, but does not provide a formalized knowledge
representation in the process. The RDF Data Cube Vocabulary, or QB, expresses
entities-in-context as information artifacts, instead of as attributes of entitites
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It has been used in a number of cases to create specialized representations
of summary statistics [
          <xref ref-type="bibr" rid="ref14 ref19 ref9">14, 19, 9</xref>
          ]. The formalisms of aggregation aren't clear with
this approach, as the aggregate values are not expressed in RDF. Additionally,
RDF Data Cube Vocabulary treats statistical data as information artifacts, and
does not support for realist representations of scienti c data and knowledge [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
8
        </p>
      </sec>
      <sec id="sec-10-4">
        <title>Evaluation: Exploring the Genomic Data Commons</title>
        <p>In most publications, aggregate statistics are expressed in human-readable tables
or in gures. While human readers can conceptually make the connection back
to the input data that has been aggregated, since no internal representation
exists, the connection between source and aggregate are not maintained and
cannot be used. Using our knowledge representation, we can query and persist
those statistics as linked data that integrates with the original data. Using the
supplementary materials script create summaries.ipynb, we were able to compute</p>
        <p>Adenocarcinoma</p>
        <p>Carcinoma
Squamous Cell Carcinoma</p>
        <p>Ductal Breast Carcinoma
Endometrioid Adenocarcinoma</p>
        <p>Glioblastoma
s
i
s
o
n
g
a
i
D</p>
        <p>Serous Cystadenocarcinoma
Gastric Papillary Adenocarcinoma</p>
        <p>Melanoma</p>
        <p>Non-Small Cell Carcinoma
Diffuse Large B-Cell Lymphoma</p>
        <p>Acinar Cell Carcinoma
Neuroendocrine Carcinoma</p>
        <p>Small Cell Carcinoma</p>
        <p>Papillary Carcinoma
Mucinous Adenocarcinoma</p>
        <p>Thymoma
Adult Cholangiocarcinoma</p>
        <p>Cervical Adenocarcinoma
Acute Myeloid Leukemia Not Otherwis…
0
1,000 2,000 3,000 4,0005,000
# of cases
count statistics and age mean, min, and max by diagnosis6. The source data is
contained in gdc cases.nt. The class URIs for these aggregates are recomputable
from their de nitions, and we are able to construct summary visualizations using
the graphs. Figure 2 contains summary counts for the top 20 diagnoses in GDC,
retrieved from a summary semantics graph. This gure was generated from the
RDF graph in the supplementary materials le age by diagnosis.ttl.
8.1</p>
        <p>Description Logic Complexity
The Description Logic (DL) complexity of G(g1; : : : ; gn) is not impacted by
the way we express these classes, since they have no direct semantics in DL,</p>
        <sec id="sec-10-4-1">
          <title>6 age by diagnosis.ttl</title>
          <p>and the DL-based de nitions of each class are used to compute a URI for
each G(). The generated G(g1; : : : ; gn) classes are themselves in ALE . SIO is
in SRIQ(D). Additionally, were able to use Pellet to perform a full inference of
the age by diagnosis.ttl with SIO without any inconsistencies or errors.
8.2</p>
          <p>Overhead
The storage overhead of our knowledge representation is fairly limited. For
instance, in our GDC dataset, expressing age by diagnosis for the entire dataset
required 4,992 statements using 304 classes, requiring about 16 statements (with
RDFS labels mixed in) per class. When using multiple grouping criteria, the
number of expressed classes will expand geometrically based on the number of
combined class criteria. The underlying data provided contained 8,048,537
statements, resulting in a signi cant reduction in statements.
9</p>
        </sec>
      </sec>
      <sec id="sec-10-5">
        <title>Conclusions</title>
        <p>The approach of OWL representations for grouping criteria plus SIO-based
attributes for the summary statistics is a natural extension of both representations,
and makes interoperability within Linked Data much easier. Assertions of
aggregate information about groups of entities can now be formally expressed in
ways that are traceable to their source. In some cases, these assertions can even
be computed based on the available summary statistics of the data at hand.
This improves the ability for researchers to build facts and hypotheses about
their entities of interest from their data. This method builds on existing
ontologies in provenance, such as Prov-O, and eScience, (e.g. SIO) and results in
assertions with justi able explanations. These assertions and their explanations
can be published as nanopublications. The proposed knowledge representation
is extensible across any aggregate functions that can be applied to de ned sets
of entities. This paper also provides a way of computing URIs for classes based
on their necessary and su cient conditions using graph digests of those OWL
restrictions, making access of those parameterized classes stable across datasets.
Finally, use of these aggregate semantics enables the use of existing analytical
tools to generate explainable and exportable assertions about the data being
analyzed and produce grouping criteria that can be re-applied to other datasets
for further validation.
10</p>
      </sec>
      <sec id="sec-10-6">
        <title>Acknowledgements</title>
        <p>Thank you to James Michaelis and John Erickson for feedback and examples.
This work is supported by IBM Research AI through the AI Horizons Network.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Afrati</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolaitis</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          :
          <article-title>Answering aggregate queries in data exchange</article-title>
          .
          <source>In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems</source>
          . pp.
          <volume>129</volume>
          {
          <fpage>138</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kharlamov</surname>
          </string-name>
          , E.:
          <article-title>Aggregate Queries Over Ontologies</article-title>
          .
          <source>Proceedings of the 2nd international workshop on Ontologies and information systems for the semantic web</source>
          (
          <year>2008</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1458484</volume>
          .
          <fpage>1458500</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stickler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Named graphs, provenance and trust</article-title>
          .
          <source>In: Proceedings of the 14th international conference on World Wide Web</source>
          . pp.
          <volume>613</volume>
          {
          <fpage>622</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Codd</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Codd</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Providing</surname>
            <given-names>OLAP</given-names>
          </string-name>
          <article-title>(on-line analytical processing)</article-title>
          .
          <source>Codd and Date</source>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The RDF data cube vocabulary</article-title>
          .
          <source>W3C recommendation</source>
          ,
          <source>W3C (Jan</source>
          <year>2014</year>
          ), http://www.w3.org/TR/2014/REC-vocab
          <article-title>-data-cube20140116/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>De Coronado</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haber</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sioutos</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuttle</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>L.W.</given-names>
          </string-name>
          , et al.:
          <article-title>NCI thesaurus: using science-based terminology to integrate cancer research results</article-title>
          . In: Medinfo. pp.
          <volume>33</volume>
          {
          <issue>37</issue>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baran</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callahan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chepelev</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz-Toledo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Rio</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furlong</surname>
            ,
            <given-names>L.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keath</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al.:
          <article-title>The semanticscience integrated ontology (sio) for biomedical research and knowledge discovery</article-title>
          .
          <source>Journal of biomedical semantics 5</source>
          (
          <issue>1</issue>
          ),
          <volume>14</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoehndorf</surname>
          </string-name>
          , R.:
          <article-title>Realism for scienti c ontologies</article-title>
          .
          <source>In: FOIS</source>
          . pp.
          <volume>387</volume>
          {
          <issue>399</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ermilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Linked open data statistics: Collection and exploitation</article-title>
          .
          <source>In: International Conference on Knowledge Engineering and the Semantic Web</source>
          . pp.
          <volume>242</volume>
          {
          <fpage>249</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Owl 2: The next step for owl</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>6</volume>
          (
          <issue>4</issue>
          ),
          <volume>309</volume>
          {
          <fpage>322</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferretti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varmus</surname>
            ,
            <given-names>H.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowy</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kibbe</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staudt</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          :
          <article-title>Toward a shared vision for cancer genomic data</article-title>
          .
          <source>New England Journal of Medicine</source>
          <volume>375</volume>
          (
          <issue>12</issue>
          ),
          <volume>1109</volume>
          {
          <fpage>1112</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ihaka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gentleman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          : R:
          <article-title>A language for data analysis and graphics</article-title>
          .
          <source>Journal of computational and graphical statistics 5(3)</source>
          ,
          <volume>299</volume>
          {
          <fpage>314</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jonquet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>The open biomedical annotator</article-title>
          .
          <source>Summit on translational bioinformatics</source>
          <year>2009</year>
          ,
          <volume>56</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Kampgen,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>ORiain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Interacting with statistical linked data via olap operations</article-title>
          .
          <source>In: Extended Semantic Web Conference</source>
          . pp.
          <volume>87</volume>
          {
          <fpage>101</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Trusty uris: Veri able, immutable, and permanent digital artifacts for linked data</article-title>
          .
          <source>In: European semantic web conference</source>
          . pp.
          <volume>395</volume>
          {
          <fpage>410</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Supplementary Materials for A Linked Data Representation for Summary Statistics</article-title>
          and Grouping
          <string-name>
            <surname>Criteria</surname>
          </string-name>
          (
          <year>2019</year>
          ). https://doi.org/10.7910/DVN/OK0BUG
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J.P.:</given-names>
          </string-name>
          <article-title>WebSig: a digital signature framework for the web</article-title>
          .
          <source>Ph.D. thesis</source>
          , Rensselaer Polytechnic Institute (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. RDFLib Team:
          <source>rd ib 4.2.2</source>
          (
          <issue>2013</issue>
          ), https://rdflib.readthedocs.io, accessed 4/1/2019
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Salas</surname>
            ,
            <given-names>P.E.R.</given-names>
          </string-name>
          , Martin,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Da</given-names>
            <surname>Mota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.M.</given-names>
            ,
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Breitman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Casanova</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.A.</surname>
          </string-name>
          :
          <article-title>Publishing statistical data on the web</article-title>
          .
          <source>In: 2012 IEEE Sixth International Conference on Semantic Computing</source>
          . pp.
          <volume>285</volume>
          {
          <fpage>292</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sayers</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karp</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          :
          <article-title>Computing the digest of an rdf graph</article-title>
          .
          <source>Mobile and Media Systems Laboratory</source>
          , HP Laboratories, Palo Alto, USA,
          <source>Tech. Rep. HPL-2003-235 1</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sudarshan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beeri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Extending the wellfounded and valid semantics for aggregation</article-title>
          .
          <source>In: ILPS</source>
          . pp.
          <volume>590</volume>
          {
          <issue>608</issue>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Van Gelder</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The well-founded semantics of aggregation</article-title>
          .
          <source>In: Proceedings of the eleventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems</source>
          . pp.
          <volume>127</volume>
          {
          <fpage>138</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golbreich</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>OWL 2 Web Ontology Language New Features and Rationale (Second Edition)</article-title>
          .
          <source>W3C recommendation</source>
          ,
          <source>W3C (Dec</source>
          <year>2012</year>
          ), http://www.w3.org/TR/2012/REC-owl2
          <string-name>
            <surname>-</surname>
          </string-name>
          new-features-
          <volume>20121211</volume>
          /
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>