=Paper=
{{Paper
|id=Vol-89/paper-7
|storemode=property
|title=Representing Contextualized Data using Semantic Web Tools
|pdfUrl=https://ceur-ws.org/Vol-89/macgregor-et-al.pdf
|volume=Vol-89
|dblpUrl=https://dblp.org/rec/conf/psss/MacGregorK03
}}
==Representing Contextualized Data using Semantic Web Tools==
https://ceur-ws.org/Vol-89/macgregor-et-al.pdf
Representing Contextualized Data using Semantic Web
Tools
Robert MacGregor
(Information Sciences Institute, University of Southern California, U.S.A.
macgregor@isi.edu)
In-Young Ko
(Information Sciences Institute, University of Southern California, U.S.A.
iko@isi.edu)
Abstract: RDF-based tools promise to provide a base for reasoning about metadata and about
situated data—data describing entities situated in time and space—that is superior to alternatives
such as relational databases or object-oriented databases. However, essential representational
machinery is missing from the current generation of Semantic Web tools and languages. When that
machinery is added, the resulting capabilities offer a combination of novelty and flexibility that
may usher in a wave of commercial Semantic Web tool-based applications that precedes the true
arrival of the Semantic Web. We have constructed a system, the Semantic Engineering Workbench
(SEW), that is proficient at managing situated data. Achieving a practical implementation
necessitated extending the basic RDF tools (Hewlett-Packard’s Jena and Stanford’s Protégé) to
support contexts. In the SEW, a context references a set of statements having common spatial,
temporal (and other metadata) attributes. We investigated multiple possible implementations of
contexts, and found significant drawbacks in the most common approaches. The clear winners are
quads (adding a fourth field of type ‘context’ to each triple results in a quadruple, or quad), and
object-oriented contexts (a context mechanism that references individuals instead of statements).
Most existing Semantic Web tools (e.g., Jena and Protégé) do not understand contextualized data.
For these tools, object-oriented contexts provide an elegant solution. We invented a new semantic
primitive, called ‘theRealThing’, that is a generalization of the ‘owl:sameIndividualAs’ property. If
and hold, then e1 and e2 are distinct resources (having
different sets of attributes) that denote the same real-world entity. The SEW uses the
‘theRealThing’ property to automatically generate abstractions of related sets of resources. Our
CHIME visualization tool utilizes the SEW to generate a continuous stream of abstracted entities
representing summarizations of spatio-temporally situated entities. CHIME offers a preview of
novel capabilities enabled by Semantic Web technology.
Categories: H.4.0 – General Information Systems Applications, H.3 – Information Storage and
Retrieval
1 Introduction
The n-Dimensional Information Management project at the University of Southern
California’s Information Sciences Institute is using Semantic Web tools as the base
representation technology for a data visualization project called CHIME. 1 CHIME
imports spatio-temporally situated data from multiple data sources, normalizes it, and
1 www.isi.edu/chime/
displays it in multiple ways—as map overlays, in event maps, and in a contextualized
tabular display.
CHIME datasets naturally separate into two classes of data, which intelligence analysts
often call “internal” and “external.” Internal data is the “ordinary” kind—entities,
relationships between entities, and attribute values; external data is metadata such as
author, source, observation date, entry date, location, etc. Most information representation
systems are very clumsy at representing external data. We would like to claim that RDF
[Brickley and Guha 03] is well-adapted for representing external data. Unfortunately, this
is not the case—we found it necessary to build a fairly sophisticated representation layer
above RDF to achieve a satisfactory match between language and requirements. Our
extensions include contexts, and a generalization of the owl:sameIndividualAs
property.
We use the term contextualized to refer to sets of data attributes that vary according to
the context in which they are viewed (examples of contextualized data include data that
changes across time, or data that changes according to a security setting). Ordinary
provenance data (e.g., author, creation date) is not normally contextualized. One of our
key findings was that contextualized data is strictly harder to represent than provenance
data (this is a practical result, not a theoretical one). We will show how commonly used
conventions for representing provenance data fail miserably when representing
contextualized data.
In this paper, we will examine various aspects of the RDF language, to see what’s
useful and where RDF falls short. Contexts are a continuing issue for debate with the
RDF community—they are generally regarded as important, if not essential for many
applications, but there are many different ways to represent them. Here we will try to
separate out some of the good ideas from the bad.
Section 2 provides an example of a query over temporally-situated (contextualized)
data. Section 3 examines several different forms of contexts. Section 4 contains a brief
advertisement for quads. Section 5 introduces the theRealThing predicate and the
notion of a snapshot, and shows how they can be used to define an alternate form of
context. Section 6 illustrates how snapshots are used to automatically compute
abstractions of entities. Section 7 provides some background on the SEW architecture.
Section 8 summarizes our conclusions.
2 An Example of Contextualized Data
One of the CHIME datasets consists of a large XML file containing data about ship
sitings. Each top-level XML component describes the location of a ship at a particular
time, along with attributes such as what kinds of cargo it contains. Using the map display
and a time slider, CHIME makes it easy to see where many different ships are located at
any given time. We are working to gradually increase the complexity of queries
representable using CHIME, the difficulty being that we want ordinary users to be able to
compose the queries. An example of a query that is still a bit beyond us today (but we
know how to get there) is the following: “Retrieve freighters that visited Antwerp on
April 2003 whose cargo included aluminum pipes.” We have submitted a challenge
problem to the Semantic Web community for examples of how to phrase this query in an
RDF query language in a form that is cognitively palatable, and have not yet received a
satisfactory answer. However, if we extend RDF to embrace “quads” and contexts, then
the query can be expressed quite succinctly. We will use that as the starting point for our
discussion, and then work backwards to RDF.
A quad is a four-tuple of the form where C is a context, and S, P, O, are
the RDF subject, predicate, and object fields.2 Section 3 discusses the semantics of our
context ‘C.’ Here is the query in an RDQL variant that supports a quad syntax instead of
a triple syntax:
SELECT ?f
WHERE ((null ?f rdf:type ex:Freighter),
(null ?c rdf:type ex:Context),
(?c ?f ex:location ex:antwerp),
(?c ?f ex:hasCargo ?cargo),
(?c ?cargo ex:consistsOf ex:AluminumPipe),
(null ?c ex:beginDate ?begin),
(null ?c ex:endDate ?end),
(null ?begin ex:before "May 1 2003"),
(null ?end ex:after "March 31 2003"))
This is actually quite a reasonable query. Its fairly concise, and fairly readable.
Unfortunately, there are few Semantic Web systems that implement quads—for many of
us in the Semantic Web community, quads are still a wish that has not come true. 3
The above RDQL query assumes that temporal data describing the time of a ship siting
is attached to a context rather than to the ship itself. This is crucial for several reasons.
There are many sitings of each ship, so we can’t attach all of the temporal data for a ship
to a single resource. In our application, we “condition” the source XML data being
translated into RDF by creating a new context object for each ship siting, and attaching
the temporal data (and other external data) to the context. 4 CHIME contains an n-
dimensional filtering mechanism that requires a uniform approach to representing spatial,
temporal, and other external data. Our use of contexts as the single point of attachment
for external data provides that uniformity.
3 Contexts
The notion of contexts has been around a long time, and there is no consensus on the
semantics of a context. Cognitively, a context consists of a set of facts (here, RDF
statements) and a description of an environment within which those facts are believed to
be true. A context implementation includes some kind of mapping from a context object
to the statements in it. There are many ways to define such a mapping. Also, the context
points to some form of definition of the “environment.” Our system defines the
environment by directly attaching assertions to the context.
Quads were invented to make it easy to map from a context to a statement. The
meaning of a quad is that the triple represented by arguments two through four belongs to
2 Some people prefer to write quads with the context field in fourth position.
3 Intellidimension’s RDF Gateway supports quads (www.intellidimension.com).
4 Location data is also copied to the context object, but for simplicity we are only focusing on the
temporal data.
the context referenced by the first argument. Some quads point to triples whose semantics
are not context dependent; in that case, we put a null value in the context position. Here
is an example of a set of quad statements:
[_:cxt1 _:f1 ex:location ex:antwerp]
[_:cxt1 _:f1 ex:hasCargo _:cargo1]
[null _:cxt1 ex:beginDate “April 3 2003”]
[null _:cxt1 ex:endDate “April 4 2003”]
These statements assert that, for the time period April 3 to April 4 2003, the facts “the
location of f1 is Antwerp” and “f1 has cargo cargo1” are both true.
The sad fact is that quads are not supported by most Semantic Web tools, so we need
some other way to map a context to a set of statements. RDF has provided an
exceedingly clumsy way to do this, using reified statements. Here is an equivalent same
set of statements, expressed as triples using reified statements.
[_:st1 rdf:subject _:f1]
[_:st1 rdf:predicate ex:location]
[_:st1 rdf:object ex:antwerp]
[_:st1 rdf:type rdf:Statement]
[_:st2 rdf:subject _:f1]
[_:st2 rdf:predicate ex:hasCargo]
[_:st2 rdf:object _:cargo1]
[_:st2 rdf:type rdf:Statement]
[_:st1 ex:inContext _:cxt1]
[_:st2 ex:inContext _:cxt1]
[_:cxt1 ex:beginDate “April 3 2003”]
[_:cxt1 ex:endDate “April 4 2003”]
Pretty hideous, isn’t it. Some RDF proponents argue that reified statements aren’t
really so bad, because a system can be built that compresses the storage blow-up that you
see here back down to something equivalent to our first set of statements. However, an
informal poll has failed to discover a remedy for the significant cognitive overload
engendered by the use of reified statements. Here is our original RDQL query, rewritten
to execute against triples and reified statements:
SELECT ?f
WHERE ((?f type Freighter),
(?st1 type Statement),
(?st1 subject ?f),
(?st1 predicate location),
(?st1 object antwerp),
(?st2 type Statement),
(?st2 subject ?f),
(?st2 predicate hasCargo),
(?st2 object ?cargo),
(?st3 type Statement),
(?st3 subject ?cargo),
(?st3 predicate consistsOf),
(?st3 object AluminumPipe),
(?st1 inContext ?c),
(?st2 inContext ?c),
(?st3 inContext ?c),
(?c beginDate ?begin),
(?c endDate ?end),
(?begin before "May 1 2003"),
(?end after "March 31 2003"))
This query captures the intended meaning accurately, but it is really quite awful. Not
only is it harder to write, and much less readable, but it is likely to be much less efficient
than the quad representation. Why is that? First of all, the number of “joins” is much
larger. Second, and possibly more damaging, the optimizer now has to optimize over
predicates like “subject,” “predicate” and “object” that mix together extensions of many
different predicates.
Before we finish our initial discussion of contexts, we discuss several variations on
representing contexts. Each of them has significant drawbacks:
Some folks have advocated using resources of type rdf:Bag to point to statements in
a context. In place of our “inContext” triples, one writes (reversing the arguments):
[_:cxt1 rdf:_1 _:st1]
[_:cxt1 rdf:_2 _:st2]
A query that interprets bags (instead of using a property such as ex:Context)
becomes even less readable, because its necessary to substitute a null predicate instead of
“ex:inContext” to match a bag to its members. It is hard to imagine why anyone
would want to use bags to solve this problem.
Recently, lists were added to RDF. If we use lists instead of bags, we can’t write a
query anymore in RDQL, because RDF does not provide a list membership predicate.
Also, the list implementation uses double the storage of bags (in terms of number of
edges) and can no longer provide a constant time membership test.
Another possibility is to eliminate contexts altogether, and attach the metadata (the
context definitions) directly to the reified statement resources. If the average context
contains fewer than two statements (probably not the norm), this saves space. However,
with this scheme you have lost a capability for “context switching”—metadata is not
grouped into convenient subgraphs. This scheme also misses out on the opportunity for a
uniform treatment of contextualized data. When a context class is defined, one is advised
to identify a set of predicates that provide a standard means for representing the most
commonly-occurring types of metadata. For example, our SEW (see Section 6) adopts a
representation for “beginDate” and “endDate” that is independent of any
representational scheme adopted by statements within a context. It does the same for
latitude and longitude—it copies positional information from internal data to the context
(making it part of the external data). This uniformity makes it easy to optimize spatio-
temporal filtering on contexts. The same goes for security information, source
information, etc.
Finally, some folks advocate using models as if they were contexts. Whether this is
viable or not depends on several factors. Most RDF engines are not equipped to handle
large numbers of models per query. If contexts are very coarse grained (relatively few
contexts and many statements per context), then this might work out. We envision that
applications that make serious use of contextualized data will want their attachments to be
more fine-grained than that. Our CHIME application has hundreds of contexts per model,
and will later on support thousands (or more) of contexts per model. Translating that into
RDQL yields FROM statements that contain hundreds or thousands of URIs. We doubt if
many RDF query systems are tuned to handle those kinds of numbers. Also, we are still
waiting for someone to tell me what kind of query syntax that RDF systems use with this
kind of context mechanism.
Several different knowledge representation systems (e.g., Loom [Loom 03], Epikit
[Genesereth 92], CycL [Lenat and Guha 90], PowerLoom [Loom 03]) implement contexts
(CycL calls them “microtheories”), and manage them using a model-as-context kind of
semantics. All of these systems support quantification over contexts/models, and they
assume that contexts can be arranged in a hierarchy, so that the truth of statements
belonging to a context inherit to its child contexts. The contexts in these systems tend to
be relatively coarse-grained, and are not particularly well-suited to representing
provenance and temporal data. Each of these languages adopts an “isT” (is true in
context) predicate to relate a context to a statement. When applied to a binary relation,
the “isT” syntax is some variation of
“(isT