Bootstrapping to a Semantic Grid
Jens Schwidder1, Tara Talbott2, Jim Myers2
1
Oak Ridge National Laboratory, 2Pacific Northwest National Laboratory
schwidderj@ornl.gov, Tara.Talbott@pnl.gov, Jim.Myers@pnl.gov
Abstract data and metadata, the SDG may have a bootstrapping
The Scientific Annotation Middleware (SAM) is a set of problem.
components and services that enable researchers, The Scientific Annotation Middleware (SAM), being
applications, problem solving environments (PSE) and developed by researchers at Pacific Northwest National
software agents to create metadata and annotations about Laboratory and Oak Ridge National Laboratory, has been
data objects and document the semantic relationships created in part, to serve as a research platform for
between them. Developed starting in 2001, SAM allows understanding these issues. SAM provides general
applications to encode metadata within files or to manage data/metadata storage capabilities that can be accessed via
metadata at the level of individual relationships as a number of interfaces with varying levels of metadata
desired. SAM then provides mechanisms to expose awareness. Further, SAM provides configurable datatype-
metadata and relationships encoded either way as specific mechanisms to map information submitted via a
WebDAV properties. In this paper, we report on work to simple interface into information with explicit semantics
further map this metadata into RDF and discuss the role exposed via other interfaces. For example, as described
of middleware such as SAM in bridging between below, this capability can be used, to expose information
traditional and semantic grid applications. within binary files as RDF-encoded relationships. This
type-specific mechanism provides an alternative to more
generic methods of extracting metadata from text, web
1. Introduction pages, and XML [1]. Further, as middleware, SAM allows
the metadata extraction process to be defined indepen-
Scientific progress depends increasingly on effective dently of the data format and the producing application
collaboration between widely distributed communities of and therefore, for the costs of metadata generation to
researchers at various institutions around the world. The potentially be transferred to those who can benefit from it.
amount of data produced and shared is enormous and In the following sections, we provide additional
more effective ways to organize the information and keep information about SAM in general and describe the
track of dependencies are becoming very important. The mechanisms we have developed to support metadata
semantic data grid (SDG), an anticipated merger of extraction and bridging between multiple metadata
semantic web and data grid concepts, is envisioned as the management interfaces, focusing in particular on work to
solution to this problem – a scalable means of sharing expose metadata via RDF. These are then discussed in
data, and its context of descriptive information and terms of their potential to support semantic applications
relationship to other data, through standard protocols and such as semantic data discovery, annotation, and prove-
description languages. nance services over data provided by more traditional
However, many obstacles remain before SDGs can applications.
fulfill their promise. SDG concepts and software are still
evolving and, while the potential uses of data with explicit 2. Background
semantics are compelling, the mechanics of how semantic
information will be captured, as well as the economics of As shown in Figure 1, SAM is a layered set of
metadata production and consumption are very unclear. In middleware components and services for managing data
particular, while SDGs enable a new class of applications annotations and the semantic relationships among data
that will become critical to information intensive science objects [2]. Conceptually, SAM presents applications with
efforts, it is not so clear that they provide enough direct a schema-less store that can manage arbitrary metadata
benefit to traditional science applications to justify and relationships that are defined by namespace qualified
upgrading them to use semantic technologies. Further, names. As such, it is well suited to a written-by-one-read-
since traditional applications are the producers of primary by-many usage model in which multiple semantic
applications contribute unique information about different
aspects of federated data generated by independent multiple independent underlying data stores, which could
scientific applications, all of which (data and metadata) be remote, e.g. GridFTP servers and Grid metadata
must be presented to the user and further analysis tools as catalogs. When new resources are created using
an integrated data context. webDAV, Slide generates standard webDAV properties
that describe the resource, such as its type, size, owner,
and creation date.
SAM extends Slide in a number of ways that enhance
its ability to function as a bridging mechanism. To make
activities in SAM visible to third-party software, we have
modified Slide to produce Java Messaging Service (JMS)
events whenever the resources are accessed or modified
via webDAV. Supplementing Slide’s default internal
authentication method, we’ve added a Java Authentication
and Authorization Services (JAAS) based mechanism to
allow SAM to be configured to use external
authentication services, e.g. a Grid MyProxy server [8].
Figure 1. Scientific Annotation Middleware
SAM is built on the Jakarta Slide [3] content 3. Mapping Between Embedded Metadata
management system and implements the web Distributed and Properties
Authoring and Versioning (webDAV) protocol [4].
WebDAV and its extensions adopt the Web’s HTTP Through webDAV, SAM can be accessed either as a
model of resources accessed via a URL, adding standard file system using third party drivers or natively as a
methods for creating new collections (directories) and resource-plus-properties repository. In designing SAM,
resources, adding and querying name–value-pair we wished to map between these two models and add
properties (arbitrary strings or XML) associated with each support for an RDF/graph-based interaction model.
resource, and supporting versioning, locks, and list-based Towards these ends, we have added a number of
access control [5,6]. WebDAV is an IETF standard and is capabilities to allow SAM administrators and end users to
supported by a wide range of client and server specify correlations between metadata in files and
applications including open-source and commercial properties, and between properties and RDF. As shown in
projects, such as Jakarta Slide, Apache Tomcat, Adobe Figure 2, this enables end-to-end scenarios where desktop,
Acrobat, Mac OS X, and Microsoft Windows [7]. Also file-based applications with custom data formats can
among the clients are file system drivers that allow directly contribute to a shared network of semantic
accessing a webDAV server like a local file system. For information.
the purposes of this paper, the most relevant methods are To extract metadata from files, we developed a
PUT for uploading content, and PROPPATCH and configurable, automated mechanism that can run a series
PROPFIND for setting and retrieving properties, of user-defined scripts and web services to produce
respectively. properties. The mechanism invokes, in order, during a
webDAV PUT call, a Binary Format Description (BFD)
Slide implements a webDAV-centric content reposi- language script, web service, and/or an XSLT script that
tory as middleware that can store data and metadata in have been registered for the relevant content MIME-type.
Figure 2. SAM’s mechanisms for mapping metadata allows file-based applications, metadata aware
applications using webDAV, and RDF-based tools to all contribute to a network of semantic information.
BFD [9] is an extension of the eXtensible Scientific with string values, this mapping is fairly intuitive. For
Interchange Language (XSIL) [10] that can describe the example a webpage, http://www.example.org/index.html,
layout of a binary or ASCII file format in terms of an which has a “creator” property as defined in the Dublin
XML data model. (BFD is one of the languages Core [12] (hereafter shown as dc:creator) whose value is
influencing the design of the Data Format Description “John Smith” would result in the following RDF:
Language (DFDL) standard being pursued through the
Global Grid Forum [11].) Analogous with XSLT, a BFD
parser can ingest a BFD description and a content file and
John Smith
produce a transformed XML output. In SAM, this output
can be piped to a web service supporting a simple WSDL
interface that includes a transform method. Any registered
XSLT script is invoked in a final step and the resulting However, for properties containing XML values, a
output is interpreted as though it were the payload of a number of issues arise. In theory, the use of XML in
webDAV PROPPATCH method. This mechanism is webDAV property values raises all the same issues as
shown in the top half of Figure 3. While we in general when attempting to interpret general XML documents as
describe this capability as a means of semantically RDF [13]. To date, however, the use cases we’ve
labeling information already within the data in some form, encountered use XML within property values for a
i.e. as metadata extraction, it should be noted that it can relatively limited set of reasons. We have seen this in
also be used for additional metadata annotation, e.g. to work within the SAM project to adapt notebook and wiki
document inter-file relationships implicit in the design of applications and in collaborations with other projects
applications that store data sets as multifile collections, adapting science applications, portals, and problem
facts that cannot be inferred from the data files alone. solving environments. For example, XML is being used to
A similar mechanism can be invoked to generate overcome the webDAV limitation of one property with a
translations and views in SAM. SAM creates a given name per resource, i.e. to list multiple dc:creators
“hastranslations” property specifying ‘virtual’ URLs for for a document. XML is also being used to clearly
the translated content that can be generated by BFD, web identify URIs rather than leaving them encoded as strings.
service, and XSLT sequences. Translations are then Perhaps most interesting is the use of XML nesting to
created dynamically, instantiating the translation URLs represent the sources of individual relationships within a
when they are requested. While this feature has primarily property. For example, the ELN electronic notebook [14],
been used to file translations and web pages showing file it is possible to include a given entry in two notebooks,
content (static HTML pages or pages invoking Java e.g. as a means of including content from a public
applets), we have recently added a means of specifying notebook in a group notebook where it will be further
that the URL for the data and/or the set of webDAV annotated. Thus, samns:children relationships written by
properties be included in the stream being transformed, the ELN need to be scoped as to which notebook they
allowing the translator to include information from belong to.
properties in an output file and thereby providing a To interpret these types of XML properties, we have
mechanism to map backwards from properties to content. initially implemented logic hardcoding a few conventions
sufficient to cover these common use cases. For example,
4. Mapping Between Properties and RDF we consider multiple top-level XML elements in a
property, or a single top-level rdf:bag element containing
Enabling metadata in SAM to be accessed via RDF multiple rdf:li subelements, as preferred within the
requires adding two related pieces of functionality; a Collaboratory for Multiscale for Chemical Science project
mapping between the syntax of webDAV properties and [15] to imply multiple RDF relationships with a common
RDF, and new access methods for retrieving and adding subject and predicate. Elements including an Xlink href
RDF statements. Our initial work to extend SAM in these attribute are interpreted as identifying the href as the
directions is described below, followed in the next section intended RDF object, while elements with text values are
by a more general discussion of the advantages and interpreted such that the text is used as the RDF object.
limitations of the described approach. Lastly, we have chosen to interpret the format used by the
At a basic level, webDAV properties map well to RDF ELN, with an additional layer of XML elements
statements. Resource URLs become subjects, property representing the source of the relationships, in terms of
names are predicates, and the property value can be RDF reification. The results for a simple multi-valued
interpreted as the object. WebDAV is following the XML property and an ELN samns:children property are shown
namespace conventions for property names, which makes below, with the overall process of mapping from
it straight forward to interpret properties as predicates of binary/ASCII files to properties and then to RDF shown
RDF statements. For the simplest properties, i.e. those in Figure 3.
Multivalued Property: dcterms:references related resources. Towards this end, very early in the
SAM project we implemented dynamically generated
properties whose values include all resources linked to the
current resource by a specified subset of properties, down
Graph Exchange Language (GXL) [16] which can be
consumed directly by a number of graph display toolkits.
(As these properties were intended as a temporary
measure primarily supporting the CMCS project (see
Discussion), they are both in the CMCS
Inferred RDF
http://purl.oclc.org/NET/SAM/cmcs namespace.)
Complex Property: samns:children
Inferred RDF
Figure 3. A simple example showing the process
generate webDAV properties and RDF available as a
query resonse. (The optional invocation of a web
More recently, we have been implementing RDF
related capabilities as extensions to the implementation of
the DAV Searching and Locating (DASL) SEARCH
Since webDAV, with our conventions for interpreting
method now available within Slide. DASL defines a basic
XML property values, provides a basic means of reading
grammar having an SQL-like format, as well as a means
and writing semantic relationships about a resource, our
to define extended grammars. The basic grammar
initial focus in providing RDF –based functionality was in
supports returning a set of properties for all resources
returning provenance information, i.e. subgraphs of
within a specified scope and meeting the specified
conditions. For example, one could request the from a data browser and metadata-based search tool to a
DAV:displayname of all documents within “/projects” provenance graphing portlet. Feedback from the CMCS
whose dc:creator property includes “Jane Smith”. For project has been invaluable in refining SAM capabilities
SAM, we have extended this grammar in two ways. First, and prioritizing development and, while the CMCS
we allow the scope to be specified in terms of a root project is ongoing, their experience suggests that the
resource and a set of properties to follow and a depth, decoupling SAM allows will be very important in
allowing the query to be run over a subgraph analogous to allowing groups to assemble a comprehensive, living
that returned through the pedigreerdf property. Second, corpus of semantically tagged data and for scaling and
we have extended the select mechanism to enable RDF- evolving collaborative tools in general.
encoding of the return value, i.e. returning the set of Discussions with CMCS, other collaborators, and
properties on matching nodes as a set of RDF statements developers interested in semantic technologies in general
generated using the conventions discussed previously. have identified a number of strengths and limitations of
Implementing these capabilities through the SEARCH SAM’s current capabilities and indicated several
method instead of through properties allows the set of promising directions for enhancements. While WebDAV
properties to follow and the depth limit to be specified per and our mapping of properties to RDF statements clearly
query rather than configured per server. Further, it provide only a subset of what RDF can encode, they have
separates the list of properties to be returned from those largely proved sufficient to represent the metadata being
used to define the scope. produced by traditional scientific applications as well as
by tools such as electronic notebooks. The webDAV
5. Discussion PROPFIND and PROPPATCH methods are conceptually
similar to the HTTP extensions proposed as part of the
SAM’s ability to separate the effort required for URI Query Agent Model [17] for accessing semantic
making data semantics explicit from the development and information, and, by emphasizing access based on a
use of scientific applications has a number of potential resource URL, webDAV presents a set of information
benefits in the context of community-wide collaborations very similar to the Concise Bounded Description of a
and grid-based computing. Most directly, SAM allows the resource, i.e. a subgraph of outbound relationships [18].
costs of describing data semantics explicitly to be born by In general, SAM’s configurable mechanism for
third parties and/or delayed until the benefits of such mapping between metadata in files and webDAV
labeling can be realized. With SAM’s approach, groups properties has worked well. While we anticipate
wishing to take advantage of metadata-based searching, migrating from BFD to DFDL as implementations appear,
provenance tracking, annotation services, and other which should broaden the range of files than can be
semantic capabilities, do not have to involve the handled and simplify script development, and we may add
developers of all of the domain software they intend to some enhancements such as a mechanism to allow
use in reaching agreement on shared ontologies and extractors to be registered for multiple file types at once,
upgrading software. Instead, groups can independently the current capabilities largely address the requirements
define metadata extractors/annotators as needed that that have been identified.
expose as much or as little semantic detail as required, The mapping between webDAV properties and RDF
mapping it directly into the desired vocabulary. This and the interface(s) to RDF are less mature and we expect
elevates the concept of a virtual organization as an a number of changes. While the conventions we’ve
administrative unit managing access controls and implemented appear to cover most of the current use
allocations to one that may also manage shared semantics. cases, there is clearly a desire from developers and end
The decision to base SAM initially on webDAV and an users to have more control over the layout of property
open source content repository implementation has values – simply alternate ways of specifying multiple
allowed us to quickly gain practical experience. Most relationships within a property and one level of
important in the context of this paper has been a reification. Further, we anticipate a need to represent more
collaboration with the Collaboratory for Multiscale complex graphs in the future as new semantic applications
Chemical Sciences (CMCS). As reported elsewhere, are developed. Towards these ends, we intend to provide a
CMCS has integrated SAM into its collaborative mechanism analogous to that used to move from metadata
framework (named “KnECS”) and has made heavy use of in files to properties to allow the mapping from properties
SAM’s metadata extractor and translation capabilities to to RDF to be configured on a per property basis.
customize the framework and portal for chemical science. Following the DFDL model - annotating an XML schema
CMCS gathers metadata using extractors, through web with instructions on how to populate an instance of the
forms, and from webDAV-enabled applications, PSEs, schema from ASCII/binary data, this might involve the
and web services. CMCS provides a number of general annotation of an RDF Schema or OWL description with
tools that make use of the federated metadata ranging instructions for creating an instance from webDAV
properties. We also intend to investigate adding a Record”, Computing in Science and Engineering, May/June
SPARQL-based [19] grammar for DASL, implementing 2003, pp. 44-50
the URIQA MPUT, MGET, and MDELETE HTTP [3] Jakarta Slide Java Content Management System Website,
methods, and/or implementing semantic grid service http://jakarta.apache.org/slide/index.html
[4] Distributed Authoring and Versioning (DAV) website,
interfaces as they are standardized. Lastly, while SAM http://webdav.org
currently maps semantic relationships to webDAV [5] J. Whitehead, "WebDAV: Versatile Collaboration Multi-
properties and stores them as such, if the usage of RDF protocol", IEEE Internet Computing, vol. 9, no. 1, Jan/Feb
increases, we can potentially invert the mapping direction 2005, pp. 66-74.
and use a native RDF store and map to webDAV http://www.soe.ucsc.edu/~ejw/papers/dav-ic-2005-final.pdf
properties dynamically from the RDF rather than the other [6] DAV Searching and Locating (DASL) Draft Specification,
way around. http://webdav.org/dasl
[7] webDAV Projects, http://webdav.org/projects/
[8] J. Novotny, S. Tuecke, and V. Welch, “An Online
6. Conclusions Credential Repository for the Grid: MyProxy”, Proceedings
of the Tenth International Symposium on High Perfor-
With the capabilities reported here, SAM now provides mance Distributed Computing (HPDC-10), IEEE Press,
a complete binary-to-RDF pathway for exposing semantic August 2001.
information implicit in science applications and their [9] Binary Format Description (BFD) Language Home Page,
output file formats. We believe that SAM demonstrates http://collaboratory.emsl.pnl.gov/sam/bfd/
the viability of a bridging approach to include existing [10] Roy Williams, “XSIL:Java/XML for Scientific Data”, July
scientific applications in semantic data grids. Further, the 2000, http://www.cacr.caltech.edu/SDA/xsil
[11] Document Format Description Language (DFDL),
use of SAM in projects such as CMCS is beginning to http://forge.gridforum.org/projects/dfdl-wg/
demonstrate the value of this approach in reducing [12] Dublin Core Metadata Initiative, http://dublincore.org/
integration and system evolution costs in collaborative [13] Stephen Buswell, “Extracting Semantics from XML
systems. Absent a strong driver for upgrading current Structure”, http://www.w3.org/2001/sw/Europe/reports/
scientific software to be semantically explicit, the ability xslt_schematron_tool/
to provide and track semantic information without having [14] James Myers, Elena Mendoza, and Bonnie Hoopes,
to rewrite an existing application will be needed for quite “A Collaborative Electronic Notebook”, Proceedings of the
some time. While SAM is still evolving and does not IASTED International Conference on Internet and
implement a full SDG, we believe that the concepts being Multimedia Systems and Applications (IMSA 2001), August
13-16, 2001, Honolulu, Hawaii
explored within SAM will be critical to the successful [15] James D. Myers, Thomas C. Allison, Sandra Bittner, Brett
realization of SDGs capable of seamlessly integrating an Didier, Michael Frenklach, William H. Green, Jr., Yen-
evolving mix of applications and supporting collaboration Ling Ho, John Hewson, Wendy Koegler, Carina Lansing,
at the scale required for next-generation information David Leahy, Michael Lee, Renata McCoy, Michael
intensive research. Minkoff, Sandeep Nijsure, Gregor von Laszewski, David
Montoya, Carmen Pancerella, Reinhardt Pinzon, William
7. Acknowledgements Pitz, Larry A. Rahn, Branko Ruscic, Karen Schuchardt,
Eric Stephan, Al Wagner, Theresa Windus, Christine Yang,
“A Collaborative Informatics Infrastructure for Multi-scale
This work was supported as part of the Scientific Science”, Proceedings of the Challenges of Large
Annotation Middleware (SAM) project. Employees of Applications in Distributed Environments (CLADE)
Battelle Memorial Institute, which operates Pacific Workshop, June 7, 2004, Honolulu, HI, pp. 24-33
Northwest National Laboratory for the US Department of [16] A. Winter, B. Kullbach, V. Riediger: An Overview of the
Energy under contract DE-AC06-76RL01830 and Oak GXL Graph Exchange Language © Springer Verlag: S.
Ridge National Laboratory under contract De-AC05- Diehl (ed.) Software Visualization · International Seminar
00OR22725, wrote this manuscript. The authors also Dagstuhl Castle, Germany, May 20-25, 2001 Revised
acknowledge helpful discussions and ongoing collabora- Lectures, available from
http://www.gupro.de/GXL/index.html
tions with members of the Collaboratory for Multiscale [17] Patrick Stickler, “URIQA: The Nokia URI Query Agent
Chemical Science (CMCS) project. Model”, specification, 2003-2004,
http://swdev.nokia.com/uriqa/URIQA.html
8. References [18] Patrick Stickler, “CBD – Consise Bounded Description”,
specification, 2003-2004
[1] Bettina Berendt, Andreas Hotho, Gerd Stumme, http://swdev.nokia.com/uriqa/CBD.html
“Towards Semantic Web Mining”, International [19] “SPARQL Query Language for RDF”, W3C Working
Draft, http://www.w3.org/TR/rdf-sparql-query/
Semantic Web Conference (ISWC), 2002
[2] James D. Myers, Alan R. Chappell, Matthew Elder, Al
Geist, Jens Schwidder, “Re-Integrating The Research