Bootstrapping to a Semantic Grid

                               Jens Schwidder1, Tara Talbott2, Jim Myers2
                 1
                   Oak Ridge National Laboratory, 2Pacific Northwest National Laboratory
                     schwidderj@ornl.gov, Tara.Talbott@pnl.gov, Jim.Myers@pnl.gov


                        Abstract                               data and metadata, the SDG may have a bootstrapping
The Scientific Annotation Middleware (SAM) is a set of         problem.
components and services that enable researchers,                  The Scientific Annotation Middleware (SAM), being
applications, problem solving environments (PSE) and           developed by researchers at Pacific Northwest National
software agents to create metadata and annotations about       Laboratory and Oak Ridge National Laboratory, has been
data objects and document the semantic relationships           created in part, to serve as a research platform for
between them. Developed starting in 2001, SAM allows           understanding these issues. SAM provides general
applications to encode metadata within files or to manage      data/metadata storage capabilities that can be accessed via
metadata at the level of individual relationships as           a number of interfaces with varying levels of metadata
desired. SAM then provides mechanisms to expose                awareness. Further, SAM provides configurable datatype-
metadata and relationships encoded either way as               specific mechanisms to map information submitted via a
WebDAV properties. In this paper, we report on work to         simple interface into information with explicit semantics
further map this metadata into RDF and discuss the role        exposed via other interfaces. For example, as described
of middleware such as SAM in bridging between                  below, this capability can be used, to expose information
traditional and semantic grid applications.                    within binary files as RDF-encoded relationships. This
                                                               type-specific mechanism provides an alternative to more
                                                               generic methods of extracting metadata from text, web
1. Introduction                                                pages, and XML [1]. Further, as middleware, SAM allows
                                                               the metadata extraction process to be defined indepen-
   Scientific progress depends increasingly on effective       dently of the data format and the producing application
collaboration between widely distributed communities of        and therefore, for the costs of metadata generation to
researchers at various institutions around the world. The      potentially be transferred to those who can benefit from it.
amount of data produced and shared is enormous and                In the following sections, we provide additional
more effective ways to organize the information and keep       information about SAM in general and describe the
track of dependencies are becoming very important. The         mechanisms we have developed to support metadata
semantic data grid (SDG), an anticipated merger of             extraction and bridging between multiple metadata
semantic web and data grid concepts, is envisioned as the      management interfaces, focusing in particular on work to
solution to this problem – a scalable means of sharing         expose metadata via RDF. These are then discussed in
data, and its context of descriptive information and           terms of their potential to support semantic applications
relationship to other data, through standard protocols and     such as semantic data discovery, annotation, and prove-
description languages.                                         nance services over data provided by more traditional
   However, many obstacles remain before SDGs can              applications.
fulfill their promise. SDG concepts and software are still
evolving and, while the potential uses of data with explicit   2. Background
semantics are compelling, the mechanics of how semantic
information will be captured, as well as the economics of         As shown in Figure 1, SAM is a layered set of
metadata production and consumption are very unclear. In       middleware components and services for managing data
particular, while SDGs enable a new class of applications      annotations and the semantic relationships among data
that will become critical to information intensive science     objects [2]. Conceptually, SAM presents applications with
efforts, it is not so clear that they provide enough direct    a schema-less store that can manage arbitrary metadata
benefit to traditional science applications to justify         and relationships that are defined by namespace qualified
upgrading them to use semantic technologies. Further,          names. As such, it is well suited to a written-by-one-read-
since traditional applications are the producers of primary    by-many usage model in which multiple semantic
                                                               applications contribute unique information about different
aspects of federated data generated by independent            multiple independent underlying data stores, which could
scientific applications, all of which (data and metadata)     be remote, e.g. GridFTP servers and Grid metadata
must be presented to the user and further analysis tools as   catalogs. When new resources are created using
an integrated data context.                                   webDAV, Slide generates standard webDAV properties
                                                              that describe the resource, such as its type, size, owner,
                                                              and creation date.
                                                                  SAM extends Slide in a number of ways that enhance
                                                              its ability to function as a bridging mechanism. To make
                                                              activities in SAM visible to third-party software, we have
                                                              modified Slide to produce Java Messaging Service (JMS)
                                                              events whenever the resources are accessed or modified
                                                              via webDAV. Supplementing Slide’s default internal
                                                              authentication method, we’ve added a Java Authentication
                                                              and Authorization Services (JAAS) based mechanism to
                                                              allow SAM to be configured to use external
                                                              authentication services, e.g. a Grid MyProxy server [8].
Figure 1. Scientific Annotation Middleware

   SAM is built on the Jakarta Slide [3] content              3. Mapping Between Embedded Metadata
management system and implements the web Distributed          and Properties
Authoring and Versioning (webDAV) protocol [4].
WebDAV and its extensions adopt the Web’s HTTP                    Through webDAV, SAM can be accessed either as a
model of resources accessed via a URL, adding standard        file system using third party drivers or natively as a
methods for creating new collections (directories) and        resource-plus-properties repository. In designing SAM,
resources, adding and querying name–value-pair                we wished to map between these two models and add
properties (arbitrary strings or XML) associated with each    support for an RDF/graph-based interaction model.
resource, and supporting versioning, locks, and list-based    Towards these ends, we have added a number of
access control [5,6]. WebDAV is an IETF standard and is       capabilities to allow SAM administrators and end users to
supported by a wide range of client and server                specify correlations between metadata in files and
applications including open-source and commercial             properties, and between properties and RDF. As shown in
projects, such as Jakarta Slide, Apache Tomcat, Adobe         Figure 2, this enables end-to-end scenarios where desktop,
Acrobat, Mac OS X, and Microsoft Windows [7]. Also            file-based applications with custom data formats can
among the clients are file system drivers that allow          directly contribute to a shared network of semantic
accessing a webDAV server like a local file system. For       information.
the purposes of this paper, the most relevant methods are         To extract metadata from files, we developed a
PUT for uploading content, and PROPPATCH and                  configurable, automated mechanism that can run a series
PROPFIND for setting and retrieving properties,               of user-defined scripts and web services to produce
respectively.                                                 properties. The mechanism invokes, in order, during a
                                                              webDAV PUT call, a Binary Format Description (BFD)
   Slide implements a webDAV-centric content reposi-          language script, web service, and/or an XSLT script that
tory as middleware that can store data and metadata in        have been registered for the relevant content MIME-type.


 Figure 2. SAM’s mechanisms for mapping metadata allows file-based applications, metadata aware
 applications using webDAV, and RDF-based tools to all contribute to a network of semantic information.
BFD [9] is an extension of the eXtensible Scientific           with string values, this mapping is fairly intuitive. For
Interchange Language (XSIL) [10] that can describe the         example a webpage, http://www.example.org/index.html,
layout of a binary or ASCII file format in terms of an         which has a “creator” property as defined in the Dublin
XML data model. (BFD is one of the languages                   Core [12] (hereafter shown as dc:creator) whose value is
influencing the design of the Data Format Description          “John Smith” would result in the following RDF:
Language (DFDL) standard being pursued through the
Global Grid Forum [11].) Analogous with XSLT, a BFD            <rdf:Description rdf:about="http://www.example.org/index.html">
                                                                  <dc:creator xmlns:dc=”http://purl.org/dc/elements/1.1/”>
parser can ingest a BFD description and a content file and
                                                                    John Smith
produce a transformed XML output. In SAM, this output             </dc:creator>
can be piped to a web service supporting a simple WSDL         </rdf:Description>
interface that includes a transform method. Any registered
XSLT script is invoked in a final step and the resulting           However, for properties containing XML values, a
output is interpreted as though it were the payload of a       number of issues arise. In theory, the use of XML in
webDAV PROPPATCH method. This mechanism is                     webDAV property values raises all the same issues as
shown in the top half of Figure 3. While we in general         when attempting to interpret general XML documents as
describe this capability as a means of semantically            RDF [13]. To date, however, the use cases we’ve
labeling information already within the data in some form,     encountered use XML within property values for a
i.e. as metadata extraction, it should be noted that it can    relatively limited set of reasons. We have seen this in
also be used for additional metadata annotation, e.g. to       work within the SAM project to adapt notebook and wiki
document inter-file relationships implicit in the design of    applications and in collaborations with other projects
applications that store data sets as multifile collections,    adapting science applications, portals, and problem
facts that cannot be inferred from the data files alone.       solving environments. For example, XML is being used to
    A similar mechanism can be invoked to generate             overcome the webDAV limitation of one property with a
translations and views in SAM. SAM creates a                   given name per resource, i.e. to list multiple dc:creators
“hastranslations” property specifying ‘virtual’ URLs for       for a document. XML is also being used to clearly
the translated content that can be generated by BFD, web       identify URIs rather than leaving them encoded as strings.
service, and XSLT sequences. Translations are then             Perhaps most interesting is the use of XML nesting to
created dynamically, instantiating the translation URLs        represent the sources of individual relationships within a
when they are requested. While this feature has primarily      property. For example, the ELN electronic notebook [14],
been used to file translations and web pages showing file      it is possible to include a given entry in two notebooks,
content (static HTML pages or pages invoking Java              e.g. as a means of including content from a public
applets), we have recently added a means of specifying         notebook in a group notebook where it will be further
that the URL for the data and/or the set of webDAV             annotated. Thus, samns:children relationships written by
properties be included in the stream being transformed,        the ELN need to be scoped as to which notebook they
allowing the translator to include information from            belong to.
properties in an output file and thereby providing a               To interpret these types of XML properties, we have
mechanism to map backwards from properties to content.         initially implemented logic hardcoding a few conventions
                                                               sufficient to cover these common use cases. For example,
4. Mapping Between Properties and RDF                          we consider multiple top-level XML elements in a
                                                               property, or a single top-level rdf:bag element containing
    Enabling metadata in SAM to be accessed via RDF            multiple rdf:li subelements, as preferred within the
requires adding two related pieces of functionality; a         Collaboratory for Multiscale for Chemical Science project
mapping between the syntax of webDAV properties and            [15] to imply multiple RDF relationships with a common
RDF, and new access methods for retrieving and adding          subject and predicate. Elements including an Xlink href
RDF statements. Our initial work to extend SAM in these        attribute are interpreted as identifying the href as the
directions is described below, followed in the next section    intended RDF object, while elements with text values are
by a more general discussion of the advantages and             interpreted such that the text is used as the RDF object.
limitations of the described approach.                         Lastly, we have chosen to interpret the format used by the
    At a basic level, webDAV properties map well to RDF        ELN, with an additional layer of XML elements
statements. Resource URLs become subjects, property            representing the source of the relationships, in terms of
names are predicates, and the property value can be            RDF reification. The results for a simple multi-valued
interpreted as the object. WebDAV is following the XML         property and an ELN samns:children property are shown
namespace conventions for property names, which makes          below, with the overall process of mapping from
it straight forward to interpret properties as predicates of   binary/ASCII files to properties and then to RDF shown
RDF statements. For the simplest properties, i.e. those        in Figure 3.
Multivalued Property: dcterms:references                     related resources. Towards this end, very early in the
<dcterms:references xmlns=”...”>
                                                             SAM project we implemented dynamically generated
  <rdf:Bag>                                                  properties whose values include all resources linked to the
    <rdf:li>                                                 current resource by a specified subset of properties, down
      <rdf:href xlink:type="simple"                          to a specified maximum link depth. These properties rely
         xlink:title="Paper 1"
         xlink:href="http://collab/paper1.pdf”/>             on a common configuration resource that specifies the
    </rdf:li>                                                desired properties and the maximum traversal depth. For
    <rdf:li>                                                 the pedigreerdf property, the value is in RDF. For the
      <rdf:href xlink:type="simple"                          pedigreegxl property, the same subgraph is encoded in the
         xlink:title="Paper 2"
         xlink:href="http://collab/paper2.pdf"/>             Graph Exchange Language (GXL) [16] which can be
    </rdf:li>                                                consumed directly by a number of graph display toolkits.
  </rdf:Bag>                                                 (As these properties were intended as a temporary
</dcterms:references>
                                                             measure primarily supporting the CMCS project (see
                                                             Discussion), they are both in the CMCS
Inferred RDF
                                                             http://purl.oclc.org/NET/SAM/cmcs namespace.)
<rdf:RDF xmlns=”...”>
  <rdf:Description
      rdf:about="/sam/files/nb1/chapter_1">
    <dcterms:references
       rdf:resource="http://collab/paper2.pdf"/>
    <dcterms:references
       rdf:resource="http://collab/paper1.pdf"/>
  </rdf:Description>
</rdf:RDF>

Complex Property: samns:children

<samns:notebookroot xmlns=”...”
     xlink:href="/files/nb_1/">
  <samns:child
    xlink:href="/files/nb1/chapter_1/"
    xlink:title="c1" />
  <samns:child
    xlink:href="/files/nb1/chapter_2/"
    xlink:title="c2" />
</samns:notebookroot>

Inferred RDF
<rdf:RDF xmlns=”...”>
  <rdf:Description rdf:about="/sam/files/nb_1">
    <samns:children
        rdf:resource="/files/nb1chapter_2/"
        rdf:ID="statement1" />
    <samns:children
        rdf:resource="/files/nb1/chapter_1/"
        rdf:ID="statement2" />
  </rdf:Description>
  <rdf:Description rdf:about="#statement1">                  Figure 3. A simple example showing the process
    <samns:notebookroot
                                                             within SAM to extract metadata from within files and
        rdf:resource="/files/nb_1/" />
  </rdf:Description>                                         generate webDAV properties and RDF available as a
  <rdf:Description rdf:about="#statement2">                  query resonse. (The optional invocation of a web
    <samns:notebookroot                                      service before the XSLT step is not shown.)
        rdf:resource="/files/nb_1/" />
  </rdf:Description>                                            More recently, we have been implementing RDF
</rdf:RDF>
                                                             related capabilities as extensions to the implementation of
                                                             the DAV Searching and Locating (DASL) SEARCH
    Since webDAV, with our conventions for interpreting
                                                             method now available within Slide. DASL defines a basic
XML property values, provides a basic means of reading
                                                             grammar having an SQL-like format, as well as a means
and writing semantic relationships about a resource, our
                                                             to define extended grammars. The basic grammar
initial focus in providing RDF –based functionality was in
                                                             supports returning a set of properties for all resources
returning provenance information, i.e. subgraphs of
                                                             within a specified scope and meeting the specified
conditions. For example, one could request the                 from a data browser and metadata-based search tool to a
DAV:displayname of all documents within “/projects”            provenance graphing portlet. Feedback from the CMCS
whose dc:creator property includes “Jane Smith”. For           project has been invaluable in refining SAM capabilities
SAM, we have extended this grammar in two ways. First,         and prioritizing development and, while the CMCS
we allow the scope to be specified in terms of a root          project is ongoing, their experience suggests that the
resource and a set of properties to follow and a depth,        decoupling SAM allows will be very important in
allowing the query to be run over a subgraph analogous to      allowing groups to assemble a comprehensive, living
that returned through the pedigreerdf property. Second,        corpus of semantically tagged data and for scaling and
we have extended the select mechanism to enable RDF-           evolving collaborative tools in general.
encoding of the return value, i.e. returning the set of            Discussions with CMCS, other collaborators, and
properties on matching nodes as a set of RDF statements        developers interested in semantic technologies in general
generated using the conventions discussed previously.          have identified a number of strengths and limitations of
Implementing these capabilities through the SEARCH             SAM’s current capabilities and indicated several
method instead of through properties allows the set of         promising directions for enhancements. While WebDAV
properties to follow and the depth limit to be specified per   and our mapping of properties to RDF statements clearly
query rather than configured per server. Further, it           provide only a subset of what RDF can encode, they have
separates the list of properties to be returned from those     largely proved sufficient to represent the metadata being
used to define the scope.                                      produced by traditional scientific applications as well as
                                                               by tools such as electronic notebooks. The webDAV
5. Discussion                                                  PROPFIND and PROPPATCH methods are conceptually
                                                               similar to the HTTP extensions proposed as part of the
    SAM’s ability to separate the effort required for          URI Query Agent Model [17] for accessing semantic
making data semantics explicit from the development and        information, and, by emphasizing access based on a
use of scientific applications has a number of potential       resource URL, webDAV presents a set of information
benefits in the context of community-wide collaborations       very similar to the Concise Bounded Description of a
and grid-based computing. Most directly, SAM allows the        resource, i.e. a subgraph of outbound relationships [18].
costs of describing data semantics explicitly to be born by        In general, SAM’s configurable mechanism for
third parties and/or delayed until the benefits of such        mapping between metadata in files and webDAV
labeling can be realized. With SAM’s approach, groups          properties has worked well. While we anticipate
wishing to take advantage of metadata-based searching,         migrating from BFD to DFDL as implementations appear,
provenance tracking, annotation services, and other            which should broaden the range of files than can be
semantic capabilities, do not have to involve the              handled and simplify script development, and we may add
developers of all of the domain software they intend to        some enhancements such as a mechanism to allow
use in reaching agreement on shared ontologies and             extractors to be registered for multiple file types at once,
upgrading software. Instead, groups can independently          the current capabilities largely address the requirements
define metadata extractors/annotators as needed that           that have been identified.
expose as much or as little semantic detail as required,           The mapping between webDAV properties and RDF
mapping it directly into the desired vocabulary. This          and the interface(s) to RDF are less mature and we expect
elevates the concept of a virtual organization as an           a number of changes. While the conventions we’ve
administrative unit managing access controls and               implemented appear to cover most of the current use
allocations to one that may also manage shared semantics.      cases, there is clearly a desire from developers and end
    The decision to base SAM initially on webDAV and an        users to have more control over the layout of property
open source content repository implementation has              values – simply alternate ways of specifying multiple
allowed us to quickly gain practical experience. Most          relationships within a property and one level of
important in the context of this paper has been a              reification. Further, we anticipate a need to represent more
collaboration with the Collaboratory for Multiscale            complex graphs in the future as new semantic applications
Chemical Sciences (CMCS). As reported elsewhere,               are developed. Towards these ends, we intend to provide a
CMCS has integrated SAM into its collaborative                 mechanism analogous to that used to move from metadata
framework (named “KnECS”) and has made heavy use of            in files to properties to allow the mapping from properties
SAM’s metadata extractor and translation capabilities to       to RDF to be configured on a per property basis.
customize the framework and portal for chemical science.       Following the DFDL model - annotating an XML schema
CMCS gathers metadata using extractors, through web            with instructions on how to populate an instance of the
forms, and from webDAV-enabled applications, PSEs,             schema from ASCII/binary data, this might involve the
and web services. CMCS provides a number of general            annotation of an RDF Schema or OWL description with
tools that make use of the federated metadata ranging          instructions for creating an instance from webDAV
properties. We also intend to investigate adding a                  Record”, Computing in Science and Engineering, May/June
SPARQL-based [19] grammar for DASL, implementing                    2003, pp. 44-50
the URIQA MPUT, MGET, and MDELETE HTTP                         [3] Jakarta Slide Java Content Management System Website,
methods, and/or implementing semantic grid service                  http://jakarta.apache.org/slide/index.html
                                                               [4] Distributed Authoring and Versioning (DAV) website,
interfaces as they are standardized. Lastly, while SAM              http://webdav.org
currently maps semantic relationships to webDAV                [5] J. Whitehead, "WebDAV: Versatile Collaboration Multi-
properties and stores them as such, if the usage of RDF             protocol", IEEE Internet Computing, vol. 9, no. 1, Jan/Feb
increases, we can potentially invert the mapping direction          2005, pp. 66-74.
and use a native RDF store and map to webDAV                        http://www.soe.ucsc.edu/~ejw/papers/dav-ic-2005-final.pdf
properties dynamically from the RDF rather than the other      [6] DAV Searching and Locating (DASL) Draft Specification,
way around.                                                         http://webdav.org/dasl
                                                               [7] webDAV Projects, http://webdav.org/projects/
                                                               [8] J. Novotny, S. Tuecke, and V. Welch, “An Online
6. Conclusions                                                      Credential Repository for the Grid: MyProxy”, Proceedings
                                                                    of the Tenth International Symposium on High Perfor-
   With the capabilities reported here, SAM now provides            mance Distributed Computing (HPDC-10), IEEE Press,
a complete binary-to-RDF pathway for exposing semantic              August 2001.
information implicit in science applications and their         [9] Binary Format Description (BFD) Language Home Page,
output file formats. We believe that SAM demonstrates               http://collaboratory.emsl.pnl.gov/sam/bfd/
the viability of a bridging approach to include existing       [10] Roy Williams, “XSIL:Java/XML for Scientific Data”, July
scientific applications in semantic data grids. Further, the        2000, http://www.cacr.caltech.edu/SDA/xsil
                                                               [11] Document Format Description Language (DFDL),
use of SAM in projects such as CMCS is beginning to                 http://forge.gridforum.org/projects/dfdl-wg/
demonstrate the value of this approach in reducing             [12] Dublin Core Metadata Initiative, http://dublincore.org/
integration and system evolution costs in collaborative        [13] Stephen Buswell, “Extracting Semantics from XML
systems. Absent a strong driver for upgrading current               Structure”, http://www.w3.org/2001/sw/Europe/reports/
scientific software to be semantically explicit, the ability        xslt_schematron_tool/
to provide and track semantic information without having       [14] James Myers, Elena Mendoza, and Bonnie Hoopes,
to rewrite an existing application will be needed for quite         “A Collaborative Electronic Notebook”, Proceedings of the
some time. While SAM is still evolving and does not                 IASTED International Conference on Internet and
implement a full SDG, we believe that the concepts being            Multimedia Systems and Applications (IMSA 2001), August
                                                                    13-16, 2001, Honolulu, Hawaii
explored within SAM will be critical to the successful         [15] James D. Myers, Thomas C. Allison, Sandra Bittner, Brett
realization of SDGs capable of seamlessly integrating an            Didier, Michael Frenklach, William H. Green, Jr., Yen-
evolving mix of applications and supporting collaboration           Ling Ho, John Hewson, Wendy Koegler, Carina Lansing,
at the scale required for next-generation information               David Leahy, Michael Lee, Renata McCoy, Michael
intensive research.                                                 Minkoff, Sandeep Nijsure, Gregor von Laszewski, David
                                                                    Montoya, Carmen Pancerella, Reinhardt Pinzon, William
7. Acknowledgements                                                 Pitz, Larry A. Rahn, Branko Ruscic, Karen Schuchardt,
                                                                    Eric Stephan, Al Wagner, Theresa Windus, Christine Yang,
                                                                    “A Collaborative Informatics Infrastructure for Multi-scale
   This work was supported as part of the Scientific                Science”, Proceedings of the Challenges of Large
Annotation Middleware (SAM) project. Employees of                   Applications in Distributed Environments (CLADE)
Battelle Memorial Institute, which operates Pacific                 Workshop, June 7, 2004, Honolulu, HI, pp. 24-33
Northwest National Laboratory for the US Department of         [16] A. Winter, B. Kullbach, V. Riediger: An Overview of the
Energy under contract DE-AC06-76RL01830 and Oak                     GXL Graph Exchange Language © Springer Verlag: S.
Ridge National Laboratory under contract De-AC05-                   Diehl (ed.) Software Visualization · International Seminar
00OR22725, wrote this manuscript. The authors also                  Dagstuhl Castle, Germany, May 20-25, 2001 Revised
acknowledge helpful discussions and ongoing collabora-              Lectures, available from
                                                                    http://www.gupro.de/GXL/index.html
tions with members of the Collaboratory for Multiscale         [17] Patrick Stickler, “URIQA: The Nokia URI Query Agent
Chemical Science (CMCS) project.                                    Model”, specification, 2003-2004,
                                                                    http://swdev.nokia.com/uriqa/URIQA.html
8. References                                                  [18] Patrick Stickler, “CBD – Consise Bounded Description”,
                                                                    specification, 2003-2004
[1] Bettina Berendt, Andreas Hotho, Gerd Stumme,                    http://swdev.nokia.com/uriqa/CBD.html
    “Towards Semantic Web Mining”, International               [19] “SPARQL Query Language for RDF”, W3C Working
                                                                    Draft, http://www.w3.org/TR/rdf-sparql-query/
    Semantic Web Conference (ISWC), 2002
[2] James D. Myers, Alan R. Chappell, Matthew Elder, Al
    Geist, Jens Schwidder, “Re-Integrating The Research