Adding eScience Assets to the Data Web

           Herbert Van de Sompel                        Carl Lagoze                     Michael L. Nelson
               Los Alamos National                    Cornell University               Old Dominion University
                    Laboratory                         Ithaca, NY USA                      Norfolk, VA USA
               Los Alamos, NM USA                lagoze@cs.cornell.edu                   mln@cs.odu.edu
               herbertv@lanl.gov
                Simeon Warner                       Robert Sanderson                       Pete Johnston
                 Cornell University                 University of Liverpool              Eduserv Foundation
                  Ithaca, NY USA                        Liverpool, UK                         Bath UK
           simeon@cs.cornell.edu                    azaroth@liv.ac.uk              pete.johnston@eduserv.org.uk

ABSTRACT                                                             In parallel with this change in research methodology there
Aggregations of Web resources are increasingly important in       has been substantial change in the way that research results
scholarship as it adopts new methods that are data-centric,       are communicated. With the emergence of the Web, schol-
collaborative, and networked-based. The same notion of ag-        arly publishers, both commercial and learned societies, al-
gregations of resources is common to the mashed-up, socially      most universally deliver journal papers, conference proceed-
networked information environment of Web 2.0. We present          ings, and monographs via the Web. While Web delivery of
a mechanism to identify and describe aggregations of Web          research results has improved their accessibility and search-
resources that has resulted from the Open Archives Initia-        ability, it represents an evolution of traditional publication
tive - Object Reuse and Exchange (OAI-ORE) project. The           practices rather than a fundamental change in the scholarly
OAI-ORE specifications are based on the principles of the         communication paradigm. Even in their digital manifesta-
Architecture of the World Wide Web, the Semantic Web,             tions, scholarly publications are mostly textually-based and
and the Linked Data effort. Therefore, their incorporation        static. To date, there are few examples of scholarly com-
into the cyberinfrastructure that supports eScholarship will      munication that move beyond the dissemination of these
ensure the integration of the products of scholarly research      traditional artifacts into a more data-centric, semantically-
into the Data Web.                                                linked, and social network-embedded scholarly communica-
                                                                  tion model that resembles the profound changes in social,
                                                                  political, and economic discourse characteristic of Web 2.0.
Categories and Subject Descriptors                                This radically different model would expose process as well
H.5.4 [Information Systems]: Hypertext/Hypermedia                 as product [39], improving opportunities to verify the repro-
                                                                  ducibility of research results, and making the full spectrum
                                                                  of artifacts generated in the scholarly value chain available
General Terms                                                     for reuse [41].
Design, Standardization                                              The deployment of radically new models depends on the
                                                                  development of basic technical infrastructure, so-called cy-
                                                                  berinfrastructure. This cyberinfrastructure must include a
Keywords                                                          number of components. These include a means to identify
Cyberinfrastructure, eScience, OAI-ORE, Web Architecture,         and cite datasets in the scholarly discourse (e.g., [38, 1]),
Linked Data, RDF, Atom                                            a standard for identifying scholarly authors to unambigu-
                                                                  ously tie them to their creations and improve the quality of
1.   INTRODUCTION                                                 scientometric information (e.g., ResearcherID1 and Digital
                                                                  Author Identifier2 ), and standards to allow machine read-
  The rapid evolution of computing, networking, and data
                                                                  ability of the products of scholarly process thereby facilitat-
capturing technologies, along with advances in data mining
                                                                  ing computational analysis and extraction of secondary and
and analysis, are fundamentally changing the way scholarly
                                                                  tertiary knowledge products. Semantic technologies are an
research is conducted [2, 5]. Although there are differences
                                                                  important component of this cyberinfrastructure, providing
amongst disciplines in their receptivity to change [13], an
                                                                  a foundation for open agreements on data formats, metadata
increasing number of scholars in the natural sciences, social
                                                                  frameworks to describe data, and ontology-based solutions
sciences, and humanities have adopted new research meth-
                                                                  for formal representation of scientific knowledge, all of which
ods that are network-based, highly collaborative, and data-
                                                                  are important components of promoting a machine-readable
intensive. Because of the central role of vast amounts of data
                                                                  scholarly record.
in these new research methods, there has been increased
                                                                     This paper focuses on one aspect of this cyberinfrastruc-
attention to sustainable infrastructures for registering, pre-
serving, and sharing datasets [17].
                                                                  1
                                                                   http://www.researcherid.com/
                                                                  2
Copyright is held by the author/owner(s).                          http://www.surffoundation.nl/smartsite.dws?ch=
LDOW 2009, April 20, 2009, Madrid, Spain.                         eng&id=13480
ture that arises from the changing nature of publications         document. In addition, prototypes exist of applications that
that are characteristic of collaborative, data-centric scholar-   allow authoring, storing, and disseminating more complex
ship. These emerging publications are aggregations of multi-      scholarly publications in the form of aggregations [8, 33,
ple resources. Such aggregations are already prevalent in ex-     42]. These more complex aggregations may consist of a tex-
isting scholarly repositories, which commonly offer access to     tual article, one or more datasets that led to the discoveries
textual documents in multiple formats, each available from        reported in the article, perhaps a visualization of a specific
a different network location. But, the changes in scholarship     state of the dataset, and the software used to generate the
described above, and especially the need to include data in       visualization. All constituents of such an aggregation are
the publication process, increases the complexity of these        distributed on the Web. One notable aspect of these more
aggregations and calls for the adoption of a common ap-           complex visions of an aggregate scholarly publication is the
proach to handle them. In the remainder of this paper,            importance of semantic relationships among constituents of
we describe our work within Open Archives Initiative - Ob-        the aggregation. These relationships include citation, ver-
ject Reuse and Exchange (OAI-ORE), a two-year project             sioning, provenance, commentary, and the like.
to investigate common methods to handle aggregations of              Some characteristics of the aggregations that are already
Web resources that culminated in October 2008 with the            common in scholarship can be illustrated by means of a doc-
release of the OAI-ORE specifications [28]. These specifica-      ument from arXiv.org, a well-known repository of physics,
tions were motivated by the resource aggregations common          mathematics, and computer science research results. The
to scholarly communication. We believe that their generic,        human start page, or “splash page”, for this document is
Web-centric approach makes them applicable to use cases in        shown in Figure 1. Some aspects of the page relevant to the
the Web at large, providing the basis for improved search re-     resource aggregation problem are highlighted in red rectan-
sults, improved information navigation, and richer services       gles, each with a number. The meanings of the highlighted
within browsers for a large class of Web applications.            areas are as follows:
   The OAI-ORE specifications leverage the principles of the
Architecture of the World Wide Web, the Semantic Web,               1. The URI http://arxiv.org/abs/astro-ph/0601007
and the Linked Data effort. As a result, future develop-               of the human start page for the arXiv document.
ments in cyberinfrastructure and scholarly communication            2. The formats in which the document is available, i.e.
that are based on OAI-ORE will integrate well with the                 PostScript, PDF, etc. These are effectively the con-
Web and with the tools, agents and applications that oper-             stituents of the aggregation that is the arXiv docu-
ate within it. This will make it possible to embed or mash up          ment.
the products of scholarship into cyber-learning efforts, co-
operative reference tools such as Wikipedia, and the larger         3. The title of the arXiv document.
social discourse that is now characteristic of Web 2.0. The
essence of the OAI-ORE solution to the resource aggregation         4. The authors of the arXiv document.
problem can be summarized is as follows:                            5. The creation and last modification date of the arXiv
     • The data model is expressed in terms of the primi-              document.
       tives of Web Architecture and the Semantic Web: Re-          6. Identifiers of resources that are in some manner compa-
       sources, Representations, URIs and RDF triples.                 rable to this arXiv document. For example, a version
     • The central entity in the data model, the Aggregation,          of this document was later published as an article in a
       is a Resource that stands for a set of other Resources.         peer-reviewed journal, and the Digital Object Identi-
       An Aggregation is a Resource with a URI but without             fier of that article is shown.
       a Representation (we refer to this as a non-document         7. The versions of this arXiv document.
       Resource from now on). This approach is aligned with
       the manner in which real-world entities or concepts are      8. Links to other arXiv documents in the same collection
       included in the Web via the mechanisms proposed by              (i.e., astro-ph).
       the Linked Data effort [4].
                                                                    9. Citations made by this arXiv document, and citations
     • Another Resource, the Resource Map, has a Represen-             it received from other documents.
       tation that is a description of the Aggregation. The
       Resource Map is accessible via the URI of the Aggre-          This rather simple example highlights the core issues that
       gation using the mechanisms defined for Cool URIs for      OAI-ORE addresses. First, although the URI of the hu-
       the Semantic Web [36].                                     man start page is commonly used as the URI for the entire
                                                                  arXiv document, within the Web Architecture that URI only
     • The Representation of a Resource Map is a serializa-       identifies the page itself, and not the aggregation that is the
       tion of the triples that describe the Aggregation. The     arXiv document. The ability to cite, annotate, version, and
       specification describes RDF/XML, RDFa, and Atom            associate properties with the aggregation itself relies on it
       serialization syntaxes.                                    having a unique identity, distinct from the splash page or
                                                                  the resources linked from it.
2.    AGGREGATIONS                                                   Second, without the use of (frequently imperfect) heuris-
                                                                  tics unique to the specific human start page, it is not read-
2.1     Aggregations in Scholarly Communication                   able by machines and agents. Because the HTML of this
   Most institutional repositories [24, 31] routinely store and   human start page usually leaves the semantics of hyperlinks
disseminate relatively simple aggregations, consisting of mul-    undefined, a machine agent cannot unambiguously distin-
tiple access formats (e.g., PDF, HTML, LaTeX) for the same        guish between links to constituents (e.g. the PostScript,
                                                                    OAI-PMH specific manner, often preventing general Web
                                                                    clients that are unaware of the protocol from accessing the
                                                                    available metadata [19].
                                                                       The Web-centric, resource-centric approach of OAI-ORE
                                                                    rectifies this architectural shortcoming and thereby provides
                                                                    the foundation for full accessibility of the products of eScience
                                                                    in the general Web environment. Furthermore, it makes the
                                                                    solution available to a broader class of Web applications in
                                                                    which the practice of aggregating resources is quite com-
                                                                    mon. For example, we accumulate URLs in bookmarks or
                                                                    favorites lists in our browser, collect photos into sets in pop-
                                                                    ular sites like Flickr, browse over multiple page documents
                                                                    that are linked together through “prev” and “next” tags,
                                                                    and talk about Web sites as if they had some real existence
                                                                    beyond the set of pages of which they consist. Despite our
                                                                    frequent use of these aggregations, their existence on the
Figure 1: The implicitly defined members of a schol-                Web is quite ephemeral because there is no common way
arly aggregation.                                                   to identify, describe, and hence handle them. This is what
                                                                    OAI-ORE provides.

                                                                    3.    THE OAI-ORE SOLUTION
PDF, etc.) of the document and links that point at infor-              In this section we describe the various elements of the
mation that is clearly outside of the document such as the          OAI-ORE solution to the resource aggregation problem out-
navigational aids shown as (8) in Figure 1. Similarly, agents       lined above. It encompasses an RDF-based data model, syn-
can not interpret relationships of the document to other doc-       taxes for serializing instances of the data model, and mech-
uments, identifiers related to this document, versions of this      anisms for providing HTTP access to those serializations.
document, etc.                                                      Complete details are available through the OAI-ORE docu-
   In essence, the problem is that there is no standard way         mentation suite [28].
to describe the constituents or boundary of an aggregation,            As noted earlier, this solution is based on the primitives
or to qualify and identify a resource as being an aggregation.      defined in the Architecture of the World Wide Web [23] that
While a robot could learn the semantics implied by arXiv’s          defines a Resource as an item of interest; a URI as a global
HTML in Figure 1, such “screen scraping” is brittle and not         identifier for a Resource; and a Representation as a datas-
scalable for applications accessing aggregations in thousands       tream corresponding to the state of a Resource at the time
of different repositories, each with their own presentation         its URI is dereferenced via some protocol (e.g. HTTP). In
idiom.                                                              addition, the solution is grounded in the principles intro-
                                                                    duced by the Semantic Web, in which URIs are also used
2.2    Integrating Aggregations into the Web                        to identify non-document Resources, such as real-world enti-
   A number of early efforts in cyberinfrastructure, for exam-      ties (e.g. people or cars), or even abstract entities (e.g. ideas
ple the initial grid architecture [40] and technologies for digi-   or classes). These non-document Resources have no Repre-
tal libraries, leveraged aspects of the Web infrastructure but      sentation to indicate their meaning. OAI-ORE adopts the
often failed to fully conform with Web Architecture princi-         following approach, proposed by the Linked Data effort [4],
ples. For example, institutional repositories frequently have       for obtaining information about those Resources:
identifier schemes and access protocols distinct from those
existing on the Web at large. As a result, much of their                 • Use of HTTP URIs to identify those non-document
content is accessible on the Web, but it poorly integrates                 Resources;
with mainstream Web applications and may even be over-                   • Publication of another Resource with a Representation
looked by major search engines, unless the search engines                  that provides information about the non-document Re-
make special accommodations for their protocols and access                 source at a HTTP URI other than the HTTP URI of
schemes.                                                                   the non-document Resource;
   Our prior work on the Open Archives Initiative Proto-
col For Metadata Harvesting (OAI-PMH) [26] demonstrates                  • Leverage of HTTP mechanisms to allow discovery of
this problem. OAI-PMH is an interoperability specification                 the HTTP URI of the published resource from the
released in 2001 aimed at streamlining the process of incre-               HTTP URI of the non-document resource.
mentally collecting XML metadata (typically bibliographic
metadata) from information systems. It shares many de-              3.1     Data Model
sign characteristics with Atom [35] and is widely adopted in           The essence of the RDF-based data model is described
its targeted community of scholarly repositories. But, OAI-         here and is illustrated in Figure 2. The full details are
PMH, in contrast to Atom, has not gained broader adoption,          available in the OAI-ORE Abstract Data Model specifica-
mainly because its architecture is not well aligned with the        tion [27].
Resource/URI/Representation foundations of the Web Ar-                 In order to be able to unambiguously refer to an aggre-
chitecture. For example, OAI-PMH clients must construct             gation of Web resources, a new Resource is introduced that
a request URI by combining a repository specific base URI,          stands for a set or collection of other Resources. This new
the identifier of the item of interest, and a format tag in an      Resource, named an Aggregation, has a URI just like any
            Figure 2: A Resource Map describes an Aggregation with three Aggregated Resources.


other Resource on the Web. And, since an Aggregation is
a conceptual construct, it is a non-document Resource that
does not have a Representation.
   Following the Linked Data guidelines, another Resource
is introduced to make information about the Aggregation
available. This new Resource, named a Resource Map, has
a URI and a machine-readable Representation that provides
details about the Aggregation. In essence, a Resource Map
expresses which Aggregation it describes (the ore:describes
relationship in Figure 2), and it lists the Aggregated Re-
sources that are part of the Aggregation (the ore:aggregates
relationship in Figure 2, a subproperty of
dcterms:hasPart). But, a Resource Map can also express
relationships and properties pertaining to all these Resources,
as well as metadata pertaining to the Resource Map itself,        Figure 3: Citing a Resource in the context of an
e.g. who published it and when it was most recently modi-         Aggregation.
fied (the dcterms:creator and dcterms:modified relation-
ships in Figure 2). A Resource Map can also express re-
lationships of the Aggregation, Aggregated Resources, and
the Resource Map itself with any arbitrary other Resource,           We note that the URI asserted in a Resource Map to de-
as long as the resulting RDF graph is connected.                  note an Aggregated Resource of a particular Aggregation is
   In addition, for discovery purposes, the data model allows     no different than the URI that denotes that Resource in-
a Resource Map to express that an Aggregated Resource of          dependent of the Aggregation. However, it is important in
a specific Aggregation is also part of another Aggregation.       scholarly communication, among others for the purpose of
This is achieved by means of the ore:isAggregatedBy rela-         citing and expressing provenance, that a resource such as a
tionship (the inverse of ore:aggregates) between the Ag-          dataset included in some context, for example a specific ar-
gregated Resource and that other Aggregation. Also stat-          ticle, be distinct from the same dataset outside the context
ing that an Aggregated Resource is itself an Aggregation          of that article, or in the context of another article.
(nesting Aggregations) is supported. To that purpose, an             To accomplish this differentiation, OAI-ORE introduces
ore:isDescribedBy relationship (the inverse of                    the notion of a Proxy. A Proxy is a Resource that stands for
ore:describes, and a subproperty of rdfs:seeAlso) is ex-          an Aggregated Resource in the context of a specific Aggrega-
pressed between the Aggregated Resource and a Resource            tion. The URI of a Proxy provides a mechanism for denot-
Map that describes it as being itself an Aggregation. Fur-        ing a Resource in context. Figure 3 shows the ore:ProxyFor
thermore, the use of non-protocol-based identifiers (such         and ore:ProxyIn relationships between a Proxy and an Ag-
as DOIs) that can be expressed as URIs is quite common            gregated Resource and an Aggregation, respectively. It also
for referencing scholarly assets. In order to support this        illustrates how citing the Aggregated Resource is different
practice, the ore:similarTo relationship between an Ag-           from citing its Proxy: the former cites a Resource “as is”,
gregation and a somehow equivalent resource identified by         the latter cites that Resource as it exists in the context of
a non-protocol-based URI is expressed. The specificity of         a specific Aggregation. In order to work seamlessly in the
ore:similarTo is situated between rdfs:seeAlso and                Web and to provide context information to OAI-ORE aware
owl:sameAs.                                                       clients, resolution of HTTP URIs assigned to Proxies must
                                                                  lead to the Aggregated Resource, and the response must
                                                                  include a HTTP Link Header [34] that points to the Aggre-
3.2   Proxies: Aggregated Resources in Context                    gation.
3.3   Resource Map Serializations
   A Resource Map has a Representation that describes an
Aggregation in some serialization syntax. OAI-ORE ex-
plicitly specifies three serialization syntaxes, Atom XML,
RDF/XML, and RDFa, while other serialization syntaxes
are possible. Which one to choose will largely depend on
the use case and on the technical environment available to a
Resource Map publisher. For example, in cases where an ex-
pressive HTML splash page exists an RDFa approach might
be attractive. Note that multiple Resource Maps, each us-
ing a different serialization syntax can describe the same
Aggregation, and that these may differ in expressiveness3 .
   Although the data model is based on RDF, we were com-
mitted to also specify a serialization based on Atom, to al-
low Aggregations to become the subject of Web 2.0 reuse
scenarios and of workflows based on the Atom Publishing          Figure 4: Discovering a Resource Map from an Ag-
Protocol [18]. The Atom Publishing Protocol adds a uni-          gregation using Cool URIs for the Semantic Web.
form read/write approach to Web 2.0, which could be of
significant benefit in scholarly communication scenarios.
   However, the task of reconciling the data model with the
Atom model proved to be non-trivial due to tensions be-          dress this we distinguish between Authoritative and Non-
tween the RDF model and the XML-oriented Atom spec-              Authoritative Resource Maps in the same way as the Linked
ification. The former is graph-based, with precise seman-        Data guidelines. An Authoritative Resource Map is one
tics that are global rather than local to a specific document.   that is accessible by dereferencing the URI of the Aggrega-
The latter is hierarchical, (XML) document-centric, and has      tion that it describes, for example using the aforementioned
intentionally loose element definitions. It took several, dra-   Cool URI mechanisms. A Non-Authoritative Resource Map
matically different iterations of the Atom serialization to      is one not reachable in this manner. The rationale for this
arrive at an acceptable solution.                                approach is that the party that introduces a new Aggrega-
   The resulting approach expresses an Aggregation by means      tion simultaneously mints URIs for both the Aggregation
of an Atom entry, and makes use of Atom’s extensibility          and the Resource Map, and actually controls both.
mechanisms in much the same way as Google Data does. For
example, Atom’s link element with an OAI-ORE-specific            4.    EARLY DEMONSTRATORS
value for the rel attribute is used to aggregate resources.
                                                                    Since the OAI-ORE specifications have only been released
And, awaiting a solution from the Atom community to deal
                                                                 recently, an in-depth evaluation of functionality, adoption,
express triples, an ore:triples element was introduced to
                                                                 and impact is premature. Still, in this section we give an
act as a wrapper for RDF descriptions. To support un-
                                                                 insight in efforts by early adopters to leverage the specifica-
ambiguous interpretation of Atom serializations of Resource
                                                                 tions. Four use cases are described below. Additional illus-
Maps, a GRDDL transform was implemented that extracts
                                                                 trations of its application are provided by the submissions
all contained triples that pertain to the OAI-ORE data model,
                                                                 to the ORE Challenge at RepoCamp 20085 .
both from the native Atom elements and from the ore:triples
extension element, and expresses them in RDF/XML4 .              4.1    Foresite: Revealing Aggregations
3.4   Leveraging HTTP                                              In order to provide feedback on the evolving OAI-ORE
                                                                 specification, the UK’s Joint Information Systems Commit-
   In order to make OAI-ORE work in the HTTP-based
                                                                 tee (JISC)6 funded an experiment to investigate applying it
Web, both the Aggregation and the Resource Map are as-
                                                                 to an extensive scholarly collection: the approximately four
signed HTTP URIs, and the Cool URIs for the Semantic
                                                                 million articles that are part of the JSTOR7 collection. By
Web guidelines [36] are adopted to support discovery of the
                                                                 developing open source OAI-ORE libraries8 and applying
HTTP URI of a Resource Map given the HTTP URI of an
                                                                 them to produce interlinked Resource Maps, the Foresite
Aggregation. Figure 4 illustrates a situation in which the
                                                                 project effectively demonstrated the feasibility of exposing
arXiv Aggregation is described by both an Atom XML and
                                                                 common scholarly artifacts to the Data Web in the manner
an RDF/XML Resource Map, and in which a client is led
                                                                 proposed by OAI-ORE. The project provided valuable feed-
to the Atom version via an HTTP 303 redirect and Content
                                                                 back that helped refine the OAI-ORE specifications, and
Negotiation.
                                                                 had a significant impact on the aforementioned discussions
3.5   Authoritative Resource Maps                                regarding the Atom serialization of Resource Maps.
                                                                   The overall structure of the Aggregations, and associated
  After one party has published a Resource Map that con-
                                                                 Resource Maps, produced for the JSTOR collection mirrors
tains a description and a URI for a new Aggregation, any
                                                                 the journal - issue - article hierarchy of the JSTOR content.
other party can publish competing or even conflicting Re-
                                                                 Each journal is modeled as an Aggregation of journal issues;
source Maps that describe the same Aggregation. To ad-
                                                                 5
3
  See http://www.openarchives.org/ore/atom for detailed            http://www.openarchives.org/ore/RepoCamp2008/
                                                                 6
Atom and RDF/XML versions of Resources Maps corre-                 http://www.jisc.ac.uk/
                                                                 7
sponding to Figure 1.                                              http://www.jstor.org/
4                                                                8
  http://www.openarchives.org/ore/atom-grddl                       http://foresite-toolkit.googlecode.com/
                                                                   Figure 6: The Foresite plug-in models Flickr Sets as
Figure 5: The hierarchical structure of the JSTOR                  OAI-ORE Aggregations, and visualizes them.
collection mapped to the OAI-ORE data model.
Note that 1..1 cardinalities are omitted from the
diagram for clarity.
                                                                   to which the Web resource corresponds from the Liverpool
                                                                   Web server. The plug-in then parses and displays the Re-
                                                                   source Map graph via dynamic SVG. Nodes in the display
each issue is an Aggregation of articles; and each article is an   represent Aggregations, Aggregated Resources, and related
Aggregation of individual page images and a PDF-formatted          Resources. Nodes for Aggregations can be clicked to expand
version of the entire article (Figure 5). The Aggregated           or contract the visualization; in case of expansion, new Re-
Resources at each level are also the subject and/or object         source Maps are obtained, parsed, and again visualized.
of a fst:followedBy relationship introduced to preserve               Further experiments using the same approach were car-
the page-turning order for pages within an article, articles       ried out on mainstream Web portals, leveraging the pro-
within an issue and so forth. Because fst:followedBy is not        vided Web service APIs to obtain metadata, and to express
a global relationship, but rather only applies within the con-     it according to the ORE data model. Flickr12 and Amazon13
text of a specific Aggregation, Proxies for these Aggregated       were selected, and wrapper services were built to generate
Resources were introduced. The article Aggregations in-            Resource Maps on demand through REST interactions, and
terlink via dcterms:references relationships for citations,        to publish them on the Liverpool server. Flickr provides a
further confirming the necessity of the graph-based nature         rich dataset with photos, photo sets, users, groups, favorites
of the OAI-ORE date model, even though the main JSTOR              and even comments and tags that can all be modeled as
content hierarchy is tree-shaped. The Resource Maps were           Aggregations. Figure 6 shows a visualization of the struc-
published on a Web server at the University of Liverpool.          ture of the Flickr Set “Glaciers” that consists of five pho-
   The resulting OAI-ORE descriptions are of immediate             tographs. In the Foresite Explorer, this set is represented
business importance to JSTOR. While JSTOR stores the               with an Aggregation visualized as the top right node within
OCR-ed full-text of each article, it is only able to openly        the OAI-ORE logo (left bottom of Figure 6), emitting a red
expose this kind of topological metadata, and would lose           dcterms:creator arc and a white ore:aggregates arc. The
its market advantage (and the participation of contributing        latter leads to the five photographs. The third photograph
publishers) if the full-text were exposed. Having the topol-       is selected, and another white ore:aggregates arc reaches
ogy of their collection available in a standardized format that    out to the available image files (differing image resolutions)
provides links back to their protected full-text documents         represented as black nodes. The purple nodes indicate other
and images, facilitates reuse in third party applications that     aggregations in which the selected photo is aggregated.
can help drive traffic to the JSTOR site and increase its             Amazon offers fewer constructs that readily map to the
customer base.                                                     OAI-ORE data model, but the user wishlists is a compelling
   In order to provide a value-added service on the basis of       one. The mapping to the data model is as follows: a wish-
the generated Resource Maps without requiring JSTOR to             list becomes an Aggregation, and wished-for items become
integrate prototype code into their production portal, the         Aggregated Resources. Interestingly, each item in an Ama-
Foresite Explorer – a visualization application9 , was devel-      zon wishlist has a unique identifier by which it is purchased.
oped using GreaseMonkey10 and its cross-site capable Xml-          That identifier is only valid within that specific wishlist to
HttpRequest. This one-click-install plug-in for Firefox11 ex-      allow tracking of individual items, once purchased. These
tracts the URI of the resource that is currently being viewed      wishlist specific constructs map directly the Proxies of the
in the JSTOR Web interface and retrieves the associated            OAI-ORE model. The GreaseMonkey script was updated to
RDF/XML Resource Map that describes the Aggregation                discover these identifiers that are necessary to interact with
 9
                                                                   the Amazon Web services, and Proxy-based relationships
   http://foresite.cheshire3.org/explorer/
10                                                                 12
   http://www.greasespot.net/                                           http://www.flickr.com/
11                                                                 13
   http://www.mozilla.com/firefox/                                      http://www.amazon.com/
were added to the visualization.                                  all resources that relate to a particular research task or pub-
   Overall, the Foresite experiment has illustrated the ap-       lication fits into the normal scholarly workflow. Two author-
plicability of the OAI-ORE resource aggregation model as          ing environments that demonstrate this are the Literature
well as the feasibility to leverage it to create a value-added    Object Reuse and Exchange (LORE) tool created by Gerber
service. It has demonstrated this for both common schol-          et al.16 , and by the SCOPE work of Cheung et al. [8, 21].
arly communication artifacts and specific constructs used         LORE is a Firefox extension that communicates via Ajax
by popular Web portals. The Foresite experiment will be           with a Sesame2 data store for maintaining the OAI-ORE
described in more detail in a dedicated, future publication.      graphs that are generated. LORE allows for the generation
                                                                  of fine-grained metadata and relationships, for example, al-
4.2      Astronomy Publication Workflow                           lowing indicating that a certain resource is contextual in-
   Datasets are of fundamental importance in observational        formation about the literature work that is being studied.
sciences such as astronomy. The astronomy community has           The SCOPE work led to the development of the Provenance
developed sophisticated repositories and data standards, ex-      Explorer, a stand-alone Java application with functionalities
emplified by the Sloan Digital Sky Survey14 and the Na-           similar to those of LORE, but aimed at the creation, editing
tional Virtual Observatory15 , which provide excellent facil-     and publication of scientific compound objects.
ities for registering and accessing large datasets. However,
when submitting an article, both new datasets that were cre-      4.4    Enhanced Publications
ated to arrive at findings reported in an article, and data ci-      The Dutch SURFshare program17 and the European
tation information that reveals the reuse of existing datasets    DRIVER II project18 are collaborating on cyberinfrastruc-
are often lost, “left behind” on the personal computer of the     ture to join a multitude of scientific repositories that hold
author.                                                           publications and research data. The goal is to give re-
   A team at Johns Hopkins University is collaborating with       searchers better means to share and access scientific mate-
the American Astronomical Society to capture datasets as          rials through innovative services. One of the envisioned ser-
part of the publication workflow [9]. In the newly devised        vices relates to enhanced publications, composites of textual
publication workflows, OAI-ORE Aggregations are used to           publications and supporting resources such as research-data,
glue an article and its associated datasets together, and Re-     visualizations, annotations, related websites, etc. To ensure
source Maps that describe these Aggregations are the tokens       the integrity and usability of such enhanced publications it
that move around between author, publisher and dataset            is important that all its components and their interrelations
repository as the publication process proceeds [10]. At each      are being preserved.
stage of the publication workflow, the Resource Map is used          A study into object models suitable for the representa-
to convey the current state of the Aggregation, and is then       tion of enhanced publications recommended the use of OAI-
updated to reflect the new state that is then passed on to        ORE. As a result, a demonstrator project [20] was launched
the next workflow phase. For example, as a Resource Map           in which enhanced publications for multiple scientific disci-
is passed from the publisher to the dataset repository and        plines ranging from engineering to journalism were modeled
back again, it is updated to contain the URIs of datasets         according to OAI-ORE, and in which approaches to meet
that are registered in the repository, and that were used for     a variety of requirements were explored, including presen-
the article. This allows the publisher to link to the datasets    tation, navigation, persistent identification, granularity of
that were used for a specific article, and the repository to      referencing, handling of sequentially ordered resources, visu-
link to papers that used a specific dataset.                      alization of interrelationships, etc. The results are available
   Generally, the availability of these Aggregations enables      at the project site19 . The project chose RDF/XML to ex-
new services to be built on both the publishing platform and      press Resource Maps and uses an XSLT-based approach to
the data repository. If the practices proposed by this novel      dynamically generate an HTML “splash page” from them.
publication workflow became commonplace, it would repre-          In each splash page, a Content tab (Figure 7) lists all cru-
sent a significant improvement in the efficiency of scientific    cial metadata about the enhanced publication, prominently
communication.                                                    shows its textual component and associated metadata, and
4.3      Authoring, Editing and Reusing                           neatly lists additional resources again with metadata. Many
                                                                  of these resources are themselves modeled as Aggregations,
   The success of OAI-ORE depends on the ease with which          and hence also have their own splash page. To support an
Aggregations and Resource Maps are authored and dissem-           understanding of the relationships among resources of an
inated on the Web. In many cases, they will be generated          Aggregation and of nested Aggregations, a Relations tab
automatically based on information that is available in an        that loads a Java applet fueled by Resource Map content
information system. For example, the arXiv.org database           is introduced. Overall, the demonstrator is remarkable be-
contains all information that is necessary to automatically       cause of the elegance and simplicity of the ORE implemen-
generate Aggregations and their associated Resource Maps,         tation. It clearly illustrates that ORE can be used as a basic
as shown in the Appendices. And, in the astronomy project         model for enhanced publications, and points at the need for
described above, the ability to create Resource Maps is built     community-defined vocabularies to convey expressive rela-
into familiar authoring environments in a manner that makes       tionships among scientific resources.
it a side-effect of the authoring process and thus minimizes
the burden on authors.                                            16
   Like all cyberinfrastructure, the success of such authoring       http://www.openarchives.org/ore/RepoCamp2008/
                                                                   #LORE
environments depends on the manner in which assembling            17
                                                                     http://www.surffoundation.nl/en/
14                                                                18
     http://www.sdss.org/                                            http://www.driver-community.eu/
15                                                                19
     http://www.us-vo.org/                                           http://driver2.dans.knaw.nl/demonstrator/html/
                                                                  “bunch” has a new HTTP URI identity, it enumerates its
                                                                  members, and it readily handles distributed Web resources.
                                                                  However, the identity of the bunch is the same as that of the
                                                                  HTML page that describes it, and expressing relationships
                                                                  between the bunched resources is not supported. GroupMe!
                                                                  is similar, with the addition of social tagging capabilities,
                                                                  but has the same problems as LinkBunch.
                                                                     Some Web navigator approaches work in an opposite gran-
                                                                  ular direction, supporting disaggregation of a single Web re-
                                                                  source (i.e., an HTML page) into multiple resources. This
                                                                  can be done automatically, such as for segmented display
                                                                  on limited devices such as PDAs [7] or for recovering struc-
                                                                  tured records from Web pages [15]. Decomposition can also
                                                                  be done manually, such as for reuse and sharing of parts of
                                                                  a Web page (e.g., ClipMarks22 ). All these approaches, man-
                                                                  ually or automatically, can be thought of as adding (or in-
                                                                  ferring) HTML anchors where none exist. These approaches
                                                                  assign identities to the newly created resources (fragments
                                                                  of the original resource), but they provide no approach to
                                                                  describe the original resource as an aggregation of these new
                                                                  resources, nor do they allow expressing relationships among
                                                                  them.
                                                                     In approaches that have the administrator of a Web infor-
Figure 7: The splash page for an enhanced publi-                  mation system in the diver seat, several technologies exist to
cation of the DRIVER II project, dynamically ren-                 deal with resource aggregations. Sitemaps were briefly con-
dered from an RDF/XML Resource Map.                               sidered as a serialization option for Resource Maps. Google,
                                                                  Yahoo and Microsoft support the Sitemap Protocol [16], a
                                                                  simple XML file format that allows Web sites to list the URIs
                                                                  they want crawled by robots. Sitemaps provide for minimal
5.      RELATED WORK                                              metadata (e.g., last modification date, update frequency and
   Given the widespread use of aggregations in both the           crawl priority), but no attempt is made to provide semantic
physical and the Web world, it comes as no surprise that          typing, and handling arbitrary distributed resources is not
other efforts have investigated this domain. Prior work in        supported. Indeed, in the interest of trust, the Sitemap Pro-
the Web realm can be grouped in two main categories de-           tocol specifies a significant limitation on URI paths that can
pending on the party that introduces aggregations. In one         be listed in a Sitemap file. For example, a Sitemap at level
case, that is the Web navigator (agent or reader), in the         www.foo.com/a/b can list URIs at level a/b and below, but
other case it is the administrator of a Web-based information     it cannot list URIs at www.foo.com/a/c, www.foo.com/d/ or
system. We look at a number of efforts in both categories,        www.bar.com/.
and evaluate their capabilities to identify aggregations, to         We made a deliberate decision to avoid the many exist-
enumerate the constituent resources of an aggregation, to         ing packaging formats, such as MPEG-21 DIDL [3], METS
express relationships among resources, and to accommodate         [32], FOXML [25], IMS-CP [22], and BagIt [6]. First, pack-
resources that are distributed on the Web.                        aging base64-encoded content in a wrapper document does
   In the Web navigator case, either an interactive user groups   not resonate well with the Resource/URI/Representation
resources based on some intent, or a robot tries to infer the     paradigm of the Web Architecture. Still, most of these for-
implicitly defined members of an aggregation. The robotic         mats also support a by-reference mechanism to deliver con-
approaches range from heuristics [30, 14] to machine-learning     tent, in which URIs can be used. However, although these
[12, 11]. While these approaches are useful, they are imper-      formats are prominent in their respective communities, they
fect and dependent on the perception of those encoding the        have not gained an adoption comparable to that of Atom or
heuristics or training set and they do not necessarily reflect    RDF/XML. And while these approaches can address iden-
the intention of the original authors of the Web resources.       tification, and enumeration of distributed resources, they
And, while these approaches may succeed at selecting the          have uneven capabilities to express the graph-based OAI-
distributed resources that are part of an implicitly defined      ORE model, due to their hierarchical perspective.
aggregation, they are not capable of inferring the relation-         In the course of the OAI-ORE effort, we also attempted to
ships between those resources, nor do they propose a way to       model aggregations as Atom feeds, not entries [29]. We ul-
unambiguously describe the aggregation.                           timately decided that was the wrong granularity, especially
   The approaches that involve an interactive user include        since common Web 2.0 reuse scenarios, including use with
tools such as GroupMe!20 and LinkBunch21 . LinkBunch              the Atom Publishing Protocol, work at the level of Atom
lets users submit several URIs that are then assigned a new       entries. The Atom Syndication Format was preferred over
HTTP URI that, when dereferenced, returns an HTML page            the various RSS formats in anticipation of using the Atom
that lists and links to the originally submitted URIs. The        Publishing Protocol [18].
                                                                     Some elements of the POWDER [37] specifications that
20
     http://groupme.org/
21                                                                22
     http://linkbun.ch/                                                http://clipmarks.com/
were developed in the same timeframe as OAI-ORE ad-               http://www.openarchives.org/ore/.
dress a problem space similar to that of OAI-ORE. However,
POWDER’s focus is significantly broader, and it approaches
the problem from the opposite perspective,                        8.   REFERENCES
   focusing on capabilities to assert (via “Description Re-        [1] M. Altman and G. King. A proposed standard for the
sources”) that a group of resources share certain properties           scholarly citation of quantitative data. D-Lib
(e.g. access rights), rather than asserting arbitrary prop-            Magazine, 13(3/4), 2007.
erties about resources that, for some reason, are grouped          [2] D. E. Atkins, K. K. Droegemeier, S. I. Feldman,
into an aggregation. That is, in POWDER the notion of                  H. Garcia-Molina, M. L. Klein, D. G. Messerschmitt,
shared properties defines an aggregation, whereas in OAI-              P. Messina, J. P. Ostriker, and M. H. Wright.
ORE an aggregation can be created for any reason deemed                Revolutionizing science and engineering through
important by its creator. Also, while POWDER provides                  cyberinfrastructure, 2003.
capabilities to describe a group of resources using a vari-        [3] J. Bekaert, E. De Kooning, and H. Van de Sompel.
ety of approaches including regular expressions, it does not           Representing digital objects using MPEG-21 Digital
introduce an identity for the aggregation.                             Item Declaration. International Journal on Digital
                                                                       Libraries, 6(2):159–173, 2006.
6.   CONCLUSIONS                                                   [4] C. Bizer, R. Cyganiak, and T. Heath. How to publish
   This paper has introduced the OAI-ORE solution to the               linked data on the web, 2007. http://sites.wiwiss.fu-
resource aggregation problem, which we argue meets a crit-             berlin.de/bizer/pub/LinkedDataTutorial/.
ical need in the development of cyberinfrastructure and the        [5] C. L. Borgman. Scholarship in the digital age :
next generation scholarly communication infrastructure. By             information, infrastructure, and the Internet. MIT
aligning the solution with the Web Architecture, and by                Press, Cambridge, Mass., 2007.
leveraging the practices of the Semantic Web and Linked            [6] A. Boyko, J. Kunze, J. Littman, and L. Madden. The
Data effort, it will facilitate better integration of scholarly        bagit file package format (v0.95), Internet Draft, July
communication with the mainstream Web, it will make schol-             2008.
arly artifacts more readily usable with common Web tools           [7] D. Chakrabarti, R. Kumar, and K. Punera. A
and applications, and it will benefit the broader community            graph-theoretic approach to webpage segmentation. In
by making research materials more visible, verifiable, and             WWW ’08: Proceedings of the 17th international
by facilitating unexpected reuse.                                      conference on World Wide Web, pages 377–386, 2008.
   While OAI-ORE was motivated by scholarly communi-               [8] K. Cheung, J. Hunter, A. Lashtabeg, and D. J.
cation, we believe that the proposed solution has broader              SCOPE - a scientific compound object publishing and
applicability. Aggregations, sets, and collections are as com-         editing system. In 3rd International Digital Curation
mon on the Web as they are in the everyday physical world.             Conference, 2007.
In many situations it would benefit agents and services if ag-     [9] S. Choudhury, T. DiLauro, A. Szalay, E. Vishniac,
gregations were unambiguously enumerated and described,                R. Hanisch, J. Steffen, R. Milkey, T. Ehling, and
essentially layering an addition level of resource granularity         R. Plante. Digital data preservation for scholarly
upon the Web.                                                          publications in astronomy. International Journal of
   Evaluation of the OAI-ORE work depends on its adop-                 Digital Curation, 2(2), 2007.
tion and evolution over time. The work has so far ben-            [10] T. DiLauro. OAI-ORE for publishing workflows: Data
efited from significant community involvement throughout               archiving for journals of the American Astronomical
the specification process, and the international team that             Society. In Open Repositories 2008, 2008.
developed the solution includes representatives with back-        [11] P. Dmitriev. As we may perceive: finding the
grounds in scholarly publishing, eScience, repository infras-          boundaries of compound documents on the web. In
tructure, digital libraries, Web search engines, linked data,          WWW ’08: Proceedings of the 17th international
and information interoperability. Work by early adopters,              conference on World Wide Web, pages 1029–1030,
such as the Foresite project and John’s Hopkins publica-               2008.
tion workflow project, are promising indicators that these
                                                                  [12] P. Dmitriev, C. Lagoze, and B. Suchkov. As we may
community contributions have led to a solution that stands
                                                                       perceive: inferring logical documents from hypertext.
realistic chances for significant adoption.
                                                                       In Proceedings of the sixteenth ACM conference on
                                                                       Hypertext and Hypermedia, pages 66–74, 2005.
7.   ACKNOWLEDGMENTS                                              [13] P. N. Edwards, S. J. Jackson, G. C. Bowker, and C. P.
   This work was supported by the National Science Foun-               Knobel. Understanding infrastructure: Dynamics,
dation Divisions of Information and Intelligent Systems and            tensions, and design. Technical report, National
Undergraduate Education through grant numbers IIS-0430906,             Science Foundation, January 2007.
IIS-0643784 and DUE-0840744, the Andrew W. Mellon Foun-           [14] N. Eiron and K. McCurley. Untangling compound
dation, Microsoft, and the Coalition for Networked Informa-            documents on the web. In Proceedings of the
tion. Development of OAI-ORE was based on input from                   fourteenth ACM conference on Hypertext and
the OAI-ORE Technical Committee, the OAI-ORE Liaison                   Hypermedia, pages 85–94, 2003.
Group, the OAI-ORE Advisory Committee, contributors to            [15] D. Embley, Y. Jiang, and Y. Ng. Record-boundary
the OAI-ORE Google discussion group, and members of                    discovery in Web documents. In Proceedings of the
the Digital Library Research & Prototyping Team of the                 1999 ACM SIGMOD international conference on
Los Alamos National Laboratory. Individuals are listed at              Management of data, pages 467–478, 1999.
[16] Google, Microsoft, and Yahoo. Sitemaps XML format,             From hypermedia to datuments. Journal of Digital
     2008. http://www.sitemaps.org/protocol.php.                    Information, 5(1), 2004.
[17] J. Gray, A. S. Szalay, A. Thakar, C. Stoughton, and       [34] M. Nottingham. HTTP header linking, Internet Draft,
     J. vandenBerg. Online scientific data curation,                March 2008.
     publication, and archiving. Technical Report arXiv        [35] M. Nottingham and R. Sayre. The Atom syndication
     cs.DL/0208012, 2002.                                           format, Internet RFC-4287, December 2005.
[18] J. Gregorio and B. de hOra. The Atom publishing           [36] L. Sauermann and R. Cyganiak. Cool URIs for the
     protocol, Internet RFC-5023, December 2007.                    semantic web. Technical Report W3C Interest Group
[19] B. Haslhofer and B. Schandl. The OAI2LOD Server:               Note 31 March 2008, W3C, 2008.
     Exposing OAI-PMH Metadata as Linked Data. In              [37] K. Scheppe and D. Pentecost. Protocol for Web
     Proceedings of WWW 2008 Workshop Linked Data on                Description Resources (POWDER): Primer. Technical
     the Web (LDOW2008), Beijing, 2008.                             Report W3C Working Draft – 14 November 2008,
[20] M. Hoogerwerf. Durable enhanced publications. In               W3C, 2008.
     Proceedings of African Digital Scholarship & Curation     [38] J. E. Sieber and B. E. Trumbo. (not) giving credit
     2009, 2009.                                                    where credit is due: Citation of data sets. Science and
[21] L. Hunter J., Chueng. Provenance explorer - a                  Engineering Ethics, 1:11–20, 1995.
     graphical interface for constructing scientific           [39] A. Smith. The research library in the 21st century:
     publication pack ages from provenance trails.                  collecting, preserving, and making it accessible
     International Journal on Digital Libraries,                    resources for scholarship. In No Brief Candle:
     7(1-2):99–107.                                                 Reconceiving Research Libraries for the 21st Century.
[22] IMS Global Learning Consortium. IMS content                    Council on Library and Information Resources, 2008.
     packaging XML binding specification version 1.1.3.        [40] S. Tuecke, K. Czajkowski, I. Foster, J. Frey,
     http://www.imsglobal.org/content/packaging/, 2003.             S. Graham, C. Kesselman, T. Maquire, T. Sandholm,
[23] I. Jacobs and N. Walsh. Architecture of the world              D. Snelling, and Vanderbilt. Open Grid Services
     wide web, volume one. Technical Report W3C                     Infrastructure (OGSI): Version 1.0. Technical Report
     Recommendation 15 December 2004, W3C, 2004.                    draft-ggf-ogsi-gridservice-33, Global Grid Forum,
[24] R. K. Johnson. Institutional repositories: Partnering          January 27 2003.
     with faculty to enhance scholarly communication.          [41] H. Van de Sompel, S. Payette, J. Erickson, C. Lagoze,
     D-Lib Magazine, 8(11), 2002.                                   and S. Warner. Rethinking scholarly communication:
[25] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora:         Building the system that scholars deserve. D-Lib
     an architecture for complex objects and their                  Magazine, 10(9), 2004.
     relationships. International Journal on Digital           [42] R. Williams, R. Moore, and R. Hanisch. A virtual
     Libraries, 6(2):124–138, 2006.                                 observatory vision based on publishing and virtual
[26] C. Lagoze and H. Van de Sompel. The Open Archives              data. Technical report, US National Virtual
     Initiative: building a low-barrier interoperability            Observatory, 2003.
     framework. In JCDL ’01: Proceedings of the 1st
     ACM/IEEE-CS Joint Conference on Digital Libraries,
     pages 54–62, 2001.
[27] C. Lagoze, H. Van de Sompel, P. Johnston, M. Nelson,
     R. Sanderson, and S. Warner. ORE Specification -
     Abstract Data Model, 2008.
     http://www.openarchives.org/ore/datamodel.
[28] C. Lagoze, H. Van de Sompel, P. Johnston, M. Nelson,
     R. Sanderson, and S. Warner. ORE Specification and
     User Guide - Table of Contents, 2008.
     http://www.openarchives.org/ore/1.0/toc.
[29] C. Lagoze, H. Van de Sompel, P. Johnston, M. L.
     Nelson, R. Sanderson, and S. Warner. Object Re-Use
     & Exchange: A Resource-Centric Approach. Technical
     Report arXiv:0804.2273, 2008.
[30] W. Li, O. Kolak, Q. Vu, and H. Takano. Defining
     logical domains in a web site. In Proceedings of the
     eleventh ACM on Hypertext and Hypermedia, pages
     123–132, 2000.
[31] C. A. Lynch. Institutional repositories: Essential
     infrastructure for scholarship in the digital age. ARL:
     A Bimonthly Report, (226), 2003.
[32] J. P. McDonough. METS: Standardized encoding for
     digital library objects. International Journal on
     Digital Libraries, 6(2):148–158, 2006.
[33] P. Murray-Rust and H. Rzepa. The next big thing: