=Paper=
{{Paper
|id=Vol-538/paper-9
|storemode=property
|title=Adding eScience Assets to the Data Web
|pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper8.pdf
|volume=Vol-538
}}
==Adding eScience Assets to the Data Web==
Adding eScience Assets to the Data Web
Herbert Van de Sompel Carl Lagoze Michael L. Nelson
Los Alamos National Cornell University Old Dominion University
Laboratory Ithaca, NY USA Norfolk, VA USA
Los Alamos, NM USA lagoze@cs.cornell.edu mln@cs.odu.edu
herbertv@lanl.gov
Simeon Warner Robert Sanderson Pete Johnston
Cornell University University of Liverpool Eduserv Foundation
Ithaca, NY USA Liverpool, UK Bath UK
simeon@cs.cornell.edu azaroth@liv.ac.uk pete.johnston@eduserv.org.uk
ABSTRACT In parallel with this change in research methodology there
Aggregations of Web resources are increasingly important in has been substantial change in the way that research results
scholarship as it adopts new methods that are data-centric, are communicated. With the emergence of the Web, schol-
collaborative, and networked-based. The same notion of ag- arly publishers, both commercial and learned societies, al-
gregations of resources is common to the mashed-up, socially most universally deliver journal papers, conference proceed-
networked information environment of Web 2.0. We present ings, and monographs via the Web. While Web delivery of
a mechanism to identify and describe aggregations of Web research results has improved their accessibility and search-
resources that has resulted from the Open Archives Initia- ability, it represents an evolution of traditional publication
tive - Object Reuse and Exchange (OAI-ORE) project. The practices rather than a fundamental change in the scholarly
OAI-ORE specifications are based on the principles of the communication paradigm. Even in their digital manifesta-
Architecture of the World Wide Web, the Semantic Web, tions, scholarly publications are mostly textually-based and
and the Linked Data effort. Therefore, their incorporation static. To date, there are few examples of scholarly com-
into the cyberinfrastructure that supports eScholarship will munication that move beyond the dissemination of these
ensure the integration of the products of scholarly research traditional artifacts into a more data-centric, semantically-
into the Data Web. linked, and social network-embedded scholarly communica-
tion model that resembles the profound changes in social,
political, and economic discourse characteristic of Web 2.0.
Categories and Subject Descriptors This radically different model would expose process as well
H.5.4 [Information Systems]: Hypertext/Hypermedia as product [39], improving opportunities to verify the repro-
ducibility of research results, and making the full spectrum
of artifacts generated in the scholarly value chain available
General Terms for reuse [41].
Design, Standardization The deployment of radically new models depends on the
development of basic technical infrastructure, so-called cy-
berinfrastructure. This cyberinfrastructure must include a
Keywords number of components. These include a means to identify
Cyberinfrastructure, eScience, OAI-ORE, Web Architecture, and cite datasets in the scholarly discourse (e.g., [38, 1]),
Linked Data, RDF, Atom a standard for identifying scholarly authors to unambigu-
ously tie them to their creations and improve the quality of
1. INTRODUCTION scientometric information (e.g., ResearcherID1 and Digital
Author Identifier2 ), and standards to allow machine read-
The rapid evolution of computing, networking, and data
ability of the products of scholarly process thereby facilitat-
capturing technologies, along with advances in data mining
ing computational analysis and extraction of secondary and
and analysis, are fundamentally changing the way scholarly
tertiary knowledge products. Semantic technologies are an
research is conducted [2, 5]. Although there are differences
important component of this cyberinfrastructure, providing
amongst disciplines in their receptivity to change [13], an
a foundation for open agreements on data formats, metadata
increasing number of scholars in the natural sciences, social
frameworks to describe data, and ontology-based solutions
sciences, and humanities have adopted new research meth-
for formal representation of scientific knowledge, all of which
ods that are network-based, highly collaborative, and data-
are important components of promoting a machine-readable
intensive. Because of the central role of vast amounts of data
scholarly record.
in these new research methods, there has been increased
This paper focuses on one aspect of this cyberinfrastruc-
attention to sustainable infrastructures for registering, pre-
serving, and sharing datasets [17].
1
http://www.researcherid.com/
2
Copyright is held by the author/owner(s). http://www.surffoundation.nl/smartsite.dws?ch=
LDOW 2009, April 20, 2009, Madrid, Spain. eng&id=13480
ture that arises from the changing nature of publications document. In addition, prototypes exist of applications that
that are characteristic of collaborative, data-centric scholar- allow authoring, storing, and disseminating more complex
ship. These emerging publications are aggregations of multi- scholarly publications in the form of aggregations [8, 33,
ple resources. Such aggregations are already prevalent in ex- 42]. These more complex aggregations may consist of a tex-
isting scholarly repositories, which commonly offer access to tual article, one or more datasets that led to the discoveries
textual documents in multiple formats, each available from reported in the article, perhaps a visualization of a specific
a different network location. But, the changes in scholarship state of the dataset, and the software used to generate the
described above, and especially the need to include data in visualization. All constituents of such an aggregation are
the publication process, increases the complexity of these distributed on the Web. One notable aspect of these more
aggregations and calls for the adoption of a common ap- complex visions of an aggregate scholarly publication is the
proach to handle them. In the remainder of this paper, importance of semantic relationships among constituents of
we describe our work within Open Archives Initiative - Ob- the aggregation. These relationships include citation, ver-
ject Reuse and Exchange (OAI-ORE), a two-year project sioning, provenance, commentary, and the like.
to investigate common methods to handle aggregations of Some characteristics of the aggregations that are already
Web resources that culminated in October 2008 with the common in scholarship can be illustrated by means of a doc-
release of the OAI-ORE specifications [28]. These specifica- ument from arXiv.org, a well-known repository of physics,
tions were motivated by the resource aggregations common mathematics, and computer science research results. The
to scholarly communication. We believe that their generic, human start page, or “splash page”, for this document is
Web-centric approach makes them applicable to use cases in shown in Figure 1. Some aspects of the page relevant to the
the Web at large, providing the basis for improved search re- resource aggregation problem are highlighted in red rectan-
sults, improved information navigation, and richer services gles, each with a number. The meanings of the highlighted
within browsers for a large class of Web applications. areas are as follows:
The OAI-ORE specifications leverage the principles of the
Architecture of the World Wide Web, the Semantic Web, 1. The URI http://arxiv.org/abs/astro-ph/0601007
and the Linked Data effort. As a result, future develop- of the human start page for the arXiv document.
ments in cyberinfrastructure and scholarly communication 2. The formats in which the document is available, i.e.
that are based on OAI-ORE will integrate well with the PostScript, PDF, etc. These are effectively the con-
Web and with the tools, agents and applications that oper- stituents of the aggregation that is the arXiv docu-
ate within it. This will make it possible to embed or mash up ment.
the products of scholarship into cyber-learning efforts, co-
operative reference tools such as Wikipedia, and the larger 3. The title of the arXiv document.
social discourse that is now characteristic of Web 2.0. The
essence of the OAI-ORE solution to the resource aggregation 4. The authors of the arXiv document.
problem can be summarized is as follows: 5. The creation and last modification date of the arXiv
• The data model is expressed in terms of the primi- document.
tives of Web Architecture and the Semantic Web: Re- 6. Identifiers of resources that are in some manner compa-
sources, Representations, URIs and RDF triples. rable to this arXiv document. For example, a version
• The central entity in the data model, the Aggregation, of this document was later published as an article in a
is a Resource that stands for a set of other Resources. peer-reviewed journal, and the Digital Object Identi-
An Aggregation is a Resource with a URI but without fier of that article is shown.
a Representation (we refer to this as a non-document 7. The versions of this arXiv document.
Resource from now on). This approach is aligned with
the manner in which real-world entities or concepts are 8. Links to other arXiv documents in the same collection
included in the Web via the mechanisms proposed by (i.e., astro-ph).
the Linked Data effort [4].
9. Citations made by this arXiv document, and citations
• Another Resource, the Resource Map, has a Represen- it received from other documents.
tation that is a description of the Aggregation. The
Resource Map is accessible via the URI of the Aggre- This rather simple example highlights the core issues that
gation using the mechanisms defined for Cool URIs for OAI-ORE addresses. First, although the URI of the hu-
the Semantic Web [36]. man start page is commonly used as the URI for the entire
arXiv document, within the Web Architecture that URI only
• The Representation of a Resource Map is a serializa- identifies the page itself, and not the aggregation that is the
tion of the triples that describe the Aggregation. The arXiv document. The ability to cite, annotate, version, and
specification describes RDF/XML, RDFa, and Atom associate properties with the aggregation itself relies on it
serialization syntaxes. having a unique identity, distinct from the splash page or
the resources linked from it.
2. AGGREGATIONS Second, without the use of (frequently imperfect) heuris-
tics unique to the specific human start page, it is not read-
2.1 Aggregations in Scholarly Communication able by machines and agents. Because the HTML of this
Most institutional repositories [24, 31] routinely store and human start page usually leaves the semantics of hyperlinks
disseminate relatively simple aggregations, consisting of mul- undefined, a machine agent cannot unambiguously distin-
tiple access formats (e.g., PDF, HTML, LaTeX) for the same guish between links to constituents (e.g. the PostScript,
OAI-PMH specific manner, often preventing general Web
clients that are unaware of the protocol from accessing the
available metadata [19].
The Web-centric, resource-centric approach of OAI-ORE
rectifies this architectural shortcoming and thereby provides
the foundation for full accessibility of the products of eScience
in the general Web environment. Furthermore, it makes the
solution available to a broader class of Web applications in
which the practice of aggregating resources is quite com-
mon. For example, we accumulate URLs in bookmarks or
favorites lists in our browser, collect photos into sets in pop-
ular sites like Flickr, browse over multiple page documents
that are linked together through “prev” and “next” tags,
and talk about Web sites as if they had some real existence
beyond the set of pages of which they consist. Despite our
frequent use of these aggregations, their existence on the
Figure 1: The implicitly defined members of a schol- Web is quite ephemeral because there is no common way
arly aggregation. to identify, describe, and hence handle them. This is what
OAI-ORE provides.
3. THE OAI-ORE SOLUTION
PDF, etc.) of the document and links that point at infor- In this section we describe the various elements of the
mation that is clearly outside of the document such as the OAI-ORE solution to the resource aggregation problem out-
navigational aids shown as (8) in Figure 1. Similarly, agents lined above. It encompasses an RDF-based data model, syn-
can not interpret relationships of the document to other doc- taxes for serializing instances of the data model, and mech-
uments, identifiers related to this document, versions of this anisms for providing HTTP access to those serializations.
document, etc. Complete details are available through the OAI-ORE docu-
In essence, the problem is that there is no standard way mentation suite [28].
to describe the constituents or boundary of an aggregation, As noted earlier, this solution is based on the primitives
or to qualify and identify a resource as being an aggregation. defined in the Architecture of the World Wide Web [23] that
While a robot could learn the semantics implied by arXiv’s defines a Resource as an item of interest; a URI as a global
HTML in Figure 1, such “screen scraping” is brittle and not identifier for a Resource; and a Representation as a datas-
scalable for applications accessing aggregations in thousands tream corresponding to the state of a Resource at the time
of different repositories, each with their own presentation its URI is dereferenced via some protocol (e.g. HTTP). In
idiom. addition, the solution is grounded in the principles intro-
duced by the Semantic Web, in which URIs are also used
2.2 Integrating Aggregations into the Web to identify non-document Resources, such as real-world enti-
A number of early efforts in cyberinfrastructure, for exam- ties (e.g. people or cars), or even abstract entities (e.g. ideas
ple the initial grid architecture [40] and technologies for digi- or classes). These non-document Resources have no Repre-
tal libraries, leveraged aspects of the Web infrastructure but sentation to indicate their meaning. OAI-ORE adopts the
often failed to fully conform with Web Architecture princi- following approach, proposed by the Linked Data effort [4],
ples. For example, institutional repositories frequently have for obtaining information about those Resources:
identifier schemes and access protocols distinct from those
existing on the Web at large. As a result, much of their • Use of HTTP URIs to identify those non-document
content is accessible on the Web, but it poorly integrates Resources;
with mainstream Web applications and may even be over- • Publication of another Resource with a Representation
looked by major search engines, unless the search engines that provides information about the non-document Re-
make special accommodations for their protocols and access source at a HTTP URI other than the HTTP URI of
schemes. the non-document Resource;
Our prior work on the Open Archives Initiative Proto-
col For Metadata Harvesting (OAI-PMH) [26] demonstrates • Leverage of HTTP mechanisms to allow discovery of
this problem. OAI-PMH is an interoperability specification the HTTP URI of the published resource from the
released in 2001 aimed at streamlining the process of incre- HTTP URI of the non-document resource.
mentally collecting XML metadata (typically bibliographic
metadata) from information systems. It shares many de- 3.1 Data Model
sign characteristics with Atom [35] and is widely adopted in The essence of the RDF-based data model is described
its targeted community of scholarly repositories. But, OAI- here and is illustrated in Figure 2. The full details are
PMH, in contrast to Atom, has not gained broader adoption, available in the OAI-ORE Abstract Data Model specifica-
mainly because its architecture is not well aligned with the tion [27].
Resource/URI/Representation foundations of the Web Ar- In order to be able to unambiguously refer to an aggre-
chitecture. For example, OAI-PMH clients must construct gation of Web resources, a new Resource is introduced that
a request URI by combining a repository specific base URI, stands for a set or collection of other Resources. This new
the identifier of the item of interest, and a format tag in an Resource, named an Aggregation, has a URI just like any
Figure 2: A Resource Map describes an Aggregation with three Aggregated Resources.
other Resource on the Web. And, since an Aggregation is
a conceptual construct, it is a non-document Resource that
does not have a Representation.
Following the Linked Data guidelines, another Resource
is introduced to make information about the Aggregation
available. This new Resource, named a Resource Map, has
a URI and a machine-readable Representation that provides
details about the Aggregation. In essence, a Resource Map
expresses which Aggregation it describes (the ore:describes
relationship in Figure 2), and it lists the Aggregated Re-
sources that are part of the Aggregation (the ore:aggregates
relationship in Figure 2, a subproperty of
dcterms:hasPart). But, a Resource Map can also express
relationships and properties pertaining to all these Resources,
as well as metadata pertaining to the Resource Map itself, Figure 3: Citing a Resource in the context of an
e.g. who published it and when it was most recently modi- Aggregation.
fied (the dcterms:creator and dcterms:modified relation-
ships in Figure 2). A Resource Map can also express re-
lationships of the Aggregation, Aggregated Resources, and
the Resource Map itself with any arbitrary other Resource, We note that the URI asserted in a Resource Map to de-
as long as the resulting RDF graph is connected. note an Aggregated Resource of a particular Aggregation is
In addition, for discovery purposes, the data model allows no different than the URI that denotes that Resource in-
a Resource Map to express that an Aggregated Resource of dependent of the Aggregation. However, it is important in
a specific Aggregation is also part of another Aggregation. scholarly communication, among others for the purpose of
This is achieved by means of the ore:isAggregatedBy rela- citing and expressing provenance, that a resource such as a
tionship (the inverse of ore:aggregates) between the Ag- dataset included in some context, for example a specific ar-
gregated Resource and that other Aggregation. Also stat- ticle, be distinct from the same dataset outside the context
ing that an Aggregated Resource is itself an Aggregation of that article, or in the context of another article.
(nesting Aggregations) is supported. To that purpose, an To accomplish this differentiation, OAI-ORE introduces
ore:isDescribedBy relationship (the inverse of the notion of a Proxy. A Proxy is a Resource that stands for
ore:describes, and a subproperty of rdfs:seeAlso) is ex- an Aggregated Resource in the context of a specific Aggrega-
pressed between the Aggregated Resource and a Resource tion. The URI of a Proxy provides a mechanism for denot-
Map that describes it as being itself an Aggregation. Fur- ing a Resource in context. Figure 3 shows the ore:ProxyFor
thermore, the use of non-protocol-based identifiers (such and ore:ProxyIn relationships between a Proxy and an Ag-
as DOIs) that can be expressed as URIs is quite common gregated Resource and an Aggregation, respectively. It also
for referencing scholarly assets. In order to support this illustrates how citing the Aggregated Resource is different
practice, the ore:similarTo relationship between an Ag- from citing its Proxy: the former cites a Resource “as is”,
gregation and a somehow equivalent resource identified by the latter cites that Resource as it exists in the context of
a non-protocol-based URI is expressed. The specificity of a specific Aggregation. In order to work seamlessly in the
ore:similarTo is situated between rdfs:seeAlso and Web and to provide context information to OAI-ORE aware
owl:sameAs. clients, resolution of HTTP URIs assigned to Proxies must
lead to the Aggregated Resource, and the response must
include a HTTP Link Header [34] that points to the Aggre-
3.2 Proxies: Aggregated Resources in Context gation.
3.3 Resource Map Serializations
A Resource Map has a Representation that describes an
Aggregation in some serialization syntax. OAI-ORE ex-
plicitly specifies three serialization syntaxes, Atom XML,
RDF/XML, and RDFa, while other serialization syntaxes
are possible. Which one to choose will largely depend on
the use case and on the technical environment available to a
Resource Map publisher. For example, in cases where an ex-
pressive HTML splash page exists an RDFa approach might
be attractive. Note that multiple Resource Maps, each us-
ing a different serialization syntax can describe the same
Aggregation, and that these may differ in expressiveness3 .
Although the data model is based on RDF, we were com-
mitted to also specify a serialization based on Atom, to al-
low Aggregations to become the subject of Web 2.0 reuse
scenarios and of workflows based on the Atom Publishing Figure 4: Discovering a Resource Map from an Ag-
Protocol [18]. The Atom Publishing Protocol adds a uni- gregation using Cool URIs for the Semantic Web.
form read/write approach to Web 2.0, which could be of
significant benefit in scholarly communication scenarios.
However, the task of reconciling the data model with the
Atom model proved to be non-trivial due to tensions be- dress this we distinguish between Authoritative and Non-
tween the RDF model and the XML-oriented Atom spec- Authoritative Resource Maps in the same way as the Linked
ification. The former is graph-based, with precise seman- Data guidelines. An Authoritative Resource Map is one
tics that are global rather than local to a specific document. that is accessible by dereferencing the URI of the Aggrega-
The latter is hierarchical, (XML) document-centric, and has tion that it describes, for example using the aforementioned
intentionally loose element definitions. It took several, dra- Cool URI mechanisms. A Non-Authoritative Resource Map
matically different iterations of the Atom serialization to is one not reachable in this manner. The rationale for this
arrive at an acceptable solution. approach is that the party that introduces a new Aggrega-
The resulting approach expresses an Aggregation by means tion simultaneously mints URIs for both the Aggregation
of an Atom entry, and makes use of Atom’s extensibility and the Resource Map, and actually controls both.
mechanisms in much the same way as Google Data does. For
example, Atom’s link element with an OAI-ORE-specific 4. EARLY DEMONSTRATORS
value for the rel attribute is used to aggregate resources.
Since the OAI-ORE specifications have only been released
And, awaiting a solution from the Atom community to deal
recently, an in-depth evaluation of functionality, adoption,
express triples, an ore:triples element was introduced to
and impact is premature. Still, in this section we give an
act as a wrapper for RDF descriptions. To support un-
insight in efforts by early adopters to leverage the specifica-
ambiguous interpretation of Atom serializations of Resource
tions. Four use cases are described below. Additional illus-
Maps, a GRDDL transform was implemented that extracts
trations of its application are provided by the submissions
all contained triples that pertain to the OAI-ORE data model,
to the ORE Challenge at RepoCamp 20085 .
both from the native Atom elements and from the ore:triples
extension element, and expresses them in RDF/XML4 . 4.1 Foresite: Revealing Aggregations
3.4 Leveraging HTTP In order to provide feedback on the evolving OAI-ORE
specification, the UK’s Joint Information Systems Commit-
In order to make OAI-ORE work in the HTTP-based
tee (JISC)6 funded an experiment to investigate applying it
Web, both the Aggregation and the Resource Map are as-
to an extensive scholarly collection: the approximately four
signed HTTP URIs, and the Cool URIs for the Semantic
million articles that are part of the JSTOR7 collection. By
Web guidelines [36] are adopted to support discovery of the
developing open source OAI-ORE libraries8 and applying
HTTP URI of a Resource Map given the HTTP URI of an
them to produce interlinked Resource Maps, the Foresite
Aggregation. Figure 4 illustrates a situation in which the
project effectively demonstrated the feasibility of exposing
arXiv Aggregation is described by both an Atom XML and
common scholarly artifacts to the Data Web in the manner
an RDF/XML Resource Map, and in which a client is led
proposed by OAI-ORE. The project provided valuable feed-
to the Atom version via an HTTP 303 redirect and Content
back that helped refine the OAI-ORE specifications, and
Negotiation.
had a significant impact on the aforementioned discussions
3.5 Authoritative Resource Maps regarding the Atom serialization of Resource Maps.
The overall structure of the Aggregations, and associated
After one party has published a Resource Map that con-
Resource Maps, produced for the JSTOR collection mirrors
tains a description and a URI for a new Aggregation, any
the journal - issue - article hierarchy of the JSTOR content.
other party can publish competing or even conflicting Re-
Each journal is modeled as an Aggregation of journal issues;
source Maps that describe the same Aggregation. To ad-
5
3
See http://www.openarchives.org/ore/atom for detailed http://www.openarchives.org/ore/RepoCamp2008/
6
Atom and RDF/XML versions of Resources Maps corre- http://www.jisc.ac.uk/
7
sponding to Figure 1. http://www.jstor.org/
4 8
http://www.openarchives.org/ore/atom-grddl http://foresite-toolkit.googlecode.com/
Figure 6: The Foresite plug-in models Flickr Sets as
Figure 5: The hierarchical structure of the JSTOR OAI-ORE Aggregations, and visualizes them.
collection mapped to the OAI-ORE data model.
Note that 1..1 cardinalities are omitted from the
diagram for clarity.
to which the Web resource corresponds from the Liverpool
Web server. The plug-in then parses and displays the Re-
source Map graph via dynamic SVG. Nodes in the display
each issue is an Aggregation of articles; and each article is an represent Aggregations, Aggregated Resources, and related
Aggregation of individual page images and a PDF-formatted Resources. Nodes for Aggregations can be clicked to expand
version of the entire article (Figure 5). The Aggregated or contract the visualization; in case of expansion, new Re-
Resources at each level are also the subject and/or object source Maps are obtained, parsed, and again visualized.
of a fst:followedBy relationship introduced to preserve Further experiments using the same approach were car-
the page-turning order for pages within an article, articles ried out on mainstream Web portals, leveraging the pro-
within an issue and so forth. Because fst:followedBy is not vided Web service APIs to obtain metadata, and to express
a global relationship, but rather only applies within the con- it according to the ORE data model. Flickr12 and Amazon13
text of a specific Aggregation, Proxies for these Aggregated were selected, and wrapper services were built to generate
Resources were introduced. The article Aggregations in- Resource Maps on demand through REST interactions, and
terlink via dcterms:references relationships for citations, to publish them on the Liverpool server. Flickr provides a
further confirming the necessity of the graph-based nature rich dataset with photos, photo sets, users, groups, favorites
of the OAI-ORE date model, even though the main JSTOR and even comments and tags that can all be modeled as
content hierarchy is tree-shaped. The Resource Maps were Aggregations. Figure 6 shows a visualization of the struc-
published on a Web server at the University of Liverpool. ture of the Flickr Set “Glaciers” that consists of five pho-
The resulting OAI-ORE descriptions are of immediate tographs. In the Foresite Explorer, this set is represented
business importance to JSTOR. While JSTOR stores the with an Aggregation visualized as the top right node within
OCR-ed full-text of each article, it is only able to openly the OAI-ORE logo (left bottom of Figure 6), emitting a red
expose this kind of topological metadata, and would lose dcterms:creator arc and a white ore:aggregates arc. The
its market advantage (and the participation of contributing latter leads to the five photographs. The third photograph
publishers) if the full-text were exposed. Having the topol- is selected, and another white ore:aggregates arc reaches
ogy of their collection available in a standardized format that out to the available image files (differing image resolutions)
provides links back to their protected full-text documents represented as black nodes. The purple nodes indicate other
and images, facilitates reuse in third party applications that aggregations in which the selected photo is aggregated.
can help drive traffic to the JSTOR site and increase its Amazon offers fewer constructs that readily map to the
customer base. OAI-ORE data model, but the user wishlists is a compelling
In order to provide a value-added service on the basis of one. The mapping to the data model is as follows: a wish-
the generated Resource Maps without requiring JSTOR to list becomes an Aggregation, and wished-for items become
integrate prototype code into their production portal, the Aggregated Resources. Interestingly, each item in an Ama-
Foresite Explorer – a visualization application9 , was devel- zon wishlist has a unique identifier by which it is purchased.
oped using GreaseMonkey10 and its cross-site capable Xml- That identifier is only valid within that specific wishlist to
HttpRequest. This one-click-install plug-in for Firefox11 ex- allow tracking of individual items, once purchased. These
tracts the URI of the resource that is currently being viewed wishlist specific constructs map directly the Proxies of the
in the JSTOR Web interface and retrieves the associated OAI-ORE model. The GreaseMonkey script was updated to
RDF/XML Resource Map that describes the Aggregation discover these identifiers that are necessary to interact with
9
the Amazon Web services, and Proxy-based relationships
http://foresite.cheshire3.org/explorer/
10 12
http://www.greasespot.net/ http://www.flickr.com/
11 13
http://www.mozilla.com/firefox/ http://www.amazon.com/
were added to the visualization. all resources that relate to a particular research task or pub-
Overall, the Foresite experiment has illustrated the ap- lication fits into the normal scholarly workflow. Two author-
plicability of the OAI-ORE resource aggregation model as ing environments that demonstrate this are the Literature
well as the feasibility to leverage it to create a value-added Object Reuse and Exchange (LORE) tool created by Gerber
service. It has demonstrated this for both common schol- et al.16 , and by the SCOPE work of Cheung et al. [8, 21].
arly communication artifacts and specific constructs used LORE is a Firefox extension that communicates via Ajax
by popular Web portals. The Foresite experiment will be with a Sesame2 data store for maintaining the OAI-ORE
described in more detail in a dedicated, future publication. graphs that are generated. LORE allows for the generation
of fine-grained metadata and relationships, for example, al-
4.2 Astronomy Publication Workflow lowing indicating that a certain resource is contextual in-
Datasets are of fundamental importance in observational formation about the literature work that is being studied.
sciences such as astronomy. The astronomy community has The SCOPE work led to the development of the Provenance
developed sophisticated repositories and data standards, ex- Explorer, a stand-alone Java application with functionalities
emplified by the Sloan Digital Sky Survey14 and the Na- similar to those of LORE, but aimed at the creation, editing
tional Virtual Observatory15 , which provide excellent facil- and publication of scientific compound objects.
ities for registering and accessing large datasets. However,
when submitting an article, both new datasets that were cre- 4.4 Enhanced Publications
ated to arrive at findings reported in an article, and data ci- The Dutch SURFshare program17 and the European
tation information that reveals the reuse of existing datasets DRIVER II project18 are collaborating on cyberinfrastruc-
are often lost, “left behind” on the personal computer of the ture to join a multitude of scientific repositories that hold
author. publications and research data. The goal is to give re-
A team at Johns Hopkins University is collaborating with searchers better means to share and access scientific mate-
the American Astronomical Society to capture datasets as rials through innovative services. One of the envisioned ser-
part of the publication workflow [9]. In the newly devised vices relates to enhanced publications, composites of textual
publication workflows, OAI-ORE Aggregations are used to publications and supporting resources such as research-data,
glue an article and its associated datasets together, and Re- visualizations, annotations, related websites, etc. To ensure
source Maps that describe these Aggregations are the tokens the integrity and usability of such enhanced publications it
that move around between author, publisher and dataset is important that all its components and their interrelations
repository as the publication process proceeds [10]. At each are being preserved.
stage of the publication workflow, the Resource Map is used A study into object models suitable for the representa-
to convey the current state of the Aggregation, and is then tion of enhanced publications recommended the use of OAI-
updated to reflect the new state that is then passed on to ORE. As a result, a demonstrator project [20] was launched
the next workflow phase. For example, as a Resource Map in which enhanced publications for multiple scientific disci-
is passed from the publisher to the dataset repository and plines ranging from engineering to journalism were modeled
back again, it is updated to contain the URIs of datasets according to OAI-ORE, and in which approaches to meet
that are registered in the repository, and that were used for a variety of requirements were explored, including presen-
the article. This allows the publisher to link to the datasets tation, navigation, persistent identification, granularity of
that were used for a specific article, and the repository to referencing, handling of sequentially ordered resources, visu-
link to papers that used a specific dataset. alization of interrelationships, etc. The results are available
Generally, the availability of these Aggregations enables at the project site19 . The project chose RDF/XML to ex-
new services to be built on both the publishing platform and press Resource Maps and uses an XSLT-based approach to
the data repository. If the practices proposed by this novel dynamically generate an HTML “splash page” from them.
publication workflow became commonplace, it would repre- In each splash page, a Content tab (Figure 7) lists all cru-
sent a significant improvement in the efficiency of scientific cial metadata about the enhanced publication, prominently
communication. shows its textual component and associated metadata, and
4.3 Authoring, Editing and Reusing neatly lists additional resources again with metadata. Many
of these resources are themselves modeled as Aggregations,
The success of OAI-ORE depends on the ease with which and hence also have their own splash page. To support an
Aggregations and Resource Maps are authored and dissem- understanding of the relationships among resources of an
inated on the Web. In many cases, they will be generated Aggregation and of nested Aggregations, a Relations tab
automatically based on information that is available in an that loads a Java applet fueled by Resource Map content
information system. For example, the arXiv.org database is introduced. Overall, the demonstrator is remarkable be-
contains all information that is necessary to automatically cause of the elegance and simplicity of the ORE implemen-
generate Aggregations and their associated Resource Maps, tation. It clearly illustrates that ORE can be used as a basic
as shown in the Appendices. And, in the astronomy project model for enhanced publications, and points at the need for
described above, the ability to create Resource Maps is built community-defined vocabularies to convey expressive rela-
into familiar authoring environments in a manner that makes tionships among scientific resources.
it a side-effect of the authoring process and thus minimizes
the burden on authors. 16
Like all cyberinfrastructure, the success of such authoring http://www.openarchives.org/ore/RepoCamp2008/
#LORE
environments depends on the manner in which assembling 17
http://www.surffoundation.nl/en/
14 18
http://www.sdss.org/ http://www.driver-community.eu/
15 19
http://www.us-vo.org/ http://driver2.dans.knaw.nl/demonstrator/html/
“bunch” has a new HTTP URI identity, it enumerates its
members, and it readily handles distributed Web resources.
However, the identity of the bunch is the same as that of the
HTML page that describes it, and expressing relationships
between the bunched resources is not supported. GroupMe!
is similar, with the addition of social tagging capabilities,
but has the same problems as LinkBunch.
Some Web navigator approaches work in an opposite gran-
ular direction, supporting disaggregation of a single Web re-
source (i.e., an HTML page) into multiple resources. This
can be done automatically, such as for segmented display
on limited devices such as PDAs [7] or for recovering struc-
tured records from Web pages [15]. Decomposition can also
be done manually, such as for reuse and sharing of parts of
a Web page (e.g., ClipMarks22 ). All these approaches, man-
ually or automatically, can be thought of as adding (or in-
ferring) HTML anchors where none exist. These approaches
assign identities to the newly created resources (fragments
of the original resource), but they provide no approach to
describe the original resource as an aggregation of these new
resources, nor do they allow expressing relationships among
them.
In approaches that have the administrator of a Web infor-
Figure 7: The splash page for an enhanced publi- mation system in the diver seat, several technologies exist to
cation of the DRIVER II project, dynamically ren- deal with resource aggregations. Sitemaps were briefly con-
dered from an RDF/XML Resource Map. sidered as a serialization option for Resource Maps. Google,
Yahoo and Microsoft support the Sitemap Protocol [16], a
simple XML file format that allows Web sites to list the URIs
they want crawled by robots. Sitemaps provide for minimal
5. RELATED WORK metadata (e.g., last modification date, update frequency and
Given the widespread use of aggregations in both the crawl priority), but no attempt is made to provide semantic
physical and the Web world, it comes as no surprise that typing, and handling arbitrary distributed resources is not
other efforts have investigated this domain. Prior work in supported. Indeed, in the interest of trust, the Sitemap Pro-
the Web realm can be grouped in two main categories de- tocol specifies a significant limitation on URI paths that can
pending on the party that introduces aggregations. In one be listed in a Sitemap file. For example, a Sitemap at level
case, that is the Web navigator (agent or reader), in the www.foo.com/a/b can list URIs at level a/b and below, but
other case it is the administrator of a Web-based information it cannot list URIs at www.foo.com/a/c, www.foo.com/d/ or
system. We look at a number of efforts in both categories, www.bar.com/.
and evaluate their capabilities to identify aggregations, to We made a deliberate decision to avoid the many exist-
enumerate the constituent resources of an aggregation, to ing packaging formats, such as MPEG-21 DIDL [3], METS
express relationships among resources, and to accommodate [32], FOXML [25], IMS-CP [22], and BagIt [6]. First, pack-
resources that are distributed on the Web. aging base64-encoded content in a wrapper document does
In the Web navigator case, either an interactive user groups not resonate well with the Resource/URI/Representation
resources based on some intent, or a robot tries to infer the paradigm of the Web Architecture. Still, most of these for-
implicitly defined members of an aggregation. The robotic mats also support a by-reference mechanism to deliver con-
approaches range from heuristics [30, 14] to machine-learning tent, in which URIs can be used. However, although these
[12, 11]. While these approaches are useful, they are imper- formats are prominent in their respective communities, they
fect and dependent on the perception of those encoding the have not gained an adoption comparable to that of Atom or
heuristics or training set and they do not necessarily reflect RDF/XML. And while these approaches can address iden-
the intention of the original authors of the Web resources. tification, and enumeration of distributed resources, they
And, while these approaches may succeed at selecting the have uneven capabilities to express the graph-based OAI-
distributed resources that are part of an implicitly defined ORE model, due to their hierarchical perspective.
aggregation, they are not capable of inferring the relation- In the course of the OAI-ORE effort, we also attempted to
ships between those resources, nor do they propose a way to model aggregations as Atom feeds, not entries [29]. We ul-
unambiguously describe the aggregation. timately decided that was the wrong granularity, especially
The approaches that involve an interactive user include since common Web 2.0 reuse scenarios, including use with
tools such as GroupMe!20 and LinkBunch21 . LinkBunch the Atom Publishing Protocol, work at the level of Atom
lets users submit several URIs that are then assigned a new entries. The Atom Syndication Format was preferred over
HTTP URI that, when dereferenced, returns an HTML page the various RSS formats in anticipation of using the Atom
that lists and links to the originally submitted URIs. The Publishing Protocol [18].
Some elements of the POWDER [37] specifications that
20
http://groupme.org/
21 22
http://linkbun.ch/ http://clipmarks.com/
were developed in the same timeframe as OAI-ORE ad- http://www.openarchives.org/ore/.
dress a problem space similar to that of OAI-ORE. However,
POWDER’s focus is significantly broader, and it approaches
the problem from the opposite perspective, 8. REFERENCES
focusing on capabilities to assert (via “Description Re- [1] M. Altman and G. King. A proposed standard for the
sources”) that a group of resources share certain properties scholarly citation of quantitative data. D-Lib
(e.g. access rights), rather than asserting arbitrary prop- Magazine, 13(3/4), 2007.
erties about resources that, for some reason, are grouped [2] D. E. Atkins, K. K. Droegemeier, S. I. Feldman,
into an aggregation. That is, in POWDER the notion of H. Garcia-Molina, M. L. Klein, D. G. Messerschmitt,
shared properties defines an aggregation, whereas in OAI- P. Messina, J. P. Ostriker, and M. H. Wright.
ORE an aggregation can be created for any reason deemed Revolutionizing science and engineering through
important by its creator. Also, while POWDER provides cyberinfrastructure, 2003.
capabilities to describe a group of resources using a vari- [3] J. Bekaert, E. De Kooning, and H. Van de Sompel.
ety of approaches including regular expressions, it does not Representing digital objects using MPEG-21 Digital
introduce an identity for the aggregation. Item Declaration. International Journal on Digital
Libraries, 6(2):159–173, 2006.
6. CONCLUSIONS [4] C. Bizer, R. Cyganiak, and T. Heath. How to publish
This paper has introduced the OAI-ORE solution to the linked data on the web, 2007. http://sites.wiwiss.fu-
resource aggregation problem, which we argue meets a crit- berlin.de/bizer/pub/LinkedDataTutorial/.
ical need in the development of cyberinfrastructure and the [5] C. L. Borgman. Scholarship in the digital age :
next generation scholarly communication infrastructure. By information, infrastructure, and the Internet. MIT
aligning the solution with the Web Architecture, and by Press, Cambridge, Mass., 2007.
leveraging the practices of the Semantic Web and Linked [6] A. Boyko, J. Kunze, J. Littman, and L. Madden. The
Data effort, it will facilitate better integration of scholarly bagit file package format (v0.95), Internet Draft, July
communication with the mainstream Web, it will make schol- 2008.
arly artifacts more readily usable with common Web tools [7] D. Chakrabarti, R. Kumar, and K. Punera. A
and applications, and it will benefit the broader community graph-theoretic approach to webpage segmentation. In
by making research materials more visible, verifiable, and WWW ’08: Proceedings of the 17th international
by facilitating unexpected reuse. conference on World Wide Web, pages 377–386, 2008.
While OAI-ORE was motivated by scholarly communi- [8] K. Cheung, J. Hunter, A. Lashtabeg, and D. J.
cation, we believe that the proposed solution has broader SCOPE - a scientific compound object publishing and
applicability. Aggregations, sets, and collections are as com- editing system. In 3rd International Digital Curation
mon on the Web as they are in the everyday physical world. Conference, 2007.
In many situations it would benefit agents and services if ag- [9] S. Choudhury, T. DiLauro, A. Szalay, E. Vishniac,
gregations were unambiguously enumerated and described, R. Hanisch, J. Steffen, R. Milkey, T. Ehling, and
essentially layering an addition level of resource granularity R. Plante. Digital data preservation for scholarly
upon the Web. publications in astronomy. International Journal of
Evaluation of the OAI-ORE work depends on its adop- Digital Curation, 2(2), 2007.
tion and evolution over time. The work has so far ben- [10] T. DiLauro. OAI-ORE for publishing workflows: Data
efited from significant community involvement throughout archiving for journals of the American Astronomical
the specification process, and the international team that Society. In Open Repositories 2008, 2008.
developed the solution includes representatives with back- [11] P. Dmitriev. As we may perceive: finding the
grounds in scholarly publishing, eScience, repository infras- boundaries of compound documents on the web. In
tructure, digital libraries, Web search engines, linked data, WWW ’08: Proceedings of the 17th international
and information interoperability. Work by early adopters, conference on World Wide Web, pages 1029–1030,
such as the Foresite project and John’s Hopkins publica- 2008.
tion workflow project, are promising indicators that these
[12] P. Dmitriev, C. Lagoze, and B. Suchkov. As we may
community contributions have led to a solution that stands
perceive: inferring logical documents from hypertext.
realistic chances for significant adoption.
In Proceedings of the sixteenth ACM conference on
Hypertext and Hypermedia, pages 66–74, 2005.
7. ACKNOWLEDGMENTS [13] P. N. Edwards, S. J. Jackson, G. C. Bowker, and C. P.
This work was supported by the National Science Foun- Knobel. Understanding infrastructure: Dynamics,
dation Divisions of Information and Intelligent Systems and tensions, and design. Technical report, National
Undergraduate Education through grant numbers IIS-0430906, Science Foundation, January 2007.
IIS-0643784 and DUE-0840744, the Andrew W. Mellon Foun- [14] N. Eiron and K. McCurley. Untangling compound
dation, Microsoft, and the Coalition for Networked Informa- documents on the web. In Proceedings of the
tion. Development of OAI-ORE was based on input from fourteenth ACM conference on Hypertext and
the OAI-ORE Technical Committee, the OAI-ORE Liaison Hypermedia, pages 85–94, 2003.
Group, the OAI-ORE Advisory Committee, contributors to [15] D. Embley, Y. Jiang, and Y. Ng. Record-boundary
the OAI-ORE Google discussion group, and members of discovery in Web documents. In Proceedings of the
the Digital Library Research & Prototyping Team of the 1999 ACM SIGMOD international conference on
Los Alamos National Laboratory. Individuals are listed at Management of data, pages 467–478, 1999.
[16] Google, Microsoft, and Yahoo. Sitemaps XML format, From hypermedia to datuments. Journal of Digital
2008. http://www.sitemaps.org/protocol.php. Information, 5(1), 2004.
[17] J. Gray, A. S. Szalay, A. Thakar, C. Stoughton, and [34] M. Nottingham. HTTP header linking, Internet Draft,
J. vandenBerg. Online scientific data curation, March 2008.
publication, and archiving. Technical Report arXiv [35] M. Nottingham and R. Sayre. The Atom syndication
cs.DL/0208012, 2002. format, Internet RFC-4287, December 2005.
[18] J. Gregorio and B. de hOra. The Atom publishing [36] L. Sauermann and R. Cyganiak. Cool URIs for the
protocol, Internet RFC-5023, December 2007. semantic web. Technical Report W3C Interest Group
[19] B. Haslhofer and B. Schandl. The OAI2LOD Server: Note 31 March 2008, W3C, 2008.
Exposing OAI-PMH Metadata as Linked Data. In [37] K. Scheppe and D. Pentecost. Protocol for Web
Proceedings of WWW 2008 Workshop Linked Data on Description Resources (POWDER): Primer. Technical
the Web (LDOW2008), Beijing, 2008. Report W3C Working Draft – 14 November 2008,
[20] M. Hoogerwerf. Durable enhanced publications. In W3C, 2008.
Proceedings of African Digital Scholarship & Curation [38] J. E. Sieber and B. E. Trumbo. (not) giving credit
2009, 2009. where credit is due: Citation of data sets. Science and
[21] L. Hunter J., Chueng. Provenance explorer - a Engineering Ethics, 1:11–20, 1995.
graphical interface for constructing scientific [39] A. Smith. The research library in the 21st century:
publication pack ages from provenance trails. collecting, preserving, and making it accessible
International Journal on Digital Libraries, resources for scholarship. In No Brief Candle:
7(1-2):99–107. Reconceiving Research Libraries for the 21st Century.
[22] IMS Global Learning Consortium. IMS content Council on Library and Information Resources, 2008.
packaging XML binding specification version 1.1.3. [40] S. Tuecke, K. Czajkowski, I. Foster, J. Frey,
http://www.imsglobal.org/content/packaging/, 2003. S. Graham, C. Kesselman, T. Maquire, T. Sandholm,
[23] I. Jacobs and N. Walsh. Architecture of the world D. Snelling, and Vanderbilt. Open Grid Services
wide web, volume one. Technical Report W3C Infrastructure (OGSI): Version 1.0. Technical Report
Recommendation 15 December 2004, W3C, 2004. draft-ggf-ogsi-gridservice-33, Global Grid Forum,
[24] R. K. Johnson. Institutional repositories: Partnering January 27 2003.
with faculty to enhance scholarly communication. [41] H. Van de Sompel, S. Payette, J. Erickson, C. Lagoze,
D-Lib Magazine, 8(11), 2002. and S. Warner. Rethinking scholarly communication:
[25] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: Building the system that scholars deserve. D-Lib
an architecture for complex objects and their Magazine, 10(9), 2004.
relationships. International Journal on Digital [42] R. Williams, R. Moore, and R. Hanisch. A virtual
Libraries, 6(2):124–138, 2006. observatory vision based on publishing and virtual
[26] C. Lagoze and H. Van de Sompel. The Open Archives data. Technical report, US National Virtual
Initiative: building a low-barrier interoperability Observatory, 2003.
framework. In JCDL ’01: Proceedings of the 1st
ACM/IEEE-CS Joint Conference on Digital Libraries,
pages 54–62, 2001.
[27] C. Lagoze, H. Van de Sompel, P. Johnston, M. Nelson,
R. Sanderson, and S. Warner. ORE Specification -
Abstract Data Model, 2008.
http://www.openarchives.org/ore/datamodel.
[28] C. Lagoze, H. Van de Sompel, P. Johnston, M. Nelson,
R. Sanderson, and S. Warner. ORE Specification and
User Guide - Table of Contents, 2008.
http://www.openarchives.org/ore/1.0/toc.
[29] C. Lagoze, H. Van de Sompel, P. Johnston, M. L.
Nelson, R. Sanderson, and S. Warner. Object Re-Use
& Exchange: A Resource-Centric Approach. Technical
Report arXiv:0804.2273, 2008.
[30] W. Li, O. Kolak, Q. Vu, and H. Takano. Defining
logical domains in a web site. In Proceedings of the
eleventh ACM on Hypertext and Hypermedia, pages
123–132, 2000.
[31] C. A. Lynch. Institutional repositories: Essential
infrastructure for scholarship in the digital age. ARL:
A Bimonthly Report, (226), 2003.
[32] J. P. McDonough. METS: Standardized encoding for
digital library objects. International Journal on
Digital Libraries, 6(2):148–158, 2006.
[33] P. Murray-Rust and H. Rzepa. The next big thing: