Adding eScience Assets to the Data Web Herbert Van de Sompel Carl Lagoze Michael L. Nelson Los Alamos National Cornell University Old Dominion University Laboratory Ithaca, NY USA Norfolk, VA USA Los Alamos, NM USA lagoze@cs.cornell.edu mln@cs.odu.edu herbertv@lanl.gov Simeon Warner Robert Sanderson Pete Johnston Cornell University University of Liverpool Eduserv Foundation Ithaca, NY USA Liverpool, UK Bath UK simeon@cs.cornell.edu azaroth@liv.ac.uk pete.johnston@eduserv.org.uk ABSTRACT In parallel with this change in research methodology there Aggregations of Web resources are increasingly important in has been substantial change in the way that research results scholarship as it adopts new methods that are data-centric, are communicated. With the emergence of the Web, schol- collaborative, and networked-based. The same notion of ag- arly publishers, both commercial and learned societies, al- gregations of resources is common to the mashed-up, socially most universally deliver journal papers, conference proceed- networked information environment of Web 2.0. We present ings, and monographs via the Web. While Web delivery of a mechanism to identify and describe aggregations of Web research results has improved their accessibility and search- resources that has resulted from the Open Archives Initia- ability, it represents an evolution of traditional publication tive - Object Reuse and Exchange (OAI-ORE) project. The practices rather than a fundamental change in the scholarly OAI-ORE specifications are based on the principles of the communication paradigm. Even in their digital manifesta- Architecture of the World Wide Web, the Semantic Web, tions, scholarly publications are mostly textually-based and and the Linked Data effort. Therefore, their incorporation static. To date, there are few examples of scholarly com- into the cyberinfrastructure that supports eScholarship will munication that move beyond the dissemination of these ensure the integration of the products of scholarly research traditional artifacts into a more data-centric, semantically- into the Data Web. linked, and social network-embedded scholarly communica- tion model that resembles the profound changes in social, political, and economic discourse characteristic of Web 2.0. Categories and Subject Descriptors This radically different model would expose process as well H.5.4 [Information Systems]: Hypertext/Hypermedia as product [39], improving opportunities to verify the repro- ducibility of research results, and making the full spectrum of artifacts generated in the scholarly value chain available General Terms for reuse [41]. Design, Standardization The deployment of radically new models depends on the development of basic technical infrastructure, so-called cy- berinfrastructure. This cyberinfrastructure must include a Keywords number of components. These include a means to identify Cyberinfrastructure, eScience, OAI-ORE, Web Architecture, and cite datasets in the scholarly discourse (e.g., [38, 1]), Linked Data, RDF, Atom a standard for identifying scholarly authors to unambigu- ously tie them to their creations and improve the quality of 1. INTRODUCTION scientometric information (e.g., ResearcherID1 and Digital Author Identifier2 ), and standards to allow machine read- The rapid evolution of computing, networking, and data ability of the products of scholarly process thereby facilitat- capturing technologies, along with advances in data mining ing computational analysis and extraction of secondary and and analysis, are fundamentally changing the way scholarly tertiary knowledge products. Semantic technologies are an research is conducted [2, 5]. Although there are differences important component of this cyberinfrastructure, providing amongst disciplines in their receptivity to change [13], an a foundation for open agreements on data formats, metadata increasing number of scholars in the natural sciences, social frameworks to describe data, and ontology-based solutions sciences, and humanities have adopted new research meth- for formal representation of scientific knowledge, all of which ods that are network-based, highly collaborative, and data- are important components of promoting a machine-readable intensive. Because of the central role of vast amounts of data scholarly record. in these new research methods, there has been increased This paper focuses on one aspect of this cyberinfrastruc- attention to sustainable infrastructures for registering, pre- serving, and sharing datasets [17]. 1 http://www.researcherid.com/ 2 Copyright is held by the author/owner(s). http://www.surffoundation.nl/smartsite.dws?ch= LDOW 2009, April 20, 2009, Madrid, Spain. eng&id=13480 ture that arises from the changing nature of publications document. In addition, prototypes exist of applications that that are characteristic of collaborative, data-centric scholar- allow authoring, storing, and disseminating more complex ship. These emerging publications are aggregations of multi- scholarly publications in the form of aggregations [8, 33, ple resources. Such aggregations are already prevalent in ex- 42]. These more complex aggregations may consist of a tex- isting scholarly repositories, which commonly offer access to tual article, one or more datasets that led to the discoveries textual documents in multiple formats, each available from reported in the article, perhaps a visualization of a specific a different network location. But, the changes in scholarship state of the dataset, and the software used to generate the described above, and especially the need to include data in visualization. All constituents of such an aggregation are the publication process, increases the complexity of these distributed on the Web. One notable aspect of these more aggregations and calls for the adoption of a common ap- complex visions of an aggregate scholarly publication is the proach to handle them. In the remainder of this paper, importance of semantic relationships among constituents of we describe our work within Open Archives Initiative - Ob- the aggregation. These relationships include citation, ver- ject Reuse and Exchange (OAI-ORE), a two-year project sioning, provenance, commentary, and the like. to investigate common methods to handle aggregations of Some characteristics of the aggregations that are already Web resources that culminated in October 2008 with the common in scholarship can be illustrated by means of a doc- release of the OAI-ORE specifications [28]. These specifica- ument from arXiv.org, a well-known repository of physics, tions were motivated by the resource aggregations common mathematics, and computer science research results. The to scholarly communication. We believe that their generic, human start page, or “splash page”, for this document is Web-centric approach makes them applicable to use cases in shown in Figure 1. Some aspects of the page relevant to the the Web at large, providing the basis for improved search re- resource aggregation problem are highlighted in red rectan- sults, improved information navigation, and richer services gles, each with a number. The meanings of the highlighted within browsers for a large class of Web applications. areas are as follows: The OAI-ORE specifications leverage the principles of the Architecture of the World Wide Web, the Semantic Web, 1. The URI http://arxiv.org/abs/astro-ph/0601007 and the Linked Data effort. As a result, future develop- of the human start page for the arXiv document. ments in cyberinfrastructure and scholarly communication 2. The formats in which the document is available, i.e. that are based on OAI-ORE will integrate well with the PostScript, PDF, etc. These are effectively the con- Web and with the tools, agents and applications that oper- stituents of the aggregation that is the arXiv docu- ate within it. This will make it possible to embed or mash up ment. the products of scholarship into cyber-learning efforts, co- operative reference tools such as Wikipedia, and the larger 3. The title of the arXiv document. social discourse that is now characteristic of Web 2.0. The essence of the OAI-ORE solution to the resource aggregation 4. The authors of the arXiv document. problem can be summarized is as follows: 5. The creation and last modification date of the arXiv • The data model is expressed in terms of the primi- document. tives of Web Architecture and the Semantic Web: Re- 6. Identifiers of resources that are in some manner compa- sources, Representations, URIs and RDF triples. rable to this arXiv document. For example, a version • The central entity in the data model, the Aggregation, of this document was later published as an article in a is a Resource that stands for a set of other Resources. peer-reviewed journal, and the Digital Object Identi- An Aggregation is a Resource with a URI but without fier of that article is shown. a Representation (we refer to this as a non-document 7. The versions of this arXiv document. Resource from now on). This approach is aligned with the manner in which real-world entities or concepts are 8. Links to other arXiv documents in the same collection included in the Web via the mechanisms proposed by (i.e., astro-ph). the Linked Data effort [4]. 9. Citations made by this arXiv document, and citations • Another Resource, the Resource Map, has a Represen- it received from other documents. tation that is a description of the Aggregation. The Resource Map is accessible via the URI of the Aggre- This rather simple example highlights the core issues that gation using the mechanisms defined for Cool URIs for OAI-ORE addresses. First, although the URI of the hu- the Semantic Web [36]. man start page is commonly used as the URI for the entire arXiv document, within the Web Architecture that URI only • The Representation of a Resource Map is a serializa- identifies the page itself, and not the aggregation that is the tion of the triples that describe the Aggregation. The arXiv document. The ability to cite, annotate, version, and specification describes RDF/XML, RDFa, and Atom associate properties with the aggregation itself relies on it serialization syntaxes. having a unique identity, distinct from the splash page or the resources linked from it. 2. AGGREGATIONS Second, without the use of (frequently imperfect) heuris- tics unique to the specific human start page, it is not read- 2.1 Aggregations in Scholarly Communication able by machines and agents. Because the HTML of this Most institutional repositories [24, 31] routinely store and human start page usually leaves the semantics of hyperlinks disseminate relatively simple aggregations, consisting of mul- undefined, a machine agent cannot unambiguously distin- tiple access formats (e.g., PDF, HTML, LaTeX) for the same guish between links to constituents (e.g. the PostScript, OAI-PMH specific manner, often preventing general Web clients that are unaware of the protocol from accessing the available metadata [19]. The Web-centric, resource-centric approach of OAI-ORE rectifies this architectural shortcoming and thereby provides the foundation for full accessibility of the products of eScience in the general Web environment. Furthermore, it makes the solution available to a broader class of Web applications in which the practice of aggregating resources is quite com- mon. For example, we accumulate URLs in bookmarks or favorites lists in our browser, collect photos into sets in pop- ular sites like Flickr, browse over multiple page documents that are linked together through “prev” and “next” tags, and talk about Web sites as if they had some real existence beyond the set of pages of which they consist. Despite our frequent use of these aggregations, their existence on the Figure 1: The implicitly defined members of a schol- Web is quite ephemeral because there is no common way arly aggregation. to identify, describe, and hence handle them. This is what OAI-ORE provides. 3. THE OAI-ORE SOLUTION PDF, etc.) of the document and links that point at infor- In this section we describe the various elements of the mation that is clearly outside of the document such as the OAI-ORE solution to the resource aggregation problem out- navigational aids shown as (8) in Figure 1. Similarly, agents lined above. It encompasses an RDF-based data model, syn- can not interpret relationships of the document to other doc- taxes for serializing instances of the data model, and mech- uments, identifiers related to this document, versions of this anisms for providing HTTP access to those serializations. document, etc. Complete details are available through the OAI-ORE docu- In essence, the problem is that there is no standard way mentation suite [28]. to describe the constituents or boundary of an aggregation, As noted earlier, this solution is based on the primitives or to qualify and identify a resource as being an aggregation. defined in the Architecture of the World Wide Web [23] that While a robot could learn the semantics implied by arXiv’s defines a Resource as an item of interest; a URI as a global HTML in Figure 1, such “screen scraping” is brittle and not identifier for a Resource; and a Representation as a datas- scalable for applications accessing aggregations in thousands tream corresponding to the state of a Resource at the time of different repositories, each with their own presentation its URI is dereferenced via some protocol (e.g. HTTP). In idiom. addition, the solution is grounded in the principles intro- duced by the Semantic Web, in which URIs are also used 2.2 Integrating Aggregations into the Web to identify non-document Resources, such as real-world enti- A number of early efforts in cyberinfrastructure, for exam- ties (e.g. people or cars), or even abstract entities (e.g. ideas ple the initial grid architecture [40] and technologies for digi- or classes). These non-document Resources have no Repre- tal libraries, leveraged aspects of the Web infrastructure but sentation to indicate their meaning. OAI-ORE adopts the often failed to fully conform with Web Architecture princi- following approach, proposed by the Linked Data effort [4], ples. For example, institutional repositories frequently have for obtaining information about those Resources: identifier schemes and access protocols distinct from those existing on the Web at large. As a result, much of their • Use of HTTP URIs to identify those non-document content is accessible on the Web, but it poorly integrates Resources; with mainstream Web applications and may even be over- • Publication of another Resource with a Representation looked by major search engines, unless the search engines that provides information about the non-document Re- make special accommodations for their protocols and access source at a HTTP URI other than the HTTP URI of schemes. the non-document Resource; Our prior work on the Open Archives Initiative Proto- col For Metadata Harvesting (OAI-PMH) [26] demonstrates • Leverage of HTTP mechanisms to allow discovery of this problem. OAI-PMH is an interoperability specification the HTTP URI of the published resource from the released in 2001 aimed at streamlining the process of incre- HTTP URI of the non-document resource. mentally collecting XML metadata (typically bibliographic metadata) from information systems. It shares many de- 3.1 Data Model sign characteristics with Atom [35] and is widely adopted in The essence of the RDF-based data model is described its targeted community of scholarly repositories. But, OAI- here and is illustrated in Figure 2. The full details are PMH, in contrast to Atom, has not gained broader adoption, available in the OAI-ORE Abstract Data Model specifica- mainly because its architecture is not well aligned with the tion [27]. Resource/URI/Representation foundations of the Web Ar- In order to be able to unambiguously refer to an aggre- chitecture. For example, OAI-PMH clients must construct gation of Web resources, a new Resource is introduced that a request URI by combining a repository specific base URI, stands for a set or collection of other Resources. This new the identifier of the item of interest, and a format tag in an Resource, named an Aggregation, has a URI just like any Figure 2: A Resource Map describes an Aggregation with three Aggregated Resources. other Resource on the Web. And, since an Aggregation is a conceptual construct, it is a non-document Resource that does not have a Representation. Following the Linked Data guidelines, another Resource is introduced to make information about the Aggregation available. This new Resource, named a Resource Map, has a URI and a machine-readable Representation that provides details about the Aggregation. In essence, a Resource Map expresses which Aggregation it describes (the ore:describes relationship in Figure 2), and it lists the Aggregated Re- sources that are part of the Aggregation (the ore:aggregates relationship in Figure 2, a subproperty of dcterms:hasPart). But, a Resource Map can also express relationships and properties pertaining to all these Resources, as well as metadata pertaining to the Resource Map itself, Figure 3: Citing a Resource in the context of an e.g. who published it and when it was most recently modi- Aggregation. fied (the dcterms:creator and dcterms:modified relation- ships in Figure 2). A Resource Map can also express re- lationships of the Aggregation, Aggregated Resources, and the Resource Map itself with any arbitrary other Resource, We note that the URI asserted in a Resource Map to de- as long as the resulting RDF graph is connected. note an Aggregated Resource of a particular Aggregation is In addition, for discovery purposes, the data model allows no different than the URI that denotes that Resource in- a Resource Map to express that an Aggregated Resource of dependent of the Aggregation. However, it is important in a specific Aggregation is also part of another Aggregation. scholarly communication, among others for the purpose of This is achieved by means of the ore:isAggregatedBy rela- citing and expressing provenance, that a resource such as a tionship (the inverse of ore:aggregates) between the Ag- dataset included in some context, for example a specific ar- gregated Resource and that other Aggregation. Also stat- ticle, be distinct from the same dataset outside the context ing that an Aggregated Resource is itself an Aggregation of that article, or in the context of another article. (nesting Aggregations) is supported. To that purpose, an To accomplish this differentiation, OAI-ORE introduces ore:isDescribedBy relationship (the inverse of the notion of a Proxy. A Proxy is a Resource that stands for ore:describes, and a subproperty of rdfs:seeAlso) is ex- an Aggregated Resource in the context of a specific Aggrega- pressed between the Aggregated Resource and a Resource tion. The URI of a Proxy provides a mechanism for denot- Map that describes it as being itself an Aggregation. Fur- ing a Resource in context. Figure 3 shows the ore:ProxyFor thermore, the use of non-protocol-based identifiers (such and ore:ProxyIn relationships between a Proxy and an Ag- as DOIs) that can be expressed as URIs is quite common gregated Resource and an Aggregation, respectively. It also for referencing scholarly assets. In order to support this illustrates how citing the Aggregated Resource is different practice, the ore:similarTo relationship between an Ag- from citing its Proxy: the former cites a Resource “as is”, gregation and a somehow equivalent resource identified by the latter cites that Resource as it exists in the context of a non-protocol-based URI is expressed. The specificity of a specific Aggregation. In order to work seamlessly in the ore:similarTo is situated between rdfs:seeAlso and Web and to provide context information to OAI-ORE aware owl:sameAs. clients, resolution of HTTP URIs assigned to Proxies must lead to the Aggregated Resource, and the response must include a HTTP Link Header [34] that points to the Aggre- 3.2 Proxies: Aggregated Resources in Context gation. 3.3 Resource Map Serializations A Resource Map has a Representation that describes an Aggregation in some serialization syntax. OAI-ORE ex- plicitly specifies three serialization syntaxes, Atom XML, RDF/XML, and RDFa, while other serialization syntaxes are possible. Which one to choose will largely depend on the use case and on the technical environment available to a Resource Map publisher. For example, in cases where an ex- pressive HTML splash page exists an RDFa approach might be attractive. Note that multiple Resource Maps, each us- ing a different serialization syntax can describe the same Aggregation, and that these may differ in expressiveness3 . Although the data model is based on RDF, we were com- mitted to also specify a serialization based on Atom, to al- low Aggregations to become the subject of Web 2.0 reuse scenarios and of workflows based on the Atom Publishing Figure 4: Discovering a Resource Map from an Ag- Protocol [18]. The Atom Publishing Protocol adds a uni- gregation using Cool URIs for the Semantic Web. form read/write approach to Web 2.0, which could be of significant benefit in scholarly communication scenarios. However, the task of reconciling the data model with the Atom model proved to be non-trivial due to tensions be- dress this we distinguish between Authoritative and Non- tween the RDF model and the XML-oriented Atom spec- Authoritative Resource Maps in the same way as the Linked ification. The former is graph-based, with precise seman- Data guidelines. An Authoritative Resource Map is one tics that are global rather than local to a specific document. that is accessible by dereferencing the URI of the Aggrega- The latter is hierarchical, (XML) document-centric, and has tion that it describes, for example using the aforementioned intentionally loose element definitions. It took several, dra- Cool URI mechanisms. A Non-Authoritative Resource Map matically different iterations of the Atom serialization to is one not reachable in this manner. The rationale for this arrive at an acceptable solution. approach is that the party that introduces a new Aggrega- The resulting approach expresses an Aggregation by means tion simultaneously mints URIs for both the Aggregation of an Atom entry, and makes use of Atom’s extensibility and the Resource Map, and actually controls both. mechanisms in much the same way as Google Data does. For example, Atom’s link element with an OAI-ORE-specific 4. EARLY DEMONSTRATORS value for the rel attribute is used to aggregate resources. Since the OAI-ORE specifications have only been released And, awaiting a solution from the Atom community to deal recently, an in-depth evaluation of functionality, adoption, express triples, an ore:triples element was introduced to and impact is premature. Still, in this section we give an act as a wrapper for RDF descriptions. To support un- insight in efforts by early adopters to leverage the specifica- ambiguous interpretation of Atom serializations of Resource tions. Four use cases are described below. Additional illus- Maps, a GRDDL transform was implemented that extracts trations of its application are provided by the submissions all contained triples that pertain to the OAI-ORE data model, to the ORE Challenge at RepoCamp 20085 . both from the native Atom elements and from the ore:triples extension element, and expresses them in RDF/XML4 . 4.1 Foresite: Revealing Aggregations 3.4 Leveraging HTTP In order to provide feedback on the evolving OAI-ORE specification, the UK’s Joint Information Systems Commit- In order to make OAI-ORE work in the HTTP-based tee (JISC)6 funded an experiment to investigate applying it Web, both the Aggregation and the Resource Map are as- to an extensive scholarly collection: the approximately four signed HTTP URIs, and the Cool URIs for the Semantic million articles that are part of the JSTOR7 collection. By Web guidelines [36] are adopted to support discovery of the developing open source OAI-ORE libraries8 and applying HTTP URI of a Resource Map given the HTTP URI of an them to produce interlinked Resource Maps, the Foresite Aggregation. Figure 4 illustrates a situation in which the project effectively demonstrated the feasibility of exposing arXiv Aggregation is described by both an Atom XML and common scholarly artifacts to the Data Web in the manner an RDF/XML Resource Map, and in which a client is led proposed by OAI-ORE. The project provided valuable feed- to the Atom version via an HTTP 303 redirect and Content back that helped refine the OAI-ORE specifications, and Negotiation. had a significant impact on the aforementioned discussions 3.5 Authoritative Resource Maps regarding the Atom serialization of Resource Maps. The overall structure of the Aggregations, and associated After one party has published a Resource Map that con- Resource Maps, produced for the JSTOR collection mirrors tains a description and a URI for a new Aggregation, any the journal - issue - article hierarchy of the JSTOR content. other party can publish competing or even conflicting Re- Each journal is modeled as an Aggregation of journal issues; source Maps that describe the same Aggregation. To ad- 5 3 See http://www.openarchives.org/ore/atom for detailed http://www.openarchives.org/ore/RepoCamp2008/ 6 Atom and RDF/XML versions of Resources Maps corre- http://www.jisc.ac.uk/ 7 sponding to Figure 1. http://www.jstor.org/ 4 8 http://www.openarchives.org/ore/atom-grddl http://foresite-toolkit.googlecode.com/ Figure 6: The Foresite plug-in models Flickr Sets as Figure 5: The hierarchical structure of the JSTOR OAI-ORE Aggregations, and visualizes them. collection mapped to the OAI-ORE data model. Note that 1..1 cardinalities are omitted from the diagram for clarity. to which the Web resource corresponds from the Liverpool Web server. The plug-in then parses and displays the Re- source Map graph via dynamic SVG. Nodes in the display each issue is an Aggregation of articles; and each article is an represent Aggregations, Aggregated Resources, and related Aggregation of individual page images and a PDF-formatted Resources. Nodes for Aggregations can be clicked to expand version of the entire article (Figure 5). The Aggregated or contract the visualization; in case of expansion, new Re- Resources at each level are also the subject and/or object source Maps are obtained, parsed, and again visualized. of a fst:followedBy relationship introduced to preserve Further experiments using the same approach were car- the page-turning order for pages within an article, articles ried out on mainstream Web portals, leveraging the pro- within an issue and so forth. Because fst:followedBy is not vided Web service APIs to obtain metadata, and to express a global relationship, but rather only applies within the con- it according to the ORE data model. Flickr12 and Amazon13 text of a specific Aggregation, Proxies for these Aggregated were selected, and wrapper services were built to generate Resources were introduced. The article Aggregations in- Resource Maps on demand through REST interactions, and terlink via dcterms:references relationships for citations, to publish them on the Liverpool server. Flickr provides a further confirming the necessity of the graph-based nature rich dataset with photos, photo sets, users, groups, favorites of the OAI-ORE date model, even though the main JSTOR and even comments and tags that can all be modeled as content hierarchy is tree-shaped. The Resource Maps were Aggregations. Figure 6 shows a visualization of the struc- published on a Web server at the University of Liverpool. ture of the Flickr Set “Glaciers” that consists of five pho- The resulting OAI-ORE descriptions are of immediate tographs. In the Foresite Explorer, this set is represented business importance to JSTOR. While JSTOR stores the with an Aggregation visualized as the top right node within OCR-ed full-text of each article, it is only able to openly the OAI-ORE logo (left bottom of Figure 6), emitting a red expose this kind of topological metadata, and would lose dcterms:creator arc and a white ore:aggregates arc. The its market advantage (and the participation of contributing latter leads to the five photographs. The third photograph publishers) if the full-text were exposed. Having the topol- is selected, and another white ore:aggregates arc reaches ogy of their collection available in a standardized format that out to the available image files (differing image resolutions) provides links back to their protected full-text documents represented as black nodes. The purple nodes indicate other and images, facilitates reuse in third party applications that aggregations in which the selected photo is aggregated. can help drive traffic to the JSTOR site and increase its Amazon offers fewer constructs that readily map to the customer base. OAI-ORE data model, but the user wishlists is a compelling In order to provide a value-added service on the basis of one. The mapping to the data model is as follows: a wish- the generated Resource Maps without requiring JSTOR to list becomes an Aggregation, and wished-for items become integrate prototype code into their production portal, the Aggregated Resources. Interestingly, each item in an Ama- Foresite Explorer – a visualization application9 , was devel- zon wishlist has a unique identifier by which it is purchased. oped using GreaseMonkey10 and its cross-site capable Xml- That identifier is only valid within that specific wishlist to HttpRequest. This one-click-install plug-in for Firefox11 ex- allow tracking of individual items, once purchased. These tracts the URI of the resource that is currently being viewed wishlist specific constructs map directly the Proxies of the in the JSTOR Web interface and retrieves the associated OAI-ORE model. The GreaseMonkey script was updated to RDF/XML Resource Map that describes the Aggregation discover these identifiers that are necessary to interact with 9 the Amazon Web services, and Proxy-based relationships http://foresite.cheshire3.org/explorer/ 10 12 http://www.greasespot.net/ http://www.flickr.com/ 11 13 http://www.mozilla.com/firefox/ http://www.amazon.com/ were added to the visualization. all resources that relate to a particular research task or pub- Overall, the Foresite experiment has illustrated the ap- lication fits into the normal scholarly workflow. Two author- plicability of the OAI-ORE resource aggregation model as ing environments that demonstrate this are the Literature well as the feasibility to leverage it to create a value-added Object Reuse and Exchange (LORE) tool created by Gerber service. It has demonstrated this for both common schol- et al.16 , and by the SCOPE work of Cheung et al. [8, 21]. arly communication artifacts and specific constructs used LORE is a Firefox extension that communicates via Ajax by popular Web portals. The Foresite experiment will be with a Sesame2 data store for maintaining the OAI-ORE described in more detail in a dedicated, future publication. graphs that are generated. LORE allows for the generation of fine-grained metadata and relationships, for example, al- 4.2 Astronomy Publication Workflow lowing indicating that a certain resource is contextual in- Datasets are of fundamental importance in observational formation about the literature work that is being studied. sciences such as astronomy. The astronomy community has The SCOPE work led to the development of the Provenance developed sophisticated repositories and data standards, ex- Explorer, a stand-alone Java application with functionalities emplified by the Sloan Digital Sky Survey14 and the Na- similar to those of LORE, but aimed at the creation, editing tional Virtual Observatory15 , which provide excellent facil- and publication of scientific compound objects. ities for registering and accessing large datasets. However, when submitting an article, both new datasets that were cre- 4.4 Enhanced Publications ated to arrive at findings reported in an article, and data ci- The Dutch SURFshare program17 and the European tation information that reveals the reuse of existing datasets DRIVER II project18 are collaborating on cyberinfrastruc- are often lost, “left behind” on the personal computer of the ture to join a multitude of scientific repositories that hold author. publications and research data. The goal is to give re- A team at Johns Hopkins University is collaborating with searchers better means to share and access scientific mate- the American Astronomical Society to capture datasets as rials through innovative services. One of the envisioned ser- part of the publication workflow [9]. In the newly devised vices relates to enhanced publications, composites of textual publication workflows, OAI-ORE Aggregations are used to publications and supporting resources such as research-data, glue an article and its associated datasets together, and Re- visualizations, annotations, related websites, etc. To ensure source Maps that describe these Aggregations are the tokens the integrity and usability of such enhanced publications it that move around between author, publisher and dataset is important that all its components and their interrelations repository as the publication process proceeds [10]. At each are being preserved. stage of the publication workflow, the Resource Map is used A study into object models suitable for the representa- to convey the current state of the Aggregation, and is then tion of enhanced publications recommended the use of OAI- updated to reflect the new state that is then passed on to ORE. As a result, a demonstrator project [20] was launched the next workflow phase. For example, as a Resource Map in which enhanced publications for multiple scientific disci- is passed from the publisher to the dataset repository and plines ranging from engineering to journalism were modeled back again, it is updated to contain the URIs of datasets according to OAI-ORE, and in which approaches to meet that are registered in the repository, and that were used for a variety of requirements were explored, including presen- the article. This allows the publisher to link to the datasets tation, navigation, persistent identification, granularity of that were used for a specific article, and the repository to referencing, handling of sequentially ordered resources, visu- link to papers that used a specific dataset. alization of interrelationships, etc. The results are available Generally, the availability of these Aggregations enables at the project site19 . The project chose RDF/XML to ex- new services to be built on both the publishing platform and press Resource Maps and uses an XSLT-based approach to the data repository. If the practices proposed by this novel dynamically generate an HTML “splash page” from them. publication workflow became commonplace, it would repre- In each splash page, a Content tab (Figure 7) lists all cru- sent a significant improvement in the efficiency of scientific cial metadata about the enhanced publication, prominently communication. shows its textual component and associated metadata, and 4.3 Authoring, Editing and Reusing neatly lists additional resources again with metadata. Many of these resources are themselves modeled as Aggregations, The success of OAI-ORE depends on the ease with which and hence also have their own splash page. To support an Aggregations and Resource Maps are authored and dissem- understanding of the relationships among resources of an inated on the Web. In many cases, they will be generated Aggregation and of nested Aggregations, a Relations tab automatically based on information that is available in an that loads a Java applet fueled by Resource Map content information system. For example, the arXiv.org database is introduced. Overall, the demonstrator is remarkable be- contains all information that is necessary to automatically cause of the elegance and simplicity of the ORE implemen- generate Aggregations and their associated Resource Maps, tation. It clearly illustrates that ORE can be used as a basic as shown in the Appendices. And, in the astronomy project model for enhanced publications, and points at the need for described above, the ability to create Resource Maps is built community-defined vocabularies to convey expressive rela- into familiar authoring environments in a manner that makes tionships among scientific resources. it a side-effect of the authoring process and thus minimizes the burden on authors. 16 Like all cyberinfrastructure, the success of such authoring http://www.openarchives.org/ore/RepoCamp2008/ #LORE environments depends on the manner in which assembling 17 http://www.surffoundation.nl/en/ 14 18 http://www.sdss.org/ http://www.driver-community.eu/ 15 19 http://www.us-vo.org/ http://driver2.dans.knaw.nl/demonstrator/html/ “bunch” has a new HTTP URI identity, it enumerates its members, and it readily handles distributed Web resources. However, the identity of the bunch is the same as that of the HTML page that describes it, and expressing relationships between the bunched resources is not supported. GroupMe! is similar, with the addition of social tagging capabilities, but has the same problems as LinkBunch. Some Web navigator approaches work in an opposite gran- ular direction, supporting disaggregation of a single Web re- source (i.e., an HTML page) into multiple resources. This can be done automatically, such as for segmented display on limited devices such as PDAs [7] or for recovering struc- tured records from Web pages [15]. Decomposition can also be done manually, such as for reuse and sharing of parts of a Web page (e.g., ClipMarks22 ). All these approaches, man- ually or automatically, can be thought of as adding (or in- ferring) HTML anchors where none exist. These approaches assign identities to the newly created resources (fragments of the original resource), but they provide no approach to describe the original resource as an aggregation of these new resources, nor do they allow expressing relationships among them. In approaches that have the administrator of a Web infor- Figure 7: The splash page for an enhanced publi- mation system in the diver seat, several technologies exist to cation of the DRIVER II project, dynamically ren- deal with resource aggregations. Sitemaps were briefly con- dered from an RDF/XML Resource Map. sidered as a serialization option for Resource Maps. Google, Yahoo and Microsoft support the Sitemap Protocol [16], a simple XML file format that allows Web sites to list the URIs they want crawled by robots. Sitemaps provide for minimal 5. RELATED WORK metadata (e.g., last modification date, update frequency and Given the widespread use of aggregations in both the crawl priority), but no attempt is made to provide semantic physical and the Web world, it comes as no surprise that typing, and handling arbitrary distributed resources is not other efforts have investigated this domain. Prior work in supported. Indeed, in the interest of trust, the Sitemap Pro- the Web realm can be grouped in two main categories de- tocol specifies a significant limitation on URI paths that can pending on the party that introduces aggregations. In one be listed in a Sitemap file. For example, a Sitemap at level case, that is the Web navigator (agent or reader), in the www.foo.com/a/b can list URIs at level a/b and below, but other case it is the administrator of a Web-based information it cannot list URIs at www.foo.com/a/c, www.foo.com/d/ or system. We look at a number of efforts in both categories, www.bar.com/. and evaluate their capabilities to identify aggregations, to We made a deliberate decision to avoid the many exist- enumerate the constituent resources of an aggregation, to ing packaging formats, such as MPEG-21 DIDL [3], METS express relationships among resources, and to accommodate [32], FOXML [25], IMS-CP [22], and BagIt [6]. First, pack- resources that are distributed on the Web. aging base64-encoded content in a wrapper document does In the Web navigator case, either an interactive user groups not resonate well with the Resource/URI/Representation resources based on some intent, or a robot tries to infer the paradigm of the Web Architecture. Still, most of these for- implicitly defined members of an aggregation. The robotic mats also support a by-reference mechanism to deliver con- approaches range from heuristics [30, 14] to machine-learning tent, in which URIs can be used. However, although these [12, 11]. While these approaches are useful, they are imper- formats are prominent in their respective communities, they fect and dependent on the perception of those encoding the have not gained an adoption comparable to that of Atom or heuristics or training set and they do not necessarily reflect RDF/XML. And while these approaches can address iden- the intention of the original authors of the Web resources. tification, and enumeration of distributed resources, they And, while these approaches may succeed at selecting the have uneven capabilities to express the graph-based OAI- distributed resources that are part of an implicitly defined ORE model, due to their hierarchical perspective. aggregation, they are not capable of inferring the relation- In the course of the OAI-ORE effort, we also attempted to ships between those resources, nor do they propose a way to model aggregations as Atom feeds, not entries [29]. We ul- unambiguously describe the aggregation. timately decided that was the wrong granularity, especially The approaches that involve an interactive user include since common Web 2.0 reuse scenarios, including use with tools such as GroupMe!20 and LinkBunch21 . LinkBunch the Atom Publishing Protocol, work at the level of Atom lets users submit several URIs that are then assigned a new entries. The Atom Syndication Format was preferred over HTTP URI that, when dereferenced, returns an HTML page the various RSS formats in anticipation of using the Atom that lists and links to the originally submitted URIs. The Publishing Protocol [18]. Some elements of the POWDER [37] specifications that 20 http://groupme.org/ 21 22 http://linkbun.ch/ http://clipmarks.com/ were developed in the same timeframe as OAI-ORE ad- http://www.openarchives.org/ore/. dress a problem space similar to that of OAI-ORE. However, POWDER’s focus is significantly broader, and it approaches the problem from the opposite perspective, 8. REFERENCES focusing on capabilities to assert (via “Description Re- [1] M. Altman and G. King. A proposed standard for the sources”) that a group of resources share certain properties scholarly citation of quantitative data. D-Lib (e.g. access rights), rather than asserting arbitrary prop- Magazine, 13(3/4), 2007. erties about resources that, for some reason, are grouped [2] D. E. Atkins, K. K. Droegemeier, S. I. Feldman, into an aggregation. That is, in POWDER the notion of H. Garcia-Molina, M. L. Klein, D. G. Messerschmitt, shared properties defines an aggregation, whereas in OAI- P. Messina, J. P. Ostriker, and M. H. Wright. ORE an aggregation can be created for any reason deemed Revolutionizing science and engineering through important by its creator. Also, while POWDER provides cyberinfrastructure, 2003. capabilities to describe a group of resources using a vari- [3] J. Bekaert, E. De Kooning, and H. Van de Sompel. ety of approaches including regular expressions, it does not Representing digital objects using MPEG-21 Digital introduce an identity for the aggregation. Item Declaration. International Journal on Digital Libraries, 6(2):159–173, 2006. 6. CONCLUSIONS [4] C. Bizer, R. Cyganiak, and T. Heath. How to publish This paper has introduced the OAI-ORE solution to the linked data on the web, 2007. http://sites.wiwiss.fu- resource aggregation problem, which we argue meets a crit- berlin.de/bizer/pub/LinkedDataTutorial/. ical need in the development of cyberinfrastructure and the [5] C. L. Borgman. Scholarship in the digital age : next generation scholarly communication infrastructure. By information, infrastructure, and the Internet. MIT aligning the solution with the Web Architecture, and by Press, Cambridge, Mass., 2007. leveraging the practices of the Semantic Web and Linked [6] A. Boyko, J. Kunze, J. Littman, and L. Madden. The Data effort, it will facilitate better integration of scholarly bagit file package format (v0.95), Internet Draft, July communication with the mainstream Web, it will make schol- 2008. arly artifacts more readily usable with common Web tools [7] D. Chakrabarti, R. Kumar, and K. Punera. A and applications, and it will benefit the broader community graph-theoretic approach to webpage segmentation. In by making research materials more visible, verifiable, and WWW ’08: Proceedings of the 17th international by facilitating unexpected reuse. conference on World Wide Web, pages 377–386, 2008. While OAI-ORE was motivated by scholarly communi- [8] K. Cheung, J. Hunter, A. Lashtabeg, and D. J. cation, we believe that the proposed solution has broader SCOPE - a scientific compound object publishing and applicability. Aggregations, sets, and collections are as com- editing system. In 3rd International Digital Curation mon on the Web as they are in the everyday physical world. Conference, 2007. In many situations it would benefit agents and services if ag- [9] S. Choudhury, T. DiLauro, A. Szalay, E. Vishniac, gregations were unambiguously enumerated and described, R. Hanisch, J. Steffen, R. Milkey, T. Ehling, and essentially layering an addition level of resource granularity R. Plante. Digital data preservation for scholarly upon the Web. publications in astronomy. International Journal of Evaluation of the OAI-ORE work depends on its adop- Digital Curation, 2(2), 2007. tion and evolution over time. The work has so far ben- [10] T. DiLauro. OAI-ORE for publishing workflows: Data efited from significant community involvement throughout archiving for journals of the American Astronomical the specification process, and the international team that Society. In Open Repositories 2008, 2008. developed the solution includes representatives with back- [11] P. Dmitriev. As we may perceive: finding the grounds in scholarly publishing, eScience, repository infras- boundaries of compound documents on the web. In tructure, digital libraries, Web search engines, linked data, WWW ’08: Proceedings of the 17th international and information interoperability. Work by early adopters, conference on World Wide Web, pages 1029–1030, such as the Foresite project and John’s Hopkins publica- 2008. tion workflow project, are promising indicators that these [12] P. Dmitriev, C. Lagoze, and B. Suchkov. As we may community contributions have led to a solution that stands perceive: inferring logical documents from hypertext. realistic chances for significant adoption. In Proceedings of the sixteenth ACM conference on Hypertext and Hypermedia, pages 66–74, 2005. 7. ACKNOWLEDGMENTS [13] P. N. Edwards, S. J. Jackson, G. C. Bowker, and C. P. This work was supported by the National Science Foun- Knobel. Understanding infrastructure: Dynamics, dation Divisions of Information and Intelligent Systems and tensions, and design. Technical report, National Undergraduate Education through grant numbers IIS-0430906, Science Foundation, January 2007. IIS-0643784 and DUE-0840744, the Andrew W. Mellon Foun- [14] N. Eiron and K. McCurley. Untangling compound dation, Microsoft, and the Coalition for Networked Informa- documents on the web. In Proceedings of the tion. Development of OAI-ORE was based on input from fourteenth ACM conference on Hypertext and the OAI-ORE Technical Committee, the OAI-ORE Liaison Hypermedia, pages 85–94, 2003. Group, the OAI-ORE Advisory Committee, contributors to [15] D. Embley, Y. Jiang, and Y. Ng. Record-boundary the OAI-ORE Google discussion group, and members of discovery in Web documents. In Proceedings of the the Digital Library Research & Prototyping Team of the 1999 ACM SIGMOD international conference on Los Alamos National Laboratory. Individuals are listed at Management of data, pages 467–478, 1999. [16] Google, Microsoft, and Yahoo. Sitemaps XML format, From hypermedia to datuments. Journal of Digital 2008. http://www.sitemaps.org/protocol.php. Information, 5(1), 2004. [17] J. Gray, A. S. Szalay, A. Thakar, C. Stoughton, and [34] M. Nottingham. HTTP header linking, Internet Draft, J. vandenBerg. Online scientific data curation, March 2008. publication, and archiving. Technical Report arXiv [35] M. Nottingham and R. Sayre. The Atom syndication cs.DL/0208012, 2002. format, Internet RFC-4287, December 2005. [18] J. Gregorio and B. de hOra. The Atom publishing [36] L. Sauermann and R. Cyganiak. Cool URIs for the protocol, Internet RFC-5023, December 2007. semantic web. Technical Report W3C Interest Group [19] B. Haslhofer and B. Schandl. The OAI2LOD Server: Note 31 March 2008, W3C, 2008. Exposing OAI-PMH Metadata as Linked Data. In [37] K. Scheppe and D. Pentecost. Protocol for Web Proceedings of WWW 2008 Workshop Linked Data on Description Resources (POWDER): Primer. Technical the Web (LDOW2008), Beijing, 2008. Report W3C Working Draft – 14 November 2008, [20] M. Hoogerwerf. Durable enhanced publications. In W3C, 2008. Proceedings of African Digital Scholarship & Curation [38] J. E. Sieber and B. E. Trumbo. (not) giving credit 2009, 2009. where credit is due: Citation of data sets. Science and [21] L. Hunter J., Chueng. Provenance explorer - a Engineering Ethics, 1:11–20, 1995. graphical interface for constructing scientific [39] A. Smith. The research library in the 21st century: publication pack ages from provenance trails. collecting, preserving, and making it accessible International Journal on Digital Libraries, resources for scholarship. In No Brief Candle: 7(1-2):99–107. Reconceiving Research Libraries for the 21st Century. [22] IMS Global Learning Consortium. IMS content Council on Library and Information Resources, 2008. packaging XML binding specification version 1.1.3. [40] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, http://www.imsglobal.org/content/packaging/, 2003. S. Graham, C. Kesselman, T. Maquire, T. Sandholm, [23] I. Jacobs and N. Walsh. Architecture of the world D. Snelling, and Vanderbilt. Open Grid Services wide web, volume one. Technical Report W3C Infrastructure (OGSI): Version 1.0. Technical Report Recommendation 15 December 2004, W3C, 2004. draft-ggf-ogsi-gridservice-33, Global Grid Forum, [24] R. K. Johnson. Institutional repositories: Partnering January 27 2003. with faculty to enhance scholarly communication. [41] H. Van de Sompel, S. Payette, J. Erickson, C. Lagoze, D-Lib Magazine, 8(11), 2002. and S. Warner. Rethinking scholarly communication: [25] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: Building the system that scholars deserve. D-Lib an architecture for complex objects and their Magazine, 10(9), 2004. relationships. International Journal on Digital [42] R. Williams, R. Moore, and R. Hanisch. A virtual Libraries, 6(2):124–138, 2006. observatory vision based on publishing and virtual [26] C. Lagoze and H. Van de Sompel. The Open Archives data. Technical report, US National Virtual Initiative: building a low-barrier interoperability Observatory, 2003. framework. In JCDL ’01: Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, pages 54–62, 2001. [27] C. Lagoze, H. Van de Sompel, P. Johnston, M. Nelson, R. Sanderson, and S. Warner. ORE Specification - Abstract Data Model, 2008. http://www.openarchives.org/ore/datamodel. [28] C. Lagoze, H. Van de Sompel, P. Johnston, M. Nelson, R. Sanderson, and S. Warner. ORE Specification and User Guide - Table of Contents, 2008. http://www.openarchives.org/ore/1.0/toc. [29] C. Lagoze, H. Van de Sompel, P. Johnston, M. L. Nelson, R. Sanderson, and S. Warner. Object Re-Use & Exchange: A Resource-Centric Approach. Technical Report arXiv:0804.2273, 2008. [30] W. Li, O. Kolak, Q. Vu, and H. Takano. Defining logical domains in a web site. In Proceedings of the eleventh ACM on Hypertext and Hypermedia, pages 123–132, 2000. [31] C. A. Lynch. Institutional repositories: Essential infrastructure for scholarship in the digital age. ARL: A Bimonthly Report, (226), 2003. [32] J. P. McDonough. METS: Standardized encoding for digital library objects. International Journal on Digital Libraries, 6(2):148–158, 2006. [33] P. Murray-Rust and H. Rzepa. The next big thing: