Introduction

An Ontology Design Pattern for Data Integration in the Library Domain

Patrick OBrien

David Carral

Mixter

Pascal Hitzler

Montana State University

Wright State University

A university's institutional repository (IR) contains the intellectual output of its faculty, sta and students. Its content is extensive and heterogenous, which complicates data aggregation and discovery tasks. To address these challenges, we propose the use of a conceptual ontology design pattern to model information for the IR domain which is general enough to be reused across di erent IR datasets.

Introduction

A university's institutional repository (IR) contains the intellectual output of its faculty, sta and students. Content can be diverse and may include theses and dissertations, proceedings, books, preprints and post-print journal articles, as well as grey literature and datasets that support research conclusions. While there are a number of Linked Open Datasets (LOD) with structured bibliographic records on the web (i.e., DBLP, CiteSeer, Semantic Web Dog Food, etc.), none have open access to a full text version of the scholarly article or a robust view of the academic output for an entire University.

Currently there are more than 2,400 IR a liated with universities or disciplinary societies that are built on the principle of open access [7]. Most IR include full text versions of the scholarly work encoded as media objects (PDF, CSV, etc.). IRs contain a vast amount of data encapsulating information that can provide unique perspectives on institutional research activities, such as the interdisciplinary collaboration among researchers, departments and colleges.

However, this valuable information is typically locked in bibliographic records as simple text strings, or blobs, that are di cult for machines to isolate, ingest and interpret. Unstructured IR data also hinder discovery by making indexing by scholarly search engines di cult [1].

To unlock the full potential of open access IR, it is necessary to dissect each bibliographic record to identify, and link together, the entities contained within. The research question, then, is whether a repeatable structured data model can improve access and discovery of IR content by improving the quality of IR data.

This paper describes a generic Ontology Design Pattern (ODP) based on a project to convert bibliographic records from Montana State University's Open Access Institutional Repository (IR) into linked data and still improve access and discovery by services such as Google and Google Scholar. Like most libraries, Montana State University's IR metadata was maintained in multiple production systems using various formats to describe and access the same scholarly papers encoded as full text PDF les. Speci cally, MAchine Readable Cataloging (MARC) and Metadata Object Description Schema (MODS).

The challenge was producing a single accurate, and robust, description of the materials contained within the IR. This required sta to extract, consolidate, and parse records into individual text strings and transform them into RDF. This was done using a model based upon Schema.org, Dublin Core and extended using the Citation Style Language for granular details. Once converted into RDF, the data were reconciled against the university's internal Faculty Activity Database to establish instance data of people with their Colleges and Departments. The RDF data were then linked to the external sources of DBpedia and the Library of Congress Subject Headings (LCSH). While the process was successful in publishing Montana State University's IR as LOD[6], this process required signi cant ad hoc and manual processes to identify and address data quality issues.

We propose a generic Ontology Design Pattern (ODP) developed with the three characteristics below would help IR managers improve the speed and e ciency for publishing IR content as quality LOD: 1. Directly applicable to a variety of IR datasets and, thus, reduce the initial hurdle for IRs to publish Linked Data [2]. 2. Easily extensible, e.g., by aligning with existing library ontologies, foundational ontologies, and other domain speci c vocabularies. 3. Help IR data managers improve the quality of IR metadata by reducing the practice of manually reviewing bibliographic records for accuracy.

Deriving such an ODP requires a generic use case which captures recurring problems in di erent application domains. Competency questions are queries that a domain expert would be expected to run against a knowledge base and are recognized as a good approach for modeling requirements from multiple domains. For the proposed ODP, such competency questions include: 1. Which records violate existing conditions required for scholarly citation? 2. What is the topic diversity of an organization intellectual output? 3. What is the depth of an organization's intellectual output? 4. Are their authors with "weak ties" to my domain of expertise I can explore for "novel ideas" or collaboration in my research? 2

Formalization

This section discusses the more interesting classes, properties, and axioms of the library pattern. Description Logics (DL) notation has been used to present the axioms. To encode the pattern, we make use of the logic fragment SROIQ as de ned in [5], which is the basis for the OWL 2 DL standard [4]. The proposed ODP has been formally encoded using the Web Ontology Language (OWL).1 A schematic view of the pattern is shown in Figure 1. 1 The pattern can be downloaded from www.dropbox.com/sh/88jh5qwdgpxueqz/AAAj_kgmL5ErPL2JaPWtCvEsa?dl=0.

CreativeWork: a generic class of creative work that includes things like books, movies or software programs. A subclass of CreativeWork, ScholarlyWork, contains all creative works related to scholarly research. The CreativeWork and ScholarlyWork class relationship is enforced by axiom ( 1 ). Axiom ( 2 ) indicates that every scholarly work must have some author and exactly one publication date.

ScholarlyWork v CreativeWork ScholarlyWork v 9hasCreator:Creatoru = 1hasPublicationDate:Date

Creator: some person or organization responsible for generating some creative work. All creators must have created at least some CreativeWork ( 3 ).

Creator v 9isCreatorOf:CreativeWork

InstitutionalRepository: a repository which contains a set of creative works. It is related to some organization. An institutional repository must contain some type of scholarly work from some creator. ( 1 ) ( 2 ) ( 3 )

InstitutionalRepository v 9containsWorksFrom:Organization u 9holdsIntelectualOutput:CreativeWork

Organization: An entity that formally links a group of people to a common goal. A relevant class of Organization for our context is ScholarlyOrganization ( 5 ). Universities, colleges, academic departments, and libraries are scholarly organizations ( 6-9 ).

ScholarlyOrganization v Organization University v ScholarlyOrganization College v ScholarlyOrganization Department v ScholarlyOrganization Library v ScholarlyOrganization

( 5 ) ( 6 ) ( 7 ) (8) (9) (11) (12) (13) (14) (15) (16) (17)

Universities have at least one college and one academic department (10). Colleges are part of at most one university (11). Academic departments are part of at least one and only one university (12).

University v 9hasCollege:College u 9hasDepartment:AcademicDepartment (10) College v 1isCollegeOf:University Department v = 1isDepartmentOf:University

We introduce subproperty statements (13-14) and declare the subproperty hasSubOrganization as transitive with the following axioms:2 hasCollege v hasSubOrganization hasDepartment v hasSubOrganization hasSubOrganization hasSubOrganization v hasSubOrganization The following role chain enables automatic determination of some organization's intellectual output: hasSubOrganization hasA liate v hasA liate

hasA liate isCreatorOf v producesIntellectualOutput 3

Conclusions and Future Work

Applying an ODP to IR data will improve the e ciency and e ectiveness of library metadata management work ows by quickly identify issues with data that are currently done manually. Improving the quality of IR metadata and publishing it for syndication on the Semantic Web will aid machine assisted discovery and help address the limited availability of datasets that contain adequate information linked to full-text scholarly research capable of supporting semantics-driven Literature-Based Discovery [3].

We are planing future iterations that extend the axiomatization and populate the pattern using previous domain modeling and a real-world dataset from Montana State University [6]. 2 Many axioms which are intuitively derived from labels such as isCollegeOf hasCollege are omitted. For a comprehensive list see out submission at www.dropbox.com/sh/88jh5qwdgpxueqz/AAAj_kgmL5ErPL2JaPWtCvEsa?dl=0.

1. Arlitsch , K. , O'Brien , P.S.: Invisible institutional repositories: Addressing the low indexing ratios of irs in google scholar . Library Hi Tech 30 ( 1 ), 60 { 81 ( 2012 ), http: //dx.doi.org/10.1108/07378831211213210

2. Bizer , C. , Heath , T. , Berners-Lee , T. : Linked data - the story so far . Int. J. Semantic Web Inf. Syst . 5 ( 3 ), 1 { 22 ( 2009 ), http://dx.doi.org/10.4018/jswis.2009081901

3. Cameron , D. , Bodenreider , O. , Yalamanchili , H. , Danh , T. , Vallabhaneni , S. , Thirunarayan , K. , Sheth , A.P. , Rind esch, T.C. : A graph-based recovery and decomposition of swanson's hypothesis using semantic predications . Journal of Biomedical Informatics 46 ( 2 ), 238 { 251 ( 2013 ), http://dx.doi.org/10.1016/j.jbi. 2012 . 09 . 004

4. Hitzler , P. , Krotzsch, M. , Parsia , B. , Patel-Schneider , P.F. , Rudolph , S. (eds.) : OWL 2 Web Ontology Language: Primer. W3C Recommendation (27 October 2009 ), available at http://www.w3.org/TR/owl2-primer/

5. Horrocks , I. , Kutz , O. , Sattler , U. : The even more irresistible SROIQ . In: Proc. of the 10th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR 2006 ). pp. 57 { 67 . AAAI Press ( 2006 )

6. Mixter , J. , OBrien, P. , Arlitsch , K. : Describing theses and dissertations using schema.org . In: Proceedings of the 2014 International Conference on Dublin Core and Metadata Applications . pp. 138 { 146 . DCMI' 14 , Dublin Core Metadata Initiative ( 2014 ), http://dl.acm.org/citation.cfm?id= 2771234 . 2771249

7. Pin eld, S., Salter , J. , Bath , P.A. , Hubbard , B. , Millington , P. , Anders , J.H.S. , Hussain , A. : Open-access repositories worldwide, 2005 - 2012 : Past growth, current characteristics, and future possibilities . JASIST 65 ( 12 ), 2404 { 2421 ( 2014 ), http: //dx.doi.org/10.1002/asi.23131