kOre: Using Linked Data for OpenScience Information
Integration
Ivan Ermilov Konrad Höffner Jens Lehmann
University of Leipzig, Institute of Computer Science, AKSW Group
Augustusplatz 10, D-04109 Leipzig, Germany
{iermilov,hoeffner,lehmann}@informatik.uni-leipzig.de
Dmitry Mouromtsev
ITMO University
49 Kronverksky Ave., St.Petersburg, 197101, Russia
d.muromtsev@gmail.com
ABSTRACT contributions using traditional methods, which are targeted
While the amount of data on the Web grows at 57 % per at print and screen media (i.e. PDF documents), the data
year, the Web of Science maintains a considerable amount has become inaccessible for automatic processing. In the
of inertia, as yearly growth varies between 1.6 % and 14 %. new era of Big Data and the Web of Data, the scientific com-
On the other hand, the Web of Science consists of high qual- munity started developing new publication methods, which
ity information created and reviewed by the international reflect the 3V Concept (i.e. volume, velocity, variety), such
community of researchers. While it is a complicated process as nanopublications [5]. While it is a complicated process
to switch from traditional publishing methods to methods, to switch from traditional publishing methods to methods
which enable data publishing in machine-readable formats, that enable data publishing in machine-readable formats, the
the situation can be improved by at least exposing meta- situation can be improved by at least exposing metadata
data about the scientific publications in machine-readable about the scientific publications in machine-readable format.
format. In this paper we aim at metadata, hidden inside In this paper we aim at metadata, which is hidden inside
universities’ internal databases, reports and other hard to universities’ internal databases, reports and other hard to
discover sources. We extend the VIVO ontology and create discover sources. Unlocking this hidden metadata will facili-
the VIVO+ ontology. We define and describe a framework tate integration of the Web of Science with the new scientific
for automatic conversion of university data to RDF. We data publication media (i.e. RDF datasets, nanopublications
showcase the VIVO+ ontology and the framework using the etc.), resulting in a Data Web of Science.
example of the ITMO university. The Data Web of Science will enable new ways of data
access based on the data semantics. In particular, complex
search queries based on metadata will be available. Therefore,
1. INTRODUCTION the availability and discovery for research papers will be in-
While the amount of data on the Web grows at 57 % per creased. Given the availability of annotated PDF documents
year [4], the Web of Science maintains a considerable amount it will be possible to measure key performance indicators
of inertia, where growth varies between 1.6 % and 14 % [7] across universities as well as browse university data world
depending on the type of publication and research area. The wide. Also, the Data Web of Science will facilitate search for
share of the Web of Science inside the whole Web is small. specialists and researchers given a particular research field,
For example, DBLP lists only 2 892 316 publications up to the thus enabling new collaborations.
date of writing [1]. The Library of Congress with over than In this paper we present a framework to expose metadata
26 million books only consumes up to 10 terabytes, while about the research institutions according to the Linked Data
the size of the Web is measured in exabytes (i.e. millions of principles [2]. In particular our contributions are as follows:
terabytes).
On the other hand, the Web of Science consists of high • We extend the VIVO ontology and create the VIVO+
quality information created and reviewed by the international ontology to tackle several deficiencies which we identi-
community of researchers. Moreover, by publishing research fied when using it.
• We define and describe a framework for automatic
conversion of university data to RDF.
• We made the implementation of our approach freely
available (on GitHub).
• We showcase how academical RDF data can be utilized
and integrated with other data sources in an efficient
way using the example of the ITMO university in St. Pe-
.
tersburg.
67
2. ONTOLOGY MODELLING Peculiarities of the Russian Education System.
Using an established ontology for a particular domain We aim to provide an ontology that is sufficiently generic
reduces the development effort and leads to higher quality to be usable internationally. Different countries have different
and better integration. In our usage scenario, we aim to educational systems, however and VIVO is modelled accord-
publish organizational university data within a joint project ing to the system of the USA. Adapting VIVO to ITMO
of the University of Leipzig and ITMO. According to our University poses three challenges: (1) identifying the pecu-
research the best-fitting candidate ontology for such a task liarities of the Russian system (2) modelling the resulting
is VIVO [3]. However, it contains a lot of gaps left which additional concepts and (3) appropriately linking them to
we address with the VIVO+ ontology built upon VIVO existing concepts. One difference between Russia and most
and described in this section. VIVO+ adds missing key of the world is that it has two different doctoral degrees: the
concepts, models and relationships. Moreover, it provides a Candidate of Sciences (кандидат наук, kandidat nauk) and
more flexible and general modelling of relationships regarding the Doctor of Sciences (доктор наук, doktor nauk). The
academic degrees, which provides a better coverage, e.g. by Candidate of Sciences is equivalent to the PhD while the
allowing to represent Russian particularities. Doctor of Sciences can be earned after a period of further
study following the award of the Candidate of Sciences de-
Missing Concepts. gree and requires five to fifteen years beyond the award of
VIVO lacks many key concepts while others are modelled the Candidate of Sciences. VIVO+ contains classes for both
in insufficient detail. For example, it contains a concept for degrees and relates them to existing concepts.
a PhD student but this concept does not reference any other
concept of this domain except its super class. While a human 3. KORE: AUTOMATIZED MAPPING FRAME-
user can understand the concept using its label, the more fine WORK
grained modelling of VIVO+ with related concepts and their In this section, we describe the transformation of the
connections allows easier extension, higher expressiveness, university data from RDBMS to RDF modeled according to
better querying and automatic processing. the VIVO+ ontology. To perform such a transformation, we
The following example shows VIVO+ modelling of the developed kOre2 framework on top of the Sparqlify SPARQL-
student-degree-qualification domain. SQL rewriter3 , depicted in Figure 1. The framework is
: A l i c e a f o a f : Person ; evaluated using ITMO university infrastructure (i.e. Oracle
: academicQualification : AliceQualification ; RDBMS) in section 4.
: a s p i r e s D e g r e e : PhDDegree . The main components of the kOre framework are:
• The Mappings Repository contains mappings in
: AliceQualification SML (Sparqlify Mapping Language). Mappings are
: academicDegree : MasterDegree ; used by the Sparqlify SPAQRL-SQL rewriter to
: a c a d e m i c S u b j e c t : Chemistry . convert the data from RDBMS to RDF. Mappings have
to be maintained and updated by the framework main-
tainer in order to reflect university RDBMS schema
Use of Established Vocabularies. changes and tackle new data inside RDBMS.
Besides the core ontologies RDF, RDFS, OWL, DCMI
Metadata Terms, XSD and FOAF, VIVO+ uses the Linked- • The Sparqlify SPARQL-SQL rewriter establishes
GeoData and GeoNames ontologies to specify locations. the connection to RDBMS and converts the data to
In order to categorize the research topic of a laboratory, RDF using mappings from the Mappings Reposi-
we use the 6 digit subject matter identifiers of the Russian tory.
State Index of Scientific and Technical Information (SISTI)1 . • The Triple store stores RDF data and executes queries
over it.
Design Principles.
VIVO+ was designed with the following design principles • The SPARQL Endpoint publishes RDF data on the
in mind, following [6]. Clarity is provided by labels and Web thus enabling various web applications which are
comments in both English and Russian for all described capable of using the W3C semantic web standards.
classes and properties. URIs have speaking names. Class and We define interactions between components with four basic
property definitions are formally stated using consistent OWL operations:
restrictions as well as domain and range definitions, which
ensures the coherence of datasets that conform to VIVO+. 1. Poll. As a university RDBMS is a live system up-
Future extendibility of the ontology and related datasets is dated with new information several times a day, Spar-
provided through the usage of established vocabulary when qlify SPARQL-SQL rewriter polls the Mappings
fitting along with general definitions that are not restricted Repository periodically to check for changes and per-
to the presented use cases. Using VIVO+ on top of VIVO form transformations.
presents minimal ontological commitment as it merely 2. Push. After successful transformation of the data from
extends the latter ontology. Encoding bias was minimized the RDBMS the data has to be pushed to the Triple
through designing the ontology according to the presented store. The Push operation adds/deletes/updates RDF
use cases. data inside the Triple store.
1 2
Translated, in the Russian original: Государственный руб- kOre: Using Linked Data for OpenScience Information
рикатор научно-технической информации (ГРНТИ), see Integration.
3
http://grnti.ru/ http://aksw.org/Projects/Sparqlify.html
68
PREFIX ifmolod:
poll PREFIX vivoplus:
PREFIX vivo:
PREFIX rdfs:
PREFIX xsd:
Sparqlify CREATE VIEW LaboratoryResearchFields AS CONSTRUCT {
?laboratory vivoplus:locatedIn ?country .
}
transform WITH
?laboratory = uri(concat("http://lod.ifmo.ru/Laboratory", ?NET_DEP_ID))
Mappings ?country = uri(?URI)
push Repository FROM
[[ SELECT
consume NET_DEP_ID,
Triplestore LabMap URI
FROM sem_country_info, sem_country_lgd
publish WHERE
country_id=id
SPARQL Third-party apps ]]
Endpoint Mapping Example
Figure 1: kOre: architecture overview.
3. Publish. The Triple store publishes the data by pro- URI http:
viding a SPARQL Endpoint thereby making it ac- //vivoplus.aksw.org/ontology#
cessible on the Web. Version Date 2014-11-01
Version Number 1.0
4. Consume. Web applications can consume RDF via the License PDDL 1.07
SPARQL Endpoint, which provides a set of inter- Triples 15 572
faces (e.g. RDF/JSON interface). SPARQL Endpoint http://lod.ifmo.ru/sparql
The presented framework provides a persistent layer on N-Triples Dump http:
top of RDBMS infrastructure. Schema changes of underlying //lod.ifmo.ru/data.itmo.nt
RDBMS are reflected with SML mappings, therefore publicly
available data provided by the framework is consistent over Table 1: Technical Details of the ITMO LOD dataset
time. Thus data consumers can rely on data schema provided
by the kOre framework and save development time required
for tackling schema changes inside applications. At the moment of writing, the ITMO LOD dataset contains
information about 43 laboratories, 188 research areas and
1001 persons. The dump of the dataset is available online8 .
4. MAPPING FRAMEWORK DEPLOYMENT Table 1 provides statistics and links to services such as the
AT ITMO SPARQL endpoint. The dataset is openly licensed under the
We implement and deploy the kOre framework using in- PDDL 1.0. in accordance with the open definition9 .
frastructure of ITMO (Saint Petersburg State University of
Information Technologies, Mechanics and Optics) Univer- 5. APPLICATIONS
sity.4 ITMO university uses an Oracle RDBMS to store the
In this section we describe how the consume operation can
data about laboratories, people, publications among others.
be utilized by developers and end-users.
The database is only accessible from the university network,
For end-users we published the data from ITMO LOD
therefore a VPN connection is necessary.
SPARQL endpoint under lod.ifmo.ru domain using the Pubby10
For deployment we set up an Ubuntu 14.04 server with
Linked Data interface. Pubby makes dataset URIs deref-
6 GB of RAM and 35 GB of disk space. The Mappings
erenceable and enables navigation between linked entities
Repository on the deployed framework corresponds to a sys-
inside the dataset with the easy-to-use web interface. For
tem folder. Sparqlify is installed for all users and available
example, in Figure 2 we show, that a user is able to nav-
through command-line interface. As a Triple store we use
igate through persons and research areas for a particular
Virtuoso Open Source, which provides SPARQL Endpoint
laboratory.
with RDF/JSON interface out of the box. Poll and push op-
Developers can consume the data through SPARQL end-
erations are configured using shell scripts and scheduled with
point. Here we showcase how the ITMO LOD dataset can be
crontab. Publish operation is performed by Virtuoso Open
utilized in such a way by implementing LabMap application11
Source internally. Consume operation is possible through
(the code is published in the GitHub repository), which shows
the SPARQL Endpoint using POST requests.5
4 8
The implementation is freely available on GitHub: https: http://lod.ifmo.ru/data/itmo.nt
9
//github.com/AKSW/itmolod http://opendefinition.org/
5 10
The SPARQL endpoint for ITMO LOD project is located http://wifo5-03.informatik.uni-mannheim.de/pubby/
11
at http://lod.ifmo.ru/sparql http://lod.ifmo.ru/usecases/heatmap.html
69
http://lod.ifmo.ru/page/Laboratory87847 links Research Area View Person View
Laboratory View
Laboratory View
http://lod.ifmo.ru/page/Person100628
rdfs:label
Person View
vivoplus:locatedIn
vivo:affiliatedOrganization
vivoplus:hasResearchArea
foaf:firstName
...
foaf:lastName
...
http://lod.ifmo.ru/page/ResearchArea1841
Research Area View
links
vivoplus:hasResearchArea
foaf:firstName
foaf:lastName
...
Figure 2: Linked Data Interface enables browsing between linked entities (on the left). LabMap application
shows the laboratories of ITMO university filtered by collaborating countries (on the right).
laboratories of ITMO university by collaborating countries. [3] VIVO-ISF Ontology. https://wiki.duraspace.org/
User is able to see the list of laboratories per country as well display/VIVO/VIVO-ISF+Ontology, accessed: 11-06-
as persons working in particular laboratory (see Figure 2). 2015
[4] Gantz, J.F., Reinsel, D.: The expanding dig-
6. CONCLUSIONS AND FUTURE WORK ital universe: A forecast of worldwide in-
We implemented the the kOre framework, which facilitates formation growth through 2010. IDC (2007),
data publishing for universities. To support our framework http://www.emc.com/collateral/analyst-reports/
we extended the VIVO ontology resulting in VIVO+. To expanding-digital-idc-white-paper.pdf
showcase the applicability of the kOre framework we deployed
Linked Data interface for end-users implemented simple web [5] Groth, P., Gibson, A., Velterop, J.: The anatomy of
application as an example of data consumption for developers. a nanopublication. Information Services and Use 30(1),
Unlocking the hidden metadata for universities is a step 51–56 (2010)
forward in the integration effort between the original Web
of Science and the Data Web of Science. In the future we [6] Gruber, T.R.: Toward principles for the design of ontolo-
plan to deploy the kOre framework for University of Leipzig. gies used for knowledge sharing. International journal of
We plan to support ongoing effort of ITMO university to human-computer studies 43(5), 907–928 (1995)
expose more data publicly. Also, we plan to involve scientific [7] Larsen, P.O., Von Ins, M.: The rate of growth in sci-
personnel and students into the development of interesting entific publication and the decline in coverage provided
applications using available data. by Science Citation Index. Scientometrics 84(3), 575–603
(2010)
7. ACKNOWLEDGMENTS
We want to say thank you to our colleagues in AKSW and
ITMO university, whom supported ITMO LOD project:
• Claus Stadler for adapting Sparqlify to Oracle SQL on
our request.
• Maxim Kolchin for administrating the server environ-
ment.
• Denis Varenikov for helping getting access to ITMO
RDBMS and exposing data for us.
This work was partially financially supported by Government
of Russian Federation, Grant 074-U01.
References
[1] DBLP publications per year. http://dblp.uni-trier.
de/statistics/publicationsperyear, accessed: 11-06-
2015
[2] Linked Data Design Issues. http://www.w3.org/
DesignIssues/LinkedData.html, accessed: 11-06-2015
70