kOre: Using Linked Data for OpenScience Information
                         Integration

                  Ivan Ermilov                       Konrad Höffner                          Jens Lehmann

                          University of Leipzig, Institute of Computer Science, AKSW Group
                                    Augustusplatz 10, D-04109 Leipzig, Germany
                           {iermilov,hoeffner,lehmann}@informatik.uni-leipzig.de
                                            Dmitry Mouromtsev
                                                   ITMO University
                                  49 Kronverksky Ave., St.Petersburg, 197101, Russia
                                              d.muromtsev@gmail.com

ABSTRACT                                                             contributions using traditional methods, which are targeted
While the amount of data on the Web grows at 57 % per                at print and screen media (i.e. PDF documents), the data
year, the Web of Science maintains a considerable amount             has become inaccessible for automatic processing. In the
of inertia, as yearly growth varies between 1.6 % and 14 %.          new era of Big Data and the Web of Data, the scientific com-
On the other hand, the Web of Science consists of high qual-         munity started developing new publication methods, which
ity information created and reviewed by the international            reflect the 3V Concept (i.e. volume, velocity, variety), such
community of researchers. While it is a complicated process          as nanopublications [5]. While it is a complicated process
to switch from traditional publishing methods to methods,            to switch from traditional publishing methods to methods
which enable data publishing in machine-readable formats,            that enable data publishing in machine-readable formats, the
the situation can be improved by at least exposing meta-             situation can be improved by at least exposing metadata
data about the scientific publications in machine-readable           about the scientific publications in machine-readable format.
format. In this paper we aim at metadata, hidden inside              In this paper we aim at metadata, which is hidden inside
universities’ internal databases, reports and other hard to          universities’ internal databases, reports and other hard to
discover sources. We extend the VIVO ontology and create             discover sources. Unlocking this hidden metadata will facili-
the VIVO+ ontology. We define and describe a framework               tate integration of the Web of Science with the new scientific
for automatic conversion of university data to RDF. We               data publication media (i.e. RDF datasets, nanopublications
showcase the VIVO+ ontology and the framework using the              etc.), resulting in a Data Web of Science.
example of the ITMO university.                                         The Data Web of Science will enable new ways of data
                                                                     access based on the data semantics. In particular, complex
                                                                     search queries based on metadata will be available. Therefore,
1.   INTRODUCTION                                                    the availability and discovery for research papers will be in-
  While the amount of data on the Web grows at 57 % per              creased. Given the availability of annotated PDF documents
year [4], the Web of Science maintains a considerable amount         it will be possible to measure key performance indicators
of inertia, where growth varies between 1.6 % and 14 % [7]           across universities as well as browse university data world
depending on the type of publication and research area. The          wide. Also, the Data Web of Science will facilitate search for
share of the Web of Science inside the whole Web is small.           specialists and researchers given a particular research field,
For example, DBLP lists only 2 892 316 publications up to the        thus enabling new collaborations.
date of writing [1]. The Library of Congress with over than             In this paper we present a framework to expose metadata
26 million books only consumes up to 10 terabytes, while             about the research institutions according to the Linked Data
the size of the Web is measured in exabytes (i.e. millions of        principles [2]. In particular our contributions are as follows:
terabytes).
  On the other hand, the Web of Science consists of high                • We extend the VIVO ontology and create the VIVO+
quality information created and reviewed by the international             ontology to tackle several deficiencies which we identi-
community of researchers. Moreover, by publishing research                fied when using it.
                                                                        • We define and describe a framework for automatic
                                                                          conversion of university data to RDF.
                                                                        • We made the implementation of our approach freely
                                                                          available (on GitHub).
                                                                        • We showcase how academical RDF data can be utilized
                                                                          and integrated with other data sources in an efficient
                                                                          way using the example of the ITMO university in St. Pe-
.
                                                                          tersburg.


                                                                67
2.   ONTOLOGY MODELLING                                               Peculiarities of the Russian Education System.
   Using an established ontology for a particular domain                 We aim to provide an ontology that is sufficiently generic
reduces the development effort and leads to higher quality            to be usable internationally. Different countries have different
and better integration. In our usage scenario, we aim to              educational systems, however and VIVO is modelled accord-
publish organizational university data within a joint project         ing to the system of the USA. Adapting VIVO to ITMO
of the University of Leipzig and ITMO. According to our               University poses three challenges: (1) identifying the pecu-
research the best-fitting candidate ontology for such a task          liarities of the Russian system (2) modelling the resulting
is VIVO [3]. However, it contains a lot of gaps left which            additional concepts and (3) appropriately linking them to
we address with the VIVO+ ontology built upon VIVO                    existing concepts. One difference between Russia and most
and described in this section. VIVO+ adds missing key                 of the world is that it has two different doctoral degrees: the
concepts, models and relationships. Moreover, it provides a           Candidate of Sciences (кандидат наук, kandidat nauk) and
more flexible and general modelling of relationships regarding        the Doctor of Sciences (доктор наук, doktor nauk). The
academic degrees, which provides a better coverage, e.g. by           Candidate of Sciences is equivalent to the PhD while the
allowing to represent Russian particularities.                        Doctor of Sciences can be earned after a period of further
                                                                      study following the award of the Candidate of Sciences de-
Missing Concepts.                                                     gree and requires five to fifteen years beyond the award of
  VIVO lacks many key concepts while others are modelled              the Candidate of Sciences. VIVO+ contains classes for both
in insufficient detail. For example, it contains a concept for        degrees and relates them to existing concepts.
a PhD student but this concept does not reference any other
concept of this domain except its super class. While a human          3.     KORE: AUTOMATIZED MAPPING FRAME-
user can understand the concept using its label, the more fine               WORK
grained modelling of VIVO+ with related concepts and their              In this section, we describe the transformation of the
connections allows easier extension, higher expressiveness,           university data from RDBMS to RDF modeled according to
better querying and automatic processing.                             the VIVO+ ontology. To perform such a transformation, we
  The following example shows VIVO+ modelling of the                  developed kOre2 framework on top of the Sparqlify SPARQL-
student-degree-qualification domain.                                  SQL rewriter3 , depicted in Figure 1. The framework is
: A l i c e a f o a f : Person ;                                      evaluated using ITMO university infrastructure (i.e. Oracle
: academicQualification : AliceQualification ;                        RDBMS) in section 4.
: a s p i r e s D e g r e e : PhDDegree .                               The main components of the kOre framework are:
                                                                            • The Mappings Repository contains mappings in
: AliceQualification                                                          SML (Sparqlify Mapping Language). Mappings are
: academicDegree : MasterDegree ;                                             used by the Sparqlify SPAQRL-SQL rewriter to
: a c a d e m i c S u b j e c t : Chemistry .                                 convert the data from RDBMS to RDF. Mappings have
                                                                              to be maintained and updated by the framework main-
                                                                              tainer in order to reflect university RDBMS schema
Use of Established Vocabularies.                                              changes and tackle new data inside RDBMS.
  Besides the core ontologies RDF, RDFS, OWL, DCMI
Metadata Terms, XSD and FOAF, VIVO+ uses the Linked-                        • The Sparqlify SPARQL-SQL rewriter establishes
GeoData and GeoNames ontologies to specify locations.                         the connection to RDBMS and converts the data to
  In order to categorize the research topic of a laboratory,                  RDF using mappings from the Mappings Reposi-
we use the 6 digit subject matter identifiers of the Russian                  tory.
State Index of Scientific and Technical Information (SISTI)1 .              • The Triple store stores RDF data and executes queries
                                                                              over it.
Design Principles.
   VIVO+ was designed with the following design principles                  • The SPARQL Endpoint publishes RDF data on the
in mind, following [6]. Clarity is provided by labels and                     Web thus enabling various web applications which are
comments in both English and Russian for all described                        capable of using the W3C semantic web standards.
classes and properties. URIs have speaking names. Class and             We define interactions between components with four basic
property definitions are formally stated using consistent OWL         operations:
restrictions as well as domain and range definitions, which
ensures the coherence of datasets that conform to VIVO+.                   1. Poll. As a university RDBMS is a live system up-
Future extendibility of the ontology and related datasets is                  dated with new information several times a day, Spar-
provided through the usage of established vocabulary when                     qlify SPARQL-SQL rewriter polls the Mappings
fitting along with general definitions that are not restricted                Repository periodically to check for changes and per-
to the presented use cases. Using VIVO+ on top of VIVO                        form transformations.
presents minimal ontological commitment as it merely                       2. Push. After successful transformation of the data from
extends the latter ontology. Encoding bias was minimized                      the RDBMS the data has to be pushed to the Triple
through designing the ontology according to the presented                     store. The Push operation adds/deletes/updates RDF
use cases.                                                                    data inside the Triple store.
1                                                                     2
 Translated, in the Russian original: Государственный руб-              kOre: Using Linked Data for OpenScience Information
рикатор научно-технической информации (ГРНТИ), see                    Integration.
                                                                      3
http://grnti.ru/                                                          http://aksw.org/Projects/Sparqlify.html


                                                                 68
                                                                            PREFIX ifmolod:<http://lod.ifmo.ru/>
                        poll                                                PREFIX vivoplus:<http://vivoplus.aksw.org/ontology#>
                                                                            PREFIX vivo:<http://vivoweb.org/ontology/core#>
                                                                            PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
                                                                            PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>


         Sparqlify                                                          CREATE VIEW LaboratoryResearchFields AS CONSTRUCT {
                                                                              ?laboratory vivoplus:locatedIn ?country .
                                                                            }
            transform                                                       WITH
                                                                              ?laboratory = uri(concat("http://lod.ifmo.ru/Laboratory", ?NET_DEP_ID))
                                               Mappings                       ?country = uri(?URI)
                  push                         Repository                   FROM
                                                                              [[ SELECT
                               consume                                            NET_DEP_ID,
       Triplestore                           LabMap                               URI
                                                                                 FROM sem_country_info, sem_country_lgd
                  publish                                                        WHERE
                                                                                  country_id=id
        SPARQL                           Third-party apps                     ]]


        Endpoint                                                                                        Mapping Example


                                         Figure 1: kOre: architecture overview.


     3. Publish. The Triple store publishes the data by pro-              URI                        http:
        viding a SPARQL Endpoint thereby making it ac-                                               //vivoplus.aksw.org/ontology#
        cessible on the Web.                                              Version Date               2014-11-01
                                                                          Version Number             1.0
     4. Consume. Web applications can consume RDF via the                 License                    PDDL 1.07
        SPARQL Endpoint, which provides a set of inter-                   Triples                    15 572
        faces (e.g. RDF/JSON interface).                                  SPARQL Endpoint            http://lod.ifmo.ru/sparql
  The presented framework provides a persistent layer on                  N-Triples Dump             http:
top of RDBMS infrastructure. Schema changes of underlying                                            //lod.ifmo.ru/data.itmo.nt
RDBMS are reflected with SML mappings, therefore publicly
available data provided by the framework is consistent over           Table 1: Technical Details of the ITMO LOD dataset
time. Thus data consumers can rely on data schema provided
by the kOre framework and save development time required
for tackling schema changes inside applications.                        At the moment of writing, the ITMO LOD dataset contains
                                                                      information about 43 laboratories, 188 research areas and
                                                                      1001 persons. The dump of the dataset is available online8 .
4.     MAPPING FRAMEWORK DEPLOYMENT                                   Table 1 provides statistics and links to services such as the
       AT ITMO                                                        SPARQL endpoint. The dataset is openly licensed under the
   We implement and deploy the kOre framework using in-               PDDL 1.0. in accordance with the open definition9 .
frastructure of ITMO (Saint Petersburg State University of
Information Technologies, Mechanics and Optics) Univer-               5.     APPLICATIONS
sity.4 ITMO university uses an Oracle RDBMS to store the
                                                                         In this section we describe how the consume operation can
data about laboratories, people, publications among others.
                                                                      be utilized by developers and end-users.
The database is only accessible from the university network,
                                                                         For end-users we published the data from ITMO LOD
therefore a VPN connection is necessary.
                                                                      SPARQL endpoint under lod.ifmo.ru domain using the Pubby10
   For deployment we set up an Ubuntu 14.04 server with
                                                                      Linked Data interface. Pubby makes dataset URIs deref-
6 GB of RAM and 35 GB of disk space. The Mappings
                                                                      erenceable and enables navigation between linked entities
Repository on the deployed framework corresponds to a sys-
                                                                      inside the dataset with the easy-to-use web interface. For
tem folder. Sparqlify is installed for all users and available
                                                                      example, in Figure 2 we show, that a user is able to nav-
through command-line interface. As a Triple store we use
                                                                      igate through persons and research areas for a particular
Virtuoso Open Source, which provides SPARQL Endpoint
                                                                      laboratory.
with RDF/JSON interface out of the box. Poll and push op-
                                                                         Developers can consume the data through SPARQL end-
erations are configured using shell scripts and scheduled with
                                                                      point. Here we showcase how the ITMO LOD dataset can be
crontab. Publish operation is performed by Virtuoso Open
                                                                      utilized in such a way by implementing LabMap application11
Source internally. Consume operation is possible through
                                                                      (the code is published in the GitHub repository), which shows
the SPARQL Endpoint using POST requests.5
4                                                                     8
    The implementation is freely available on GitHub: https:             http://lod.ifmo.ru/data/itmo.nt
                                                                      9
//github.com/AKSW/itmolod                                                http://opendefinition.org/
5                                                                     10
  The SPARQL endpoint for ITMO LOD project is located                    http://wifo5-03.informatik.uni-mannheim.de/pubby/
                                                                      11
at http://lod.ifmo.ru/sparql                                             http://lod.ifmo.ru/usecases/heatmap.html


                                                                 69
 http://lod.ifmo.ru/page/Laboratory87847                      links                  Research Area View                     Person View
 Laboratory View
                                                                                                          Laboratory View
                                     http://lod.ifmo.ru/page/Person100628
 rdfs:label
                                     Person View
 vivoplus:locatedIn
                                     vivo:affiliatedOrganization
 vivoplus:hasResearchArea
                                     foaf:firstName
 ...
                                     foaf:lastName
                                             ...
          http://lod.ifmo.ru/page/ResearchArea1841
          Research Area View
 links
          vivoplus:hasResearchArea

          foaf:firstName

          foaf:lastName
          ...


Figure 2: Linked Data Interface enables browsing between linked entities (on the left). LabMap application
shows the laboratories of ITMO university filtered by collaborating countries (on the right).


laboratories of ITMO university by collaborating countries.                      [3] VIVO-ISF Ontology. https://wiki.duraspace.org/
User is able to see the list of laboratories per country as well                     display/VIVO/VIVO-ISF+Ontology, accessed: 11-06-
as persons working in particular laboratory (see Figure 2).                          2015
                                                                                 [4] Gantz, J.F., Reinsel, D.:    The expanding dig-
6.       CONCLUSIONS AND FUTURE WORK                                                 ital universe:   A forecast of worldwide in-
  We implemented the the kOre framework, which facilitates                           formation growth through 2010. IDC (2007),
data publishing for universities. To support our framework                           http://www.emc.com/collateral/analyst-reports/
we extended the VIVO ontology resulting in VIVO+. To                                 expanding-digital-idc-white-paper.pdf
showcase the applicability of the kOre framework we deployed
Linked Data interface for end-users implemented simple web                       [5] Groth, P., Gibson, A., Velterop, J.: The anatomy of
application as an example of data consumption for developers.                        a nanopublication. Information Services and Use 30(1),
  Unlocking the hidden metadata for universities is a step                           51–56 (2010)
forward in the integration effort between the original Web
of Science and the Data Web of Science. In the future we                         [6] Gruber, T.R.: Toward principles for the design of ontolo-
plan to deploy the kOre framework for University of Leipzig.                         gies used for knowledge sharing. International journal of
We plan to support ongoing effort of ITMO university to                              human-computer studies 43(5), 907–928 (1995)
expose more data publicly. Also, we plan to involve scientific                   [7] Larsen, P.O., Von Ins, M.: The rate of growth in sci-
personnel and students into the development of interesting                           entific publication and the decline in coverage provided
applications using available data.                                                   by Science Citation Index. Scientometrics 84(3), 575–603
                                                                                     (2010)
7.       ACKNOWLEDGMENTS
  We want to say thank you to our colleagues in AKSW and
ITMO university, whom supported ITMO LOD project:
     • Claus Stadler for adapting Sparqlify to Oracle SQL on
       our request.
     • Maxim Kolchin for administrating the server environ-
       ment.
     • Denis Varenikov for helping getting access to ITMO
       RDBMS and exposing data for us.
This work was partially financially supported by Government
of Russian Federation, Grant 074-U01.

References
[1] DBLP publications per year. http://dblp.uni-trier.
    de/statistics/publicationsperyear, accessed: 11-06-
    2015
[2] Linked Data Design Issues. http://www.w3.org/
    DesignIssues/LinkedData.html, accessed: 11-06-2015


                                                                            70