<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>kOre: Using Linked Data for OpenScience Information Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivan Ermilov</string-name>
          <email>iermilov@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konrad Höffner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>lehmann@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dmitry Mouromtsev ITMO University 49 Kronverksky Ave.</institution>
          ,
          <addr-line>St.Petersburg, 197101</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Leipzig, Institute of Computer Science, AKSW Group Augustusplatz 10</institution>
          ,
          <addr-line>D-04109 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>67</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>While the amount of data on the Web grows at 57 % per year, the Web of Science maintains a considerable amount of inertia, as yearly growth varies between 1:6 % and 14 %. On the other hand, the Web of Science consists of high quality information created and reviewed by the international community of researchers. While it is a complicated process to switch from traditional publishing methods to methods, which enable data publishing in machine-readable formats, the situation can be improved by at least exposing metadata about the scienti c publications in machine-readable format. In this paper we aim at metadata, hidden inside universities' internal databases, reports and other hard to discover sources. We extend the VIVO ontology and create the VIVO+ ontology. We de ne and describe a framework for automatic conversion of university data to RDF. We showcase the VIVO+ ontology and the framework using the example of the ITMO university.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        While the amount of data on the Web grows at 57 % per
year [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ], the Web of Science maintains a considerable amount
of inertia, where growth varies between 1:6 % and 14 % [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ]
depending on the type of publication and research area. The
share of the Web of Science inside the whole Web is small.
For example, DBLP lists only 2 892 316 publications up to the
date of writing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The Library of Congress with over than
26 million books only consumes up to 10 terabytes, while
the size of the Web is measured in exabytes (i.e. millions of
terabytes).
      </p>
      <p>
        On the other hand, the Web of Science consists of high
quality information created and reviewed by the international
community of researchers. Moreover, by publishing research
contributions using traditional methods, which are targeted
at print and screen media (i.e. PDF documents), the data
has become inaccessible for automatic processing. In the
new era of Big Data and the Web of Data, the scienti c
community started developing new publication methods, which
re ect the 3V Concept (i.e. volume, velocity, variety), such
as nanopublications [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ]. While it is a complicated process
to switch from traditional publishing methods to methods
that enable data publishing in machine-readable formats, the
situation can be improved by at least exposing metadata
about the scienti c publications in machine-readable format.
In this paper we aim at metadata, which is hidden inside
universities' internal databases, reports and other hard to
discover sources. Unlocking this hidden metadata will
facilitate integration of the Web of Science with the new scienti c
data publication media (i.e. RDF datasets, nanopublications
etc.), resulting in a Data Web of Science.
      </p>
      <p>The Data Web of Science will enable new ways of data
access based on the data semantics. In particular, complex
search queries based on metadata will be available. Therefore,
the availability and discovery for research papers will be
increased. Given the availability of annotated PDF documents
it will be possible to measure key performance indicators
across universities as well as browse university data world
wide. Also, the Data Web of Science will facilitate search for
specialists and researchers given a particular research eld,
thus enabling new collaborations.</p>
      <p>
        In this paper we present a framework to expose metadata
about the research institutions according to the Linked Data
principles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular our contributions are as follows:
We extend the VIVO ontology and create the VIVO+
ontology to tackle several de ciencies which we
identied when using it.
      </p>
      <p>We de ne and describe a framework for automatic
conversion of university data to RDF.</p>
      <p>We made the implementation of our approach freely
available (on GitHub).</p>
      <p>We showcase how academical RDF data can be utilized
and integrated with other data sources in an e cient
way using the example of the ITMO university in St.
Petersburg.</p>
    </sec>
    <sec id="sec-2">
      <title>ONTOLOGY MODELLING</title>
      <p>Using an established ontology for a particular domain
reduces the development e ort and leads to higher quality
and better integration. In our usage scenario, we aim to
publish organizational university data within a joint project
of the University of Leipzig and ITMO. According to our
research the best- tting candidate ontology for such a task
is VIVO [3]. However, it contains a lot of gaps left which
we address with the VIVO+ ontology built upon VIVO
and described in this section. VIVO+ adds missing key
concepts, models and relationships. Moreover, it provides a
more exible and general modelling of relationships regarding
academic degrees, which provides a better coverage, e.g. by
allowing to represent Russian particularities.</p>
      <sec id="sec-2-1">
        <title>Missing Concepts.</title>
        <p>VIVO lacks many key concepts while others are modelled
in insu cient detail. For example, it contains a concept for
a PhD student but this concept does not reference any other
concept of this domain except its super class. While a human
user can understand the concept using its label, the more ne
grained modelling of VIVO+ with related concepts and their
connections allows easier extension, higher expressiveness,
better querying and automatic processing.</p>
        <p>The following example shows VIVO+ modelling of the
student-degree-quali cation domain.
: A l i c e a f o a f : Person ;
: a c a d e m i c Q u a l i f i c a t i o n : A l i c e Q u a l i f i c a t i o n ;
: a s p i r e s D e g r e e : PhDDegree .
: A l i c e Q u a l i f i c a t i o n
: a ca d e m i cD e g r e e : MasterDegree ;
: a c a d e m i c S u b j e c t : Chemistry .</p>
      </sec>
      <sec id="sec-2-2">
        <title>Use of Established Vocabularies.</title>
        <p>Besides the core ontologies RDF, RDFS, OWL, DCMI
Metadata Terms, XSD and FOAF, VIVO+ uses the
LinkedGeoData and GeoNames ontologies to specify locations.</p>
        <p>In order to categorize the research topic of a laboratory,
we use the 6 digit subject matter identi ers of the Russian
State Index of Scienti c and Technical Information (SISTI)1.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Design Principles.</title>
        <p>
          VIVO+ was designed with the following design principles
in mind, following [
          <xref ref-type="bibr" rid="ref5">6</xref>
          ]. Clarity is provided by labels and
comments in both English and Russian for all described
classes and properties. URIs have speaking names. Class and
property de nitions are formally stated using consistent OWL
restrictions as well as domain and range de nitions, which
ensures the coherence of datasets that conform to VIVO+.
Future extendibility of the ontology and related datasets is
provided through the usage of established vocabulary when
tting along with general de nitions that are not restricted
to the presented use cases. Using VIVO+ on top of VIVO
presents minimal ontological commitment as it merely
extends the latter ontology. Encoding bias was minimized
through designing the ontology according to the presented
use cases.
1Translated, in the Russian original: Государственный
рубрикатор научно-технической информации (ГРНТИ), see
http://grnti.ru/
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Peculiarities of the Russian Education System.</title>
        <p>We aim to provide an ontology that is su ciently generic
to be usable internationally. Di erent countries have di erent
educational systems, however and VIVO is modelled
according to the system of the USA. Adapting VIVO to ITMO
University poses three challenges: (1) identifying the
peculiarities of the Russian system (2) modelling the resulting
additional concepts and (3) appropriately linking them to
existing concepts. One di erence between Russia and most
of the world is that it has two di erent doctoral degrees: the
Candidate of Sciences (кандидат наук, kandidat nauk) and
the Doctor of Sciences (доктор наук, doktor nauk). The
Candidate of Sciences is equivalent to the PhD while the
Doctor of Sciences can be earned after a period of further
study following the award of the Candidate of Sciences
degree and requires ve to fteen years beyond the award of
the Candidate of Sciences. VIVO+ contains classes for both
degrees and relates them to existing concepts.
3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>KORE: AUTOMATIZED MAPPING FRAME</title>
    </sec>
    <sec id="sec-4">
      <title>WORK</title>
      <p>In this section, we describe the transformation of the
university data from RDBMS to RDF modeled according to
the VIVO+ ontology. To perform such a transformation, we
developed kOre2 framework on top of the Sparqlify
SPARQLSQL rewriter3, depicted in Figure 1. The framework is
evaluated using ITMO university infrastructure (i.e. Oracle
RDBMS) in section 4.</p>
      <p>The main components of the kOre framework are:
The Mappings Repository contains mappings in
SML (Sparqlify Mapping Language). Mappings are
used by the Sparqlify SPAQRL-SQL rewriter to
convert the data from RDBMS to RDF. Mappings have
to be maintained and updated by the framework
maintainer in order to re ect university RDBMS schema
changes and tackle new data inside RDBMS.</p>
      <p>The Sparqlify SPARQL-SQL rewriter establishes
the connection to RDBMS and converts the data to
RDF using mappings from the Mappings
Repository.</p>
      <p>The Triple store stores RDF data and executes queries
over it.</p>
      <p>The SPARQL Endpoint publishes RDF data on the
Web thus enabling various web applications which are
capable of using the W3C semantic web standards.</p>
      <p>We de ne interactions between components with four basic
operations:
1. Poll. As a university RDBMS is a live system
updated with new information several times a day,
Sparqlify SPARQL-SQL rewriter polls the Mappings
Repository periodically to check for changes and
perform transformations.
2. Push. After successful transformation of the data from
the RDBMS the data has to be pushed to the Triple
store. The Push operation adds/deletes/updates RDF
data inside the Triple store.
2kOre: Using Linked Data for OpenScience Information
Integration.
3http://aksw.org/Projects/Sparqlify.html
poll
push
publish</p>
      <sec id="sec-4-1">
        <title>Sparqlify</title>
        <p>transform</p>
      </sec>
      <sec id="sec-4-2">
        <title>Triplestore</title>
      </sec>
      <sec id="sec-4-3">
        <title>SPARQL</title>
      </sec>
      <sec id="sec-4-4">
        <title>Endpoint</title>
        <p>PREFIX ifmolod:&lt;http://lod.ifmo.ru/&gt;
PREFIX vivoplus:&lt;http://vivoplus.aksw.org/ontology#&gt;
PREFIX vivo:&lt;http://vivoweb.org/ontology/core#&gt;
PREFIX rdfs:&lt;http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX xsd:&lt;http://www.w3.org/2001/XMLSchema#&gt;
CREATE VIEW LaboratoryResearchFields AS CONSTRUCT {
?laboratory vivoplus:locatedIn ?country .
}
WITH
?laboratory = uri(concat("http://lod.ifmo.ru/Laboratory", ?NET_DEP_ID))
?country = uri(?URI)
FROM
[[ SELECT</p>
        <p>NET_DEP_ID,
URI
FROM sem_country_info, sem_country_lgd
WHERE
country_id=id</p>
        <p>Mapping Example
consume</p>
        <p>Mappings</p>
        <p>Repository</p>
      </sec>
      <sec id="sec-4-5">
        <title>LabMap</title>
      </sec>
      <sec id="sec-4-6">
        <title>Third-party apps</title>
        <p>3. Publish. The Triple store publishes the data by
providing a SPARQL Endpoint thereby making it
accessible on the Web.
4. Consume. Web applications can consume RDF via the
SPARQL Endpoint, which provides a set of
interfaces (e.g. RDF/JSON interface).</p>
        <p>The presented framework provides a persistent layer on
top of RDBMS infrastructure. Schema changes of underlying
RDBMS are re ected with SML mappings, therefore publicly
available data provided by the framework is consistent over
time. Thus data consumers can rely on data schema provided
by the kOre framework and save development time required
for tackling schema changes inside applications.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>MAPPING FRAMEWORK DEPLOYMENT</title>
    </sec>
    <sec id="sec-6">
      <title>AT ITMO</title>
      <p>We implement and deploy the kOre framework using
infrastructure of ITMO (Saint Petersburg State University of
Information Technologies, Mechanics and Optics)
University.4 ITMO university uses an Oracle RDBMS to store the
data about laboratories, people, publications among others.
The database is only accessible from the university network,
therefore a VPN connection is necessary.</p>
      <p>For deployment we set up an Ubuntu 14.04 server with
6 GB of RAM and 35 GB of disk space. The Mappings
Repository on the deployed framework corresponds to a
system folder. Sparqlify is installed for all users and available
through command-line interface. As a Triple store we use
Virtuoso Open Source, which provides SPARQL Endpoint
with RDF/JSON interface out of the box. Poll and push
operations are con gured using shell scripts and scheduled with
crontab. Publish operation is performed by Virtuoso Open
Source internally. Consume operation is possible through
the SPARQL Endpoint using POST requests.5
4The implementation is freely available on GitHub: https:
//github.com/AKSW/itmolod
5The SPARQL endpoint for ITMO LOD project is located
at http://lod.ifmo.ru/sparql
URI
Version Date
Version Number
License
Triples
SPARQL Endpoint
N-Triples Dump
http:
//vivoplus.aksw.org/ontology#
2014-11-01
1.0
PDDL 1.07
15 572
http://lod.ifmo.ru/sparql
http:
//lod.ifmo.ru/data.itmo.nt</p>
      <p>At the moment of writing, the ITMO LOD dataset contains
information about 43 laboratories, 188 research areas and
1001 persons. The dump of the dataset is available online8.
Table 1 provides statistics and links to services such as the
SPARQL endpoint. The dataset is openly licensed under the
PDDL 1.0. in accordance with the open de nition9.
5.</p>
    </sec>
    <sec id="sec-7">
      <title>APPLICATIONS</title>
      <p>In this section we describe how the consume operation can
be utilized by developers and end-users.</p>
      <p>For end-users we published the data from ITMO LOD
SPARQL endpoint under lod.ifmo.ru domain using the Pubby10
Linked Data interface. Pubby makes dataset URIs
dereferenceable and enables navigation between linked entities
inside the dataset with the easy-to-use web interface. For
example, in Figure 2 we show, that a user is able to
navigate through persons and research areas for a particular
laboratory.</p>
      <p>Developers can consume the data through SPARQL
endpoint. Here we showcase how the ITMO LOD dataset can be
utilized in such a way by implementing LabMap application11
(the code is published in the GitHub repository), which shows
8http://lod.ifmo.ru/data/itmo.nt
9http://opendefinition.org/
10http://wifo5-03.informatik.uni-mannheim.de/pubby/
11http://lod.ifmo.ru/usecases/heatmap.html
http://lod.ifmo.ru/page/Laboratory87847
Laboratory View
vivoplus:locatedIn
vivoplus:hasResearchArea
...</p>
      <p>http://lod.ifmo.ru/page/Person100628
Person View
vivo:affiliatedOrganization
foaf:firstName
foaf:lastName
links
http://lod.ifmo.ru/page/ResearchArea..1. 841
Research Area View
vivoplus:hasResearchArea
foaf:firstName
foaf:lastName
...
links
Laboratory View
laboratories of ITMO university by collaborating countries.
User is able to see the list of laboratories per country as well
as persons working in particular laboratory (see Figure 2).
[3] VIVO-ISF Ontology. https://wiki.duraspace.org/
display/VIVO/VIVO-ISF+Ontology, accessed:
11-062015</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>We implemented the the kOre framework, which facilitates
data publishing for universities. To support our framework
we extended the VIVO ontology resulting in VIVO+. To
showcase the applicability of the kOre framework we deployed
Linked Data interface for end-users implemented simple web
application as an example of data consumption for developers.</p>
      <p>Unlocking the hidden metadata for universities is a step
forward in the integration e ort between the original Web
of Science and the Data Web of Science. In the future we
plan to deploy the kOre framework for University of Leipzig.
We plan to support ongoing e ort of ITMO university to
expose more data publicly. Also, we plan to involve scienti c
personnel and students into the development of interesting
applications using available data.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>We want to say thank you to our colleagues in AKSW and
ITMO university, whom supported ITMO LOD project:
Claus Stadler for adapting Sparqlify to Oracle SQL on
our request.</p>
      <p>Maxim Kolchin for administrating the server
environment.</p>
      <p>Denis Varenikov for helping getting access to ITMO
RDBMS and exposing data for us.</p>
      <p>This work was partially nancially supported by Government
of Russian Federation, Grant 074-U01.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] DBLP publications per year</article-title>
          . http://dblp.uni-trier. de/statistics/publicationsperyear, accessed:
          <fpage>11</fpage>
          -
          <lpage>06</lpage>
          - 2015
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Linked</given-names>
            <surname>Data Design Issues</surname>
          </string-name>
          . http://www.w3.org/ DesignIssues/LinkedData.html, accessed:
          <fpage>11</fpage>
          -
          <lpage>06</lpage>
          -2015
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gantz</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reinsel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The expanding digital universe: A forecast of worldwide information growth through 2010</article-title>
          . IDC (
          <year>2007</year>
          ), http://www.emc.com/collateral/analyst-reports/
          <article-title>expanding-digital-idc-white-paper</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velterop</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The anatomy of a nanopublication</article-title>
          .
          <source>Information Services and Use</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <volume>51</volume>
          {
          <fpage>56</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Gruber</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          :
          <article-title>Toward principles for the design of ontologies used for knowledge sharing</article-title>
          .
          <source>International journal of human-computer studies 43(5)</source>
          ,
          <volume>907</volume>
          {
          <fpage>928</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Larsen</surname>
            ,
            <given-names>P.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Von</surname>
            <given-names>Ins</given-names>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>The rate of growth in scienti c publication and the decline in coverage provided by Science Citation Index</article-title>
          .
          <source>Scientometrics</source>
          <volume>84</volume>
          (
          <issue>3</issue>
          ),
          <volume>575</volume>
          {
          <fpage>603</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>