<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Graph Database for Persistent Identi ers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>n Bing</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>min Y</string-name>
          <email>ramin.yahyapourg@gwdg.de</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gesellschaft fur wissenschaftliche Datenverarbeitung mbH Gottingen Am Fa berg 11</institution>
          ,
          <addr-line>37077 Gottingen https://</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>L3S Research Center / KBS Group, Leibniz University Hannover Appelstra e 4</institution>
          ,
          <addr-line>30167 Hannover</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Handle Software manages references to resources of information. However, it does not support a search functionality. A prior implementation with Elasticsearch could not e ciently capture the complex structure of our dataset, especially the relationships between handles. In this paper, we apply a graph database together with Elasticsearch to provide more search capabilities to users. In addition, the graph can e ciently store meta-data provided during handle creation. Further use cases for this graph include redundancy detection (two or more handles pointing to the same URL), or bibliographic network analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>Persistent identi er Neo4j Elasticsearch</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Nowadays, people often locate digital objects using Uniform Resource Locators
(URLs). However, URLs tend to be broken over time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To overcome this
problem, the concept of Persistent Identi er (PID) is introduced. As the name
suggests, a PID is an identi er which is valid for a long time. In practice, a PID
is mapped to an up-to-date URL [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        According to FAIR (Findable, Accessible, Interoperable, Reusable)
principles, data with PIDs and their meta-data are supposed to be ndable [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
However, there is currently no e cient tool to nd PIDs from their meta-data. In
prior work, a search engine was created using Elasticsearch. Although it solved
the search problem, it did not e ciently capture the complexity of our dataset.
The contribution of this paper is to introduce a graph database as a tool that is
able to perform advanced searches on PID data; it is also able to search based
on the relationships between digital objects.
      </p>
      <p>The paper is organized as follows. Section 2 discusses the system design. The
system is evaluated in Section 3 and the conclusion is presented in Sections 4.</p>
      <p>
        Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Graph databases { a special category of NoSQL databases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] { represent
information by nodes and relationships and store data in so-called properties. The
purpose of our system is to employ a graph database to maintain the complex
structure of the handle data and to provide a search function together with
the ability to explore and analyze the data. To achieve that, a database which
is optimized for graph storage and traversal is required. Therefore, the graph
database Neo4j4 that implements the property graph model was chosen. An
implementation of PID is the Handle Software, which is developed by CNRI5.
Every handle consists of two parts: its naming authority (known as its pre x),
and a unique local name under the naming authority (known as its su x). The
main disadvantage of the Handle Software is, that it does not provide a search
function. There are no restriction on the creation of handle values. Hence, when
the system processes a handle value, it does not know the meaning of each data
type due to the lack of standardization. To overcome this problem, a schema
shown in Figure 1 is used in our system. The solution is to use many smaller
nodes where each node contains only one property instead of one node with
many properties. In this schema, except the handle nodes which are labeled as
handle, the label of nodes and relationships are the data types of the handle
values, such as URL or Institute. In this schema, every node is unique: handles
which have the same handle values will point to the same nodes.
      </p>
      <p>:Handle
- handle
:Handle_value
:Handle_value
- handle_value
4 https://neo4j.com
5 https://www.handle.net
6 http://dtr.pidconsortium.eu
the handle node 10.123/456 to the handle node 10.123/789. Only when the
handle node 10.123/789 is not found, the system will create that node rst,
then add a relationship between those two handle nodes. Lastly, there are many
cases where a handle value does not have an atomic value but a JavaScript
Object Notation (JSON). In a naive approach, the system will create a new node
and put the whole JSON inside. However, doing so leads to disadvantages. First,
because it is just a string, it is very hard to distinguish between the key and
the value to search on. Second, all the structures inside the JSON as well as the
connections with other nodes are lost. Hence our system must parses each JSON
object and creates an appropriate graph from it. The graph in Figure 2 shows
what our system generates from the example data in Table 1. As can be noticed
from the gure, there are two empty nodes in the graph. These empty nodes are
the results of the JSON parsing process. The purpose of these nodes is to group
related data together.</p>
      <p>To improve the performance, each node will have one more property called
nodeId. This property is unique among nodes and used as the key of a node.
When a node is created, its nodeId is calculated by hashing the value of that
node. This process is applied for non-handle nodes. Because the handle string
is already unique, the nodeId of a handle node is the handle string itself. The
nodeId property is indexed with a unique constraint. While indexing enhances
the performance of the READ operation, other operations (CREATE, UPDATE,
DELETE) are slowed down due to the updating of index table. Indeed, our
graph database is under a heavy load of CREATE and DELETE operations.
:Email
:Email
:URL</p>
      <p>:Name
‐ Name: Triet Doan</p>
      <p>:Handle
‐ Handle:10.123/456
:isPreviousVersionOf
:Name
:INST</p>
      <p>:INST
‐ INST: GWDG
:address</p>
      <p>:city
‐ city: Göttingen
:city</p>
      <p>:country
:address :country ‐ country: Germany</p>
      <p>: rst_name
:first_name ‐ first_name: Triet
:last_name :last_name
‐ last_name: Doan
:URL
‐ URL: http://www.google.com
‐ Handle:10.123/789 :Creator :Creator</p>
      <p>:Handle
However, because of the uniqueness of every node in our graph database, one
single CREATE or DELETE involves many READ operations which greatly
bene t from the index. We hence observed that indexing leads to a huge boost
in the performance of the system (see Section 3).
3</p>
    </sec>
    <sec id="sec-2">
      <title>Evaluation</title>
      <p>The execution time was measured when the system was running under a heavy
load scenario. During this time, the system had to retrieve data from two data
sources and build a graph with around 1 million nodes and 2.5 million
relationships. Figure 3 shows the number of handle values processed by the system
per minute. The lower green line shows the execution time when data was
collected without hashing and indexing. As can be seen from the chart, the system
runs quite fast at the beginning with around 1000 handle values processed per
minute. However, it quickly becomes slow over time. The reason for this
performance loss is that whenever a node is created, the system must make sure that
the node is unique. Therefore, the more nodes it has, the longer the checking
time. After around 107 hours, which is about 4.5 days, the system became too
slow. It processed only 130 handle values per minutes. This test was stopped
after 119 hours (almost 5 days). If continued, it would have taken around 7 days
to nish. For the second approach, with hashing and indexing, the performance
was greatly improved as shown by the upper blue line in Figure 3. It can be seen
that it runs quite stable with the number of processed handle values uctuating
between 2000 to more than 3000 per minute. By exploiting the indexing feature,
the performance is increased by factor 7.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion and Conclusion</title>
      <p>Our rst achievement is the appropriate graph schema for the handle data. That
graph schema is able to deal with the exibility in the creation of handle values
The comparison of performance between indexing and non-indexing approach
With indexing
Without indexing
0
20
40</p>
      <p>60
Hours
80
100
120
as well as maintaining a good performance of the system. A search engine for
handles is the second achievement. It o ers a variety of search options from
Elasticsearch and the ability to manage relationships between handles from Neo4j.
Basic usages can be done through the Graphical User Interface (GUI), while
a web-based tool is ready for more advanced purposes, such as some analyses
which are performed to discover hidden knowledge inside the graph. A topic of
future work to consider is the interoperability of the system: the graph database
can be enriched by importing data from other platforms, such as DOI, ARK,
ISBN, or ORCID.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hakala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Persistent identi ers { an overview</article-title>
          .
          <source>KIM Technology Watch Report</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Markwell</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brooks</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          :
          <article-title>Broken links: The ephemeral nature of educational WWW hyperlinks</article-title>
          .
          <source>Journal of Science Education and Technology</source>
          <volume>11</volume>
          (
          <issue>2</issue>
          ),
          <volume>105</volume>
          {
          <fpage>108</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wiese</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Advanced data management: for SQL, NoSQL, cloud</article-title>
          and distributed databases. de Gruyter Publishing (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aalbersberg</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Appleton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomberg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boiten</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          , da Silva Santos,
          <string-name>
            <given-names>L.B.</given-names>
            ,
            <surname>Bourne</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.E.</surname>
          </string-name>
          , et al.:
          <article-title>The FAIR Guiding Principles for scienti c data management and stewardship</article-title>
          .
          <source>Scienti c data 3</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>