<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using the semantic web for author disambiguation - are we there yet?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cornelia Hedeler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bijan Parsia</string-name>
          <email>bijan.parsiag@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brigitte Mathiak</string-name>
          <email>brigitte.mathiak@gesis.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences</institution>
          ,
          <addr-line>Unter Sachsenhausen 6-8, 50667 Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, The University of Manchester</institution>
          ,
          <addr-line>Oxford Road, M13 9PL Manchester</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The quality, and therefore, the usability and reliability of data in digital libraries depends on author disambiguation, i.e., the correct assignment of publications to a particular person. Author disambiguation aims to resolve name ambiguity, i.e., synonyms (the same author publishing under di erent names), and polysemes (di erent authors with the same name), and assign publications to the correct person. However, author disambiguation is di cult given that the information available in digital libraries is sparse and, when integrated from multiple data sources, contain inconsistencies in representation, e.g., of person names, or venue titles. Here we analyse and evaluate the usability of person-centred reference data available as linked data to complement the information present in digital libraries and aid author disambiguation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Users of digital libraries are not only interested in literature related to a
particular topic or research eld of interest, but more frequently also in literature
written by a particular author [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, as digital libraries tend to integrate
information from various sources, they su er from inconsistencies in
representation of, e.g., author names or venue titles, despite best e orts to maintain a
high data quality. For the actual disambiguation process, a wide variety of
additional metadata are used, e.g., journal or conference names, author a liations,
co-author networks, and keywords or topics [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ]. However, in some digital
libraries the available metadata can be quite sparse, providing insu cient amount
and detail of information to disambiguate authors e ciently.
      </p>
      <p>
        To complement the sometimes sparse bibliographic information a number of
approaches surveyed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] utilise information available elsewhere, e.g., using web
searches, and most of the approaches proposed are evaluated utilising gold
standard datasets of high quality, such as Google Scholar author pro les. However,
to the best of the authors' knowledge, these high quality data sets have so far not
been used as part of the disambiguation process itself. Here we analyse
personcentred reference data available on the semantic web and evaluate whether it
contains su cient detail and content to provide additional information and aid
author disambiguation.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data sets</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Digital library data sets</title>
        <p>
          In contrast to the wealth of metadata available in some digital libraries, the
records in the two digital library data sets used here only o er limited metadata.
DBLP Most publication records in the DBLP Computer Science Bibliography
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] consist only of author names, publication titles, and venue information, such
as names of conferences and journals. In addition to the publication records,
DBLP also contains person records, which are created as result of ongoing e orts
for author disambiguation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Sowiport The portal Sowiport(http://sowiport.gesis.org) is provided by GESIS
and contains publication records relevant to the social sciences. Here we only
focus on a subset of just over 500,000 literature entries in Sowiport from three
data sources (SOFIS, SOLIS, SSOAR) within GESIS, that have been annotated
with keywords from TheSoz, a German thesaurus for the Social Sciences.</p>
        <p>So far, no author disambiguation has taken place in these records, and
inconsistencies in particular in author names make it hard for users to nd all
publications by a particular author. An analysis of the search logs has shown
that the authors most frequently searched for are those with large numbers of
publications, who tend to have entries in DBpedia and GND, motivating the use
of the reference data sources introduced below.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Person-centred reference data</title>
        <p>GND authority le and GND publication information. As the literature
in Sowiport, in particular the subset used here, is heavily biased towards German
literature, we use the Integrated Authority File (GND) of the German-speaking
countries and the bibliographic data o ered as part of the linked data service
by the German National Library (http://www.dnb.de/EN/lds). Amongst other
information, which also includes keywords, the GND le contains di erentiated
person records, which refer to a real person, and are used here.</p>
        <p>
          DBpedia [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is available for download(http://wiki.dbpedia.org/Downloads39)
and comes in various data sets containing di erent kinds of data, amongst them
`Persondata', with information about people, such as their date and place of
birth and death. As the persondata subset itself does not contain much
additional detail, other data sets are required to obtain information useful for author
disambiguation. The data is available either as raw infobox data or cleaned
mapping-based data, which we use here.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach for author disambiguation</title>
      <p>Our approach for author disambiguation can be seen as preliminary, as the main
focus of this work was to evaluate whether there is su cient information available
in such reference data sets to make this a viable approach. It uses a domain
speci c heuristic as similarity function, and the reference data sets introduced
above as additional (web information) evidence. To limit the number of records
that need to be compared in detail, we use an index on the author/person names
Using the semantic web for author disambiguation - are we there yet?
in GND, and DBpedia. (ii) A random subset of 250 computer scientists from
person records in DBLP with a link to the corresponding wikipedia page, and
run the part of the author disambiguation algorithm that identi es the GND and
DBpedia entries of an author of a publication record in Sowiport and DBLP. The
precision ranging between 0.7 and 1 is encouraging (In detail: social scientists
with entry in German DBpedia: 0.97; - with entry in English DBpedia: 1; - with
entry in GND: 0.92; computer scientists with entry in DBpedia: 0.89 taking into
account the language of the false positives; - with entry in GND: 0.7). However,
the data set used here is fairly small and does not contain too many people with
common names, which contribute the majority of the false positives.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>The analysis and evaluation of DBpedia and GND has shown that the
semantic markup of the information in DBpedia is still lacking in various aspects.
How much of an issue this lack of appropriately detailed information and lack of
completeness really causes for tasks does not only depend on the corresponding
subset of the reference data and its properties, but also on the remainder of the
reference data set, and the digital library data set. This would suggest that a
quality measure that assesses the suitability of the reference data set for author
disambiguation should take into account the following: (i) tuple completeness,
(ii) speci city of the annotation with ontologies, (iii) how much of the
information is provided in form of ontologies or thesaurus or even worse literal strings,
which provides an indication of the expected heterogeneity of the information
across di erent data sets, and (iv) the number of people in the reference data
set who share their names.</p>
      <p>To bring this into context with the digital library data set, one could also
determine whether and how many of the author names are shared with several
person records in the reference data set. In particular in these cases, su ciently
detailed information is vital in order to be able to identify the correct person
record or determine that there is no person record available for that particular
person, even though there are plenty of records for people with the same name.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goncalves</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laender</surname>
            ,
            <given-names>A.H.F.</given-names>
          </string-name>
          :
          <article-title>A brief survey of automatic methods for author name disambiguation</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>41</volume>
          (
          <issue>2</issue>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Herskovic</surname>
            ,
            <given-names>J.R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>L.Y.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstam</surname>
            ,
            <given-names>E.V.E.</given-names>
          </string-name>
          :
          <article-title>A day in the life of PubMed: analysis of a typical day's query log</article-title>
          .
          <source>Journal of the American Medical Informatics Association : JAMIA</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <volume>212</volume>
          {
          <fpage>220</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            , M., van Kleef,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          .
          <source>Semantic Web Journal</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ley</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>DBLP: some lessons learned</article-title>
          .
          <source>In: VLDB'09</source>
          . pp.
          <volume>1493</volume>
          {
          <issue>1500</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Reuther</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klink</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Managing the Quality of Person Names in DBLP</article-title>
          .
          <source>In: ECDL'06</source>
          . pp.
          <volume>508</volume>
          {
          <issue>511</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Smalheiser</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torvik</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Author name disambiguation</article-title>
          .
          <source>Annual review of information science and technology 43(1)</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>