<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Domain-aware Matching of Events to DBpedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kristian Slabbekoorn</string-name>
          <email>k.slabbekoorn@student.tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Hollink</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Geert-Jan Houben</string-name>
          <email>g.j.p.m.houbeng@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Web Information Systems Group Delft University of Technology</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our work on the enrichment of the EventMedia dataset as provided by the DeRiVE data challenge with links to DBpedia. Our main contribution is an exploration into the use of domain knowledge in the matching process. As a starting point we take DBpedia Spotlight, an o -the-shelf tool for matching textual resources to DBpedia. We present a bootstrap method to automatically derive the needed domain knowledge from an initial set of high con dence matches, and compare this to a baseline method without any domain knowledge, and an `oracle' method with perfect domain knowledge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In this paper, we present our work on the enrichment of the EventMedia dataset
as provided by the DeRiVE data challenge with links to DBpedia. Tools and
algorithms have emerged that automate the task of matching two ontologies or
datasets. Most of these systems use string similarity measures and/or structural
measures to determine the similarity between a pair of resources. However, little
is known about how one can include the domain of the data into the matching
process. In our case, we know that the EventMedia dataset is about events,
performing artists and venues. The main contribution of this paper is an exploration
into the use of this knowledge of the domain to produce better or more matches.
In addition, the resulting matches are made publicly available for download.</p>
      <p>
        As a starting point we take DBpedia Spotlight, an o -the-shelf tool for
matching textual resources to DBpedia. DBpedia Spotlight has been shown to be able
to compete with established annotation systems while remaining largely
congurable [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The con gurability allows us to include various forms of domain
knowledge, and test the e ect on the resulting matches. It also means, however,
that we have to choose values for a relatively large number of parameters that
potentially in uence the results. To minimize this e ect, we set the parameters
systematically and transparently in section 2.1.
      </p>
      <p>We present a bootstrap method to automatically derive the needed domain
knowledge from an initial set of high con dence matches, and compare this to
a baseline method without any domain knowledge, and an `oracle' method with
perfect domain knowledge. To explore the generalizability of the derived domain
knowledge, we perform an evaluation in which we derive the domain knowledge
from one dataset and use it to nd matches in another dataset.</p>
      <p>
        Several bootstrapping methods to derive links between Linking Open Data
(LOD) datasets have been proposed previously. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] matches concepts by nding
candidates in DBpedia, then comparing classi cations of their own concepts to
the classes and categories of the DBpedia candidate concepts. In our case, we
do not assume to have a classi cation of the source data available. BLOOMS+
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] uses the Wikipedia category hierarchy to bootstrap the process of nding
schema-level links between LOD datasets. We exploit the Wikipedia category
hierarchy in a similar fashion; not to nd matches directly but to nd categories
(and classes) that e ectively describe our domain of interest.
1.1
      </p>
      <sec id="sec-1-1">
        <title>Dataset and Reference Alignment</title>
        <p>All experiments are performed on the EventMedia dataset provided as part of
the DeRiVE data challenge, containing RDF statements about events, artists
and venues from the websites Last.fm, Eventful.com and Upcoming.yahoo.com.
We have chosen to focus on matching artists as they are more likely to have
pages dedicated to them on Wikipedia than venues and events do - pages can
be found for roughly 35% of the artists contained in the Last.fm dataset, and
45% of the artists contained in the Eventful dataset. Upcoming does not
contain explicit mentions of artists. We evaluate our approach by comparison to a
manually composed reference alignment of 1500 randomly picked artists (1000
from Last.fm and 500 from Eventful.com) to DBpedia resources.
1.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>DBpedia Spotlight</title>
        <p>
          Throughout this work we have used DBpedia Spotlight [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a powerful tool for
automatically annotating natural language texts with links to DBpedia resources.
It does so by rst nding surface forms in the text that could be mentions of
DBpedia resources (the `spotting' function), then disambiguating to link to the
right DBpedia resources based on context similarity measures (the
`disambiguation' function). Its results can be directed towards high precision or high recall
by setting two parameters: a `support' threshold for minimum popularity of the
Wikipedia page (i.e. the number of inlinks from other Wikipedia pages) and a
`con dence' threshold for minimum similarity between source text and context
associated with DBpedia surface forms. The latter has been normalized to a
range of 0..1. In addition, Spotlight's `black- and whitelists' allow one to lter
the results to exclude/include only members of certain classes and categories
that correspond to the domain of the source text.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>In this section we present our approach to domain-aware matching. We compare
our results to a baseline approach, where we run Spotlight without any domain
lters, and an `oracle' approach, where the optimal classes and categories are
chosen as a domain lter, based on the best matching results in hindsight.</p>
      <p>We do our matching in two passes. In the rst pass we attempt to match
the full label. We do this by marking the full rdfs:label of an artist for
disambiguation and appending the dc:description value, if available, as context
for Spotlight's `disambiguate' function. We attempt to increase the number of
links by running a second pass with Spotlight's `spotting' function on artists
not matched initially to search for surface forms `hidden' inside the label (for
instance, some labels include more than one artist).</p>
      <p>Bootstrap approach: deriving domain lters. We bootstrap the selection of
domain lters by rst running DBpedia Spotlight without any knowledge of the
domain, with parameters set towards high precision, to obtain an initial set of
links from our data to DBpedia resources. From these resources we gather the
associated DBpedia classes, YAGO classes and Wikipedia categories and use
these as a domain knowledge lter to nd further matches.</p>
      <p>To get a set of classes and categories that concisely describes our domain, we
rst gather all DBpedia and YAGO classes of the matched DBpedia resources,
including all super-classes up to the root of their respective hierarchies. For
categories, we gather only up to 4 ancestors of each, as due to the size and
messy structure of the Wikipedia category hierarchy the set will quickly become
too large and broad to be useful.</p>
      <p>Second, we select the appropriate classes and categories from this long list
as follows. We count the number of occurrences of each class or category. We
lter out all classes that occur less than r% of the total number of classes found.
General (super-)classes will occur more frequently then speci c (sub-)classes.
Therefore, the higher the value of r, the more general our list of classes will be.
For categories, this e ect is less strong since we do not gather super-categories
up to the root. Therefore we simply select the top t categories that occur the
most. To avoid too much redundancy in the list of classes and categories, i.e.
to avoid including a super-class plus all its sub-classes, we lter out all
superclasses where the sum of the numbers of occurrence of their sub-classes is more
than 90% of the number of occurrence of the super-class. The same procedure is
followed for categories. The resulting list of DBpedia classes, YAGO classes and
categories represents our domain lter.
2.1</p>
      <sec id="sec-2-1">
        <title>Spotlight parameter optimization</title>
        <p>In this paper we assume an application that values precision and recall equally,
and therefore we optimize the parameters for high F-measure (F ). To allow a
fair comparison between the three approaches, we set the parameters of each
approach independently to the values that are optimal for that approach.</p>
        <p>For each approach, we need to optimize `con dence' c and `support' s for pass
1 and pass 2, resulting in the four parameters c1, c2, s1 and s2. We determine
the best setting of all parameters by evaluating the resulting matches against
our reference alignment. First, we keep s1 xed at 0 and vary c1 between 0 and
1 in steps of 0.1. Next, we take values around and including the value of c1
that provided the highest F and vary s1 between 0 and 50 in steps of 5. We
take this relatively low range of inlinks due to the nature of our dataset, which
largely consists of obscure entities that are likely not often linked to. We settle
on whichever combination of c1 and s1 gives the best F . To set the parameters
c2 and s2 of the second pass, analogous experiments are performed, this time
varying c2 and s2 respectively.</p>
        <p>Baseline approach Figure 1a shows that the highest F-measure (Fmax) is
obtained when parameters are set as follows: c1 = 0:3, s1 = 0, c2 = 0:8 and
s2 = 15.</p>
        <p>Oracle approach Optimal parameters for this approach are c1 = 0:0, s1 = 0,
c2 = 0:75 and s2 = 0. See gure 1b.</p>
        <p>Bootstrap approach Our aim is to evaluate our bootstrap approach with
varying amounts of classes and categories to specify the domain. The
optimal parameters for each variant could be di erent and hence they need to
be determined independently. For space reasons, gure 1c only shows the
parameters for the approach that gave the best results: c1 = 0:0, s1 = 0,
c2 = 0:75 and s2 = 20.
(a) The baseline ap- (b) The oracle approach. (c) The bootstrap
approach. Fmax = 0:811 Fmax = 0:939 proach (r = 1; t = 0).
Fmax = 0:895
Table 1 show the results of our experiments on the Last.fm artist dataset. We
see that our domain-aware bootstrap method gives results that are better than
the baseline, and close to the `oracle'. The results are slightly better when we do
not consider categories at all (t = 0). A reason for this is because it is di cult to
detect and lter out overly general categories (for example, Category:People is
always part of our result). There is also an expected inherent trade-o to be seen
between precision and recall when we choose r as either 1 or 2. We additionally
run the baseline, oracle and best-performing bootstrap approach on the Eventful
artist dataset and evaluate based on a ground truth of 500 artist labels derived
in a similar way to the Last.fm ground truth. These results again show the value
of domain knowledge, with Fmax = 0:726 for the baseline, Fmax = 0:905 for the
oracle, and Fmax = 0:901 for the derived approach.</p>
        <p>The DBpedia links created with the oracle approach for both datasets are
available for download1. Also included are interlinks between Last.fm and
Eventful artists if they have all of their links in common. For Last.fm, we have 50120
entities in total and end up with 17116 DBpedia links. For Eventful, we have
6540 entities and 2724 links. There are 2450 interlinks made between datasets.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Discussion and Future Work</title>
      <p>In this paper we presented a bootstrapping method to improve the matching
of concepts within a particular domain to DBpedia resources. We found that
our bootstrapping method performs better than a general domain-independent
matching, and that the F-measure associated to our best derived model is
consistent across two datasets. It is not yet clear to what extent our proposed method
is applicable to other domains of a di erent nature. We are currently
exploring how robust our method is against di erent sets of initial high con dence
matches, varying the size as well as the domain.</p>
      <p>We found that we often end up with rather generic classes/categories, such
as YAGO class LivingPeople and category People, in our nal selections for
a domain lter. Our future work focusses on improving the class and category
selection algorithm in order to lter out these general cases.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garca-Silva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : DBpedia Spotlight:
          <article-title>Shedding Light on the Web of Documents</article-title>
          .
          <source>In the Proc. of the 7th International Conference on Semantic Systems (I-Semantics)</source>
          .
          <source>Graz, Austria</source>
          ,
          <fpage>7</fpage>
          -
          <issue>9</issue>
          <year>September 2011</year>
          . (to appear)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raimond</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sizemore</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smethurst</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Conections</article-title>
          .
          <source>Proc. of ESWC</source>
          <year>2009</year>
          , Heraklion, Crete.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jain</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeh</surname>
            <given-names>P. Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verma</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasquez</surname>
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damova</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            <given-names>A. P.</given-names>
          </string-name>
          :
          <article-title>Contextual Ontology Alignment of LOD with an Upper Ontology: A Case Study with Proton</article-title>
          . In G. Antoniou (Ed.),
          <source>Proc. of ESWC</source>
          <year>2011</year>
          , Heraklion, Crete.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>