<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Redundancy for Pattern-based Relation Instantiation using tOKo</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktor de Boer</string-name>
          <email>v.de.boer@cs.vu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bob J. Wielinga</string-name>
          <email>bj.wielinga@few.vu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten W. van Someren</string-name>
          <email>m.w.vansomeren@uva.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anjo A. Anjewierden</string-name>
          <email>a.a.anjewierden@gw.utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Behavioural Science IST, University of Twente</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Web &amp; Media, Vrije Universiteit Amsterdam</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Informatics Institute, Universiteit van</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The Semantic Web calls for semi-automatic methods to
learn, populate and enrich ontologies. In this work, we
present a method for the extraction of domain-speci c
relations between instances (relation instantiation). The method
uses hand-crafted extraction patterns which are executed on
a text corpus using the tOKo text analysis tool[Anjewierden,
2006]. Additionaly, the extracted candidate relation instances
can be ltered in a post-processing phase by using
domainand task-speci c background knowledge.</p>
      <p>
        The tOKo pattern language allows for patterns that
include references to semantic classes. This allows for a wider
variety of generality of the patterns
        <xref ref-type="bibr" rid="ref2">(cf. [Cali and Mooney,
2003])</xref>
        . When very speci c patterns are used, we can
expect a high precision but a relatively low recall. If more
general patterns are used, recall is expected to go up. This
will negatively a ect precision, but if we exploit the
redundancy of the relation instances in the corpus by putting a
threshold on the frequency of pattern matches, we can
compensate for this loss in precision. Especially for extraction
tasks where the expected recall is very low, boosting the
recall is very bene cial to increase the overall performance
when measured in terms of the F-measure. In this paper, we
show how exploiting the redundancy in this way improves
performance of the method. An extended version of this
work can be found in [de Boer, 2010].
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK AND METHOD</title>
      <p>We de ne the task of relation instantiation from a corpus
as follows: Given two classes Ci and Cj in a partly
populated ontology, with sets of instances Ii and Ij and given
a relation R : Ci Cj , identify for an instance i 2 Ii all
instances j 2 Ij such that the relation R(i; j) holds given
the information in the corpus. In this work we will discuss
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.</p>
      <p>EKAW 2010 11-15 October 2010 - Lisbon, Portugal
.
both the situation where all elements of Ii or Ij are known
as well as the situation where we discover new instances of
the class Ci or Cj .</p>
      <p>The tOKo tool and its pattern language The open
source tool tOKo [Anjewierden, 2006] has a large number
of interactive text analysis and ontology engineering
functionalities that can be accessed through a user-interface or
through a Prolog API. The tool also provides a powerful
pattern search functionality. The pattern language includes
'standard' syntactic abstractions such as matches on
exact words, lemma's, word classes, numbers, punctuations,
special characters, etc. TOKo also allows the use of
populated ontology concepts in these patterns (denoted by square
brackets) where all term instances of that class are matched
in a text corpus. For example, the pattern I ate an [f ruit]
matches the phrases "`I ate an apple"' and "`I ate an
orange"', assuming that the class fruit is populated with these
instances.</p>
      <p>Relation Instantiation using patterns. The input for
the method is a speci c relation R and the related concepts
Ci and Cj from the ontology and any instances Ii and Ij
from the knowledge base. In the rst step, we create a
corpus for the task using the labels from the concepts and the
relation. These are presented to the Google search engine.
The rst N pages are retrieved to form the corpus. On this
corpus, a manually constructed tOKo extraction pattern is
executed. A pattern query consists of three sub-patterns
corresponding to the concept Ci, the relation R and the
concept Cj respectively. The sub-patterns for Ci and Cj are
constructed using tOKo's sub-concept retrieval feature. If
the task also includes populating one of the classes, the
expected word class can be used to match potential candidate
instances. The generality of a relation instantiation pattern
can be adjusted by choosing more general pattern constructs
for the subpattern for R (I verb an [f ruit] is more general
than I eat an [f ruit].</p>
      <p>Next, the speci c phrases that are the result of the
Information Extraction phase are converted to RDF triples by
mapping the three di erent sub-phrases to the
corresponding instances of Ci, R and Cj respectively using the tOKo
API. Synonyms, misspellings and abbreviations are mapped
to single instances. The output is a list of candidate
relation instances ordered by their associated frequencies in the
corpus. In our experiments, we evaluate the performance of
the method for various experiments by putting a threshold
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
pattern 1
pattern 2
pattern 3
pattern 4
pattern 5
on the frequency of the candidate relations.</p>
      <p>Background knowledge about the classes Ci and Cj and
the relation R can be used to improve the performance of
the method and to reduce unwanted redundancy in the
candidate relation instances. In Section 4 we give an example.</p>
    </sec>
    <sec id="sec-3">
      <title>EXP. 1: ROMAN GODS</title>
      <p>For this experiment, we constructed an extremely simple
'ontology' consisting of two classes: gods:Roman God
populated with 259 instances and gods:Domain (unpopulated),
with the relation gods:is god of between the two. We
constructed a corpus by extracting from the web the rst 1000
pages resulting from the google query 'Roman +God
+Goddess'. We constructed the following 5 patterns of varying
generality:
1: [Roman god] is the fgodjg of hnouni
2: [Roman god] the fgodjgoddessg of hnouni
3: [Roman god]fj g :::10 the godjgoddess of hnouni
4: [Roman god]fj g :::10 godjgoddess of hnouni
5: [Roman god]fj g :::10 godjgoddess :::10 hnouni</p>
      <p>The results show the expected tradeo between precision
and recall depending on the generality of the pattern. To
show the combined performance, we plotted the harmonic
mean of both precision and recall, the F-measure against
the threshold value in Figure 3 for all patterns. This gure
shows that using a general pattern and a threshold on the
frequency is preferable to using speci c patterns. This is
the case when a large number of relation instances are to be
found and recall is the main contributor to the F-measure.</p>
      <p>To test the performance of the method in a second domain
and to show the post-processing step, we attempt a second
relation instantiation task where the goal is to extract
instances of the relation painter born in birthplace the
subject and object classes were populated with 1808
European painters and 47.000 European birthplaces. Three
patterns of varying generality were constructed:
1: [painter] was (born) in [place]
2: [painter] was (born) finjatg :::10 [place]
3: [painter] :::10 (born) :::20 [place]</p>
      <p>We manually evaluated the results. Again, more general
patterns lead to higher recall, while more speci c patterns
0.05
0.04
0.03
0.02
0.01
0
0
Pattern 1
Pattern 2
Pattern 3
Pattern 3
(postprocessed)
10
20
30
40
50
lead to higher precision. In Figure 2 we plot the values for
the F-measure for di erent threshold values. We here also
observe that the value of the F-measure for more general
patterns is higher than that of more speci c patterns for all
threshold values that are evaluated. Thus we can conclude
that if the harmonic mean is used as an evaluation criterion,
using more general patterns results in a better performance.</p>
      <p>We also performed a postprocessing step on this data
where we exploit the hierarchical structure of the geographic
places in the TGN1. Candidate relation instances that are
hierachically equivalent are mapped to a single relation, where
occurrence frequencies are summed. Figure 2 also shows
the results of the evaluation this postprocessed candidate
relation instance set, which shows a signi cantly higher
Fmeasure value.
5.</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS</title>
      <p>We have shown the working of the various steps of the
extraction method and the performance-boosting e ect of
the post-processing step. In both experiments, the values of
the F1-measures are largely determined by the relatively low
recall values. If the corpus is nite and the list of instances
to be found is large enough this data sparseness will occur
for all patterns. In that case, using more general patterns in
combination with a threshold, thereby exploiting the
redundancy will have a bene cial in uence on the performance.
For relation instantiation tasks, where semi-automatic
methods are most needed due to the large number of target
relation instances, using redundancy will be bene cial.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Anjewierden</source>
          , 2006] Anjewierden,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>toko and sigmund: text analysis support for ontology development and social research</article-title>
          . http://www.toko-sigmund.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Cali and Mooney</source>
          , 2003] Cali ,
          <string-name>
            <given-names>M. E.</given-names>
            and
            <surname>Mooney</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J.</surname>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Bottom-up relational learning of pattern matching rules for information extraction</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>4</volume>
          :
          <fpage>177</fpage>
          {
          <fpage>210</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>[de Boer</surname>
            , 2010] de Boer,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Ontology Enrichment from Heterogeneous Sources on the Web</article-title>
          .
          <source>PhD thesis</source>
          , Universiteit van Amsterdam.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>1Getty's Thesaurus of Geographic Names</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>