Exploiting Redundancy for Pattern-based Relation
                         Instantiation using tOKo

                                  Viktor de Boer                                             Maarten W. van Someren
          Department of Computer Science, Web & Media,                                    Informatics Institute, Universiteit van
           Vrije Universiteit Amsterdam, the Netherlands                                      Amsterdam, the Netherlands
                              v.de.boer@cs.vu.nl                                             m.w.vansomeren@uva.nl
                                Bob J. Wielinga                                                  Anjo A. Anjewierden
          Department of Computer Science, Web & Media,                               Behavioural Science IST, University of Twente,
           Vrije Universiteit Amsterdam, the Netherlands                                           the Netherlands
                           bj.wielinga@few.vu.nl                                         a.a.anjewierden@gw.utwente.nl

1.     INTRODUCTION                                                                  both the situation where all elements of Ii or Ij are known
   The Semantic Web calls for semi-automatic methods to                              as well as the situation where we discover new instances of
learn, populate and enrich ontologies. In this work, we                              the class Ci or Cj .
present a method for the extraction of domain-specific rela-                            The tOKo tool and its pattern language The open
tions between instances (relation instantiation). The method                         source tool tOKo [Anjewierden, 2006] has a large number
uses hand-crafted extraction patterns which are executed on                          of interactive text analysis and ontology engineering func-
a text corpus using the tOKo text analysis tool[Anjewierden,                         tionalities that can be accessed through a user-interface or
2006]. Additionaly, the extracted candidate relation instances                       through a Prolog API. The tool also provides a powerful
can be filtered in a post-processing phase by using domain-                          pattern search functionality. The pattern language includes
and task-specific background knowledge.                                              ’standard’ syntactic abstractions such as matches on ex-
   The tOKo pattern language allows for patterns that in-                            act words, lemma’s, word classes, numbers, punctuations,
clude references to semantic classes. This allows for a wider                        special characters, etc. TOKo also allows the use of popu-
variety of generality of the patterns (cf. [Califf and Mooney,                       lated ontology concepts in these patterns (denoted by square
2003]). When very specific patterns are used, we can ex-                             brackets) where all term instances of that class are matched
pect a high precision but a relatively low recall. If more                           in a text corpus. For example, the pattern I ate an [f ruit]
general patterns are used, recall is expected to go up. This                         matches the phrases ”‘I ate an apple”’ and ”‘I ate an or-
will negatively affect precision, but if we exploit the redun-                       ange”’, assuming that the class fruit is populated with these
dancy of the relation instances in the corpus by putting a                           instances.
threshold on the frequency of pattern matches, we can com-                              Relation Instantiation using patterns. The input for
pensate for this loss in precision. Especially for extraction                        the method is a specific relation R and the related concepts
tasks where the expected recall is very low, boosting the                            Ci and Cj from the ontology and any instances Ii and Ij
recall is very beneficial to increase the overall performance                        from the knowledge base. In the first step, we create a cor-
when measured in terms of the F-measure. In this paper, we                           pus for the task using the labels from the concepts and the
show how exploiting the redundancy in this way improves                              relation. These are presented to the Google search engine.
performance of the method. An extended version of this                               The first N pages are retrieved to form the corpus. On this
work can be found in [de Boer, 2010].                                                corpus, a manually constructed tOKo extraction pattern is
                                                                                     executed. A pattern query consists of three sub-patterns
                                                                                     corresponding to the concept Ci , the relation R and the
2.     TASK AND METHOD                                                               concept Cj respectively. The sub-patterns for Ci and Cj are
   We define the task of relation instantiation from a corpus                        constructed using tOKo’s sub-concept retrieval feature. If
as follows: Given two classes Ci and Cj in a partly popu-                            the task also includes populating one of the classes, the ex-
lated ontology, with sets of instances Ii and Ij and given                           pected word class can be used to match potential candidate
a relation R : Ci × Cj , identify for an instance i ∈ Ii all                         instances. The generality of a relation instantiation pattern
instances j ∈ Ij such that the relation R(i, j) holds given                          can be adjusted by choosing more general pattern constructs
the information in the corpus. In this work we will discuss                          for the subpattern for R (I verb an [f ruit] is more general
                                                                                     than I eat an [f ruit].
                                                                                        Next, the specific phrases that are the result of the Infor-
                                                                                     mation Extraction phase are converted to RDF triples by
Permission to make digital or hard copies of all or part of this work for            mapping the three different sub-phrases to the correspond-
personal or classroom use is granted without fee provided that copies are            ing instances of Ci , R and Cj respectively using the tOKo
not made or distributed for profit or commercial advantage and that copies           API. Synonyms, misspellings and abbreviations are mapped
bear this notice and the full citation on the first page. To copy otherwise, to      to single instances. The output is a list of candidate rela-
republish, to post on servers or to redistribute to lists, requires prior specific   tion instances ordered by their associated frequencies in the
permission and/or a fee.
EKAW 2010 11-15 October 2010 - Lisbon, Portugal
                                                                                     corpus. In our experiments, we evaluate the performance of
.                                                                                    the method for various experiments by putting a threshold
 0.5

0.45                                                                    0.05

 0.4
                                                                        0.04
0.35

 0.3
                                                                        0.03
0.25                                           pattern 1                                                       Pattern 1
                                               pattern 2
 0.2                                           pattern 3                                                       Pattern 2
                                               pattern 4                0.02
                                                                                                               Pattern 3
0.15                                           pattern 5
                                                                                                               Pattern 3 (post-
 0.1                                                                    0.01                                   processed)

0.05

     0                                                                    0
         0        10      20      30      40         50    60                  0   10      20       30        40              50


Figure 1: F-measure values for the five patterns for                Figure 2: F-measure values threshold values for Ex-
Experiment 1.                                                       periment 2, including post-processed results


on the frequency of the candidate relations.                        lead to higher precision. In Figure 2 we plot the values for
  Background knowledge about the classes Ci and Cj and              the F-measure for different threshold values. We here also
the relation R can be used to improve the performance of            observe that the value of the F-measure for more general
the method and to reduce unwanted redundancy in the can-            patterns is higher than that of more specific patterns for all
didate relation instances. In Section 4 we give an example.         threshold values that are evaluated. Thus we can conclude
                                                                    that if the harmonic mean is used as an evaluation criterion,
3.           EXP. 1: ROMAN GODS                                     using more general patterns results in a better performance.
  For this experiment, we constructed an extremely simple              We also performed a postprocessing step on this data
’ontology’ consisting of two classes: gods:Roman God pop-           where we exploit the hierarchical structure of the geographic
ulated with 259 instances and gods:Domain (unpopulated),            places in the TGN1 . Candidate relation instances that are hi-
with the relation gods:is god of between the two. We con-           erachically equivalent are mapped to a single relation, where
structed a corpus by extracting from the web the first 1000         occurrence frequencies are summed. Figure 2 also shows
pages resulting from the google query ’Roman +God +God-             the results of the evaluation this postprocessed candidate
dess’. We constructed the following 5 patterns of varying           relation instance set, which shows a significantly higher F-
generality:                                                         measure value.
             1:   [Roman god] is the {god|} of hnouni
             2:   [Roman god] ∗ the {god|goddess} of hnouni         5.         CONCLUSIONS
             3:   [Roman god]{| } ...10 the god|goddess of hnouni      We have shown the working of the various steps of the
             4:   [Roman god]{| } ...10 god|goddess of hnouni
             5:   [Roman god]{| } ...10 god|goddess ...10 hnouni    extraction method and the performance-boosting effect of
                                                                    the post-processing step. In both experiments, the values of
   The results show the expected tradeoff between precision         the F1-measures are largely determined by the relatively low
and recall depending on the generality of the pattern. To           recall values. If the corpus is finite and the list of instances
show the combined performance, we plotted the harmonic              to be found is large enough this data sparseness will occur
mean of both precision and recall, the F-measure against            for all patterns. In that case, using more general patterns in
the threshold value in Figure 3 for all patterns. This figure       combination with a threshold, thereby exploiting the redun-
shows that using a general pattern and a threshold on the           dancy will have a beneficial influence on the performance.
frequency is preferable to using specific patterns. This is         For relation instantiation tasks, where semi-automatic meth-
the case when a large number of relation instances are to be        ods are most needed due to the large number of target rela-
found and recall is the main contributor to the F-measure.          tion instances, using redundancy will be beneficial.

4.           EXP. 2: ARTISTS’ BIRTH PLACES                          6.         REFERENCES
  To test the performance of the method in a second domain          [Anjewierden, 2006] Anjewierden, A. (2006). toko and
and to show the post-processing step, we attempt a second             sigmund: text analysis support for ontology development
relation instantiation task where the goal is to extract in-          and social research. http://www.toko-sigmund.org.
stances of the relation painter born in birthplace the
                                                                    [Califf and Mooney, 2003] Califf, M. E. and Mooney, R. J.
subject and object classes were populated with 1808 Eu-
                                                                      (2003). Bottom-up relational learning of pattern
ropean painters and 47.000 European birthplaces. Three
                                                                      matching rules for information extraction. J. Mach.
patterns of varying generality were constructed:
                                                                      Learn. Res., 4:177–210.
                  1:   [painter] was (born) in [place]              [de Boer, 2010] de Boer, V. (2010). Ontology Enrichment
                  2:   [painter] was (born) {in|at} ...10 [place]
                  3:   [painter] ...10 (born) ...20 [place]
                                                                      from Heterogeneous Sources on the Web. PhD thesis,
                                                                      Universiteit van Amsterdam.
  We manually evaluated the results. Again, more general
                                                                    1
patterns lead to higher recall, while more specific patterns            Getty’s Thesaurus of Geographic Names