1. INTRODUCTION

Exploiting Redundancy for Pattern-based Relation Instantiation using tOKo

Viktor de Boer

v.de.boer@cs.vu.nl 1

Bob J. Wielinga

bj.wielinga@few.vu.nl 1

Maarten W. van Someren

m.w.vansomeren@uva.nl 2

Anjo A. Anjewierden

a.a.anjewierden@gw.utwente.nl 0 0 Behavioural Science IST, University of Twente , the Netherlands 1 Department of Computer Science, Web & Media, Vrije Universiteit Amsterdam , the Netherlands 2 Informatics Institute, Universiteit van , Amsterdam , the Netherlands

1. INTRODUCTION

The Semantic Web calls for semi-automatic methods to learn, populate and enrich ontologies. In this work, we present a method for the extraction of domain-speci c relations between instances (relation instantiation). The method uses hand-crafted extraction patterns which are executed on a text corpus using the tOKo text analysis tool[Anjewierden, 2006]. Additionaly, the extracted candidate relation instances can be ltered in a post-processing phase by using domainand task-speci c background knowledge.

The tOKo pattern language allows for patterns that include references to semantic classes. This allows for a wider variety of generality of the patterns (cf. [Cali and Mooney, 2003]) . When very speci c patterns are used, we can expect a high precision but a relatively low recall. If more general patterns are used, recall is expected to go up. This will negatively a ect precision, but if we exploit the redundancy of the relation instances in the corpus by putting a threshold on the frequency of pattern matches, we can compensate for this loss in precision. Especially for extraction tasks where the expected recall is very low, boosting the recall is very bene cial to increase the overall performance when measured in terms of the F-measure. In this paper, we show how exploiting the redundancy in this way improves performance of the method. An extended version of this work can be found in [de Boer, 2010].

TASK AND METHOD

We de ne the task of relation instantiation from a corpus as follows: Given two classes Ci and Cj in a partly populated ontology, with sets of instances Ii and Ij and given a relation R : Ci Cj , identify for an instance i 2 Ii all instances j 2 Ij such that the relation R(i; j) holds given the information in the corpus. In this work we will discuss Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

EKAW 2010 11-15 October 2010 - Lisbon, Portugal . both the situation where all elements of Ii or Ij are known as well as the situation where we discover new instances of the class Ci or Cj .

The tOKo tool and its pattern language The open source tool tOKo [Anjewierden, 2006] has a large number of interactive text analysis and ontology engineering functionalities that can be accessed through a user-interface or through a Prolog API. The tool also provides a powerful pattern search functionality. The pattern language includes 'standard' syntactic abstractions such as matches on exact words, lemma's, word classes, numbers, punctuations, special characters, etc. TOKo also allows the use of populated ontology concepts in these patterns (denoted by square brackets) where all term instances of that class are matched in a text corpus. For example, the pattern I ate an [f ruit] matches the phrases "`I ate an apple"' and "`I ate an orange"', assuming that the class fruit is populated with these instances.

Relation Instantiation using patterns. The input for the method is a speci c relation R and the related concepts Ci and Cj from the ontology and any instances Ii and Ij from the knowledge base. In the rst step, we create a corpus for the task using the labels from the concepts and the relation. These are presented to the Google search engine. The rst N pages are retrieved to form the corpus. On this corpus, a manually constructed tOKo extraction pattern is executed. A pattern query consists of three sub-patterns corresponding to the concept Ci, the relation R and the concept Cj respectively. The sub-patterns for Ci and Cj are constructed using tOKo's sub-concept retrieval feature. If the task also includes populating one of the classes, the expected word class can be used to match potential candidate instances. The generality of a relation instantiation pattern can be adjusted by choosing more general pattern constructs for the subpattern for R (I verb an [f ruit] is more general than I eat an [f ruit].

Next, the speci c phrases that are the result of the Information Extraction phase are converted to RDF triples by mapping the three di erent sub-phrases to the corresponding instances of Ci, R and Cj respectively using the tOKo API. Synonyms, misspellings and abbreviations are mapped to single instances. The output is a list of candidate relation instances ordered by their associated frequencies in the corpus. In our experiments, we evaluate the performance of the method for various experiments by putting a threshold 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 pattern 1 pattern 2 pattern 3 pattern 4 pattern 5 on the frequency of the candidate relations.

Background knowledge about the classes Ci and Cj and the relation R can be used to improve the performance of the method and to reduce unwanted redundancy in the candidate relation instances. In Section 4 we give an example.

EXP. 1: ROMAN GODS

For this experiment, we constructed an extremely simple 'ontology' consisting of two classes: gods:Roman God populated with 259 instances and gods:Domain (unpopulated), with the relation gods:is god of between the two. We constructed a corpus by extracting from the web the rst 1000 pages resulting from the google query 'Roman +God +Goddess'. We constructed the following 5 patterns of varying generality: 1: [Roman god] is the fgodjg of hnouni 2: [Roman god] the fgodjgoddessg of hnouni 3: [Roman god]fj g :::10 the godjgoddess of hnouni 4: [Roman god]fj g :::10 godjgoddess of hnouni 5: [Roman god]fj g :::10 godjgoddess :::10 hnouni

The results show the expected tradeo between precision and recall depending on the generality of the pattern. To show the combined performance, we plotted the harmonic mean of both precision and recall, the F-measure against the threshold value in Figure 3 for all patterns. This gure shows that using a general pattern and a threshold on the frequency is preferable to using speci c patterns. This is the case when a large number of relation instances are to be found and recall is the main contributor to the F-measure.

To test the performance of the method in a second domain and to show the post-processing step, we attempt a second relation instantiation task where the goal is to extract instances of the relation painter born in birthplace the subject and object classes were populated with 1808 European painters and 47.000 European birthplaces. Three patterns of varying generality were constructed: 1: [painter] was (born) in [place] 2: [painter] was (born) finjatg :::10 [place] 3: [painter] :::10 (born) :::20 [place]

We manually evaluated the results. Again, more general patterns lead to higher recall, while more speci c patterns 0.05 0.04 0.03 0.02 0.01 0 0 Pattern 1 Pattern 2 Pattern 3 Pattern 3 (postprocessed) 10 20 30 40 50 lead to higher precision. In Figure 2 we plot the values for the F-measure for di erent threshold values. We here also observe that the value of the F-measure for more general patterns is higher than that of more speci c patterns for all threshold values that are evaluated. Thus we can conclude that if the harmonic mean is used as an evaluation criterion, using more general patterns results in a better performance.

We also performed a postprocessing step on this data where we exploit the hierarchical structure of the geographic places in the TGN1. Candidate relation instances that are hierachically equivalent are mapped to a single relation, where occurrence frequencies are summed. Figure 2 also shows the results of the evaluation this postprocessed candidate relation instance set, which shows a signi cantly higher Fmeasure value. 5.

CONCLUSIONS

We have shown the working of the various steps of the extraction method and the performance-boosting e ect of the post-processing step. In both experiments, the values of the F1-measures are largely determined by the relatively low recall values. If the corpus is nite and the list of instances to be found is large enough this data sparseness will occur for all patterns. In that case, using more general patterns in combination with a threshold, thereby exploiting the redundancy will have a bene cial in uence on the performance. For relation instantiation tasks, where semi-automatic methods are most needed due to the large number of target relation instances, using redundancy will be bene cial.

[Anjewierden , 2006] Anjewierden, A. ( 2006 ). toko and sigmund: text analysis support for ontology development and social research . http://www.toko-sigmund.org.

[Cali and Mooney , 2003] Cali , M. E. and Mooney , R. J. ( 2003 ). Bottom-up relational learning of pattern matching rules for information extraction . J. Mach. Learn. Res. , 4 : 177 { 210 .

[de Boer , 2010] de Boer, V. ( 2010 ). Ontology Enrichment from Heterogeneous Sources on the Web . PhD thesis , Universiteit van Amsterdam.

1Getty's Thesaurus of Geographic Names