Exploiting Redundancy for Pattern-based Relation Instantiation using tOKo Viktor de Boer Maarten W. van Someren Department of Computer Science, Web & Media, Informatics Institute, Universiteit van Vrije Universiteit Amsterdam, the Netherlands Amsterdam, the Netherlands v.de.boer@cs.vu.nl m.w.vansomeren@uva.nl Bob J. Wielinga Anjo A. Anjewierden Department of Computer Science, Web & Media, Behavioural Science IST, University of Twente, Vrije Universiteit Amsterdam, the Netherlands the Netherlands bj.wielinga@few.vu.nl a.a.anjewierden@gw.utwente.nl 1. INTRODUCTION both the situation where all elements of Ii or Ij are known The Semantic Web calls for semi-automatic methods to as well as the situation where we discover new instances of learn, populate and enrich ontologies. In this work, we the class Ci or Cj . present a method for the extraction of domain-specific rela- The tOKo tool and its pattern language The open tions between instances (relation instantiation). The method source tool tOKo [Anjewierden, 2006] has a large number uses hand-crafted extraction patterns which are executed on of interactive text analysis and ontology engineering func- a text corpus using the tOKo text analysis tool[Anjewierden, tionalities that can be accessed through a user-interface or 2006]. Additionaly, the extracted candidate relation instances through a Prolog API. The tool also provides a powerful can be filtered in a post-processing phase by using domain- pattern search functionality. The pattern language includes and task-specific background knowledge. ’standard’ syntactic abstractions such as matches on ex- The tOKo pattern language allows for patterns that in- act words, lemma’s, word classes, numbers, punctuations, clude references to semantic classes. This allows for a wider special characters, etc. TOKo also allows the use of popu- variety of generality of the patterns (cf. [Califf and Mooney, lated ontology concepts in these patterns (denoted by square 2003]). When very specific patterns are used, we can ex- brackets) where all term instances of that class are matched pect a high precision but a relatively low recall. If more in a text corpus. For example, the pattern I ate an [f ruit] general patterns are used, recall is expected to go up. This matches the phrases ”‘I ate an apple”’ and ”‘I ate an or- will negatively affect precision, but if we exploit the redun- ange”’, assuming that the class fruit is populated with these dancy of the relation instances in the corpus by putting a instances. threshold on the frequency of pattern matches, we can com- Relation Instantiation using patterns. The input for pensate for this loss in precision. Especially for extraction the method is a specific relation R and the related concepts tasks where the expected recall is very low, boosting the Ci and Cj from the ontology and any instances Ii and Ij recall is very beneficial to increase the overall performance from the knowledge base. In the first step, we create a cor- when measured in terms of the F-measure. In this paper, we pus for the task using the labels from the concepts and the show how exploiting the redundancy in this way improves relation. These are presented to the Google search engine. performance of the method. An extended version of this The first N pages are retrieved to form the corpus. On this work can be found in [de Boer, 2010]. corpus, a manually constructed tOKo extraction pattern is executed. A pattern query consists of three sub-patterns corresponding to the concept Ci , the relation R and the 2. TASK AND METHOD concept Cj respectively. The sub-patterns for Ci and Cj are We define the task of relation instantiation from a corpus constructed using tOKo’s sub-concept retrieval feature. If as follows: Given two classes Ci and Cj in a partly popu- the task also includes populating one of the classes, the ex- lated ontology, with sets of instances Ii and Ij and given pected word class can be used to match potential candidate a relation R : Ci × Cj , identify for an instance i ∈ Ii all instances. The generality of a relation instantiation pattern instances j ∈ Ij such that the relation R(i, j) holds given can be adjusted by choosing more general pattern constructs the information in the corpus. In this work we will discuss for the subpattern for R (I verb an [f ruit] is more general than I eat an [f ruit]. Next, the specific phrases that are the result of the Infor- mation Extraction phase are converted to RDF triples by Permission to make digital or hard copies of all or part of this work for mapping the three different sub-phrases to the correspond- personal or classroom use is granted without fee provided that copies are ing instances of Ci , R and Cj respectively using the tOKo not made or distributed for profit or commercial advantage and that copies API. Synonyms, misspellings and abbreviations are mapped bear this notice and the full citation on the first page. To copy otherwise, to to single instances. The output is a list of candidate rela- republish, to post on servers or to redistribute to lists, requires prior specific tion instances ordered by their associated frequencies in the permission and/or a fee. EKAW 2010 11-15 October 2010 - Lisbon, Portugal corpus. In our experiments, we evaluate the performance of . the method for various experiments by putting a threshold 0.5 0.45 0.05 0.4 0.04 0.35 0.3 0.03 0.25 pattern 1 Pattern 1 pattern 2 0.2 pattern 3 Pattern 2 pattern 4 0.02 Pattern 3 0.15 pattern 5 Pattern 3 (post- 0.1 0.01 processed) 0.05 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 Figure 1: F-measure values for the five patterns for Figure 2: F-measure values threshold values for Ex- Experiment 1. periment 2, including post-processed results on the frequency of the candidate relations. lead to higher precision. In Figure 2 we plot the values for Background knowledge about the classes Ci and Cj and the F-measure for different threshold values. We here also the relation R can be used to improve the performance of observe that the value of the F-measure for more general the method and to reduce unwanted redundancy in the can- patterns is higher than that of more specific patterns for all didate relation instances. In Section 4 we give an example. threshold values that are evaluated. Thus we can conclude that if the harmonic mean is used as an evaluation criterion, 3. EXP. 1: ROMAN GODS using more general patterns results in a better performance. For this experiment, we constructed an extremely simple We also performed a postprocessing step on this data ’ontology’ consisting of two classes: gods:Roman God pop- where we exploit the hierarchical structure of the geographic ulated with 259 instances and gods:Domain (unpopulated), places in the TGN1 . Candidate relation instances that are hi- with the relation gods:is god of between the two. We con- erachically equivalent are mapped to a single relation, where structed a corpus by extracting from the web the first 1000 occurrence frequencies are summed. Figure 2 also shows pages resulting from the google query ’Roman +God +God- the results of the evaluation this postprocessed candidate dess’. We constructed the following 5 patterns of varying relation instance set, which shows a significantly higher F- generality: measure value. 1: [Roman god] is the {god|} of hnouni 2: [Roman god] ∗ the {god|goddess} of hnouni 5. CONCLUSIONS 3: [Roman god]{| } ...10 the god|goddess of hnouni We have shown the working of the various steps of the 4: [Roman god]{| } ...10 god|goddess of hnouni 5: [Roman god]{| } ...10 god|goddess ...10 hnouni extraction method and the performance-boosting effect of the post-processing step. In both experiments, the values of The results show the expected tradeoff between precision the F1-measures are largely determined by the relatively low and recall depending on the generality of the pattern. To recall values. If the corpus is finite and the list of instances show the combined performance, we plotted the harmonic to be found is large enough this data sparseness will occur mean of both precision and recall, the F-measure against for all patterns. In that case, using more general patterns in the threshold value in Figure 3 for all patterns. This figure combination with a threshold, thereby exploiting the redun- shows that using a general pattern and a threshold on the dancy will have a beneficial influence on the performance. frequency is preferable to using specific patterns. This is For relation instantiation tasks, where semi-automatic meth- the case when a large number of relation instances are to be ods are most needed due to the large number of target rela- found and recall is the main contributor to the F-measure. tion instances, using redundancy will be beneficial. 4. EXP. 2: ARTISTS’ BIRTH PLACES 6. REFERENCES To test the performance of the method in a second domain [Anjewierden, 2006] Anjewierden, A. (2006). toko and and to show the post-processing step, we attempt a second sigmund: text analysis support for ontology development relation instantiation task where the goal is to extract in- and social research. http://www.toko-sigmund.org. stances of the relation painter born in birthplace the [Califf and Mooney, 2003] Califf, M. E. and Mooney, R. J. subject and object classes were populated with 1808 Eu- (2003). Bottom-up relational learning of pattern ropean painters and 47.000 European birthplaces. Three matching rules for information extraction. J. Mach. patterns of varying generality were constructed: Learn. Res., 4:177–210. 1: [painter] was (born) in [place] [de Boer, 2010] de Boer, V. (2010). Ontology Enrichment 2: [painter] was (born) {in|at} ...10 [place] 3: [painter] ...10 (born) ...20 [place] from Heterogeneous Sources on the Web. PhD thesis, Universiteit van Amsterdam. We manually evaluated the results. Again, more general 1 patterns lead to higher recall, while more specific patterns Getty’s Thesaurus of Geographic Names