<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Exploiting Redundancy for Pattern-based Relation Instantiation using tOKo</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Viktor</forename><surname>De Boer</surname></persName>
							<email>v.de.boer@cs.vu.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science, Web &amp; Media</orgName>
								<orgName type="institution">Vrije Universiteit Amsterdam</orgName>
								<address>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maarten</forename><forename type="middle">W</forename><surname>Van Someren</surname></persName>
							<email>m.w.vansomeren@uva.nl</email>
							<affiliation key="aff1">
								<orgName type="department">Informatics Institute</orgName>
								<orgName type="institution">Universiteit van Amsterdam</orgName>
								<address>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bob</forename><forename type="middle">J</forename><surname>Wielinga</surname></persName>
							<email>bj.wielinga@few.vu.nl</email>
							<affiliation key="aff2">
								<orgName type="department">Department of Computer Science, Web &amp; Media</orgName>
								<orgName type="institution">Vrije Universiteit Amsterdam</orgName>
								<address>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Anjo</forename><forename type="middle">A</forename><surname>Anjewierden</surname></persName>
							<email>a.a.anjewierden@gw.utwente.nl</email>
							<affiliation key="aff3">
								<orgName type="department">Behavioural Science IST</orgName>
								<orgName type="institution">University of Twente</orgName>
								<address>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Exploiting Redundancy for Pattern-based Relation Instantiation using tOKo</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">7D25D9C9EDE065D0721912B5B893F0AD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T23:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The Semantic Web calls for semi-automatic methods to learn, populate and enrich ontologies. In this work, we present a method for the extraction of domain-specific relations between instances (relation instantiation). The method uses hand-crafted extraction patterns which are executed on a text corpus using the tOKo text analysis tool <ref type="bibr" target="#b0">[Anjewierden, 2006]</ref>. Additionaly, the extracted candidate relation instances can be filtered in a post-processing phase by using domainand task-specific background knowledge.</p><p>The tOKo pattern language allows for patterns that include references to semantic classes. This allows for a wider variety of generality of the patterns (cf. <ref type="bibr" target="#b0">[Califf and Mooney, 2003]</ref>). When very specific patterns are used, we can expect a high precision but a relatively low recall. If more general patterns are used, recall is expected to go up. This will negatively affect precision, but if we exploit the redundancy of the relation instances in the corpus by putting a threshold on the frequency of pattern matches, we can compensate for this loss in precision. Especially for extraction tasks where the expected recall is very low, boosting the recall is very beneficial to increase the overall performance when measured in terms of the F-measure. In this paper, we show how exploiting the redundancy in this way improves performance of the method. An extended version of this work can be found in <ref type="bibr" target="#b1">[de Boer, 2010]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">TASK AND METHOD</head><p>We define the task of relation instantiation from a corpus as follows: Given two classes Ci and Cj in a partly populated ontology, with sets of instances Ii and Ij and given a relation R : Ci × Cj, identify for an instance i ∈ Ii all instances j ∈ Ij such that the relation R(i, j) holds given the information in the corpus. In this work we will discuss both the situation where all elements of Ii or Ij are known as well as the situation where we discover new instances of the class Ci or Cj.</p><p>The tOKo tool and its pattern language The open source tool tOKo <ref type="bibr" target="#b0">[Anjewierden, 2006]</ref> has a large number of interactive text analysis and ontology engineering functionalities that can be accessed through a user-interface or through a Prolog API. The tool also provides a powerful pattern search functionality. The pattern language includes 'standard' syntactic abstractions such as matches on exact words, lemma's, word classes, numbers, punctuations, special characters, etc. TOKo also allows the use of populated ontology concepts in these patterns (denoted by square brackets) where all term instances of that class are matched in a text corpus. For example, the pattern I ate an [f ruit] matches the phrases "'I ate an apple"' and "'I ate an orange"', assuming that the class fruit is populated with these instances.</p><p>Relation Instantiation using patterns. The input for the method is a specific relation R and the related concepts Ci and Cj from the ontology and any instances Ii and Ij from the knowledge base. In the first step, we create a corpus for the task using the labels from the concepts and the relation. These are presented to the Google search engine. The first N pages are retrieved to form the corpus. On this corpus, a manually constructed tOKo extraction pattern is executed. A pattern query consists of three sub-patterns corresponding to the concept Ci, the relation R and the concept Cj respectively. The sub-patterns for Ci and Cj are constructed using tOKo's sub-concept retrieval feature. If the task also includes populating one of the classes, the expected word class can be used to match potential candidate instances. The generality of a relation instantiation pattern can be adjusted by choosing more general pattern constructs for the subpattern for R (I verb an [f ruit] is more general than I eat an [f ruit].</p><p>Next, the specific phrases that are the result of the Information Extraction phase are converted to RDF triples by mapping the three different sub-phrases to the corresponding instances of Ci, R and Cj respectively using the tOKo API. Synonyms, misspellings and abbreviations are mapped to single instances. The output is a list of candidate relation instances ordered by their associated frequencies in the corpus. In our experiments, we evaluate the performance of the method for various experiments by putting a threshold on the frequency of the candidate relations. Background knowledge about the classes Ci and Cj and the relation R can be used to improve the performance of the method and to reduce unwanted redundancy in the candidate relation instances. In Section 4 we give an example.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXP. 1: ROMAN GODS</head><p>For this experiment, we constructed an extremely simple 'ontology' consisting of two classes: gods:Roman God populated with 259 instances and gods:Domain (unpopulated), with the relation gods:is god of between the two. We constructed a corpus by extracting from the web the first 1000 pages resulting from the google query 'Roman +God +Goddess'. We constructed the following 5 patterns of varying generality: The results show the expected tradeoff between precision and recall depending on the generality of the pattern. To show the combined performance, we plotted the harmonic mean of both precision and recall, the F-measure against the threshold value in Figure <ref type="figure">3</ref> for all patterns. This figure shows that using a general pattern and a threshold on the frequency is preferable to using specific patterns. This is the case when a large number of relation instances are to be found and recall is the main contributor to the F-measure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">EXP. 2: ARTISTS' BIRTH PLACES</head><p>To test the performance of the method in a second domain and to show the post-processing step, we attempt a second relation instantiation task where the goal is to extract instances of the relation painter born in birthplace the subject and object classes were populated with 1808 European painters and 47.000 European birthplaces. Three patterns of varying generality were constructed:</p><p>1:</p><p>[painter] was (born) in [place] 2:</p><p>[painter] was (born) {in|at} ...10 [place] 3:</p><p>[painter] ...10 (born) ..</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>.20 [place]</head><p>We manually evaluated the results. Again, more general patterns lead to higher recall, while more specific patterns lead to higher precision. In Figure <ref type="figure" target="#fig_2">2</ref> we plot the values for the F-measure for different threshold values. We here also observe that the value of the F-measure for more general patterns is higher than that of more specific patterns for all threshold values that are evaluated. Thus we can conclude that if the harmonic mean is used as an evaluation criterion, using more general patterns results in a better performance. We also performed a postprocessing step on this data where we exploit the hierarchical structure of the geographic places in the TGN<ref type="foot" target="#foot_0">1</ref> . Candidate relation instances that are hierachically equivalent are mapped to a single relation, where occurrence frequencies are summed. Figure <ref type="figure" target="#fig_2">2</ref> also shows the results of the evaluation this postprocessed candidate relation instance set, which shows a significantly higher Fmeasure value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CONCLUSIONS</head><p>We have shown the working of the various steps of the extraction method and the performance-boosting effect of the post-processing step. In both experiments, the values of the F1-measures are largely determined by the relatively low recall values. If the corpus is finite and the list of instances to be found is large enough this data sparseness will occur for all patterns. In that case, using more general patterns in combination with a threshold, thereby exploiting the redundancy will have a beneficial influence on the performance. For relation instantiation tasks, where semi-automatic methods are most needed due to the large number of target relation instances, using redundancy will be beneficial.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Figure 1: F-measure values for the five patterns for Experiment 1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>god] is the {god|} of noun 2: [Roman god] * the {god|goddess} of noun 3: [Roman god]{| } ...10 the god|goddess of noun 4: [Roman god]{| } ...10 god|goddess of noun 5: [Roman god]{| } ...10 god|goddess ...10 noun</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: F-measure values threshold values for Experiment 2, including post-processed results</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Getty's Thesaurus of Geographic Names</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">toko and sigmund: text analysis support for ontology development and social research</title>
		<author>
			<persName><forename type="first">A</forename><surname>References ; Anjewierden ; Anjewierden</surname></persName>
		</author>
		<author>
			<persName><surname>Califf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Califf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Mooney</surname></persName>
		</author>
		<ptr target="http://www.toko-sigmund.org" />
	</analytic>
	<monogr>
		<title level="j">J. Mach. Learn. Res</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="177" to="210" />
			<date type="published" when="2003">2006. 2006. 2003. 2003</date>
		</imprint>
	</monogr>
	<note>Bottom-up relational learning of pattern matching rules for information extraction</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Ontology Enrichment from Heterogeneous Sources on the Web</title>
		<author>
			<persName><forename type="first">;</forename><surname>De Boer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>De Boer</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
		<respStmt>
			<orgName>Universiteit van Amsterdam</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
