<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Scaling up Pattern Induction for Web Relation Extraction through Frequent Itemset Mining</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sebastian</forename><surname>Blohm</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute AIFB</orgName>
								<orgName type="institution">Universität Karlsruhe (TH)</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Philipp</forename><surname>Cimiano</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute AIFB</orgName>
								<orgName type="institution">Universität Karlsruhe (TH)</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Scaling up Pattern Induction for Web Relation Extraction through Frequent Itemset Mining</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">30FF107D2E3EF870B69D7B26B131B6D8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T04:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we address the problem of extracting relational information from the Web at a large scale. In particular we present a bootstrapping approach to relation extraction which starts with a few seed tuples of the target relation and induces patterns which can be used to extract further tuples. Our contribution in this paper lies in the formulation of the pattern induction task as a well-known machine learning problem, i.e. the one of determining frequent itemsets on the basis of a set of transactions representing patterns. The formulation of the extraction problem as the task of mining frequent itemsets is not only elegant, but also speeds up the pattern induction step considerably with respect to previous implementations of the bootstrapping procedure. We evaluate our approach in terms of standard measures with respect to seven datasets of varying size and complexity. In particular, by analyzing the extraction rate (extracted tuples per time) we show that our approach reduces the pattern induction complexity from quadratic to linear (in the size of the occurrences to be generalized), while mantaining extraction quality at similar (or even marginally better) levels.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>A problem which has received much attention in the last years is the extraction of (binary) relations from the Web. Automatic extraction of relations is useful whenever the amount of text to analyze is not manageable manually. As an example, a car manufacturer may want to monitor upcoming market developments by analyzing news and blogs on the Web. Relation extraction can extract the presentedAt relation in order to compile a list of upcoming car models and where they will be presented (e.g presentedAt(Audi Q7, Detroit Motor Show)). To address this problem, several supervised approaches have been examined which induce a classifier from training data and then apply it to discover new examples of the relation in question. These approaches typically work on a closed corpus and rely on positive and (implicit) negative examples provided in the form of annotations <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b7">8]</ref> or a handful of positive and negative examples <ref type="bibr" target="#b4">[5]</ref>. The obvious drawback of such methods is that they can inherently not scale to the Web as they would require the application of the classifier to the whole textual data on the Web, thus being linear in its size.</p><p>Alternative approaches to address the problem of extracting relations from the Web have been presented (we discuss a couple of systems below). These approaches rely on the induction of patterns on the basis of occurrences of a few examples of the relation in question. Such explicit textual patterns allow to take a shortcut to linearly scanning the whole Web by relying on standard index structures to evaluate the string patterns as standard search engine queries using off-the-shelf search engine APIs. This circumvents the need to linearly process the whole Web (see e.g. <ref type="bibr" target="#b2">[3]</ref>). Some approaches perform pattern induction in an iterative fashion in a cyclic approach which uses the new examples derived in one iteration for the induction of new patterns in the next iteration <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b0">1]</ref>. In this paper we follow this latter approach and in particular examine more in detail the empirical complexity of the pattern induction step. As in these approaches the induction of patterns proceeds in a bootstrapping-like fashion, the complexity of the pattern induction step crucially determines the time complexity of the whole approach. Earlier implementations of the approach have used greedy strategies for the pairwise comparison of the occurrences of seed examples. In this paper we show how the Apriori algorithm for discovering frequent itemsets can be used to derive patterns with a minimal support in linear time. Our empirical evaluation shows that with this approach pattern induction can be reduced to linear time while maintaining extraction quality comparable (and even marginally better) to earlier implementations of the algorithm.</p><p>The remainder of this paper is organized as follows. In the next section we describe the approach of pattern-based relation extraction using Web search engines in more detail. In section Pattern Induction as Frequent Itemset Mining, we give a brief introduction to Frequent Itemset Mining before describing how it is applied in order to induce patterns for relation extraction. We describe our experimental results in section Experimental Results, before discussing related work and giving some concluding remarks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Iterative Pattern Induction</head><p>The goal of pattern induction is, given a set of seed examples (pairs) S of a relation R as well as occurrences Occ(S) in the corpus (the Web in our case) of these seeds, to induce a set of patterns P which are general enough to extract many more tuples standing in the relation R (thus having a good coverage) and which at the same time do not overgenerate in the sense that they produce too many spurious examples. The challenging issues here are on the one hand that the hypothesis space is huge, corresponding to the power set of the set of possible patterns P representing abstractions over the set of occurrences Occ(S). We will denote this hypothesis space as 2 P . On the other hand, the complete extension ext R of the relation R is unknown (it is the goal of the whole approach to approximate this extension as closely as possible at the end of the cycle), such that we cannot use it to compute an objective function: o : 2 P → R to determine the patterns' accuracy with respect to the extension ext R .</p><p>The general algorithm for iterative induction of patterns is presented in Figure <ref type="figure" target="#fig_0">1</ref>. It subsumes many of the approaches mentioned in the introduction which implement similar bootstrapping-like procedures. The key idea is to co-evolve P (which at the beginning is assumed to be empty) as well as a constantly growing set of examples S which at the beginning corresponds to the seed examples. The candidate patterns can be generated in a greedy fashion by abstracting over the occurrences Occ(S). Abstracting requires finding common properties, which in principle is a quadratic task as it requires pairwise comparison between the different occurrences.  The algorithm starts with a set of initial tuples S of the relation in questionso called seeds -and loops over a procedure which starts by acquiring occurrences of the tuples currently in S (e.g. by querying a search engine with "Stockholm" "Sweden") for the relation locatedIn. Further patterns are then learned by abstracting over the text occurrences of the tuples. The new patterns are then evaluated and filtered before they are matched. A resulting pattern could be "flights to ARG 1 , ARG 2 from * airport" and thus may contain wildcards and argument place holders. From these matches, new tuples are extracted, evaluated and filtered. The process is repeated until a termination condition DONE is fulfilled. The learning is thus inductive in nature, abstracting over individual positive examples in a bottom-up manner.</p><p>For our experiments we have used the implementation of the above algorithm as described in <ref type="bibr" target="#b2">[3]</ref>. They have shown in previous work that in absence of an objective function to maximize, we can reasonably estimate the quality of the set P of patterns by a heuristic function. Among the different functions examined in the above mentioned work, a simple function which assesses the quality of a pattern on the basis of its support, i.e. the different occurrences which it was generated from and therefore covers, is shown to be a good choice compared to other more elaborate measures such as the pointwise mutual information used in the Espresso <ref type="bibr" target="#b11">[12]</ref> and other systems (e.g. KnowItAll <ref type="bibr" target="#b8">[9]</ref>). Therefore, a reasonable choice is to select those patterns which have a minimal support and meet some heuristic syntactic criteria to prevent too general patterns <ref type="foot" target="#foot_0">1</ref> . We describe in the following section how this problem can be formulated as the one of determining frequent itemsets using the well-known apriori algorithm. With this move, we also reduce the complexity of the pattern induction step from quadratic to linear in the number of occurrences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Pattern Induction as Frequent Itemset Mining</head><p>In our approach, we translate textual occurrences of a certain relation into set representations and use the Apriori algorithm to find patterns in these occurrences that exceed a certain minimum support. This task is typically called frequent itemset mining (FIM).</p><p>The mining for frequent itemsets is a subtask of Association Rule Mining. Association rules are used to derive statements like "Clients who bought product X also bought product Y" from transaction databases. A transaction t ∈ DB constitutes a process with several items a from an alphabet of items A (e.g. products that have been jointly purchased). DB is thus a (multi) set of subsets of A.</p><p>In a database DB of transactions the frequent itemsets F ⊂ 2 A are defined as those sets that occur at least f req min times as subset of a transaction, i.e.</p><formula xml:id="formula_0">F = {f ∈ 2 A ||{t ∈ DB|f ⊂ t}| ≥ f req min }.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">The Apriori Algorithm</head><p>Apriori <ref type="bibr" target="#b1">[2]</ref> is an algorithm for finding all frequent itemsets given a database and a frequency threshold. It is based on the observation that an itemset f of size |f | = n can only be frequent in DB if all its subsets are also frequent in DB. Apriori thus significantly reduces the amount of itemsets for which the frequency has to be counted by first deriving all frequent itemsets of size n = 1 and then progressively increasing n so that the above subset condition can be checked when generating the candidates for n + 1 as all subsets of size n are known. The Apriori algorithm looks as follows in pseudocode:</p><formula xml:id="formula_1">APRIORI (Alphabet A, Database DB ⊂ 2 A , T hreshold f reqmin) 1 C ← {{a}|a ∈ A} 2 n ← 1 3 while C = ∅ 4 do 5 ∀c ∈ C : COUNTSUPPORT (c, DB) 6 Fn ← {c ∈ C|SUPPORT(c) &gt;= f reqmin} 7 C ← {f ∪ g|f, g ∈ Fn ∧ MERGABLE (f, g)} 8 C ← PRUNE(C, Fn) 9 n ← n + 1</formula><p>The algorithm stores all frequent itemsets of size n in a set F n after verifying for each itemset that it occurrs at least f req min times in DB. The set of candidates for the first iteration is given by all elements of the alphabet. For the following iterations it is then generated by taking all elements of F n and combining them if the condition MERGABLE(f, g) is fulfilled, which makes sure that f and g overlap in n − 1 elements. PRUNE(C, F n ) removes all itemsets c from C (which all have length n + 1) for which one or more of all possible subsets of c of size n are not contained in F n which is the above-mentioned necessary condition for c to be frequent.</p><p>The performance of the Apriori algorithm depends on the efficient implementation of the operations COUNTSUPPORT(c, DB), MERGABLE(f, g) and PRUNE(C, F n ). It is common to use a Trie data structure (also called Prefix Tree) for this purpose. Given an arbitrary total order on A, one can represent the itemsets as ordered sequences with respect to that sequence. Tries are trees that represent sequences as paths in the tree along with their frequency counts. After constructing a Trie from the DB, one can find and count non-continuous subsequences of DB entries very efficiently, which is the task of COUNTSUPPORT. Similarly, MERGABLE and PRUNE can be implemented as traversal operations on the Trie (as described in <ref type="bibr" target="#b10">[11]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Mining for Text Patterns with Apriori</head><p>The general idea of applying frequent itemset mining for text pattern induction is that a text pattern "flights to * , * " can be considered the frequent itemset of the set of text occurrences it has been generated from (e.g. DB = {"We offer flights to London, England.","I look for flights to Palo Alto, CA."}). In order to ensure that, in spite of the set character of itemsets, word order is preserved, a special encoding is used, allowing at the same time to express additional constraints over words. While sequence mining algorithms such as the one used by Jindal and Liu <ref type="bibr" target="#b9">[10]</ref> can be applied, it is not straightforward to encode multiple constraints per token. Thus, in our approach we exploit the more general model of unordered itemsets and encode word order and other constraints as described below.</p><p>We use the notion of constraints for describing the textual occurrences and patterns. Each constraint has a type, a position and a value. A constraint is fulfilled for a given text segment if the value is present at the given position in a way described by the constraint type. The positions are the token numbers (aligned by the positions of the arguments). Types can be for example surface string, capitalization and part-of-speech with their obvious sets of possible values. The pattern "We offer flights to * , * " may be represented as the following set of constraints:</p><formula xml:id="formula_2">surf ace1 = we, capitalization1 = true surf ace2 = offer, capitalization2 = f alse surf ace3 = flights, capitalization3 = f alse surf ace4 = to, capitalization4 = f alse surf ace6 = COMMA, capitalization6 = f alse</formula><p>Note that no constraints are posed for positions 5 and 7 because those are the argument positions (reflected by the * wildcard above). In our implementation we ensure that all occurrences are aligned such that the position numbers are always the same relative to the argument positions.</p><p>We encode each constraint as a positive integer value using a bijective function encode : T ype × P osition × V alue → N: encode(con, pos, value) = value * maxCon * maxP os + (pos + maxP os * (con − 1)). where con is the number of the constraint type, pos the position and value a numerical value reflecting frequency. The remaining variables reflect the respective maximal values with respect to the given database. One can think of this as the process of first "flattening" the structured information contained in the constraints to items like: in * " is frequent, then " * was * in * " is frequent as well). In order to avoid such too general patterns and at the same time avoiding too specific ones (e.g. "Wolfgang Amadeus * was born in * "), we introduce the following rule for removing more general patterns: if pattern a has all constraints also present in b and one more, b is removed unless SUPPORT(b) is at least 20% higher than SUPPORT(a). This rule is applied starting with the smallest patterns. We experimentally determined that the threshold of 20% leads to a generally rather appropriate set of patterns. The remaining unwanted patterns are left to be eliminated by further filtering.</p><formula xml:id="formula_3">{surf</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Evaluation</head><p>The goal of our experimental evaluation is to demonstrate the advantages of modeling the pattern abstraction subtask of iterative pattern induction as a frequent itemset mining (FIM) problem. We do so by comparing the performance achieved by our itemset-based implementation with the abstraction algorithm used in previous implementations (compare <ref type="bibr" target="#b2">[3]</ref>). We do not intend to show the superiority of the approach based on Frequent Itemset Mining to those from the literature as this would require a common benchmark for large-scale Web Relation Extraction or at least a common basis of implementation. Such a standard does not exist due to the diversity of applications and pattern representation formalisms in the literature. Yet, we evaluate our results on a fairly diverse set of non-taxonomic relations to ensure generality. The datasets we use have already been used in <ref type="bibr" target="#b2">[3]</ref> and are provided for download by the authors. As in these experiments, we have also used the same 10 seeds selected by hand and the same automatic evaluation procedure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental Setup</head><p>In our experiments, we rely on the widely used precision and recall measures to evaluate our system's output with respect to the full extension of the relation<ref type="foot" target="#foot_1">2</ref> . To give an shows precision, recall and F-measure for three configurations of the system: the classic configuration, the FIM configuration which uses the proposed modeling of the learning problem with all parameters unchanged and FIM tuned for which the parameters have been optimized for the new learning algorithm. In particular, as FIM is more efficient than the classic merge procedure, we can process a higher number of tuples, such that we set the number of occurrences downloaded to 200 (versus a decreasing number as used in <ref type="bibr" target="#b2">[3]</ref>). All the other parameters of the algorithm have been chosen as described there. Overall, there is a small superiority of FIM over the classic version in terms of precision and recall (29% vs. 27% and 15% vs. 11%). Most importantly, there is a clear superiority in terms of extraction rate (0.19 vs. 0.05 occurrences/second). This difference is statistically significant (two-sides paired Student's t-test with an α-Level of 0.05).</p><p>Table <ref type="table" target="#tab_1">1</ref> shows the different relations together with the size of their extension, the precision yielded by a manual evaluation of a sample of 100 tuples of each relation (P manual ), the precision yielded by the classic pattern induction algorithm P classic as well as the relative improvements yielded by our formulation of the problem as a frequent itemset mining (FIM) task relative to the precision P classic calculated automatically with respect to the relation's extension <ref type="foot" target="#foot_2">3</ref> . The best results for each relation are highlighted. In general, we see that while the results vary for each relation, overall the FIM version of the algorithm does not deteriorate the results, but even slightly improves them on average (+1,9% for the FIM version and +4.7% for the tuned FIM version).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Discussion</head><p>In principle, there are no reasons for any of the abstraction algorithms to show better precision and recall because they both explore all possible frequently occurring patterns in a breadth-first-search manner. Differences are due to minor modeling issues (see below), the slightly different evaluation of patterns based directly on support counts produced by apriori and, most importantly, the fact that learning is cut off after one hour per iteration. Indeed the standard implementation frequently reached this time limit of an hour, thus leading to better results for the FIM version of the algorithm which does not suffer from this time limit.</p><p>One example of slight modeling differences which influenced performance is the treatment of multi-word instances. The learner has to decide whether to insert one wildcard * in an argument position (nearly always matching exactly one word) or two (allowing for two or more words). The classic version heuristically takes the number of words in the argument of the first occurrence used for pattern creation as sample for the wildcard structure. The FIM version encodes the fact that an argument has more than one word as an additional constraint. If this item is contained in a learned frequent itemset, a double wildcard is inserted. The stronger performance with the bornInYear (+48%), currencyOf (+10.9%) and productOf (+12%) relations can be explained in that way (compare Table <ref type="table" target="#tab_1">1</ref>). For example, the FIM version learns that person names have typically length 2 and birth years always have length 1 while the classic induction approach does not allow this additional constraint. This explains the decreased performance of the classic approach for the relations mentioned above for which at least one argument has a rather fixed length (e.g. years).</p><p>As indicated in Figure <ref type="figure">2</ref>, the clear benefit of the FIM abstraction step lies in its runtime behavior. The duration of a pattern generation process is plotted over the number of sample instances to be generalized. To measure these times, both learning modules were provided with the same sets of occurrences isolated from the rest of the induction procedure. The FIM shows a close to linear increase of processing duration for the given occurrence counts. Even though implemented with a number of optimizations (see <ref type="bibr" target="#b2">[3]</ref>), the classic induction approach clearly shows a quadratic increase in computation time w.r.t. the number of input occurrences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>The iterative induction of textual patterns is a method widely used in large-scale information extraction. Sergey Brin pioneered the use of Web search indices for this purpose <ref type="bibr" target="#b3">[4]</ref>. Recent successful systems include KnowItAll which has been extended to automatic learning of patterns <ref type="bibr" target="#b8">[9]</ref> and Espresso <ref type="bibr" target="#b11">[12]</ref>. The precision of Espresso on various relations ranges between 49% and 85%, which is comparable to our range of precisions P manual . Concerning the standard restriction to binary relations, Xu et al. <ref type="bibr" target="#b16">[17]</ref> have shown how approaches used for extracting binary relations can be applied to n-ary relations in a rather generic manner by considering binary relations as projections of these. These and the many other related systems vary considerably with respect to the representation of patterns and in the learning algorithms used for pattern induction. The methods used include Conditional Random Fields <ref type="bibr" target="#b15">[16]</ref>, vector space clustering <ref type="bibr" target="#b0">[1]</ref>, suffix trees <ref type="bibr" target="#b13">[14]</ref> and minimizing edit distance <ref type="bibr" target="#b12">[13]</ref>. In this paper, we have proposed to model different representational dimensions of a pattern such as word order, token at a certain position, part-of-speech etc. as constraints. Our approach allows straightfor-wardly to represent all these dimensions by an appropriate encoding. Given such an encoding, we have shown how frequent itemset mining techniques can be used to efficiently find patterns with a minimal support. Apart from pattern-based approaches, a variety of supervised and semi-supervised classification algorithms have been applied to relation extraction. The methods include kernel-based methods <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b7">8]</ref> and graph-labeling techniques <ref type="bibr" target="#b5">[6]</ref>. The advantage of such methods is that abstraction and partial matches are inherent features of the learning algorithm. In addition, kernels allow incorporating more complex structures like parse trees which cannot be reflected in text patterns. However, such classifiers require testing all possible relation instances while with text patterns extraction can be significantly speeded up using search indices. From the point of view of execution performance, a pattern-based approach is superior to a classifier which incorporates a learned model which can not be straightforwardly used to query a large corpus such as the web. Classification thus requires linear-time processing of the corpus while search-patterns can lead to faster extraction. Recently, the A similar approach to ours is the one by Jindal and Liu <ref type="bibr" target="#b9">[10]</ref>. They use Sequential Pattern Mining -a modification of Frequent Itemet Mining -to derive textual patterns for classifying comparative sentences in product descriptions. While, like our approach, encoding sequence information, their model is not able to account for several constraints per word. Additionally, the scalability aspect has not been focus of their study as mining has only be performed on a corpus of 2684 sentences with a very limited alphabet. Another approach orthogonal to ours is presented by <ref type="bibr" target="#b6">[7]</ref>. Each occurrence is abstracted over in a bottom up manner which saves pairwise occurrence comparison at the expense of evaluating the large amounts of pattern candidates with respect to the training set. The algorithm seems thus more appropriate for fully supervised settings of limited size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>Our contribution in this paper lies in the formulation of the pattern induction step as a well-known machine learning problem, i.e. the one of mining frequent itemsets. On the one hand, this formulation is elegant and advantageous as we can import all the results from the literature on association mining for further optimization (an overview of which is given in and <ref type="bibr" target="#b14">[15]</ref>). On the other hand, we have shown that this formulation leads to a significant decrease in the running time of the extraction. In particular, we have shown that the running time behavior decreases from quadratic to linear with the number of occurrences to be generalized with respect to previous implementations. Further, we have also shown that the quality of the generated tuples even slightly increases in terms of F-measure compared to the standard pattern induction algorithm. This increase is mainly due to the modeling of argument length as an additional constraint which can be straightforwardly encoded in our FIM framework. Overall, modeling the different representational dimensions of a pattern as constraints is elegant as it allows to straightforwardly add more information. In future work we plan to consider taxonomic as well as other linguistic knowledge.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Iterative pattern induction algorithm starting with initial tuples S or (alternatively) patterns P .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .Figure 2</head><label>22</label><figDesc>Fig. 2. Precision, recall, F-measure and extraction rate for the individual configurations averaged over all relations (left); Time (sec.) taken by a run of the classical induction algorithm (squares) and the FIM-based algorithm (circles) over the numbers of sample occurrences. (right)objective measure for temporal performance, we use the Extraction Rate, that is, the number of correctly extracted tuples T P over the duration D of the extraction process in seconds (on a dual core machine with 4GB of RAM): Ex = T P D Figure2shows precision, recall and F-measure for three configurations of the system: the classic configuration, the FIM configuration which uses the proposed modeling of the learning problem with all parameters unchanged and FIM tuned for which the parameters have been optimized for the new learning algorithm. In particular, as FIM is more efficient than the classic merge procedure, we can process a higher number of tuples, such that we set the number of occurrences downloaded to 200 (versus a decreasing number as used in<ref type="bibr" target="#b2">[3]</ref>). All the other parameters of the algorithm have been chosen as described there. Overall, there is a small superiority of FIM over the classic version in terms of precision and recall (29% vs. 27% and 15% vs. 11%). Most importantly, there is a clear superiority in terms of extraction rate (0.19 vs. 0.05 occurrences/second). This difference is statistically significant (two-sides paired Student's t-test with an α-Level of 0.05).Table1shows the different relations together with the size of their extension, the precision yielded by a manual evaluation of a sample of 100 tuples of each relation (P manual ), the precision yielded by the classic pattern induction algorithm P classic as well as the relative improvements yielded by our formulation of the problem as a frequent itemset mining (FIM) task relative to the precision P classic calculated automatically with respect to the relation's extension3 . The best results for each relation are highlighted. In general, we see that while the results vary for each relation, overall the FIM version of the algorithm does not deteriorate the results, but even slightly improves them on average (+1,9% for the FIM version and +4.7% for the tuned FIM version).</figDesc><graphic coords="7,134.76,70.05,170.00,97.27" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>ace 1 we, capitalization 1 true, surf ace 2 off er, capitalization 2 f alse, surf ace 3 f lights, capitalization 3 f alse, surf ace 4 to, capitalization 4 f alse, surf ace 6 COMMA, capitalization 6 f alse} Relations with precision scores obtained by the classic system (manual evaluation) and differences (∆) measured with the two FIM conditions.</figDesc><table><row><cell>and subsequently translated to integer values: {987, 435, 656634, 4235, 234, 6453, 64,</cell></row><row><cell>242, 786, 89}. During the application of Apriori, only those subsets are retained that</cell></row><row><cell>reflect a frequently occurring textual pattern: {6453,64,242,786,89}= "flights to *, *".</cell></row><row><cell>Apriori generates all patterns that exceed a given frequency threshold. Inevitably,</cell></row><row><cell>this yields multiple patterns that are subsumed by each other (e.g. if " * was born</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">In particular, we ensure that the patterns have a minimal number of token constraints (and not only wildcards) as well as that they have been generated from at least two different tuples.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Note that this is different from the evaluation of other similar systems which calculate these measures with respect to a specific corpus, thus yielding higher scores. Also due to the abscence of a closed corpus in our Web scenario, our notion of recall is is not comparable. We use "relative recall" in the sense that it reflects extractions compared to the highest yield count obtained over all experimental settings we applied.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">Note here that the precision P classic calculated automatically with respect to the datasets is much lower than the precision obtained through sampled manual evaluation (P manual ). This is due to the in some cases unavoidable in-completeness of the datasets and orthographic differences in test data and extraction results.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work was funded by Deutsche Forschungsgemeinschaft (MULTIPLA project, grant Nr 38457858) and the X-Media project (www.x-media-project.org) sponsored by the European Commission as part of the Information Society Technologies (IST) program under EC grant number IST-FP6-026978.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Snowball: extracting relations from large plain-text collections</title>
		<author>
			<persName><forename type="first">E</forename><surname>Agichtein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gravano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fifth ACM conference on Digital Libraries (DL)</title>
				<meeting>the fifth ACM conference on Digital Libraries (DL)</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Fast algorithms for mining association rules in large databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th International Conference on Very Large Data Bases (VLDB&apos;94)</title>
				<meeting>the 20th International Conference on Very Large Data Bases (VLDB&apos;94)</meeting>
		<imprint>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="487" to="499" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Harvesting relations from the web -quantifiying the impact of filtering functions</title>
		<author>
			<persName><forename type="first">S</forename><surname>Blohm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stemle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of AAAI&apos;07</title>
				<meeting>AAAI&apos;07</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1316" to="1323" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Extracting patterns and relations from the world wide web</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the WebDB Workshop at the 6th International Conference on Extending Database Technology (EDBT)</title>
				<meeting>the WebDB Workshop at the 6th International Conference on Extending Database Technology (EDBT)</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Learning to extract relations from the web using minimal supervision</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bunescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mooney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACL 2007</title>
				<meeting>ACL 2007</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Relation extraction using label propagation based semi-supervised learning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Niu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of COLING-ACL 2006</title>
				<meeting>COLING-ACL 2006</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="129" to="136" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Adaptive information extraction from text by rule induction and generalisation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ciravegna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of IJCAI 2001</title>
				<meeting>IJCAI 2001</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="1251" to="1256" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Dependency tree kernels for relation extraction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Culotta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sorensen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 42nd ACL</title>
				<meeting>the 42nd ACL</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="423" to="429" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Learning text patterns for web information extraction and assessment</title>
		<author>
			<persName><forename type="first">D</forename><surname>Downey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Soderland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weld</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Workshop on Adaptive Text Extraction and Mining</title>
				<meeting>the AAAI Workshop on Adaptive Text Extraction and Mining</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Mining comparative sentences and relations</title>
		<author>
			<persName><forename type="first">N</forename><surname>Jindal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of AAAI&apos;06</title>
				<meeting>AAAI&apos;06</meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Fast sequential and parallel algorithms for association rule mining: a comparison</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mueller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1995">1995</date>
			<pubPlace>, MD, USA</pubPlace>
		</imprint>
		<respStmt>
			<orgName>College Park</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Espresso: Leveraging generic patterns for automatically harvesting semantic relations</title>
		<author>
			<persName><forename type="first">P</forename><surname>Pantel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pennacchiotti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of COLING-ACL&apos;06</title>
				<meeting>COLING-ACL&apos;06</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="113" to="120" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Towards terascale knowledge acquisition</title>
		<author>
			<persName><forename type="first">P</forename><surname>Pantel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ravichandran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">H</forename><surname>Hovy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of COLING-04</title>
				<meeting>COLING-04</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="771" to="777" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Learning surface text patterns for a question answering system</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ravichandran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hovy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the ACL</title>
				<meeting>the 40th Annual Meeting of the ACL</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="41" to="47" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt-Thieme</surname></persName>
		</author>
		<title level="m">Assoziationsregel-Algorithmen für Daten mit komplexer Struktur</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
		<respStmt>
			<orgName>Universität Karlsruhe</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A context pattern induction method for named entity extraction</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">P</forename><surname>Talukdar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Brants</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liberman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th CoNLL</title>
				<meeting>the 10th CoNLL<address><addrLine>New York City</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A seed-driven bottom-up machine learning framework for extracting relations of various complexity</title>
		<author>
			<persName><forename type="first">F</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 45th ACL</title>
				<meeting>the 45th ACL</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="584" to="591" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Kernel methods for relation extraction</title>
		<author>
			<persName><forename type="first">D</forename><surname>Zelenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Aone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Richardella</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="1083" to="1106" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
