Minimizing the Costs of the Training Data for Learning Web Wrappers Rolando Creo, Valter Crescenzi, Disheng Qiu, Paolo Merialdo Dipartimento di Informatica ed Automazione Università degli Studi Roma Tre Via della Vasca Navale, 79 – Rome, Italy {creor, disheng, crescenz, merialdo}@dia.uniroma3.it ABSTRACT pages that publish subjective values, such as customer rat- Data extraction from the Web represents an important is- ings, or real time data, such as stock quote prices), or they sue. Several approaches have been developed to bring the might be biased over specific instances (typically the most wrapper generation process at the web scale. Although they popular). This prevents the generation of a correct wrapper rely on different techniques and formalisms, they all learn around the broader set of instances, and thus raises the need a wrapper given a set of sample pages. Unsupervised ap- of additional training data. proaches require just a set of sample pages, supervised ones Overall, obtaining training data represents a relevant cost also need training data. Unfortunately, the accuracy ob- in the wrapper inference process. This paper presents a tained by unsupervised techniques is not sufficient for many framework to minimize the the cost of generating a wrap- applications. On the other hand, obtaining training data per expressed as the number of membership queries (MQ) is not cheap at the web scale. This paper addresses the is- needed by a supervised inference system for the training [1]. sue of minimizing the costs of collecting training data for Membership queries are the simplest form of queries since learning web wrappers. We show that two interleaved prob- they admit only a yes/no answer, e.g. “is rick@usr.edu a lems affect this issue: the choice of the sample pages, and value to extract?”. It is worth observing that the simplicity the expressiveness of the wrapper language. We propose a of these queries make them suitable to be answered through solution that leverages contributions in the field of learning crowdsourcing platforms, that make the training costs ex- theory, and we discuss the promising results of an experi- plicit. mental evaluation of our approach. As the following example illustrates, these costs depend on two interrelated features: the representativeness of the sample and the expressiveness of the extraction language. 1. INTRODUCTION The huge amount of information available on the web in- The Sampling Problem. Suppose we are interested to wrap spired several researches towards the development of tools pages containing information about professors. For the sake and techniques to infer web wrappers for extracting data of simplicity, let us represent pages as tables, where data from script-generated HTML pages. is organized in rows and columns. Figure 1(a) shows sam- Unsupervised approaches take as input a set of sample ple pages depicted according to the above simplification. A pages and analyze regularities and differences to infer a wrapper can be described as a set of extraction rules. In wrapper based on the underlying HTML template [6, 2]. our abstraction, an extraction rule specifies the cell con- They could scale on the number of sources, but the accuracy taining the relevant data, and it can be expressed by ab- of the generated wrappers is limited. Supervised approaches solute coordinates (e.g. first row, second column), or by can produce accurate wrappers, but their scalability is lim- relative coordinates, that is, with respect to another cell ited because they need training data, i.e. annotations over (e.g. the first cell located at the right of the cell contain- the values published in the pages. In early approaches these ing ‘Email’). Correspondingly, suppose that we choose to data were provided by means of a human intervention [9]. adopt only XPath extraction rules of one form out of two More recently, solutions that rely on data stored in existing possible: absolute extraction rules (e.g. /html[1]/table[1]/- repositories have been proposed [7]. Unfortunately, in many tr[x]/td[y]) that we denote abs(x,y); relative extraction rules domains suitable training data does not exist at all (consider (e.g. //tr[td[contains(.,’x’)]]/td[2]) that we denote right-of(‘x’). For example, according to Rick’s page, candidate extrac- tion rules for Name are abs(1,1) and above(‘Home’). Simi- larly, rules for Position are abs(5,2) and right-of(‘Position’). Now suppose that the Position of professors is a relevant attribute to extract. If the sample set is composed only of awarded professors (such as Rick), inferred rules could not VLDS’12 August 31, 2012. Istanbul, Turkey. work for the broader set of all professors, including those Copyright c 2012 for the individual papers by the papers’ authors. Copying without any award (such as Mark and Bill). For example, permitted for private and academic purposes. This volume is published and the rule abs(5,2) for Position might work for all the awarded copyrighted by its editors. Rick Mark Bill Home rick.usr.edu Home mark.usm.edu Affiliation SE Dept. Univ. of G. Awards Best paper Affiliation IT Dept. Univ. of M. Position Ass. Prof. Affiliation CS Dept. Univ. of R. Position Ass. Prof. Email bill@usg.edu Position Full Prof. Email mark@usm.edu Phone 123-454-3210 Email rick@usr.edu Phone 987-654-3210 Phone 123-345-6789 v0+ = mark@usm.edu (a) sample pages r1 abs(5,2) r1 r2 r3 r4 r2 right-of(‘Email’) pr Full Prof. rick@usr.edu nil nil r3 above(‘987-654-3210’) pm mark@usm.edu mark@usm.edu mark@usm.edu mark@usm.edu r4 below(‘Ass. Prof.’) pb 123-454-3210 bill@usg.edu nil bill@usg.edu (b) extraction rules (c) values extracted Figure 1: Running Example professors, but it does not extract the position for other might be expanded to a larger class Rh . professors. The goal is to achieve the expressiveness of the largest The usual approach to address this issue is to work with a classes only whenever it is detected as actually needed. For large set of annotations that hopefully covers all the possible all those cases that can be solved within the smaller classes types of target pages. However, according to our cost model, of rules, the algorithm will save many samples. this strategy is inefficient. The approach is independent of the details of the formal- ism used to express extraction rules. Our hierarchy {Rh } The Expressiveness Problem. Consider again the running makes use of simple rules, as follows: R0 is the class of ab- example and suppose we are interested to extract professors’ solute XPath rules as described before; Rh , for 0 < h ≤ m, Name and Home. We now show that the size of the sample is obtained by adding to Rh−1 the class of relative XPath set actually depends on the expressiveness of the language rules with distance at most h from a pivoting leaf to the used to specify the rules, and hence on the set of available value.1 However this is just one possibility: the hierarchy rules. can be built with other of rules. Suppose that we choose to adopt only absolute extraction A landmark decision is whether and when expanding the rules: a correct rule for Name is abs(1,1). Note that only one set of candidate rules. We introduce a probabilistic model labeled sample would suffice to infer this rule. Suppose now to dynamically characterize the probability of correctness to adopt a more expressive language, which also includes for rules in the current class of candidate rule Rh . We pro- relative rules. Using just one page, say Rick’s, several rules pose an original active learning algorithm [12] that exploits are generated to extract Name: abs(1,1), above(‘Home’), the probabilistic model for deciding to enlarge Rh lazily, i.e. above(‘rick.usr.edu’). To determine the correct rule at only whenever there is enough evidence that the correct rule least another well chosen example is required: only with is not amongst the current set of candidates. The algorithm the help of Mark’s page we have the evidence that the rule actively chooses the next membership query to find a rep- above(‘Home’) does not work. resentative sample set: it selects a sample page and poses To summarize, the more expressive is the model, the larger a membership query on an extracted value. By accurately is the size of the representative sample set [1] [5] (intuitively, choosing this value, the number of queries can be minimized. the space of hypotheses is larger and thus more examples are The paper is organized as follows: Section 2 discusses re- needed to discard the incorrect ones). However, the addi- lated work; Section 3 formalizes our setting and states the tional expressiveness should be carefully handled, since its problem definition; Section 4 develops a probamilistic model actual need depends on the input pages and on the desired to characterize the correctness of extraction rules; Section 5 attributes, whereas it always entails additional costs. presents an active learning algorithm for extraction rules The usual approach to address this issue is to work with based on the model; finally, Section 6 discusses our prelimi- overly expressive languages to cover all needs. However, as nary experiments with a set of sources from the Web. made explicit by our cost model, this strategy is inefficient. 2. RELATED WORK Overview. In this paper we address the above issues. We In machine learning, the number of labeled samples needed show that they cannot be tackled separately, and propose by a supervised learning algorithm to infer a good hypothesis an approach that carefully handles the expressiveness of the is called sample complexity, and has been studied from sev- wrapping language, and the choice of the samples (i.e. the eral perspectives. For instance, similarly to our setting, [1] pages to be annotated). discusses the problem of exactly inferring a concept, i.e. a set We propose an approach, inspired by a statistical learn- of elements, by means of membership queries, i.e. question ing technique [14, 13], in which the expressiveness of the language is enlarged at runtime. We organize the class of 1 In the current prototype all textual leaves are used as can- candidate rules R into a hierarchy of classes {Rh }0≤h≤m didate pivot. The distance from the pivot to the extracted of increasing size: initially the correct rule is searched only node is measured according to the number of edges crossed within R0 , and then the set of available extraction rules in the DOM representation of the HTML pages but consid- ering contiguous siblings at distance 1. of the type “is this an element of the target concept?”. How- from U . Note that |R(U )| ≤ |R|, with the strict inequality ever, the main idea underlying our approach has been pro- holding whenever a vector is extracted by different rules. posed by the statistical learning community [14], in which a We denote with {Rh } the hierarchy of classes of all the loss function is given in order to characterize the quality of generable rules, i.e. the extraction rules that can be gener- the produced hypothesis. The structural risk minimization ated to extract vectors from U . The classes of this family (SRM) technique [13], i.e. the decomposition of the set of are countable but potentially not finite. We will manage hypotheses into a hierarchy of subclasses, aims at avoiding to work with a finite restriction of them, called the set of the overfitting problem: since the class of hypotheses stud- generated rules, as discussed in the following. ied by this community might be so expressive to be able to Labeled Sample Sequences: we introduce the concept of la- arbitrarily reduce the loss, a trade-off with other quality cri- beled sample v l where v ∈ pv is a value from a page pv , and teria is needed to avoid that the learning algorithm selects l ∈ {+, −} is either a positive or a negative label. In the fol- the hypothesis describing the training data perfectly, rather lowing v + and v − denote a positive sample (or annotation) than their underlying patterns. or a negative sample, respectively, that is the two possible Many researchers have proposed several variations of the answers to a MQ. learning paradigm to make it practically feasible in differ- A rule r is admissible wrt to a set of samples L (denoted ent applicative contexts: the learning approaches in which L(r)) iff: the inference algorithm is free to choose which sample to la- bel next are usually defined active [12]. These have recently l = + → r(pv ) = v L(r) ⇔ ∀v l ∈ L, gained interest, since, as clarified in [3], that they might pro- l = − → r(pv ) 6= v duce exponential improvements over the number of samples that is, it is compliant with the labels in the set. wrt traditional supervised approaches. The concept can be trivially extended to set of rules R, To the best of our knowledge and differently from our pro- and we denote with RL = {r ∈ R : L(r)} the subset of posal, all the approaches for inferring wrappers over struc- tured websites developed by the researchers in the wrapper admissible rules in R wrt L and with VbLR (U ) all the values community [6, 2, 9, 4, 15], define the set of hypotheses stati- they extract from U : VbLR (U ) = {v : v = r(p), r ∈ RL , p ∈ cally, i.e. before performing the inference, and once set, the U }. set of candidate rules cannot be changed without seriously Example: Let pr , pm and pb be the pages in Figure 1(a) revisiting the inference algorithm. Therefore they oversize and let U = {pr , pm , pb }. The attribute Email is extracted the expressiveness of the formal language used to specify the by the rule r2 =right-of(‘Email’): two positive samples are extraction rules and additional samples are required only to v0+ =‘mark@usm.edu’ and v1+ =‘bill@usg.edu’, a negative sam- compensate with the excess of expressiveness. ple is v2− =‘123-454-3210’. Observe that r2 is admissible wrt Active learning approaches for wrapper induction have to L = {v0 , v1 , v2 }. Now consider another rule r1 =abs(5,2) been proposed in [11, 10]. However, also in these works and the set of rules R = {r1 , r2 }. Then r1 is not admissible the expressiveness is statically defined. Moreover, the latter wrt to L since r1 (pb ) = ‘123 − 454 − 32100 which is the neg- approach requires complex user interaction, since the user ative sample v2− . Hence, RL = {r2 } and VbLR (U ) = {v0 , v1 }. has to choose the correct wrapper within a set of ranked In the following, given a set of rules R, we will only con- proposals. sider special ordered sets of samples, called labeled sample A few recent proposals try to scale the wrapper infer- sequences, which are formed by an initial annotation, and ence to the web scale [7, 8]. In [7] the authors leverage an then by adding only new values which are still admissible available dataset, but it is not clear how they can ignore with respect to the samples already seen. Intuitively, a la- the presence of biased samples (as suggested by its running beled sample sequence is the list of answers to the MQ posed example based on popular objects itself), while in [8] it is to learn a rule: needed domain knowledge that only an human expert can Definition: A Labeled Sample Sequence (l.s.s.) L wrt to a provide. set of rules R and a set of pages U is specified by a sequence of labeled sample v0 , . . . , vk , . . . that defines a sequence of 3. PROBLEM DEFINITION (observed) sets Lk with Lk+1 = Lk ∪ {vk } = {v0+ , v1 , . . . , vk } Preliminary Definitions: Let U = {p1 , p2 . . . pn } be a set of such that: (i) it begins with an annotation v0+ 6= nil, and pages. Every page publishes several attributes of interest (ii) ∀k > 0, vk ∈ VLRk (U ) = VbLRk (U ) \ Lk . (e.g. professor Position, Email, etc.). For simplicity, we de- velop the discussion concentrating on one attribute, and we The constraint (i) on the first annotation v0+ of the se- assume that its values are either a textual leaf of the DOM quence is useful to get finite2 RL1 , whereas the constraint (ii) tree representation of the pages, or a distinguished nil value. on the remaining samples entails that the new sample vk that We write v ∈ p to denote that v is a value of the page p, pv forms Lk+1 from Lk leads to smaller and smaller admissible to denote the page in which the value v is located, and |p| sets: RLk+1 ⊆ RLk . It is worth noting that RLk+1 plays to denote the number of values in p. the role of what the learning communities call the version- We refer to a generic extraction rule (or simply rule) r space [12], i.e. the set of hypotheses still plausible after over the set of pages U as a concrete tool to build a vector of having considered an input set of labeled samples. values indexed by the pages in U such that r(p) ∈ p ∪ {nil}. In the following we will uniformly refer to both L and one Every rule extracts one vector of values from U denoted of its observed subsets Lk blurring the differences between r(U ). Figure 1(c) shows the vectors extracted by the rules the two concepts whenever the context clarifies which one is r1 , r2 , r3 , r4 in Figure 1(b). We denote with R(U ) the set actually involved. of vectors obtained by applying a set of rules R over U , and 2 The first annotation can also be conveniently seen as the blur the distinction between a rule and the vector it extracts specification of the desired attribute. It can always be decided whether a rule extracting the Let P(R) be the prior probability that the correct vector desired vector exists. However, since it is not known in can be extracted by a rule belonging to R. We suppose that advance whether that rule was in the set of all candidate the acquisition of a new labeled sample vk to form Lk+1 rules, the only certain way to be sure of its presence is by from Lk follows a uniform p.d.f. amongst all values still checking every single page [1]. In order to minimize the queryable, i.e. the values in VLRk (U ) = VbLRk (U ) \ Lk . number of MQ, we develop a probabilistic characterization Similarly, given a correct rule r, let V + (Lk , r) = VLRk (U ), r(U ) of the rules, and restate our problem in that sense. denote the set of all and only the values that can form new Problem Definition: Given a hierarchy of classes of extrac- positive samples, and V − (Lk , r) = VLRk (U ) \ V + (Lk , r) the tion rules {Rh }0≤h≤m over a set of pages U , and a threshold set of values that can form negative samples. It follows: δ, find either a correct rule r ∈ Rm , or conclude that no rule is correct in any Rh with probability greater than 1 − δ, by ( 1 |V Rk (U )| , iff vk ∈ V l (Lk , r) minimizing the length of the input l.s.s. L. P (vkl |r, Lk ) = L 0 , otherwise 4. A PROBABILISTIC MODEL FOR Similarly, we can compute P (vkl |R, Lk ) following an ap- STRUCTURAL RISK MINIMIZATION proach based on a uniform p.d.f. over all possible values. We introduce a probabilistic model for evaluating the cor- These are essentially all the values in VLRk (U ) but only the rectness of an extraction rule, given a l.s.s. L, and a class values in VLRk (U ) ∩ RLk (U ) can be labeled either positive or of rules R. As a consequence, the model is able to compute negative (and we assume with the same probability) while the probability that a correct extraction rule has not been the values in VLRk (U ) \ RLk (U ) will surely be labeled nega- generated. tive. Therefore, it follows that P (vkl |R, Lk ) = The notations used for main events covered by our anal- ysis, and their probabilities, are summarized in Table 1. 5 1 P (vkl |R, Lk ) = , iff vk ∈ VLRk (U ) ∩ RLk (U )  We assume that the probability of an extraction rule is de-    2·|V Rk (U )∩RLk (U )| L termined by the extracted values r1 (U ) = r2 (U ) ⇒ P (r1 ) = P (vk− |R, Lk )= R 1 , iff vk ∈ VLRk (U ) \ RLk (U ) |V k (U )\RLk (U )| P (r2 ).   L 0 , iff vk 6∈ VLRk (U )  Given a new labeled sample vkl to form Lk+1 = {vkl } ∪ Lk , we denote with P (Lk+1 ) the probability P (vkl , Lk ). By applying Bayes’ theorem, the probabilities of the two main Note that the exact computation of the set RLk (U ) can be events of interest are: expensive, since given a value v ∈ VLRk (U ), in order to figure out whether v ∈ RLk (U ), we should enumerate a potentially P (vkl |r, Lk )P (r|Lk ) very large number of vectors in RLk (U ). P (r|Lk+1 ) = (1) P (vkl |Lk ) We adopt an approximate and efficient solution based on the assumption that the equivalences holding for k = 0:3 P (vkl |R, Lk )P (R|Lk ) VLRk (U ), RLk (U ) = VLRk (U ) and VLRk (U ) \ RLk (U ) = ∅, also P (R|Lk+1 ) = (2) hold for any k > 0. Hence, it can be rewritten as: P (vkl |Lk ) where P (vkl |Lk ) is a normalization factor that can be ex- ( 1 , iff vk ∈ VLRk (U ) 2·|V Rk (U )| pressed as: P (vkl |R, Lk ) ' L (3) X 0 , iff vk 6∈ VLRk (U ) P (vkl |ri , Lk )P (ri |Lk ) + P (vkl |R, Lk )P (R|Lk ) r∈RLk Actually, this is an oversimplification when k gets bigger and approaches |U |: both RLk (U ) and VLRk (U ) gets smaller For any k, P (r|Lk+1 ) and P (R|Lk+1 ) can be defined iter- and smaller and VLRk (U ) \ RLk (U ) 6= ∅. Since our algorithm atively by means of P (vkl |r, Lk ), P (vkl |R, Lk ), P (R|Lk ) and look for the correct rule while minimizing k, in our setting P (r|Lk ). P (vkl |r, Lk ) and P (vkl |R, Lk ) can be defined by ab- this semplification does not significantly affect the results. stracting the actual process that leads to the observed l.s.s into a simple generative model. This is essentially equivalent 4.2 A-priori Probabilities to define a p.d.f. over every l.s.s.. Our probabilistic model is based on the following priors: By repeatedly applying the bayesian updating rules ex- the prior probability P(Rh ) that a correct rule has been pressed by equations 1 and 2, the model allows the compu- generated in the set of rules Rh ; the prior probability P(r) tation of P (R|Lk+1 ) and P (r|Lk+1 ) for any k, starting from that the extraction rule r does extract the correct vector of prior-probabilities P (R|L0 ) = P(R) of having generated a values from the input set of pages U . As regards the former correct rule, and P (r|L0 ) = P(r) of r being a correct rule. prior p.d.f. P(r), it is set by using a uniform p.d.f. over all The iteration continues until admissible rules exist, i.e. until rules in RL1 . RLk 6= ∅. For the latter priors P(Rh ), we follow a standard ap- proach, and estimate the priors on a sufficiently large set 4.1 Generative Model of attributes the frequency of the involved events. In this section we describe a simple generative model for the P(Rh ) has been set equals to nNh where nh is the number l.s.s., useful for deriving the posterior probability P (r|Lk ) of attributes extracted by a rule in Rh and not in Rh+1 , that r(U ) is the correct vector to extract once a l.s.s. Lk+1 and N is the number of attributes used in our estimation has been observed. This vector is not known in advance, but (we sampled N = 356 attributes). the values forming the l.s.s. Lk+1 will be labeled as either 3 positive or negative according to it. Admitting that every value is extracted at least by one rule. Table 1: The main events considered in the bayesian analysis Notation Probability Event P(r) = P(r(U )) prior probability that r ∈ R is correct P(R) / P(R) prior probability that a/none rule in R is correct P (r|Lk ) probability that r is correct once a l.s.s., observed Lk P (R|Lk ) probability that the correct rule is not in R, observed Lk l k P (vk |r, L ) likelihood of vkl if r is correct, observed Lk l k P (vk |R, L ) likelihood of vkl if the correct rule in not in RLk , observed Lk k+1 lk k P (L ) = P (vk , L ) probability of a l.s.s Lk+1 For example, 95% of the attributes can be extracted by Listing 1 An active learning algorithms for extraction rules rules in R3 , while to cover the remaining 5% of attributes Input: set of sample pages U = {p1 , . . . , p|U | } we should stretch our set of candidate rules to R10 . Input: an initial positive sample L1 = {vo+ } 5. LEARNING EXTRACTION RULES Parameter: a hierarchy of generable rules {Rh } over U h The probabilistic model just developed aims at at comput- Output: P (r|Lk+1 ) over r ∈ RhLk+1 , P (RLk+1 ); ing, observed an l.s.s. Lk+1 , the probability P (r|Lk+1 ) that a given extraction rule r within a set of candidate rules R is 1: let k ← 1; let h ← 0; correct, and the probability P (RLk+1 ) that the correct rule 2: let R ← Rh ; R ← RL1 ; has not been generated at all, i.e. it is not inside R. 3: while (R 6= ∅ and not halt(R, Lk )) do Here, we start presenting an active learning algorithm that 4: vk ← chooseSample(R, Lk ); exploits these probabilities (Sec. 5.1); then, we describe a 5: l ← oracle(vk ); strategy to choose the next sample to be labeled (Sec. 5.2); 6: Lk+1 ← Lk ∪ {vkl }; finally, we discuss how the set of candidate rules is dynam- 7: R ← RhLk+1 ; ically expanded according to the “observed need of expres- 8: compute P (r|Lk+1 ), ∀r ∈ R according to eq. 1; siveness” rather than statically predetermined (Sec. 5.3). 9: h compute P (R |Lk+1 ) according to eq. 2; 10: h ← h + expandRuleSet(R, Lk+1 ); 5.1 Active Learning Algorithm 11: k ← k + 1; As shown by the pseudo-code in Listing 1, our algorithm 12: end while takes as input a labeled sample sequence L built by ac- 13: if (R 6= ∅) then tively [12] asking to an oracle (here modeled by means of the h 14: return RhLk+1 , P (r|Lk+1 ) and P (R |Lk+1 ); subprogram oracle()) the label of a sample chosen by the 15: end if subprogram chooseSample(). The algorithm is parametric 16: return ⊥; wrt a class of extraction rules decomposed into a hierarchy of subclasses {Rh }, and it makes use of the probabilistic model detailed before (lines 8-9). Initially, R0 is taken as initial set of candidate rules, and In the following, we detail chooseSample() and expandRule- the set of rules admissible wrt the initial sample R0L1 is com- Set(), respectively. puted (lines 1-2). In every iteration, the oracle is asked to label a new sample vk (lines 4-5) and the l.s.s. is updated 5.2 Choosing the Right Samples to obtain Lk+1 (6). Then, the set of admissible rules is up- For instantiating chooseSample() we propose two variants: dated (7) (recall that RLk+1 ⊆ RLk ), and the probabilities Entropy plus a baseline algorithm Random. P (r|Lk+1 ) and P (RLk+1 ) are consequently updated (8-9). Random: It chooses a random admissible sample: expandRuleSet() has to decide whether the set of candi- date rule should be expanded (line 10). chooseSample(R,L) { return a random v ∈ VLR (U ); } This algorithm can be instantiated by appropriately choos- ing the semantics of three subprograms: chooseSample(), and it serves two purposes: as a baseline against other which selects the next sample to be labeled for the user; strategies, and as a measure of the sample complexity of halt(), which establishes an exit criterion before the l.s.s. its extraction. naturally expires (i.e. R becomes empty); and finally, ex- Entropy: It bases the sample choice on the p.d.f. of the pandRuleSet(), which decides at runtime whether Rh should extracted value: a simple strategy is to choose the sample on be expanded with new candidate rules by incrementing h. which rules most disagree, appropriately weighted according h to their probability. This is equivalent to compute the vote The latter decision is usually based on P (RLk+1 ): the higher its value, the more likely that new candidate rules are needed. entropy [12] for each v ∈ RLk (U ): The implementation of halt() strongly depends on the over- H(v) = −[P (v + |Lk ) log P (v + |Lk ) + P (v − |Lk ) log P (v − |Lk )] all goal of the search strategy. We leave further investigation (4) on this aspect to future work; in the present paper, for il- P (v + |Lk ) = r∈{r∈R k :r(pv )=v} P (r|Lk ) P lustrative purposes, we use a minimum threshold λr on the where: L probability of the best rule: P (v − |Lk ) = k P halt(R, L) { return (argmaxr∈RL P (r|L) > λr ); } and: r∈{r∈RLk :r(pv )6=v} P (r|L ) are the probabilities that v is respectively either a value to Static SRM extract or an incorrect value. Rnd #A H R0 R1 R2 R3 The next sample is the one maximizing the vote entropy: ≥9 10 4 1.90 2.33 2.89 3.56 8 16 4.12 1.56 1.78 3.22 3.25 chooseSample(R,L) { return argmaxv∈V R (U ) H(v); } 7 12 4.08 1.66 1.69 2.7 3.63 L 6 15 3.86 1.26 1.58 3.22 3.40 Note that this choice essentially removes the most uncertain 5 15 3.5 1.55 2.06 2.39 2.77 4 32 3.58 1.46 1.96 2.48 3.13 sample. 3 53 3.45 1.55 2.06 2.52 3.14 2 55 3.01 1.76 1.54 2.11 2.38 5.3 Dynamically Expanding the Rule Set 1 128 2.27 1.59 1.65 1.71 1.76 expandRuleSet() is in charge of deciding whether and Precision 96.2% 86.7% 93.79% 95.85% 96.2% when expanding the hierarchy of candidate rules Rh . It h makes use of the probability P (RLk ) of the correct rule not Table 2: Experimental results. being present in the current set of candidate rules Rh after observing as input a given l.s.s. Lk . We leave a thoroughly discussion of the best criteria un- with homogeneous type). Also, we aim at integrating SRM derlying this analysis to future work, while in this paper we with automatic annotators that rely on data available on use a trivial implementation expandRuleSet() based on a external sources, such as Freebase, and with crowdsourcing predefined fixed threshold λR over P (RLk ): platforms. expandRuleSet(R, L) { 7. REFERENCES if (R = Rm ) return 0; // max expansion reached [1] D. Angluin. Queries revisited. Theor. Comput. Sci., else if (P (RL ) > λR ) return +1; 313(2):175–194, 2004. else return 0; [2] A. Arasu and H. Garcia-Molina. Extracting structured } data from web pages. In SIGMOD 2003. The set of rules is therefore enlarged lazily, i.e. only when- [3] M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The ever according to P (RL ) there is evidence that a correct rule true sample complexity of active learning. Machine is not amongst the currently available candidate rules. Learning, 80(2-3), 2010. [4] C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In WWW 2001. 6. EXPERIMENTS AND FUTURE WORKS [5] V. Crescenzi and G. Mecca. Automatic information We built a Java prototype implementation of our algorithms.We extraction from large websites. J. ACM, report the results of some preliminary experiments mainly 51(5):731–779, 2004. focused on evaluating the effectiveness of the SRM tech- [6] V. Crescenzi and P. Merialdo. Wrapper inference for nique. We downloaded pages from 101 websites publishing ambiguous web pages. JAAI, 22(1&2):21–52, 2008. information in several domain. For each website, we down- loaded a small set of sample pages (about 20) sharing a [7] N. N. Dalvi, R. Kumar, and M. A. Soliman. common HTML template, and considered as relevant 2 − 4 Automatic wrappers for large scale web extraction. attributes, for which we manually crafted a golden extrac- PVLDB, 4(4):219–230, 2011. tion rule to get as a reference the correct values, totally [8] T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, considering 240 attributes. A. Kravchenko, G. Orsi, C. Schallhart, A. J. Sellers, The results have been collected running our prototype and C. Wang. Diadem: domain-centric, intelligent, with λr = 0.99 and λR = 0.95. Table 6 reports: the automated data extraction methodology. In WWW (rounded) average number of MQ by Random over 10 ex- (Companion Volume) 2012. ecutions (Rnd);4 the aggregated number of attributes ex- [9] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, tracted by Random (#A); the number of MQ posed by En- and S. Flesca. The lixto data extraction project - back tropy with R5 and SRM disabled, (H); the number of MQ and forth between theory and practice. In PODS 2004. when SRM starts from the class of extraction rules Ri with [10] U. Irmak and T. Suel. Interactive wrapper generation i = 0, 1, 2, 3 (SRM-Ri ). The last row reports the average with minimal user effort. In WWW 2006. precision of the values extracted by the output rules. [11] I. Muslea, S. Minton, and C. A. Knoblock. Active SRM always saves MQ. Whenever the initial expressive- learning with multiple views. JAIR, 27:203–233, 2006. ness is low, the precision suffers, and the number of MQ [12] B. Settles. Active learning literature survey. CS T.R. saved is higher. This can be explained by expandRuleSet 1648, Univ. of Wisconsin–Madison, 2009. making wrong decisions, i.e. it trusts on current imprecise [13] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and rules rather than betting on more expressive classes. Note M. Anthony. Structural risk minimization over that even Entropy cannot reach 100% since there exist at- data-dependent hierarchies. IEEE Transactions on tributes that need rules out of R5 . Information Theory, 44(5):1926–1940, 1998. Since our dataset consists of a small number of pages per [14] V. Vapnik. An overview of statistical learning theory. site and it is not suitable for evaluating the sampling issue, IEEE Transactions on Neural Networks, we leave it to future work. To overcome the precision loss 10(5):988–999, 1999. with the SRM approach, it is possible to add additional [15] Y. Zhai and B. Liu. Structured data extraction from informations, e.g. types analysis (to weight more vectors the web based on partial tree alignment. IEEE Trans. 4 These values represent the sample complexity. Knowl. Data Eng., 18(12):1614–1628, 2006.