Minimizing the Costs of the Training Data for Learning Web
                         Wrappers

                             Rolando Creo, Valter Crescenzi, Disheng Qiu, Paolo Merialdo
                                                 Dipartimento di Informatica ed Automazione
                                                       Università degli Studi Roma Tre
                                                  Via della Vasca Navale, 79 – Rome, Italy

                                   {creor, disheng, crescenz, merialdo}@dia.uniroma3.it


ABSTRACT                                                                     pages that publish subjective values, such as customer rat-
Data extraction from the Web represents an important is-                     ings, or real time data, such as stock quote prices), or they
sue. Several approaches have been developed to bring the                     might be biased over specific instances (typically the most
wrapper generation process at the web scale. Although they                   popular). This prevents the generation of a correct wrapper
rely on different techniques and formalisms, they all learn                  around the broader set of instances, and thus raises the need
a wrapper given a set of sample pages. Unsupervised ap-                      of additional training data.
proaches require just a set of sample pages, supervised ones                    Overall, obtaining training data represents a relevant cost
also need training data. Unfortunately, the accuracy ob-                     in the wrapper inference process. This paper presents a
tained by unsupervised techniques is not sufficient for many                 framework to minimize the the cost of generating a wrap-
applications. On the other hand, obtaining training data                     per expressed as the number of membership queries (MQ)
is not cheap at the web scale. This paper addresses the is-                  needed by a supervised inference system for the training [1].
sue of minimizing the costs of collecting training data for                  Membership queries are the simplest form of queries since
learning web wrappers. We show that two interleaved prob-                    they admit only a yes/no answer, e.g. “is rick@usr.edu a
lems affect this issue: the choice of the sample pages, and                  value to extract?”. It is worth observing that the simplicity
the expressiveness of the wrapper language. We propose a                     of these queries make them suitable to be answered through
solution that leverages contributions in the field of learning               crowdsourcing platforms, that make the training costs ex-
theory, and we discuss the promising results of an experi-                   plicit.
mental evaluation of our approach.                                              As the following example illustrates, these costs depend
                                                                             on two interrelated features: the representativeness of the
                                                                             sample and the expressiveness of the extraction language.
1.    INTRODUCTION
The huge amount of information available on the web in-                      The Sampling Problem. Suppose we are interested to wrap
spired several researches towards the development of tools                   pages containing information about professors. For the sake
and techniques to infer web wrappers for extracting data                     of simplicity, let us represent pages as tables, where data
from script-generated HTML pages.                                            is organized in rows and columns. Figure 1(a) shows sam-
   Unsupervised approaches take as input a set of sample                     ple pages depicted according to the above simplification. A
pages and analyze regularities and differences to infer a                    wrapper can be described as a set of extraction rules. In
wrapper based on the underlying HTML template [6, 2].                        our abstraction, an extraction rule specifies the cell con-
They could scale on the number of sources, but the accuracy                  taining the relevant data, and it can be expressed by ab-
of the generated wrappers is limited. Supervised approaches                  solute coordinates (e.g. first row, second column), or by
can produce accurate wrappers, but their scalability is lim-                 relative coordinates, that is, with respect to another cell
ited because they need training data, i.e. annotations over                  (e.g. the first cell located at the right of the cell contain-
the values published in the pages. In early approaches these                 ing ‘Email’). Correspondingly, suppose that we choose to
data were provided by means of a human intervention [9].                     adopt only XPath extraction rules of one form out of two
More recently, solutions that rely on data stored in existing                possible: absolute extraction rules (e.g. /html[1]/table[1]/-
repositories have been proposed [7]. Unfortunately, in many                  tr[x]/td[y]) that we denote abs(x,y); relative extraction rules
domains suitable training data does not exist at all (consider               (e.g. //tr[td[contains(.,’x’)]]/td[2]) that we denote right-of(‘x’).
                                                                                For example, according to Rick’s page, candidate extrac-
                                                                             tion rules for Name are abs(1,1) and above(‘Home’). Simi-
                                                                             larly, rules for Position are abs(5,2) and right-of(‘Position’).
                                                                                Now suppose that the Position of professors is a relevant
                                                                             attribute to extract. If the sample set is composed only of
                                                                             awarded professors (such as Rick), inferred rules could not
VLDS’12 August 31, 2012. Istanbul, Turkey.                                   work for the broader set of all professors, including those
Copyright c 2012 for the individual papers by the papers’ authors. Copying   without any award (such as Mark and Bill). For example,
permitted for private and academic purposes. This volume is published and    the rule abs(5,2) for Position might work for all the awarded
copyrighted by its editors.
                       Rick                                       Mark                                      Bill
      Home              rick.usr.edu             Home              mark.usm.edu            Affiliation   SE Dept. Univ. of G.
      Awards              Best paper             Affiliation   IT Dept. Univ. of M.        Position            Ass. Prof.
      Affiliation   CS Dept. Univ. of R.         Position            Ass. Prof.            Email             bill@usg.edu
      Position            Full Prof.             Email             mark@usm.edu            Phone             123-454-3210
      Email             rick@usr.edu             Phone             987-654-3210
      Phone             123-345-6789                                                                v0+ = mark@usm.edu

                                                            (a) sample pages
      r1    abs(5,2)                                    r1             r2               r3                r4
      r2    right-of(‘Email’)                    pr     Full Prof.     rick@usr.edu     nil               nil
      r3    above(‘987-654-3210’)                pm     mark@usm.edu mark@usm.edu       mark@usm.edu      mark@usm.edu
      r4    below(‘Ass. Prof.’)                  pb     123-454-3210   bill@usg.edu     nil               bill@usg.edu

    (b) extraction rules                       (c) values extracted

                                                   Figure 1: Running Example


professors, but it does not extract the position for other             might be expanded to a larger class Rh .
professors.                                                               The goal is to achieve the expressiveness of the largest
  The usual approach to address this issue is to work with a           classes only whenever it is detected as actually needed. For
large set of annotations that hopefully covers all the possible        all those cases that can be solved within the smaller classes
types of target pages. However, according to our cost model,           of rules, the algorithm will save many samples.
this strategy is inefficient.                                             The approach is independent of the details of the formal-
                                                                       ism used to express extraction rules. Our hierarchy {Rh }
The Expressiveness Problem. Consider again the running                 makes use of simple rules, as follows: R0 is the class of ab-
example and suppose we are interested to extract professors’           solute XPath rules as described before; Rh , for 0 < h ≤ m,
Name and Home. We now show that the size of the sample                 is obtained by adding to Rh−1 the class of relative XPath
set actually depends on the expressiveness of the language             rules with distance at most h from a pivoting leaf to the
used to specify the rules, and hence on the set of available           value.1 However this is just one possibility: the hierarchy
rules.                                                                 can be built with other of rules.
   Suppose that we choose to adopt only absolute extraction               A landmark decision is whether and when expanding the
rules: a correct rule for Name is abs(1,1). Note that only one         set of candidate rules. We introduce a probabilistic model
labeled sample would suffice to infer this rule. Suppose now           to dynamically characterize the probability of correctness
to adopt a more expressive language, which also includes               for rules in the current class of candidate rule Rh . We pro-
relative rules. Using just one page, say Rick’s, several rules         pose an original active learning algorithm [12] that exploits
are generated to extract Name: abs(1,1), above(‘Home’),                the probabilistic model for deciding to enlarge Rh lazily, i.e.
above(‘rick.usr.edu’). To determine the correct rule at                only whenever there is enough evidence that the correct rule
least another well chosen example is required: only with               is not amongst the current set of candidates. The algorithm
the help of Mark’s page we have the evidence that the rule             actively chooses the next membership query to find a rep-
above(‘Home’) does not work.                                           resentative sample set: it selects a sample page and poses
   To summarize, the more expressive is the model, the larger          a membership query on an extracted value. By accurately
is the size of the representative sample set [1] [5] (intuitively,     choosing this value, the number of queries can be minimized.
the space of hypotheses is larger and thus more examples are              The paper is organized as follows: Section 2 discusses re-
needed to discard the incorrect ones). However, the addi-              lated work; Section 3 formalizes our setting and states the
tional expressiveness should be carefully handled, since its           problem definition; Section 4 develops a probamilistic model
actual need depends on the input pages and on the desired              to characterize the correctness of extraction rules; Section 5
attributes, whereas it always entails additional costs.                presents an active learning algorithm for extraction rules
   The usual approach to address this issue is to work with            based on the model; finally, Section 6 discusses our prelimi-
overly expressive languages to cover all needs. However, as            nary experiments with a set of sources from the Web.
made explicit by our cost model, this strategy is inefficient.
                                                                       2.   RELATED WORK
Overview. In this paper we address the above issues. We                In machine learning, the number of labeled samples needed
show that they cannot be tackled separately, and propose
                                                                       by a supervised learning algorithm to infer a good hypothesis
an approach that carefully handles the expressiveness of the
                                                                       is called sample complexity, and has been studied from sev-
wrapping language, and the choice of the samples (i.e. the
                                                                       eral perspectives. For instance, similarly to our setting, [1]
pages to be annotated).
                                                                       discusses the problem of exactly inferring a concept, i.e. a set
  We propose an approach, inspired by a statistical learn-
                                                                       of elements, by means of membership queries, i.e. question
ing technique [14, 13], in which the expressiveness of the
language is enlarged at runtime. We organize the class of              1
                                                                         In the current prototype all textual leaves are used as can-
candidate rules R into a hierarchy of classes {Rh }0≤h≤m               didate pivot. The distance from the pivot to the extracted
of increasing size: initially the correct rule is searched only        node is measured according to the number of edges crossed
within R0 , and then the set of available extraction rules             in the DOM representation of the HTML pages but consid-
                                                                       ering contiguous siblings at distance 1.
of the type “is this an element of the target concept?”. How-       from U . Note that |R(U )| ≤ |R|, with the strict inequality
ever, the main idea underlying our approach has been pro-           holding whenever a vector is extracted by different rules.
posed by the statistical learning community [14], in which a           We denote with {Rh } the hierarchy of classes of all the
loss function is given in order to characterize the quality of      generable rules, i.e. the extraction rules that can be gener-
the produced hypothesis. The structural risk minimization           ated to extract vectors from U . The classes of this family
(SRM) technique [13], i.e. the decomposition of the set of          are countable but potentially not finite. We will manage
hypotheses into a hierarchy of subclasses, aims at avoiding         to work with a finite restriction of them, called the set of
the overfitting problem: since the class of hypotheses stud-        generated rules, as discussed in the following.
ied by this community might be so expressive to be able to          Labeled Sample Sequences: we introduce the concept of la-
arbitrarily reduce the loss, a trade-off with other quality cri-    beled sample v l where v ∈ pv is a value from a page pv , and
teria is needed to avoid that the learning algorithm selects        l ∈ {+, −} is either a positive or a negative label. In the fol-
the hypothesis describing the training data perfectly, rather       lowing v + and v − denote a positive sample (or annotation)
than their underlying patterns.                                     or a negative sample, respectively, that is the two possible
   Many researchers have proposed several variations of the         answers to a MQ.
learning paradigm to make it practically feasible in differ-           A rule r is admissible wrt to a set of samples L (denoted
ent applicative contexts: the learning approaches in which          L(r)) iff:
the inference algorithm is free to choose which sample to la-
bel next are usually defined active [12]. These have recently                                         l = + → r(pv ) = v
                                                                               L(r) ⇔ ∀v l ∈ L,
gained interest, since, as clarified in [3], that they might pro-                                     l = − → r(pv ) 6= v
duce exponential improvements over the number of samples
                                                                    that is, it is compliant with the labels in the set.
wrt traditional supervised approaches.
                                                                      The concept can be trivially extended to set of rules R,
   To the best of our knowledge and differently from our pro-
                                                                    and we denote with RL = {r ∈ R : L(r)} the subset of
posal, all the approaches for inferring wrappers over struc-
tured websites developed by the researchers in the wrapper          admissible rules in R wrt L and with VbLR (U ) all the values
community [6, 2, 9, 4, 15], define the set of hypotheses stati-     they extract from U : VbLR (U ) = {v : v = r(p), r ∈ RL , p ∈
cally, i.e. before performing the inference, and once set, the      U }.
set of candidate rules cannot be changed without seriously           Example: Let pr , pm and pb be the pages in Figure 1(a)
revisiting the inference algorithm. Therefore they oversize         and let U = {pr , pm , pb }. The attribute Email is extracted
the expressiveness of the formal language used to specify the       by the rule r2 =right-of(‘Email’): two positive samples are
extraction rules and additional samples are required only to        v0+ =‘mark@usm.edu’ and v1+ =‘bill@usg.edu’, a negative sam-
compensate with the excess of expressiveness.                       ple is v2− =‘123-454-3210’. Observe that r2 is admissible wrt
   Active learning approaches for wrapper induction have            to L = {v0 , v1 , v2 }. Now consider another rule r1 =abs(5,2)
been proposed in [11, 10]. However, also in these works             and the set of rules R = {r1 , r2 }. Then r1 is not admissible
the expressiveness is statically defined. Moreover, the latter      wrt to L since r1 (pb ) = ‘123 − 454 − 32100 which is the neg-
approach requires complex user interaction, since the user          ative sample v2− . Hence, RL = {r2 } and VbLR (U ) = {v0 , v1 }.
has to choose the correct wrapper within a set of ranked               In the following, given a set of rules R, we will only con-
proposals.                                                          sider special ordered sets of samples, called labeled sample
   A few recent proposals try to scale the wrapper infer-           sequences, which are formed by an initial annotation, and
ence to the web scale [7, 8]. In [7] the authors leverage an        then by adding only new values which are still admissible
available dataset, but it is not clear how they can ignore          with respect to the samples already seen. Intuitively, a la-
the presence of biased samples (as suggested by its running         beled sample sequence is the list of answers to the MQ posed
example based on popular objects itself), while in [8] it is        to learn a rule:
needed domain knowledge that only an human expert can
                                                                     Definition: A Labeled Sample Sequence (l.s.s.) L wrt to a
provide.
                                                                    set of rules R and a set of pages U is specified by a sequence
                                                                    of labeled sample v0 , . . . , vk , . . . that defines a sequence of
3.   PROBLEM DEFINITION                                             (observed) sets Lk with Lk+1 = Lk ∪ {vk } = {v0+ , v1 , . . . , vk }
Preliminary Definitions: Let U = {p1 , p2 . . . pn } be a set of    such that: (i) it begins with an annotation v0+ 6= nil, and
pages. Every page publishes several attributes of interest          (ii) ∀k > 0, vk ∈ VLRk (U ) = VbLRk (U ) \ Lk .
(e.g. professor Position, Email, etc.). For simplicity, we de-
velop the discussion concentrating on one attribute, and we           The constraint (i) on the first annotation v0+ of the se-
assume that its values are either a textual leaf of the DOM         quence is useful to get finite2 RL1 , whereas the constraint (ii)
tree representation of the pages, or a distinguished nil value.     on the remaining samples entails that the new sample vk that
We write v ∈ p to denote that v is a value of the page p, pv        forms Lk+1 from Lk leads to smaller and smaller admissible
to denote the page in which the value v is located, and |p|         sets: RLk+1 ⊆ RLk . It is worth noting that RLk+1 plays
to denote the number of values in p.                                the role of what the learning communities call the version-
   We refer to a generic extraction rule (or simply rule) r         space [12], i.e. the set of hypotheses still plausible after
over the set of pages U as a concrete tool to build a vector of     having considered an input set of labeled samples.
values indexed by the pages in U such that r(p) ∈ p ∪ {nil}.          In the following we will uniformly refer to both L and one
Every rule extracts one vector of values from U denoted             of its observed subsets Lk blurring the differences between
r(U ). Figure 1(c) shows the vectors extracted by the rules         the two concepts whenever the context clarifies which one is
r1 , r2 , r3 , r4 in Figure 1(b). We denote with R(U ) the set      actually involved.
of vectors obtained by applying a set of rules R over U , and       2
                                                                      The first annotation can also be conveniently seen as the
blur the distinction between a rule and the vector it extracts      specification of the desired attribute.
   It can always be decided whether a rule extracting the                 Let P(R) be the prior probability that the correct vector
desired vector exists. However, since it is not known in               can be extracted by a rule belonging to R. We suppose that
advance whether that rule was in the set of all candidate              the acquisition of a new labeled sample vk to form Lk+1
rules, the only certain way to be sure of its presence is by           from Lk follows a uniform p.d.f. amongst all values still
checking every single page [1]. In order to minimize the               queryable, i.e. the values in VLRk (U ) = VbLRk (U ) \ Lk .
number of MQ, we develop a probabilistic characterization                 Similarly, given a correct rule r, let V + (Lk , r) = VLRk (U ), r(U )
of the rules, and restate our problem in that sense.                   denote the set of all and only the values that can form new
 Problem Definition: Given a hierarchy of classes of extrac-           positive samples, and V − (Lk , r) = VLRk (U ) \ V + (Lk , r) the
tion rules {Rh }0≤h≤m over a set of pages U , and a threshold          set of values that can form negative samples. It follows:
δ, find either a correct rule r ∈ Rm , or conclude that no rule
is correct in any Rh with probability greater than 1 − δ, by                                        (        1
                                                                                                         |V Rk (U )|
                                                                                                                        , iff vk ∈ V l (Lk , r)
minimizing the length of the input l.s.s. L.                                    P (vkl |r, Lk ) =           L
                                                                                                        0               , otherwise
4.    A PROBABILISTIC MODEL FOR                                           Similarly, we can compute P (vkl |R, Lk ) following an ap-
      STRUCTURAL RISK MINIMIZATION                                     proach based on a uniform p.d.f. over all possible values.
We introduce a probabilistic model for evaluating the cor-             These are essentially all the values in VLRk (U ) but only the
rectness of an extraction rule, given a l.s.s. L, and a class          values in VLRk (U ) ∩ RLk (U ) can be labeled either positive or
of rules R. As a consequence, the model is able to compute             negative (and we assume with the same probability) while
the probability that a correct extraction rule has not been            the values in VLRk (U ) \ RLk (U ) will surely be labeled nega-
generated.                                                             tive. Therefore, it follows that P (vkl |R, Lk ) =
   The notations used for main events covered by our anal-
ysis, and their probabilities, are summarized in Table 1. 5                                            1
                                                                         P (vkl |R, Lk ) =                        , iff vk ∈ VLRk (U ) ∩ RLk (U )
                                                                       
We assume that the probability of an extraction rule is de-            
                                                                       
                                                                                          2·|V Rk (U )∩RLk (U )|
                                                                                               L
termined by the extracted values r1 (U ) = r2 (U ) ⇒ P (r1 ) =           P (vk− |R, Lk )= R          1
                                                                                                                  , iff vk ∈ VLRk (U ) \ RLk (U )
                                                                                           |V k (U )\RLk (U )|
P (r2 ).                                                               
                                                                                            L
                                                                         0                                        , iff vk 6∈ VLRk (U )
                                                                       
   Given a new labeled sample vkl to form Lk+1 = {vkl } ∪
Lk , we denote with P (Lk+1 ) the probability P (vkl , Lk ). By
applying Bayes’ theorem, the probabilities of the two main              Note that the exact computation of the set RLk (U ) can be
events of interest are:                                                expensive, since given a value v ∈ VLRk (U ), in order to figure
                                                                       out whether v ∈ RLk (U ), we should enumerate a potentially
                               P (vkl |r, Lk )P (r|Lk )                very large number of vectors in RLk (U ).
               P (r|Lk+1 ) =                                    (1)
                                      P (vkl |Lk )                        We adopt an approximate and efficient solution based on
                                                                       the assumption that the equivalences holding for k = 0:3
                               P (vkl |R, Lk )P (R|Lk )                VLRk (U ), RLk (U ) = VLRk (U ) and VLRk (U ) \ RLk (U ) = ∅, also
              P (R|Lk+1 ) =                                     (2)    hold for any k > 0. Hence, it can be rewritten as:
                                       P (vkl |Lk )
where P (vkl |Lk ) is a normalization factor that can be ex-                                      (           1
                                                                                                                        , iff vk ∈ VLRk (U )
                                                                                                        2·|V Rk (U )|
pressed as:                                                                   P (vkl |R, Lk ) '             L                                     (3)
      X                                                                                                 0               , iff vk 6∈ VLRk (U )
           P (vkl |ri , Lk )P (ri |Lk ) + P (vkl |R, Lk )P (R|Lk )
      r∈RLk                                                              Actually, this is an oversimplification when k gets bigger
                                                                       and approaches |U |: both RLk (U ) and VLRk (U ) gets smaller
   For any k, P (r|Lk+1 ) and P (R|Lk+1 ) can be defined iter-         and smaller and VLRk (U ) \ RLk (U ) 6= ∅. Since our algorithm
atively by means of P (vkl |r, Lk ), P (vkl |R, Lk ), P (R|Lk ) and    look for the correct rule while minimizing k, in our setting
P (r|Lk ). P (vkl |r, Lk ) and P (vkl |R, Lk ) can be defined by ab-   this semplification does not significantly affect the results.
stracting the actual process that leads to the observed l.s.s
into a simple generative model. This is essentially equivalent         4.2      A-priori Probabilities
to define a p.d.f. over every l.s.s..                                  Our probabilistic model is based on the following priors:
   By repeatedly applying the bayesian updating rules ex-              the prior probability P(Rh ) that a correct rule has been
pressed by equations 1 and 2, the model allows the compu-              generated in the set of rules Rh ; the prior probability P(r)
tation of P (R|Lk+1 ) and P (r|Lk+1 ) for any k, starting from         that the extraction rule r does extract the correct vector of
prior-probabilities P (R|L0 ) = P(R) of having generated a             values from the input set of pages U . As regards the former
correct rule, and P (r|L0 ) = P(r) of r being a correct rule.          prior p.d.f. P(r), it is set by using a uniform p.d.f. over all
The iteration continues until admissible rules exist, i.e. until       rules in RL1 .
RLk 6= ∅.                                                                 For the latter priors P(Rh ), we follow a standard ap-
                                                                       proach, and estimate the priors on a sufficiently large set
4.1     Generative Model                                               of attributes the frequency of the involved events.
In this section we describe a simple generative model for the             P(Rh ) has been set equals to nNh where nh is the number
l.s.s., useful for deriving the posterior probability P (r|Lk )        of attributes extracted by a rule in Rh and not in Rh+1 ,
that r(U ) is the correct vector to extract once a l.s.s. Lk+1         and N is the number of attributes used in our estimation
has been observed. This vector is not known in advance, but            (we sampled N = 356 attributes).
the values forming the l.s.s. Lk+1 will be labeled as either
                                                                       3
positive or negative according to it.                                      Admitting that every value is extracted at least by one rule.
                            Table 1: The main events considered in the bayesian analysis
                    Notation Probability                                  Event
                      P(r) = P(r(U ))                   prior probability that r ∈ R is correct
                      P(R) / P(R)                 prior probability that a/none rule in R is correct
                          P (r|Lk )             probability that r is correct once a l.s.s., observed Lk
                          P (R|Lk )           probability that the correct rule is not in R, observed Lk
                              l    k
                        P (vk |r, L )                likelihood of vkl if r is correct, observed Lk
                             l      k
                        P (vk |R, L )      likelihood of vkl if the correct rule in not in RLk , observed Lk
                       k+1           lk k
                   P (L    ) = P (vk , L )                      probability of a l.s.s Lk+1


  For example, 95% of the attributes can be extracted by           Listing 1 An active learning algorithms for extraction rules
rules in R3 , while to cover the remaining 5% of attributes        Input: set of sample pages U = {p1 , . . . , p|U | }
we should stretch our set of candidate rules to R10 .              Input: an initial positive sample L1 = {vo+ }

5.    LEARNING EXTRACTION RULES                                    Parameter: a hierarchy of generable rules {Rh } over U
                                                                                                                       h
The probabilistic model just developed aims at at comput-          Output: P (r|Lk+1 ) over r ∈ RhLk+1 , P (RLk+1 );
ing, observed an l.s.s. Lk+1 , the probability P (r|Lk+1 ) that
a given extraction rule r within a set of candidate rules R is      1: let k ← 1; let h ← 0;
correct, and the probability P (RLk+1 ) that the correct rule       2: let R ← Rh ; R ← RL1 ;
has not been generated at all, i.e. it is not inside R.             3: while (R 6= ∅ and not halt(R, Lk )) do
   Here, we start presenting an active learning algorithm that      4:    vk ← chooseSample(R, Lk );
exploits these probabilities (Sec. 5.1); then, we describe a        5:    l ← oracle(vk );
strategy to choose the next sample to be labeled (Sec. 5.2);        6:    Lk+1 ← Lk ∪ {vkl };
finally, we discuss how the set of candidate rules is dynam-        7:    R ← RhLk+1 ;
ically expanded according to the “observed need of expres-          8:    compute P (r|Lk+1 ), ∀r ∈ R according to eq. 1;
siveness” rather than statically predetermined (Sec. 5.3).          9:
                                                                                       h
                                                                          compute P (R |Lk+1 ) according to eq. 2;
                                                                   10:    h ← h + expandRuleSet(R, Lk+1 );
5.1     Active Learning Algorithm                                  11:    k ← k + 1;
As shown by the pseudo-code in Listing 1, our algorithm            12: end while
takes as input a labeled sample sequence L built by ac-            13: if (R 6= ∅) then
tively [12] asking to an oracle (here modeled by means of the                                                 h
                                                                   14:    return RhLk+1 , P (r|Lk+1 ) and P (R |Lk+1 );
subprogram oracle()) the label of a sample chosen by the
                                                                   15: end if
subprogram chooseSample(). The algorithm is parametric
                                                                   16: return ⊥;
wrt a class of extraction rules decomposed into a hierarchy
of subclasses {Rh }, and it makes use of the probabilistic
model detailed before (lines 8-9).
   Initially, R0 is taken as initial set of candidate rules, and     In the following, we detail chooseSample() and expandRule-
the set of rules admissible wrt the initial sample R0L1 is com-    Set(), respectively.
puted (lines 1-2). In every iteration, the oracle is asked to
label a new sample vk (lines 4-5) and the l.s.s. is updated        5.2    Choosing the Right Samples
to obtain Lk+1 (6). Then, the set of admissible rules is up-       For instantiating chooseSample() we propose two variants:
dated (7) (recall that RLk+1 ⊆ RLk ), and the probabilities        Entropy plus a baseline algorithm Random.
P (r|Lk+1 ) and P (RLk+1 ) are consequently updated (8-9).         Random: It chooses a random admissible sample:
expandRuleSet() has to decide whether the set of candi-
date rule should be expanded (line 10).                             chooseSample(R,L) { return a random v ∈ VLR (U ); }
   This algorithm can be instantiated by appropriately choos-
ing the semantics of three subprograms: chooseSample(),            and it serves two purposes: as a baseline against other
which selects the next sample to be labeled for the user;          strategies, and as a measure of the sample complexity of
halt(), which establishes an exit criterion before the l.s.s.      its extraction.
naturally expires (i.e. R becomes empty); and finally, ex-         Entropy: It bases the sample choice on the p.d.f. of the
pandRuleSet(), which decides at runtime whether Rh should          extracted value: a simple strategy is to choose the sample on
be expanded with new candidate rules by incrementing h.            which rules most disagree, appropriately weighted according
                                               h                   to their probability. This is equivalent to compute the vote
The latter decision is usually based on P (RLk+1 ): the higher
its value, the more likely that new candidate rules are needed.    entropy [12] for each v ∈ RLk (U ):
The implementation of halt() strongly depends on the over-         H(v) = −[P (v + |Lk ) log P (v + |Lk ) + P (v − |Lk ) log P (v − |Lk )]
all goal of the search strategy. We leave further investigation                                                                       (4)
on this aspect to future work; in the present paper, for il-
                                                                                     P (v + |Lk ) = r∈{r∈R k :r(pv )=v} P (r|Lk )
                                                                                                     P
lustrative purposes, we use a minimum threshold λr on the          where:
                                                                                                                   L
probability of the best rule:
                                                                                       P (v − |Lk ) =                                  k
                                                                                                        P
     halt(R, L) { return (argmaxr∈RL P (r|L) > λr ); }             and:                                     r∈{r∈RLk :r(pv )6=v} P (r|L )
are the probabilities that v is respectively either a value to             Static                         SRM
extract or an incorrect value.                                       Rnd #A         H       R0        R1      R2         R3
  The next sample is the one maximizing the vote entropy:            ≥9     10      4       1.90     2.33     2.89       3.56
                                                                      8     16     4.12     1.56     1.78     3.22       3.25
    chooseSample(R,L) { return argmaxv∈V R (U ) H(v); }               7     12     4.08     1.66     1.69      2.7       3.63
                                                 L                    6     15     3.86     1.26     1.58     3.22       3.40
Note that this choice essentially removes the most uncertain          5     15     3.5      1.55     2.06     2.39       2.77
                                                                      4     32     3.58     1.46     1.96     2.48       3.13
sample.                                                               3     53     3.45     1.55     2.06     2.52       3.14
                                                                      2     55     3.01     1.76     1.54     2.11       2.38
5.3      Dynamically Expanding the Rule Set                           1    128     2.27     1.59     1.65     1.71       1.76
expandRuleSet() is in charge of deciding whether and                 Precision    96.2%    86.7%    93.79% 95.85%       96.2%
when expanding the hierarchy of candidate rules Rh . It
                                   h
makes use of the probability P (RLk ) of the correct rule not                   Table 2: Experimental results.
being present in the current set of candidate rules Rh after
observing as input a given l.s.s. Lk .
  We leave a thoroughly discussion of the best criteria un-         with homogeneous type). Also, we aim at integrating SRM
derlying this analysis to future work, while in this paper we       with automatic annotators that rely on data available on
use a trivial implementation expandRuleSet() based on a             external sources, such as Freebase, and with crowdsourcing
predefined fixed threshold λR over P (RLk ):                        platforms.

expandRuleSet(R, L) {                                               7.   REFERENCES
  if (R = Rm ) return 0; // max expansion reached                    [1] D. Angluin. Queries revisited. Theor. Comput. Sci.,
  else if (P (RL ) > λR ) return +1;                                     313(2):175–194, 2004.
  else return 0;                                                     [2] A. Arasu and H. Garcia-Molina. Extracting structured
}                                                                        data from web pages. In SIGMOD 2003.
   The set of rules is therefore enlarged lazily, i.e. only when-    [3] M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The
ever according to P (RL ) there is evidence that a correct rule          true sample complexity of active learning. Machine
is not amongst the currently available candidate rules.                  Learning, 80(2-3), 2010.
                                                                     [4] C.-H. Chang and S.-C. Lui. IEPAD: information
                                                                         extraction based on pattern discovery. In WWW 2001.
6.     EXPERIMENTS AND FUTURE WORKS
                                                                     [5] V. Crescenzi and G. Mecca. Automatic information
We built a Java prototype implementation of our algorithms.We            extraction from large websites. J. ACM,
report the results of some preliminary experiments mainly                51(5):731–779, 2004.
focused on evaluating the effectiveness of the SRM tech-
                                                                     [6] V. Crescenzi and P. Merialdo. Wrapper inference for
nique. We downloaded pages from 101 websites publishing
                                                                         ambiguous web pages. JAAI, 22(1&2):21–52, 2008.
information in several domain. For each website, we down-
loaded a small set of sample pages (about 20) sharing a              [7] N. N. Dalvi, R. Kumar, and M. A. Soliman.
common HTML template, and considered as relevant 2 − 4                   Automatic wrappers for large scale web extraction.
attributes, for which we manually crafted a golden extrac-               PVLDB, 4(4):219–230, 2011.
tion rule to get as a reference the correct values, totally          [8] T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo,
considering 240 attributes.                                              A. Kravchenko, G. Orsi, C. Schallhart, A. J. Sellers,
   The results have been collected running our prototype                 and C. Wang. Diadem: domain-centric, intelligent,
with λr = 0.99 and λR = 0.95. Table 6 reports: the                       automated data extraction methodology. In WWW
(rounded) average number of MQ by Random over 10 ex-                     (Companion Volume) 2012.
ecutions (Rnd);4 the aggregated number of attributes ex-             [9] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog,
tracted by Random (#A); the number of MQ posed by En-                    and S. Flesca. The lixto data extraction project - back
tropy with R5 and SRM disabled, (H); the number of MQ                    and forth between theory and practice. In PODS 2004.
when SRM starts from the class of extraction rules Ri with          [10] U. Irmak and T. Suel. Interactive wrapper generation
i = 0, 1, 2, 3 (SRM-Ri ). The last row reports the average               with minimal user effort. In WWW 2006.
precision of the values extracted by the output rules.              [11] I. Muslea, S. Minton, and C. A. Knoblock. Active
   SRM always saves MQ. Whenever the initial expressive-                 learning with multiple views. JAIR, 27:203–233, 2006.
ness is low, the precision suffers, and the number of MQ            [12] B. Settles. Active learning literature survey. CS T.R.
saved is higher. This can be explained by expandRuleSet                  1648, Univ. of Wisconsin–Madison, 2009.
making wrong decisions, i.e. it trusts on current imprecise         [13] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and
rules rather than betting on more expressive classes. Note               M. Anthony. Structural risk minimization over
that even Entropy cannot reach 100% since there exist at-                data-dependent hierarchies. IEEE Transactions on
tributes that need rules out of R5 .                                     Information Theory, 44(5):1926–1940, 1998.
   Since our dataset consists of a small number of pages per        [14] V. Vapnik. An overview of statistical learning theory.
site and it is not suitable for evaluating the sampling issue,           IEEE Transactions on Neural Networks,
we leave it to future work. To overcome the precision loss               10(5):988–999, 1999.
with the SRM approach, it is possible to add additional             [15] Y. Zhai and B. Liu. Structured data extraction from
informations, e.g. types analysis (to weight more vectors                the web based on partial tree alignment. IEEE Trans.
4
    These values represent the sample complexity.                        Knowl. Data Eng., 18(12):1614–1628, 2006.