DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


           Wrapper Generation Supervised by a Noisy Crowd

                                         Valter Crescenzi, Paolo Merialdo, Disheng Qiu
                                                          Dipartimento di Ingegneria
                                                        Università degli Studi Roma Tre
                                                   Via della Vasca Navale, 79 – Rome, Italy

                                        {crescenz, merialdo, disheng}@dia.uniroma3.it


ABSTRACT                                                                            However, generating wrappers with the support of crowd-
We present solutions based on crowdsourcing platforms to                         sourcing platforms introduces a number of challenges that
support large-scale production of accurate wrappers around                       were not addressed in the literature: the mini-tasks submit-
data-intensive websites. Our approach is based on super-                         ted to the platform should be extremely simple, since they
vised wrapper induction algorithms which demand the bur-                         are performed by non-expert workers; their number should
den of generating the training data to the workers of a                          be minimized to contain the costs.
crowdsourcing platform. Workers are paid for answering                              We are developing a framework, alf [7], that addresses
simple membership queries chosen by the system. We present                       the above issues to let the crowd effectively and efficiently su-
two algorithms: a single worker algorithm (alfη ) and a mul-                     pervise the wrapper generation process.1 alf progressively
tiple workers algorithm (alfred). Both the algorithms deal                       infers a wrapper by posing membership queries (MQ), which
with the inherent uncertainty of the responses and use an ac-                    are the simplest form of queries [1], since they admit only a
tive learning approach to select the most informative queries.                   yes/no answer (e.g., “Is the string ‘John Wayne’ a value to
alfred estimates the workers’ error rate to decide at run-                       extract ? ”); alf implements an active learning algorithm to
time how many workers are needed. The experiments that                           select the queries that more quickly bring to the generation
we conducted on real and synthetic data are encouraging:                         of an accurate wrapper, thus reducing the costs [14]; and,
our approach is able to produce accurate wrappers at a low                       finally, here we extend alf to adopt a probabilistic model
cost, even in presence of workers with a significant error                       that considers errors (wrong answers) introduced by inaccu-
rate.                                                                            rate workers [2, 15].
                                                                                    We have experimentally observed that with perfect work-
                                                                                 ers, i.e., workers that do not make any mistake in answering
1.    INTRODUCTION                                                               the proposed membership queries, alf generates the correct
   The abundance of data contained in web pages has mo-                          extraction rules reducing on average the number of queries
tivated many research efforts towards the development of                         by 4× with respect to a random choice [7]. However, it
effective methods and tools for generating web wrappers,                         is well known that the workers recruited on crowd sourc-
i.e., rules that allow the extraction of data from web pages.                    ing platforms are far from being perfect. On an empirical
Supervised approaches to infer web wrappers have limited                         evaluation that we conducted on a popular crowdsourcing
scalability, mainly because they require a set of training                       platform, we experienced a significant number of incorrect
data, typically provided as labeled values. Unsupervised ap-                     answers (around 10% on average) even for the simple mem-
proaches (e.g. [3, 6]) have been investigated as an attempt                      bership queries posed by our system.
to “scale-up” the wrapper generation process by overcoming                          This paper extends our framework to manage workers that
the need of training data. Unfortunately, they have limited                      may return incorrect answers. First, we introduce alfη ,
applicability because of the low precision of the produced                       which extends the underlying model of alf to deal with the
wrappers.                                                                        noisy answers of a single worker. The presence of errors
   Crowdsourcing platforms represent an intriguing opportu-                      introduces the challenging issue of deciding when to stop
nity to “scale-out” supervised wrapper inference approaches.                     the learning process. Intuitively, when a worker is inaccu-
These platforms support the assignment of mini-tasks to                          rate, the costs of acquiring her answers may become not
people recruited on the Web, and thus allow the engagement                       justified by the increment of quality in the inferred wrap-
of a large number of workers to produce massive amounts of                       per that these answers produce. alfη needs an estimation
training data.                                                                   of the worker’s error rate to reason on the optimal number
                                                                                 of queries that should be assigned to the worker. Unfor-
                                                                                 tunately, at most a rough estimation of the error rate is
                                                                                 available when the worker is engaged.
                                                                                    Then, to overcome this issue, we introduce alfred (alf
                                                                                 with redundancy), an algorithm that builds on alfη to find
                                                                                 the best solution according to the training data provided by
                                                                                 multiple workers. It adopts the conventional technique of
Copyright © 2013 for the individual papers by the papers’ authors. Copying       facing the presence of errors by engaging multiple workers
permitted for private and academic purposes. This volume is published and        1
copyrighted by its editors.                                                          A demo is available at http://alfred.dia.uniroma3.it.


                                                                             1
                                                                             8
                   DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


for solving the same tasks. However, alfred decides the                   pivot. Textual leaves that occur once in every input page
number of workers during the learning process, at runtime,                are considered template nodes [3].
and minimizes the costs engaging only the workers actu-                       The inference process evaluates the candidate rules in Rv0
ally needed to achieve the desired quality. alfred exploits               by posing questions to workers recruited from a crowdsourc-
the weighted consensus among multiple workers to estimate                 ing platform: they are shown a page and asked whether a
their error rates so that better stopping conditions can be               given value v from that page corresponds to a value of the
crafted to take into account both the cost (quantified as                 target attribute. The goal is to select the rule working not
the number of queries) and the quality of the wrapper (as                 only on the annotated sample page from which it has been
estimated by a probabilistic model).                                      generated, but also for all the other pages in the input col-
                                                                          lection U . Each query is formed by picking up the value in
Contributions. Overall, the paper makes several contribu-                 the set VvR   0
                                                                                          (U ) of values extracted from pages in U by the
tions: (i) we extend our crowd based wrapper inference                    candidate rules Rv0 .
framework [7] in order to manage noisy answers; (ii) we pro-                  Figure 1 shows an example: suppose that we are inter-
pose a principled approach to decide at runtime how many                  ested to generate a wrapper that extracts the Title from the
workers should be engaged to deal with the presence of noisy              fictional set of movie pages U = {p1 , p2 , p3 } whose DOM
answers; (iii) we show how to estimate the workers’ error                 trees are sketched in Figure 1(left). Assume that the ini-
rates during the learning; (iv) we set several termination                tial annotation v0 =‘City of God’ is supplied on the sample
strategies aiming at a fair trade-off between output quality              page p1 . Figure 1(right) shows the set Rv0 = {r1 , r2 , r3 } of
and cost; (v) we report a set of preliminary experimental                 candidate rules generated from this initial annotation. The
results with both synthetic and real answers collected from               queries composed by the inference process use the values
the crowd.                                                                VvR  0
                                                                                 (U ) that appear as elements of the vectors extracted by
                                                                          the rules in Rv0 from the pages in U .
                                                                              The binary answer l, with l ∈ {−, +}, supplied by a
Paper outline. The paper is organized as follows: Section 2               worker adorns the queried value v with either a positive
formalizes our setting and presents the extension to our pre-
                                                                          or a negative label, producing a labeled value denoted by
vious probabilistic model to deal with noisy answers; Sec-
                                                                          v l . An ordered set of k labeled values composes a train-
tion 3 introduces alfη , the active learning algorithm for
                                                                          ing sequence (t.s.) denoted Lk−1 , so that L0 = {v0+ } and
a single noisy worker and Section 4 presents alfred, its
                                                                          Lk+1 = Lk ∪ {vkl }.
generalization to multiple workers. Section 5 reports the
                                                                              Given a t.s. Lk , we introduce a probabilistic model for
experimental results. Section 6 discusses related work, and
                                                                          estimating the probability P (r|Lk ) of each candidate rule
Section 7 concludes the paper.
                                                                          r ∈ Rv0 of being correct for the whole set of input pages U ,
                                                                          and the probability P (Rv0 |Lk ) that the correct rule is not
2.   MODELING WRAPPER QUALITY AND                                         present in Rv0 . These probabilities are updated after each
     NOISY WORKERS                                                        new labeled value vkl is observed, i.e., a worker labels with l
  We focus on data-intensive websites whose pages are gen-                the value provided by a MQ on vk and the t.s. is expanded
erated by scripts that embed data from an underlying data-                to Lk+1 = Lk ∪ {vkl }.
base into an HTML template. Let U = {p1 , . . . , pn } be an                  Bayesian update rules compute the posterior probabili-
ordered set of pages generated by the same script. Given an               ties P (r|Lk+1 ) and P (Rv0 |Lk+1 ) starting from the proba-
attribute of interest published in the pages, its values can              bilities, P (r|Lk ) and P (Rv0 |Lk ) respectively, available be-
be extracted by means of an extraction rule (or simply rule).             fore observing vkl . The process can be repeated treating each
The value extracted by a rule r from a page p, denoted by                 posterior probability as the priors required by the next itera-
r(p), is either a string occurrence from the HTML source                  tion. The whole process is triggered using as a base case the
code of p, or a special nil value. A rule r over the pages in             priors P(Rv0 ) of having generated the correct rule in Rv0 ,
U returns the ordered set of values r(p1 ), . . . , r(pn ) and rep-       and the probability P(r) that r is a correct rule. Assuming
resents a concrete tool to build a vector of values, denoted              that Rv0 is generated from a class of rules sufficiently ex-
by r(U ), indexed by the pages of U .                                     pressive to include the target rule, we can fix P(Rv0 ) = 0,3
  We propose a wrapper induction process that requires as                 and uniformly set P(r) = |R1v | .
                                                                                                             0
input the set U of pages to be wrapped, and only a sin-                       Bayesian update rules require the definition of the p.d.f.
                                                                                       k
gle initial annotation v0 (which is assumed correct) of the               P (vk |r, L ). This is usually obtained by introducing a prob-
attribute value to extract.2                                              abilistic generative model to abstract the actual process
  The inference process starts by generating a space of hy-               leading to the generation of t.s.. For the sake of simplicity,
pothesis, i.e., a set Rv0 of candidate rules that extract the             we adopt the Classification Noise Process (CNP) model [2]
given initial annotations v0 . We consider extraction rules               to describe inaccurate workers that may produce indepen-
defined by means of expressions belonging to a simple frag-               dent and random mistakes with an expected error rate η, and
ment of XPath; namely, we use absolute and relative XPath                 we assume that the next value to query is randomly chosen
expressions that specify paths to the leaf node containing                according to a uniform p.d.f. over all values VvR   0
                                                                                                                                (U ) \ Lk .
the value to be extracted. Absolute rules specify paths that                  Let Pη (·) denote a p.d.f. over all possible t.s. in presence
start either from the document root or from a node having                 of noise, it follows:
an ‘id’; relative rules start from a template node working as
2
                                                                              Pη (vkl |r, Lk ) =P (vkl |r, Lk ) · (1 − η) + P (vk−l |r, Lk ) · η   (1)
 Such input annotation may be supplied either manually
                                                                          3
or automatically by looking up in the page a golden value                  In [7], where we study the possibility of dynamically tun-
from an available database. The approach can be easily                    ing the expressiveness of the class, we have developed the
generalized to deal with multiple initial annotations.                    opposite assumption.


                                                                      2
                                                                      9
                         DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                                                                                             XXX
                                                                                                    XXpages
                                                                                                       XX
                                                                                                                                 U
                                                                                              rules       X            p1          p2         p3
                                                                                                            r1 City of God Inception Oblivion
                                                                                               Rv0          r2 City of God Inception          nil
                                                                                                            r3 City of God         nil     Oblivion
                                                                                               r1 =/html/table/tr[1]/td
                                                                                               r2 =//td[contains(.,“Ratings:”)]/../../tr[1]/td
                                                                                               r3 =//td[contains(.,“Director:”)]/../../tr[1]/td


Figure 1: (Left) DOM of three sample pages; (Right) Extraction rules in Rv0 = {r1 , r2 , r3 } with v0 =‘City of God’
and the set of values VvR
                        0
                          (U ) they extract from pages in U = {p1 , p2 , p3 }.


where (1−η) is the probability that a worker correctly labels                    In every iteration (lines 2–8), the worker is asked to label
with l the provided value vk , and η is the probability that                   a new value vk (lines 3–4) and the t.s. is expanded (line 5).
she wrongly provides the opposite label, here denoted by −l.                   Then the probability P (r|Lk+1 ) is updated (line 6).
P (vkl |r, Lk ) is the corresponding noise-free p.d.f. and can be                chooseQuestion() selects the next value to be labeled
expressed as in [7]:                                                           by the worker, i.e., the next membership query. The chosen
                                                                               value is that on which rules most disagree, appropriately
                         (          1
                                                , iff vk ∈ V l (r) \ Lk
                                                                               weighted according to their probability. This is equivalent
     P (vkl |r, Lk ) =       |VvR (U )|−|Lk |
                                 0                                             to compute the vote entropy [14] for each v ∈ VvR 0
                                                                                                                                   (U ):
                             0                  , otherwise
                                                                               H(v) = −[P (v + |Lk ) log P (v + |Lk ) + P (v − |Lk ) log P (v − |Lk )]
where V (r) is the subset of values in VvR
          l
                                             0
                                               (U ) that should
be labeled l if r is the correct rule. In our generative model                   where:               P (v + |Lk ) =
                                                                                                                       P                         k
                                                                                                                           r∈Rv0 :r(pv )=v P (r|L )
the values composing the t.s. Lk cannot be queried again,
therefore VvR0
               (U ) \ Lk is the set of values that can be used
                                                                                                     P (v − |Lk ) =                           k
                                                                                                                       P
to generate new queries.                                                          and                                  r∈Rv0 :r(pv )6=v P (r|L )
  As we discuss in the next section, these probabilities are
used to effectively choose the next question, and to establish                 are the probabilities that v is respectively either a value to
a termination condition for the learning process.                              extract or an incorrect value (pv denotes the page contain-
                                                                               ing v). Intuitively, the entropy measures the uncertainty
                                                                               of a value and querying the value with the highest entropy
3.   LEARNING EXTRACTION RULES                                                 removes the most uncertain value:
   The probabilistic model developed in the previous sec-
tion aims at computing, observed a t.s. Lk , the probability                   chooseQuestion(Lk ) { return argmaxv∈VvR (U ) H(v); }
                                                                                                                                        0
P (r|Lk ) that a given extraction rule r within a set of candi-
date rules Rv0 is correct. In this section we present alfη , an                  Since different termination policies can be appropriate de-
active learning algorithm that exploits these probabilities to                 pending on the budget constraints and on the quality tar-
minimize the number of queries to the crowd workers.                           gets, we propose several implementations of haltalf (Lk ).
                                                                               haltr : A simple policy is to stop when the probability of
Listing 1 alfη : Active Learning Algorithm for a Single                        the best rule overcomes a threshold λr :
Noisy Worker
                                                                                    haltr (Lk ) { return (maxr∈Rv0 P (r|Lk ) > λr ); }
Input: a set of pages U
Input: the set of candidate rules Rv0                                          The main limitation of this strategy is that it does not take
Input: a worker w and its associated error rate ηw                             into account the costs.
                                                                               haltM Q : A simple policy to upper bound the costs is to
Output: a teaching sequence Lk                                                 stop as soon as the algorithm runs out of a “budget” of
 1: let k = 1; let L1 = {v0+ };                                                λMQ membership queries.
 2: while (not haltalf (Lk )) do
                                                                                          haltM Q (Lk ) { return (|Lk | > λMQ ); }
 3:   vk ← chooseQuestion(Lk );
 4:   l ← getAnswer(w, vk );                                                   The main limitation of this strategy is that it does not con-
 5:   Lk+1 ← Lk ∪ {vkl };                                                      sider the quality of the output rules.
 6:   compute P (r|Lk+1 ), ∀r ∈ Rv0 ;
 7:   k ← k + 1;                                                               haltH : A trade-off between quality and cost can be set by
 8: end while                                                                  posing only queries that contribute enough to the quality of
 9: return Lk ;                                                                the inferred rules, and stopping as soon as the costs are con-
                                                                               sidered not correctly rewarded by the increment of quality in
                                                                               the output rules. It turns out that this can be easily mod-
   Listing 1 contains the pseudo-code of the alfη algorithm:                   eled in term of the maximum entropy. haltH terminates
it processes a t.s. Lk built by actively asking to a worker                    as soon as the maximum entropy of the values is below a
(here modeled by means of the subprogram getAnswer())                          threshold λH , i.e., no value is uncertain enough to deserve
the label of a value chosen by the subprogram choose-                          a query:
Question(); alfη computes a p.d.f. describing the prob-
ability of correctness over the rules in Rv0 .                                      haltH (Lk ) { return (maxv∈VvR (U ) H(v) < λH ); }
                                                                                                                            0


                                                                          3
                                                                          10
                  DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


   Whichever is the termination policy adopted by alfη , its            Listing 2 alfred: Active Learning Algorithm with Multi-
efficacy in achieving an optimal trade-off between quality              ple Noisy Workers
and the costs for the learning tasks is strongly affected by a          Input: a set of pages U
correct estimation of the worker’s error rate, η. An incorrect          Input: the set of candidate rules Rv0
evaluation of η can lead to a waste of queries (when the
error rate is overestimated), or to a quality loss (when it             Parameter: number of initial workers W0
is underestimated), as confirmed by the experiments with                Parameter: a threshold λη for error rates convergence
different termination policies (reported in Section 5.2).
                                                                        Output: the most probable extraction rule r ∈ Rv0
   In our experimental evaluation, we use as a termination
policy the following combination:                                        1: let W = ∅; // set of engaged workers
                      k             k                 k                  2: repeat
      haltalfη (W, L ) = haltH (L ) or haltM Q (L ).
                                                                         3:   repeat
This policy leverages the same trade-off between quality and             4:     let w = engageWorker();
cost as in haltH , but focuses on the cost side by limiting              5:     let ηw = getErrorRate(w);
through haltM Q the budget allocated for each worker.                    6:     let Lw = alfη (U, Rv0 , w);
                                                                         7:     W ← W ∪ {w};
                                                                         8:   until (|W | < W0 ); // wait for workers
4.   INFERENCE WITH A NOISY CROWD
                                                                         9:   let L = ]w∈W Lw ; // append all t.s.
   We now introduce another algorithm, alfred, that fol-
                                                                        10:   repeat
lows the conventional approach based on redundancy [15]
                                                                        11:      compute P (r|L) by using ηw , ∀r ∈ Rv0 ;
to deal with inaccurate workers. alfred improves alfη                                 prev
                                                                        12:      let ηw    = ηw , ∀w ∈ W ; // save previous η
robustness by dynamically recruiting additional workers to
                                                                        13:      compute               P (r|L), ∀w ∈ W ;
                                                                                       P ηw by usingprev
whom it dispatches redundant tasks. It combines the an-
                                                                        14:   until ( w∈W (ηw − ηw )2 > λη · |W |);
swers provided by a set W of multiple workers on the same
task to estimate the workers’ error rate, as well as to find            15: until (haltalfred (W, L));
the most likely rule, at the same time.                                 16: return argmaxr∈Rv P (r|L);
                                                                                                  0
   Our probabilistic model can easily deal with multiple work-
ers by appending the t.s. Lw of every worker w ∈ W into a
unique t.s., denoted by L = ]w∈W Lw , which is then used to
train alfη . However, this approach raises two issues: (i) it is        and it can be interpreted as the probability of the worker
not clear how many workers should be involved in the learn-             of correctly answering a MQ, estimated by averaging over
ing process to achieve a good trade-off between output qual-            all the MQ in the t.s. Lw that she has provided. Since also
ity and cost; (ii) a correct estimation of the workers’ error           the probability is defined in term of error rates, as shown
rates strongly affects alfη performances, and each worker               in Eq. 1, their computation is interleaved (lines 11 and 13)
could have an error rate radically different from others.               until their values do not significantly change anymore.5
   To overcome these issues, alfred dynamically chooses the                The termination condition of alfred can be set accord-
amount of redundancy needed, and it exploits the cumula-                ing to different policies by specifying haltalfred . In our
tive knowledge both to find the solution on which most of               implementation, in order to take into account both budget
the weighted workers consensus converges, and to estimate               constraints and quality targets, we used the following com-
the workers’ error rates.                                               bination:
   Listing 2 illustrates alfred pseudo-code: the main loop                     haltalfred (W, L) = haltr (L) or haltM Q (L)
(lines 2–15) alternates two phases. First, the workers are
engaged (lines 3–8): each worker w trains an alfη instance,                Observe that the recruitment of new workers negatively
thus producing a t.s. Lw . In this phase, the worker is as-             influences the latency of the system. A strategy to contain
sociated with an initial error rate (line 5).4 The second               the latency of the system is to recruit bulk workers, but this
phase (lines 10–14) starts as soon as the first phase has ac-           is beyond the scope of the present paper.
cumulated enough workers (at least W0 ). The goal of this
phase is to leverage all the t.s. provided by the workers in
order to compute both the p.d.f. of the rules P (r|L) and the
                                                                        5.    EXPERIMENTAL EVALUATION
individual error rate ηw of each worker w ∈ W .                            We use a dataset composed of 5 collections of pages: ac-
   Our solution is inspired by the work on Truth Finding                tor and movie pages from www.imdb.com; band and album
Problems [10], and exploits the mutual dependency between               pages from www.allmusic.com; stock quote pages from www.-
the probability of correctness of the rules and the error rates         nasdaq.com. For each collection we selected about 10 at-
of the workers answering the MQ. The error rate is defined              tributes, for a total of 40 attributes. Then we manually
in term of probability as follows:                                      crafted a golden XPath rule for each attribute. We ran-
                                                                        domly selected a training sample set of 500 pages from each
                            1 − P (vk+ |L) , iff l = +
                          
                P                                                       collection of pages, and another (disjoint) test set of 2, 000
                   l ∈L
                  vk    w   P (vk+ |L)     , iff l = −                  pages. Our algorithm was run on the training sample, and
         ηw =                                                           the output extraction rules were evaluated on the test set.
                                  |Lw |
                                                                        We compared the (non-null) values extracted by the golden
4
 We use the average error rate empirically estimated in our             rules against those extracted by the rules generated by our
experiment with real workers; another option is to derive
                                                                        5
this value from the workers ranks provided by the crowd-                  We have empirically observed the convergence of the algo-
sourcing platform.                                                      rithm. A formal proof is left as future work.


                                                                   4
                                                                   11
                                       DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                                                                   1.1
     14                                                                                                                              100                   HALTr
                                                                    1                                                                                      HALTH                                           1
     12
                                                                   0.9
                                                                                                                                                                                                          0.9
     10                                                            0.8
MQ


                                                               F


                                                                                                                                    MQ
                                                                                                                                                                                                          0.8


                                                                                                                                                                                                      F
     8                                                             0.7

                                                                   0.6                                                                   10                                                               0.7
     6    HALTr                                                           HALTr
          HALTH                                                    0.5    HALTH                                                                                                                                              HALTr
                                                                         HALTMQ                                                                                                                           0.6                HALTH
     4
                                                                   0.4                                                                                                                                                      HALTMQ
      0.05   0.1   0.15   0.2        0.25   0.3   0.35   0.4          0.05   0.1   0.15   0.2        0.25   0.3   0.35   0.4             3                                                                0.5
                                                                                                                                              0   0.05   0.1   0.15   0.2   0.25   0.3   0.35   0.4             0   0.05   0.1   0.15   0.2   0.25   0.3   0.35   0.4
                                η∗                                                              η∗
                                                                                                                                                                      η                                                                 η


Figure 2: alfη (η = 0.09) sensitivity to worker error                                                                               Figure 3: alfη sensitivity to the expected worker
rate η ∗ : Cost (left) and quality (right) of the output                                                                            error rate η with a noisy worker η ∗ = 0.09: Cost
wrapper.                                                                                                                            (left) and quality (right) of the output wrapper.

algorithm. For every generated rule r, and given a test set                                                                         an error rate η ∗ increasing from 0.05 to 0.4. The thresh-
of pages U , we computed precision (P ), recall (R), and                                                                            olds of the three termination conditions considered, haltr ,
F-measure (F ) at the value level w.r.t. the corresponding                                                                          haltM Q and haltH were set to λr = 0.8, λMQ = 5 and
                                 |r (U )∩r(U )|      |r (U )∩r(U )|                                                                 λH = 0.2, respectively.
golden rule rg , as follows: P = g |r(U )|      ; R = g |r (U )| ;
         ·R
                                                          g
                                                                                                                                       As the error rate η ∗ of the worker increases, the results
F = 2 PP+R  .                                                                                                                       degrade with every termination strategy, as expected. How-
   We report the results of two sets of experiments: the first                                                                      ever, haltH detects the wider presence of uncertain values,
one was conducted to test alfη with a single worker, in the                                                                         and tries to compensate with a greater number of queries;
second set of experiments we consider alfred in presence                                                                            conversely, since haltr focuses only on the most likely rule,
of multiple workers. We evaluate our algorithms by using                                                                            it poses a rather steady number of queries, and the output
synthetic workers following the CNP probabilistic model [2],                                                                        quality is more seriously compromised.
i.e., workers making random and independent errors with a                                                                              We then empirically evaluated how an incorrect setting
fixed error rate η ∗ .                                                                                                              of the parameter η, i.e., the expected worker error rate, in-
   In order to estimate the error rate distribution over a pop-                                                                     fluences alfη performances. We used a single worker with
ulation of real workers, we also conducted a preliminary ex-                                                                        η ∗ = ηb = 0.09, and repeated several inference processes,
periment with workers engaged by means of CrowdFlower,                                                                              configuring alfη with η ranging from η = 0 to 0.4 as re-
a meta platform that offers services to recruit workers on                                                                          ported in Figure 3.
AMT.                                                                                                                                   When the system overestimates the accuracy of worker
5.1           Experiments with AMT Workers                                                                                          (η < η ∗ ) we observe a reduction of the number of MQ, but
                                                                                                                                    the quality of the wrapper drops. The system trusts the
   The main intent of this preliminary experiment was to                                                                            workers and terminates quickly, thus posing less questions
evaluate the error rate distribution of a population of real                                                                        than actually needed. When the system underestimates the
workers recruited on a crowdsourcing platform. We submit-                                                                           worker accuracy (η > η ∗ ), some MQ are wasted since the
ted 100 tasks to 100 distinct workers. Each task was paid 10¢                                                                       system does not trust the worker. With an η larger than η ∗
and consisted on a set of 20 MQ to generate the extraction                                                                          by +0.3, haltr requires more than 40 MQ, i.e., 5× those
rules for several attributes of our dataset. The ground truth                                                                       required when η = η ∗ . Observe that many MQ are wasted
was straightforwardly obtained by the results of our golden                                                                         since the F -measure gain is less than 5%.
extraction rules. We used alfη configured with haltr as
termination condition (λr = 0.8), and with η = 0.1.                                                                                                                           average                               max
   The average error rate of an AMT worker was ηb = 0.09,                                                                                                             |W |      F #MQ                      |W |       #MQ                     σF
with a standard deviation of σ   bη = 0.11. About 1/3 of the                                                                                             alfη          1       0.92 7.58                    1           11                    0.17
workers responded correctly to all the queries, and the re-
sponse time for each MQ was around 7 seconds. On average
the number of MQ posed by the system to infer a rule was                                                                                      Table 1: alfη inference with synthetic workers
4, and each task contained enough queries to learn 5 rules.
   The information obtained from this experiment was then                                                                             Table 1 reports alfη results when the termination policy
used to set up realistic configurations of the synthetic work-                                                                      haltalfη has been instantiated by setting the parameters
ers for the other experiments: the average error rate empir-                                                                        λH = 0.2 and λMQ = 10. alfη requires just a few queries
ically observed ηb is used to set alfη ’s parameter η = ηb as                                                                       (#M Q = 7.58) to learn rather accurate wrappers (F =
well as the initial worker error rate estimation of alfred,                                                                         0.92). However, there is a significant standard deviation
ηw = ηb; also, the synthetic workers used to test alfred are                                                                        (σF = 17%) in the output quality that makes the algorithm
created using the same error rate distribution observed on                                                                          not that robust to workers’ errors.
real workers.
                                                                                                                                    5.3             Multiple Noisy Workers
5.2           Single Noisy Worker                                                                                                     As discussed in Section 4, alfred builds on alfη , and
   The main goal of the experiment was to evaluate the ef-                                                                          recruits additional workers to estimate their error rate and
fects of the workers’ mistakes on alfη , our algorithm in                                                                           to find the correct rule at the same time. We rely on the ter-
absence of redundancy. In figure 2 we show the effects                                                                              mination strategy of the outer algorithm (haltalfred ) to
of an inaccurate worker with different termination strate-                                                                          achieve the target quality, while for the inner alfη instance
gies, by setting η = ηb = 0.09. We simulated workers with                                                                           we use the same termination policy (haltalfη ) focused on


                                                                                                                               5
                                                                                                                               12
                    DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                      average                   max                    workers as well as their accuracies are statically determined,
             |W |   F #MQ |ηw − η ∗ |    |W |     #MQ    σF
                                                                       based on the workers’ historical performances.
 alfredno    2.33   1 18.6     —          9         83   0.01
  alfred     2.07   1 16.1    0.8%        4         44   0.01            In our previous work [7], we studied wrapper inference
 alfred∗     2.05   1 16.07    0%         4         40   0.01          with a single and perfect worker.

                                                                       7.   CONCLUSIONS
 Table 2: alfred inference with synthetic workers
                                                                         We presented wrapper inference algorithms specifically
                                                                       tailored for working with the support of crowdsourcing plat-
the costs as in the previous experiment. To evaluate alfred            forms. Our approach allows the wrappers to be generated
performances, we organized as many tasks as the number of              by posing simple questions, membership queries, to work-
attributes of our experiment (40). For each task we exe-               ers engaged on a crowdsourcing platform. We proposed two
cuted alfred (with λη = 10−4 ) by recruiting workers from              algorithms that consider the possibility of noisy answers:
a virtual population with the same error rate distribution             alfη recruits a single worker, alfred can dynamically en-
observed over the real workers (the results have been aver-            gage multiple workers to improve the quality of the solution.
aged over 20 executions).                                              We showed that alfred can produce high quality wrappers
  alfred’s results in Table 2 demonstrate the role played              at reasonable costs, and that the quality of the output wrap-
by the workers’ error rate estimation. We compare the al-              per is highly predictable.
gorithm against a baseline (alfredno ) in which the error
rate estimation is disabled (we just set ηw = ηb), and against         8.   REFERENCES
a bound (alfred∗ ) in which an oracle sets ηw = η ∗ . The               [1] D. Angluin. Queries revisited. Theor. Comput. Sci.,
workers’ error rate estimation is precise (|ηw − η ∗ | = 0.8%               313(2):175–194, 2004.
when the learning terminates), and it allows the system to              [2] D. Angluin and P. Laird. Learning from noisy
save queries (16.1 vs 18.6 on average). The average num-                    examples. Mach. Learn., 2(4):343–370, Apr. 1988.
ber of MQ posed by alfred to learn the correct rule is                  [3] A. Arasu and H. Garcia-Molina. Extracting structured
only a negligible amount larger than the lower bound set by                 data from web pages. In SIGMOD 2003.
alfred∗ . The costs are more than twice those paid running
                                                                        [4] M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The
alfη with a single worker (16.1 vs. 7.58). However, notice
                                                                            true sample complexity of active learning. Machine
that alfred always concluded the tasks with a perfect re-
                                                                            Learning, 80(2-3):111–139, 2010.
sult, and that it is robust to workers’ error rates (σF = 1%).
  alfred terminates in most of the cases (94%) engaging                 [5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F.
only 2 workers, and seldom recruited 3 and 4 workers (5%                    Shaalan. A survey of web information extraction
and 1%, respectively). Overall, alfred was able to recruit                  systems. IEEE Trans. Knowl. Data Eng.,
more workers, thus paying their answers, only when needed                   18(10):1411–1428, 2006.
to achieve the target quality of the output wrapper.                    [6] V. Crescenzi and P. Merialdo. Wrapper inference for
                                                                            ambiguous web pages. JAAI, 22(1&2):21–52, 2008.
                                                                        [7] V. Crescenzi, P. Merialdo, and D. Qiu. A framework
6.   RELATED WORK                                                           for learning web wrappers from the crowd. In WWW
   Wrapper induction for extracting data from web pages has                 2013.
been subject of many researches [5]. A wrapper inference                [8] N. N. Dalvi, R. Kumar, and M. A. Soliman.
approach tolerant to noise in the training data has been                    Automatic wrappers for large scale web extraction.
proposed in [8], however it applies only for domains where                  PVLDB, 4(4):219–230, 2011.
it is possible to automatically obtain a set of annotations.            [9] A. Doan, R. Ramakrishnan, and A. Y. Halevy.
   Active learning approaches [14] have recently gained inter-              Crowdsourcing systems on the world-wide web.
est as they can produce exponential improvements over the                   Commun. ACM, 54(4):86–96, Apr. 2011.
number of samples wrt traditional supervised approaches [4].
                                                                       [10] X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava.
The advent of crowdsourcing platforms has led to new chal-
                                                                            Solomon: Seeking the truth via copying detection.
lenges. The main issue is to learn from noisy observations
                                                                            PVLDB, 3(2):1617–1620, 2010.
generated by non expert users.
                                                                       [11] U. Irmak and T. Suel. Interactive wrapper generation
   Wrapper induction techniques that rely on active learn-
                                                                            with minimal user effort. In WWW 2006.
ing approaches have been proposed in [11, 13]. These studies
rely on a more complicated user interaction than ours, since           [12] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and
the user has to choose the correct wrapper within a set of                  M. Zhang. Cdas: a crowdsourcing data analytics
ranked solutions. Also, they do not consider the presence                   system. In VLDB 2012.
of noise in the training data. Many works have studied the             [13] I. Muslea, S. Minton, and C. A. Knoblock. B. Settles.
problem of learning with noisy data coming from crowd-                      Active learning with multiple views. JAIR 2006.
sourcing platform workers, e.g., [9, 12, 15]. [15] shows that          [14] Active learning literature survey. CS Tech. Rep. 1648,
when labeling is not perfect, selective acquisition of multiple             University of Wisconsin–Madison, 2009.
good labels is crucial and that repeated-labeling can improve          [15] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get
label quality and model quality. [12] proposes a crowdsourc-                another label? Improving data quality and data
ing assisted system, CDAS, for data analytics tasks; the sys-               mining using multiple, noisy labelers. In KDD 2008.
tem faces the presence of noisy answers by submitting re-
dundant tasks and adopts a voting strategy to estimate the
correct solution. Compared to our approach, the number of


                                                                  6
                                                                  13