<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wrapper Generation Supervised by a Noisy Crowd</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valter Crescenzi</string-name>
          <email>crescenz@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Merialdo</string-name>
          <email>merialdo@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Disheng Qiu</string-name>
          <email>disheng@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Ingegneria Universit a` degli Studi Roma Tre Via della Vasca Navale</institution>
          ,
          <addr-line>79 - Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>We present solutions based on crowdsourcing platforms to support large-scale production of accurate wrappers around data-intensive websites. Our approach is based on supervised wrapper induction algorithms which demand the burden of generating the training data to the workers of a crowdsourcing platform. Workers are paid for answering simple membership queries chosen by the system. We present two algorithms: a single worker algorithm (alfη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the responses and use an active learning approach to select the most informative queries. alfred estimates the workers' error rate to decide at runtime how many workers are needed. The experiments that we conducted on real and synthetic data are encouraging: our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The abundance of data contained in web pages has
motivated many research efforts towards the development of
effective methods and tools for generating web wrappers,
i.e., rules that allow the extraction of data from web pages.
Supervised approaches to infer web wrappers have limited
scalability, mainly because they require a set of training
data, typically provided as labeled values. Unsupervised
approaches (e.g. [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ]) have been investigated as an attempt
to “scale-up” the wrapper generation process by overcoming
the need of training data. Unfortunately, they have limited
applicability because of the low precision of the produced
wrappers.
      </p>
      <p>Crowdsourcing platforms represent an intriguing
opportunity to “scale-out” supervised wrapper inference approaches.
These platforms support the assignment of mini-tasks to
people recruited on the Web, and thus allow the engagement
of a large number of workers to produce massive amounts of
training data.</p>
      <p>Copyright © 2013 for the individual papers by the papers’ authors. Copying
permitted for private and academic purposes. This volume is published and
copyrighted by its editors.</p>
      <p>However, generating wrappers with the support of
crowdsourcing platforms introduces a number of challenges that
were not addressed in the literature: the mini-tasks
submitted to the platform should be extremely simple, since they
are performed by non-expert workers; their number should
be minimized to contain the costs.</p>
      <p>
        We are developing a framework, alf [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], that addresses
the above issues to let the crowd effectively and efficiently
supervise the wrapper generation process.1 alf progressively
infers a wrapper by posing membership queries (MQ), which
are the simplest form of queries [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], since they admit only a
yes/no answer (e.g., “Is the string ‘John Wayne’ a value to
extract ? ”); alf implements an active learning algorithm to
select the queries that more quickly bring to the generation
of an accurate wrapper, thus reducing the costs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; and,
finally, here we extend alf to adopt a probabilistic model
that considers errors (wrong answers) introduced by
inaccurate workers [
        <xref ref-type="bibr" rid="ref15 ref2">2, 15</xref>
        ].
      </p>
      <p>
        We have experimentally observed that with perfect
workers, i.e., workers that do not make any mistake in answering
the proposed membership queries, alf generates the correct
extraction rules reducing on average the number of queries
by 4× with respect to a random choice [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, it
is well known that the workers recruited on crowd
sourcing platforms are far from being perfect. On an empirical
evaluation that we conducted on a popular crowdsourcing
platform, we experienced a significant number of incorrect
answers (around 10% on average) even for the simple
membership queries posed by our system.
      </p>
      <p>This paper extends our framework to manage workers that
may return incorrect answers. First, we introduce alfη,
which extends the underlying model of alf to deal with the
noisy answers of a single worker. The presence of errors
introduces the challenging issue of deciding when to stop
the learning process. Intuitively, when a worker is
inaccurate, the costs of acquiring her answers may become not
justified by the increment of quality in the inferred
wrapper that these answers produce. alfη needs an estimation
of the worker’s error rate to reason on the optimal number
of queries that should be assigned to the worker.
Unfortunately, at most a rough estimation of the error rate is
available when the worker is engaged.</p>
      <p>Then, to overcome this issue, we introduce alfred (alf
with redundancy), an algorithm that builds on alfη to find
the best solution according to the training data provided by
multiple workers. It adopts the conventional technique of
facing the presence of errors by engaging multiple workers
1A demo is available at http://alfred.dia.uniroma3.it.
for solving the same tasks. However, alfred decides the
number of workers during the learning process, at runtime,
and minimizes the costs engaging only the workers
actually needed to achieve the desired quality. alfred exploits
the weighted consensus among multiple workers to estimate
their error rates so that better stopping conditions can be
crafted to take into account both the cost (quantified as
the number of queries) and the quality of the wrapper (as
estimated by a probabilistic model).</p>
      <p>
        Contributions. Overall, the paper makes several
contributions: (i) we extend our crowd based wrapper inference
framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in order to manage noisy answers; (ii) we
propose a principled approach to decide at runtime how many
workers should be engaged to deal with the presence of noisy
answers; (iii) we show how to estimate the workers’ error
rates during the learning; (iv) we set several termination
strategies aiming at a fair trade-off between output quality
and cost; (v) we report a set of preliminary experimental
results with both synthetic and real answers collected from
the crowd.
      </p>
      <p>Paper outline. The paper is organized as follows: Section 2
formalizes our setting and presents the extension to our
previous probabilistic model to deal with noisy answers;
Section 3 introduces alfη, the active learning algorithm for
a single noisy worker and Section 4 presents alfred, its
generalization to multiple workers. Section 5 reports the
experimental results. Section 6 discusses related work, and
Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>MODELING WRAPPER QUALITY AND</title>
    </sec>
    <sec id="sec-3">
      <title>NOISY WORKERS</title>
      <p>We focus on data-intensive websites whose pages are
generated by scripts that embed data from an underlying
database into an HTML template. Let U = {p1, . . . , pn} be an
ordered set of pages generated by the same script. Given an
attribute of interest published in the pages, its values can
be extracted by means of an extraction rule (or simply rule).
The value extracted by a rule r from a page p, denoted by
r(p), is either a string occurrence from the HTML source
code of p, or a special nil value. A rule r over the pages in
U returns the ordered set of values r(p1), . . . , r(pn) and
represents a concrete tool to build a vector of values, denoted
by r(U ), indexed by the pages of U .</p>
      <p>We propose a wrapper induction process that requires as
input the set U of pages to be wrapped, and only a
single initial annotation v0 (which is assumed correct) of the
attribute value to extract.2</p>
      <p>
        The inference process starts by generating a space of
hypothesis, i.e., a set Rv0 of candidate rules that extract the
given initial annotations v0. We consider extraction rules
defined by means of expressions belonging to a simple
fragment of XPath; namely, we use absolute and relative XPath
expressions that specify paths to the leaf node containing
the value to be extracted. Absolute rules specify paths that
start either from the document root or from a node having
an ‘id’; relative rules start from a template node working as
2Such input annotation may be supplied either manually
or automatically by looking up in the page a golden value
from an available database. The approach can be easily
generalized to deal with multiple initial annotations.
pivot. Textual leaves that occur once in every input page
are considered template nodes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The inference process evaluates the candidate rules in Rv0
by posing questions to workers recruited from a
crowdsourcing platform: they are shown a page and asked whether a
given value v from that page corresponds to a value of the
target attribute. The goal is to select the rule working not
only on the annotated sample page from which it has been
generated, but also for all the other pages in the input
collection U . Each query is formed by picking up the value in
the set VvR0 (U ) of values extracted from pages in U by the
candidate rules Rv0 .</p>
      <p>Figure 1 shows an example: suppose that we are
interested to generate a wrapper that extracts the Title from the
fictional set of movie pages U = {p1, p2, p3} whose DOM
trees are sketched in Figure 1(left). Assume that the
initial annotation v0 =‘City of God’ is supplied on the sample
page p1. Figure 1(right) shows the set Rv0 = {r1, r2, r3} of
candidate rules generated from this initial annotation. The
queries composed by the inference process use the values
VvR0 (U ) that appear as elements of the vectors extracted by
the rules in Rv0 from the pages in U .</p>
      <p>The binary answer l, with l ∈ {−, +}, supplied by a
worker adorns the queried value v with either a positive
or a negative label, producing a labeled value denoted by
vl. An ordered set of k labeled values composes a
training sequence (t.s.) denoted Lk−1, so that L0 = {v0+} and
Lk+1 = Lk ∪ {vkl}.</p>
      <p>Given a t.s. Lk, we introduce a probabilistic model for
estimating the probability P (r|Lk) of each candidate rule
r ∈ Rv0 of being correct for the whole set of input pages U ,
and the probability P (Rv0 |Lk) that the correct rule is not
present in Rv0 . These probabilities are updated after each
new labeled value vkl is observed, i.e., a worker labels with l
the value provided by a MQ on vk and the t.s. is expanded
to Lk+1 = Lk ∪ {vkl}.</p>
      <p>Bayesian update rules compute the posterior
probabilities P (r|Lk+1) and P (Rv0 |Lk+1) starting from the
probabilities, P (r|Lk) and P (Rv0 |Lk) respectively, available
before observing vl . The process can be repeated treating each
k
posterior probability as the priors required by the next
iteration. The whole process is triggered using as a base case the
priors P(Rv0 ) of having generated the correct rule in Rv0 ,
and the probability P(r) that r is a correct rule. Assuming
that Rv0 is generated from a class of rules sufficiently
expressive to include the target rule, we can fix P(Rv0 ) = 0,3
and uniformly set P(r) = 1 .</p>
      <p>|Rv0 |</p>
      <p>
        Bayesian update rules require the definition of the p.d.f.
P (vk|r, Lk). This is usually obtained by introducing a
probabilistic generative model to abstract the actual process
leading to the generation of t.s.. For the sake of simplicity,
we adopt the Classification Noise Process (CNP) model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
to describe inaccurate workers that may produce
independent and random mistakes with an expected error rate η, and
we assume that the next value to query is randomly chosen
according to a uniform p.d.f. over all values VvR0 (U ) \ Lk.
      </p>
      <p>Let Pη(·) denote a p.d.f. over all possible t.s. in presence
of noise, it follows:</p>
      <p>
        Pη(vkl|r, Lk) =P (vkl|r, Lk) · (1 − η) + P (vk−l|r, Lk) · η
(1)
3In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where we study the possibility of dynamically
tuning the expressiveness of the class, we have developed the
opposite assumption.
XXXXXpXageXXs
rules
      </p>
      <p>r1
Rv0 r2</p>
      <p>
        r3
r1 =/html/table/tr[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]/td
r2 =//td[contains(.,“Ratings:”)]/../../tr[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]/td
r3 =//td[contains(.,“Director:”)]/../../tr[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]/td
      </p>
      <p>p1 p2 p3
City of God Inception Oblivion
City of God Inception nil
City of God nil Oblivion</p>
      <p>
        U
where (1−η) is the probability that a worker correctly labels
with l the provided value vk, and η is the probability that
she wrongly provides the opposite label, here denoted by −l.
P (vkl|r, Lk) is the corresponding noise-free p.d.f. and can be
expressed as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
      </p>
      <p>P (vkl|r, Lk) =
(</p>
      <p>1
|VvR0 (U)|−|Lk|
0
, iff vk ∈ V l(r) \ Lk
, otherwise
where V l(r) is the subset of values in VvR0 (U ) that should
be labeled l if r is the correct rule. In our generative model
the values composing the t.s. Lk cannot be queried again,
therefore VvR0 (U ) \ Lk is the set of values that can be used
to generate new queries.</p>
      <p>As we discuss in the next section, these probabilities are
used to effectively choose the next question, and to establish
a termination condition for the learning process.</p>
    </sec>
    <sec id="sec-4">
      <title>3. LEARNING EXTRACTION RULES</title>
      <p>The probabilistic model developed in the previous
section aims at computing, observed a t.s. Lk, the probability
P (r|Lk) that a given extraction rule r within a set of
candidate rules Rv0 is correct. In this section we present alfη, an
active learning algorithm that exploits these probabilities to
minimize the number of queries to the crowd workers.
Listing 1 alfη: Active Learning Algorithm for a Single
Noisy Worker
Input: a set of pages U
Input: the set of candidate rules Rv0
Input: a worker w and its associated error rate ηw
Output: a teaching sequence Lk
1: let k = 1; let L1 = {v0+};
2: while (not haltalf(Lk)) do
3: vk ← chooseQuestion(Lk);
4: l ← getAnswer(w, vk);
5: Lk+1 ← Lk ∪ {vkl};
6: compute P (r|Lk+1), ∀r ∈ Rv0 ;
7: k ← k + 1;
8: end while
9: return Lk;</p>
      <p>Listing 1 contains the pseudo-code of the alfη algorithm:
it processes a t.s. Lk built by actively asking to a worker
(here modeled by means of the subprogram getAnswer())
the label of a value chosen by the subprogram
chooseQuestion(); alfη computes a p.d.f. describing the
probability of correctness over the rules in Rv0 .</p>
      <p>In every iteration (lines 2–8), the worker is asked to label
a new value vk (lines 3–4) and the t.s. is expanded (line 5).
Then the probability P (r|Lk+1) is updated (line 6).</p>
      <p>
        chooseQuestion() selects the next value to be labeled
by the worker, i.e., the next membership query. The chosen
value is that on which rules most disagree, appropriately
weighted according to their probability. This is equivalent
to compute the vote entropy [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] for each v ∈ VvR0 (U ):
H(v) = −[P (v+|Lk) log P (v+|Lk) + P (v−|Lk) log P (v−|Lk)]
where:
and
      </p>
      <p>P (v+|Lk) = Pr∈Rv0 :r(pv)=v P (r|Lk)</p>
      <p>P (v−|Lk) = Pr∈Rv0 :r(pv)6=v P (r|Lk)
are the probabilities that v is respectively either a value to
extract or an incorrect value (pv denotes the page
containing v). Intuitively, the entropy measures the uncertainty
of a value and querying the value with the highest entropy
removes the most uncertain value:
chooseQuestion(Lk) { return argmaxv∈VvR0 (U) H(v); }</p>
      <p>Since different termination policies can be appropriate
depending on the budget constraints and on the quality
targets, we propose several implementations of haltalf(Lk).
haltr: A simple policy is to stop when the probability of
the best rule overcomes a threshold λr:</p>
      <p>haltr(Lk) { return (maxr∈Rv0 P (r|Lk) &gt; λr); }
The main limitation of this strategy is that it does not take
into account the costs.
haltMQ: A simple policy to upper bound the costs is to
stop as soon as the algorithm runs out of a “budget” of
λMQ membership queries.</p>
      <p>haltMQ(Lk) { return (|Lk| &gt; λMQ); }
The main limitation of this strategy is that it does not
consider the quality of the output rules.
haltH : A trade-off between quality and cost can be set by
posing only queries that contribute enough to the quality of
the inferred rules, and stopping as soon as the costs are
considered not correctly rewarded by the increment of quality in
the output rules. It turns out that this can be easily
modeled in term of the maximum entropy. haltH terminates
as soon as the maximum entropy of the values is below a
threshold λH , i.e., no value is uncertain enough to deserve
a query:
haltH (Lk) { return (maxv∈VvR0 (U) H(v) &lt; λH ); }
Whichever is the termination policy adopted by alfη, its
efficacy in achieving an optimal trade-off between quality
and the costs for the learning tasks is strongly affected by a
correct estimation of the worker’s error rate, η. An incorrect
evaluation of η can lead to a waste of queries (when the
error rate is overestimated), or to a quality loss (when it
is underestimated), as confirmed by the experiments with
different termination policies (reported in Section 5.2).</p>
      <p>In our experimental evaluation, we use as a termination
policy the following combination:</p>
      <p>haltalfη (W, Lk) = haltH (Lk) or haltMQ(Lk).</p>
      <p>This policy leverages the same trade-off between quality and
cost as in haltH , but focuses on the cost side by limiting
through haltMQ the budget allocated for each worker.</p>
    </sec>
    <sec id="sec-5">
      <title>4. INFERENCE WITH A NOISY CROWD</title>
      <p>
        We now introduce another algorithm, alfred, that
follows the conventional approach based on redundancy [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
to deal with inaccurate workers. alfred improves alfη
robustness by dynamically recruiting additional workers to
whom it dispatches redundant tasks. It combines the
answers provided by a set W of multiple workers on the same
task to estimate the workers’ error rate, as well as to find
the most likely rule, at the same time.
      </p>
      <p>Our probabilistic model can easily deal with multiple
workers by appending the t.s. Lw of every worker w ∈ W into a
unique t.s., denoted by L = ]w∈W Lw, which is then used to
train alfη. However, this approach raises two issues: (i) it is
not clear how many workers should be involved in the
learning process to achieve a good trade-off between output
quality and cost; (ii) a correct estimation of the workers’ error
rates strongly affects alfη performances, and each worker
could have an error rate radically different from others.</p>
      <p>To overcome these issues, alfred dynamically chooses the
amount of redundancy needed, and it exploits the
cumulative knowledge both to find the solution on which most of
the weighted workers consensus converges, and to estimate
the workers’ error rates.</p>
      <p>Listing 2 illustrates alfred pseudo-code: the main loop
(lines 2–15) alternates two phases. First, the workers are
engaged (lines 3–8): each worker w trains an alfη instance,
thus producing a t.s. Lw. In this phase, the worker is
associated with an initial error rate (line 5).4 The second
phase (lines 10–14) starts as soon as the first phase has
accumulated enough workers (at least W0). The goal of this
phase is to leverage all the t.s. provided by the workers in
order to compute both the p.d.f. of the rules P (r|L) and the
individual error rate ηw of each worker w ∈ W .</p>
      <p>
        Our solution is inspired by the work on Truth Finding
Problems [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and exploits the mutual dependency between
the probability of correctness of the rules and the error rates
of the workers answering the MQ. The error rate is defined
in term of probability as follows:
ηw =
      </p>
      <p>Pvkl∈Lw
1 − P (vk+|L) , iff l = +
P (vk+|L) , iff l = −
|Lw|
4We use the average error rate empirically estimated in our
experiment with real workers; another option is to derive
this value from the workers ranks provided by the
crowdsourcing platform.
and it can be interpreted as the probability of the worker
of correctly answering a MQ, estimated by averaging over
all the MQ in the t.s. Lw that she has provided. Since also
the probability is defined in term of error rates, as shown
in Eq. 1, their computation is interleaved (lines 11 and 13)
until their values do not significantly change anymore.5</p>
      <p>The termination condition of alfred can be set
according to different policies by specifying haltalfred. In our
implementation, in order to take into account both budget
constraints and quality targets, we used the following
combination:</p>
      <p>haltalfred(W, L) = haltr(L) or haltMQ(L)</p>
      <p>Observe that the recruitment of new workers negatively
influences the latency of the system. A strategy to contain
the latency of the system is to recruit bulk workers, but this
is beyond the scope of the present paper.
5.</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>We use a dataset composed of 5 collections of pages:
actor and movie pages from www.imdb.com; band and album
pages from www.allmusic.com; stock quote pages from
www.nasdaq.com. For each collection we selected about 10
attributes, for a total of 40 attributes. Then we manually
crafted a golden XPath rule for each attribute. We
randomly selected a training sample set of 500 pages from each
collection of pages, and another (disjoint) test set of 2, 000
pages. Our algorithm was run on the training sample, and
the output extraction rules were evaluated on the test set.
We compared the (non-null) values extracted by the golden
rules against those extracted by the rules generated by our
5We have empirically observed the convergence of the
algorithm. A formal proof is left as future work.
14
12
10
QM8
6 HHAALLTTHr
4
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
η∗
algorithm. For every generated rule r, and given a test set
of pages U , we computed precision (P ), recall (R), and
F-measure (F ) at the value level w.r.t. the corresponding
golden rule rg, as follows: P = |rg(U )∩r(U )| ; R = |rg(U )∩r(U )| ;
|r(U )| |rg(U )|
F = 2 P ·R .</p>
      <p>P +R</p>
      <p>
        We report the results of two sets of experiments: the first
one was conducted to test alfη with a single worker, in the
second set of experiments we consider alfred in presence
of multiple workers. We evaluate our algorithms by using
synthetic workers following the CNP probabilistic model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
i.e., workers making random and independent errors with a
fixed error rate η∗.
      </p>
      <p>In order to estimate the error rate distribution over a
population of real workers, we also conducted a preliminary
experiment with workers engaged by means of CrowdFlower,
a meta platform that offers services to recruit workers on
AMT.
5.1</p>
    </sec>
    <sec id="sec-7">
      <title>Experiments with AMT Workers</title>
      <p>The main intent of this preliminary experiment was to
evaluate the error rate distribution of a population of real
workers recruited on a crowdsourcing platform. We
submitted 100 tasks to 100 distinct workers. Each task was paid 10¢
and consisted on a set of 20 MQ to generate the extraction
rules for several attributes of our dataset. The ground truth
was straightforwardly obtained by the results of our golden
extraction rules. We used alfη configured with haltr as
termination condition (λr = 0.8), and with η = 0.1.</p>
      <p>The average error rate of an AMT worker was ηb = 0.09,
with a standard deviation of ση = 0.11. About 1/3 of the
b
workers responded correctly to all the queries, and the
response time for each MQ was around 7 seconds. On average
the number of MQ posed by the system to infer a rule was
4, and each task contained enough queries to learn 5 rules.</p>
      <p>The information obtained from this experiment was then
used to set up realistic configurations of the synthetic
workers for the other experiments: the average error rate
empirically observed ηb is used to set alfη’s parameter η = η as
b
well as the initial worker error rate estimation of alfred,
ηw = ηb; also, the synthetic workers used to test alfred are
created using the same error rate distribution observed on
real workers.
5.2</p>
    </sec>
    <sec id="sec-8">
      <title>Single Noisy Worker</title>
      <p>The main goal of the experiment was to evaluate the
effects of the workers’ mistakes on alfη, our algorithm in
absence of redundancy. In figure 2 we show the effects
of an inaccurate worker with different termination
strategies, by setting η = ηb = 0.09. We simulated workers with
an error rate η∗ increasing from 0.05 to 0.4. The
thresholds of the three termination conditions considered, haltr,
haltMQ and haltH were set to λr = 0.8, λMQ = 5 and
λH = 0.2, respectively.</p>
      <p>As the error rate η∗ of the worker increases, the results
degrade with every termination strategy, as expected.
However, haltH detects the wider presence of uncertain values,
and tries to compensate with a greater number of queries;
conversely, since haltr focuses only on the most likely rule,
it poses a rather steady number of queries, and the output
quality is more seriously compromised.</p>
      <p>We then empirically evaluated how an incorrect setting
of the parameter η, i.e., the expected worker error rate,
influences alfη performances. We used a single worker with
η∗ = η = 0.09, and repeated several inference processes,
b
configuring alfη with η ranging from η = 0 to 0.4 as
reported in Figure 3.</p>
      <p>When the system overestimates the accuracy of worker
(η &lt; η∗) we observe a reduction of the number of MQ, but
the quality of the wrapper drops. The system trusts the
workers and terminates quickly, thus posing less questions
than actually needed. When the system underestimates the
worker accuracy (η &gt; η∗), some MQ are wasted since the
system does not trust the worker. With an η larger than η∗
by +0.3, haltr requires more than 40 MQ, i.e., 5× those
required when η = η∗. Observe that many MQ are wasted
since the F -measure gain is less than 5%.</p>
      <p>alfη
|W |
1
average</p>
      <p>F #MQ
0.92 7.58
|W |
1
max
#MQ
11
σF
0.17
Table 1 reports alfη results when the termination policy
haltalfη has been instantiated by setting the parameters
λH = 0.2 and λMQ = 10. alfη requires just a few queries
(#M Q = 7.58) to learn rather accurate wrappers (F =
0.92). However, there is a significant standard deviation
(σF = 17%) in the output quality that makes the algorithm
not that robust to workers’ errors.
5.3</p>
    </sec>
    <sec id="sec-9">
      <title>Multiple Noisy Workers</title>
      <p>As discussed in Section 4, alfred builds on alfη, and
recruits additional workers to estimate their error rate and
to find the correct rule at the same time. We rely on the
termination strategy of the outer algorithm (haltalfred) to
achieve the target quality, while for the inner alfη instance
we use the same termination policy (haltalfη ) focused on
|W |
alfredno 2.33
alfred 2.07
alfred∗ 2.05</p>
      <p>average
F #MQ |ηw − η∗| |W |
1 18.6 — 9
1 16.1 0.8% 4
1 16.07 0% 4
the costs as in the previous experiment. To evaluate alfred
performances, we organized as many tasks as the number of
attributes of our experiment (40). For each task we
executed alfred (with λη = 10−4) by recruiting workers from
a virtual population with the same error rate distribution
observed over the real workers (the results have been
averaged over 20 executions).</p>
      <p>alfred’s results in Table 2 demonstrate the role played
by the workers’ error rate estimation. We compare the
algorithm against a baseline (alfredno) in which the error
rate estimation is disabled (we just set ηw = ηb), and against
a bound (alfred∗) in which an oracle sets ηw = η∗. The
workers’ error rate estimation is precise (|ηw − η∗| = 0.8%
when the learning terminates), and it allows the system to
save queries (16.1 vs 18.6 on average). The average
number of MQ posed by alfred to learn the correct rule is
only a negligible amount larger than the lower bound set by
alfred∗. The costs are more than twice those paid running
alfη with a single worker (16.1 vs. 7.58). However, notice
that alfred always concluded the tasks with a perfect
result, and that it is robust to workers’ error rates (σF = 1%).</p>
      <p>alfred terminates in most of the cases (94%) engaging
only 2 workers, and seldom recruited 3 and 4 workers (5%
and 1%, respectively). Overall, alfred was able to recruit
more workers, thus paying their answers, only when needed
to achieve the target quality of the output wrapper.</p>
    </sec>
    <sec id="sec-10">
      <title>RELATED WORK</title>
      <p>
        Wrapper induction for extracting data from web pages has
been subject of many researches [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A wrapper inference
approach tolerant to noise in the training data has been
proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], however it applies only for domains where
it is possible to automatically obtain a set of annotations.
      </p>
      <p>
        Active learning approaches [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have recently gained
interest as they can produce exponential improvements over the
number of samples wrt traditional supervised approaches [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The advent of crowdsourcing platforms has led to new
challenges. The main issue is to learn from noisy observations
generated by non expert users.
      </p>
      <p>
        Wrapper induction techniques that rely on active
learning approaches have been proposed in [
        <xref ref-type="bibr" rid="ref11 ref13">11, 13</xref>
        ]. These studies
rely on a more complicated user interaction than ours, since
the user has to choose the correct wrapper within a set of
ranked solutions. Also, they do not consider the presence
of noise in the training data. Many works have studied the
problem of learning with noisy data coming from
crowdsourcing platform workers, e.g., [
        <xref ref-type="bibr" rid="ref12 ref15 ref9">9, 12, 15</xref>
        ]. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] shows that
when labeling is not perfect, selective acquisition of multiple
good labels is crucial and that repeated-labeling can improve
label quality and model quality. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposes a
crowdsourcing assisted system, CDAS, for data analytics tasks; the
system faces the presence of noisy answers by submitting
redundant tasks and adopts a voting strategy to estimate the
correct solution. Compared to our approach, the number of
workers as well as their accuracies are statically determined,
based on the workers’ historical performances.
      </p>
      <p>
        In our previous work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we studied wrapper inference
with a single and perfect worker.
7.
      </p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSIONS</title>
      <p>We presented wrapper inference algorithms specifically
tailored for working with the support of crowdsourcing
platforms. Our approach allows the wrappers to be generated
by posing simple questions, membership queries, to
workers engaged on a crowdsourcing platform. We proposed two
algorithms that consider the possibility of noisy answers:
alfη recruits a single worker, alfred can dynamically
engage multiple workers to improve the quality of the solution.
We showed that alfred can produce high quality wrappers
at reasonable costs, and that the quality of the output
wrapper is highly predictable.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Angluin</surname>
          </string-name>
          .
          <article-title>Queries revisited</article-title>
          .
          <source>Theor. Comput. Sci.</source>
          ,
          <volume>313</volume>
          (
          <issue>2</issue>
          ):
          <fpage>175</fpage>
          -
          <lpage>194</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Angluin</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Laird</surname>
          </string-name>
          .
          <article-title>Learning from noisy examples</article-title>
          .
          <source>Mach. Learn.</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>343</fpage>
          -
          <lpage>370</lpage>
          , Apr.
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arasu</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Extracting structured data from web pages</article-title>
          .
          <source>In SIGMOD</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M.-F. Balcan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hanneke</surname>
            , and
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          .
          <article-title>The true sample complexity of active learning</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>80</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>111</fpage>
          -
          <lpage>139</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Girgis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. F.</given-names>
            <surname>Shaalan</surname>
          </string-name>
          .
          <article-title>A survey of web information extraction systems</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>18</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1411</fpage>
          -
          <lpage>1428</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          .
          <article-title>Wrapper inference for ambiguous web pages</article-title>
          .
          <source>JAAI</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          &amp;2):
          <fpage>21</fpage>
          -
          <lpage>52</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Qiu</surname>
          </string-name>
          .
          <article-title>A framework for learning web wrappers from the crowd</article-title>
          .
          <source>In WWW</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Soliman</surname>
          </string-name>
          .
          <article-title>Automatic wrappers for large scale web extraction</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <fpage>219</fpage>
          -
          <lpage>230</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Crowdsourcing systems on the world-wide web</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>54</volume>
          (
          <issue>4</issue>
          ):
          <fpage>86</fpage>
          -
          <lpage>96</lpage>
          , Apr.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Berti-Equille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Solomon: Seeking the truth via copying detection</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1617</fpage>
          -
          <lpage>1620</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>U.</given-names>
            <surname>Irmak</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <article-title>Interactive wrapper generation with minimal user effort</article-title>
          .
          <source>In WWW</source>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Ooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Cdas:
          <article-title>a crowdsourcing data analytics system</article-title>
          .
          <source>In VLDB</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Muslea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          .
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          .
          <article-title>Active learning with multiple views</article-title>
          .
          <source>JAIR</source>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Active learning literature survey</article-title>
          .
          <source>CS Tech. Rep. 1648</source>
          , University of Wisconsin-Madison,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Provost</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          .
          <article-title>Get another label? Improving data quality and data mining using multiple, noisy labelers</article-title>
          .
          <source>In KDD</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>