DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing Wrapper Generation Supervised by a Noisy Crowd Valter Crescenzi, Paolo Merialdo, Disheng Qiu Dipartimento di Ingegneria Università degli Studi Roma Tre Via della Vasca Navale, 79 – Rome, Italy {crescenz, merialdo, disheng}@dia.uniroma3.it ABSTRACT However, generating wrappers with the support of crowd- We present solutions based on crowdsourcing platforms to sourcing platforms introduces a number of challenges that support large-scale production of accurate wrappers around were not addressed in the literature: the mini-tasks submit- data-intensive websites. Our approach is based on super- ted to the platform should be extremely simple, since they vised wrapper induction algorithms which demand the bur- are performed by non-expert workers; their number should den of generating the training data to the workers of a be minimized to contain the costs. crowdsourcing platform. Workers are paid for answering We are developing a framework, alf [7], that addresses simple membership queries chosen by the system. We present the above issues to let the crowd effectively and efficiently su- two algorithms: a single worker algorithm (alfη ) and a mul- pervise the wrapper generation process.1 alf progressively tiple workers algorithm (alfred). Both the algorithms deal infers a wrapper by posing membership queries (MQ), which with the inherent uncertainty of the responses and use an ac- are the simplest form of queries [1], since they admit only a tive learning approach to select the most informative queries. yes/no answer (e.g., “Is the string ‘John Wayne’ a value to alfred estimates the workers’ error rate to decide at run- extract ? ”); alf implements an active learning algorithm to time how many workers are needed. The experiments that select the queries that more quickly bring to the generation we conducted on real and synthetic data are encouraging: of an accurate wrapper, thus reducing the costs [14]; and, our approach is able to produce accurate wrappers at a low finally, here we extend alf to adopt a probabilistic model cost, even in presence of workers with a significant error that considers errors (wrong answers) introduced by inaccu- rate. rate workers [2, 15]. We have experimentally observed that with perfect work- ers, i.e., workers that do not make any mistake in answering 1. INTRODUCTION the proposed membership queries, alf generates the correct The abundance of data contained in web pages has mo- extraction rules reducing on average the number of queries tivated many research efforts towards the development of by 4× with respect to a random choice [7]. However, it effective methods and tools for generating web wrappers, is well known that the workers recruited on crowd sourc- i.e., rules that allow the extraction of data from web pages. ing platforms are far from being perfect. On an empirical Supervised approaches to infer web wrappers have limited evaluation that we conducted on a popular crowdsourcing scalability, mainly because they require a set of training platform, we experienced a significant number of incorrect data, typically provided as labeled values. Unsupervised ap- answers (around 10% on average) even for the simple mem- proaches (e.g. [3, 6]) have been investigated as an attempt bership queries posed by our system. to “scale-up” the wrapper generation process by overcoming This paper extends our framework to manage workers that the need of training data. Unfortunately, they have limited may return incorrect answers. First, we introduce alfη , applicability because of the low precision of the produced which extends the underlying model of alf to deal with the wrappers. noisy answers of a single worker. The presence of errors Crowdsourcing platforms represent an intriguing opportu- introduces the challenging issue of deciding when to stop nity to “scale-out” supervised wrapper inference approaches. the learning process. Intuitively, when a worker is inaccu- These platforms support the assignment of mini-tasks to rate, the costs of acquiring her answers may become not people recruited on the Web, and thus allow the engagement justified by the increment of quality in the inferred wrap- of a large number of workers to produce massive amounts of per that these answers produce. alfη needs an estimation training data. of the worker’s error rate to reason on the optimal number of queries that should be assigned to the worker. Unfor- tunately, at most a rough estimation of the error rate is available when the worker is engaged. Then, to overcome this issue, we introduce alfred (alf with redundancy), an algorithm that builds on alfη to find the best solution according to the training data provided by multiple workers. It adopts the conventional technique of Copyright © 2013 for the individual papers by the papers’ authors. Copying facing the presence of errors by engaging multiple workers permitted for private and academic purposes. This volume is published and 1 copyrighted by its editors. A demo is available at http://alfred.dia.uniroma3.it. 1 8 DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing for solving the same tasks. However, alfred decides the pivot. Textual leaves that occur once in every input page number of workers during the learning process, at runtime, are considered template nodes [3]. and minimizes the costs engaging only the workers actu- The inference process evaluates the candidate rules in Rv0 ally needed to achieve the desired quality. alfred exploits by posing questions to workers recruited from a crowdsourc- the weighted consensus among multiple workers to estimate ing platform: they are shown a page and asked whether a their error rates so that better stopping conditions can be given value v from that page corresponds to a value of the crafted to take into account both the cost (quantified as target attribute. The goal is to select the rule working not the number of queries) and the quality of the wrapper (as only on the annotated sample page from which it has been estimated by a probabilistic model). generated, but also for all the other pages in the input col- lection U . Each query is formed by picking up the value in Contributions. Overall, the paper makes several contribu- the set VvR 0 (U ) of values extracted from pages in U by the tions: (i) we extend our crowd based wrapper inference candidate rules Rv0 . framework [7] in order to manage noisy answers; (ii) we pro- Figure 1 shows an example: suppose that we are inter- pose a principled approach to decide at runtime how many ested to generate a wrapper that extracts the Title from the workers should be engaged to deal with the presence of noisy fictional set of movie pages U = {p1 , p2 , p3 } whose DOM answers; (iii) we show how to estimate the workers’ error trees are sketched in Figure 1(left). Assume that the ini- rates during the learning; (iv) we set several termination tial annotation v0 =‘City of God’ is supplied on the sample strategies aiming at a fair trade-off between output quality page p1 . Figure 1(right) shows the set Rv0 = {r1 , r2 , r3 } of and cost; (v) we report a set of preliminary experimental candidate rules generated from this initial annotation. The results with both synthetic and real answers collected from queries composed by the inference process use the values the crowd. VvR 0 (U ) that appear as elements of the vectors extracted by the rules in Rv0 from the pages in U . The binary answer l, with l ∈ {−, +}, supplied by a Paper outline. The paper is organized as follows: Section 2 worker adorns the queried value v with either a positive formalizes our setting and presents the extension to our pre- or a negative label, producing a labeled value denoted by vious probabilistic model to deal with noisy answers; Sec- v l . An ordered set of k labeled values composes a train- tion 3 introduces alfη , the active learning algorithm for ing sequence (t.s.) denoted Lk−1 , so that L0 = {v0+ } and a single noisy worker and Section 4 presents alfred, its Lk+1 = Lk ∪ {vkl }. generalization to multiple workers. Section 5 reports the Given a t.s. Lk , we introduce a probabilistic model for experimental results. Section 6 discusses related work, and estimating the probability P (r|Lk ) of each candidate rule Section 7 concludes the paper. r ∈ Rv0 of being correct for the whole set of input pages U , and the probability P (Rv0 |Lk ) that the correct rule is not 2. MODELING WRAPPER QUALITY AND present in Rv0 . These probabilities are updated after each NOISY WORKERS new labeled value vkl is observed, i.e., a worker labels with l We focus on data-intensive websites whose pages are gen- the value provided by a MQ on vk and the t.s. is expanded erated by scripts that embed data from an underlying data- to Lk+1 = Lk ∪ {vkl }. base into an HTML template. Let U = {p1 , . . . , pn } be an Bayesian update rules compute the posterior probabili- ordered set of pages generated by the same script. Given an ties P (r|Lk+1 ) and P (Rv0 |Lk+1 ) starting from the proba- attribute of interest published in the pages, its values can bilities, P (r|Lk ) and P (Rv0 |Lk ) respectively, available be- be extracted by means of an extraction rule (or simply rule). fore observing vkl . The process can be repeated treating each The value extracted by a rule r from a page p, denoted by posterior probability as the priors required by the next itera- r(p), is either a string occurrence from the HTML source tion. The whole process is triggered using as a base case the code of p, or a special nil value. A rule r over the pages in priors P(Rv0 ) of having generated the correct rule in Rv0 , U returns the ordered set of values r(p1 ), . . . , r(pn ) and rep- and the probability P(r) that r is a correct rule. Assuming resents a concrete tool to build a vector of values, denoted that Rv0 is generated from a class of rules sufficiently ex- by r(U ), indexed by the pages of U . pressive to include the target rule, we can fix P(Rv0 ) = 0,3 We propose a wrapper induction process that requires as and uniformly set P(r) = |R1v | . 0 input the set U of pages to be wrapped, and only a sin- Bayesian update rules require the definition of the p.d.f. k gle initial annotation v0 (which is assumed correct) of the P (vk |r, L ). This is usually obtained by introducing a prob- attribute value to extract.2 abilistic generative model to abstract the actual process The inference process starts by generating a space of hy- leading to the generation of t.s.. For the sake of simplicity, pothesis, i.e., a set Rv0 of candidate rules that extract the we adopt the Classification Noise Process (CNP) model [2] given initial annotations v0 . We consider extraction rules to describe inaccurate workers that may produce indepen- defined by means of expressions belonging to a simple frag- dent and random mistakes with an expected error rate η, and ment of XPath; namely, we use absolute and relative XPath we assume that the next value to query is randomly chosen expressions that specify paths to the leaf node containing according to a uniform p.d.f. over all values VvR 0 (U ) \ Lk . the value to be extracted. Absolute rules specify paths that Let Pη (·) denote a p.d.f. over all possible t.s. in presence start either from the document root or from a node having of noise, it follows: an ‘id’; relative rules start from a template node working as 2 Pη (vkl |r, Lk ) =P (vkl |r, Lk ) · (1 − η) + P (vk−l |r, Lk ) · η (1) Such input annotation may be supplied either manually 3 or automatically by looking up in the page a golden value In [7], where we study the possibility of dynamically tun- from an available database. The approach can be easily ing the expressiveness of the class, we have developed the generalized to deal with multiple initial annotations. opposite assumption. 2 9 DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing XXX XXpages XX U rules X p1 p2 p3 r1 City of God Inception Oblivion Rv0 r2 City of God Inception nil r3 City of God nil Oblivion r1 =/html/table/tr[1]/td r2 =//td[contains(.,“Ratings:”)]/../../tr[1]/td r3 =//td[contains(.,“Director:”)]/../../tr[1]/td Figure 1: (Left) DOM of three sample pages; (Right) Extraction rules in Rv0 = {r1 , r2 , r3 } with v0 =‘City of God’ and the set of values VvR 0 (U ) they extract from pages in U = {p1 , p2 , p3 }. where (1−η) is the probability that a worker correctly labels In every iteration (lines 2–8), the worker is asked to label with l the provided value vk , and η is the probability that a new value vk (lines 3–4) and the t.s. is expanded (line 5). she wrongly provides the opposite label, here denoted by −l. Then the probability P (r|Lk+1 ) is updated (line 6). P (vkl |r, Lk ) is the corresponding noise-free p.d.f. and can be chooseQuestion() selects the next value to be labeled expressed as in [7]: by the worker, i.e., the next membership query. The chosen value is that on which rules most disagree, appropriately ( 1 , iff vk ∈ V l (r) \ Lk weighted according to their probability. This is equivalent P (vkl |r, Lk ) = |VvR (U )|−|Lk | 0 to compute the vote entropy [14] for each v ∈ VvR 0 (U ): 0 , otherwise H(v) = −[P (v + |Lk ) log P (v + |Lk ) + P (v − |Lk ) log P (v − |Lk )] where V (r) is the subset of values in VvR l 0 (U ) that should be labeled l if r is the correct rule. In our generative model where: P (v + |Lk ) = P k r∈Rv0 :r(pv )=v P (r|L ) the values composing the t.s. Lk cannot be queried again, therefore VvR0 (U ) \ Lk is the set of values that can be used P (v − |Lk ) = k P to generate new queries. and r∈Rv0 :r(pv )6=v P (r|L ) As we discuss in the next section, these probabilities are used to effectively choose the next question, and to establish are the probabilities that v is respectively either a value to a termination condition for the learning process. extract or an incorrect value (pv denotes the page contain- ing v). Intuitively, the entropy measures the uncertainty of a value and querying the value with the highest entropy 3. LEARNING EXTRACTION RULES removes the most uncertain value: The probabilistic model developed in the previous sec- tion aims at computing, observed a t.s. Lk , the probability chooseQuestion(Lk ) { return argmaxv∈VvR (U ) H(v); } 0 P (r|Lk ) that a given extraction rule r within a set of candi- date rules Rv0 is correct. In this section we present alfη , an Since different termination policies can be appropriate de- active learning algorithm that exploits these probabilities to pending on the budget constraints and on the quality tar- minimize the number of queries to the crowd workers. gets, we propose several implementations of haltalf (Lk ). haltr : A simple policy is to stop when the probability of Listing 1 alfη : Active Learning Algorithm for a Single the best rule overcomes a threshold λr : Noisy Worker haltr (Lk ) { return (maxr∈Rv0 P (r|Lk ) > λr ); } Input: a set of pages U Input: the set of candidate rules Rv0 The main limitation of this strategy is that it does not take Input: a worker w and its associated error rate ηw into account the costs. haltM Q : A simple policy to upper bound the costs is to Output: a teaching sequence Lk stop as soon as the algorithm runs out of a “budget” of 1: let k = 1; let L1 = {v0+ }; λMQ membership queries. 2: while (not haltalf (Lk )) do haltM Q (Lk ) { return (|Lk | > λMQ ); } 3: vk ← chooseQuestion(Lk ); 4: l ← getAnswer(w, vk ); The main limitation of this strategy is that it does not con- 5: Lk+1 ← Lk ∪ {vkl }; sider the quality of the output rules. 6: compute P (r|Lk+1 ), ∀r ∈ Rv0 ; 7: k ← k + 1; haltH : A trade-off between quality and cost can be set by 8: end while posing only queries that contribute enough to the quality of 9: return Lk ; the inferred rules, and stopping as soon as the costs are con- sidered not correctly rewarded by the increment of quality in the output rules. It turns out that this can be easily mod- Listing 1 contains the pseudo-code of the alfη algorithm: eled in term of the maximum entropy. haltH terminates it processes a t.s. Lk built by actively asking to a worker as soon as the maximum entropy of the values is below a (here modeled by means of the subprogram getAnswer()) threshold λH , i.e., no value is uncertain enough to deserve the label of a value chosen by the subprogram choose- a query: Question(); alfη computes a p.d.f. describing the prob- ability of correctness over the rules in Rv0 . haltH (Lk ) { return (maxv∈VvR (U ) H(v) < λH ); } 0 3 10 DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing Whichever is the termination policy adopted by alfη , its Listing 2 alfred: Active Learning Algorithm with Multi- efficacy in achieving an optimal trade-off between quality ple Noisy Workers and the costs for the learning tasks is strongly affected by a Input: a set of pages U correct estimation of the worker’s error rate, η. An incorrect Input: the set of candidate rules Rv0 evaluation of η can lead to a waste of queries (when the error rate is overestimated), or to a quality loss (when it Parameter: number of initial workers W0 is underestimated), as confirmed by the experiments with Parameter: a threshold λη for error rates convergence different termination policies (reported in Section 5.2). Output: the most probable extraction rule r ∈ Rv0 In our experimental evaluation, we use as a termination policy the following combination: 1: let W = ∅; // set of engaged workers k k k 2: repeat haltalfη (W, L ) = haltH (L ) or haltM Q (L ). 3: repeat This policy leverages the same trade-off between quality and 4: let w = engageWorker(); cost as in haltH , but focuses on the cost side by limiting 5: let ηw = getErrorRate(w); through haltM Q the budget allocated for each worker. 6: let Lw = alfη (U, Rv0 , w); 7: W ← W ∪ {w}; 8: until (|W | < W0 ); // wait for workers 4. INFERENCE WITH A NOISY CROWD 9: let L = ]w∈W Lw ; // append all t.s. We now introduce another algorithm, alfred, that fol- 10: repeat lows the conventional approach based on redundancy [15] 11: compute P (r|L) by using ηw , ∀r ∈ Rv0 ; to deal with inaccurate workers. alfred improves alfη prev 12: let ηw = ηw , ∀w ∈ W ; // save previous η robustness by dynamically recruiting additional workers to 13: compute P (r|L), ∀w ∈ W ; P ηw by usingprev whom it dispatches redundant tasks. It combines the an- 14: until ( w∈W (ηw − ηw )2 > λη · |W |); swers provided by a set W of multiple workers on the same task to estimate the workers’ error rate, as well as to find 15: until (haltalfred (W, L)); the most likely rule, at the same time. 16: return argmaxr∈Rv P (r|L); 0 Our probabilistic model can easily deal with multiple work- ers by appending the t.s. Lw of every worker w ∈ W into a unique t.s., denoted by L = ]w∈W Lw , which is then used to train alfη . However, this approach raises two issues: (i) it is and it can be interpreted as the probability of the worker not clear how many workers should be involved in the learn- of correctly answering a MQ, estimated by averaging over ing process to achieve a good trade-off between output qual- all the MQ in the t.s. Lw that she has provided. Since also ity and cost; (ii) a correct estimation of the workers’ error the probability is defined in term of error rates, as shown rates strongly affects alfη performances, and each worker in Eq. 1, their computation is interleaved (lines 11 and 13) could have an error rate radically different from others. until their values do not significantly change anymore.5 To overcome these issues, alfred dynamically chooses the The termination condition of alfred can be set accord- amount of redundancy needed, and it exploits the cumula- ing to different policies by specifying haltalfred . In our tive knowledge both to find the solution on which most of implementation, in order to take into account both budget the weighted workers consensus converges, and to estimate constraints and quality targets, we used the following com- the workers’ error rates. bination: Listing 2 illustrates alfred pseudo-code: the main loop haltalfred (W, L) = haltr (L) or haltM Q (L) (lines 2–15) alternates two phases. First, the workers are engaged (lines 3–8): each worker w trains an alfη instance, Observe that the recruitment of new workers negatively thus producing a t.s. Lw . In this phase, the worker is as- influences the latency of the system. A strategy to contain sociated with an initial error rate (line 5).4 The second the latency of the system is to recruit bulk workers, but this phase (lines 10–14) starts as soon as the first phase has ac- is beyond the scope of the present paper. cumulated enough workers (at least W0 ). The goal of this phase is to leverage all the t.s. provided by the workers in order to compute both the p.d.f. of the rules P (r|L) and the 5. EXPERIMENTAL EVALUATION individual error rate ηw of each worker w ∈ W . We use a dataset composed of 5 collections of pages: ac- Our solution is inspired by the work on Truth Finding tor and movie pages from www.imdb.com; band and album Problems [10], and exploits the mutual dependency between pages from www.allmusic.com; stock quote pages from www.- the probability of correctness of the rules and the error rates nasdaq.com. For each collection we selected about 10 at- of the workers answering the MQ. The error rate is defined tributes, for a total of 40 attributes. Then we manually in term of probability as follows: crafted a golden XPath rule for each attribute. We ran- domly selected a training sample set of 500 pages from each 1 − P (vk+ |L) , iff l = +  P collection of pages, and another (disjoint) test set of 2, 000 l ∈L vk w P (vk+ |L) , iff l = − pages. Our algorithm was run on the training sample, and ηw = the output extraction rules were evaluated on the test set. |Lw | We compared the (non-null) values extracted by the golden 4 We use the average error rate empirically estimated in our rules against those extracted by the rules generated by our experiment with real workers; another option is to derive 5 this value from the workers ranks provided by the crowd- We have empirically observed the convergence of the algo- sourcing platform. rithm. A formal proof is left as future work. 4 11 DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing 1.1 14 100 HALTr 1 HALTH 1 12 0.9 0.9 10 0.8 MQ F MQ 0.8 F 8 0.7 0.6 10 0.7 6 HALTr HALTr HALTH 0.5 HALTH HALTr HALTMQ 0.6 HALTH 4 0.4 HALTMQ 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 3 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 η∗ η∗ η η Figure 2: alfη (η = 0.09) sensitivity to worker error Figure 3: alfη sensitivity to the expected worker rate η ∗ : Cost (left) and quality (right) of the output error rate η with a noisy worker η ∗ = 0.09: Cost wrapper. (left) and quality (right) of the output wrapper. algorithm. For every generated rule r, and given a test set an error rate η ∗ increasing from 0.05 to 0.4. The thresh- of pages U , we computed precision (P ), recall (R), and olds of the three termination conditions considered, haltr , F-measure (F ) at the value level w.r.t. the corresponding haltM Q and haltH were set to λr = 0.8, λMQ = 5 and |r (U )∩r(U )| |r (U )∩r(U )| λH = 0.2, respectively. golden rule rg , as follows: P = g |r(U )| ; R = g |r (U )| ; ·R g As the error rate η ∗ of the worker increases, the results F = 2 PP+R . degrade with every termination strategy, as expected. How- We report the results of two sets of experiments: the first ever, haltH detects the wider presence of uncertain values, one was conducted to test alfη with a single worker, in the and tries to compensate with a greater number of queries; second set of experiments we consider alfred in presence conversely, since haltr focuses only on the most likely rule, of multiple workers. We evaluate our algorithms by using it poses a rather steady number of queries, and the output synthetic workers following the CNP probabilistic model [2], quality is more seriously compromised. i.e., workers making random and independent errors with a We then empirically evaluated how an incorrect setting fixed error rate η ∗ . of the parameter η, i.e., the expected worker error rate, in- In order to estimate the error rate distribution over a pop- fluences alfη performances. We used a single worker with ulation of real workers, we also conducted a preliminary ex- η ∗ = ηb = 0.09, and repeated several inference processes, periment with workers engaged by means of CrowdFlower, configuring alfη with η ranging from η = 0 to 0.4 as re- a meta platform that offers services to recruit workers on ported in Figure 3. AMT. When the system overestimates the accuracy of worker 5.1 Experiments with AMT Workers (η < η ∗ ) we observe a reduction of the number of MQ, but the quality of the wrapper drops. The system trusts the The main intent of this preliminary experiment was to workers and terminates quickly, thus posing less questions evaluate the error rate distribution of a population of real than actually needed. When the system underestimates the workers recruited on a crowdsourcing platform. We submit- worker accuracy (η > η ∗ ), some MQ are wasted since the ted 100 tasks to 100 distinct workers. Each task was paid 10¢ system does not trust the worker. With an η larger than η ∗ and consisted on a set of 20 MQ to generate the extraction by +0.3, haltr requires more than 40 MQ, i.e., 5× those rules for several attributes of our dataset. The ground truth required when η = η ∗ . Observe that many MQ are wasted was straightforwardly obtained by the results of our golden since the F -measure gain is less than 5%. extraction rules. We used alfη configured with haltr as termination condition (λr = 0.8), and with η = 0.1. average max The average error rate of an AMT worker was ηb = 0.09, |W | F #MQ |W | #MQ σF with a standard deviation of σ bη = 0.11. About 1/3 of the alfη 1 0.92 7.58 1 11 0.17 workers responded correctly to all the queries, and the re- sponse time for each MQ was around 7 seconds. On average the number of MQ posed by the system to infer a rule was Table 1: alfη inference with synthetic workers 4, and each task contained enough queries to learn 5 rules. The information obtained from this experiment was then Table 1 reports alfη results when the termination policy used to set up realistic configurations of the synthetic work- haltalfη has been instantiated by setting the parameters ers for the other experiments: the average error rate empir- λH = 0.2 and λMQ = 10. alfη requires just a few queries ically observed ηb is used to set alfη ’s parameter η = ηb as (#M Q = 7.58) to learn rather accurate wrappers (F = well as the initial worker error rate estimation of alfred, 0.92). However, there is a significant standard deviation ηw = ηb; also, the synthetic workers used to test alfred are (σF = 17%) in the output quality that makes the algorithm created using the same error rate distribution observed on not that robust to workers’ errors. real workers. 5.3 Multiple Noisy Workers 5.2 Single Noisy Worker As discussed in Section 4, alfred builds on alfη , and The main goal of the experiment was to evaluate the ef- recruits additional workers to estimate their error rate and fects of the workers’ mistakes on alfη , our algorithm in to find the correct rule at the same time. We rely on the ter- absence of redundancy. In figure 2 we show the effects mination strategy of the outer algorithm (haltalfred ) to of an inaccurate worker with different termination strate- achieve the target quality, while for the inner alfη instance gies, by setting η = ηb = 0.09. We simulated workers with we use the same termination policy (haltalfη ) focused on 5 12 DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing average max workers as well as their accuracies are statically determined, |W | F #MQ |ηw − η ∗ | |W | #MQ σF based on the workers’ historical performances. alfredno 2.33 1 18.6 — 9 83 0.01 alfred 2.07 1 16.1 0.8% 4 44 0.01 In our previous work [7], we studied wrapper inference alfred∗ 2.05 1 16.07 0% 4 40 0.01 with a single and perfect worker. 7. CONCLUSIONS Table 2: alfred inference with synthetic workers We presented wrapper inference algorithms specifically tailored for working with the support of crowdsourcing plat- the costs as in the previous experiment. To evaluate alfred forms. Our approach allows the wrappers to be generated performances, we organized as many tasks as the number of by posing simple questions, membership queries, to work- attributes of our experiment (40). For each task we exe- ers engaged on a crowdsourcing platform. We proposed two cuted alfred (with λη = 10−4 ) by recruiting workers from algorithms that consider the possibility of noisy answers: a virtual population with the same error rate distribution alfη recruits a single worker, alfred can dynamically en- observed over the real workers (the results have been aver- gage multiple workers to improve the quality of the solution. aged over 20 executions). We showed that alfred can produce high quality wrappers alfred’s results in Table 2 demonstrate the role played at reasonable costs, and that the quality of the output wrap- by the workers’ error rate estimation. We compare the al- per is highly predictable. gorithm against a baseline (alfredno ) in which the error rate estimation is disabled (we just set ηw = ηb), and against 8. REFERENCES a bound (alfred∗ ) in which an oracle sets ηw = η ∗ . The [1] D. Angluin. Queries revisited. Theor. Comput. Sci., workers’ error rate estimation is precise (|ηw − η ∗ | = 0.8% 313(2):175–194, 2004. when the learning terminates), and it allows the system to [2] D. Angluin and P. Laird. Learning from noisy save queries (16.1 vs 18.6 on average). The average num- examples. Mach. Learn., 2(4):343–370, Apr. 1988. ber of MQ posed by alfred to learn the correct rule is [3] A. Arasu and H. Garcia-Molina. Extracting structured only a negligible amount larger than the lower bound set by data from web pages. In SIGMOD 2003. alfred∗ . The costs are more than twice those paid running [4] M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The alfη with a single worker (16.1 vs. 7.58). However, notice true sample complexity of active learning. Machine that alfred always concluded the tasks with a perfect re- Learning, 80(2-3):111–139, 2010. sult, and that it is robust to workers’ error rates (σF = 1%). alfred terminates in most of the cases (94%) engaging [5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. only 2 workers, and seldom recruited 3 and 4 workers (5% Shaalan. A survey of web information extraction and 1%, respectively). Overall, alfred was able to recruit systems. IEEE Trans. Knowl. Data Eng., more workers, thus paying their answers, only when needed 18(10):1411–1428, 2006. to achieve the target quality of the output wrapper. [6] V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. JAAI, 22(1&2):21–52, 2008. [7] V. Crescenzi, P. Merialdo, and D. Qiu. A framework 6. RELATED WORK for learning web wrappers from the crowd. In WWW Wrapper induction for extracting data from web pages has 2013. been subject of many researches [5]. A wrapper inference [8] N. N. Dalvi, R. Kumar, and M. A. Soliman. approach tolerant to noise in the training data has been Automatic wrappers for large scale web extraction. proposed in [8], however it applies only for domains where PVLDB, 4(4):219–230, 2011. it is possible to automatically obtain a set of annotations. [9] A. Doan, R. Ramakrishnan, and A. Y. Halevy. Active learning approaches [14] have recently gained inter- Crowdsourcing systems on the world-wide web. est as they can produce exponential improvements over the Commun. ACM, 54(4):86–96, Apr. 2011. number of samples wrt traditional supervised approaches [4]. [10] X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. The advent of crowdsourcing platforms has led to new chal- Solomon: Seeking the truth via copying detection. lenges. The main issue is to learn from noisy observations PVLDB, 3(2):1617–1620, 2010. generated by non expert users. [11] U. Irmak and T. Suel. Interactive wrapper generation Wrapper induction techniques that rely on active learn- with minimal user effort. In WWW 2006. ing approaches have been proposed in [11, 13]. These studies rely on a more complicated user interaction than ours, since [12] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and the user has to choose the correct wrapper within a set of M. Zhang. Cdas: a crowdsourcing data analytics ranked solutions. Also, they do not consider the presence system. In VLDB 2012. of noise in the training data. Many works have studied the [13] I. Muslea, S. Minton, and C. A. Knoblock. B. Settles. problem of learning with noisy data coming from crowd- Active learning with multiple views. JAIR 2006. sourcing platform workers, e.g., [9, 12, 15]. [15] shows that [14] Active learning literature survey. CS Tech. Rep. 1648, when labeling is not perfect, selective acquisition of multiple University of Wisconsin–Madison, 2009. good labels is crucial and that repeated-labeling can improve [15] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get label quality and model quality. [12] proposes a crowdsourc- another label? Improving data quality and data ing assisted system, CDAS, for data analytics tasks; the sys- mining using multiple, noisy labelers. In KDD 2008. tem faces the presence of noisy answers by submitting re- dundant tasks and adopts a voting strategy to estimate the correct solution. Compared to our approach, the number of 6 13