DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                The Palm-tree Index: Indexing with the crowd∗

               Ahmed R. Mahmood                                   Walid G. Aref                               Eduard Dragut
                    Purdue University                            Purdue University                            Purdue University
             amahmoo@purdue.edu                             aref@cs.purdue.edu                           edragut@purdue.edu
                                                             Saleh Basalamah
                                                             Umm Al-Qura University
                                                        smbasalamah@uqu.edu.sa

ABSTRACT                                                                          repair shop maintains images of cars previously repaired associated
Crowdsourcing services allow employing human intelligence in                      their actual repair cost. A good repair estimate would be the actual
tasks that are difficult to accomplish with computers such as image               repair cost of previously repaired car of similar condition. Building
tagging and data collection. At a relatively low monetary cost and                a tree index on images of previously repaired cars sorted based on
through web interfaces such as Amazon’s Mechanical Turk (AMT),                    their repair cost allows performing crowdsourcing-based similarity
humans can act as a computational operator in large systems. Re-                  queries on the index. These queries help users to find cars with
cent work has been conducted to build database management sys-                    similar conditions and hence anticipate the same repair cost to their
tems that can harness the crowd power in database operators, such                 own damaged car. The key to the success of this scenario is that
as sort, join, count, etc. The fundamental problem of indexing                    the cost of repair is directly proportional to the condition of the car.
within crowdsourced databases has not been studied. In this paper,                The same idea can be applied when estimating the selling price of a
we study the problem of tree-based indexing within crowd-enabled                  used car, the cleaning cost of rental rooms, the placement of a new
databases. We investigate the effect of various tree and crowdsourc-              soccer player in player rankings or the ranking of new scientific
ing parameters on the quality of index operations. We propose new                 publications, etc.
algorithms for index search, insert, and update.                                     In this paper, we introduce the palm-tree index; a crowdsourc-
                                                                                  ing tree-based index. We call our proposed crowdsourcing tree-
                                                                                  based index palm-tree index as real palm trees also need humans
1.    INTRODUCTION                                                                in order to move the pollen from one male tree to the other female
   Crowdsourcing is the practice of solving large problems by di-                 trees to produce fruits (dates). The palm-tree index aims at index-
viding them into smaller tasks, each of which is then solved by                   ing keys based on properties that need human intelligence, e.g., to
humans from an online community. The tasks are usually difficult                  compare images against each other in order to descend and nav-
to be performed automatically by a computer as they require hu-                   igate the index. Crowdsourcing-based tree indexing is useful for
man intelligence. Typical crowdsourcing tasks involve labelling,                  (1) the ordering of a set of keys based on a subjective property,
ranking, data cleaning, data filtering, data collection, and entity               (2) performing index nested-loops joins, and (3) answering range
matching (e.g., see [11, 6, 2, 15, 5, 14]). Websites, e.g., Amazon’s              queries, among other typical uses of an index. The problem of
Mechanical Turk (MTurk) [1] provide an infrastructure for orga-                   crowdsourcing-based indexing is challenging because of the fol-
nizations to submit numerous micro-tasks and collect their results                lowing issues. First, crowd-based comparisons are subjective and
after they are fulfilled by human workers recruited at those web-                 are prone to error. Hence, we need to find strategies that give the
sites. In a typical crowdsourced application, tasks are replicated                most accurate and consistent results. Second, comparison tasks per-
and are answered by multiple people to avoid single user errors.                  formed by the crowd incur a (monetary) cost. Hence, optimizing
Each person answering a task incurs a (monetary) cost. Thus, a                    the number of tasks while preserving accuracy is very important.
challenging problem for crowdsourcing is that of optimizing the                      The contributions of this paper are summarized as follows:
number of tasks while maintaining the accuracy of the results.
   Consider the following scenario where a crowdsourcing tree-                        • We introduce the problem of crowdsourcing tree-based in-
based index can be useful. Assume that a car repair shop wants                          dexing as a technique for ordering a set of items with “sub-
to provide an online service that allows users to submit pictures of                    jective” keys.
their damaged cars to get an estimate of repair cost of their car. The                • We present a taxonomy for crowdsourcing tree-based in-
∗This work was partially supported by the National Science Foun-                        dexes.
dation under Grants III-1117766 and IIS-0964639.                                      • We introduce the palm-tree index, a crowdsourced index,
                                                                                        along with several accompanying index traversal algorithms.
                                                                                      • We study the quality of the results retrieved using the palm-
                                                                                        tree index.
                                                                                     The rest of this paper is organized as follows. Section 2 intro-
                                                                                  duces the taxonomy for crowdsourcing tree-based indexes. Sec-
                                                                                  tion 3 presents notations used throughout the paper. Section 4
Copyright © 2013 for the individual papers by the papers’ authors. Copying
permitted for private and academic purposes. This volume is published and         presents the palm-tree index and its search algorithms. Section 5
copyrighted by its editors.                                                       analyzes the proposed algorithms. Preliminary experimental results


                                                                             1
                                                                             26
                             DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


are given in Section 6. Related work is presented in Section 7. Sec-                             Task replication In a single task no-replication model, workers
tion 8 contains concluding remarks.                                                                    are trusted to deliver high quality answers. In many other
                                                                                                       situations, workers’ responses incur error. The Replicated
2.     PROBLEM TAXONOMY                                                                                task model allows aggregating multiple workers’ opinions to
                                                                                                       increase the quality of the results. In this paper, we assume
   In this section, we introduce a taxonomy for crowdsourcing tree-                                    the replicated model.
based indexes. This taxonomy can be seen as extension to the
one presented in [8], which addresses only worker motivation, task                               Task submission model A task submission model describes how
replication and task assignment. Our taxonomy adds concepts re-                                        replications of the same task are submitted to workers. One
lated to tree indexing (e.g., tree height) and to crowdsourcing in                                     alternative is to submit all task replicas in a single batch. This
general (e.g., worker experience). Figure 1 depicts the taxonomy.                                      technique can reduce the response time to get the final results
                                                                                                       as task replicas are solved by multiple workers concurrently.
                                                                                                       Nonetheless, one may end up submitting more tasks than
                                            Crowdsourcing
                                              tree-based                                               needed. In order to reduce the cost needed for simple tasks,
      Tree                                     indexing                          Worker                the replicas of same task can be submitted sequentially. In
                                                                               experience
     height
                                                                                                       the sequential model, a replica of a task is submitted only
                                                                              x Equal
x Balanced
                                                                              x Variable               when the outcome of the previous replicas does not produce
x Unbalanced
                                 Task                           Task
                                                                              x Malicious              results that are of acceptable quality. The sequential tech-
                             assignment                      submission                                nique increases the task response time as replicas wait for
                                                              x Batch
                         x User select
                                                              x Sequential                             each other. A hybrid submission model is a middle-ground
                         x Sever Assign
                                                              x Hybrid                                 between both alternatives. In the hybrid model, small-sized
                  Task
               replication
                                           Worker
                                          motivation
                                                       Number of
                                                       Dimensions
                                                                           Cost
                                                                         bounds
                                                                                                       batches of the same task are submitted sequentially. In this
              x Single               x Volunteer x One                 x Unlimited
                                                                                                       paper, we only consider the batch task submission model,
              x Replicated           x Rewarded x Multi                x limited                       i.e., that replica tasks are submitted in parallel to the desig-
                                                                                                       nated workers at the same time.
Figure 1: A taxonomy of crowdsourcing tree-based indexing. The
concepts utilized in this paper are shown in boldface.                                           Cost bounds Having unlimited cost per worker is a rare occur-
                                                                                                       rence. It may occur when the quality of the result is of major
                                                                                                       importance. Typically, cost for tasks are limited and opti-
Tree height It specifies whether the tree is balanced or not. The                                      mizing the overall task error within cost bounds is of major
      proposed palm-tree index is a balanced B+ -tree-like struc-                                      importance. In this paper, we consider the limited cost case.
      ture to obtain predictive query cost. More detail about query
      cost is given in Section 3.                                                                3.     PRELIMINARIES AND INDEX MODEL
Number of dimensions An index can be either one-dimensional                                      3.1     Problem definition
    or multi-dimensional. In this paper we consider one-
    dimensional indexing.                                                                           Let S be a set of N items and q be a query item. The keys of
                                                                                                 these items are subjective or imprecise (e.g., the item is the photo
Worker motivation In some situations, workers may be willing                                     of a damaged car and the key is the cost of repair for this car). We
    to perform tasks for free, e.g., see [8], However, in most in-                               need to address the following issues:
    stances workers seek rewards for their services. This raises
                                                                                                      • Order the items in S according to their keys.
    several issues, e.g., budget estimation and the presence of
    low quality work due to workers solving tasks quickly to ob-                                      • Construct (build) an index on S.
    tain more money. In this paper we consider rewarded work-
    ers.                                                                                              • Submit query q on S to the crowd.

Worker experience A simple model to describe workers is to as-                                        • Aggregate crowds responses to query q to produce a final
    sume that all workers have equal experience, e.g., as in [11].                                      result .
    An alternative and viable model is to differentiate workers
    based on their level of expertise, e.g., as in [3]. In the palm-
                                                                                                 3.2     Index model
    tree index, workers are assumed to have indistinguishable,                                      The structure of the palm-tree index is composed of two compo-
    i.e., equal, expertise. We plan to extend our analysis to the                                nents: the index manager and a B+ -tree. Figure 2 gives the main
    general case when workers have variable experience. One                                      components of the palm-tree index. We distinguish between two
    special type of workers is a malicious worker. A malicious                                   main concepts in the palm-tree; jobs and tasks. A job is the unit
    worker aims at degrading output quality by providing inten-                                  of work submitted by a palm-tree user to perform either a query or
    tionally faulty and misleading answers. Coping with mali-                                    an insertion on the indexed data. A task or HIT (Human Intelligent
    cious workers is not addressed in this paper.                                                Task) is the smallest unit of work assigned to a worker. Typically,
                                                                                                 a job involves generating multiple tasks.
Task assginment There are two cases to consider: (1) The worker
     chooses the tasks to work on, or (2) the tasks are automati-                                3.2.1      The B+ -tree Component
     cally assigned to the worker without the workers interference                                 The palm-tree index is built on top of a B+ -tree index. Each
     in the selection. This dimension is particularly important                                  node in the B+ -tree has n ordered keys and represents one task.
     when workers are of variable expertise. For example, it is                                  [11] observes that human beings are capable of performing n-ary
     important to avoid assigning difficult tasks to inexperienced                               operations with ease. Therefore, we give to a worker a sorted list
     workers. In this paper, we assume Case 2, above.                                            A of n items, e.g., images or videos, along with the query item q,


                                                                                            2
                                                                                            27
                     DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                                                                                                     Table 1: Palm-tree index operations

                                                                                         Operation     Quantitative keys         Qualitative keys
                                                                                         Query         Crowd-search              Crowd-search
                                                                                         Insert        B+ -tree insert           Crowd-search then
                                                                                                                                 B+ -tree insert
                                                                                         Build         Bulk loading or           Successive inserts
                                                                                                       Successive inserts
                                                                                         Delete        Query then                Query then
                                                                                                       B+ -tree delete           B+ -tree delete
                                                                                         Update        Delete then insert        Delete then insert


                                                                               for estimating the repair cost of a car. The image on top represents
                                                                               the query image, while the images below are the items in a node
                                                                               in the tree. The worker assess where the query key fits among the
                                                                               current nodes keys.

                 Figure 2: palm-tree index model.                                                                   Query
                                                                                                                     item


                                                                               Content
                                                                                Query
e.g., a query image or video, and ask the worker to place q in A
based on how q compares to the items in A. For instance, A can be
a list of 5 pictures of cars sorted according to the extent to which
the car is damaged, while q can be the picture of a new damaged                Content
                                                                                Node
                                                                                           Less                                                    More
car. The fanout f of the B+ -tree (the maximum allowed children                          Expensive                                               Expensive
of a node other than the root) naturally captures the maximum n-
                                                                                                                                           k3
ary operations that a human can accomplish with ease. Clearly, a                                        k1                  k2
human being can handle a lot easier a list A with 5 pictures than one
with 20. Determining the optimal f for a given problem requires                Figure 3: Sample task for choosing the best branch to estimate the
an initial phase of training. We discuss this issue in more detail in          car repair cost.
Section 5.
   Construction (building) of the palm-tree depends on the type of
indexed keys. We have two types of keys: quantitative and quali-
tative keys.                                                                   4.        ALGORITHMS
                                                                                  Table 1 describes how the build, insert, update, delete, and query
   DEFINITION 3.1. A quantitative key is a key to be indexed                   operations are performed in the palm-tree index. Regular B+ -tree
that has two proportional properties, namely, a subjective property            algorithms (i.e, insert-split) are performed directly and do not in-
and an assessed value. For quantitative keys, the palm-tree index              volve the crowd. We focus on the crowd-search operation (i.e.,
can be constructed by sorting keys based on the assessed values                querying the palm-tree index using the crowd) because the other
then bulkload the index.                                                       operations depend on it.
                                                                                  The original search algorithm for the B+ -tree index is not di-
   DEFINITION 3.2. A qualitative key is a key to be indexed
                                                                               rectly applicable for the crowdsourced palm-tree index. The palm-
based on a subjective property only. It does not have any asso-
                                                                               tree search algorithm is dependent on the way the palm-tree is
ciated assessed value. For qualitative keys, a palm-tree index can
                                                                               traversed and the way the workers answers are aggregated. We
be constructed only using successive insertions and human-based
                                                                               identify three search strategies for the palm-tree search algorithm.
comparisons.
                                                                               All three search strategies share the following two steps. The first
   One example of a quantitative key is images of damaged cars as-             step is root task generation that generates tasks at the root level
sociated with actual repair costs. The degree of car damage is the             and is common for all three strategies. The second step is the re-
subjective property of the key while the repair cost is the quantita-          sponse handler that generates tasks at the lower levels of the tree
tive key. An example of a qualitative key is images of butterflies             and varies according to the three strategies. Algorithm 1 describes
with an ordering based on beauty as a subjective property. Reg-                the root task generation step.
ular B+ -tree splitting/merging algorithms are used during inser-
tion/deletion into/from the B+ -tree. For both types of keys, queries            Algorithm 1: generate root tasks(Tree tree,Key q)
use human intelligence to make comparisons needed to search the                   k← get replications(tree,h)
B+ -tree. All comparisons are based on the subjective property only.              solved← 0 // zero solved so far
3.2.2     Index manager                                                           for i = 1 to k do
                                                                                      w ← choose worker()
   The palm-tree index manager is responsible for tree construc-                      h← tree.height // tasks at the root level
tion and query processing within the palm-tree. It initiates jobs for                 t← create task(tree,h,q,w)
users’ requests, generates tasks for every job, assigns tasks to work-                submit task(t,response handler)
ers, and aggregates workers’ responses. We use majority voting to                 end
aggregate responses of workers.
   In the palm-tree index, a task consists of items (e.g., pictures) in
a node in the B+ -tree. A worker is asked to choose the best location
for a query item among these items. Figure 3 gives an example task             4.1         Leaf-Only Aggregation Strategy

                                                                          3
                                                                          28
                    DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


   In this technique, a job is replicated to K workers. A worker              workers. A priority queue of pointers to the unexplored nodes in
performs the index search by descending the tree from the root to             the index. Unexplored nodes are ordered in the priority queue ac-
the leaf based solely on this workers own decisions or assessments            cording to their height (or level) in the tree. Nodes at the same
to find the best match to the query item. All workers responses and           level in the B+ -tree index are ordered by the percentage of votes
final results are aggregated only at the leaf level. Algorithm 2 gives        given to these nodes. To detect bad decisions, workers are shown
an outline of the leaf-only aggregation strategy.                             progressively the leaf node bounds of the current sub-tree at hand.
                                                                              If the workers indicate that the position of the query key falls out-
 Algorithm 2: handle leaf only aggregation(Task t)                            side the range of the keys in leaf nodes, then the current sub-tree is
                                                                              abandoned and another node is picked from the top of the priority
  if t.level 6= 0 then // task at non-leaf level                              queue to resume the search.
       tree ← get subtree(t.response,t.tree)
                                                                                 Figure 4 gives an example of the backtracking search strategy. In
       t← create task(tree,t.level-1,t.q,t.w)
                                                                              Figure 4(a), a palm-tree of fanout 2 is used to index keys 1 through
       submit task(t,handle leaf only aggregation)
                                                                              8. Figure 4(b) illustrates the steps of traversing the tree. The search
  else
       solved++                                                               starts at the root node A. Node B is at the top of the priority queue
       if solved == k then // all workers finished                            withas 60% of the workers indicate this node. The priority queue
           result = ← aggregate all results                                   keeps track of other alternatives besides B. In next step we show the
           return result                                                      range (i.e., the min and max) of the keys in the sub-tree rooted at B,
       end                                                                    that is from 1 to 4. Assume that the workers indicate that the query
  end                                                                         key is larger than 4; then we have an error. Hence, we backtrack
                                                                              to node C. We insert in the queue node D. Algorithm 4 outlines the
                                                                              search steps.
4.2    All-levels Aggregation Strategy
   An alternative search strategy is to move down the tree level by            Algorithm 4: handle backtrack all levels aggregation(Task
level. The path from the root to the final search position is collec-          t)
tively determined by the crowd as follows. The search is started at              solved++
the root where K0 tasks are generated at this level. After all work-             if solved == k then // all workers finished at
ers complete their tasks at the root level, their answers are aggre-             this level
gated, say by majority voting, and the child node that corresponds                   result = ← aggregate all results
to the item with the highest vote is selected. Let this node be v. We                if result outside range of current subtree then
generate K1 new tasks for v. We aggregate the answers from the                            tree ← pop queue()
K1 workers and decide which of the children of v to go to next. We                   else
repeat the process until we reach the leaves.                                             tree ← get subtree(result,t.tree)
   There are two variations of this search technique: uninformed                     end
and informed workers. In the former, a worker is not aware of                        push queue(other voting alternatives)
the other workers’ decisions. In the latter, a worker is allowed to                  if t.level 6= 0 then // task at non-leaf level
                                                                                          k← get replications(tree,t.level-1)
view a summary of the other workers’ responses to the current task.
                                                                                          for i = 1 to k do
Algorithm 3 provides an outline of this search procedure.
                                                                                              w ← choose worker()
                                                                                              t← create task(tree,t.level-1,t.q,w)
 Algorithm 3: handle all levels aggregation(Task t)                                           submit task(t,handle backtrack all levels aggregation)
  solved++                                                                                end
  if solved == k then // all workers finished at                                     else
  this level                                                                              return result
      result = ← aggregate all results                                               end
      if t.level 6= 0 then // task at non leaf level                             end
           k← get replications(tree,t.level-1)
           tree ← get subtree(result,t.tree)
           for i = 1 to k do
               w ← choose worker()
               t← create task(tree,t.level-1,t.q,w)
                                                                              5.    ANALYSIS
               submit task(t,handle all levels aggregation)                      In this section, we study how to set the parameters of the palm-
           end                                                                tree index: tree order and cost distribution. We also introduce
      else                                                                    the main performance metrics of the palm-tree index, mainly, error
           return result                                                      and cost.
      end
  end
                                                                              5.1    Performance metrics
                                                                                The location of a key is the rank of the key in the ordered list
                                                                              of keys at the leaf level of the tree. Let q be a query key with a
                                                                              ground truth location, say loctruth , at the leaf level, and locret be
4.3    All-levels Aggregation                  Strategy        with           the retrieved location using one of the search strategies.
       Backtracking
   The two search procedures above, namely the leaf-only and all-                DEFINITION 5.1. Error E is the distance between ground
levels aggregation strategies, cannot compensate for an error. Back-          truth location and retrieved location of the query key
tracking is one way to correct potentially wrong decisions of the             E= | loctruth − locret |


                                                                         4
                                                                         29
                          DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                                         A
                                             P5                                                60%        40%                      Priority queue
                      B                                         C                     step 1         P5                                {}
                           P3                                  P7                                                     30%        70%
                                         D
                                                                                      step 2         P1         P3          P4         {<C,60%>}
                 P2                 P4                P6                P8

             1        2         3        4        5        6        7        8        step 3         P5         P7          P8         {<D,30%>}

                                         (a)                                                                 (b)

                                                       Figure 4: Backtracking all-levels aggregation strategy.


   DEFINITION 5.2. Cost C is the total number of tasks to com-                            DEFINITION 5.3. Probability of distance d error at level l
plete a job.                                                                           (Pdl ) is the probability of deviating d branches from the ground
                                                                                       truth branch at level l.
5.2    Tree order and height                                                              We estimate the final error that would result from a distance d
   The order (fanout) of a regular B+ -tree index is usually con-                      error at level l to be d × f l−1 . For example, a distance 1 error at
strained by the size of the disk page. In contrast, in the palm-tree                   level 1 (i.e., the leaf level) deviates the final result from the correct
index, the order of the B+ -tree depends on the ability of workers                     one by distance 1. A distance 1 error at level h (i.e., the root level)
to process at once (in a single task) a specific number of keys (see                   deviates final result about the number of leaf keys in a child subtree
Section 3).                                                                            of the root node, that is f h−1 P. We calculatel−1 the expected distance
   Notice that increasing the order of the tree increases a worker’s                   error at level l (EEl ) to be      d × Pdl × f .
probability of making a wrong decision at a given node. However,                                                        d
                                                                                         Techniques for cost distribution for level-by-level voting with
from a different point of view, in order to get a correct final result,
                                                                                       backtracking is left for future investigation.
correct decisions must be made at all levels of the tree. Increasing
the tree order reduces number of levels required to be correct and
hence reduces the final probability of error.                                          6.      EXPERIMENTAL EVALUATION
                                                                                          In order to test the palm-tree index and its search strategies, we
5.3    Cost selection                                                                  implement a web site to submit tasks to workers. This helps in fur-
   We adopt majority voting to aggregate workers decisions. There-                     ther studying the problem under controlled settings. Two datasets
fore, increasing the number of replicated tasks increases the quality                  are used in the experiments; the first dataset is a set of 200 square
of the aggregated decisions. However, there is almost a satura-                        images of different sizes. The second dataset is a set of 1300 used
tion point beyond which, increasing the number of replicated tasks                     cars images with desired selling prices. A custom-built web crawler
would be of little, if any, benefit.                                                   is used to collect the dataset of cars from a website of used car ads.


5.4    Cost distribution
   Assume that there is a cost bound, say C, to search the index. An
important issue is how this cost is distributed among tasks in terms
of the number of replicas per tree level as tasks are assigned to
workers. In the leaf-only aggregation strategy, every worker has to
perform exactly h tasks from the root level to the leaf level. Hence,
we can only have k = ⌊ C  h
                            ⌋ workers at most.
   In the all-levels aggregation strategy, we propose two techniques
to distribute tasks among level, namely even and probabilistic dis-
tribution. In the even distribution technique, every level in the
palm-tree index is assigned ⌊ C h
                                  ⌋ tasks. Even cost distribution as-
sumes equal importance of all the levels of the palm-tree index.                                (a) Squares dataset                         (b) Cars dataset
However, tree levels are of different importance e.g., when a wrong
decision is made at the higher levels (closer to the root level) of                             Figure 5: Mean error while changing tree fanout.
the palm-tree index, it would result in a higher final error. This is
not the case when wrong decisions are made at the lower levels of                         Figures 5 and 6 give the experimental results of the mean er-
the palm-tree index. On the other hand, wrong decisions are more                       ror rate and mean cost for the uninformed all-levels aggregation
likely to occur at the lower levels than at the higher levels. The                     and leaf-only aggregation search strategies. The parameters of the
reason is that spacing in-between the keys within a node decreases                     palm-tree index are set as follows. The fanout ranges from 2 to 4
as we descend the tree.                                                                and task replication is set to 3 and 5.
   To accommodate for the effect of nodes level on error severity                         Figure 5 gives the mean error rates for the two search algorithms
is to replicate tasks in a way that is proportional to the expected                    when applied to the cars and squares datasets. The error rate for
distance error per level.                                                              either algorithm is much higher on the cars dataset than on the


                                                                                 5
                                                                                 30
                     DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                         !
                                                                                8.   CONCLUSION
                         !
                        "!                                  !                      In this paper, we study the problem of crowdsourcing-based in-
                        "!                                  !                   dexing. We motivate the problem with several application sce-
                                                                                narios. We propose a taxonomy of the problem and highlight its
                                                                                main challenges. We propose techniques for index construction
                                                                                and querying. We intend to complete this study with an extensive
                                                                                analysis and experimental evaluation.
                                                                                   Our current work focuses on one-dimensional indexing. We plan
                                                                                to study other variations of crowdsourced indexing, such as mul-
                                                                                tidimensional indexing, spatial and spatio-temporal indexing and
                                                                                fuzzy indexing.
         (a) Squares dataset                  (b) Cars dataset
                                                                                9.   REFERENCES
         Figure 6: Mean cost while changing tree fanout.                         [1] The Amazon Mturk website.
                                                                                     https://www.mturk.com, 2013.
                                                                                 [2] A. Bozzon, M. Brambilla, and S. Ceri. Answering search
squares dataset. The reason is that it is easier for ordinary users                  queries with CrowdSearcher. In WWW, pages 1009–1018,
to compare images of squares based on their sizes than it is to com-                 2012.
pare cars based on their prices. It is also clear that all-levels aggre-         [3] A. Bozzon, M. Brambilla, S. Ceri, M. Silvestri, and G. Vesci.
gation outperforms leaf-only aggregation in both datasets. The rea-                  Choosing the right crowd: expert finding in social networks.
son is that in the leaf-only aggregation strategy, if a worker makes a               In EDBT, pages 637–648, 2013.
wrong decision at a higher level of the tree, this error affects more            [4] S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the
significantly his/her answer in the lower levels of the tree.                        crowd for top-k and group-by queries. In ICDT.
   In the all-levels aggregation, if a worker makes a wrong deci-                [5] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux.
sion, then other workers can usually compensate for it (We use the                   ZenCrowd: leveraging probabilistic reasoning and
majority principle). It is also evident from Figure 5 that increasing                crowdsourcing techniques for large-scale entity linking. In
the number of workers reduces the error rate.                                        WWW, pages 469–478, 2012.
   Figure 6 gives the mean cost needed to perform a query on the                 [6] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and
palm-tree. The cost for search algorithms is higher on the cars                      R. Xin. CrowdDB: answering queries with crowdsourcing. In
dataset than on the squares dataset. The reason is that the size (and                SIGMOD, pages 61–72, 2011.
accordingly tree height) of the cars dataset is larger than the that of
                                                                                 [7] S. Guo, A. G. Parameswaran, and H. Garcia-Molina. So who
the squares dataset. It is evident that increasing the number of repli-
                                                                                     won?: dynamic max discovery with the crowd. In SIGMOD,
cations increases the cost needed. It is also evident that increasing
                                                                                     pages 385–396, 2012.
the tree fanout generally reduces the cost as the height of the tree
decreases. One technique used to reduce the cost is to allow work-               [8] L. Kazemi and C. Shahabi. Geocrowd: enabling query
ers to stop at not leaf levels if they agree that a key at a non-leaf                answering with spatial crowdsourcing. In Proceedings of the
node matches the search key.                                                         20th International Conference on Advances in Geographic
   We are in the process of implementing more extensive exper-                       Information Systems, pages 189–198. ACM, 2012.
iments that will include larger crowds and other search strate-                  [9] C. Lofi, K. El Maarry, and W.-T. Balke. Skyline queries in
gies (i.e., informed all-levels aggregation and all-levels aggregation               crowd-enabled databases. In EDBT, pages 465–476, 2013.
with backtracking). We plan to study the effect of cost distribution            [10] A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh.
alternatives (even and expected distance errors). We also plan to                    Counting with the crowd. In PVLDB, pages 109–120, 2013.
study the quality of sorts and joins using the palm-tree index.                 [11] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller.
                                                                                     Human-powered sorts and joins. VLDB Endow., 5:13–24,
                                                                                     2011.
7.    RELATED WORK                                                              [12] A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C.
                                                                                     Miller. Crowdsourced databases: Query processing with
   Lately, there has been an increasing interest in crowdsourcing.                   people. CIDR, 2011.
Several crowd-powered database systems, e.g., [6, 15, 12, 11] have
                                                                                [13] A. Parameswaran, A. D. Sarma, H. Garcia-Molina,
been proposed to use human intelegance to perform database oper-
                                                                                     N. Polyzotis, and J. Widom. Human-assisted graph search:
ations. These operations mainly focus on sorts and joins, e.g., [11],
                                                                                     it’s okay to ask questions. VLDB Endow., 4:267–278, 2011.
counting, e.g., [10], and max-finding, e.g., [7, 16]. Other works
                                                                                [14] A. G. Parameswaran, H. Garcia-Molina, H. Park,
focus on algorithms to answer top-K, e.g., [4], skyline queries,
                                                                                     N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen:
e.g., [9], and spatial queries, e.g., [8].
                                                                                     algorithms for filtering data with humans. In SIGMOD,
   The problem of optimizing the number of questions needed to
                                                                                     pages 361–372, 2012.
find a set of target nodes within a general directed acyclic graph
(DAG) using yes/no questions is studied in [13]. However, the B+                [15] H. Park, R. Pang, A. G. Parameswaran, H. Garcia-Molina,
tree within the palm-tree index posseses more specific properties                    N. Polyzotis, and J. Widom. Deco: A system for declarative
than a general DAG (i.e., balanced height, and multiple ordered                      crowdsourcing. PVLDB, 5, 2012.
keys within nodes) that requires more specific strategies for tree              [16] P. Venetis, H. Garcia-Molina, K. Huang, and N. Polyzotis.
search and task aggregation methods (Section 4) than in the case                     Max algorithms in crowdsourcing environments. In
when yes/no questions are used.To our knowledge, this is the first                   Proceedings of the 21st international conference on World
work to address the problem of indexing with the crowd.                              Wide Web, WWW ’12, pages 989–998, 2012.


                                                                           6
                                                                           31