Classical vs. Crowdsourcing Surveys for Eliciting Geographic Relevance Criteria Stefano De Sabbata1 , Omar Alonso2 , and Stefano Mizzaro3 1 University of Zurich-Irchel Winterthurerstrasse 190, CH-8057 Zurich, Switzerland stefano.desabbata@geo.uzh.ch 2 Microsoft Corp. 1065 La Avenida, Mountain View CA, USA omar.alonso@microsoft.com 3 University of Udine Via delle Scienze 206, 33100 Udine, Italy mizzaro@uniud.it Abstract. Geographic relevance aims to assess the relevance of physical entities (e.g., shops and museums) in geographic space for a mobile user in a given context, thereby shifting the focus from the digital world (the realm of classical information retrieval) to the physical world. We study the elicitation of geographic relevance criteria by means of both a classi- cal survey and an Amazon Mechanical Turk (a crowdsourcing platform) survey. This allows us to obtain three results: first, we gather a set of criteria and their relative importance; second, we gain a first insight on the differences between geographic relevance and classical relevance as commonly understoon in the IR field; and third we draw some consid- erations on the agreement, on the importance of specific criteria, among the participants to the classical and the crowdsourcing surveys. Keywords: Relevance, Crowdsourcing, Amazon Mechanical Turk, SurveyMon- key 1 Introduction The elicitation of relevance criteria dates back to the 90s, if not earlier [7]. Although such criteria seemed quite well established at that time [2], recently this issue is studied again [1]. This is probably due to the Web, that on the one side provides novel search services that might entail a different notion of relevance, and on the other side allows more convenient methods for preparing surveys involving several participants. In this short paper, we concentrate on Geographic Relevance (GR), a recent area of Information Retrieval (IR), and we discuss the elicitation of relevance criteria by means of: – SurveyMonkey (SM, www.surveymonkey.com), a Web service that allows the preparation of an online survey whose participants are then invited by email, and – Amazon Mechanical Turk (AMT, www.mturk.com), a crowdsourcing plat- form that allows to outsource to the crowd specific tasks for a small amount of money. The aim of this research is threefold: – to find suitable GR criteria, that might be different from the classical rele- vance criteria; – to gain a first insight into the difference between GR and the classical concept of relevance in the IR field; – to understand if AMT provides reliable results, or at least if those results agree with the SM ones, which are obtained in a more classical way. AMT quality and reliability are important issues [6]: there is no guarantee that AMT workers provide reliable answers and that they carry on their task in a reliable way; for example, workers might cheat to quickly gain money. This is even more critical as crowdsourcing is emerging as a widespread alternative for relevance evaluations. In the following, we first define GR (Section 2) and discuss crowdsourcing and AMT (Section 3) then we present the experimental study and its results (Section 4), and we finally summarize the main findings (Section 5). 2 Geographic Relevance Criteria The basic idea of GR is to assess the relevance of physical entities (e.g., shops and museums) in geographic space for a mobile user in a given context [8]. This definition implies a shift from the informational world — that is the focus of IR, which is devoted to retrieve information from unstructured digital document collections — to the physical world. In other terms, the aim of GR is to apply the principles and concepts developed in the field of IR not only in the informational world, but also in the physical world [3]. GR is different from Geographic Information Retrieval because the second still focuses on digital entities. The aim of Geographic Information Retrieval is to retrieve geographic information from digital documents, or to find relevant digital documents that can satisfy a user’s need for geographic information. GR uses digital entities (e.g., the objects in a collection within a Geographic Information System, or documents, or images, etc.) as means to estimate the relevance of the physical entities they refer to, rather than aiming to evaluate the relevance of the digital entities themselves. In shifting the focus from the digital world to the physical world, a first question is whether the criteria of relevance developed in IR [7, 2, 1] can be applied to assess GR. A second question is whether other criteria are needed in order to fully understand the relevance of a physical entity. We ground our study Properties Geography Information Presentation Topicality Spatial proximity Specificity Accessibility Appropriateness Temporal proximity Availability Clarity Coverage Spatio-temporal proximity Accuracy Tangibility Novelty Directionality Currency Dynamism Visibility Reliability Presentation quality Hierarchy Verification Cluster Affectiveness Co-location Curiosity Association rule Familiarity Variety Table 1. Four sets of GR criteria, classified as in [4]. on the set of criteria of GR proposed in [4]; these criteria are listed in Table 1. We do not have the space here to discuss these criteria in detail; a comprehensive description of each single criterion, together with a more in depth analysis, is provided in [5]. 3 Crowdsourcing Crowdsourcing has emerged as a feasible alternative for relevance evaluation because it brings the flexibility of the editorial approach at a larger scale. AMT is an example of a crowdsourcing platform: it is an Internet service that gives developers the ability to include human intelligence as a core component of their applications. Developers use a web services API to submit tasks, approve completed tasks, and incorporate the answers into their software applications. To the application, the transaction looks very much like any remote procedure call: the application sends the request, and the service returns the results. People (the “crowd”) come to the web site looking for tasks and receive payment for their completed work. In addition to the API, there is also the option to interact using a dashboard that includes several useful features for prototyping experiments. There is an increased participation by large numbers of online users from all over the world, which is a good sample that includes diversity. The individual or organization who has work to be performed is known as the requester. A person who wants to sign up to perform work is described in the system as a worker. One issue with AMT and similar crowdsourcing platform is quality [6]: there is no guarantee that the workers provide correct answers and that they carry on their task in a reliable way. For example, workers might cheat to quickly gain money. One of the aims of this paper is to compare a survey carried on by means of AMT with a similar one carried on by more classical means, like SM. 1. Considering a place that fits your needs by its category (e.g. a restaurant, if you want to go out for dinner), which other criteria would you take into account? – A place that offers just the services you need is more relevant than a place that also offers other services. – A place that offers all the services you need is more relevant than a place that offers just some of them. – A place that was previously unknown to you is more relevant than an already known place. 2. Considering a place that fits your needs, do you take into account the following cri- teria related to the presented information and the way it is presented (for example on your mobile device) to judge its relevance? – The more information available about a place, the higher is the relevance of the place. – The more accurate the information about a place, the higher is the relevance of the place. – The more current, recent, timely, up-to-date the information about a place, the higher is the relevance of the place. – The more dynamic, active or interactive the presentation of information, the higher is the relevance of the presented place. – The more the information about a place is presented in a certain format or style, or offers output in a way that is helpful, desirable, or preferable, the higher is its relevance. Fig. 1. Questions 1 and 2 as framed in SMs and AMTs1. 4 Experiments 4.1 Experimental design We selected a subset of the criteria listed in Table 1: the 14 criteria in italics. We chose many of the geographic criteria, leaving out spatial proximity and temporal proximity (we took into account the spatio-temporal proximity that combines both), and association rule (which is difficult to explain and can be misunderstood if not explained in detail). We selected two or three criteria from each of the other groups, choosing the easier to explain in a few words and, probably, the most intuitive ones. Towards the aims stated in Section 1, we ran 3 experiments: – A SM survey (referred to as SMs) sent by email to researchers and students in IR and similar subjects. – A first AMT survey (AMTs1) obtained by simplifying the SM survey and by focussing on some items only. – A second AMT survey (AMTs2) obtained, after the responses to AMTs1, by fine tuning the language to tailor it to the AMT environment, where workers usually are not keen to spend much time on a task. The questions were asked in an indirect way: for example, we did not ask lit- erally whether “spatio-temporal proximity is an important GR criterion”; rather 1. Given a place in the right category (e.g., a restaurant, if you want to go out for dinner), which other criteria would you take into account? – A place that offers just the services you need is more relevant than a place that also provides other services. – A place that offers all the services you need is more relevant than a place that provides just some of them. – A place that was previously unknown to you is more relevant than an already known place. 2. Considering a place that fits your needs, do you take into account the following criteria to judge its relevance? – The more information available about a place, the higher is the relevance of the place. – The more accurate the information about a place, the higher is the relevance of the place. – The more current, recent, timely, up-to-date the information about a place, the higher is the relevance of the place. – The more dynamic, active or interactive the presentation of informa- tion, the higher is the relevance of the presented place. – The more the information about a place is presented in a certain format or style, or offers output in a way that is helpful, desirable, or preferable, the higher is its relevance. Fig. 2. Questions 1 and 2 as framed in AMTs2. we asked whether “it is important to take into account whether the place (or a related event) will be available at the time you will be able to reach it (e.g., whether you can reach the shop before it closes).” The questionnaire included a total of 14 items, arranged into three main questions. Figure 1 shows two of the three questions (each one grouping some items) as framed in SMs and AMTs1. In SMs, a first page was dedicated to the criteria not related to geographic concepts (e.g., novelty), whereas a second page was dedicated to the geography-related criteria. The same items have been used in AMTs1, where the 3 questions were all presented in one page. Figure 2 shows the same items as framed in AMTs2, where we slightly modified the questions (but not the items, that were almost identical to SMs and ATMs14 ), each one presented in a separate page. Participants assessed each item on a 7-point Likert scale “1 - Strongly disagree” – “7 - Strongly agree” (all the scale values appear on the ordinal axis in Figure 3). 4.2 Results The number of participants in the three cases is similar: SMs got 53 participants, AMTs1 43, and AMTs2 42 (we discarded two outliers from each AMT survey since they were far too quick). The collected demographics say that participants 4 The only differences, as shown in the figures, is the change of “offer” into “provide” and the usage of boldface to highlight some terms. Fig. 3. Median value for each criteria. to SMs were familiar with digital maps (71% use them at least several times a week), mobile maps (51% use them on their mobile), and online yellow pages (only 30% of the participants have never used them). We did not collect demo- graphic data for AMT (we plan to do that in future experiments). We paid $0.15 to each AMT worker. The total cost for both AMT experiments was $16. The Kolmogorov-Smirnov normality test was negative, so we considered the variables as ordinal. Figure 3 shows the median importance of the single criteria in the three surveys. By analyzing the relative importance of the criteria, three groups can be singled out: a first one including the three leftmost criteria (coverage, spatio- temporal proximity, and currency), whose importance seems very high according to all the three surveys; a second group including the central seven criteria whose importance is tangible, but somehow lower with respect to the first group; and a final group of the four rightmost criteria whose importance seems rather low and more inconsistent among the three surveys. Turning to the agreement among the participants in the three surveys, we can note first that SMs median values are generally lower than AMTs1/2. Also, agreement is different for each criterion, as confirmed by a Mann-Whitney test: – highly significant (p < .01) difference has been found between SMs and AMTs1, and also between SMs and AMTs2, for the criteria availability, accuracy, dynamism, presentation quality; SM AMT Demographics Targeted practitioners and experts Crowd (unknown workers) Incentive Volunteer Money Development cost low low Service fee $30 per month Free Participant fee None $0.15 per participant Cost dependencies Time and service level Number of participants per survey Total incurred cost $60 $8 + $8 Time to completion 45 days 3 days for AMTs1 and 6 days for AMTs2 Table 2. SM vs. AMT comparison. – highly significant (p < .01) difference has been found between SMs and AMTs1 for the criterion hierarchy, and between SMs and AMTs2 for the criterion visibility; – significant (p < .05) difference has been found between SMs and AMTs1 for the criteria currency and visibility, and between SMs and AMTs2 for the criterion co-location; – no statistical significant difference has been found between AMTs1 and AMTs2, in any criteria. Besides differences in quality per se, there are other characteristics that may influence the choice of system for conducting surveys. We present the most im- portant aspects in Table 2. 5 Conclusions Overall, the results hint that: – The most important GR criteria seem to be coverage, spatio-temporal prox- imity, and currency. – SM and AMT surveys provide slightly different results. – The differences mainly concern the importance of four criteria (availability, accuracy, dynamism and presentation quality) – None of these four criteria are in the Geography set (see Table 1). This last point is perhaps surprising, since one would expect that the heteroge- neous background and cultural differences of the international AMT population would particularly affect the elicitation of geographic criteria. However, in our experiments disagreement was mainly on classical relevance criteria. One further point to remark is that the average quality of AMT workers an- swers was good, as demonstrated by the good agreement level with SM, although we did not require qualified workers — as it would have been possible in AMT. Finally, as future work, we are considering a more “visual” survey, with more images or scenarios, than just pure text as we did in this work . References 1. O. Alonso and S. Mizzaro. Relevance criteria for e-commerce: a crowdsourcing-based experimental analysis. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR, pages 760–761, 2009. 2. C. L. Barry and L. Schamber. Users’ criteria for relevance evaluation: A cross- situational comparison. Information Processing & Management, 34(2-3):219–236, May 1998. 3. P. Coppola, V. D. Mea, L. D. Gaspero, and S. Mizzaro. The concept of relevance in mobile and ubiquitous information access. In Mobile HCI Workshop on Mobile and Ubiquitous Information Access, volume 2954 of LNCS, pages 1–10. Springer, 2003. 4. S. De Sabbata. Criteria of geographic relevance. In 6th Int’l Conf. on Geographic Information Science, 2010. 5. S. De Sabbata and T. Reichenbacher. Criteria of geographic relevance: an experi- mental study. International Journal of Geographic Information Science, forthcom- ing. 6. P. Marsden. Crowdsourcing. Contagious Magazine, 18:24–28, 2009. 7. S. Mizzaro. Relevance: The whole history. Journal of the American Society for Information Science, 48(9):810–832, 1997. 8. T. Reichenbacher, P. Crease, and S. De Sabbata. The concept of geographic rele- vance. In Proceedings of the 6th Int’l Symposium on LBS & TeleCartography, 2009.