IR Game: How well do you know information retrieval papers? Jan Rybak Krisztian Balog Kjetil Nørvåg Norwegian University University of Stavanger Norwegian University of Science and Technology krisztian.balog@uis.no of Science and Technology jan.rybak@idi.ntnu.no kjetil.norvag@idi.ntnu.no publications) is not easy; we developed an assessment in- terface for this purpose [RBN14a] and found that people Abstract were not especially willing to spend time with it. (It has to be mentioned, however, that selecting the most impor- In this paper we demonstrate how a gamification tant publications was only part of the task, the assessment approach increases the attractiveness of an assess- procedure was more involved than that.) The main motiva- ment exercise in the context of expertise profil- tion for us, therefore, is to attract users that can represent ing. We present an online game, in two diffi- the relevant scientific community’s general opinion and can culty modes, where users have to guess the au- generate data to be used in our efforts to evaluate temporal thors of publications. We analyze the collected expert profiling approaches [RBN14b]. data along different dimensions and identify four Gamification is not the only alternative we considered. types of gaming personalities based on behavioral In many scenarios, crowdsourcing is a valid option for the patterns. Further, we examine the relation be- delivery of simple yet time-consuming or repetitive tasks tween popularity and recognizability for both pa- such as data annotation or evaluation. Eickhoff [Eic14] pers and authors. Finally, we provide insights into examines the crowd-powered expert paradigm, where the game mechanics that extend beyond our specific majority of the workload is preprocessed by crowd work- use case. ers and experts are only needed for specialized steps. None of these approaches, however, are applicable in our case; 1 Introduction here, the entire task is strictly domain-specific and requires Gamification is a method for keeping users involved in a the involvement of domain experts. task for longer periods of time or to encourage them to re- We cast our assessment exercise as a simple question- peatedly undergo otherwise not so entertaining tasks. An- answering quiz that tests the user’s knowledge of informa- other reason for gamification is the desire to generate useful tion retrieval (IR) papers, i.e., the “IR game.” The player data as a by-product of the playing activity [VAD08]. Our is presented with the title of a publication, from selected work takes place in the context of (temporal) expertise pro- top conferences, and her task is to attribute the paper to filing, where we are concerned with identifying what topics the corresponding author(s).1 The game comes in two dif- people are knowledgeable about [RBN14b]. Specifically, ficulty modes. In “beginner” mode, the user has to select we focus on the academic domain, where scientific publi- the right set of authors, from three options, while in “ad- cations constitute the best available evidence from which to vanced” mode authors have to be picked out individually. draw conclusions regarding a person’s expertise. Of course, The goal in each mode is to answer as many questions, what one holds as her most important publication(s) does i.e., collect as many points, as possible. The game ends not necessarily correspond with what others (i.e., the scien- after three wrong answers. A “leader board” is provided to tific community) consider as such. Arguably, the latter one track the highest scoring players. The game is available at is more important. In our experience, getting the first ques- http://bit.ly/ir-game. tion answered (what a person considers his most important In the course of this work, we examine whether we are able to attract more interest (and collect more data) by pre- Copyright c 2015 for the individual papers by the paper’s authors. senting our assessment exercise indirectly, as a game, as Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. 1 This game should not be unfamiliar to academics; many of us perform In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder (eds.): a similar “authorship attribution” exercise, albeit not deliberately, when Proceedings of the GamifIR’15 Workshop, Vienna, Austria, 29-March- performing blind reviews. The main difference is that here we offer instant 2015, published at http://ceur-ws.org feedback while providing limited context. 2 The Game “IR game” is a simple knowledge quiz that tests users’ knowledge of publications in the field of Information Re- trieval. 2.1 Game rules Given a publication title, the player’s task is to select the correct authors within a given time limit. The game can be played in two modes, beginner and advanced. Beginner mode The user has to select the (entire) group of authors from three options. Only one variant is correct and all authors are listed in the same order as on the paper. See Figure 1 (a). Advanced mode In the more difficult game mode, indi- vidual author names are offered and the user has to decide which names belong to the paper. The num- ber of authors does not necessarily correspond with the actual number of the paper’s authors and the same applies to the order of names. See Figure 1 (b). The user is credited with the corresponding F1-score for each answer (i.e., correct vs. selected set of authors); answers below an F1-score of 0.5 count as wrong. In both modes, the game ends after three wrong answers. Figure 1: IR Game in (a) beginner and (b) advanced mode. A separate leader board is available for each game mode that lists the highest scoring players (with score, name, and timestamp). Players that made it to the leader board were opposed to dealing with it explicitly using a purpose-built offered the opportunity to “brag” about their achievement interface. In addition, we address a number of more spe- on Twitter. cific questions: • Which level of difficulty is preferred, the easy mode 2.2 Data or the advanced one? The collection of publications used in this game comprises • Does a competitive element, such as a leader board, of the 1111 top cited IR papers (according to the ACM increase the level of engagement? DL2 ) from the period 2004-2014 that were presented in one of the following conference series: SIGIR, WWW, CIKM, • When do users stop playing? KDD, and WSDM. For each publication, besides its own • Do users return to play again? After how long? set of original authors, a set of “fictitious” authors is ran- domly selected from other documents within the same data • What types of players can we identify? set. Usage data is collected while the game is being played. • Are more cited papers also more easily recognized? Specifically, for each user, we store the questions that have • Are more popular authors also more easily recog- been asked in the game, correct and wrong answers, scores, nized? date and time, time to answer, number of attempts to copy text from the webpage, and rough location. In order to • Do people prefer to play anonymously? recognize returning users, we use browser cookies with a unique identifier. We also track the site’s traffic using Our findings confirm the premise that interweaving game Google Analytics.3 mechanics into a non-game environment is beneficial in terms of task attractiveness. We also demonstrate that the 2 http://dl.acm.org. leader board is a powerful motivator for many people. 3 http://www.google.com/analytics/. 2.3 Usage statistics 3 Analysis of results 4 Next, we analyze the collected data in different ways: by The game was promoted on Twitter aiming at people from the IR community. In this paper, we analyze traffic from the answers (§3.1), by players (§3.2), by papers (§3.3), and by first five days of the game’s existence (i.e., from January 31 authors (§3.4). to February 4, 2015). During this period, 302 unique vis- itors from 33 countries visited the site and more than one 3.1 Answers third of them participated in the game. Figure 2 presents the geographic distribution of visitors; this roughly corre- Time to answer sponds to the distribution of IR groups in the world (albeit In both game modes, users’ response time is limited to 15 Norway is admittedly over-represented on this figure). Fig- seconds. In case the time limit is exceeded, the answer is ure 3 shows the time of the day when the game was played considered wrong. On average, it took about half of the (normalized according to users’ timezones). specified time limit to provide an answer, more precisely, it was 7.58s. There is a notable difference, 1.8s, between av- erage response times for correct (6.74s) and wrong (8.53s) answers. Looking at the distribution of answer times, Fig- ure 4, we find that the higher response time for wrong an- swers is due to timeouts. It is also visible from this plot that when the user knows the correct answer, he is less likely to use up all the time available. Figure 2: Geographic distribution of visitors. Figure 4: Time to answer in beginner mode. Figure 3: Time of day vs. number of games played. Total scores Figures 5(a) and 5(b) depict the distribution of total game Table 1 presents usage statistics, in terms of number of scores for beginner and advanced modes, respectively. games played and number of unique players. We observe These correspond to power-law shape distributions, al- that the beginner game mode was almost 8 times more suc- though the existing data (esp. for the advanced mode) are cessful in terms of game counts and almost 9 times in terms too sparse to infer their parameters. In both cases there is a of players counts than the advanced one. These statistics noticeable drop when going from score 3 to 4. This has to show that people who took part in the easier version of the do with the fact that the game ends after 3 mistakes. game were more likely to play again. 3.2 Players Table 1: Usage statistics Game mode Returning visitors Total Beginner Advanced An interesting measure of success of a game is the num- #unique players 111 16 116 ber of returning players. We examined all games that were #games played 347 39 387 played more than once (56 games) searching for interest- avg. #games per player 3.14 2.44 3.34 ing patterns. It is most likely that a user plays again right after she finishes the game. In 42 cases, users played again within the same hour. The plot on Figure 6 presents time 4 #irgame intervals between users’ returns. (a) beginner mode (b) advanced mode Figure 5: Distribution of game scores. (a) Jumper Figure 6: Time elapsed between games (from the first game) for returning visitors. Player types From the analysis of the game series, we can derive 4 types of players depending on when they decide to leave the game. (b) Give-uper Jumpers are the type of visitors who come and play a single game; they leave after that no matter what the score is. Give-upers are players who return repetitively but leave the game due to demotivation when in a series of games their score drops. (c) Fighter Fighters do the exact opposite. They leave the game at the top of their form, when in a series of games they reach their highest score. Achievers care about winning. They keep returning and playing the game until they are back on the top of the leader board. (d) Achiever Figure 7 shows an example user for each of the player types. Figure 7: Examples of player types. Session boundaries are marked with vertical red lines. 3.3 Papers Are popular papers, i.e., papers with more citations, rec- ognized more easily? In order to answer this question, we number of times it was shown to users. Next, we divide first introduce the concept of a paper’s recognition ratio. It all publications that appeared in the game into three groups is defined to be the number of times the publication was based on the number of citations they received (according successfully recognized by users (players) divided by total to ACM DL). We report the average recognition ratio for each group in Figure 8, where the leftmost bar represents 4 Observations papers with the highest number of citations. As expected, Based on the analysis of results as well as informal feed- we find that more cited papers are in general better recog- back from the users, we make a number of observations nized (left and middle vs. right), albeit papers in the middle that may generalize beyond our specific use case. of the citation range seem to perform best in this regard. We note that these findings may not be conclusive due to data Learning From user feedback we know that this game sparsity. was also used to learn about relevant, previously un- seen, publications. On the motto of Comenius’ say- ing: “Much can be learned in play that will afterwards be of use when the circumstances demand it.”, we be- lieve that our game might also prove to be useful for exploring and discovering more about scientific liter- ature. Unfair behavior We have a suspicion that at least in one case a user acted dishonestly in order to get into the lead. This user’s score was unreasonably high com- pared to the second best. More importantly, her aver- age time to answer was 12.84s (compared to the av- Figure 8: Citation counts vs. recognition ratio. erage of 7.85s), which seems just about enough time to use a web search engine to look up the paper in question. This behavior was reported by another com- petitor (which supports the assumption that players do care about their position in the leader board). 3.4 Authors Head-start At least in one case we observed that a user Are popular authors, i.e., people with more publications, was restarting the game until he was able to answer recognized more easily? Similarly to papers, we define an the first question correctly. Beginning the game with author’s recognition ratio to be the number of times the au- a set of easier questions, then gradually increasing dif- thor’s publications were successfully recognized by users ficulty, might therefore be helpful in keeping users en- (players) divided by total number of times her publications gaged. were shown to users. On Figure 9 we plot authors’ pop- ularity, measured in the number of publications (in our Engaging users From the statistics (§3.2) we can see that paper selection), against recognition ratio. We find that users are not very likely to return to the game days there is a significant difference between authors with a sin- after their first visit. However, chances that they re- gle publication and authors with multiple publications; not peatedly participate in the game within the first hour surprisingly, having multiple publications benefits recogni- after their initial visit are much higher. It is therefore tion. On the other hand, it appears that having many more of vital importance to keep the user stay in the game publications does not improve recognition any further. as long as possible when she comes for the first time. Identity Some people (∼ 10%) opted to use their full civil name as opposed to a nickname. We hypothesize that it was a choice made deliberately in case they make it to the leader board. 5 Conclusions and Future work This study presented in this paper has started with the fol- lowing main question in mind: Could we make an assess- ment exercise, in the context of expertise profiling, more appealing for users? We have answered this question pos- itively. Our experiment has shown that it is more desir- Figure 9: Author popularity (number of publications) vs. able for users to participate in a game-like assessment task recognition ratio. rather than having to evaluate results explicitly using a purpose-built interface. We have analyzed the data col- lected along different dimensions and have identified four types of gaming personalities based on behavioral patterns. On top of the analysis of the game mechanics, this experi- ment has allowed us to gather valuable data about authors and publications. This has let us to perform an initial ex- amination of the relation between popularity and recogniz- ability for both papers and authors. In future work we plan to enhance the game in several ways. The main purpose of the game is to indirectly mea- sure how researchers recognize each others’ publications. In this first version, fictitious publications were selected randomly; however, interesting experiments could be con- ducted if the selection of alternative authors was biased in a controlled way. This would allow us to adjust the diffi- culty of the questions as the game progresses. We also plan to add new game modes (e.g., time trial), expand the data set (i.e., add more publications), and possibly explore other research fields/communities. References [Eic14] Carsten Eickhoff. Crowd-powered experts: Helping surgeons interpret breast cancer im- ages. In Proceedings of the First International Workshop on Gamification for Information Re- trieval, pages 53–56, 2014. [RBN14a] Jan Rybak, Krisztian Balog, and Kjetil Nørvåg. ExperTime: Tracking expertise over time. In Proceedings of the 37th International ACM SI- GIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pages 1273–1274, 2014. [RBN14b] Jan Rybak, Krisztian Balog, and Kjetil Nørvåg. Temporal expertise profiling. In Proceedings of the 36th European conference on Advances in Information Retrieval, ECIR ’14, pages 540– 546, 2014. [VAD08] Luis Von Ahn and Laura Dabbish. Designing games with a purpose. Communications of the ACM, 51(8):58–67, 2008.