=Paper=
{{Paper
|id=Vol-1345/gamifir15_1
|storemode=property
|title=IR Game: How well do you Know Information Retrieval Papers?
|pdfUrl=https://ceur-ws.org/Vol-1345/gamifir15_1.pdf
|volume=Vol-1345
|dblpUrl=https://dblp.org/rec/conf/ecir/RybakBN15
}}
==IR Game: How well do you Know Information Retrieval Papers?==
IR Game: How well do you know information retrieval papers?
Jan Rybak Krisztian Balog Kjetil Nørvåg
Norwegian University University of Stavanger Norwegian University
of Science and Technology krisztian.balog@uis.no of Science and Technology
jan.rybak@idi.ntnu.no kjetil.norvag@idi.ntnu.no
publications) is not easy; we developed an assessment in-
terface for this purpose [RBN14a] and found that people
Abstract were not especially willing to spend time with it. (It has
to be mentioned, however, that selecting the most impor-
In this paper we demonstrate how a gamification tant publications was only part of the task, the assessment
approach increases the attractiveness of an assess- procedure was more involved than that.) The main motiva-
ment exercise in the context of expertise profil- tion for us, therefore, is to attract users that can represent
ing. We present an online game, in two diffi- the relevant scientific community’s general opinion and can
culty modes, where users have to guess the au- generate data to be used in our efforts to evaluate temporal
thors of publications. We analyze the collected expert profiling approaches [RBN14b].
data along different dimensions and identify four Gamification is not the only alternative we considered.
types of gaming personalities based on behavioral In many scenarios, crowdsourcing is a valid option for the
patterns. Further, we examine the relation be- delivery of simple yet time-consuming or repetitive tasks
tween popularity and recognizability for both pa- such as data annotation or evaluation. Eickhoff [Eic14]
pers and authors. Finally, we provide insights into examines the crowd-powered expert paradigm, where the
game mechanics that extend beyond our specific majority of the workload is preprocessed by crowd work-
use case. ers and experts are only needed for specialized steps. None
of these approaches, however, are applicable in our case;
1 Introduction here, the entire task is strictly domain-specific and requires
Gamification is a method for keeping users involved in a the involvement of domain experts.
task for longer periods of time or to encourage them to re- We cast our assessment exercise as a simple question-
peatedly undergo otherwise not so entertaining tasks. An- answering quiz that tests the user’s knowledge of informa-
other reason for gamification is the desire to generate useful tion retrieval (IR) papers, i.e., the “IR game.” The player
data as a by-product of the playing activity [VAD08]. Our is presented with the title of a publication, from selected
work takes place in the context of (temporal) expertise pro- top conferences, and her task is to attribute the paper to
filing, where we are concerned with identifying what topics the corresponding author(s).1 The game comes in two dif-
people are knowledgeable about [RBN14b]. Specifically, ficulty modes. In “beginner” mode, the user has to select
we focus on the academic domain, where scientific publi- the right set of authors, from three options, while in “ad-
cations constitute the best available evidence from which to vanced” mode authors have to be picked out individually.
draw conclusions regarding a person’s expertise. Of course, The goal in each mode is to answer as many questions,
what one holds as her most important publication(s) does i.e., collect as many points, as possible. The game ends
not necessarily correspond with what others (i.e., the scien- after three wrong answers. A “leader board” is provided to
tific community) consider as such. Arguably, the latter one track the highest scoring players. The game is available at
is more important. In our experience, getting the first ques- http://bit.ly/ir-game.
tion answered (what a person considers his most important In the course of this work, we examine whether we are
able to attract more interest (and collect more data) by pre-
Copyright c 2015 for the individual papers by the paper’s authors. senting our assessment exercise indirectly, as a game, as
Copying permitted for private and academic purposes. This volume is
published and copyrighted by its editors. 1 This game should not be unfamiliar to academics; many of us perform
In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder (eds.): a similar “authorship attribution” exercise, albeit not deliberately, when
Proceedings of the GamifIR’15 Workshop, Vienna, Austria, 29-March- performing blind reviews. The main difference is that here we offer instant
2015, published at http://ceur-ws.org feedback while providing limited context.
2 The Game
“IR game” is a simple knowledge quiz that tests users’
knowledge of publications in the field of Information Re-
trieval.
2.1 Game rules
Given a publication title, the player’s task is to select the
correct authors within a given time limit. The game can be
played in two modes, beginner and advanced.
Beginner mode The user has to select the (entire) group of
authors from three options. Only one variant is correct
and all authors are listed in the same order as on the
paper. See Figure 1 (a).
Advanced mode In the more difficult game mode, indi-
vidual author names are offered and the user has to
decide which names belong to the paper. The num-
ber of authors does not necessarily correspond with
the actual number of the paper’s authors and the same
applies to the order of names. See Figure 1 (b). The
user is credited with the corresponding F1-score for
each answer (i.e., correct vs. selected set of authors);
answers below an F1-score of 0.5 count as wrong.
In both modes, the game ends after three wrong answers.
Figure 1: IR Game in (a) beginner and (b) advanced mode. A separate leader board is available for each game mode
that lists the highest scoring players (with score, name, and
timestamp). Players that made it to the leader board were
opposed to dealing with it explicitly using a purpose-built
offered the opportunity to “brag” about their achievement
interface. In addition, we address a number of more spe-
on Twitter.
cific questions:
• Which level of difficulty is preferred, the easy mode 2.2 Data
or the advanced one?
The collection of publications used in this game comprises
• Does a competitive element, such as a leader board, of the 1111 top cited IR papers (according to the ACM
increase the level of engagement? DL2 ) from the period 2004-2014 that were presented in one
of the following conference series: SIGIR, WWW, CIKM,
• When do users stop playing? KDD, and WSDM. For each publication, besides its own
• Do users return to play again? After how long? set of original authors, a set of “fictitious” authors is ran-
domly selected from other documents within the same data
• What types of players can we identify? set.
Usage data is collected while the game is being played.
• Are more cited papers also more easily recognized?
Specifically, for each user, we store the questions that have
• Are more popular authors also more easily recog- been asked in the game, correct and wrong answers, scores,
nized? date and time, time to answer, number of attempts to copy
text from the webpage, and rough location. In order to
• Do people prefer to play anonymously? recognize returning users, we use browser cookies with
a unique identifier. We also track the site’s traffic using
Our findings confirm the premise that interweaving game Google Analytics.3
mechanics into a non-game environment is beneficial in
terms of task attractiveness. We also demonstrate that the 2 http://dl.acm.org.
leader board is a powerful motivator for many people. 3 http://www.google.com/analytics/.
2.3 Usage statistics 3 Analysis of results
4 Next, we analyze the collected data in different ways: by
The game was promoted on Twitter aiming at people from
the IR community. In this paper, we analyze traffic from the answers (§3.1), by players (§3.2), by papers (§3.3), and by
first five days of the game’s existence (i.e., from January 31 authors (§3.4).
to February 4, 2015). During this period, 302 unique vis-
itors from 33 countries visited the site and more than one 3.1 Answers
third of them participated in the game. Figure 2 presents
the geographic distribution of visitors; this roughly corre- Time to answer
sponds to the distribution of IR groups in the world (albeit In both game modes, users’ response time is limited to 15
Norway is admittedly over-represented on this figure). Fig- seconds. In case the time limit is exceeded, the answer is
ure 3 shows the time of the day when the game was played considered wrong. On average, it took about half of the
(normalized according to users’ timezones). specified time limit to provide an answer, more precisely, it
was 7.58s. There is a notable difference, 1.8s, between av-
erage response times for correct (6.74s) and wrong (8.53s)
answers. Looking at the distribution of answer times, Fig-
ure 4, we find that the higher response time for wrong an-
swers is due to timeouts. It is also visible from this plot that
when the user knows the correct answer, he is less likely to
use up all the time available.
Figure 2: Geographic distribution of visitors.
Figure 4: Time to answer in beginner mode.
Figure 3: Time of day vs. number of games played. Total scores
Figures 5(a) and 5(b) depict the distribution of total game
Table 1 presents usage statistics, in terms of number of scores for beginner and advanced modes, respectively.
games played and number of unique players. We observe These correspond to power-law shape distributions, al-
that the beginner game mode was almost 8 times more suc- though the existing data (esp. for the advanced mode) are
cessful in terms of game counts and almost 9 times in terms too sparse to infer their parameters. In both cases there is a
of players counts than the advanced one. These statistics noticeable drop when going from score 3 to 4. This has to
show that people who took part in the easier version of the do with the fact that the game ends after 3 mistakes.
game were more likely to play again.
3.2 Players
Table 1: Usage statistics
Game mode Returning visitors
Total
Beginner Advanced An interesting measure of success of a game is the num-
#unique players 111 16 116 ber of returning players. We examined all games that were
#games played 347 39 387 played more than once (56 games) searching for interest-
avg. #games per player 3.14 2.44 3.34 ing patterns. It is most likely that a user plays again right
after she finishes the game. In 42 cases, users played again
within the same hour. The plot on Figure 6 presents time
4 #irgame intervals between users’ returns.
(a) beginner mode (b) advanced mode
Figure 5: Distribution of game scores.
(a) Jumper
Figure 6: Time elapsed between games (from the first
game) for returning visitors.
Player types
From the analysis of the game series, we can derive 4 types
of players depending on when they decide to leave the
game. (b) Give-uper
Jumpers are the type of visitors who come and play a
single game; they leave after that no matter what the
score is.
Give-upers are players who return repetitively but leave
the game due to demotivation when in a series of
games their score drops.
(c) Fighter
Fighters do the exact opposite. They leave the game at the
top of their form, when in a series of games they reach
their highest score.
Achievers care about winning. They keep returning and
playing the game until they are back on the top of the
leader board.
(d) Achiever
Figure 7 shows an example user for each of the player
types. Figure 7: Examples of player types. Session boundaries
are marked with vertical red lines.
3.3 Papers
Are popular papers, i.e., papers with more citations, rec-
ognized more easily? In order to answer this question, we number of times it was shown to users. Next, we divide
first introduce the concept of a paper’s recognition ratio. It all publications that appeared in the game into three groups
is defined to be the number of times the publication was based on the number of citations they received (according
successfully recognized by users (players) divided by total to ACM DL). We report the average recognition ratio for
each group in Figure 8, where the leftmost bar represents 4 Observations
papers with the highest number of citations. As expected,
Based on the analysis of results as well as informal feed-
we find that more cited papers are in general better recog-
back from the users, we make a number of observations
nized (left and middle vs. right), albeit papers in the middle
that may generalize beyond our specific use case.
of the citation range seem to perform best in this regard. We
note that these findings may not be conclusive due to data
Learning From user feedback we know that this game
sparsity.
was also used to learn about relevant, previously un-
seen, publications. On the motto of Comenius’ say-
ing: “Much can be learned in play that will afterwards
be of use when the circumstances demand it.”, we be-
lieve that our game might also prove to be useful for
exploring and discovering more about scientific liter-
ature.
Unfair behavior We have a suspicion that at least in one
case a user acted dishonestly in order to get into the
lead. This user’s score was unreasonably high com-
pared to the second best. More importantly, her aver-
age time to answer was 12.84s (compared to the av-
Figure 8: Citation counts vs. recognition ratio. erage of 7.85s), which seems just about enough time
to use a web search engine to look up the paper in
question. This behavior was reported by another com-
petitor (which supports the assumption that players do
care about their position in the leader board).
3.4 Authors
Head-start At least in one case we observed that a user
Are popular authors, i.e., people with more publications, was restarting the game until he was able to answer
recognized more easily? Similarly to papers, we define an the first question correctly. Beginning the game with
author’s recognition ratio to be the number of times the au- a set of easier questions, then gradually increasing dif-
thor’s publications were successfully recognized by users ficulty, might therefore be helpful in keeping users en-
(players) divided by total number of times her publications gaged.
were shown to users. On Figure 9 we plot authors’ pop-
ularity, measured in the number of publications (in our Engaging users From the statistics (§3.2) we can see that
paper selection), against recognition ratio. We find that users are not very likely to return to the game days
there is a significant difference between authors with a sin- after their first visit. However, chances that they re-
gle publication and authors with multiple publications; not peatedly participate in the game within the first hour
surprisingly, having multiple publications benefits recogni- after their initial visit are much higher. It is therefore
tion. On the other hand, it appears that having many more of vital importance to keep the user stay in the game
publications does not improve recognition any further. as long as possible when she comes for the first time.
Identity Some people (∼ 10%) opted to use their full civil
name as opposed to a nickname. We hypothesize that
it was a choice made deliberately in case they make it
to the leader board.
5 Conclusions and Future work
This study presented in this paper has started with the fol-
lowing main question in mind: Could we make an assess-
ment exercise, in the context of expertise profiling, more
appealing for users? We have answered this question pos-
itively. Our experiment has shown that it is more desir-
Figure 9: Author popularity (number of publications) vs. able for users to participate in a game-like assessment task
recognition ratio. rather than having to evaluate results explicitly using a
purpose-built interface. We have analyzed the data col-
lected along different dimensions and have identified four
types of gaming personalities based on behavioral patterns.
On top of the analysis of the game mechanics, this experi-
ment has allowed us to gather valuable data about authors
and publications. This has let us to perform an initial ex-
amination of the relation between popularity and recogniz-
ability for both papers and authors.
In future work we plan to enhance the game in several
ways. The main purpose of the game is to indirectly mea-
sure how researchers recognize each others’ publications.
In this first version, fictitious publications were selected
randomly; however, interesting experiments could be con-
ducted if the selection of alternative authors was biased in
a controlled way. This would allow us to adjust the diffi-
culty of the questions as the game progresses. We also plan
to add new game modes (e.g., time trial), expand the data
set (i.e., add more publications), and possibly explore other
research fields/communities.
References
[Eic14] Carsten Eickhoff. Crowd-powered experts:
Helping surgeons interpret breast cancer im-
ages. In Proceedings of the First International
Workshop on Gamification for Information Re-
trieval, pages 53–56, 2014.
[RBN14a] Jan Rybak, Krisztian Balog, and Kjetil Nørvåg.
ExperTime: Tracking expertise over time. In
Proceedings of the 37th International ACM SI-
GIR Conference on Research & Development
in Information Retrieval, SIGIR ’14, pages
1273–1274, 2014.
[RBN14b] Jan Rybak, Krisztian Balog, and Kjetil Nørvåg.
Temporal expertise profiling. In Proceedings of
the 36th European conference on Advances in
Information Retrieval, ECIR ’14, pages 540–
546, 2014.
[VAD08] Luis Von Ahn and Laura Dabbish. Designing
games with a purpose. Communications of the
ACM, 51(8):58–67, 2008.