IR Game: How well do you know information retrieval papers?

                Jan Rybak                                  Krisztian Balog                               Kjetil Nørvåg
           Norwegian University                        University of Stavanger                      Norwegian University
        of Science and Technology                      krisztian.balog@uis.no                    of Science and Technology
          jan.rybak@idi.ntnu.no                                                                   kjetil.norvag@idi.ntnu.no


                                                                         publications) is not easy; we developed an assessment in-
                                                                         terface for this purpose [RBN14a] and found that people
                            Abstract                                     were not especially willing to spend time with it. (It has
                                                                         to be mentioned, however, that selecting the most impor-
     In this paper we demonstrate how a gamification                     tant publications was only part of the task, the assessment
     approach increases the attractiveness of an assess-                 procedure was more involved than that.) The main motiva-
     ment exercise in the context of expertise profil-                   tion for us, therefore, is to attract users that can represent
     ing. We present an online game, in two diffi-                       the relevant scientific community’s general opinion and can
     culty modes, where users have to guess the au-                      generate data to be used in our efforts to evaluate temporal
     thors of publications. We analyze the collected                     expert profiling approaches [RBN14b].
     data along different dimensions and identify four                       Gamification is not the only alternative we considered.
     types of gaming personalities based on behavioral                   In many scenarios, crowdsourcing is a valid option for the
     patterns. Further, we examine the relation be-                      delivery of simple yet time-consuming or repetitive tasks
     tween popularity and recognizability for both pa-                   such as data annotation or evaluation. Eickhoff [Eic14]
     pers and authors. Finally, we provide insights into                 examines the crowd-powered expert paradigm, where the
     game mechanics that extend beyond our specific                      majority of the workload is preprocessed by crowd work-
     use case.                                                           ers and experts are only needed for specialized steps. None
                                                                         of these approaches, however, are applicable in our case;
1    Introduction                                                        here, the entire task is strictly domain-specific and requires
Gamification is a method for keeping users involved in a                 the involvement of domain experts.
task for longer periods of time or to encourage them to re-                  We cast our assessment exercise as a simple question-
peatedly undergo otherwise not so entertaining tasks. An-                answering quiz that tests the user’s knowledge of informa-
other reason for gamification is the desire to generate useful           tion retrieval (IR) papers, i.e., the “IR game.” The player
data as a by-product of the playing activity [VAD08]. Our                is presented with the title of a publication, from selected
work takes place in the context of (temporal) expertise pro-             top conferences, and her task is to attribute the paper to
filing, where we are concerned with identifying what topics              the corresponding author(s).1 The game comes in two dif-
people are knowledgeable about [RBN14b]. Specifically,                   ficulty modes. In “beginner” mode, the user has to select
we focus on the academic domain, where scientific publi-                 the right set of authors, from three options, while in “ad-
cations constitute the best available evidence from which to             vanced” mode authors have to be picked out individually.
draw conclusions regarding a person’s expertise. Of course,              The goal in each mode is to answer as many questions,
what one holds as her most important publication(s) does                 i.e., collect as many points, as possible. The game ends
not necessarily correspond with what others (i.e., the scien-            after three wrong answers. A “leader board” is provided to
tific community) consider as such. Arguably, the latter one              track the highest scoring players. The game is available at
is more important. In our experience, getting the first ques-            http://bit.ly/ir-game.
tion answered (what a person considers his most important                    In the course of this work, we examine whether we are
                                                                         able to attract more interest (and collect more data) by pre-
    Copyright c 2015 for the individual papers by the paper’s authors.   senting our assessment exercise indirectly, as a game, as
Copying permitted for private and academic purposes. This volume is
published and copyrighted by its editors.                                   1 This game should not be unfamiliar to academics; many of us perform

    In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder (eds.):    a similar “authorship attribution” exercise, albeit not deliberately, when
Proceedings of the GamifIR’15 Workshop, Vienna, Austria, 29-March-       performing blind reviews. The main difference is that here we offer instant
2015, published at http://ceur-ws.org                                    feedback while providing limited context.
                                                              2     The Game
                                                              “IR game” is a simple knowledge quiz that tests users’
                                                              knowledge of publications in the field of Information Re-
                                                              trieval.

                                                              2.1   Game rules
                                                              Given a publication title, the player’s task is to select the
                                                              correct authors within a given time limit. The game can be
                                                              played in two modes, beginner and advanced.

                                                              Beginner mode The user has to select the (entire) group of
                                                                  authors from three options. Only one variant is correct
                                                                  and all authors are listed in the same order as on the
                                                                  paper. See Figure 1 (a).

                                                              Advanced mode In the more difficult game mode, indi-
                                                                 vidual author names are offered and the user has to
                                                                 decide which names belong to the paper. The num-
                                                                 ber of authors does not necessarily correspond with
                                                                 the actual number of the paper’s authors and the same
                                                                 applies to the order of names. See Figure 1 (b). The
                                                                 user is credited with the corresponding F1-score for
                                                                 each answer (i.e., correct vs. selected set of authors);
                                                                 answers below an F1-score of 0.5 count as wrong.

                                                              In both modes, the game ends after three wrong answers.
Figure 1: IR Game in (a) beginner and (b) advanced mode.      A separate leader board is available for each game mode
                                                              that lists the highest scoring players (with score, name, and
                                                              timestamp). Players that made it to the leader board were
opposed to dealing with it explicitly using a purpose-built
                                                              offered the opportunity to “brag” about their achievement
interface. In addition, we address a number of more spe-
                                                              on Twitter.
cific questions:

  • Which level of difficulty is preferred, the easy mode     2.2   Data
    or the advanced one?
                                                              The collection of publications used in this game comprises
  • Does a competitive element, such as a leader board,       of the 1111 top cited IR papers (according to the ACM
    increase the level of engagement?                         DL2 ) from the period 2004-2014 that were presented in one
                                                              of the following conference series: SIGIR, WWW, CIKM,
  • When do users stop playing?                               KDD, and WSDM. For each publication, besides its own
  • Do users return to play again? After how long?            set of original authors, a set of “fictitious” authors is ran-
                                                              domly selected from other documents within the same data
  • What types of players can we identify?                    set.
                                                                 Usage data is collected while the game is being played.
  • Are more cited papers also more easily recognized?
                                                              Specifically, for each user, we store the questions that have
  • Are more popular authors also more easily recog-          been asked in the game, correct and wrong answers, scores,
    nized?                                                    date and time, time to answer, number of attempts to copy
                                                              text from the webpage, and rough location. In order to
  • Do people prefer to play anonymously?                     recognize returning users, we use browser cookies with
                                                              a unique identifier. We also track the site’s traffic using
Our findings confirm the premise that interweaving game       Google Analytics.3
mechanics into a non-game environment is beneficial in
terms of task attractiveness. We also demonstrate that the        2 http://dl.acm.org.

leader board is a powerful motivator for many people.             3 http://www.google.com/analytics/.
2.3    Usage statistics                                           3     Analysis of results
                                     4                            Next, we analyze the collected data in different ways: by
The game was promoted on Twitter aiming at people from
the IR community. In this paper, we analyze traffic from the      answers (§3.1), by players (§3.2), by papers (§3.3), and by
first five days of the game’s existence (i.e., from January 31    authors (§3.4).
to February 4, 2015). During this period, 302 unique vis-
itors from 33 countries visited the site and more than one        3.1   Answers
third of them participated in the game. Figure 2 presents
the geographic distribution of visitors; this roughly corre-      Time to answer
sponds to the distribution of IR groups in the world (albeit      In both game modes, users’ response time is limited to 15
Norway is admittedly over-represented on this figure). Fig-       seconds. In case the time limit is exceeded, the answer is
ure 3 shows the time of the day when the game was played          considered wrong. On average, it took about half of the
(normalized according to users’ timezones).                       specified time limit to provide an answer, more precisely, it
                                                                  was 7.58s. There is a notable difference, 1.8s, between av-
                                                                  erage response times for correct (6.74s) and wrong (8.53s)
                                                                  answers. Looking at the distribution of answer times, Fig-
                                                                  ure 4, we find that the higher response time for wrong an-
                                                                  swers is due to timeouts. It is also visible from this plot that
                                                                  when the user knows the correct answer, he is less likely to
                                                                  use up all the time available.


         Figure 2: Geographic distribution of visitors.


                                                                          Figure 4: Time to answer in beginner mode.


      Figure 3: Time of day vs. number of games played.           Total scores
                                                                  Figures 5(a) and 5(b) depict the distribution of total game
   Table 1 presents usage statistics, in terms of number of       scores for beginner and advanced modes, respectively.
games played and number of unique players. We observe             These correspond to power-law shape distributions, al-
that the beginner game mode was almost 8 times more suc-          though the existing data (esp. for the advanced mode) are
cessful in terms of game counts and almost 9 times in terms       too sparse to infer their parameters. In both cases there is a
of players counts than the advanced one. These statistics         noticeable drop when going from score 3 to 4. This has to
show that people who took part in the easier version of the       do with the fact that the game ends after 3 mistakes.
game were more likely to play again.
                                                                  3.2   Players
                   Table 1: Usage statistics
                                  Game mode                       Returning visitors
                                                          Total
                              Beginner Advanced                   An interesting measure of success of a game is the num-
 #unique players                    111           16       116    ber of returning players. We examined all games that were
 #games played                      347           39       387    played more than once (56 games) searching for interest-
 avg. #games per player            3.14         2.44      3.34    ing patterns. It is most likely that a user plays again right
                                                                  after she finishes the game. In 42 cases, users played again
                                                                  within the same hour. The plot on Figure 6 presents time
  4 #irgame                                                       intervals between users’ returns.
                        (a) beginner mode                                            (b) advanced mode
                                            Figure 5: Distribution of game scores.


                                                                                           (a) Jumper
Figure 6: Time elapsed between games (from the first
game) for returning visitors.

Player types
From the analysis of the game series, we can derive 4 types
of players depending on when they decide to leave the
game.                                                                                    (b) Give-uper

 Jumpers are the type of visitors who come and play a
    single game; they leave after that no matter what the
    score is.

 Give-upers are players who return repetitively but leave
    the game due to demotivation when in a series of
    games their score drops.
                                                                                           (c) Fighter
 Fighters do the exact opposite. They leave the game at the
    top of their form, when in a series of games they reach
    their highest score.

 Achievers care about winning. They keep returning and
    playing the game until they are back on the top of the
    leader board.
                                                                                          (d) Achiever
Figure 7 shows an example user for each of the player
types.                                                           Figure 7: Examples of player types. Session boundaries
                                                                 are marked with vertical red lines.
3.3   Papers
Are popular papers, i.e., papers with more citations, rec-
ognized more easily? In order to answer this question, we        number of times it was shown to users. Next, we divide
first introduce the concept of a paper’s recognition ratio. It   all publications that appeared in the game into three groups
is defined to be the number of times the publication was         based on the number of citations they received (according
successfully recognized by users (players) divided by total      to ACM DL). We report the average recognition ratio for
each group in Figure 8, where the leftmost bar represents        4   Observations
papers with the highest number of citations. As expected,
                                                                 Based on the analysis of results as well as informal feed-
we find that more cited papers are in general better recog-
                                                                 back from the users, we make a number of observations
nized (left and middle vs. right), albeit papers in the middle
                                                                 that may generalize beyond our specific use case.
of the citation range seem to perform best in this regard. We
note that these findings may not be conclusive due to data
                                                                 Learning From user feedback we know that this game
sparsity.
                                                                     was also used to learn about relevant, previously un-
                                                                     seen, publications. On the motto of Comenius’ say-
                                                                     ing: “Much can be learned in play that will afterwards
                                                                     be of use when the circumstances demand it.”, we be-
                                                                     lieve that our game might also prove to be useful for
                                                                     exploring and discovering more about scientific liter-
                                                                     ature.

                                                                 Unfair behavior We have a suspicion that at least in one
                                                                     case a user acted dishonestly in order to get into the
                                                                     lead. This user’s score was unreasonably high com-
                                                                     pared to the second best. More importantly, her aver-
                                                                     age time to answer was 12.84s (compared to the av-
      Figure 8: Citation counts vs. recognition ratio.               erage of 7.85s), which seems just about enough time
                                                                     to use a web search engine to look up the paper in
                                                                     question. This behavior was reported by another com-
                                                                     petitor (which supports the assumption that players do
                                                                     care about their position in the leader board).
3.4   Authors
                                                                 Head-start At least in one case we observed that a user
Are popular authors, i.e., people with more publications,           was restarting the game until he was able to answer
recognized more easily? Similarly to papers, we define an           the first question correctly. Beginning the game with
author’s recognition ratio to be the number of times the au-        a set of easier questions, then gradually increasing dif-
thor’s publications were successfully recognized by users           ficulty, might therefore be helpful in keeping users en-
(players) divided by total number of times her publications         gaged.
were shown to users. On Figure 9 we plot authors’ pop-
ularity, measured in the number of publications (in our          Engaging users From the statistics (§3.2) we can see that
paper selection), against recognition ratio. We find that            users are not very likely to return to the game days
there is a significant difference between authors with a sin-        after their first visit. However, chances that they re-
gle publication and authors with multiple publications; not          peatedly participate in the game within the first hour
surprisingly, having multiple publications benefits recogni-         after their initial visit are much higher. It is therefore
tion. On the other hand, it appears that having many more            of vital importance to keep the user stay in the game
publications does not improve recognition any further.               as long as possible when she comes for the first time.

                                                                 Identity Some people (∼ 10%) opted to use their full civil
                                                                     name as opposed to a nickname. We hypothesize that
                                                                     it was a choice made deliberately in case they make it
                                                                     to the leader board.

                                                                 5   Conclusions and Future work
                                                                 This study presented in this paper has started with the fol-
                                                                 lowing main question in mind: Could we make an assess-
                                                                 ment exercise, in the context of expertise profiling, more
                                                                 appealing for users? We have answered this question pos-
                                                                 itively. Our experiment has shown that it is more desir-
Figure 9: Author popularity (number of publications) vs.         able for users to participate in a game-like assessment task
recognition ratio.                                               rather than having to evaluate results explicitly using a
                                                                 purpose-built interface. We have analyzed the data col-
                                                                 lected along different dimensions and have identified four
types of gaming personalities based on behavioral patterns.
On top of the analysis of the game mechanics, this experi-
ment has allowed us to gather valuable data about authors
and publications. This has let us to perform an initial ex-
amination of the relation between popularity and recogniz-
ability for both papers and authors.
   In future work we plan to enhance the game in several
ways. The main purpose of the game is to indirectly mea-
sure how researchers recognize each others’ publications.
In this first version, fictitious publications were selected
randomly; however, interesting experiments could be con-
ducted if the selection of alternative authors was biased in
a controlled way. This would allow us to adjust the diffi-
culty of the questions as the game progresses. We also plan
to add new game modes (e.g., time trial), expand the data
set (i.e., add more publications), and possibly explore other
research fields/communities.

References
   [Eic14] Carsten Eickhoff. Crowd-powered experts:
           Helping surgeons interpret breast cancer im-
           ages. In Proceedings of the First International
           Workshop on Gamification for Information Re-
           trieval, pages 53–56, 2014.
[RBN14a] Jan Rybak, Krisztian Balog, and Kjetil Nørvåg.
         ExperTime: Tracking expertise over time. In
         Proceedings of the 37th International ACM SI-
         GIR Conference on Research & Development
         in Information Retrieval, SIGIR ’14, pages
         1273–1274, 2014.
[RBN14b] Jan Rybak, Krisztian Balog, and Kjetil Nørvåg.
         Temporal expertise profiling. In Proceedings of
         the 36th European conference on Advances in
         Information Retrieval, ECIR ’14, pages 540–
         546, 2014.
 [VAD08] Luis Von Ahn and Laura Dabbish. Designing
         games with a purpose. Communications of the
         ACM, 51(8):58–67, 2008.