=Paper=
{{Paper
|id=Vol-1956/GHItaly17_paper_01
|storemode=property
|title=Using User Created Game Reviews for Sentiment Analysis: A Method for Researching User Attitudes
|pdfUrl=https://ceur-ws.org/Vol-1956/GHItaly17_paper_01.pdf
|volume=Vol-1956
|authors=Bjorn Straatt,Harko Verhagen
|dblpUrl=https://dblp.org/rec/conf/chitaly/StraattV17
}}
==Using User Created Game Reviews for Sentiment Analysis: A Method for Researching User Attitudes==
<pdf width="1500px">https://ceur-ws.org/Vol-1956/GHItaly17_paper_01.pdf</pdf>
<pre>
           Using User Created Game Reviews for Sentiment
          Analysis: A Method for Researching User Attitudes
                                      Björn Strååt                                 Harko Verhagen
                                  Stockholm University                           Stockholm University
                                 Department of Computer                         Department of Computer
                                  and Systems Sciences                           and Systems Sciences
                                   Stockholm, Sweden                              Stockholm, Sweden
                                   bjor-str@dsv.su.se                          harko.verhagen@dsv.su.se
ABSTRACT                                                                     provide user ratings and review services, e.g. products on
This paper presents a method for gathering and evaluating                    amazon.com online store, tourist guides such as Yelp.com,
user attitudes towards previously released video games. All                  and TripAdvisor.com, movie reviews such as
user reviews from two video game franchise were collected.                   rottentomatoes.com, and many more. The video game
The most frequently mentioned words of the games were                        community is no different. The content provider service
derived from this dataset through word frequency analysis.                   Steam allows the users to vote and comment on games, and
The words, called “aspects” were then further analyzed                       the website Metacritic.com present both expert- and user
through a manual aspect based sentiment analysis. The final                  created reviews. User created content offers a vast and
analysis show that the rating of user review to a high degree                varied source of data for anyone who wish to explore the
correlate with the sentiment of the aspect in question, if the               user sentiment beyond the basic rating of previously
data set is large enough. This knowledge is valuable for a                   released products.
developer who wishes to learn more about previous games
success or failure factors.                                                  In this study, we have performed an Aspect Based
                                                                             Sentiment Analysis (ABSA) [2] based on data gathered
Author Keywords                                                              from user reviews regarding two video game series, on
Sentiment; sentiment analysis; user created content; reviews                 Metacritic.com. Our purpose was to explore if the sentiment
ACM Classification Keywords                                                  an aspect (commonly used words in the reviews) was used
H.5.m. Information interfaces and presentation (e.g., HCI):                  in, would reflect the overall rating from the reviewers. A
Evaluation/methodology                                                       positive result would imply that user reviews can be used to
                                                                             explain user attitudes (positive/negative sentiment) from a
INTRODUCTION                                                                 root-cause point of view (the aspects).
It is commonly acknowledged that designers and
developers have much to gain from knowing the needs and                      The results show that, given that the data set is extensive
expectations of their future customers and users. In                         enough, there is a strong connection between the sentiment
interaction design, many methods of exploring this exist;                    of the aspect and the rating the reviewer provided.
interviews, observations, surveys, and other techniques                      BACKGROUND
exist both in the industry and the academic world [1].                       The use of video game reviews as a resource for game
In relatively recent times, customer/user attitude researchers               studies is not a common phenomenon. Most of the studies
have turned to user created content, such as social media,                   that has been performed, has been on professional reviews:
internet forum, and user reviews, with the intent of mining                  Pinelle, Wong & Stach [3] used professional reviews as a
user attitudes from within. Online text content is not a new                 source to find common video game issues, which they
source, but the phenomenon was earlier more focused on                       compiled into a set of design patterns, Zagal, Ladd &
expert rather than user created content. Users often express                 Johnson [4] found that game reviews often include design
themselves regarding their experiences; many services                        suggestions and serious discussions on game designer’s
                                                                             intention and goals. User created reviews has been used as
                                                                             well, but not as frequently: Strååt & Verhagen [5] used user
                                                                             reviews to evaluate video game heuristics, Zagal & Tomuro
                                                                             [6] studied cultural differences and similarities in user
                                                                             created reviews from Japan and USA, and quite recently,
                                                                             Koehler, Arnold, Greenhalgh, Owens Boltz & Burdell’s
                                                                             published their article “A Taxonomy Approach to Studying
GHITALY17: 1st Workshop on Games-Human Interaction, April 18th, 2017,        How Gamers Review Games” [7]. They used an existing
Cagliari, Italy.                                                             theoretical model, a video game taxonomy, and compared
Copyright © 2017 for the individual papers by the papers' authors. Copying
permitted for private and academic purposes. This volume is published and    user submitted reviews with the categories of the taxonomy.
copyrighted by its editors.                                                  They found that users to a certain degree used the same
concepts as the taxonomy, and that there was a difference in       received varying ratings from players. The PC version of
use of the concepts depending on the game rating. As more          DA1 received 8.7/10.0 userscore on Metacritic, DA2
researchers move into the field, we would like to propose          received 4.5/10.0, and DA3 received 5.9/10.0.
our method as presented in this paper.
                                                                   ME1 received a userscore of 8.6/10.0, ME2 received
Metacritic                                                         8.8/10.0, and ME3 was rated 5.6/10.0.
Metacritic.com is a site that aggregates professional
reviewer scores from various online media review sources.          The sudden drop in ratings from DA1 to DA2, and ME1/2
Television shows, movies, music and video games (various           to ME3 tells us that something has changed in the series,
platforms) are examples of media that are presented.               either with the games or the users. This is the phenomenon
Metacritic calculates an average score called Metascore,           we wanted to explore by analyzing the user reviews.
based on the various professional reviewers by converting          METHOD
the reviewers’ local score into a score of 0 to 100 (e.g. a        In this section, we describe our scientific approach and
local score of 8 out of 10 renders a Metascore of 80). These       methods for data gathering and analysis. We use a
scores are weighted (based on the quality and overall              qualitatively driven mixed methods approach, where
stature of the source) and finalized into a professional           quantitative methods supplement and improve the study’s
Metascore.                                                         results. The qualitative analysis is done through a through
                                                                   manual aspect based sentiment analysis. The quantitative
Regular non-professional users are also allowed to score the
                                                                   analysis was done through hypothesis testing using a Chi-
media on a scale of 0 to 10. The unweighted average of this
                                                                   square test.
score is presented by Metacritic as the Userscore. Non-
professional users can also post their own reviews along           Aspect Based Sentiment Analysis
with their score. The User score does not consider the             An aspect based sentiment analysis (ABSA) [2] is
length or quality of these reviews; a simple four-word             performed when user sentiment of certain aspects of a
comment, such as “this game is good”, is valued the same           multi-aspect entity is to be measured, in a dataset gathered
as an analytical 500-word essay. User reviews and scores           from user comments, such as online forum or user created
are posted anonymously under a self-selected user name.            reviews. Video games have plenty of aspects that the user
The user score is divided into three tiers: Positive, Neutral      considers when playing, e.g. playability, graphics, storyline.
and Negative, where Positive is ratings 8 to 10, Neutral is
                                                                   Aspects are words or phrases that exist either explicit or
ratings 5 to 7, and Negative is ratings 0 to 4. The rating tiers
                                                                   implicit in the dataset. Explicit aspects are the actual word
are color coded in green for Positive, yellow for Neutral
                                                                   in context, and implicit aspects are inferred from the
and red for Negative.
                                                                   context. For example, if the aspect is gameplay, an explicit
Metacritic has been the subject of many discussions. The           occurrence could be “I really enjoyed the gameplay”, and
validity and value of the professional reviews have been           an intrinsic could be “I really enjoyed the challenges and
questioned in various video game blogs and online                  the features of X.”
magazines [8] [9], and the site has been used in game and
                                                                   The aspects are determined through a word frequency
social studies, e.g. as an examination and comparison of
                                                                   analysis. After the dataset is collected, product or domain
player experience vis-à-vis professional reviews [10], or as
                                                                   relevant words that occur on a frequency above a pre-set
a key factor in assessing game value and quality [11]. Most
                                                                   threshold are retained for the following sentiment analysis
commonly, the discussion has been around the professional
                                                                   step. The sentiment analysis is then performed either
reviews. In this study however, we have only looked at the
                                                                   through a scripted natural language processing algorithm, or
User score and user comments.
                                                                   through a manual read through. The result will show the
Games in this study                                                sentiment for each aspect, for example in terms of positive,
The goal of this study is to see if the user sentiment differs     neutral, or negative sentiment.
between games that are released in a series. To this end, we
                                                                   Word frequency and selection
decided to examine the user comments of the game series
                                                                   The data collection for our ABSA was performed in the
“Dragon Age” and “Mass Effect”. At the time of the study,
                                                                   following steps. First, we collected all user reviews on the
Dragon Age has three installments: Dragon Age: Origin
                                                                   PC-version of the three games from the Dragon Age
(DA1) [12], Dragon Age 2 (DA2) [13], and Dragon Age:
                                                                   franchise: DA1, DA2, and DA3, and the three first games
Inquisition (DA3) [14]. Mass Effect has four installments,
                                                                   from the Mass Effect franchise: ME1, ME2, ME3, from
but only the three first existed when we performed the data
                                                                   Metacritic.com. As mentioned in the Metacritic description
collection. These are Mass Effect (ME1) [15], Mass Effect
                                                                   in the background section, Metacritic authors rate their own
2 (ME2) [16], Mass Effect 3 (ME3) [17].
                                                                   reviews to reflect their experience of the game in question.
We chose these franchises since they are widely known,             This is a rating from 0 to 10, but in effect it will categorize
and represents a relatively common and popular game genre          the comment as one of three tiers: low, medium, or high
(role playing games), and most importantly, they have              rated. We decided to only work with the reviews of the PC-
version (the games exist for multiple platforms) as it was            •    Aspects were determined through word frequency
the versions that we were familiar with.                                   analysis of all the user reviews
For each game, we did a word frequency analysis, using                •    The three most frequent aspects were combat,
AntConc 1, to find which aspect that was most frequently                   story, character.
used in the reviews. As we had no previous practice of this           •    Each game had a number of reviews
method in this context, the threshold was set after we saw            •    A review contains at least one of the aspects
the results – we decided to pursue the three most frequent            •    A review is rated as either low, medium, or high
explicit aspects that were shared by all three games. These
                                                                      •    The dataset contains all reviews, sorted by game,
explicit aspects were: Story, Combat, and Character. All
reviews that did not contain any of the aspects were omitted               rating, and aspect.
from the dataset. As the reviews were rated by the authors,       Manual Sentiment Analysis
we already had the rating categories.                             The sentiment analysis was performed online, through an
Since the review rating and the sentiment of the aspect may       online crowdsourcing service.2 The rating and name of the
differ – for example, a high rating review may use an aspect      game was omitted for the evaluators to limit the risk of bias.
in a negative way – it was important to collect all reviews       The evaluators were asked to read a review, or excerpt of a
of all ratings, that contained at least one aspect. Figure 1 is   review, which contained one of the aspects, and to
an illustration on how frequent the aspects were in relation      determine if the author of the review had used the aspect in
to review rating. As can be seen, the aspects tend to be          a positive, neutral, or negative way. The following quote is
more frequent in low rated reviews than high and mid rated        an example of an excerpt that the evaluators judged:
reviews. This was true for all games, but for reasons of          “The menus, crafting and combat are so totally and
limited space, only one figure is shown.                          completely cumbersome. Everything is very statically
                                                                  organized and takes so much time. I spent an ungodly
           Story-concept in relation to game                      amount of hours collecting resources, crafting things,
                                                                  comparing items to what I already owned and it is just so,
                       rating                                     so, so cumbersome and tiresome, it really damages the
    2500                                                          game”
                                                                  The aspect of combat occurs in the quote, and the overall
    2000                                                          use of the aspect is considered negative.8268 review
                                                                  excerpts from the DA series and 3357 from the ME series
    1500                                                          were analyzed this way, and each aspect was judged by at
                                                                  least three evaluators. If an excerpt would contain more
    1000                                                          than one aspect, it would be run again, through a second (or
                                                                  third) sentiment analysis, where that aspect would be in
     500                                                          focus for the evaluator. When the sentiment analysis was
                                                                  done, the dataset was reconstructed with rating and game
       0                                                          name.
             DA1 Story      DA2 Story       DA3 Story
                                                                  Chi-square analysis
              High rate     Mid rate     Low rate                 Chi square is a common test for hypothesis testing. At its
                                                                  core, it calculates the differences between observed
                                                                  frequencies and expected frequencies in a row by row and
     Figure 1: Relation between aspect and review rating          column by column calculation, adding the calculations for
                                                                  each cell together into one comprehensive measure.
After the data collection, we had a dataset of reviews for        Depending on the degrees of freedom (number of rows
each game, regarding the three aspects (story, combat,            minus 1 times number of columns minus one) and the
character). Each review was categorized into its original         measure of reliability, cut-off measures have been
rating level.                                                     calculated. A Chi square above the cut-off value means that
So, in conclusion of this section:                                the probability of the variables to be independent (Null
                                                                  hypothesis) is below the reliability (usually .05 or lower). In
                                                                  general, for 2*2 tables, a lower threshold of 5 for each
                                                                  expected frequency is thought to be needed, even if some
1
  AntConc, by Anthony (2012), is a freeware concordance
                                                                  2
and text analysis tool by Dr Laurence Anthony at the                www.crowdflower.com; a data mining and crowdsourcing
Faculty of Science and Engineering at Waseda University,          service where researchers can upload their data e.g. for
Japan (http://www.antlab.sci.waseda.ac.jp/index.html).            manual sentiment analysis by anonymous evaluators.
debate exists concerning this value. Thus, for a 3*3 table                            1012      686    1659   3357
such as ours, at least 45 observations need to exist from the
start for Chi square to be a reliable test by general            Table 2: The aspects distributed on review ratings, for all
agreement.                                                       three games in the Mass Effect franchise. The values are
                                                                          from the evaluators sentiment analysis.
RESULT
After the sentiment analysis, we processed the data from an     Chi Square Test Results
analytical standpoint. Table 1 shows the complete data set      Using the Chi square test, we obtained the values presented
for all three DA games, distributed on review ratings,          in table 3 and 4. The tables show Chi-square per aspect for
aspects and sentiment, and table 2 shows the same for the       each game.
ME games.                                                                              Aspect             Chi square
We tested the relevance of each of the three aspects for the    DA1                    Character          120,2
overall review. We constructed the following null
hypothesis: There is no relationship between the values of                             Combat             100,4
aspect X (character, combat or story) and the overall                                  Story              196,6
review rating.
                                                                DA2                    Character          304,9
 Dragon Age         Review rating                                                      Combat             299,6
 Aspect             Low      Mid      High                                             Story              426,6
 Char.    bad          633      257      87     977             DA3                    Character          1072,5
          neutral    1038        72      92   1202                                     Combat             374,4
          good          68      148     543     759   2938                             Story              1250,2
 Comb.    bad          520      211      72     803             DA series              Character          1541,3
          neutral      358       50      48     456                                    Combat             813,8
          good          43       83     353     479   1738                             Story              1963,4
 Story    bad          993      278      69   1340
                                                                Table 3: Chi-square values for each aspect from the DA series
          neutral    1056       119     129   1304
                                                                All values exceed the threshold at p= 0.001 and 4 degrees
          good          72      142     734     948   3592
                                                                of freedom (18,465) thus in all cases of the DA series; the
                     4781     1360     2127   8268              null hypothesis can be reject. We conclude that there is a
                                                                correlation between the aspect value and the overall review
  Table 1: The aspects distributed on review ratings, for all   value.
three games in the Dragon Age franchise. The values are from
              the evaluators sentiment analysis.                                        Aspect          Chi square

 Mass Effect        Review rating                               ME1                     Character       94.12

 Aspect             Low       Mid     High                                              Combat          31.15
 Char.    bad          256      120      33     409                                     Story           139.48
          neutral       28       34      49     111             ME2                     Character       252.13
          good          53       77     411     541   1061                              Combat          75.20
 Comb.    bad           88       66      35     189                                     Story           470.90
          neutral       10       19      80     109             ME3                     Character       466.21
          good          25       45     191     261     559                             Combat          163.41
 Story    bad          425      164      65     654                                     Story           797.27
          neutral       59       70     128     257
                                                                Table 4: Chi-square values for each aspect from the ME series
          good          58       91     667     826   1737
Given the minimum value of 5, each row or column should            possibly render a different result, or enhance the one
have at least 15 observations, which in the case of ME1            presented in this paper.
does not hold for any of the aspects as there are fewer than
                                                                   REFERENCES
15 observations in the "Low" column for each of the 3
aspects. The same goes for ME2 for the Combat aspect
(only 5 with score "Low" and only 7 with score "Mid").             [1] D. Benyon, P. Turner and S. Turner, Designing
DISCUSSION                                                             interactive systems: People, activities, contexts,
Our results show that if an aspect occurs in a review, the             technologies, Pearson Education, 2005.
sentiment of that aspect will reflect the rating of the review.
                                                                   [2] M. Pontiki, D. Galanis, J. Pavlopoulos, H.
The null hypothesis was falsified for all games, and all
                                                                       Papageorgiou, I. Androutsopoulos and S. Manandhar,
aspects except for two of the games in the ME series, ME1
                                                                       "Semeval-2014 task 4: Aspect based sentiment
and ME2.
                                                                       analysis," Proceedings of SemEval, pp. 27-35, 2014.
This implies that the aspects reflect areas, in the games, that
are disliked by the users. The relatively high frequency of        [3] D. Pinelle, N. Wong and T. Stach, "Heuristic
the aspects is an indication that these areas are the most             Evaluation for Games: Usability Principles for Video
important ones for the users. It also indicates that the root          Game Design," in Proceedings of the ACM Conference
cause of the low rated reviews is to be found within the               on Human Factors in Computing Systems (CHI 2008),
game features that the aspects represent.                              2008.

The null hypothesis was not possible to falsify for ME1 and        [4] J. P. Zagal, A. Ladd and T. Johnson, "Characterizing
ME2 due to the lack of data for these two games. Looking               and understanding game reviews," in Proceedings of
at table 2, we can see that it only exists 10 Low                      the 4th international Conference on Foundations of
review/Combat neutral, meaning that this data point cannot             Digital Games, 2009.
be calculated using Chi-square. This is a good indication
that the threshold for the word frequency analysis (please         [5] B. Strååt and H. Verhagen, "VOX POPULI - A Case
see method section, word frequency and selection) must be              Study of User Comments on Contemporary Video
at least 45 for the analysis to be valid.                              Games in Relation to Video Game Heuristics?,"
                                                                       United Kingdoms, 2014.
However, a game designer might not need the analysis to be
statistically valid: Consider figure 1. The amount of user         [6] J. P. Zagal and N. Tomuro, "Cultural differences in
reviews increase for each instalment of the game franchise,            game appreciation: A study of player game reviews,"
but a large majority of the increase is within the negatively          in FDG, 2013.
rated reviews. This is our first clue that the related aspect is
important to the users. This is not a statistically validated      [7] M. J. Koehler, B. Arnold, S. P. Greenhalgh, L. O.
result, but it gives us an indication if we are looking at             Boltz and G. P. Burdell, "A taxonomy approach to
something that needs to be further investigated. The amount            studying how gamers review games," Simulation &
of low rated reviews that contain at least one of each aspect          Gaming, vol. 48, no. 3, pp. 363--380, 2017.
may indicate that these aspects are part of the reasons that
users didn’t appreciate the games. From a video game               [8] "Metacritic Matters: How Review Scores Hurt Video
developer standpoint, we could stop here. It wouldn’t take             Games," 08 08 2015. [Online]. Available:
too long to manually read through a few pages of these                 http://kotaku.com/metacritic-matters-how-review-
comments to get an estimated overview whether the aspects              scores-hurt-video-games-472462218. [Accessed 18 04
are used in a negative sentiment or not. A developer can, at           2016].
this stage, get this overview and regard their design choices      [9] "Time to kill Metacritic," 15 10 2014. [Online].
accordingly.                                                           Available: http://www.mcvuk.com/news/read/time-to-
The frequency of the aspects implies that they are important           kill-metacritic/0139824. [Accessed 18 04 2016].
to the users – this implies that the authors of the low rated
reviews are disappointed of the aspects as presented in the        [10] D. Johnson, C. Watling, J. Gardner and L. Nacke, "The
games. A future research task would be to perform a more                edge of glory: The relationship between metacritic
qualitative analysis, on user review level, to pinpoint the             scores and player experience.," in Proceedings of the
root cause of the problems that the users experience. A                 first ACM SIGCHI annual symposium on Computer-
content analysis, for example, of the material would give a             human interaction in play, 2014.
more detailed insight. Furthermore, we have only worked
                                                                   [11] A. Greenwood-Ericksen, S. R. Poorman and R. Papp,
with the PC-reviews of the game franchises. A full out
                                                                        "On the Validity of Metacritic in Assessing Game
analysis of all the platforms for all the games would
                                                                        Value," Eludamos. Journal for computer Game
    Culture, vol. 7, no. 1, pp. 101-127, 2013.

[12] Dragon Age:Origins, BioWare, 2009.

[13] Dragon Age II, BioWare, 2011.

[14] Dragon Age: Inquisition, BioWare, 2014.

[15] BioWare, Mass Effect, USA: Electronic Arts, 2007.

[16] BioWare, Mass Effect 2, USA: Electronic Arts, 2010.

[17] BioWare, Mass Effect 3, USA: Electronic Arts, 2012.

</pre>