=Paper= {{Paper |id=Vol-2844/games5 |storemode=property |title=Summarizing Game Reviews: First Contact |pdfUrl=https://ceur-ws.org/Vol-2844/games5.pdf |volume=Vol-2844 |authors=Aris Kosmopoulos,Antonios Liapis,George Giannakopoulos,Nikiforos Pittaras |dblpUrl=https://dblp.org/rec/conf/setn/KosmopoulosLGP20 }} ==Summarizing Game Reviews: First Contact== https://ceur-ws.org/Vol-2844/games5.pdf
                             Summarizing Game Reviews: First Contact
                             Aris Kosmopoulos                                                                Antonios Liapis
                    SciFY PNPC and NCSR Demokritos                                           Institute of Digital Games, University of Malta
                             Athens, Greece                                                                    Msida, Malta
                            akosmo@scify.org                                                           antonios.liapis@um.edu.mt

                        George Giannakopoulos                                                              Nikiforos Pittaras
                    SciFY PNPC and NCSR Demokritos                                     NCSR Demokritos and Kapodistrian University of Athens
                             Athens, Greece                                                              Athens, Greece
                        ggianna@iit.demokritos.gr                                                      npittaras@di.uoa.gr

ABSTRACT                                                                                content, strategies, cheats, etc. This community-driven content
In recent years the number of players that are willing to submit a                      often informs other users’ purchases (e.g. via an aggregated review
video game review has increased drastically. This is due to a com-                      score) but is also carefully monitored by developers and publishers
bination of factors such as the raw increase of video gamers and                        in order to gauge opinions on specific aspects of the game which
the wide use of gaming platforms that facilitate the review submis-                     can be patched or improved in updates to the game or in sequels. For
sion process. The vast data produced by reviewers make extracting                       both players and developers, being able to succinctly monitor other
actionable knowledge difficult, both for companies and other play-                      players’ views is highly beneficial. The website www.metacritic.com
ers, especially if the extraction is to be completed in a timely and                    aggregates reviews by players and professional critics, returning
efficient manner. In this paper we experiment with a game review                        a percentage score for the game and highlighting diverse reviews
summarization pipeline that aims to automatically produce review                        along the spectrum of positive versus negative. The Steam platform
summaries through aspect identification and sentiment analysis.                         also aggregates its users’ reviews into different categories (‘Mixed’,
We build upon early experiments on the feasibility of evaluation                        ‘Overwhelmingly Positive’, ‘Mostly Negative’ etc.) which is another
for the task, designing and performing the first evaluation of its                      criterion for sorting and (likely) promoting games. The simple
kind. Thus, we apply variants of a main analysis pipeline on an                         aggregation of reviews into a general score is important, but it
appropriate dataset, studying the results to better understand pos-                     obfuscates the nuances of the different reviewers’ grievances and
sible future directions. To this end, we propose and implement an                       is of limited use to designers who wish to improve their game.
evaluation procedure regarding the produced summaries, creating a                       This paper explores techniques for text summarization in order
benchmark setting for future works on game review summarization.                        to provide a multi-dimensional and holistic summary of Steam
                                                                                        reviews for a particular game.
CCS CONCEPTS                                                                                We explore the topic of summarization for game reviews us-
                                                                                        ing a large dataset of Steam reviews from 12 selected games. The
• Applied computing → Computer games; • Computing method-
                                                                                        goal of the summarization pipeline is to extract users’ views on
ologies → Information extraction.
                                                                                        different facets of games such as graphics, audio, and gameplay
                                                                                        [28], leveraging textual sentiment analysis to identify and posi-
KEYWORDS                                                                                tive and negative review snippets, creating a composite summary
 summarization, natural language processing, sentiment analysis,                        of indicative comments on a specific game facet. Unlike the nu-
game reviews, Steam.                                                                    merical aggregation of Metacritic or Steam, this approach extracts
                                                                                        individual sentences (and criticisms) contained within a usually
1    INTRODUCTION                                                                       dense review and attempts to classify those in terms of positive
The ever-expanding popularity of digital games is evidenced by the                      or negative automatically (rather than based on the user’s binary
large profit margins of the commercial game industry sector [4],                        recommendation). The presentation of the game’s summary, which
the vast and diverse swathes of the population that play games [17],                    is split based on different aspects typically criticized in games, can
and the appeal of games and gamification beyond the purposes                            be valuable for both players and designers. For players, the statis-
of entertainment [13]. A large factor for the market penetration                        tics derived from this process (e.g. ratio of positive versus negative
of digital games are distribution platforms such as Steam and the                       comments in one aspect) can act as an expanded game scoring
Google Play Store. Not only do these distribution platforms allow                       system not unlike professional game reviews which gave a score to
interested players to purchase and download new games, they also                        graphics, audio etc. For designers, the indicative comments split per
cultivate a player community with players returning to rate and                         sentiment and aspect allows for a quick monitoring of players’ cur-
comment on their favorite game or even contribute user-created                          rent favorite features. Moreover, the flexible way in which aspects
                                                                                        are defined allows designers to explicitly redefine the keywords
                                                                                        they are interested in, personalizing the summary to their design
GAITECUS0, September 2–4, 2020, Athens                                                  priorities.
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                          There has been very limited attention to game review summa-
                                                                                        rization, besides student projects [55]. Inspired by the only work
that performs aspect-based game review summarization [50], this         word order is considered in many other approaches as it can capture
paper evaluates the outcomes of a straightforward summarization         a word’s importance. For instance, the first and last sentences in a
pipeline in a small-scale user survey. Using the twelve most re-        larger document tend to be more important [33]. Other approaches
viewed games in a 2017 dataset of Steam reviews, the resulting          tag words on their part-of-speech (POS) [37], e.g. nouns (NN), verbs
summaries are evaluated by a small set of experts. The paper stud-      (VB), or adverbs (RB). This is useful for pre-processing, e.g. selecting
ies pipeline variants to better sketch what is important in game        only sentences with a noun and adjective as a corpus for review
review summarization. Based on the outcomes of the different sum-       summarization [25]. Another use of POS tags is to select N-grams
marization processes, and a small-scale study where the different       (i.e. a sequence of words) with specific parts of speech, such as a
outcomes were compared, a number of potential improvements              comparative adverb followed by an adjective [47].
were identified. The paper also highlights the many directions
                                                                        2.1.2 Topic Modeling. Identifying the topic of a document, sen-
which game review summarization research can follow so that it
                                                                        tence, or review is often necessary for clustering opinions on the
can serve designers and players through different pipeline imple-
                                                                        topic together. When the topics of interest are known in advance,
mentations, alternative visualizations, bottom-up aspect discovery,
                                                                        experts usually provide the keywords used to filter the relevant doc-
or text processing driven by domain knowledge.
                                                                        uments. For instance, TweetElect used an initial set of 38 keywords
   The paper is structured as follows. We start with a review of
                                                                        related to the 2016 US elections (including candidates’ names) for
related works in Section 2. We then describe the proposed summa-
                                                                        streaming relevant tweets [11]. However, a boolean check whether
rization pipeline and variants in Section 3. We describe the dataset
                                                                        a keyword is specifically mentioned is rarely sufficient due to the
in Section 4 and present two different user studies in Sections 5 and
                                                                        nuances of language; query expansion is applied to create a larger set
6. We then discuss the results in Section 7 and conclude the paper
                                                                        of terms related to each original keyword [29]. Supervised learning
in Section 8.
                                                                        is often applied for topic modelling, showing positive and negative
                                                                        examples of relevant documents to a classifier [29]. When topics are
2     RELATED WORK
                                                                        unknown and must be discovered from the data, a simple approach
User reviews are a rich source of information, although the extrac-     is to identify the most frequent terms and cluster emergent terms
tion and analysis of this information can be challenging not only       based on co-occurrence [16]. Probabilistic topic models such as
due to the textual nature of the medium but also because users tend     Latent Dirichlet Allocation (LDA) [8] can more efficiently discover
to have a mixed opinion about various features [31]. Approaches         topics without domain knowledge, following a bag-of-words ap-
such as sentiment analysis as well as summarization have been           proach which disregards word or document order. LDA randomly
applied to various datasets, such as product reviews [24, 31], movie    chooses a set of topics and decomposes the probability distribu-
reviews [53, 54], or hotel reviews [25]. Section 2.1 surveys relevant   tion matrix of words in a document into two matrices consisting
approaches for the different phases of a summarization pipeline,        of the distribution of topics in a document and the distribution of
while Section 2.2 discusses the nuances of the Steam platform and       words in a topic. Due to the vast number of possible topic structures,
early work in game review summarization. For interested readers,        sampling-based algorithms are used to find the sample topics which
[25] provides a more thorough overview on review summarization          best approximate the posterior distribution [7]. LDA has often been
according to the type of corpora used as input.                         applied to find topics within reviews, primarily in order to identify
                                                                        review’s sentiments towards these topics, e.g. in [26].
2.1    Summarization Pipeline
                                                                        2.1.3 Sentiment Analysis. The sentiment behind utterances is im-
Summarization can be extractive when relevant portions (usually
                                                                        portant for summarization, especially when the corpus is reviews
sentences) of the input are copied and combined, or abstractive
                                                                        of any kind. Turney [47] highlighted that reviews may recommend
when new text is generated to rephrase and summarize the input
                                                                        or not a certain product, movie, or travel destination; a summary
[18]. The summarization pipeline requires a number of steps before
                                                                        therefore should account for both positive and negative reviews.
the raw textual input can produce a summary; algorithms and
                                                                        Turney’s study was the first to perform sentiment analysis on text-
approaches for each step are discussed below.
                                                                        based reviews based on responses of the AltaVista internet search
2.1.1 Pre-processing and parsing. A fundamental step towards sum-       query on how near the phrases were to the word ‘excellent’ (for
marization (and natural language processing more broadly) is the        recommended) and the word ‘poor’ (for not recommended). Man-
pre-processing and extraction of features from the dataset. In the      ually created lexica for words that express sentiment have been
analysis below, the term documents is used to describe any type of      used in conjunction with fuzzy logic, vector distance, etc. to clas-
text, e.g. a sentence, a paragraph, or an academic paper. One pop-      sify positive and negative [12, 45]. In the same context, there has
ular if naive approach for pre-processing data is the bag-of-words      been extensive work on extracting opinion words which express
which collects all words in the document, disregarding their order      subjective opinions within sentences [49]. It has been found that
and grammar. This method counts the number of instances of the          subjective sentences are statistically correlated with the presence
same word, and the frequency of occurrence of each word is used         of adjectives [49], and much research in product review summariza-
as a feature to measure similarity between documents. Since many        tion uses adjectives to determine sentiment polarity. For instance,
words (such as articles or pronouns) are far more frequent in all       Hu et al. [24] used a frequency-based algorithm to find relevant
documents, terms are weighted based on their frequency via tf.idf       domain features, and then extracted nearby adjectives to such do-
[41] where the term frequency (𝑡 𝑓 ) is multiplied by the inverse       main features. Using a labeled set of adjectives and expanding the
document frequence (𝑖𝑑 𝑓 ). Unlike the bag-of-words approach, the       initial set via WordNet, Hu et al. classified the extracted adjectives’
polarity and assigned that positive or negative sentiment to the                 in snippets or sentences, Part-of-Speech tagging, and other
nearby domain feature. The SentiWordNet database is constructed                  similar tasks.
based on the same principles of the domain-specific adjective clas-            Aspect Identification which identifies interesting aspects (or
sification of [24], using a manually annotated set of seed words and             topics) in the reviews. These topics may be expressed as a
using WordNet term relationships to expand the training set, which               set of words, e.g. "visual, aesthetic, scenery" or "soundscape,
is then used as the ground truth for machine learning classifiers [2].           audio experience, "sound effects".
SentiWordNet, and similar general-purpose models for sentiment                 Aspect Labeling which assigns clear, descriptive labels to the
prediction [46], have been used for polarity detection in reviews,               discovered aspects. E.g. "graphics", "audio".
e.g. in [23, 40].                                                              Sentiment Analysis which gathers information related to the
                                                                                 sentiment expressed within the reviews. This information
2.2     Steam Review Summarization                                               may later be used to update the final summary appropriately.
Since its 2003 release, the Steam platform has become the largest dig-           For example, one may need only positive views in the sum-
ital distribution platform for PC gaming [15], hosting over 34,000               mary, or—most probably—a sampling of all the views, be
games and tens of millions of active users daily. This paper fo-                 they positive or negative.
cuses on user-created reviews on Steam, although other initiatives             Summary Creation which implies the process which, given
such as the Steam workshop allow users to upload their mods or                   all the information gathered in previous steps, forms and
strategies and comment on others’ content. User reviews can be                   renders the final summary for the user.
submitted only by people that have purchased the game from Steam,           Given the above pipeline, we implemented three different vari-
although they are visible to all. As noted above, Steam aggregates       ants. The first two are based on keyword detection and Clustering
user reviews into a category and provides a number of companion          (CL). The first variant does not do Sentiment Analysis, while the
statistics, including a timeline of reviewer’s scores. Reviews them-     second one uses the full pipeline. The last one is another full pipe
selves consist of a single binary recommendation (Recommended            method based on Deep Learning (DL) that focuses on improving on
versus Not Recommended) and a text explaining the user’s opinion.        Aspect Labeling and Summary Creation steps.
Other users can review the quality of the review itself by tagging it
helpful, not helpful, funny, or breaking the Rules of Conduct. By
                                                                         3.1      CL pipeline
default, Steam shows the most helpful reviews submitted within
the last 30 days, although users can also choose to sort reviews by      During the preprocessing step, each review is split into sentences,
other criteria.                                                          each sentence is cleaned in order to create the basic elements on
   As noted in the introduction, there is no systematic academic         which the final summaries will be based. The cleaning process
research in Steam review summarization. To the best of our knowl-        included of some character replacements so that each sentence
edge, the only academic publication that tackles the problem of          could be presentable (e.x. starting with a capital letter and ending
aspect-based summarization on such data is by Yauris and Kho-            with a period) even if it originated from a larger sentence that was
dra [50]. In their approach, only relevant portions of sentences         split during sentence splitting. Moreover, preprocessing prepared
were extracted via conditions applied on text tagged via Parts of        the lemmatized versions of the sentences which are used for aspect
Speech; these portions were usually small, e.g. the phrase could be      detection. In these lemmatized sentences, general stop words are
“amount of content” [50]. Similar to our approach, a pre-specified       removed. For all preprocessing steps, we used the default functions
set of keywords are used for aspect categorization. The aspects          (and stop word lists) of the nltk Python library [5].
and keywords are similar but not identical to our approach (e.g.             The aspect detection process is split into two parts: aspect iden-
the aspects in [50] are gameplay, story, graphic, music, community,      tification and aspect labeling. Aspect identification splits sentences
and general/others), while choosing the aspect described in the          into sets that focus on a specific aspect while aspect labeling iden-
phrase was based on the cosine similarity from each word of the          tifies this aspect in order to present it to the final review summary.
phrase to the aspect’s keywords. The output summary consists                 Our approach uses a predefined set of aspects, presented in
of many aspects (most of which are outside the pre-specified key-        Table 1. We selected these six aspects since they are well-established
words) and a single adjective for each, unlike our current work          facets of games [28] and are popular dimensions within professional
which extracts complete sentences with different polarities. The         reviews.
summarization pipeline was tested on a single game (Skyrim), ex-             A simple approach for aspect labeling is to use a dictionary of
ploring different sentiment extraction approaches using precision        keywords per aspect as the ones presented in Table 1. In order to
and recall as performance metrics. While our current work does not       be able to include sentences even when they do not include the
explore as many parameters for sentiment analysis, it is the first       exact keywords, a k-means clustering is applied to all sentences to
instance where game review summaries are evaluated by humans             find clusters with similar text. Terms are weighted based on their
in a small-scale but thorough user study.                                frequency via 𝑡 𝑓 .𝑖𝑑 𝑓 , which has been used extensively for sentence
                                                                         similarity in bag-of-words approaches (see Section 2.1). The result
                                                                         is 𝐾 clusters of sentences with similar words to each other; in all
3     SUMMARIZATION PIPELINES
                                                                         our experiments we set 𝐾 = 20 based on prior evidence [35]. Once
Figure 1 visualizes the main components of our pipeline:                 sentences are all assigned a cluster based on the distance to the
      Preprocessing which aims to prepare the input reviews for          center, all sentences in all clusters are processed in the following
        further analysis. This may imply cleaning, chunking text         fashion:
                                                                        Community                  Community
                                                                                     Graphics                   Graphics




                                                                               Gameplay                 Gameplay


                              1 Review                2 Aspect                 3 Aspect               4 Sen�ment              5 Summary
                            Preprocessing           Iden�fica�on                Labeling                 Analysis               Crea�on

Figure 1: The full pipeline represents both the Clustering variant (CL Full) and the Deep Learning variant (DL Full), while
variant CL AsDe produces summaries by skipping the Sentiment Analysis step.

 Aspect         Keywords                                                 𝑁 sentences at random from each aspect’s set. A sample CL AsDe
 Graphics       graphic, visual, aesthetic, animation, scenery           summary can be found in Table 2 for Tom Clancy’s The Division.
 Gameplay       mission, item, map, weapon, mode, multiplayer,              The next step of the process is Sentiment Analysis, which is used
                control                                                  by the next summarization variant (CL Full). Using the different sets
 Audio          audio, sound, music, soundtrack, melody, voice           of candidate sentences per aspect, the sentiment polarity (positive or
 Community community, toxic, friendly                                    negative) of each sentence is calculated by averaging the sentiment
 Performance server, bug, connection, lag, latency, ping, crash,         score of each word it contains. As above, sentiment analysis of each
                glitch, optimization                                     word is done via the default functions of the nltk Python library [5].
 Story          dialog, romance, ending, cutscene, story                 The library calculates probabilities for each polarity class (positive,
Table 1: Aspects and keywords used for the identification of             neutral, negative). We took into account sentences which were
dominant aspects in review clusters.                                     assigned a class with a probability of at least 0.5. In order to select
                                                                         a number of sentences per category, a 𝑘-means clustering approach
                                                                         (using 𝑡 𝑓 .𝑖𝑑 𝑓 ) is applied within the set of sentences with the same
                                                                         polarity. In the CL Full implementation of this paper, only two
                                                                         sentences per polarity are selected (𝑘 = 2) as the ones closest to
   (1) If the sentence contains the exact keywords of only one           each cluster’s centroid. If there exist sufficient positive and negative
       aspect, the sentence is assigned to that aspect and is flagged    sentences, then this approach returns 6 sentences as bullet points.
       as a candidate that can be used by the summary of that            Note that if fewer than two sentences are above the threshold for
       aspect.                                                           positive (or below the threshold, for negative) then fewer sentences
   (2) If keywords from multiple aspects are found in the sentence,      may be included in the summary. An example summary from CL
       the sentence is flagged as an unsuitable candidate for any        Full variant can be found in Table 2 for Tom Clancy’s The Division.
       summary and removed.
   (3) If no aspect keywords are found in the sentence, the most
       common aspect within the sentences of the same cluster will
       be used to label this sentence and flag it as a candidate. For    3.2     DL pipeline
       instance, if a sentence does not contain any keyword, but         After experimenting with the first two variant pipelines and taking
       sentences in its cluster predominantly belong to the aspect       into account the feedback of the first user study (see Section 5), we
       Gameplay via case (1), then the sentence is also assigned to      decided to focus on improving the following:
       the same aspect and flagged as a candidate.
   Using the sentences from cases (1) and (3), a set of candidate              • Keyword detection and clustering based Aspect Labeling
sentences is created per aspect. Using these sets, the first varia-              must be improved to avoid sentences such as "If those things
tion of our pipeline could now produce a summary. This variation,                all sound good to you you will like the game." to be labeled
named Clustering Aspect Detection summary (CL AsDe), chooses                     as audio sentences.
                                                                               - In a few words the game is single dimensional this might sound vague but it
     • The final summary should somehow provide information                     becomes apparent that there is not much depth as you play once you’re a couple
        regarding the whole sentiment of the given aspect and not               hours in.
        just by the selected sentences.                                        - Clothes sound "right" when you move in them.
                                                                               - They sound good and looked good with ability to mod for better stats or even
     • The final summary should use a better sentence extraction                rerolling stats.
        approach in order to deal with redundancy.                             - They have improved the pve portion of the game and crazy as it sounds the pvp
                                                                                too.
   Taking all the above into account, the DL pipeline makes changes            - No music and something feels so strangely abandnded about it.
to the Aspect Detection and Summary Creation steps of the CL                   - Like how if there’s a blizzard your cap and shoulder will be covered in snow and
                                                                                that npc voices will echo when they are standing in hallways with hollow walls.
pipeline described in Section 3.1.                                             - Very good voice acting.
   For Aspect Detection, we used the BERT model [14] to generate               - Great abilities pretty good sounds; indoor echos reverb off objects etc.
embeddings for game reviews. BERT is a deep neural language                    - If those things all sound good to you you will like the game.
                                                                               - Superb voice acting and ambient city sounds are also a good plus for this game.
model that uses a bidirectional, multilayer transformer architecture,          - It sounds hyperbolic but I’m being dead serious.
exploiting cross and self-attention to capture word interdependen-             - Sounds terrible right
                                                                               - Most opinions are positive regarding audio.
cies effectively [3, 48]. The approach relies on multi-head attention          - The voice acting in the game is in the higher tiers as is most ubisoft games.
modules for sequence encoding modelling, with word order infor-                - There are not a lot of different voices and some of the voice acting for them is
mation being retained with additive positional encoding vectors.                bad.
                                                                               - Ubisoft - bugs - the textures are so fucked up that nobody can play this game
BERT is trained in an unsupervised setting on large quantities of               anymore.
English text, using masked language modelling and next sentence                - And it clearly shows I want to play it and that I try to.
                                                                               -I’m gonna be honest the cinematics are pretty great.
prediction objectives. These tasks require the prediction of hidden
                                                                           Table 2: Summaries generated by different pipelines, for as-
sequence tokens and the generation of an entire sequence, given an
                                                                           pect Audio of Tom Clancy’s The Division. From top to bot-
input sequence (e.g. for tasks such as question-answering and text
                                                                           tom: CL AsDe (only aspect detection), CL Full (aspect detec-
entailment, etc.). This pretraining scheme and architecture have
                                                                           tion with sentiment analysis) and DL Full (Deep learning
been shown to perform exceptionally well for a variety of natural
                                                                           combined with a sophisticated summarizer).
language understanding tasks.
   To obtain the representation for a game review, we feed the text
to the model using a sequence length of 16 tokens. We use the
𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 model variant, that produces 768-dimensional sequence                   - Mixed: 𝑃¯ ≈ 0, high standard deviation.
embeddings, learned during training for classification purposes.                 - Mostly neutral: 𝑃¯ ≈ 0, low standard deviation.
The implementation and pretrained model utilized are provided by                 - Mostly positive: 𝑃¯ > 0 above a threshold.
the transformers software package from huggingface1 . Using the                  - Mostly negative: 𝑃¯ < 0 below a threshold.
produced embeddings as features we trained a binary Ridge Logistic
                                                                              The final summary is composed by randomly shuffling these 6
Regression classifier [19] (one vs all) for each aspect. We also trained
                                                                           sentences. An example summary from DL Full variant can be found
a seventh classifier to detect sentences unfit for any aspect. For each
                                                                           in Table 2 for Tom Clancy’s The Division.
candidate sentence a confidence score was calculated by each aspect
classifier. Only sentences with a high prediction confidence in the
given aspect and a low confidence on each other classifier were
                                                                           4      DATASET
selected as summary candidates for the next steps of the pipeline.         As a first demonstration of the summarization pipeline, we follow
   During the Summary Creation we applied the following strat-             [35] and select the most helpful reviews on Steam, splitting them per
egy to the 100 most probable candidate sentences of each aspect.           game. This paper parses the Steam review dataset gathered by Zuo
First, the NewSum Toolkit [20] was used to select the sentences            [55], which consists of over 7 million reviews obtained via Steam’s
that provide the most representative information. NewSum uses              API. Each review text comes with a plethora of features concerning
language-agnostic methods based on n-gram graphs, that not only            both the game being reviewed and the reviewer, although only a
extract the most representative sentences, but also deal with redun-       subset of features is used for this experiment. Since Steam users
dancy. In the end we had 20 candidate sentences per Aspect. The            can vote a review as helpful, unhelpful, or spam, we only consider
final summary was composed by 6 sentences using the following              ‘valid’ reviews those with 10 or more user votes as ‘helpful’. With
strategy:                                                                  this criterion (minimum of at least 1000 of ‘helpful’ reviews), we
                                                                           select twelve games with the most valid reviews (see Table 3). The
     • Select the most positive sentence (Sentiment Analysis).
                                                                           games selected have a desirable diversity both in terms of genres
     • Select the most negative sentence (Sentiment Analysis).
                                                                           (shooting, survival, adventure, open-world, multi-player, single-
     • Select the first 3 sentences provided by NewSum Tookit
                                                                           player, etc.) and in terms of general audience reception (shown
        (excluding the previously selected sentences).
                                                                           by the Metacritic score which aggregates professional and users’
     • Create an artificial sentence using the polarities provided by
                                                                           reviews).
        Sentiment Analysis of all the aspect sentences. The polarity
                                                                              For each of the selected games we selected to keep the 10 thou-
        of each sentence was mapped as 1, 0 or -1 (positive, neutral,
                                                                           sand most up-voted reviews. As already discussed in Section 3 each
        negative) using thresholds. Given an Aspect and the mean
                                                                           of these reviews was split into sentences to create a sentence pool
        Polarity score 𝑃, ¯ the possible produced sentences reflect
                                                                           per game. On average, the sentence pool consisted of around 50
        opinions that fall in the following categories:
                                                                           thousand sentences per game. The smallest pool of sentences was
1 https://huggingface.co/                                                  for PAYDAY 2 (37K), while the largest one was for Elite Dangerous
    Game Title              Publisher           Year   Reviews   MC
    No Man’s Sky            Hello Games         2016    4146     61%
    DayZ                    Bohemia Interac-    2018    3349      –
                            tive
 PAYDAY 2                   Starbreeze          2017    2573     79%
 ARK: Survival Evolved      Studio Wildcard     2017    2368     70%
 Grand Theft Auto V         Rockstar Games      2015    2104     96%
 Firewatch                  Campo Santo         2016    1599     81%
 Darkest Dungeon            Red Hook Studios    2016    1564     84%
 Just Survive               Daybreak Game       2015    1463      –
                            Company
    Killing Floor 2         Tripwire Interac-   2016    1276     75%
                            tive
    Elite Dangerous         Frontier Develop-   2015    1270     80%
                            ments
    Tom Clancy’s ‘The Di-   Ubisoft             2016    1091     79%
    vision’
    Subnautica              Unknown Worlds      2018    1056     87%    Figure 2: User interface for online evaluation of summaries
                            Entertainment                               produced by CL AsDe and CL Full methods.
Table 3: Games selected from the dataset, sorted by the num-
ber of ‘valid’ reviews (10 or more ‘helpful’ votes). The Meta-
                                                                        of summaries. We initialized the system by providing two sets
critic score (MC) is included for reference.
                                                                        of summaries A, B, one from system 𝐴 and one from system 𝐵.
                                                                        Each summary in A corresponded to a summary in B, as they both
                                                                        summarize the same set of reviews and the same aspect (e.g. the
                                                                        aspect Graphics of DayZ). During the experiment, each system’s
(70K). The average length of the sentences was 85.7 in characters
                                                                        summary was randomly placed first or second to minimize any bias
and 16.4 in words. In terms of both characters and words, the longest
                                                                        related ordering effect.
sentences were those of Darkest Dungeon (average of 91.8 char-
                                                                           The UI also informed the user of the title of the game being
acters and 17.3 words) and the shortest ones were those of Just
                                                                        summarized, plus the aspect (e.g. Graphics). The user was then
Survive (average of 79.9 characters and 15.6 words).
                                                                        called to select their preferred summary (A or B) and explain the
   In terms of aspects, the most common one was Gameplay on av-
                                                                        reasons for this preference. For the latter annotation, the user could
erage. Performance was the next most popular aspect and in certain
                                                                        select one or more tickboxes among the following options:
games such as ARK: Survival Evolved it was the most popular one.
The least popular aspect was Audio with a ratio of 1 to 5 compared           • It repeats less the same information (Less Redundant)
to the Gameplay aspect.                                                      • It seems to be more coherent and/or complete
   In terms of sentiment, the majority of sentences were more                • For other (or even unclear) reasons
neutral than positive or negative. Between positive and negative           The first two options aim to assess whether redundancy is a con-
sentiment, no general safe conclusions can be drawn since the re-       cern and, similarly, whether coherence and completeness are useful
sults varied given different combinations of aspects and games.         in the task. Redundancy has been traditionally a summarization
In general, we can say that the aspect Performance was character-       evaluation indicator [1], especially in multi-document summariza-
ized as negative more frequently. The opposite was true for the         tion. The completeness and coherence aspect is essentially a (more
aspect Graphics. On the other hand the sentiment ratio (positive vs     nuanced) version of overall responsiveness, as this has been used
negative) towards the aspect Community varied between different         in DUC/TAC summarization tracks and related work [10].
games.
                                                                        5.2    Participants
5      FIRST USER STUDY                                                 The evaluation was carried out by eight adult evaluators (3 female),
As a first experiment, we evaluated the two variations of the CL        fluent in English, with gaming experience. The evaluators were
pipeline (CL AsDe and CL Full) in a small-scale user-study with         selected explicitly among the authors’ network of contacts and in-
summaries of aspects of the 12 games of Table 3.                        vited directly by the authors. Participants were asked to connect to
                                                                        the online system and evaluate all 72 pairs of summaries (produced
5.1       Annotation Protocol                                           by CL AsDe and CL Full), which covered all predefined aspects (see
A pairwise comparison process was followed, rather than a scale-        Table 1) of the ten games in Table 3. There was no time limit for
based rating approach, due to (a) evidence that comparison-based        completing the evaluation, but there was a requirement that all
evaluation can be less demanding cognitively [9] and (b) a rich body    pairs were evaluated in a single session.
of literature that has applied pairwise evaluation for summarization
tasks [34] (e.g. the single document summarization task in [21]).       5.3    Results
   To this end, we created an online evaluation user interface (UI)     The data collected from the experiment was a total of 576 obser-
(see Figure 2) which supported comparative pairwise evaluation          vations, including the preference of each evaluator for each pair
          Aspect                  CL Full CL AsDe                            To get a better understanding of the reasons annotators gave
          Audio                     43%       57%                         regarding their preference, we looked further into the statistics
          Community                 51%       49%                         of the winning observations of CL AsDe vs. CL Full. When AsDe
          Gameplay                  55%       45%                         was preferred, annotators explained their preference mainly due to
          Graphics                  30%       70%                         better coherence (63%), lower redundancy (28%), but also ‘other rea-
          Performance               54%       46%                         sons’ (26%). When CL Full was preferred, annotators chose ‘other
          Story                     49%       51%                         reasons’ (50%), and less often coherence (41%) or low redundancy
          Overall                   47%       53%                         (17%). This finding shows that summaries by AsDe were more co-
Table 4: First user study: annotators’ preference of one sum-             herent but annotators still preferred summaries by CL Full often
marization algorithm over the other, per aspect and overall.              for other reasons. This points to a limitation of the experimental
                                                                          protocol, as the interface did not provide annotators with enough
                                                                          options to allow them to explain their reasons for their summary
                                                                          preference. This was addressed in the second user study (see Sec-
                           Df F value 𝑝 value
                                                                          tion 6) with an extra option on the UI. It should be noted that better
             game          11    1.519 0.120                              coherence was selected far more often overall (53% of instances)
             aspect         5    3.912 0.001 *                            than lower redundancy (23%), while ‘other reasons’ were also cho-
             evaluator      6    7.945 0.000 *                            sen often (37%). Redundancy and coherence were chosen together
             coherence      1 18.6491 0.000 *                             in only 5% of instances, and thus it is evident that these two axes of
             redundancy     1   5.7604 0.017 *                            evaluation are fairly independent. These findings, coupled with the
             other          1   0.5639 0.453                              statistically significant influence (via ANOVA) between preference
Table 5: Analysis of variance between the preference of                   of summarization approach and tagged coherence and redundancy,
one approach and different factors. Significant findings are              support our conclusion that both coherence and redundancy were
shown with an asterisk. The analysis is made on the F statis-             important factors for annotators’ preference.
tic and the degrees of freedom (Df) are also noted.
                                                                          6     SECOND USER STUDY
                                                                          Based on the findings and limitations identified in the first user
                                                                          study, conducted a second study with more participants but fewer
of summaries and the reasons for this choice. The primary goals
                                                                          games, testing the best CL approaches with the novel DL Full
of the user study are to assess (a) whether the annotators prefer
                                                                          pipeline. Due to participants’ concerns on the long duration of
one of the two summarization approaches, (b) which criteria they
                                                                          the 72-item survey in the first experiment, we opted to use only
explicitly (via the three tickboxes) or implicitly (based on properties
                                                                          two games to lower the time required from annotators; it is ex-
of the summary) consider when selecting their preference. Towards
                                                                          pected that fatigue would likely introduce noise to the participants’
this end, the data is processed based on the 8 users’ annotations
                                                                          responses. Details on how the games and annotation options were
on 72 game/aspect pairs (for a total of 576 data points), and all
                                                                          chosen are detailed in Section 6.1.
statistical tests are performed at a 5% significance threshold. Our
assumption is that the complete CL pipeline which includes both
aspect detection and sentiment analysis will offer a richer and more      6.1    Annotation Protocol
diverse summary than AsDe alone.                                          The user interface for the second user study was largely the same as
   Regarding users’ preference of one summarization technique,            in the first (see Section 5.1). Based on the first study’s finding that
results were mixed: overall, annotators had no clear preference with      ‘other reasons’ for an annotator’s preference were often chosen, a
CL AsDe being marginally more often selected (53%). Table 4 shows         fourth option was added to the UI as a tickbox stating “The summary
the distribution of selection of CL Full split per aspect. The Table      was more focused and contained less irrelevant information.” We
shows that the main factor for the skew of the overall preference         refer to this additional option as Focus in the analysis that follows.
towards CL AsDe was the graphics summaries, as the other aspects             As noted above, to reduce the time required for the study only
are fairly evenly preferred between the two approaches.                   two games were chosen to be annotated. We chose among the
   To further assess which factors led to the annotators’ preference      games from the first user study, taking the game where CL Full had
of one summary over the other, we conducted an analysis of vari-          the highest preference (Tom Clancy’s The Division, where CL Full
ance test (ANOVA) between the preferred approach (represented             was chosen 60% of the time) and the game where CL AsDe had the
as a binary choice) and other features such as the aspect. Table 5        highest preference (Elite Dangerous, where CL AsDe was chosen
shows the results in terms of significant differences, and verifies       60% of the time). For each of the two games, the preferred method
that there is a systematic influence between the aspect and prefer-       was chosen to present to the user, juxtaposed with the summary
ence. On the other hand, the game does not seem to affect users’          for the same game and aspect produced by DL Full. Therefore, the
preference of one summary or the other; this is a promising finding       participant had to annotate 12 items, 6 aspects for Tom Clancy’s the
as the methods are supposed to be applicable to any game. There           Division comparing the CL Full summary with the DL Full summary
is also a clear evidence that preference was highly varying from          and 6 aspects for Elite Dangerous comparing the CL AsDe summary
annotator to annotator, and annotators rarely agreed with each            with the DL Full summary. The rationale was to select the most
other even in this simple pair-wise preference task.                      successful game summaries (for both CL variants) and compare
them with the novel DL pipeline. We refer to CL and DL summaries           of pre-specified game facets. two small-scale user surveys exam-
in this paper, referring to the best CL summary (CL Full or CL AsDe)       ined the preference of users in the presence of different pipeline
as shown to the user.                                                      implementations. Results indicate that (a) aspect extraction is im-
   As with the first user study, the order of the two options was          portant for summarization, although deep-learning does not neces-
randomized (i.e. sometimes CL summaries were shown first, some-            sarily improve the aspect extraction process compared to a simpler
times second). Unlike the previous experiment, however, the order          clustering-based method; (b) between the clustering-based pipeline
of the sentences within the same summary was also randomized;              variants (CL AsDe, CL Full), there was no clear winner with respect
the rationale was to avoid ordering effects when the participant           to the summary outputs; (c) evaluators had strong and individual
starts by reading an incoherent sentence first.                            opinions on which variant was better; (d) sentiment-based crite-
                                                                           ria and/or confidence-based criteria for selecting sentences do not
                                                                           seem to perform better than the random selection performed by CL
6.2    Participants
                                                                           AsDe.
Fourteen participants completed this annotation task. Unlike the              While the aspects chosen for this experiment were intuitive,
previous study, a snowball method for soliciting participants was          based on typical facets of games that players and professional crit-
followed, soliciting feedback from a broader group. Thus, this study       ics focus on, some of the resulting aspect-based summaries were
lacks data on the demographics and gaming experience of partici-           less coherent than others. The choice to assign a sentence to an
pants, although participants were all adults and had experience in         aspect even if its cluster only had a slim majority in keyword fre-
data analysis and artificial intelligence.                                 quency likely introduced inconsistency. For CL aspect detection,
                                                                           the most significant factor for the lack of coherence was the choice
6.3    Results                                                             of keywords. Specifically, the keyword “sound” was often found in
The data collected from the experiment was a total of 168 obser-           sentences unrelated to game audio, used as a verb: e.g. “On paper
vations. Overall CL summaries were slightly more preferred by              this game sounds great”. To a degree, such artefacts were removed in
participants (55%), although the difference is not statistically signif-   the DL aspect detection pipeline via (a) the latent sentence represen-
icant (Paired t-test, p-value 0.22). Interestingly, for Elite Dangerous    tation and (b) fine-tuning the model based on manual annotations
(which was summarized by CL AsDe) the difference was more pro-             on this specific corpus. However, a more sophisticated method for
nounced (CL AsDe preferred 60% of the time over DL Full); for Tom          aspect detection seems necessary. For instance, an adaptive query
Clancy’s The Division the two methods (CL Full and DL Full) were           expansion as followed by [29] could create a much larger set of key-
chosen evenly. Since only one game was tested per CL variant, it           words automatically, although it may overlook the nuances of game
is difficult to assess whether the preference was due to the game          terminology. On the other hand, a Word2Vec model [30] trained
itself or the sentiment-based selection component. Moreover, while         on the entire corpus of steam reviews (or even larger game-related
DL Full includes sentiment-based selection, this part accounts for 2       corpora such as game FAQs and fansites) could be used to derive
of the 6 sentences and thus it is even more difficult to estimate the      a similarity score with specific aspects. Building a game ontology
reasons for the users’ preference. This ambiguity points to further        for this task or using an existing one [36, 39] could further assist
refinements needed for the annotation protocol which is discussed          in discovering more keywords or in calculating an ontology-based
in Section 7.                                                              semantic similarity measure [42]. Finally, a completely different
   In terms of the reasons offered by participants for their choice,       direction could see the discovery of topics specific to each game
coherence was still most commonly chosen (62% of responses), fol-          rather than focusing on the same pre-specified topics every time.
lowed closely by focus (56%). Low redundancy was chosen less               This would be valuable as different genres have a different focus
often (23%), while ‘other reasons’ are chosen only in 14% of re-           (e.g. multiplayer games focus on balance or lag, while horror games
sponses). The addition of the focus option seems to have mitigated         focus on the emotional response), but could make it difficult to main-
the prevalence of ‘other reasons’ in the first study. Unlike the first     tain the same presentation format across games and thus confuse
study, however, low redundancy was often chosen in conjunction             end-users.
with one other reason (56% of the time) or two other reasons (36%             Sentiment analysis was also often problematic, primarily due to
of the time). Combined with its low overall prevalence, it is possible     the informal and idiosyncratic language that games reviews were of-
that low redundancy may now longer be necessary as a separate              ten in. Reviews are often rife with sarcasm and negation, e.g. “Have
reason in the UI, although a broader user study with more games            fun spending huge amounts of hours for very little progress.”. More-
is needed to validate this hypothesis.                                     over, many reviews’ sentences have poor syntax and are very short
   Pearson’s Chi-squared tests were also used in order to test whether     or very long (e.g. “Good: + great aesthetic.”). Sentiment analysis
any of the above reasons is correlated to the preferred summary.           treated the sentence as a bag-of-words, exacerbating the problem. In
Only redundancy was found to be correlated with the type of sum-           general, sentiment analysis can not capture negation or sarcasm and
mary (p-value 0.001). This clearly indicates the importance of han-        handles incomplete sentences poorly. Performance would likely be
dling redundancy satisfyingly in any future approach.                      improved with a more appropriate pre-trained lexicon for informal
                                                                           utterances on the Social Web, such as SentiStrength [46] or other
                                                                           sentiment- and negation-aware approaches [22]. Alternatively, a
7     DISCUSSION                                                           custom classifier for sentiment analysis could be trained using text
This paper introduced a number of possible pipelines for iden-             from a Steam review as input and the user’s recommendation as
tifying, grouping, and extracting the opinions of users in terms           polarity. Complementing the training set with experts’ annotations
could refine such a model, especially when dealing with sarcasm.           There are many directions for future research depending on the
Another promising alternative to SentiWordNet for sentiment anal-       purpose of the game review summarization. As a tool for game eval-
ysis would be the use of an authored dictionary of opinion words        uation, primarily targeted towards players or producers, the game’s
[24] or game-specific adjectives annotated in terms of polarity [51].   context is important in order to choose which reviews or topics to
   Our findings also showed no clear winner between the two CL          highlight. Additional research in this vein would need to find topics
variants or between CL and DL summaries. These ambiguity of the         or patterns in similar games (e.g. of the same genre, publisher, or
findings could well be by-products of the experimental protocol         publication date) and then to compare the current game’s reviews
followed. Findings from the first user study pointed to a missing       in terms of those topics or compared to other games’ reviews. User
reason for players to report, and the second study included a “focus”   experience research would also be important to find how best to
reason which improved the quality of the data collected but raised      present such results, as interactive summaries where the user can
questions about the importance of the “low redundancy” reason.          zoom in and out into different games and/or different topics within
The users’ reported fatigue in the first experiment led to fewer        games would make the summaries more intuitive and manageable.
items in the second study to alleviate the burden from annotators.      As a tool for game analysis, bottom-up probabilistic topic modelling
However, this increased the locality of the findings in the second      [7] in games of the same genre could help identify design patterns
study as it was unclear whether preferences were due to the game or     [6] and players’ expectations based on their repertoire [27]. As a
the algorithm. In future studies, summaries for more games should       tool for knowledge discovery, game reviews can serve as raw text
be annotated by more participants, showing only two games to each       or multi-modal corpora from which structured data can be automat-
user but randomizing which games are shown when the user starts         ically extracted as entities and relations [44], concept hierarchies
the study. More importantly, the current experimental protocol          [43, 52], or even a complete game ontology [32, 38].
forces participants to select one review as preferred and provide
at least one reason. The forced choice between two summaries            8    CONCLUSION
does not allow the user to provide more nuanced feedback. A four-       This paper highlighted the challenges and opportunities of game
alternative forced-choice (4-AFC) with options “A”, “B”, “both A and    review summarization via natural language processing. The paper
B”, “neither A nor B” would allow the user to point out cases where     introduced a pipeline for grouping Steam users’ comments into
both summaries are equally good or equally bad. The fairly even         pre-specified aspects such as visuals or performance, and studied
split between the two alternatives in both user studies could be due    different renderings of the final summary, exploiting positive and
to fact that users consider some summaries shown equally bad and        negative sentences based on sentiment analysis. The small-scale
select randomly. On the other hand, a 4-AFC questionnaire would         user survey revealed differences in how different annotators assess
likely need many more participants since much of the data will          the reviews, highlighted possible foci of research for better game
be removed when no ranking is given. The need for more games,           review summarization systems, and suggested a number of refine-
more annotation choices, and perhaps more algorithm variants (DL        ments to the process are suggested in this promising subfield of
AsDe, for instance) point to the need for a large-scale user survey     game artificial intelligence.
among the general gaming community, which will be performed in
future work in this vein.                                               REFERENCES
   As discussed in Section 1 and explored on a high-level during         [1] Rasim M Alguliev, Ramiz M Aliguliyev, Makrufa S Hajirahimova, and Chingiz A
the user study, game review summarization can be valuable both               Mehdiyev. 2011. MCMR: Maximum coverage and minimum redundant text
                                                                             summarization model. Expert Systems with Applications 38, 12 (2011), 14514–
to consumers (players) and producers (game developers). However,             14522.
each stakeholder has different priorities and will likely respond        [2] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet
differently to different summary formats. The extractive summariza-          3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In
                                                                             Proceedings of the International Conference on Language Resources and Evaluation,
tion process was visualized as ‘pure text’ bullet points, which was          Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk,
not as engaging to either type of audience. It would be important            Stelios Piperidis, Mike Rosner, and Daniel Tapias (Eds.).
to explore alternative visualizations for players and developers. For    [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Ma-
                                                                             chine Translation by Jointly Learning to Align and Translate. arXiv1409.0473
players, the summary could provide more structure (based on pre-             [cs, stat] (sep 2014). https://doi.org/10.1146/annurev.neuro.26.041002.131047
specified game facets), focus more on the weights and scoring of             arXiv:1409.0473
                                                                         [4] BBC News. 2019. Gaming worth more than video and music combined. https:
each aspect (including visualizations such as pie-charts), show only         //www.bbc.com/news/technology-46746593. Accessed 26 January 2020.
a few polar opposites in terms of review sentences, and perhaps          [5] Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing
cross-reference these findings with other games’ review summaries.           with Python. O’Reilly Media Inc.
                                                                         [6] Staffan Bjork and Jussi Holopainen. 2004. Patterns in Game Design. Charles River
For developers, on the other hand, a bottom-up topic discovery               Media.
would likely be beneficial in order to identify unexpected points of     [7] David M. Blei. 2012. Probabilistic Topic Models. Communicatiosn of the ACM 55,
contention among users. Moreover, presenting the context of the              4 (2012), 77–84.
                                                                         [8] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
reviewers’ chosen sentences would also be valuable for designers,            allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
e.g. how many reviewers agree with or echo this comment, when            [9] Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and Christof
                                                                             Neumann. 2018. Why rate when you could compare? Using the “EloChoice”
this comment was made and whether general sentiment has shifted              package to assess pairwise comparisons of perceived physical strength. PloS one
since then. Such context can be important regarding the urgency              13, 1 (2018).
of addressing certain concerns or to gauge whether patches and          [10] Hoa Trang Dang and Karolina Owczarzak. 2008. Overview of the TAC 2008
                                                                             Update Summarization Task. In Proceedings of the Text Analysis Conference.
updates have improved reviewers’ perception, not unlike Steam’s         [11] Kareem Darwish, Walid Magdy, and Tahar Zanouda. 2017. Trump vs. Hillary:
use of most recent reviews.                                                  What Went Viral During the 2016 US Presidential Election. In Proceedings of the
     International Conference on Social Informatics. Springer International Publishing,    [35] George Panagiotopoulos, George Giannakopoulos, and Antonios Liapis. 2019. A
     143–161.                                                                                   Study on Video Game Review Summarization. In Proceedings of the MultiLing
[12] Sanmay Das and Mike Y. Chen. 2001. Yahoo! for Amazon: extracting market                    Workshop.
     sentiment from stock message boards. In Proceedings of the Asia Pacific finance       [36] Janne Parkkila, Filip Radulovic, Daniel Garijo, María Poveda-Villalón, Jouni
     association annual conference.                                                             Ikonen, Jari Porras, and Asuncion Gomez-Perez. 2016. An ontology for videogame
[13] Sebastian Deterding, Dan Dixon, Rilla Khaled, and Lennart Nacke. 2011. From                interoperability. Multimedia Tools and Applications 76 (2016).
     Game Design Elements to Gamefulness: Defining “Gamification”. In Proceedings          [37] M.F. Porter. 2006. An algoritm for suffix stripping. Program 14 (2006), 130–137.
     of the 15th International Academic MindTrek Conference: Envisioning Future Media      [38] Ligaj Pradhan, Chengcui Zhang, and Steven Bethard. 2016. Towards extracting
     Environments. 9–15.                                                                        coherent user concerns and their hierarchical organization from user reviews. In
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:              2016 IEEE 17th International Conference on Information Reuse and Integration (IRI).
     Pre-training of Deep Bidirectional Transformers for Language Understanding. In             IEEE, 582–590.
     Proceedings of the 2019 Conference of the North American Chapter of the Association   [39] Owen Sacco, Antonios Liapis, and Georgios N. Yannakakis. 2017. Game Character
     for Computational Linguistics: Human Language Technologies, Volume 1 (Long and             Ontology (GCO): A Vocabulary for Extracting and Describing Game Character
     Short Papers). 4171–4186.                                                                  Information from Web Content. In Proceedings of the International Conference on
[15] Cliff Edwards. 2013. Valve Lines Up Console Partners in Challenge to Microsoft,            Semantic Systems.
     Sony. https://www.bloomberg.com/news/articles/2013-11-04/valve-lines-up-              [40] Hassan Saif, Miriam Fernandez, and Harith Alani. 2015. Contextual semantics for
     console-partners-in-challenge-to-microsoft-sony. Accessed 26 January 2020.                 sentiment analysis of Twitter. Information Processing & Management 52 (2015).
[16] Eslam Elsawy, Moamen Mokhtar, and Walid Magdy. 2014. TweetMogaz v2: Iden-             [41] Gerard Salton and C.S. Yang. 1973. On the specification of term values in auto-
     tifying News Stories in Social Media. In Proceedings of the 23rd ACM International         matic indexing. Journal of Documentation. 29, 4 (1973), 351–372.
     Conference on Information and Knowledge Management.                                   [42] David Sánchez, Montserrat Batet, David Isern, and Aida Valls. 2012. Ontology-
[17] Entertainment Software Association. 2018. Essential Facts About the Computer               based semantic similarity: A new feature-based approach. Expert Systems with
     and Video Game Industry report. https://www.theesa.com/wp-content/uploads/                 Applications 39 (2012).
     2019/03/ESA_EssentialFacts_2018.pdf. Accessed: 5 Sep 2019.                            [43] Mark Sanderson and W. Bruce Croft. 1999. Deriving concept hierarchies from
[18] Angela Fan, David Grangier, and Michael Auli. 2017. Controllable Abstractive               text. In Proceedings of the ACM SIGIR Conference on Research and Development in
     Summarization. In Proceedings of the ACL Workshop on Neural Machine Translation            Information Retrieval.
     and Generation.                                                                       [44] Min Song, Won Chul Kim, Dahee Lee, Go Eun Heo, and Keun Young Kang. 2015.
[19] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.               PKDE4J: Entity and relation extraction for public knowledge discovery. Journal
     2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine             of Biomedical Informatics 57 (2015), 320 – 332.
     Learning Research 9 (2008), 1871–1874.                                                [45] Pero Subasic and Alison Huettner. 2001. Affect analysis of text using fuzzy
[20] George Giannakopoulos, George Kiomourtzis, and Vangelis Karkaletsis. 2014.                 semantic typing. IEEE Transactions on Fuzzy Systems 2 (2001), 483 – 496.
     Newsum:"n-gram graph"-based summarization in the real world. In Innovative            [46] Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment Strength
     Document Summarization Techniques: Revolutionizing Knowledge Understanding.                Detection for the Social Web. Journal of the American Society for Information
     IGI Global, 205–230.                                                                       Science and Technology 63, 1 (2012), 163–173.
[21] George Giannakopoulos, Jeff Kubina, John Conroy, Josef Steinberger, Benoit            [47] Peter D. Turney. 2002. Thumbs up or Thumbs down? Semantic Orientation
     Favre, Mijail Kabadjov, Udo Kruschwitz, and Massimo Poesio. 2015. Multiling                Applied to Unsupervised Classification of Reviews. In Proceedings of the 40th
     2015: multilingual summarization of single and multi-documents, on-line fora,              Annual Meeting on Association for Computational Linguistics.
     and call-center conversations. In Proceedings of the 16th Annual Meeting of the       [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Special Interest Group on Discourse and Dialogue. 270–274.                                 Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[22] Maria Giatsoglou, Manolis G Vozalis, Konstantinos Diamantaras, Athena Vakali,              you need. In Advances in neural information processing systems. 5998–6008.
     George Sarigiannidis, and Konstantinos Ch. Chatzisavvas Chatzisavvas. 2017.           [49] Janyce Wiebe. 2000. Learning Subjective Adjectives from Corpora. In Proceed-
     Sentiment analysis leveraging emotions and word embeddings. Expert Systems                 ings of the Seventeenth National Conference on Artificial Intelligence and Twelfth
     with Applications 69 (2017), 214–224.                                                      Conference on Innovative Applications of Artificial Intelligence. 735–740.
[23] Hongyu Han, Yongshi Zhang, Jianpei Zhang, Jing Yang, and Xiaomei Zou. 2018.           [50] Kevin Yauris and Masayu Leylia Khodra. 2017. Aspect-based summarization
     Improving the performance of lexicon-based review sentiment analysis method                for game review using double propagation. In Proceedings of the International
     by reducing additional introduced sentiment bias. PLOS ONE 13 (2018).                      Conference on Advanced Informatics, Concepts, Theory, and Applications.
[24] Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews.               [51] José P. Zagal, Noriko Tomuro, and Andriy Shepitsen. 2012. Natural Language
     In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge               Processing in Game Studies Research: An Overview. Simulation & Gaming 43, 3
     Discovery and Data Mining. 168–177.                                                        (2012), 356–373.
[25] Ya-Han Hu, Yen-Liang Chen, and Hui-Ling Chou. 2017. Opinion mining from               [52] Elias Zavitsanos, Georgios Paliouras, George A Vouros, and Sergios Petridis.
     online hotel reviews – A text summarization approach. Information Processing &             2010. Learning subsumption hierarchies of ontology concepts from texts. Web
     Management 53, 2 (2017), 436–449.                                                          Intelligence and Agent Systems: An International Journal 8, 1 (2010), 37–51.
[26] San-Yih Hwang, Chia-Yu Lai, Jia-Jhe Jiang, and Shanlin Chang. 2014. The identifi-     [53] Lili Zhao and Chunping Li. 2009. Ontology Based Opinion Mining for Movie
     cation of Noteworthy Hotel Reviews for Hotel Management. Pacific Asia Journal              Reviews. In Knowledge Science, Engineering and Management. Springer Berlin
     of the Association for Information Systems 6 (2014).                                       Heidelberg, 204–214.
[27] Jesper Juul. 2005. Half Real. Videogames between Real Rules and Fictional Worlds.     [54] Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie Review Mining and
     MIT Press.                                                                                 Summarization. In Proceedings of the ACM International Conference on Information
[28] Antonios Liapis, Georgios N. Yannakakis, Mark J. Nelson, Mike Preuss, and Rafael           and Knowledge Management. 43–50.
     Bidarra. 2019. Orchestrating Game Generation. IEEE Transactions on Games 11, 1        [55] Zhen Zuo. 2018. Sentiment Analysis of Steam Review Datasets using Naive
     (2019), 48–68.                                                                             Bayes and Decision Tree Classifier. Technical Report. University of Illinois at
[29] Walid Magdy and Tamer Elsayed. 2016. Unsupervised adaptive microblog filtering             Urbana–Champaign.
     for broad dynamic topics. Information Processing & Management 52, 4 (2016),
     513–528.
[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
     Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
     arXiv:1301.3781 http://arxiv.org/abs/1301.3781
[31] Subhabrata Mukherjee and Pushpak Bhattacharyya. 2012. Feature Specific Senti-
     ment Analysis for Product Reviews. In Computational Linguistics and Intelligent
     Text Processing. Springer Berlin Heidelberg, 475–487.
[32] Roberto Navigli and Paola Velardi. 2004. Learning Domain Ontologies from
     Document Warehouses and Dedicated Web Sites. Computational Linguistics 30, 2
     (2004), 151–179.
[33] Chikashi Nobata and Satoshi Sekine. 2004. CRL/NYU Summarization System at
     DUC-2004. In Document Understanding Workshop 2004.
[34] Karolina Owczarzak, John M Conroy, Hoa Trang Dang, and Ani Nenkova. 2012.
     An assessment of the accuracy of automatic evaluation in summarization. In Pro-
     ceedings of Workshop on Evaluation Metrics and System Comparison for Automatic
     Summarization. Association for Computational Linguistics, 1–9.