=Paper= {{Paper |id=Vol-2844/games5 |storemode=property |title=Summarizing Game Reviews: First Contact |pdfUrl=https://ceur-ws.org/Vol-2844/games5.pdf |volume=Vol-2844 |authors=Aris Kosmopoulos,Antonios Liapis,George Giannakopoulos,Nikiforos Pittaras |dblpUrl=https://dblp.org/rec/conf/setn/KosmopoulosLGP20 }} ==Summarizing Game Reviews: First Contact== https://ceur-ws.org/Vol-2844/games5.pdf

Summarizing Game Reviews: First Contact
Aris Kosmopoulos Antonios Liapis
SciFY PNPC and NCSR Demokritos Institute of Digital Games, University of Malta
Athens, Greece Msida, Malta
akosmo@scify.org antonios.liapis@um.edu.mt

George Giannakopoulos Nikiforos Pittaras
SciFY PNPC and NCSR Demokritos NCSR Demokritos and Kapodistrian University of Athens
Athens, Greece Athens, Greece
ggianna@iit.demokritos.gr npittaras@di.uoa.gr

ABSTRACT content, strategies, cheats, etc. This community-driven content
In recent years the number of players that are willing to submit a often informs other users’ purchases (e.g. via an aggregated review
video game review has increased drastically. This is due to a com- score) but is also carefully monitored by developers and publishers
bination of factors such as the raw increase of video gamers and in order to gauge opinions on specific aspects of the game which
the wide use of gaming platforms that facilitate the review submis- can be patched or improved in updates to the game or in sequels. For
sion process. The vast data produced by reviewers make extracting both players and developers, being able to succinctly monitor other
actionable knowledge difficult, both for companies and other play- players’ views is highly beneficial. The website www.metacritic.com
ers, especially if the extraction is to be completed in a timely and aggregates reviews by players and professional critics, returning
efficient manner. In this paper we experiment with a game review a percentage score for the game and highlighting diverse reviews
summarization pipeline that aims to automatically produce review along the spectrum of positive versus negative. The Steam platform
summaries through aspect identification and sentiment analysis. also aggregates its users’ reviews into different categories (‘Mixed’,
We build upon early experiments on the feasibility of evaluation ‘Overwhelmingly Positive’, ‘Mostly Negative’ etc.) which is another
for the task, designing and performing the first evaluation of its criterion for sorting and (likely) promoting games. The simple
kind. Thus, we apply variants of a main analysis pipeline on an aggregation of reviews into a general score is important, but it
appropriate dataset, studying the results to better understand pos- obfuscates the nuances of the different reviewers’ grievances and
sible future directions. To this end, we propose and implement an is of limited use to designers who wish to improve their game.
evaluation procedure regarding the produced summaries, creating a This paper explores techniques for text summarization in order
benchmark setting for future works on game review summarization. to provide a multi-dimensional and holistic summary of Steam
reviews for a particular game.
CCS CONCEPTS We explore the topic of summarization for game reviews us-
ing a large dataset of Steam reviews from 12 selected games. The
• Applied computing → Computer games; • Computing method-
goal of the summarization pipeline is to extract users’ views on
ologies → Information extraction.
different facets of games such as graphics, audio, and gameplay
[28], leveraging textual sentiment analysis to identify and posi-
KEYWORDS tive and negative review snippets, creating a composite summary
summarization, natural language processing, sentiment analysis, of indicative comments on a specific game facet. Unlike the nu-
game reviews, Steam. merical aggregation of Metacritic or Steam, this approach extracts
individual sentences (and criticisms) contained within a usually
1 INTRODUCTION dense review and attempts to classify those in terms of positive
The ever-expanding popularity of digital games is evidenced by the or negative automatically (rather than based on the user’s binary
large profit margins of the commercial game industry sector [4], recommendation). The presentation of the game’s summary, which
the vast and diverse swathes of the population that play games [17], is split based on different aspects typically criticized in games, can
and the appeal of games and gamification beyond the purposes be valuable for both players and designers. For players, the statis-
of entertainment [13]. A large factor for the market penetration tics derived from this process (e.g. ratio of positive versus negative
of digital games are distribution platforms such as Steam and the comments in one aspect) can act as an expanded game scoring
Google Play Store. Not only do these distribution platforms allow system not unlike professional game reviews which gave a score to
interested players to purchase and download new games, they also graphics, audio etc. For designers, the indicative comments split per
cultivate a player community with players returning to rate and sentiment and aspect allows for a quick monitoring of players’ cur-
comment on their favorite game or even contribute user-created rent favorite features. Moreover, the flexible way in which aspects
are defined allows designers to explicitly redefine the keywords
they are interested in, personalizing the summary to their design
GAITECUS0, September 2–4, 2020, Athens priorities.
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). There has been very limited attention to game review summa-
rization, besides student projects [55]. Inspired by the only work
that performs aspect-based game review summarization [50], this word order is considered in many other approaches as it can capture
paper evaluates the outcomes of a straightforward summarization a word’s importance. For instance, the first and last sentences in a
pipeline in a small-scale user survey. Using the twelve most re- larger document tend to be more important [33]. Other approaches
viewed games in a 2017 dataset of Steam reviews, the resulting tag words on their part-of-speech (POS) [37], e.g. nouns (NN), verbs
summaries are evaluated by a small set of experts. The paper stud- (VB), or adverbs (RB). This is useful for pre-processing, e.g. selecting
ies pipeline variants to better sketch what is important in game only sentences with a noun and adjective as a corpus for review
review summarization. Based on the outcomes of the different sum- summarization [25]. Another use of POS tags is to select N-grams
marization processes, and a small-scale study where the different (i.e. a sequence of words) with specific parts of speech, such as a
outcomes were compared, a number of potential improvements comparative adverb followed by an adjective [47].
were identified. The paper also highlights the many directions
2.1.2 Topic Modeling. Identifying the topic of a document, sen-
which game review summarization research can follow so that it
tence, or review is often necessary for clustering opinions on the
can serve designers and players through different pipeline imple-
topic together. When the topics of interest are known in advance,
mentations, alternative visualizations, bottom-up aspect discovery,
experts usually provide the keywords used to filter the relevant doc-
or text processing driven by domain knowledge.
uments. For instance, TweetElect used an initial set of 38 keywords
The paper is structured as follows. We start with a review of
related to the 2016 US elections (including candidates’ names) for
related works in Section 2. We then describe the proposed summa-
streaming relevant tweets [11]. However, a boolean check whether
rization pipeline and variants in Section 3. We describe the dataset
a keyword is specifically mentioned is rarely sufficient due to the
in Section 4 and present two different user studies in Sections 5 and
nuances of language; query expansion is applied to create a larger set
6. We then discuss the results in Section 7 and conclude the paper
of terms related to each original keyword [29]. Supervised learning
in Section 8.
is often applied for topic modelling, showing positive and negative
examples of relevant documents to a classifier [29]. When topics are
2 RELATED WORK
unknown and must be discovered from the data, a simple approach
User reviews are a rich source of information, although the extrac- is to identify the most frequent terms and cluster emergent terms
tion and analysis of this information can be challenging not only based on co-occurrence [16]. Probabilistic topic models such as
due to the textual nature of the medium but also because users tend Latent Dirichlet Allocation (LDA) [8] can more efficiently discover
to have a mixed opinion about various features [31]. Approaches topics without domain knowledge, following a bag-of-words ap-
such as sentiment analysis as well as summarization have been proach which disregards word or document order. LDA randomly
applied to various datasets, such as product reviews [24, 31], movie chooses a set of topics and decomposes the probability distribu-
reviews [53, 54], or hotel reviews [25]. Section 2.1 surveys relevant tion matrix of words in a document into two matrices consisting
approaches for the different phases of a summarization pipeline, of the distribution of topics in a document and the distribution of
while Section 2.2 discusses the nuances of the Steam platform and words in a topic. Due to the vast number of possible topic structures,
early work in game review summarization. For interested readers, sampling-based algorithms are used to find the sample topics which
[25] provides a more thorough overview on review summarization best approximate the posterior distribution [7]. LDA has often been
according to the type of corpora used as input. applied to find topics within reviews, primarily in order to identify
review’s sentiments towards these topics, e.g. in [26].
2.1 Summarization Pipeline
2.1.3 Sentiment Analysis. The sentiment behind utterances is im-
Summarization can be extractive when relevant portions (usually
portant for summarization, especially when the corpus is reviews
sentences) of the input are copied and combined, or abstractive
of any kind. Turney [47] highlighted that reviews may recommend
when new text is generated to rephrase and summarize the input
or not a certain product, movie, or travel destination; a summary
[18]. The summarization pipeline requires a number of steps before
therefore should account for both positive and negative reviews.
the raw textual input can produce a summary; algorithms and
Turney’s study was the first to perform sentiment analysis on text-
approaches for each step are discussed below.
based reviews based on responses of the AltaVista internet search
2.1.1 Pre-processing and parsing. A fundamental step towards sum- query on how near the phrases were to the word ‘excellent’ (for
marization (and natural language processing more broadly) is the recommended) and the word ‘poor’ (for not recommended). Man-
pre-processing and extraction of features from the dataset. In the ually created lexica for words that express sentiment have been
analysis below, the term documents is used to describe any type of used in conjunction with fuzzy logic, vector distance, etc. to clas-
text, e.g. a sentence, a paragraph, or an academic paper. One pop- sify positive and negative [12, 45]. In the same context, there has
ular if naive approach for pre-processing data is the bag-of-words been extensive work on extracting opinion words which express
which collects all words in the document, disregarding their order subjective opinions within sentences [49]. It has been found that
and grammar. This method counts the number of instances of the subjective sentences are statistically correlated with the presence
same word, and the frequency of occurrence of each word is used of adjectives [49], and much research in product review summariza-
as a feature to measure similarity between documents. Since many tion uses adjectives to determine sentiment polarity. For instance,
words (such as articles or pronouns) are far more frequent in all Hu et al. [24] used a frequency-based algorithm to find relevant
documents, terms are weighted based on their frequency via tf.idf domain features, and then extracted nearby adjectives to such do-
[41] where the term frequency (𝑡 𝑓 ) is multiplied by the inverse main features. Using a labeled set of adjectives and expanding the
document frequence (𝑖𝑑 𝑓 ). Unlike the bag-of-words approach, the initial set via WordNet, Hu et al. classified the extracted adjectives’
polarity and assigned that positive or negative sentiment to the in snippets or sentences, Part-of-Speech tagging, and other
nearby domain feature. The SentiWordNet database is constructed similar tasks.
based on the same principles of the domain-specific adjective clas- Aspect Identification which identifies interesting aspects (or
sification of [24], using a manually annotated set of seed words and topics) in the reviews. These topics may be expressed as a
using WordNet term relationships to expand the training set, which set of words, e.g. "visual, aesthetic, scenery" or "soundscape,
is then used as the ground truth for machine learning classifiers [2]. audio experience, "sound effects".
SentiWordNet, and similar general-purpose models for sentiment Aspect Labeling which assigns clear, descriptive labels to the
prediction [46], have been used for polarity detection in reviews, discovered aspects. E.g. "graphics", "audio".
e.g. in [23, 40]. Sentiment Analysis which gathers information related to the
sentiment expressed within the reviews. This information
2.2 Steam Review Summarization may later be used to update the final summary appropriately.
Since its 2003 release, the Steam platform has become the largest dig- For example, one may need only positive views in the sum-
ital distribution platform for PC gaming [15], hosting over 34,000 mary, or—most probably—a sampling of all the views, be
games and tens of millions of active users daily. This paper fo- they positive or negative.
cuses on user-created reviews on Steam, although other initiatives Summary Creation which implies the process which, given
such as the Steam workshop allow users to upload their mods or all the information gathered in previous steps, forms and
strategies and comment on others’ content. User reviews can be renders the final summary for the user.
submitted only by people that have purchased the game from Steam, Given the above pipeline, we implemented three different vari-
although they are visible to all. As noted above, Steam aggregates ants. The first two are based on keyword detection and Clustering
user reviews into a category and provides a number of companion (CL). The first variant does not do Sentiment Analysis, while the
statistics, including a timeline of reviewer’s scores. Reviews them- second one uses the full pipeline. The last one is another full pipe
selves consist of a single binary recommendation (Recommended method based on Deep Learning (DL) that focuses on improving on
versus Not Recommended) and a text explaining the user’s opinion. Aspect Labeling and Summary Creation steps.
Other users can review the quality of the review itself by tagging it
helpful, not helpful, funny, or breaking the Rules of Conduct. By
3.1 CL pipeline
default, Steam shows the most helpful reviews submitted within
the last 30 days, although users can also choose to sort reviews by During the preprocessing step, each review is split into sentences,
other criteria. each sentence is cleaned in order to create the basic elements on
As noted in the introduction, there is no systematic academic which the final summaries will be based. The cleaning process
research in Steam review summarization. To the best of our knowl- included of some character replacements so that each sentence
edge, the only academic publication that tackles the problem of could be presentable (e.x. starting with a capital letter and ending
aspect-based summarization on such data is by Yauris and Kho- with a period) even if it originated from a larger sentence that was
dra [50]. In their approach, only relevant portions of sentences split during sentence splitting. Moreover, preprocessing prepared
were extracted via conditions applied on text tagged via Parts of the lemmatized versions of the sentences which are used for aspect
Speech; these portions were usually small, e.g. the phrase could be detection. In these lemmatized sentences, general stop words are
“amount of content” [50]. Similar to our approach, a pre-specified removed. For all preprocessing steps, we used the default functions
set of keywords are used for aspect categorization. The aspects (and stop word lists) of the nltk Python library [5].
and keywords are similar but not identical to our approach (e.g. The aspect detection process is split into two parts: aspect iden-
the aspects in [50] are gameplay, story, graphic, music, community, tification and aspect labeling. Aspect identification splits sentences
and general/others), while choosing the aspect described in the into sets that focus on a specific aspect while aspect labeling iden-
phrase was based on the cosine similarity from each word of the tifies this aspect in order to present it to the final review summary.
phrase to the aspect’s keywords. The output summary consists Our approach uses a predefined set of aspects, presented in
of many aspects (most of which are outside the pre-specified key- Table 1. We selected these six aspects since they are well-established
words) and a single adjective for each, unlike our current work facets of games [28] and are popular dimensions within professional
which extracts complete sentences with different polarities. The reviews.
summarization pipeline was tested on a single game (Skyrim), ex- A simple approach for aspect labeling is to use a dictionary of
ploring different sentiment extraction approaches using precision keywords per aspect as the ones presented in Table 1. In order to
and recall as performance metrics. While our current work does not be able to include sentences even when they do not include the
explore as many parameters for sentiment analysis, it is the first exact keywords, a k-means clustering is applied to all sentences to
instance where game review summaries are evaluated by humans find clusters with similar text. Terms are weighted based on their
in a small-scale but thorough user study. frequency via 𝑡 𝑓 .𝑖𝑑 𝑓 , which has been used extensively for sentence
similarity in bag-of-words approaches (see Section 2.1). The result
is 𝐾 clusters of sentences with similar words to each other; in all
3 SUMMARIZATION PIPELINES
our experiments we set 𝐾 = 20 based on prior evidence [35]. Once
Figure 1 visualizes the main components of our pipeline: sentences are all assigned a cluster based on the distance to the
Preprocessing which aims to prepare the input reviews for center, all sentences in all clusters are processed in the following
further analysis. This may imply cleaning, chunking text fashion:
Community Community
Graphics Graphics

Gameplay Gameplay

1 Review 2 Aspect 3 Aspect 4 Sen�ment 5 Summary
Preprocessing Iden�ﬁca�on Labeling Analysis Crea�on

Figure 1: The full pipeline represents both the Clustering variant (CL Full) and the Deep Learning variant (DL Full), while
variant CL AsDe produces summaries by skipping the Sentiment Analysis step.

Aspect Keywords 𝑁 sentences at random from each aspect’s set. A sample CL AsDe
Graphics graphic, visual, aesthetic, animation, scenery summary can be found in Table 2 for Tom Clancy’s The Division.
Gameplay mission, item, map, weapon, mode, multiplayer, The next step of the process is Sentiment Analysis, which is used
control by the next summarization variant (CL Full). Using the different sets
Audio audio, sound, music, soundtrack, melody, voice of candidate sentences per aspect, the sentiment polarity (positive or
Community community, toxic, friendly negative) of each sentence is calculated by averaging the sentiment
Performance server, bug, connection, lag, latency, ping, crash, score of each word it contains. As above, sentiment analysis of each
glitch, optimization word is done via the default functions of the nltk Python library [5].
Story dialog, romance, ending, cutscene, story The library calculates probabilities for each polarity class (positive,
Table 1: Aspects and keywords used for the identification of neutral, negative). We took into account sentences which were
dominant aspects in review clusters. assigned a class with a probability of at least 0.5. In order to select
a number of sentences per category, a 𝑘-means clustering approach
(using 𝑡 𝑓 .𝑖𝑑 𝑓 ) is applied within the set of sentences with the same
polarity. In the CL Full implementation of this paper, only two
sentences per polarity are selected (𝑘 = 2) as the ones closest to
(1) If the sentence contains the exact keywords of only one each cluster’s centroid. If there exist sufficient positive and negative
aspect, the sentence is assigned to that aspect and is flagged sentences, then this approach returns 6 sentences as bullet points.
as a candidate that can be used by the summary of that Note that if fewer than two sentences are above the threshold for
aspect. positive (or below the threshold, for negative) then fewer sentences
(2) If keywords from multiple aspects are found in the sentence, may be included in the summary. An example summary from CL
the sentence is flagged as an unsuitable candidate for any Full variant can be found in Table 2 for Tom Clancy’s The Division.
summary and removed.
(3) If no aspect keywords are found in the sentence, the most
common aspect within the sentences of the same cluster will
be used to label this sentence and flag it as a candidate. For 3.2 DL pipeline
instance, if a sentence does not contain any keyword, but After experimenting with the first two variant pipelines and taking
sentences in its cluster predominantly belong to the aspect into account the feedback of the first user study (see Section 5), we
Gameplay via case (1), then the sentence is also assigned to decided to focus on improving the following:
the same aspect and flagged as a candidate.
Using the sentences from cases (1) and (3), a set of candidate • Keyword detection and clustering based Aspect Labeling
sentences is created per aspect. Using these sets, the first varia- must be improved to avoid sentences such as "If those things
tion of our pipeline could now produce a summary. This variation, all sound good to you you will like the game." to be labeled
named Clustering Aspect Detection summary (CL AsDe), chooses as audio sentences.
- In a few words the game is single dimensional this might sound vague but it
• The final summary should somehow provide information becomes apparent that there is not much depth as you play once you’re a couple
regarding the whole sentiment of the given aspect and not hours in.
just by the selected sentences. - Clothes sound "right" when you move in them.
- They sound good and looked good with ability to mod for better stats or even
• The final summary should use a better sentence extraction rerolling stats.
approach in order to deal with redundancy. - They have improved the pve portion of the game and crazy as it sounds the pvp
too.
Taking all the above into account, the DL pipeline makes changes - No music and something feels so strangely abandnded about it.
to the Aspect Detection and Summary Creation steps of the CL - Like how if there’s a blizzard your cap and shoulder will be covered in snow and
that npc voices will echo when they are standing in hallways with hollow walls.
pipeline described in Section 3.1. - Very good voice acting.
For Aspect Detection, we used the BERT model [14] to generate - Great abilities pretty good sounds; indoor echos reverb off objects etc.
embeddings for game reviews. BERT is a deep neural language - If those things all sound good to you you will like the game.
- Superb voice acting and ambient city sounds are also a good plus for this game.
model that uses a bidirectional, multilayer transformer architecture, - It sounds hyperbolic but I’m being dead serious.
exploiting cross and self-attention to capture word interdependen- - Sounds terrible right
- Most opinions are positive regarding audio.
cies effectively [3, 48]. The approach relies on multi-head attention - The voice acting in the game is in the higher tiers as is most ubisoft games.
modules for sequence encoding modelling, with word order infor- - There are not a lot of different voices and some of the voice acting for them is
mation being retained with additive positional encoding vectors. bad.
- Ubisoft - bugs - the textures are so fucked up that nobody can play this game
BERT is trained in an unsupervised setting on large quantities of anymore.
English text, using masked language modelling and next sentence - And it clearly shows I want to play it and that I try to.
-I’m gonna be honest the cinematics are pretty great.
prediction objectives. These tasks require the prediction of hidden
Table 2: Summaries generated by different pipelines, for as-
sequence tokens and the generation of an entire sequence, given an
pect Audio of Tom Clancy’s The Division. From top to bot-
input sequence (e.g. for tasks such as question-answering and text
tom: CL AsDe (only aspect detection), CL Full (aspect detec-
entailment, etc.). This pretraining scheme and architecture have
tion with sentiment analysis) and DL Full (Deep learning
been shown to perform exceptionally well for a variety of natural
combined with a sophisticated summarizer).
language understanding tasks.
To obtain the representation for a game review, we feed the text
to the model using a sequence length of 16 tokens. We use the
𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 model variant, that produces 768-dimensional sequence - Mixed: 𝑃¯ ≈ 0, high standard deviation.
embeddings, learned during training for classification purposes. - Mostly neutral: 𝑃¯ ≈ 0, low standard deviation.
The implementation and pretrained model utilized are provided by - Mostly positive: 𝑃¯ > 0 above a threshold.
the transformers software package from huggingface1 . Using the - Mostly negative: 𝑃¯ < 0 below a threshold.
produced embeddings as features we trained a binary Ridge Logistic
The final summary is composed by randomly shuffling these 6
Regression classifier [19] (one vs all) for each aspect. We also trained
sentences. An example summary from DL Full variant can be found
a seventh classifier to detect sentences unfit for any aspect. For each
in Table 2 for Tom Clancy’s The Division.
candidate sentence a confidence score was calculated by each aspect
classifier. Only sentences with a high prediction confidence in the
given aspect and a low confidence on each other classifier were
4 DATASET
selected as summary candidates for the next steps of the pipeline. As a first demonstration of the summarization pipeline, we follow
During the Summary Creation we applied the following strat- [35] and select the most helpful reviews on Steam, splitting them per
egy to the 100 most probable candidate sentences of each aspect. game. This paper parses the Steam review dataset gathered by Zuo
First, the NewSum Toolkit [20] was used to select the sentences [55], which consists of over 7 million reviews obtained via Steam’s
that provide the most representative information. NewSum uses API. Each review text comes with a plethora of features concerning
language-agnostic methods based on n-gram graphs, that not only both the game being reviewed and the reviewer, although only a
extract the most representative sentences, but also deal with redun- subset of features is used for this experiment. Since Steam users
dancy. In the end we had 20 candidate sentences per Aspect. The can vote a review as helpful, unhelpful, or spam, we only consider
final summary was composed by 6 sentences using the following ‘valid’ reviews those with 10 or more user votes as ‘helpful’. With
strategy: this criterion (minimum of at least 1000 of ‘helpful’ reviews), we
select twelve games with the most valid reviews (see Table 3). The
• Select the most positive sentence (Sentiment Analysis).
games selected have a desirable diversity both in terms of genres
• Select the most negative sentence (Sentiment Analysis).
(shooting, survival, adventure, open-world, multi-player, single-
• Select the first 3 sentences provided by NewSum Tookit
player, etc.) and in terms of general audience reception (shown
(excluding the previously selected sentences).
by the Metacritic score which aggregates professional and users’
• Create an artificial sentence using the polarities provided by
reviews).
Sentiment Analysis of all the aspect sentences. The polarity
For each of the selected games we selected to keep the 10 thou-
of each sentence was mapped as 1, 0 or -1 (positive, neutral,
sand most up-voted reviews. As already discussed in Section 3 each
negative) using thresholds. Given an Aspect and the mean
of these reviews was split into sentences to create a sentence pool
Polarity score 𝑃, ¯ the possible produced sentences reflect
per game. On average, the sentence pool consisted of around 50
opinions that fall in the following categories:
thousand sentences per game. The smallest pool of sentences was
1 https://huggingface.co/ for PAYDAY 2 (37K), while the largest one was for Elite Dangerous
Game Title Publisher Year Reviews MC
No Man’s Sky Hello Games 2016 4146 61%
DayZ Bohemia Interac- 2018 3349 –
tive
PAYDAY 2 Starbreeze 2017 2573 79%
ARK: Survival Evolved Studio Wildcard 2017 2368 70%
Grand Theft Auto V Rockstar Games 2015 2104 96%
Firewatch Campo Santo 2016 1599 81%
Darkest Dungeon Red Hook Studios 2016 1564 84%
Just Survive Daybreak Game 2015 1463 –
Company
Killing Floor 2 Tripwire Interac- 2016 1276 75%
tive
Elite Dangerous Frontier Develop- 2015 1270 80%
ments
Tom Clancy’s ‘The Di- Ubisoft 2016 1091 79%
vision’
Subnautica Unknown Worlds 2018 1056 87% Figure 2: User interface for online evaluation of summaries
Entertainment produced by CL AsDe and CL Full methods.
Table 3: Games selected from the dataset, sorted by the num-
ber of ‘valid’ reviews (10 or more ‘helpful’ votes). The Meta-
of summaries. We initialized the system by providing two sets
critic score (MC) is included for reference.
of summaries A, B, one from system 𝐴 and one from system 𝐵.
Each summary in A corresponded to a summary in B, as they both
summarize the same set of reviews and the same aspect (e.g. the
aspect Graphics of DayZ). During the experiment, each system’s
(70K). The average length of the sentences was 85.7 in characters
summary was randomly placed first or second to minimize any bias
and 16.4 in words. In terms of both characters and words, the longest
related ordering effect.
sentences were those of Darkest Dungeon (average of 91.8 char-
The UI also informed the user of the title of the game being
acters and 17.3 words) and the shortest ones were those of Just
summarized, plus the aspect (e.g. Graphics). The user was then
Survive (average of 79.9 characters and 15.6 words).
called to select their preferred summary (A or B) and explain the
In terms of aspects, the most common one was Gameplay on av-
reasons for this preference. For the latter annotation, the user could
erage. Performance was the next most popular aspect and in certain
select one or more tickboxes among the following options:
games such as ARK: Survival Evolved it was the most popular one.
The least popular aspect was Audio with a ratio of 1 to 5 compared • It repeats less the same information (Less Redundant)
to the Gameplay aspect. • It seems to be more coherent and/or complete
In terms of sentiment, the majority of sentences were more • For other (or even unclear) reasons
neutral than positive or negative. Between positive and negative The first two options aim to assess whether redundancy is a con-
sentiment, no general safe conclusions can be drawn since the re- cern and, similarly, whether coherence and completeness are useful
sults varied given different combinations of aspects and games. in the task. Redundancy has been traditionally a summarization
In general, we can say that the aspect Performance was character- evaluation indicator [1], especially in multi-document summariza-
ized as negative more frequently. The opposite was true for the tion. The completeness and coherence aspect is essentially a (more
aspect Graphics. On the other hand the sentiment ratio (positive vs nuanced) version of overall responsiveness, as this has been used
negative) towards the aspect Community varied between different in DUC/TAC summarization tracks and related work [10].
games.
5.2 Participants
5 FIRST USER STUDY The evaluation was carried out by eight adult evaluators (3 female),
As a first experiment, we evaluated the two variations of the CL fluent in English, with gaming experience. The evaluators were
pipeline (CL AsDe and CL Full) in a small-scale user-study with selected explicitly among the authors’ network of contacts and in-
summaries of aspects of the 12 games of Table 3. vited directly by the authors. Participants were asked to connect to
the online system and evaluate all 72 pairs of summaries (produced
5.1 Annotation Protocol by CL AsDe and CL Full), which covered all predefined aspects (see
A pairwise comparison process was followed, rather than a scale- Table 1) of the ten games in Table 3. There was no time limit for
based rating approach, due to (a) evidence that comparison-based completing the evaluation, but there was a requirement that all
evaluation can be less demanding cognitively [9] and (b) a rich body pairs were evaluated in a single session.
of literature that has applied pairwise evaluation for summarization
tasks [34] (e.g. the single document summarization task in [21]). 5.3 Results
To this end, we created an online evaluation user interface (UI) The data collected from the experiment was a total of 576 obser-
(see Figure 2) which supported comparative pairwise evaluation vations, including the preference of each evaluator for each pair
Aspect CL Full CL AsDe To get a better understanding of the reasons annotators gave
Audio 43% 57% regarding their preference, we looked further into the statistics
Community 51% 49% of the winning observations of CL AsDe vs. CL Full. When AsDe
Gameplay 55% 45% was preferred, annotators explained their preference mainly due to
Graphics 30% 70% better coherence (63%), lower redundancy (28%), but also ‘other rea-
Performance 54% 46% sons’ (26%). When CL Full was preferred, annotators chose ‘other
Story 49% 51% reasons’ (50%), and less often coherence (41%) or low redundancy
Overall 47% 53% (17%). This finding shows that summaries by AsDe were more co-
Table 4: First user study: annotators’ preference of one sum- herent but annotators still preferred summaries by CL Full often
marization algorithm over the other, per aspect and overall. for other reasons. This points to a limitation of the experimental
protocol, as the interface did not provide annotators with enough
options to allow them to explain their reasons for their summary
preference. This was addressed in the second user study (see Sec-
Df F value 𝑝 value
tion 6) with an extra option on the UI. It should be noted that better
game 11 1.519 0.120 coherence was selected far more often overall (53% of instances)
aspect 5 3.912 0.001 * than lower redundancy (23%), while ‘other reasons’ were also cho-
evaluator 6 7.945 0.000 * sen often (37%). Redundancy and coherence were chosen together
coherence 1 18.6491 0.000 * in only 5% of instances, and thus it is evident that these two axes of
redundancy 1 5.7604 0.017 * evaluation are fairly independent. These findings, coupled with the
other 1 0.5639 0.453 statistically significant influence (via ANOVA) between preference
Table 5: Analysis of variance between the preference of of summarization approach and tagged coherence and redundancy,
one approach and different factors. Significant findings are support our conclusion that both coherence and redundancy were
shown with an asterisk. The analysis is made on the F statis- important factors for annotators’ preference.
tic and the degrees of freedom (Df) are also noted.
6 SECOND USER STUDY
Based on the findings and limitations identified in the first user
study, conducted a second study with more participants but fewer
of summaries and the reasons for this choice. The primary goals
games, testing the best CL approaches with the novel DL Full
of the user study are to assess (a) whether the annotators prefer
pipeline. Due to participants’ concerns on the long duration of
one of the two summarization approaches, (b) which criteria they
the 72-item survey in the first experiment, we opted to use only
explicitly (via the three tickboxes) or implicitly (based on properties
two games to lower the time required from annotators; it is ex-
of the summary) consider when selecting their preference. Towards
pected that fatigue would likely introduce noise to the participants’
this end, the data is processed based on the 8 users’ annotations
responses. Details on how the games and annotation options were
on 72 game/aspect pairs (for a total of 576 data points), and all
chosen are detailed in Section 6.1.
statistical tests are performed at a 5% significance threshold. Our
assumption is that the complete CL pipeline which includes both
aspect detection and sentiment analysis will offer a richer and more 6.1 Annotation Protocol
diverse summary than AsDe alone. The user interface for the second user study was largely the same as
Regarding users’ preference of one summarization technique, in the first (see Section 5.1). Based on the first study’s finding that
results were mixed: overall, annotators had no clear preference with ‘other reasons’ for an annotator’s preference were often chosen, a
CL AsDe being marginally more often selected (53%). Table 4 shows fourth option was added to the UI as a tickbox stating “The summary
the distribution of selection of CL Full split per aspect. The Table was more focused and contained less irrelevant information.” We
shows that the main factor for the skew of the overall preference refer to this additional option as Focus in the analysis that follows.
towards CL AsDe was the graphics summaries, as the other aspects As noted above, to reduce the time required for the study only
are fairly evenly preferred between the two approaches. two games were chosen to be annotated. We chose among the
To further assess which factors led to the annotators’ preference games from the first user study, taking the game where CL Full had
of one summary over the other, we conducted an analysis of vari- the highest preference (Tom Clancy’s The Division, where CL Full
ance test (ANOVA) between the preferred approach (represented was chosen 60% of the time) and the game where CL AsDe had the
as a binary choice) and other features such as the aspect. Table 5 highest preference (Elite Dangerous, where CL AsDe was chosen
shows the results in terms of significant differences, and verifies 60% of the time). For each of the two games, the preferred method
that there is a systematic influence between the aspect and prefer- was chosen to present to the user, juxtaposed with the summary
ence. On the other hand, the game does not seem to affect users’ for the same game and aspect produced by DL Full. Therefore, the
preference of one summary or the other; this is a promising finding participant had to annotate 12 items, 6 aspects for Tom Clancy’s the
as the methods are supposed to be applicable to any game. There Division comparing the CL Full summary with the DL Full summary
is also a clear evidence that preference was highly varying from and 6 aspects for Elite Dangerous comparing the CL AsDe summary
annotator to annotator, and annotators rarely agreed with each with the DL Full summary. The rationale was to select the most
other even in this simple pair-wise preference task. successful game summaries (for both CL variants) and compare
them with the novel DL pipeline. We refer to CL and DL summaries of pre-specified game facets. two small-scale user surveys exam-
in this paper, referring to the best CL summary (CL Full or CL AsDe) ined the preference of users in the presence of different pipeline
as shown to the user. implementations. Results indicate that (a) aspect extraction is im-
As with the first user study, the order of the two options was portant for summarization, although deep-learning does not neces-
randomized (i.e. sometimes CL summaries were shown first, some- sarily improve the aspect extraction process compared to a simpler
times second). Unlike the previous experiment, however, the order clustering-based method; (b) between the clustering-based pipeline
of the sentences within the same summary was also randomized; variants (CL AsDe, CL Full), there was no clear winner with respect
the rationale was to avoid ordering effects when the participant to the summary outputs; (c) evaluators had strong and individual
starts by reading an incoherent sentence first. opinions on which variant was better; (d) sentiment-based crite-
ria and/or confidence-based criteria for selecting sentences do not
seem to perform better than the random selection performed by CL
6.2 Participants
AsDe.
Fourteen participants completed this annotation task. Unlike the While the aspects chosen for this experiment were intuitive,
previous study, a snowball method for soliciting participants was based on typical facets of games that players and professional crit-
followed, soliciting feedback from a broader group. Thus, this study ics focus on, some of the resulting aspect-based summaries were
lacks data on the demographics and gaming experience of partici- less coherent than others. The choice to assign a sentence to an
pants, although participants were all adults and had experience in aspect even if its cluster only had a slim majority in keyword fre-
data analysis and artificial intelligence. quency likely introduced inconsistency. For CL aspect detection,
the most significant factor for the lack of coherence was the choice
6.3 Results of keywords. Specifically, the keyword “sound” was often found in
The data collected from the experiment was a total of 168 obser- sentences unrelated to game audio, used as a verb: e.g. “On paper
vations. Overall CL summaries were slightly more preferred by this game sounds great”. To a degree, such artefacts were removed in
participants (55%), although the difference is not statistically signif- the DL aspect detection pipeline via (a) the latent sentence represen-
icant (Paired t-test, p-value 0.22). Interestingly, for Elite Dangerous tation and (b) fine-tuning the model based on manual annotations
(which was summarized by CL AsDe) the difference was more pro- on this specific corpus. However, a more sophisticated method for
nounced (CL AsDe preferred 60% of the time over DL Full); for Tom aspect detection seems necessary. For instance, an adaptive query
Clancy’s The Division the two methods (CL Full and DL Full) were expansion as followed by [29] could create a much larger set of key-
chosen evenly. Since only one game was tested per CL variant, it words automatically, although it may overlook the nuances of game
is difficult to assess whether the preference was due to the game terminology. On the other hand, a Word2Vec model [30] trained
itself or the sentiment-based selection component. Moreover, while on the entire corpus of steam reviews (or even larger game-related
DL Full includes sentiment-based selection, this part accounts for 2 corpora such as game FAQs and fansites) could be used to derive
of the 6 sentences and thus it is even more difficult to estimate the a similarity score with specific aspects. Building a game ontology
reasons for the users’ preference. This ambiguity points to further for this task or using an existing one [36, 39] could further assist
refinements needed for the annotation protocol which is discussed in discovering more keywords or in calculating an ontology-based
in Section 7. semantic similarity measure [42]. Finally, a completely different
In terms of the reasons offered by participants for their choice, direction could see the discovery of topics specific to each game
coherence was still most commonly chosen (62% of responses), fol- rather than focusing on the same pre-specified topics every time.
lowed closely by focus (56%). Low redundancy was chosen less This would be valuable as different genres have a different focus
often (23%), while ‘other reasons’ are chosen only in 14% of re- (e.g. multiplayer games focus on balance or lag, while horror games
sponses). The addition of the focus option seems to have mitigated focus on the emotional response), but could make it difficult to main-
the prevalence of ‘other reasons’ in the first study. Unlike the first tain the same presentation format across games and thus confuse
study, however, low redundancy was often chosen in conjunction end-users.
with one other reason (56% of the time) or two other reasons (36% Sentiment analysis was also often problematic, primarily due to
of the time). Combined with its low overall prevalence, it is possible the informal and idiosyncratic language that games reviews were of-
that low redundancy may now longer be necessary as a separate ten in. Reviews are often rife with sarcasm and negation, e.g. “Have
reason in the UI, although a broader user study with more games fun spending huge amounts of hours for very little progress.”. More-
is needed to validate this hypothesis. over, many reviews’ sentences have poor syntax and are very short
Pearson’s Chi-squared tests were also used in order to test whether or very long (e.g. “Good: + great aesthetic.”). Sentiment analysis
any of the above reasons is correlated to the preferred summary. treated the sentence as a bag-of-words, exacerbating the problem. In
Only redundancy was found to be correlated with the type of sum- general, sentiment analysis can not capture negation or sarcasm and
mary (p-value 0.001). This clearly indicates the importance of han- handles incomplete sentences poorly. Performance would likely be
dling redundancy satisfyingly in any future approach. improved with a more appropriate pre-trained lexicon for informal
utterances on the Social Web, such as SentiStrength [46] or other
sentiment- and negation-aware approaches [22]. Alternatively, a
7 DISCUSSION custom classifier for sentiment analysis could be trained using text
This paper introduced a number of possible pipelines for iden- from a Steam review as input and the user’s recommendation as
tifying, grouping, and extracting the opinions of users in terms polarity. Complementing the training set with experts’ annotations
could refine such a model, especially when dealing with sarcasm. There are many directions for future research depending on the
Another promising alternative to SentiWordNet for sentiment anal- purpose of the game review summarization. As a tool for game eval-
ysis would be the use of an authored dictionary of opinion words uation, primarily targeted towards players or producers, the game’s
[24] or game-specific adjectives annotated in terms of polarity [51]. context is important in order to choose which reviews or topics to
Our findings also showed no clear winner between the two CL highlight. Additional research in this vein would need to find topics
variants or between CL and DL summaries. These ambiguity of the or patterns in similar games (e.g. of the same genre, publisher, or
findings could well be by-products of the experimental protocol publication date) and then to compare the current game’s reviews
followed. Findings from the first user study pointed to a missing in terms of those topics or compared to other games’ reviews. User
reason for players to report, and the second study included a “focus” experience research would also be important to find how best to
reason which improved the quality of the data collected but raised present such results, as interactive summaries where the user can
questions about the importance of the “low redundancy” reason. zoom in and out into different games and/or different topics within
The users’ reported fatigue in the first experiment led to fewer games would make the summaries more intuitive and manageable.
items in the second study to alleviate the burden from annotators. As a tool for game analysis, bottom-up probabilistic topic modelling
However, this increased the locality of the findings in the second [7] in games of the same genre could help identify design patterns
study as it was unclear whether preferences were due to the game or [6] and players’ expectations based on their repertoire [27]. As a
the algorithm. In future studies, summaries for more games should tool for knowledge discovery, game reviews can serve as raw text
be annotated by more participants, showing only two games to each or multi-modal corpora from which structured data can be automat-
user but randomizing which games are shown when the user starts ically extracted as entities and relations [44], concept hierarchies
the study. More importantly, the current experimental protocol [43, 52], or even a complete game ontology [32, 38].
forces participants to select one review as preferred and provide
at least one reason. The forced choice between two summaries 8 CONCLUSION
does not allow the user to provide more nuanced feedback. A four- This paper highlighted the challenges and opportunities of game
alternative forced-choice (4-AFC) with options “A”, “B”, “both A and review summarization via natural language processing. The paper
B”, “neither A nor B” would allow the user to point out cases where introduced a pipeline for grouping Steam users’ comments into
both summaries are equally good or equally bad. The fairly even pre-specified aspects such as visuals or performance, and studied
split between the two alternatives in both user studies could be due different renderings of the final summary, exploiting positive and
to fact that users consider some summaries shown equally bad and negative sentences based on sentiment analysis. The small-scale
select randomly. On the other hand, a 4-AFC questionnaire would user survey revealed differences in how different annotators assess
likely need many more participants since much of the data will the reviews, highlighted possible foci of research for better game
be removed when no ranking is given. The need for more games, review summarization systems, and suggested a number of refine-
more annotation choices, and perhaps more algorithm variants (DL ments to the process are suggested in this promising subfield of
AsDe, for instance) point to the need for a large-scale user survey game artificial intelligence.
among the general gaming community, which will be performed in
future work in this vein. REFERENCES
As discussed in Section 1 and explored on a high-level during [1] Rasim M Alguliev, Ramiz M Aliguliyev, Makrufa S Hajirahimova, and Chingiz A
the user study, game review summarization can be valuable both Mehdiyev. 2011. MCMR: Maximum coverage and minimum redundant text
summarization model. Expert Systems with Applications 38, 12 (2011), 14514–
to consumers (players) and producers (game developers). However, 14522.
each stakeholder has different priorities and will likely respond [2] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet
differently to different summary formats. The extractive summariza- 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In
Proceedings of the International Conference on Language Resources and Evaluation,
tion process was visualized as ‘pure text’ bullet points, which was Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk,
not as engaging to either type of audience. It would be important Stelios Piperidis, Mike Rosner, and Daniel Tapias (Eds.).
to explore alternative visualizations for players and developers. For [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Ma-
chine Translation by Jointly Learning to Align and Translate. arXiv1409.0473
players, the summary could provide more structure (based on pre- [cs, stat] (sep 2014). https://doi.org/10.1146/annurev.neuro.26.041002.131047
specified game facets), focus more on the weights and scoring of arXiv:1409.0473
[4] BBC News. 2019. Gaming worth more than video and music combined. https:
each aspect (including visualizations such as pie-charts), show only //www.bbc.com/news/technology-46746593. Accessed 26 January 2020.
a few polar opposites in terms of review sentences, and perhaps [5] Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing
cross-reference these findings with other games’ review summaries. with Python. O’Reilly Media Inc.
[6] Staffan Bjork and Jussi Holopainen. 2004. Patterns in Game Design. Charles River
For developers, on the other hand, a bottom-up topic discovery Media.
would likely be beneficial in order to identify unexpected points of [7] David M. Blei. 2012. Probabilistic Topic Models. Communicatiosn of the ACM 55,
contention among users. Moreover, presenting the context of the 4 (2012), 77–84.
[8] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
reviewers’ chosen sentences would also be valuable for designers, allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
e.g. how many reviewers agree with or echo this comment, when [9] Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and Christof
Neumann. 2018. Why rate when you could compare? Using the “EloChoice”
this comment was made and whether general sentiment has shifted package to assess pairwise comparisons of perceived physical strength. PloS one
since then. Such context can be important regarding the urgency 13, 1 (2018).
of addressing certain concerns or to gauge whether patches and [10] Hoa Trang Dang and Karolina Owczarzak. 2008. Overview of the TAC 2008
Update Summarization Task. In Proceedings of the Text Analysis Conference.
updates have improved reviewers’ perception, not unlike Steam’s [11] Kareem Darwish, Walid Magdy, and Tahar Zanouda. 2017. Trump vs. Hillary:
use of most recent reviews. What Went Viral During the 2016 US Presidential Election. In Proceedings of the
International Conference on Social Informatics. Springer International Publishing, [35] George Panagiotopoulos, George Giannakopoulos, and Antonios Liapis. 2019. A
143–161. Study on Video Game Review Summarization. In Proceedings of the MultiLing
[12] Sanmay Das and Mike Y. Chen. 2001. Yahoo! for Amazon: extracting market Workshop.
sentiment from stock message boards. In Proceedings of the Asia Pacific finance [36] Janne Parkkila, Filip Radulovic, Daniel Garijo, María Poveda-Villalón, Jouni
association annual conference. Ikonen, Jari Porras, and Asuncion Gomez-Perez. 2016. An ontology for videogame
[13] Sebastian Deterding, Dan Dixon, Rilla Khaled, and Lennart Nacke. 2011. From interoperability. Multimedia Tools and Applications 76 (2016).
Game Design Elements to Gamefulness: Defining “Gamification”. In Proceedings [37] M.F. Porter. 2006. An algoritm for suffix stripping. Program 14 (2006), 130–137.
of the 15th International Academic MindTrek Conference: Envisioning Future Media [38] Ligaj Pradhan, Chengcui Zhang, and Steven Bethard. 2016. Towards extracting
Environments. 9–15. coherent user concerns and their hierarchical organization from user reviews. In
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI).
Pre-training of Deep Bidirectional Transformers for Language Understanding. In IEEE, 582–590.
Proceedings of the 2019 Conference of the North American Chapter of the Association [39] Owen Sacco, Antonios Liapis, and Georgios N. Yannakakis. 2017. Game Character
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Ontology (GCO): A Vocabulary for Extracting and Describing Game Character
Short Papers). 4171–4186. Information from Web Content. In Proceedings of the International Conference on
[15] Cliff Edwards. 2013. Valve Lines Up Console Partners in Challenge to Microsoft, Semantic Systems.
Sony. https://www.bloomberg.com/news/articles/2013-11-04/valve-lines-up- [40] Hassan Saif, Miriam Fernandez, and Harith Alani. 2015. Contextual semantics for
console-partners-in-challenge-to-microsoft-sony. Accessed 26 January 2020. sentiment analysis of Twitter. Information Processing & Management 52 (2015).
[16] Eslam Elsawy, Moamen Mokhtar, and Walid Magdy. 2014. TweetMogaz v2: Iden- [41] Gerard Salton and C.S. Yang. 1973. On the specification of term values in auto-
tifying News Stories in Social Media. In Proceedings of the 23rd ACM International matic indexing. Journal of Documentation. 29, 4 (1973), 351–372.
Conference on Information and Knowledge Management. [42] David Sánchez, Montserrat Batet, David Isern, and Aida Valls. 2012. Ontology-
[17] Entertainment Software Association. 2018. Essential Facts About the Computer based semantic similarity: A new feature-based approach. Expert Systems with
and Video Game Industry report. https://www.theesa.com/wp-content/uploads/ Applications 39 (2012).
2019/03/ESA_EssentialFacts_2018.pdf. Accessed: 5 Sep 2019. [43] Mark Sanderson and W. Bruce Croft. 1999. Deriving concept hierarchies from
[18] Angela Fan, David Grangier, and Michael Auli. 2017. Controllable Abstractive text. In Proceedings of the ACM SIGIR Conference on Research and Development in
Summarization. In Proceedings of the ACL Workshop on Neural Machine Translation Information Retrieval.
and Generation. [44] Min Song, Won Chul Kim, Dahee Lee, Go Eun Heo, and Keun Young Kang. 2015.
[19] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. PKDE4J: Entity and relation extraction for public knowledge discovery. Journal
2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine of Biomedical Informatics 57 (2015), 320 – 332.
Learning Research 9 (2008), 1871–1874. [45] Pero Subasic and Alison Huettner. 2001. Affect analysis of text using fuzzy
[20] George Giannakopoulos, George Kiomourtzis, and Vangelis Karkaletsis. 2014. semantic typing. IEEE Transactions on Fuzzy Systems 2 (2001), 483 – 496.
Newsum:"n-gram graph"-based summarization in the real world. In Innovative [46] Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment Strength
Document Summarization Techniques: Revolutionizing Knowledge Understanding. Detection for the Social Web. Journal of the American Society for Information
IGI Global, 205–230. Science and Technology 63, 1 (2012), 163–173.
[21] George Giannakopoulos, Jeff Kubina, John Conroy, Josef Steinberger, Benoit [47] Peter D. Turney. 2002. Thumbs up or Thumbs down? Semantic Orientation
Favre, Mijail Kabadjov, Udo Kruschwitz, and Massimo Poesio. 2015. Multiling Applied to Unsupervised Classification of Reviews. In Proceedings of the 40th
2015: multilingual summarization of single and multi-documents, on-line fora, Annual Meeting on Association for Computational Linguistics.
and call-center conversations. In Proceedings of the 16th Annual Meeting of the [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Special Interest Group on Discourse and Dialogue. 270–274. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[22] Maria Giatsoglou, Manolis G Vozalis, Konstantinos Diamantaras, Athena Vakali, you need. In Advances in neural information processing systems. 5998–6008.
George Sarigiannidis, and Konstantinos Ch. Chatzisavvas Chatzisavvas. 2017. [49] Janyce Wiebe. 2000. Learning Subjective Adjectives from Corpora. In Proceed-
Sentiment analysis leveraging emotions and word embeddings. Expert Systems ings of the Seventeenth National Conference on Artificial Intelligence and Twelfth
with Applications 69 (2017), 214–224. Conference on Innovative Applications of Artificial Intelligence. 735–740.
[23] Hongyu Han, Yongshi Zhang, Jianpei Zhang, Jing Yang, and Xiaomei Zou. 2018. [50] Kevin Yauris and Masayu Leylia Khodra. 2017. Aspect-based summarization
Improving the performance of lexicon-based review sentiment analysis method for game review using double propagation. In Proceedings of the International
by reducing additional introduced sentiment bias. PLOS ONE 13 (2018). Conference on Advanced Informatics, Concepts, Theory, and Applications.
[24] Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews. [51] José P. Zagal, Noriko Tomuro, and Andriy Shepitsen. 2012. Natural Language
In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Processing in Game Studies Research: An Overview. Simulation & Gaming 43, 3
Discovery and Data Mining. 168–177. (2012), 356–373.
[25] Ya-Han Hu, Yen-Liang Chen, and Hui-Ling Chou. 2017. Opinion mining from [52] Elias Zavitsanos, Georgios Paliouras, George A Vouros, and Sergios Petridis.
online hotel reviews – A text summarization approach. Information Processing & 2010. Learning subsumption hierarchies of ontology concepts from texts. Web
Management 53, 2 (2017), 436–449. Intelligence and Agent Systems: An International Journal 8, 1 (2010), 37–51.
[26] San-Yih Hwang, Chia-Yu Lai, Jia-Jhe Jiang, and Shanlin Chang. 2014. The identifi- [53] Lili Zhao and Chunping Li. 2009. Ontology Based Opinion Mining for Movie
cation of Noteworthy Hotel Reviews for Hotel Management. Pacific Asia Journal Reviews. In Knowledge Science, Engineering and Management. Springer Berlin
of the Association for Information Systems 6 (2014). Heidelberg, 204–214.
[27] Jesper Juul. 2005. Half Real. Videogames between Real Rules and Fictional Worlds. [54] Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie Review Mining and
MIT Press. Summarization. In Proceedings of the ACM International Conference on Information
[28] Antonios Liapis, Georgios N. Yannakakis, Mark J. Nelson, Mike Preuss, and Rafael and Knowledge Management. 43–50.
Bidarra. 2019. Orchestrating Game Generation. IEEE Transactions on Games 11, 1 [55] Zhen Zuo. 2018. Sentiment Analysis of Steam Review Datasets using Naive
(2019), 48–68. Bayes and Decision Tree Classifier. Technical Report. University of Illinois at
[29] Walid Magdy and Tamer Elsayed. 2016. Unsupervised adaptive microblog filtering Urbana–Champaign.
for broad dynamic topics. Information Processing & Management 52, 4 (2016),
513–528.
[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
arXiv:1301.3781 http://arxiv.org/abs/1301.3781
[31] Subhabrata Mukherjee and Pushpak Bhattacharyya. 2012. Feature Specific Senti-
ment Analysis for Product Reviews. In Computational Linguistics and Intelligent
Text Processing. Springer Berlin Heidelberg, 475–487.
[32] Roberto Navigli and Paola Velardi. 2004. Learning Domain Ontologies from
Document Warehouses and Dedicated Web Sites. Computational Linguistics 30, 2
(2004), 151–179.
[33] Chikashi Nobata and Satoshi Sekine. 2004. CRL/NYU Summarization System at
DUC-2004. In Document Understanding Workshop 2004.
[34] Karolina Owczarzak, John M Conroy, Hoa Trang Dang, and Ani Nenkova. 2012.
An assessment of the accuracy of automatic evaluation in summarization. In Pro-
ceedings of Workshop on Evaluation Metrics and System Comparison for Automatic
Summarization. Association for Computational Linguistics, 1–9.