=Paper=
{{Paper
|id=Vol-2844/games5
|storemode=property
|title=Summarizing Game Reviews: First Contact
|pdfUrl=https://ceur-ws.org/Vol-2844/games5.pdf
|volume=Vol-2844
|authors=Aris Kosmopoulos,Antonios Liapis,George Giannakopoulos,Nikiforos Pittaras
|dblpUrl=https://dblp.org/rec/conf/setn/KosmopoulosLGP20
}}
==Summarizing Game Reviews: First Contact==
Summarizing Game Reviews: First Contact Aris Kosmopoulos Antonios Liapis SciFY PNPC and NCSR Demokritos Institute of Digital Games, University of Malta Athens, Greece Msida, Malta akosmo@scify.org antonios.liapis@um.edu.mt George Giannakopoulos Nikiforos Pittaras SciFY PNPC and NCSR Demokritos NCSR Demokritos and Kapodistrian University of Athens Athens, Greece Athens, Greece ggianna@iit.demokritos.gr npittaras@di.uoa.gr ABSTRACT content, strategies, cheats, etc. This community-driven content In recent years the number of players that are willing to submit a often informs other users’ purchases (e.g. via an aggregated review video game review has increased drastically. This is due to a com- score) but is also carefully monitored by developers and publishers bination of factors such as the raw increase of video gamers and in order to gauge opinions on specific aspects of the game which the wide use of gaming platforms that facilitate the review submis- can be patched or improved in updates to the game or in sequels. For sion process. The vast data produced by reviewers make extracting both players and developers, being able to succinctly monitor other actionable knowledge difficult, both for companies and other play- players’ views is highly beneficial. The website www.metacritic.com ers, especially if the extraction is to be completed in a timely and aggregates reviews by players and professional critics, returning efficient manner. In this paper we experiment with a game review a percentage score for the game and highlighting diverse reviews summarization pipeline that aims to automatically produce review along the spectrum of positive versus negative. The Steam platform summaries through aspect identification and sentiment analysis. also aggregates its users’ reviews into different categories (‘Mixed’, We build upon early experiments on the feasibility of evaluation ‘Overwhelmingly Positive’, ‘Mostly Negative’ etc.) which is another for the task, designing and performing the first evaluation of its criterion for sorting and (likely) promoting games. The simple kind. Thus, we apply variants of a main analysis pipeline on an aggregation of reviews into a general score is important, but it appropriate dataset, studying the results to better understand pos- obfuscates the nuances of the different reviewers’ grievances and sible future directions. To this end, we propose and implement an is of limited use to designers who wish to improve their game. evaluation procedure regarding the produced summaries, creating a This paper explores techniques for text summarization in order benchmark setting for future works on game review summarization. to provide a multi-dimensional and holistic summary of Steam reviews for a particular game. CCS CONCEPTS We explore the topic of summarization for game reviews us- ing a large dataset of Steam reviews from 12 selected games. The • Applied computing → Computer games; • Computing method- goal of the summarization pipeline is to extract users’ views on ologies → Information extraction. different facets of games such as graphics, audio, and gameplay [28], leveraging textual sentiment analysis to identify and posi- KEYWORDS tive and negative review snippets, creating a composite summary summarization, natural language processing, sentiment analysis, of indicative comments on a specific game facet. Unlike the nu- game reviews, Steam. merical aggregation of Metacritic or Steam, this approach extracts individual sentences (and criticisms) contained within a usually 1 INTRODUCTION dense review and attempts to classify those in terms of positive The ever-expanding popularity of digital games is evidenced by the or negative automatically (rather than based on the user’s binary large profit margins of the commercial game industry sector [4], recommendation). The presentation of the game’s summary, which the vast and diverse swathes of the population that play games [17], is split based on different aspects typically criticized in games, can and the appeal of games and gamification beyond the purposes be valuable for both players and designers. For players, the statis- of entertainment [13]. A large factor for the market penetration tics derived from this process (e.g. ratio of positive versus negative of digital games are distribution platforms such as Steam and the comments in one aspect) can act as an expanded game scoring Google Play Store. Not only do these distribution platforms allow system not unlike professional game reviews which gave a score to interested players to purchase and download new games, they also graphics, audio etc. For designers, the indicative comments split per cultivate a player community with players returning to rate and sentiment and aspect allows for a quick monitoring of players’ cur- comment on their favorite game or even contribute user-created rent favorite features. Moreover, the flexible way in which aspects are defined allows designers to explicitly redefine the keywords they are interested in, personalizing the summary to their design GAITECUS0, September 2–4, 2020, Athens priorities. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). There has been very limited attention to game review summa- rization, besides student projects [55]. Inspired by the only work that performs aspect-based game review summarization [50], this word order is considered in many other approaches as it can capture paper evaluates the outcomes of a straightforward summarization a word’s importance. For instance, the first and last sentences in a pipeline in a small-scale user survey. Using the twelve most re- larger document tend to be more important [33]. Other approaches viewed games in a 2017 dataset of Steam reviews, the resulting tag words on their part-of-speech (POS) [37], e.g. nouns (NN), verbs summaries are evaluated by a small set of experts. The paper stud- (VB), or adverbs (RB). This is useful for pre-processing, e.g. selecting ies pipeline variants to better sketch what is important in game only sentences with a noun and adjective as a corpus for review review summarization. Based on the outcomes of the different sum- summarization [25]. Another use of POS tags is to select N-grams marization processes, and a small-scale study where the different (i.e. a sequence of words) with specific parts of speech, such as a outcomes were compared, a number of potential improvements comparative adverb followed by an adjective [47]. were identified. The paper also highlights the many directions 2.1.2 Topic Modeling. Identifying the topic of a document, sen- which game review summarization research can follow so that it tence, or review is often necessary for clustering opinions on the can serve designers and players through different pipeline imple- topic together. When the topics of interest are known in advance, mentations, alternative visualizations, bottom-up aspect discovery, experts usually provide the keywords used to filter the relevant doc- or text processing driven by domain knowledge. uments. For instance, TweetElect used an initial set of 38 keywords The paper is structured as follows. We start with a review of related to the 2016 US elections (including candidates’ names) for related works in Section 2. We then describe the proposed summa- streaming relevant tweets [11]. However, a boolean check whether rization pipeline and variants in Section 3. We describe the dataset a keyword is specifically mentioned is rarely sufficient due to the in Section 4 and present two different user studies in Sections 5 and nuances of language; query expansion is applied to create a larger set 6. We then discuss the results in Section 7 and conclude the paper of terms related to each original keyword [29]. Supervised learning in Section 8. is often applied for topic modelling, showing positive and negative examples of relevant documents to a classifier [29]. When topics are 2 RELATED WORK unknown and must be discovered from the data, a simple approach User reviews are a rich source of information, although the extrac- is to identify the most frequent terms and cluster emergent terms tion and analysis of this information can be challenging not only based on co-occurrence [16]. Probabilistic topic models such as due to the textual nature of the medium but also because users tend Latent Dirichlet Allocation (LDA) [8] can more efficiently discover to have a mixed opinion about various features [31]. Approaches topics without domain knowledge, following a bag-of-words ap- such as sentiment analysis as well as summarization have been proach which disregards word or document order. LDA randomly applied to various datasets, such as product reviews [24, 31], movie chooses a set of topics and decomposes the probability distribu- reviews [53, 54], or hotel reviews [25]. Section 2.1 surveys relevant tion matrix of words in a document into two matrices consisting approaches for the different phases of a summarization pipeline, of the distribution of topics in a document and the distribution of while Section 2.2 discusses the nuances of the Steam platform and words in a topic. Due to the vast number of possible topic structures, early work in game review summarization. For interested readers, sampling-based algorithms are used to find the sample topics which [25] provides a more thorough overview on review summarization best approximate the posterior distribution [7]. LDA has often been according to the type of corpora used as input. applied to find topics within reviews, primarily in order to identify review’s sentiments towards these topics, e.g. in [26]. 2.1 Summarization Pipeline 2.1.3 Sentiment Analysis. The sentiment behind utterances is im- Summarization can be extractive when relevant portions (usually portant for summarization, especially when the corpus is reviews sentences) of the input are copied and combined, or abstractive of any kind. Turney [47] highlighted that reviews may recommend when new text is generated to rephrase and summarize the input or not a certain product, movie, or travel destination; a summary [18]. The summarization pipeline requires a number of steps before therefore should account for both positive and negative reviews. the raw textual input can produce a summary; algorithms and Turney’s study was the first to perform sentiment analysis on text- approaches for each step are discussed below. based reviews based on responses of the AltaVista internet search 2.1.1 Pre-processing and parsing. A fundamental step towards sum- query on how near the phrases were to the word ‘excellent’ (for marization (and natural language processing more broadly) is the recommended) and the word ‘poor’ (for not recommended). Man- pre-processing and extraction of features from the dataset. In the ually created lexica for words that express sentiment have been analysis below, the term documents is used to describe any type of used in conjunction with fuzzy logic, vector distance, etc. to clas- text, e.g. a sentence, a paragraph, or an academic paper. One pop- sify positive and negative [12, 45]. In the same context, there has ular if naive approach for pre-processing data is the bag-of-words been extensive work on extracting opinion words which express which collects all words in the document, disregarding their order subjective opinions within sentences [49]. It has been found that and grammar. This method counts the number of instances of the subjective sentences are statistically correlated with the presence same word, and the frequency of occurrence of each word is used of adjectives [49], and much research in product review summariza- as a feature to measure similarity between documents. Since many tion uses adjectives to determine sentiment polarity. For instance, words (such as articles or pronouns) are far more frequent in all Hu et al. [24] used a frequency-based algorithm to find relevant documents, terms are weighted based on their frequency via tf.idf domain features, and then extracted nearby adjectives to such do- [41] where the term frequency (𝑡 𝑓 ) is multiplied by the inverse main features. Using a labeled set of adjectives and expanding the document frequence (𝑖𝑑 𝑓 ). Unlike the bag-of-words approach, the initial set via WordNet, Hu et al. classified the extracted adjectives’ polarity and assigned that positive or negative sentiment to the in snippets or sentences, Part-of-Speech tagging, and other nearby domain feature. The SentiWordNet database is constructed similar tasks. based on the same principles of the domain-specific adjective clas- Aspect Identification which identifies interesting aspects (or sification of [24], using a manually annotated set of seed words and topics) in the reviews. These topics may be expressed as a using WordNet term relationships to expand the training set, which set of words, e.g. "visual, aesthetic, scenery" or "soundscape, is then used as the ground truth for machine learning classifiers [2]. audio experience, "sound effects". SentiWordNet, and similar general-purpose models for sentiment Aspect Labeling which assigns clear, descriptive labels to the prediction [46], have been used for polarity detection in reviews, discovered aspects. E.g. "graphics", "audio". e.g. in [23, 40]. Sentiment Analysis which gathers information related to the sentiment expressed within the reviews. This information 2.2 Steam Review Summarization may later be used to update the final summary appropriately. Since its 2003 release, the Steam platform has become the largest dig- For example, one may need only positive views in the sum- ital distribution platform for PC gaming [15], hosting over 34,000 mary, or—most probably—a sampling of all the views, be games and tens of millions of active users daily. This paper fo- they positive or negative. cuses on user-created reviews on Steam, although other initiatives Summary Creation which implies the process which, given such as the Steam workshop allow users to upload their mods or all the information gathered in previous steps, forms and strategies and comment on others’ content. User reviews can be renders the final summary for the user. submitted only by people that have purchased the game from Steam, Given the above pipeline, we implemented three different vari- although they are visible to all. As noted above, Steam aggregates ants. The first two are based on keyword detection and Clustering user reviews into a category and provides a number of companion (CL). The first variant does not do Sentiment Analysis, while the statistics, including a timeline of reviewer’s scores. Reviews them- second one uses the full pipeline. The last one is another full pipe selves consist of a single binary recommendation (Recommended method based on Deep Learning (DL) that focuses on improving on versus Not Recommended) and a text explaining the user’s opinion. Aspect Labeling and Summary Creation steps. Other users can review the quality of the review itself by tagging it helpful, not helpful, funny, or breaking the Rules of Conduct. By 3.1 CL pipeline default, Steam shows the most helpful reviews submitted within the last 30 days, although users can also choose to sort reviews by During the preprocessing step, each review is split into sentences, other criteria. each sentence is cleaned in order to create the basic elements on As noted in the introduction, there is no systematic academic which the final summaries will be based. The cleaning process research in Steam review summarization. To the best of our knowl- included of some character replacements so that each sentence edge, the only academic publication that tackles the problem of could be presentable (e.x. starting with a capital letter and ending aspect-based summarization on such data is by Yauris and Kho- with a period) even if it originated from a larger sentence that was dra [50]. In their approach, only relevant portions of sentences split during sentence splitting. Moreover, preprocessing prepared were extracted via conditions applied on text tagged via Parts of the lemmatized versions of the sentences which are used for aspect Speech; these portions were usually small, e.g. the phrase could be detection. In these lemmatized sentences, general stop words are “amount of content” [50]. Similar to our approach, a pre-specified removed. For all preprocessing steps, we used the default functions set of keywords are used for aspect categorization. The aspects (and stop word lists) of the nltk Python library [5]. and keywords are similar but not identical to our approach (e.g. The aspect detection process is split into two parts: aspect iden- the aspects in [50] are gameplay, story, graphic, music, community, tification and aspect labeling. Aspect identification splits sentences and general/others), while choosing the aspect described in the into sets that focus on a specific aspect while aspect labeling iden- phrase was based on the cosine similarity from each word of the tifies this aspect in order to present it to the final review summary. phrase to the aspect’s keywords. The output summary consists Our approach uses a predefined set of aspects, presented in of many aspects (most of which are outside the pre-specified key- Table 1. We selected these six aspects since they are well-established words) and a single adjective for each, unlike our current work facets of games [28] and are popular dimensions within professional which extracts complete sentences with different polarities. The reviews. summarization pipeline was tested on a single game (Skyrim), ex- A simple approach for aspect labeling is to use a dictionary of ploring different sentiment extraction approaches using precision keywords per aspect as the ones presented in Table 1. In order to and recall as performance metrics. While our current work does not be able to include sentences even when they do not include the explore as many parameters for sentiment analysis, it is the first exact keywords, a k-means clustering is applied to all sentences to instance where game review summaries are evaluated by humans find clusters with similar text. Terms are weighted based on their in a small-scale but thorough user study. frequency via 𝑡 𝑓 .𝑖𝑑 𝑓 , which has been used extensively for sentence similarity in bag-of-words approaches (see Section 2.1). The result is 𝐾 clusters of sentences with similar words to each other; in all 3 SUMMARIZATION PIPELINES our experiments we set 𝐾 = 20 based on prior evidence [35]. Once Figure 1 visualizes the main components of our pipeline: sentences are all assigned a cluster based on the distance to the Preprocessing which aims to prepare the input reviews for center, all sentences in all clusters are processed in the following further analysis. This may imply cleaning, chunking text fashion: Community Community Graphics Graphics Gameplay Gameplay 1 Review 2 Aspect 3 Aspect 4 Sen�ment 5 Summary Preprocessing Iden�fica�on Labeling Analysis Crea�on Figure 1: The full pipeline represents both the Clustering variant (CL Full) and the Deep Learning variant (DL Full), while variant CL AsDe produces summaries by skipping the Sentiment Analysis step. Aspect Keywords 𝑁 sentences at random from each aspect’s set. A sample CL AsDe Graphics graphic, visual, aesthetic, animation, scenery summary can be found in Table 2 for Tom Clancy’s The Division. Gameplay mission, item, map, weapon, mode, multiplayer, The next step of the process is Sentiment Analysis, which is used control by the next summarization variant (CL Full). Using the different sets Audio audio, sound, music, soundtrack, melody, voice of candidate sentences per aspect, the sentiment polarity (positive or Community community, toxic, friendly negative) of each sentence is calculated by averaging the sentiment Performance server, bug, connection, lag, latency, ping, crash, score of each word it contains. As above, sentiment analysis of each glitch, optimization word is done via the default functions of the nltk Python library [5]. Story dialog, romance, ending, cutscene, story The library calculates probabilities for each polarity class (positive, Table 1: Aspects and keywords used for the identification of neutral, negative). We took into account sentences which were dominant aspects in review clusters. assigned a class with a probability of at least 0.5. In order to select a number of sentences per category, a 𝑘-means clustering approach (using 𝑡 𝑓 .𝑖𝑑 𝑓 ) is applied within the set of sentences with the same polarity. In the CL Full implementation of this paper, only two sentences per polarity are selected (𝑘 = 2) as the ones closest to (1) If the sentence contains the exact keywords of only one each cluster’s centroid. If there exist sufficient positive and negative aspect, the sentence is assigned to that aspect and is flagged sentences, then this approach returns 6 sentences as bullet points. as a candidate that can be used by the summary of that Note that if fewer than two sentences are above the threshold for aspect. positive (or below the threshold, for negative) then fewer sentences (2) If keywords from multiple aspects are found in the sentence, may be included in the summary. An example summary from CL the sentence is flagged as an unsuitable candidate for any Full variant can be found in Table 2 for Tom Clancy’s The Division. summary and removed. (3) If no aspect keywords are found in the sentence, the most common aspect within the sentences of the same cluster will be used to label this sentence and flag it as a candidate. For 3.2 DL pipeline instance, if a sentence does not contain any keyword, but After experimenting with the first two variant pipelines and taking sentences in its cluster predominantly belong to the aspect into account the feedback of the first user study (see Section 5), we Gameplay via case (1), then the sentence is also assigned to decided to focus on improving the following: the same aspect and flagged as a candidate. Using the sentences from cases (1) and (3), a set of candidate • Keyword detection and clustering based Aspect Labeling sentences is created per aspect. Using these sets, the first varia- must be improved to avoid sentences such as "If those things tion of our pipeline could now produce a summary. This variation, all sound good to you you will like the game." to be labeled named Clustering Aspect Detection summary (CL AsDe), chooses as audio sentences. - In a few words the game is single dimensional this might sound vague but it • The final summary should somehow provide information becomes apparent that there is not much depth as you play once you’re a couple regarding the whole sentiment of the given aspect and not hours in. just by the selected sentences. - Clothes sound "right" when you move in them. - They sound good and looked good with ability to mod for better stats or even • The final summary should use a better sentence extraction rerolling stats. approach in order to deal with redundancy. - They have improved the pve portion of the game and crazy as it sounds the pvp too. Taking all the above into account, the DL pipeline makes changes - No music and something feels so strangely abandnded about it. to the Aspect Detection and Summary Creation steps of the CL - Like how if there’s a blizzard your cap and shoulder will be covered in snow and that npc voices will echo when they are standing in hallways with hollow walls. pipeline described in Section 3.1. - Very good voice acting. For Aspect Detection, we used the BERT model [14] to generate - Great abilities pretty good sounds; indoor echos reverb off objects etc. embeddings for game reviews. BERT is a deep neural language - If those things all sound good to you you will like the game. - Superb voice acting and ambient city sounds are also a good plus for this game. model that uses a bidirectional, multilayer transformer architecture, - It sounds hyperbolic but I’m being dead serious. exploiting cross and self-attention to capture word interdependen- - Sounds terrible right - Most opinions are positive regarding audio. cies effectively [3, 48]. The approach relies on multi-head attention - The voice acting in the game is in the higher tiers as is most ubisoft games. modules for sequence encoding modelling, with word order infor- - There are not a lot of different voices and some of the voice acting for them is mation being retained with additive positional encoding vectors. bad. - Ubisoft - bugs - the textures are so fucked up that nobody can play this game BERT is trained in an unsupervised setting on large quantities of anymore. English text, using masked language modelling and next sentence - And it clearly shows I want to play it and that I try to. -I’m gonna be honest the cinematics are pretty great. prediction objectives. These tasks require the prediction of hidden Table 2: Summaries generated by different pipelines, for as- sequence tokens and the generation of an entire sequence, given an pect Audio of Tom Clancy’s The Division. From top to bot- input sequence (e.g. for tasks such as question-answering and text tom: CL AsDe (only aspect detection), CL Full (aspect detec- entailment, etc.). This pretraining scheme and architecture have tion with sentiment analysis) and DL Full (Deep learning been shown to perform exceptionally well for a variety of natural combined with a sophisticated summarizer). language understanding tasks. To obtain the representation for a game review, we feed the text to the model using a sequence length of 16 tokens. We use the 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 model variant, that produces 768-dimensional sequence - Mixed: 𝑃¯ ≈ 0, high standard deviation. embeddings, learned during training for classification purposes. - Mostly neutral: 𝑃¯ ≈ 0, low standard deviation. The implementation and pretrained model utilized are provided by - Mostly positive: 𝑃¯ > 0 above a threshold. the transformers software package from huggingface1 . Using the - Mostly negative: 𝑃¯ < 0 below a threshold. produced embeddings as features we trained a binary Ridge Logistic The final summary is composed by randomly shuffling these 6 Regression classifier [19] (one vs all) for each aspect. We also trained sentences. An example summary from DL Full variant can be found a seventh classifier to detect sentences unfit for any aspect. For each in Table 2 for Tom Clancy’s The Division. candidate sentence a confidence score was calculated by each aspect classifier. Only sentences with a high prediction confidence in the given aspect and a low confidence on each other classifier were 4 DATASET selected as summary candidates for the next steps of the pipeline. As a first demonstration of the summarization pipeline, we follow During the Summary Creation we applied the following strat- [35] and select the most helpful reviews on Steam, splitting them per egy to the 100 most probable candidate sentences of each aspect. game. This paper parses the Steam review dataset gathered by Zuo First, the NewSum Toolkit [20] was used to select the sentences [55], which consists of over 7 million reviews obtained via Steam’s that provide the most representative information. NewSum uses API. Each review text comes with a plethora of features concerning language-agnostic methods based on n-gram graphs, that not only both the game being reviewed and the reviewer, although only a extract the most representative sentences, but also deal with redun- subset of features is used for this experiment. Since Steam users dancy. In the end we had 20 candidate sentences per Aspect. The can vote a review as helpful, unhelpful, or spam, we only consider final summary was composed by 6 sentences using the following ‘valid’ reviews those with 10 or more user votes as ‘helpful’. With strategy: this criterion (minimum of at least 1000 of ‘helpful’ reviews), we select twelve games with the most valid reviews (see Table 3). The • Select the most positive sentence (Sentiment Analysis). games selected have a desirable diversity both in terms of genres • Select the most negative sentence (Sentiment Analysis). (shooting, survival, adventure, open-world, multi-player, single- • Select the first 3 sentences provided by NewSum Tookit player, etc.) and in terms of general audience reception (shown (excluding the previously selected sentences). by the Metacritic score which aggregates professional and users’ • Create an artificial sentence using the polarities provided by reviews). Sentiment Analysis of all the aspect sentences. The polarity For each of the selected games we selected to keep the 10 thou- of each sentence was mapped as 1, 0 or -1 (positive, neutral, sand most up-voted reviews. As already discussed in Section 3 each negative) using thresholds. Given an Aspect and the mean of these reviews was split into sentences to create a sentence pool Polarity score 𝑃, ¯ the possible produced sentences reflect per game. On average, the sentence pool consisted of around 50 opinions that fall in the following categories: thousand sentences per game. The smallest pool of sentences was 1 https://huggingface.co/ for PAYDAY 2 (37K), while the largest one was for Elite Dangerous Game Title Publisher Year Reviews MC No Man’s Sky Hello Games 2016 4146 61% DayZ Bohemia Interac- 2018 3349 – tive PAYDAY 2 Starbreeze 2017 2573 79% ARK: Survival Evolved Studio Wildcard 2017 2368 70% Grand Theft Auto V Rockstar Games 2015 2104 96% Firewatch Campo Santo 2016 1599 81% Darkest Dungeon Red Hook Studios 2016 1564 84% Just Survive Daybreak Game 2015 1463 – Company Killing Floor 2 Tripwire Interac- 2016 1276 75% tive Elite Dangerous Frontier Develop- 2015 1270 80% ments Tom Clancy’s ‘The Di- Ubisoft 2016 1091 79% vision’ Subnautica Unknown Worlds 2018 1056 87% Figure 2: User interface for online evaluation of summaries Entertainment produced by CL AsDe and CL Full methods. Table 3: Games selected from the dataset, sorted by the num- ber of ‘valid’ reviews (10 or more ‘helpful’ votes). The Meta- of summaries. We initialized the system by providing two sets critic score (MC) is included for reference. of summaries A, B, one from system 𝐴 and one from system 𝐵. Each summary in A corresponded to a summary in B, as they both summarize the same set of reviews and the same aspect (e.g. the aspect Graphics of DayZ). During the experiment, each system’s (70K). The average length of the sentences was 85.7 in characters summary was randomly placed first or second to minimize any bias and 16.4 in words. In terms of both characters and words, the longest related ordering effect. sentences were those of Darkest Dungeon (average of 91.8 char- The UI also informed the user of the title of the game being acters and 17.3 words) and the shortest ones were those of Just summarized, plus the aspect (e.g. Graphics). The user was then Survive (average of 79.9 characters and 15.6 words). called to select their preferred summary (A or B) and explain the In terms of aspects, the most common one was Gameplay on av- reasons for this preference. For the latter annotation, the user could erage. Performance was the next most popular aspect and in certain select one or more tickboxes among the following options: games such as ARK: Survival Evolved it was the most popular one. The least popular aspect was Audio with a ratio of 1 to 5 compared • It repeats less the same information (Less Redundant) to the Gameplay aspect. • It seems to be more coherent and/or complete In terms of sentiment, the majority of sentences were more • For other (or even unclear) reasons neutral than positive or negative. Between positive and negative The first two options aim to assess whether redundancy is a con- sentiment, no general safe conclusions can be drawn since the re- cern and, similarly, whether coherence and completeness are useful sults varied given different combinations of aspects and games. in the task. Redundancy has been traditionally a summarization In general, we can say that the aspect Performance was character- evaluation indicator [1], especially in multi-document summariza- ized as negative more frequently. The opposite was true for the tion. The completeness and coherence aspect is essentially a (more aspect Graphics. On the other hand the sentiment ratio (positive vs nuanced) version of overall responsiveness, as this has been used negative) towards the aspect Community varied between different in DUC/TAC summarization tracks and related work [10]. games. 5.2 Participants 5 FIRST USER STUDY The evaluation was carried out by eight adult evaluators (3 female), As a first experiment, we evaluated the two variations of the CL fluent in English, with gaming experience. The evaluators were pipeline (CL AsDe and CL Full) in a small-scale user-study with selected explicitly among the authors’ network of contacts and in- summaries of aspects of the 12 games of Table 3. vited directly by the authors. Participants were asked to connect to the online system and evaluate all 72 pairs of summaries (produced 5.1 Annotation Protocol by CL AsDe and CL Full), which covered all predefined aspects (see A pairwise comparison process was followed, rather than a scale- Table 1) of the ten games in Table 3. There was no time limit for based rating approach, due to (a) evidence that comparison-based completing the evaluation, but there was a requirement that all evaluation can be less demanding cognitively [9] and (b) a rich body pairs were evaluated in a single session. of literature that has applied pairwise evaluation for summarization tasks [34] (e.g. the single document summarization task in [21]). 5.3 Results To this end, we created an online evaluation user interface (UI) The data collected from the experiment was a total of 576 obser- (see Figure 2) which supported comparative pairwise evaluation vations, including the preference of each evaluator for each pair Aspect CL Full CL AsDe To get a better understanding of the reasons annotators gave Audio 43% 57% regarding their preference, we looked further into the statistics Community 51% 49% of the winning observations of CL AsDe vs. CL Full. When AsDe Gameplay 55% 45% was preferred, annotators explained their preference mainly due to Graphics 30% 70% better coherence (63%), lower redundancy (28%), but also ‘other rea- Performance 54% 46% sons’ (26%). When CL Full was preferred, annotators chose ‘other Story 49% 51% reasons’ (50%), and less often coherence (41%) or low redundancy Overall 47% 53% (17%). This finding shows that summaries by AsDe were more co- Table 4: First user study: annotators’ preference of one sum- herent but annotators still preferred summaries by CL Full often marization algorithm over the other, per aspect and overall. for other reasons. This points to a limitation of the experimental protocol, as the interface did not provide annotators with enough options to allow them to explain their reasons for their summary preference. This was addressed in the second user study (see Sec- Df F value 𝑝 value tion 6) with an extra option on the UI. It should be noted that better game 11 1.519 0.120 coherence was selected far more often overall (53% of instances) aspect 5 3.912 0.001 * than lower redundancy (23%), while ‘other reasons’ were also cho- evaluator 6 7.945 0.000 * sen often (37%). Redundancy and coherence were chosen together coherence 1 18.6491 0.000 * in only 5% of instances, and thus it is evident that these two axes of redundancy 1 5.7604 0.017 * evaluation are fairly independent. These findings, coupled with the other 1 0.5639 0.453 statistically significant influence (via ANOVA) between preference Table 5: Analysis of variance between the preference of of summarization approach and tagged coherence and redundancy, one approach and different factors. Significant findings are support our conclusion that both coherence and redundancy were shown with an asterisk. The analysis is made on the F statis- important factors for annotators’ preference. tic and the degrees of freedom (Df) are also noted. 6 SECOND USER STUDY Based on the findings and limitations identified in the first user study, conducted a second study with more participants but fewer of summaries and the reasons for this choice. The primary goals games, testing the best CL approaches with the novel DL Full of the user study are to assess (a) whether the annotators prefer pipeline. Due to participants’ concerns on the long duration of one of the two summarization approaches, (b) which criteria they the 72-item survey in the first experiment, we opted to use only explicitly (via the three tickboxes) or implicitly (based on properties two games to lower the time required from annotators; it is ex- of the summary) consider when selecting their preference. Towards pected that fatigue would likely introduce noise to the participants’ this end, the data is processed based on the 8 users’ annotations responses. Details on how the games and annotation options were on 72 game/aspect pairs (for a total of 576 data points), and all chosen are detailed in Section 6.1. statistical tests are performed at a 5% significance threshold. Our assumption is that the complete CL pipeline which includes both aspect detection and sentiment analysis will offer a richer and more 6.1 Annotation Protocol diverse summary than AsDe alone. The user interface for the second user study was largely the same as Regarding users’ preference of one summarization technique, in the first (see Section 5.1). Based on the first study’s finding that results were mixed: overall, annotators had no clear preference with ‘other reasons’ for an annotator’s preference were often chosen, a CL AsDe being marginally more often selected (53%). Table 4 shows fourth option was added to the UI as a tickbox stating “The summary the distribution of selection of CL Full split per aspect. The Table was more focused and contained less irrelevant information.” We shows that the main factor for the skew of the overall preference refer to this additional option as Focus in the analysis that follows. towards CL AsDe was the graphics summaries, as the other aspects As noted above, to reduce the time required for the study only are fairly evenly preferred between the two approaches. two games were chosen to be annotated. We chose among the To further assess which factors led to the annotators’ preference games from the first user study, taking the game where CL Full had of one summary over the other, we conducted an analysis of vari- the highest preference (Tom Clancy’s The Division, where CL Full ance test (ANOVA) between the preferred approach (represented was chosen 60% of the time) and the game where CL AsDe had the as a binary choice) and other features such as the aspect. Table 5 highest preference (Elite Dangerous, where CL AsDe was chosen shows the results in terms of significant differences, and verifies 60% of the time). For each of the two games, the preferred method that there is a systematic influence between the aspect and prefer- was chosen to present to the user, juxtaposed with the summary ence. On the other hand, the game does not seem to affect users’ for the same game and aspect produced by DL Full. Therefore, the preference of one summary or the other; this is a promising finding participant had to annotate 12 items, 6 aspects for Tom Clancy’s the as the methods are supposed to be applicable to any game. There Division comparing the CL Full summary with the DL Full summary is also a clear evidence that preference was highly varying from and 6 aspects for Elite Dangerous comparing the CL AsDe summary annotator to annotator, and annotators rarely agreed with each with the DL Full summary. The rationale was to select the most other even in this simple pair-wise preference task. successful game summaries (for both CL variants) and compare them with the novel DL pipeline. We refer to CL and DL summaries of pre-specified game facets. two small-scale user surveys exam- in this paper, referring to the best CL summary (CL Full or CL AsDe) ined the preference of users in the presence of different pipeline as shown to the user. implementations. Results indicate that (a) aspect extraction is im- As with the first user study, the order of the two options was portant for summarization, although deep-learning does not neces- randomized (i.e. sometimes CL summaries were shown first, some- sarily improve the aspect extraction process compared to a simpler times second). Unlike the previous experiment, however, the order clustering-based method; (b) between the clustering-based pipeline of the sentences within the same summary was also randomized; variants (CL AsDe, CL Full), there was no clear winner with respect the rationale was to avoid ordering effects when the participant to the summary outputs; (c) evaluators had strong and individual starts by reading an incoherent sentence first. opinions on which variant was better; (d) sentiment-based crite- ria and/or confidence-based criteria for selecting sentences do not seem to perform better than the random selection performed by CL 6.2 Participants AsDe. Fourteen participants completed this annotation task. Unlike the While the aspects chosen for this experiment were intuitive, previous study, a snowball method for soliciting participants was based on typical facets of games that players and professional crit- followed, soliciting feedback from a broader group. Thus, this study ics focus on, some of the resulting aspect-based summaries were lacks data on the demographics and gaming experience of partici- less coherent than others. The choice to assign a sentence to an pants, although participants were all adults and had experience in aspect even if its cluster only had a slim majority in keyword fre- data analysis and artificial intelligence. quency likely introduced inconsistency. For CL aspect detection, the most significant factor for the lack of coherence was the choice 6.3 Results of keywords. Specifically, the keyword “sound” was often found in The data collected from the experiment was a total of 168 obser- sentences unrelated to game audio, used as a verb: e.g. “On paper vations. Overall CL summaries were slightly more preferred by this game sounds great”. To a degree, such artefacts were removed in participants (55%), although the difference is not statistically signif- the DL aspect detection pipeline via (a) the latent sentence represen- icant (Paired t-test, p-value 0.22). Interestingly, for Elite Dangerous tation and (b) fine-tuning the model based on manual annotations (which was summarized by CL AsDe) the difference was more pro- on this specific corpus. However, a more sophisticated method for nounced (CL AsDe preferred 60% of the time over DL Full); for Tom aspect detection seems necessary. For instance, an adaptive query Clancy’s The Division the two methods (CL Full and DL Full) were expansion as followed by [29] could create a much larger set of key- chosen evenly. Since only one game was tested per CL variant, it words automatically, although it may overlook the nuances of game is difficult to assess whether the preference was due to the game terminology. On the other hand, a Word2Vec model [30] trained itself or the sentiment-based selection component. Moreover, while on the entire corpus of steam reviews (or even larger game-related DL Full includes sentiment-based selection, this part accounts for 2 corpora such as game FAQs and fansites) could be used to derive of the 6 sentences and thus it is even more difficult to estimate the a similarity score with specific aspects. Building a game ontology reasons for the users’ preference. This ambiguity points to further for this task or using an existing one [36, 39] could further assist refinements needed for the annotation protocol which is discussed in discovering more keywords or in calculating an ontology-based in Section 7. semantic similarity measure [42]. Finally, a completely different In terms of the reasons offered by participants for their choice, direction could see the discovery of topics specific to each game coherence was still most commonly chosen (62% of responses), fol- rather than focusing on the same pre-specified topics every time. lowed closely by focus (56%). Low redundancy was chosen less This would be valuable as different genres have a different focus often (23%), while ‘other reasons’ are chosen only in 14% of re- (e.g. multiplayer games focus on balance or lag, while horror games sponses). The addition of the focus option seems to have mitigated focus on the emotional response), but could make it difficult to main- the prevalence of ‘other reasons’ in the first study. Unlike the first tain the same presentation format across games and thus confuse study, however, low redundancy was often chosen in conjunction end-users. with one other reason (56% of the time) or two other reasons (36% Sentiment analysis was also often problematic, primarily due to of the time). Combined with its low overall prevalence, it is possible the informal and idiosyncratic language that games reviews were of- that low redundancy may now longer be necessary as a separate ten in. Reviews are often rife with sarcasm and negation, e.g. “Have reason in the UI, although a broader user study with more games fun spending huge amounts of hours for very little progress.”. More- is needed to validate this hypothesis. over, many reviews’ sentences have poor syntax and are very short Pearson’s Chi-squared tests were also used in order to test whether or very long (e.g. “Good: + great aesthetic.”). Sentiment analysis any of the above reasons is correlated to the preferred summary. treated the sentence as a bag-of-words, exacerbating the problem. In Only redundancy was found to be correlated with the type of sum- general, sentiment analysis can not capture negation or sarcasm and mary (p-value 0.001). This clearly indicates the importance of han- handles incomplete sentences poorly. Performance would likely be dling redundancy satisfyingly in any future approach. improved with a more appropriate pre-trained lexicon for informal utterances on the Social Web, such as SentiStrength [46] or other sentiment- and negation-aware approaches [22]. Alternatively, a 7 DISCUSSION custom classifier for sentiment analysis could be trained using text This paper introduced a number of possible pipelines for iden- from a Steam review as input and the user’s recommendation as tifying, grouping, and extracting the opinions of users in terms polarity. Complementing the training set with experts’ annotations could refine such a model, especially when dealing with sarcasm. There are many directions for future research depending on the Another promising alternative to SentiWordNet for sentiment anal- purpose of the game review summarization. As a tool for game eval- ysis would be the use of an authored dictionary of opinion words uation, primarily targeted towards players or producers, the game’s [24] or game-specific adjectives annotated in terms of polarity [51]. context is important in order to choose which reviews or topics to Our findings also showed no clear winner between the two CL highlight. Additional research in this vein would need to find topics variants or between CL and DL summaries. These ambiguity of the or patterns in similar games (e.g. of the same genre, publisher, or findings could well be by-products of the experimental protocol publication date) and then to compare the current game’s reviews followed. Findings from the first user study pointed to a missing in terms of those topics or compared to other games’ reviews. User reason for players to report, and the second study included a “focus” experience research would also be important to find how best to reason which improved the quality of the data collected but raised present such results, as interactive summaries where the user can questions about the importance of the “low redundancy” reason. zoom in and out into different games and/or different topics within The users’ reported fatigue in the first experiment led to fewer games would make the summaries more intuitive and manageable. items in the second study to alleviate the burden from annotators. As a tool for game analysis, bottom-up probabilistic topic modelling However, this increased the locality of the findings in the second [7] in games of the same genre could help identify design patterns study as it was unclear whether preferences were due to the game or [6] and players’ expectations based on their repertoire [27]. As a the algorithm. In future studies, summaries for more games should tool for knowledge discovery, game reviews can serve as raw text be annotated by more participants, showing only two games to each or multi-modal corpora from which structured data can be automat- user but randomizing which games are shown when the user starts ically extracted as entities and relations [44], concept hierarchies the study. More importantly, the current experimental protocol [43, 52], or even a complete game ontology [32, 38]. forces participants to select one review as preferred and provide at least one reason. The forced choice between two summaries 8 CONCLUSION does not allow the user to provide more nuanced feedback. A four- This paper highlighted the challenges and opportunities of game alternative forced-choice (4-AFC) with options “A”, “B”, “both A and review summarization via natural language processing. The paper B”, “neither A nor B” would allow the user to point out cases where introduced a pipeline for grouping Steam users’ comments into both summaries are equally good or equally bad. The fairly even pre-specified aspects such as visuals or performance, and studied split between the two alternatives in both user studies could be due different renderings of the final summary, exploiting positive and to fact that users consider some summaries shown equally bad and negative sentences based on sentiment analysis. The small-scale select randomly. On the other hand, a 4-AFC questionnaire would user survey revealed differences in how different annotators assess likely need many more participants since much of the data will the reviews, highlighted possible foci of research for better game be removed when no ranking is given. The need for more games, review summarization systems, and suggested a number of refine- more annotation choices, and perhaps more algorithm variants (DL ments to the process are suggested in this promising subfield of AsDe, for instance) point to the need for a large-scale user survey game artificial intelligence. among the general gaming community, which will be performed in future work in this vein. REFERENCES As discussed in Section 1 and explored on a high-level during [1] Rasim M Alguliev, Ramiz M Aliguliyev, Makrufa S Hajirahimova, and Chingiz A the user study, game review summarization can be valuable both Mehdiyev. 2011. MCMR: Maximum coverage and minimum redundant text summarization model. Expert Systems with Applications 38, 12 (2011), 14514– to consumers (players) and producers (game developers). However, 14522. each stakeholder has different priorities and will likely respond [2] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet differently to different summary formats. The extractive summariza- 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In Proceedings of the International Conference on Language Resources and Evaluation, tion process was visualized as ‘pure text’ bullet points, which was Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, not as engaging to either type of audience. It would be important Stelios Piperidis, Mike Rosner, and Daniel Tapias (Eds.). to explore alternative visualizations for players and developers. For [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Ma- chine Translation by Jointly Learning to Align and Translate. arXiv1409.0473 players, the summary could provide more structure (based on pre- [cs, stat] (sep 2014). https://doi.org/10.1146/annurev.neuro.26.041002.131047 specified game facets), focus more on the weights and scoring of arXiv:1409.0473 [4] BBC News. 2019. Gaming worth more than video and music combined. https: each aspect (including visualizations such as pie-charts), show only //www.bbc.com/news/technology-46746593. Accessed 26 January 2020. a few polar opposites in terms of review sentences, and perhaps [5] Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing cross-reference these findings with other games’ review summaries. with Python. O’Reilly Media Inc. [6] Staffan Bjork and Jussi Holopainen. 2004. Patterns in Game Design. Charles River For developers, on the other hand, a bottom-up topic discovery Media. would likely be beneficial in order to identify unexpected points of [7] David M. Blei. 2012. Probabilistic Topic Models. Communicatiosn of the ACM 55, contention among users. Moreover, presenting the context of the 4 (2012), 77–84. [8] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet reviewers’ chosen sentences would also be valuable for designers, allocation. Journal of Machine Learning Research 3 (2003), 993–1022. e.g. how many reviewers agree with or echo this comment, when [9] Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and Christof Neumann. 2018. Why rate when you could compare? Using the “EloChoice” this comment was made and whether general sentiment has shifted package to assess pairwise comparisons of perceived physical strength. PloS one since then. Such context can be important regarding the urgency 13, 1 (2018). of addressing certain concerns or to gauge whether patches and [10] Hoa Trang Dang and Karolina Owczarzak. 2008. Overview of the TAC 2008 Update Summarization Task. In Proceedings of the Text Analysis Conference. updates have improved reviewers’ perception, not unlike Steam’s [11] Kareem Darwish, Walid Magdy, and Tahar Zanouda. 2017. Trump vs. Hillary: use of most recent reviews. What Went Viral During the 2016 US Presidential Election. In Proceedings of the International Conference on Social Informatics. Springer International Publishing, [35] George Panagiotopoulos, George Giannakopoulos, and Antonios Liapis. 2019. A 143–161. Study on Video Game Review Summarization. In Proceedings of the MultiLing [12] Sanmay Das and Mike Y. Chen. 2001. Yahoo! for Amazon: extracting market Workshop. sentiment from stock message boards. In Proceedings of the Asia Pacific finance [36] Janne Parkkila, Filip Radulovic, Daniel Garijo, María Poveda-Villalón, Jouni association annual conference. Ikonen, Jari Porras, and Asuncion Gomez-Perez. 2016. An ontology for videogame [13] Sebastian Deterding, Dan Dixon, Rilla Khaled, and Lennart Nacke. 2011. From interoperability. Multimedia Tools and Applications 76 (2016). Game Design Elements to Gamefulness: Defining “Gamification”. In Proceedings [37] M.F. Porter. 2006. An algoritm for suffix stripping. Program 14 (2006), 130–137. of the 15th International Academic MindTrek Conference: Envisioning Future Media [38] Ligaj Pradhan, Chengcui Zhang, and Steven Bethard. 2016. Towards extracting Environments. 9–15. coherent user concerns and their hierarchical organization from user reviews. In [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI). Pre-training of Deep Bidirectional Transformers for Language Understanding. In IEEE, 582–590. Proceedings of the 2019 Conference of the North American Chapter of the Association [39] Owen Sacco, Antonios Liapis, and Georgios N. Yannakakis. 2017. Game Character for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Ontology (GCO): A Vocabulary for Extracting and Describing Game Character Short Papers). 4171–4186. Information from Web Content. In Proceedings of the International Conference on [15] Cliff Edwards. 2013. Valve Lines Up Console Partners in Challenge to Microsoft, Semantic Systems. Sony. https://www.bloomberg.com/news/articles/2013-11-04/valve-lines-up- [40] Hassan Saif, Miriam Fernandez, and Harith Alani. 2015. Contextual semantics for console-partners-in-challenge-to-microsoft-sony. Accessed 26 January 2020. sentiment analysis of Twitter. Information Processing & Management 52 (2015). [16] Eslam Elsawy, Moamen Mokhtar, and Walid Magdy. 2014. TweetMogaz v2: Iden- [41] Gerard Salton and C.S. Yang. 1973. On the specification of term values in auto- tifying News Stories in Social Media. In Proceedings of the 23rd ACM International matic indexing. Journal of Documentation. 29, 4 (1973), 351–372. Conference on Information and Knowledge Management. [42] David Sánchez, Montserrat Batet, David Isern, and Aida Valls. 2012. Ontology- [17] Entertainment Software Association. 2018. Essential Facts About the Computer based semantic similarity: A new feature-based approach. Expert Systems with and Video Game Industry report. https://www.theesa.com/wp-content/uploads/ Applications 39 (2012). 2019/03/ESA_EssentialFacts_2018.pdf. Accessed: 5 Sep 2019. [43] Mark Sanderson and W. Bruce Croft. 1999. Deriving concept hierarchies from [18] Angela Fan, David Grangier, and Michael Auli. 2017. Controllable Abstractive text. In Proceedings of the ACM SIGIR Conference on Research and Development in Summarization. In Proceedings of the ACL Workshop on Neural Machine Translation Information Retrieval. and Generation. [44] Min Song, Won Chul Kim, Dahee Lee, Go Eun Heo, and Keun Young Kang. 2015. [19] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. PKDE4J: Entity and relation extraction for public knowledge discovery. Journal 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine of Biomedical Informatics 57 (2015), 320 – 332. Learning Research 9 (2008), 1871–1874. [45] Pero Subasic and Alison Huettner. 2001. Affect analysis of text using fuzzy [20] George Giannakopoulos, George Kiomourtzis, and Vangelis Karkaletsis. 2014. semantic typing. IEEE Transactions on Fuzzy Systems 2 (2001), 483 – 496. Newsum:"n-gram graph"-based summarization in the real world. In Innovative [46] Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment Strength Document Summarization Techniques: Revolutionizing Knowledge Understanding. Detection for the Social Web. Journal of the American Society for Information IGI Global, 205–230. Science and Technology 63, 1 (2012), 163–173. [21] George Giannakopoulos, Jeff Kubina, John Conroy, Josef Steinberger, Benoit [47] Peter D. Turney. 2002. Thumbs up or Thumbs down? Semantic Orientation Favre, Mijail Kabadjov, Udo Kruschwitz, and Massimo Poesio. 2015. Multiling Applied to Unsupervised Classification of Reviews. In Proceedings of the 40th 2015: multilingual summarization of single and multi-documents, on-line fora, Annual Meeting on Association for Computational Linguistics. and call-center conversations. In Proceedings of the 16th Annual Meeting of the [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Special Interest Group on Discourse and Dialogue. 270–274. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all [22] Maria Giatsoglou, Manolis G Vozalis, Konstantinos Diamantaras, Athena Vakali, you need. In Advances in neural information processing systems. 5998–6008. George Sarigiannidis, and Konstantinos Ch. Chatzisavvas Chatzisavvas. 2017. [49] Janyce Wiebe. 2000. Learning Subjective Adjectives from Corpora. In Proceed- Sentiment analysis leveraging emotions and word embeddings. Expert Systems ings of the Seventeenth National Conference on Artificial Intelligence and Twelfth with Applications 69 (2017), 214–224. Conference on Innovative Applications of Artificial Intelligence. 735–740. [23] Hongyu Han, Yongshi Zhang, Jianpei Zhang, Jing Yang, and Xiaomei Zou. 2018. [50] Kevin Yauris and Masayu Leylia Khodra. 2017. Aspect-based summarization Improving the performance of lexicon-based review sentiment analysis method for game review using double propagation. In Proceedings of the International by reducing additional introduced sentiment bias. PLOS ONE 13 (2018). Conference on Advanced Informatics, Concepts, Theory, and Applications. [24] Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews. [51] José P. Zagal, Noriko Tomuro, and Andriy Shepitsen. 2012. Natural Language In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Processing in Game Studies Research: An Overview. Simulation & Gaming 43, 3 Discovery and Data Mining. 168–177. (2012), 356–373. [25] Ya-Han Hu, Yen-Liang Chen, and Hui-Ling Chou. 2017. Opinion mining from [52] Elias Zavitsanos, Georgios Paliouras, George A Vouros, and Sergios Petridis. online hotel reviews – A text summarization approach. Information Processing & 2010. Learning subsumption hierarchies of ontology concepts from texts. Web Management 53, 2 (2017), 436–449. Intelligence and Agent Systems: An International Journal 8, 1 (2010), 37–51. [26] San-Yih Hwang, Chia-Yu Lai, Jia-Jhe Jiang, and Shanlin Chang. 2014. The identifi- [53] Lili Zhao and Chunping Li. 2009. Ontology Based Opinion Mining for Movie cation of Noteworthy Hotel Reviews for Hotel Management. Pacific Asia Journal Reviews. In Knowledge Science, Engineering and Management. Springer Berlin of the Association for Information Systems 6 (2014). Heidelberg, 204–214. [27] Jesper Juul. 2005. Half Real. Videogames between Real Rules and Fictional Worlds. [54] Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie Review Mining and MIT Press. Summarization. In Proceedings of the ACM International Conference on Information [28] Antonios Liapis, Georgios N. Yannakakis, Mark J. Nelson, Mike Preuss, and Rafael and Knowledge Management. 43–50. Bidarra. 2019. Orchestrating Game Generation. IEEE Transactions on Games 11, 1 [55] Zhen Zuo. 2018. Sentiment Analysis of Steam Review Datasets using Naive (2019), 48–68. Bayes and Decision Tree Classifier. Technical Report. University of Illinois at [29] Walid Magdy and Tamer Elsayed. 2016. Unsupervised adaptive microblog filtering Urbana–Champaign. for broad dynamic topics. Information Processing & Management 52, 4 (2016), 513–528. [30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). arXiv:1301.3781 http://arxiv.org/abs/1301.3781 [31] Subhabrata Mukherjee and Pushpak Bhattacharyya. 2012. Feature Specific Senti- ment Analysis for Product Reviews. In Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 475–487. [32] Roberto Navigli and Paola Velardi. 2004. Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites. Computational Linguistics 30, 2 (2004), 151–179. [33] Chikashi Nobata and Satoshi Sekine. 2004. CRL/NYU Summarization System at DUC-2004. In Document Understanding Workshop 2004. [34] Karolina Owczarzak, John M Conroy, Hoa Trang Dang, and Ani Nenkova. 2012. An assessment of the accuracy of automatic evaluation in summarization. In Pro- ceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. Association for Computational Linguistics, 1–9.