=Paper= {{Paper |id=Vol-2282/EXAG_120 |storemode=property |title=Towards Automated Let’s Play Commentary |pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_120.pdf |volume=Vol-2282 |authors=Matthew Guzdial,Shukan Shah,Mark Riedl |dblpUrl=https://dblp.org/rec/conf/aiide/GuzdialSR18 }} ==Towards Automated Let’s Play Commentary== https://ceur-ws.org/Vol-2282/EXAG_120.pdf
                               Towards Automated Let’s Play Commentary

                                      Matthew Guzdial, Shukan Shah, Mark Riedl
                                                   College of Computing
                                               Georgia Institute of Technology
                                                     Atlanta, GA 30332
                              mguzdial3@gatech.edu, shukanshah@gatech.edu, riedl@cc.gatech.edu



                           Abstract                                the best of our knowledge, no existing attempt to automati-
                                                                   cally generate Let’s Play commentary exists.
  We introduce the problem of generating Let’s Play-style com-        The remainder of this paper is organized as follows. First,
  mentary of gameplay video via machine learning. We propose
                                                                   we discuss Let’s Plays as a genre and an initial qualitative
  an analysis of Let’s Play commentary and a framework for
  building such a system. To test this framework we build an       analysis. Second, we propose a framework for the gener-
  initial, naive implementation, which we use to interrogate the   ation of Let’s Plays through machine learning. Third, we
  assumptions of the framework. We demonstrate promising re-       cover relevant related work. Finally, we explore a few ex-
  sults towards future Let’s Play commentary generation.           perimental, initial results in support of our framework.

                                                                            Let’s Play Commentary Analysis
                       Introduction
                                                                   For the purposes of this paper, we refer to the set of utter-
The rise of video streaming sites such as YouTube and              ances made by players of Let’s Plays as commentary. How-
Twitch has given rise to a new medium of entertainment,            ever, despite referring to these utterances as commentary,
known as the “Let’s Play”. Let’s Plays involve video stream-       they are not strictly commenting or reflecting on the game-
ers providing commentary of their own gameplay to an audi-         play. This is in large part due to the necessity of a player to
ence of viewers. The impact of Let’s Plays on todays enter-        speak near constantly during a Let’s Play video to keep the
tainment culture is evident from revenue numbers. Studies          audience engaged. At times there is simply nothing occur-
estimate that Let’s Plays and other streamed game content          ring in the game to talk about.
will generate $3.5 billion in ad revenue by the year 2021             We analyzed hours of Let’s Play footage of various Let’s
(Foye 2017). What makes Let’s Plays unique is the combi-           players and found four major types of Let’s Play commen-
nation of commentary and gameplay, with each part influ-           tary. We list these types below in a rough ordering of fre-
encing the other.                                                  quency, and include a citation for a representative Let’s Play
   Let’s Play serve two major purposes. First, to engage and       video. However, we note that in most Let’s Play videos, a
entertain an audience through humor and interesting com-           player flows naturally between different types of commen-
mentary. Second, to educate an audience, either explicitly as      tary.
when a commentator describes what or why something hap-
pens in a game, or implicitly as viewers experience the game       1. Reaction: The most common type of comment re-
through the Let’s Play. We contend that for these reasons,            lates in some way to the gameplay occurring on screen
and due to the popularity and raw amount of existing Let’s            (McLoughlin 2013). This can be descriptive, educational,
Play videos, this medium serves as an excellent training do-          or humorous. For example, reacting on a death by restat-
main for automated commentary or explanation systems.                 ing what occurred, explaining why it occurred, or down-
   Let’s Plays could serve as a training base for an AI ap-           playing the death with a joke.
proach that learns to generate novel, improvisational, and         2. Storytelling: The second most common comment we
entertaining content for games and video. Such a model                found was some form of storytelling that related events
could easily meet the demand for gameplay commentary                  outside of the game. This storytelling could be biographi-
with a steady throughput of new content generation. Beyond            cal or fictional, improvised or pre-authored. For example,
automating the creation of Let’s Play we anticipate such              Hanson begins a fictional retelling of his life with “at age
systems could find success in other domains that require              six I was born without a face” in (Hanson and Avidan
engagement and explanation, such as eSports commentary,               2015).
game tutorial generation, and other domains at the intersec-
tion of education and entertainment in games. However, to          3. Roleplay: Much less frequently than the first two types,
                                                                      some Let’s players make comments in order to roleplay
                                                                      a specific character. At times this is the focus of an en-
                                                                      tire video and the player never breaks character, in others
   a player may slip in and out of the character throughout         of physical sports (Graefe 2016) or automated highlight gen-
   (Dugdale 2014).                                                  eration for physical sports (Kolekar and Sengupta 2006).
                                                                    These approaches require a log of game events. Our proposal
4. ASMR: ASMR stands for Autonomous Sensory Merid-
                                                                    is for live commentary of an ongoing video game stream.
   ian Response (Barratt and Davis 2015), and there exists
                                                                    Harrison et al. (2017) create explanations of an AI player’s
   a genre of YouTube video dedicated to causing this re-
                                                                    actions for the game Frogger. All of these approaches de-
   sponse in viewers, typically for the purposes of relaxation.
                                                                    pend on access to a game’s engine or the existence of a pub-
   Let’s players have taken notice, with some full Let’s Plays
                                                                    licly accessible logging system.
   in the genre of ASMR, and some Let’s players slipping in
                                                                       To the best of our knowledge, there have been no AI sys-
   and out of this type of commentary. These utterances tend
                                                                    tems that attempt to directly generate streaming commen-
   to resemble whispering nonsense or making other non-
                                                                    tary of live gameplay footage. Nonetheless, work exists that
   word mouth noises (Fischbach 2016).
                                                                    maps visual elements (such as videos and photos) to story-
   These high level types of commentary are roughly de-             like natural language. For example, The Sports Commen-
   fined, and do not fully represent the variance of player         tary Recommendation System or SCoReS (2014) and Plot-
   utterances. These utterances also differ based on the num-       Shot (2016) seek to create stories centered around visual
   ber of Let’s players in a single video, the potential for live   elements measured/recorded by the system. While SCoReS
   interactions with an audience if a game is streamed, and         learns a mapping from specific game states to an appropriate
   variations among the games being played. We highlight            story (Lee, Bulitko, and Ludvig 2014), PlotShot is a narra-
   these types as a means of demonstrating the breadth of           tive planning system that measures the distance between a
   distinct categories in Let’s Play commentary. Any artifi-        photo and the action it portrays (R. Cardona-Rivera 2016).
   cial commentary generation system must be able to iden-          Our domain is different as we are not illustrating stories but
   tify and generate within these distinct categories.              rather trying to construct live commentary on the fly as the
                                                                    system receives a continuous stream of events as input.
                 Proposed Framework                                    Significant prior work has explored Let’s Play as cultural
In the prior section we list some high-level types of com-          artifact and as a medium. For example, prior studies of the
mentary we identified from Let’s Play videos. This vari-            audience of Let’s Plays (Sjöblom and Hamari 2017), content
ance, in conjunction with the challenges present in all natu-       of Let’s Plays (Sjöblom et al. 2017), and building commu-
ral language generation tasks (Reiter and Dale 2000), makes         nities around Let’s Play (Hamilton, Garretson, and Kerne
this problem an open challenge for machine learning ap-             2014). The work described in this paper is preliminary as
proaches.                                                           a means of exploring the possibility for automated genera-
                                                                    tion of Let’s Play commentary. We anticipate future develop-
   We propose the following two stage, high-level frame-
                                                                    ments in this work to more closely engage with scholarship
work for machine learning approaches that generate Let’s
                                                                    in these areas.
Play commentary. First, to handle the variance we anticipate
                                                                       More recent work explores the automated generation of
the need for a pre-processing clustering step. This cluster-
                                                                    content from Let’s Plays, but not automated commentary.
ing step may require human action, for example, separating
                                                                    Both Guzdial and Riedl (2016) and Summerville et al.
Let’s Plays of a specific type, by Let’s player, or by game.
                                                                    (2016) look to use Longplay’s, a variation of Let’s Play gen-
In addition or as an alternative, automated clustering may
                                                                    erally without commentary, as part of a process to gener-
be applied to derive categories of utterances and associated
                                                                    ate video game levels through procedural content generation
gameplay footage. This may reflect the analysis we present
                                                                    via machine learning (Summerville et al. 2017). Other work
in Section 2 or may find groupings specific to a particular
                                                                    has looked at eSport commentators in a similar manner, as
dataset.
                                                                    a means of determining what approaches the commentators
   In the second stage of our framework a machine learn-
                                                                    use that may apply to explainable AI systems (Dodge et al.
ing approach is used to approximate a commentary gener-
                                                                    2018). However, this work only presented an analysis of this
ation function. The most naive interpretation of this would
                                                                    commentary, not of the video, and without any suggested
be to learn a mapping between gameplay footage and com-
                                                                    approach to generate new commentary.
mentary based on the clustered dataset from the first step.
However, we anticipate the need to include prior commen-
tary and its relevant gameplay footage as input to generate
                                                                              Experimental Implementation
coherent commentary. We expect a full-fledged implemen-             In this section we discuss an experimental implementation
tation of this approach would make use of state of the art          of our framework in the domain of Super Mario Bros. Let’s
natural language generation methods.                                Plays. The purpose of this experimental implementation is
                                                                    to allow us to interrogate the assumptions implicit in our
                                                                    framework, in particular that clustering is a necessary and
                       Related Work                                 helpful means of handling the variance of Let’s Play com-
There exists prior work on automatically generating textual         mentary.
descriptions of gameplay. Bardic (2017) creates narrative re-          We visualize a high-level overview of our implementation
ports from Defense of the Ancients 2 (DOTA 2) game logs             in Figure 1. As a preprocessing step we take video and an
for improved and automated player insights. This has some           associated automated transcript from YouTube, to which we
similarity to common approaches for automated journalism            apply ffmpeg to break the video into individual frames and
                             Figure 1: Visual representation of our experimental implementation.


associated commentary lines or utterances. We make use of          centive generality. This random forest predicted from the
1 FPS when extracting frames from video, given most com-           bag of sprites frames representation to a bag of words
ments took at least a few seconds. Therefore most comments         comment representation. This means it produced a bag of
are paired with two or more frames in a sequence, represent-       words instead of the ordered words necessary for a com-
ing the video footage while the player makes that comment.         ment, but it was sufficient for comparison with true com-
   A dataset of paired frames and an associated utterance          ments in the same representation.
serves as input into our training process. We represent each     • KNN: We constructed two different K-Nearest Neighbor
frame as a “bag of sprites”, based on the bag of words rep-        (KNN) approaches, based on a value of k of 5 or 10. In
resentation used in natural language processing. We pull           this approach we grabbed the k closest training elements
the values of individual sprites from each frame using a           to a test element according to frame distance. Given these
spritesheet of Super Mario Bros. with the approach de-             k training examples, we took the set of their words as the
scribed in (Guzdial and Riedl 2016). We combine multiple           output. As with the Forest baseline, this does not repre-
bags of sprites when a comment is associated with multiple         sent a final comment. We note further that there are far
frames. We also represent the utterance as a bag of words          too many words for a single comment using this method,
as well. In this representation we cluster the bag of sprites      which makes it difficult to compare across the other base-
and words according to a given distance function with K-           lines, but we can compare between variations of this base-
Medoids. We determine a value of k for our medoids clus-           line.
tering method according to the distortion ratio (Pham, Di-
mov, and Nguyen 2005). This represents the first step of our        For testing we then take as input some frames without
framework.                                                       or with withheld commentary, determine which cluster it
   For the second step of our framework in this implementa-      would be clustered into, and use a learned frame-to-text
tion, we approximate a frame-to-commentary function using        function to predict an output utterance. This can then be
one of three variations. These variations are as follows:        compared to the true utterance if there is one. We note this
                                                                 is the most naive possible implementation, and true attempts
• Random: Random takes the test input, and simply re-            at this problem will want to include as input to this func-
  turned a random training element’s comment data.               tion prior frames and prior generated comments, in order to
• Forest: We constructed a 10 tree random forest, using          determine an appropriate next comment.
  the default SciPy random forest implementation (Jones,            Clustering approaches require a distance function to de-
  Oliphant, and Peterson 2014). We used all default param-       termine the distance between any two arbitrary elements you
  eters except that we limited the tree depth to 200 to in-      might wish to cluster. Our distance function is made up of
                                                                    Table 1: Comparison of different ML approaches to the
                                                                    frame-to-comment function approximation, trained across
                                                                    all available training data or only within individual clusters.
                                                                    Values are average cosine distance, thus lower is better.
                                                                                            Standard        Per-Cluster
                                                                              Random      0.940±0.102      0.937±0.099
                                                                              Forest      0.993±0.042      0.970±0.090
                                                                              KNN 5       0.886±0.091      0.885±0.100
                                                                              KNN 10      0.854±0.091      0.852±0.101
Figure 2: An example test commentary prediction. The true
frame on the left is associated with the utterance “if you
get to Bowser” and our method returns ”parts like these on          Standard vs. Per-Cluster Experiment
Bowser”.                                                            For this first experiment we wished to interrogate our
                                                                    assumption that automated clustering represented a cost-
                                                                    effective means of handling the variance of Let’s Play com-
                                                                    mentary. To accomplish this, we trained each of our three
two component distance functions; one part measures the             variations (Random, Forest, KNN) according to two differ-
distance between the frame data (represented as a bag of            ent processes. In what we call the standard approach each
sprites) and one part measures the distance between the ut-         variation is trained on the entirety of the dataset. In the per-
terances (represented as a bag of words). For both parts we         cluster approach a model of each variation is trained for
make use of cosine similarity. Cosine similarity, a simple          each cluster. This means that we had one random forest in
technique used popularly in statistical natural language pro-       the standard approach, and six random forests (one for each
cessing, measures how similar two vectors are by measuring          cluster) in the per-cluster approach.
the angle between them. The smaller the angle, the more                We tested all 306 test elements for each approach. For the
similar the two vectors are (and vice versa). An underlying         per-cluster approaches we first clustered each test element
assumption that we make is that the similarity between a            only in terms of its frame data, and then used the associated
pair of comments is more indicative of closeness than sim-          model trained only on that cluster to predict output. In the
ilar frame data, simply because it is very common for two           standard variation we simply ran the test element’s frame
instances to share similar sprites (especially when they’re         data through the trained function. For both approaches we
adjacent in the video). Thus, when calculating distance, the        compare the cosine distance of the true withheld comment
utterance cosine similarity is weighted more heavily (75%)          and the predicted comment. This can be understood as the
than the frame cosine similarity (25%).                             test error of each approach, meaning a lower value is bet-
                                                                    ter. If the per-cluster approach of each variation outperforms
   At present, we use Super Mario Bros. (SMB), a well-              the standard approach, then that would be evidence that the
known and studied game for the NES, as the domain for our           smaller training dataset size of the per-cluster approach was
work. We chose Super Mario Bros because prior work in               more than made up for by the reduction in variance.
this domain has demonstrated the ability to apply machine              We compile the results of this first experiment in Table 1.
learning techniques to scene understanding of SMB game-             In particular, we give the average cosine distance between
play (Guzdial and Riedl 2016).                                      the predicted and true comment and the standard deviation.
                                                                    In all cases the per-cluster variation outperforms the stan-
   For our initial experiments we collected a dataset of two
                                                                    dard variation. The largest improvement came from the Ran-
fifteen minute segments of a popular Let’s Player playing
                                                                    dom Forest baseline, while the KNN baseline with k = 5
through Super Mario Bros. along with the associated text
                                                                    had the smallest improvement. This makes sense as even in
transcripts generated by Youtube. We use one of these videos
                                                                    the standard approach, KNN would largely find the same
as a training set and one as a test set. This means roughly
                                                                    training examples as the per-cluster approach. However, the
three-hundred frame and text comment pairs in both the
                                                                    actual numbers do not matter here. Given the size of the as-
training and testing sets, with 333 for the training set and 306
                                                                    sociated datasets, seeing any trend indicates support for our
for the testing set. This may seem small given that each seg-
                                                                    assumption, which we would anticipate to scale with larger
ment was comprised of fifteen minutes of gameplay footage,
                                                                    datasets and more complex methods.
however it was due to the length and infrequency of the com-
ments.                                                              Cluster Choice Experiment
   We applied the training dataset to our model. In clustering      The prior experiment suggests that our clusters have suc-
we found k = 6 for our K-medoids clustering approach ac-            cessfully cut down on the variance of the problem. However,
cording to the distortion ratio. This lead to six final clusters.   this does not necessarily mean that our clusters represent any
These clusters had a reasonable spread, with a minimum of           meaningful categories of or relationships between frames
31 elements, a maximum of 90 elements, and a median size            and comments. Instead, the performance from the prior ex-
of 50. From this point we ran our two evaluations.                  periment may be due to the clusters reducing the dimension-
                                                                   (two Let’s Play videos for one game), we cannot state with
Table 2: Comparison of different ML approaches to the              certainty that these results will generalize.
frame-to-comment function approximation, trained with the             Future implementations of this framework will incorpo-
true cluster or a random cluster. Values are average cosine        rate more sophisticated methods for natural language pro-
distance, thus lower is better.                                    cessing to generate novel commentary. As stated above, for
                    True Cluster     Random Cluster                the purposes of this initial experiment we went with the
        Random      0.937±0.099       0.936±0.101                  naive approach of predicting output commentary solely from
        Forest      0.970±0.090       0.986±0.058                  an associated sequence of gameplay video frames. We an-
        KNN 5       0.885±0.100       0.901±0.077                  ticipate a more final system will require a history of prior
        KNN 10      0.852±0.101       0.885±0.066                  comments and associated frames.
                                                                      In this initial work we drew upon random forest, and KNN
                                                                   as a means of generating output commentary, represented as
                                                                   a bag of words. We note that as our training data increased,
ality of the problem. It is well-recognized that even arbi-        we would anticipate a runtime increase for both these ap-
trary reductions in the dimensionality of a problem can lead       proaches (though we can limit this in the random forest by
to improved performance for machine learning approaches            limiting depth). If we want to have commentary generated
(Bingham and Mannila 2001). This interpretation would ex-          in real-time, we might instead want to make use of a deep
plain the improvement seen in our random forest baseline,          neural network or similar model of fixed size.
given that this method can be understood as including a fea-          Besides generating entertaining commentary for Let’s
ture selection step. Therefore we ran a secondary experiment       Plays, a final working system could be useful in a variety of
in which we compared the per-cluster variations using ei-          settings. One obvious approach would be to attempt to ex-
ther the assigned cluster or a random cluster. If it is the case   tend such a system to color commentary for eSports games
that the major benefit to the approaches came from the di-         (Dodge et al. 2018). More generally, such a system might
mensionality reduction from the cluster, we would anticipate       help increase user engagement with AI agents by aiding in
equivalent performance no matter which cluster is chosen.          Explainable AI approaches to rationalize decision-making
   We summarize the results of this experiment in Table 2.         (B. Harrison 2017).
Outside of the random variation, the true cluster faired bet-
ter than a random cluster. In the case of the random baseline
the performance was marginally better with a random clus-                                Conclusions
ter, but nearly equivalent. This makes sense given that the        In this paper, we define the problem of automatic commen-
random variation only involves uniformly sampling across           tary of Let’s Play videos via machine learning. Our frame-
the cluster distribution as opposed to learning some mapping       work requires an initialize clustering stage to cut back on
from frame representation to comment representation.               the implicit variance of Let’s Play commentary, followed by
   These results indicate that the clusters do actually rep-       a function approximation for commentary generation. We
resent meaningful types of relationships between frame             present an experimental implementation and multiple exper-
and text. This is further evidenced looking at the av-             imental results. Our results lend support to our framework.
erage comment cosine distance between the test exam-               Although there is much to improve upon, the work is an ex-
ples and the closest medoid according to the test exam-            citing first step towards solving the difficult problem of au-
ple’s frame (0.910±0.115) and a randomly selected medoid           tomated, real-time commentary generation.
(0.915±0.108).
                                                                                     Acknowledgements
Qualitative Example
                                                                   This material is based upon work supported by the National
We include an example of output in Figure 2 using a KNN            Science Foundation under Grant No. IIS-1525967.
with k = 1 in order to get a text comment output as opposed
to a bag of words. Despite the text having almost nothing                                 References
to do with the frame in question the text is fairly similar
from our perspective. It is worth noting again that all our        B. Harrison, U. Ehsan, M. R. 2017. Rationalization: A neu-
data comes from the same Let’s Player, and therefore may           ral machine translation approach to generating natural lan-
represent some style of that commentator.                          guage explanations.
                                                                   Barot, C.; Branon, M.; Cardona-Rivera, R. E.; Eger, M.;
           Limitations and Future Work                             Glatz, M.; Green, N.; Mattice, J.; Potts, C. M.; Robertson,
                                                                   J.; Shukonobe, M.; et al. 2017. Bardic: Generating multi-
In this paper we introduce the problem of creating a ma-           media narrative reports for game logs.
chine learning approach for Let’s Play commentary gener-
ation. Towards this purpose we present an abstract frame-          Barratt, E. L., and Davis, N. J. 2015. Autonomous sensory
work for solving this problem and present a limited, experi-       meridian response (asmr): a flow-like mental state. PeerJ
mental implementation which we interrogate. We find some           3:e851.
results that present initial evidence towards assumptions in       Bingham, E., and Mannila, H. 2001. Random projection
our framework. However, due to the scale of this experiment        in dimensionality reduction: applications to image and text
data. In Proceedings of the seventh ACM SIGKDD interna-         In Twelfth Artificial Intelligence and Interactive Digital En-
tional conference on Knowledge discovery and data mining,       tertainment Conference.
245–250. ACM.                                                   Summerville, A.; Snodgrass, S.; Guzdial, M.; Holmgård, C.;
Dodge, J.; Penney, S.; Hilderbrand, C.; Anderson, A.; and       Hoover, A. K.; Isaksen, A.; Nealen, A.; and Togelius, J.
Burnett, M. 2018. How the experts do it: Assessing and          2017. Procedural content generation via machine learning
explaining agent behaviors in real-time strategy games. In      (pcgml). arXiv preprint arXiv:1702.00539.
Proceedings of the 2018 CHI Conference on Human Factors
in Computing Systems, 562. ACM.
Dugdale, S. 2014. Let’s roleplay the elder scrolls v: Skyrim
episode 1 ”shipwrecked”.
Fischbach, M. 2016. World’s quietest let’s play.
Foye, L. 2017. esports and let’s plays rise of the backseat
gamers. Technical report, Jupiter Research.
Graefe, A. 2016. Guide to automated journalism.
Guzdial, M., and Riedl, M. 2016. Game level generation
from gameplay videos. In Twelfth Artificial Intelligence and
Interactive Digital Entertainment Conference.
Hamilton, W. A.; Garretson, O.; and Kerne, A. 2014.
Streaming on twitch: fostering participatory communities
of play within live mixed media. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Sys-
tems, 1315–1324. ACM.
Hanson, A., and Avidan, D. 2015. Yoshi’s cookie: National
treasure - part 1 - game grumps vs.
Jones, E.; Oliphant, T.; and Peterson, P. 2014. {SciPy}:
open source scientific tools for {Python}.
Kolekar, M. H., and Sengupta, S. 2006. Event-importance
based customized and automatic cricket highlight genera-
tion. In Multimedia and Expo, 2006 IEEE International
Conference on, 1617–1620. IEEE.
Lee, G.; Bulitko, V.; and Ludvig, E. A. 2014. Auto-
mated story selection for color commentary in sports. IEEE
transactions on computational intelligence and ai in games
6(2):144–155.
McLoughlin, S. 2013. Outlast - part 1 — so freaking scary
— gameplay walkthrough - commentary/face cam reaction.
Pham, D. T.; Dimov, S. S.; and Nguyen, C. D. 2005. Selec-
tion of k in k-means clustering. Proceedings of the Institu-
tion of Mechanical Engineers, Part C: Journal of Mechani-
cal Engineering Science 219(1):103–119.
R. Cardona-Rivera, B. L. 2016. Plotshot: Generating
discourse-constrained stories around photos.
Reiter, E., and Dale, R. 2000. Building natural language
generation systems. Cambridge university press.
Sjöblom, M., and Hamari, J. 2017. Why do people watch
others play video games? an empirical study on the mo-
tivations of twitch users. Computers in Human Behavior
75:985–996.
Sjöblom, M.; Törhönen, M.; Hamari, J.; and Macey, J. 2017.
Content structure is king: An empirical study on gratifica-
tions, game genres and content type on twitch. Computers
in Human Behavior 73:161–171.
Summerville, A.; Guzdial, M.; Mateas, M.; and Riedl, M. O.
2016. Learning player tailored content from observation:
Platformer level generation from video traces using lstms.