=Paper=
{{Paper
|id=Vol-2282/EXAG_114
|storemode=property
|title=Deep Learning for Classification of Speech Accents in Video Games
|pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_114.pdf
|volume=Vol-2282
|authors=Sergio Poo Hernandez,Vadim Bulitko,Shelby Carleton,Astrid Ensslin,Tejasvi Goorimoorthee
|dblpUrl=https://dblp.org/rec/conf/aiide/HernandezBCEG18
}}
==Deep Learning for Classification of Speech Accents in Video Games==
<pdf width="1500px">https://ceur-ws.org/Vol-2282/EXAG_114.pdf</pdf>
<pre>
              Deep Learning for Classification of Speech Accents in Video Games

Sergio Poo Hernandez and Vadim Bulitko Shelby Carleton Astrid Ensslin and Tejasvi Goorimoorthee
              Computing Science                            English and Film Studies                 Humanities Computing
              University of Alberta                         University of Alberta                    University of Alberta
                Edmonton, AB                                   Edmonton, AB                            Edmonton, AB
        {pooherna | bulitko}@ualberta.ca                     scarleto@ualberta.ca                {ensslin | tejasvi}@ualberta.ca


                              Abstract                                    paper is concluded with a discussion of the current short-
                                                                          comings and the corresponding future work.
    In many video games, a wide range of non-playable charac-
    ters make up the worlds players inhabit. Such characters are
    often voiced by human actors and their accents can have an                          2   Problem Formulation
    influence on their perceived moral inclination, level of trust-       The problem is mapping from an audio file containing
    worthiness, social class, level of education and ethnic back-         speech to an accent label from a pre-determined set. A file
    ground. We use deep learning to train a neural network to             is assumed to have a single speaker whose accent is consis-
    classify speech accents. Such a machine-learned tool would            tent throughout the file. We evaluate performance of such a
    provide game developers with an ability to analyze accent
    distribution in their titles as well as possibly help screening
                                                                          mapping by measuring its accuracy on a test set of files. The
    voiceover actors applying for a role. To make the approach            objective is then to increase the test accuracy while keep-
    accessible we used a readily available off the shelf deep net-        ing the approach accessible to game developers as well as
    work and commodity GPU hardware. Preliminary results are              researchers from different disciplines.
    promising with a 71% test accuracy achieved over two ac-
    cents in a commercial video game.                                                        3    Related Work
                                                                          In speech recognition Graves, Mohamed, and Hinton (2013)
                       1    Introduction                                  used recurrent neural networks (Goodfellow, Bengio, and
                                                                          Courville 2016) to recognize phonemes in the TIMIT
  Modern video games often feature numerous non-playable                  database (Garofalo et al. 1993). Other work in phoneme
  characters (NPCs) that populate the in-game world, con-                 classification in speech signals using convolutional neural
  tributing to the atmosphere, gameplay and storytelling. Such            networks (CNN) (Palaz, Collobert, and Doss 2013; Song
  characters are usually available to interact with the player            and Cai 2015; Zhang et al. 2017) used logarithmic mel-filter-
  and are frequently voiced by well-known actors (e.g., Martin            bank coefficients and hybrid networks composed of a CNN
  Sheen in Mass Effect 2 (BioWare 2010)). As in movies, dif-              and a recurrent neural networks (RNN). Their primary task
  ferent accents in the same language (e.g., English) contribute          is different from ours in that they are identifying phonemes
  to an ethnic, social and moral image of an NPC. Thus it is              to recognize words as opposed to accents. Yet other work
  important for game developers to be aware of and control                on phoneme recognition (Hinton et al. 2012) highlighted
  assignment of accents to in-game characters. Having a fast              the importance of weight initialization when recognizing
  and low-cost way of determining an accent of a voiceover                phonemes. Once again the problem they were tackling is
  can help developers screen sent-in audition files or take an            substantively different from ours. Research by Espi et al.
  inventory of accents within a development prototype.                    (2015) of acoustic event detection emphasized the impor-
     In this paper we demonstrate how machine learning can                tance and feasibility of local feature extraction in detecting
  be used to automatically classify speech accents in video-              and classifying non-speech acoustic events occurring, for in-
  game voiceovers. The approach is designed to be accessible              stance, in conversation scenes. This work is indirectly re-
  to small-scale game developers and individual researchers               lated to our current work as it does not label accents in a
  in the field of game studies. Specifically, we train an off-            conversation, however it can be combined with our approach
  the-self deep neural network on commodity hardware. The                 to detect and remove sections of the audio without speech.
  network achieved a 71% test accuracy over American and                     Work has been done on detecting emotions in speech
  British accents in the commercial videogame Dragon Age:                 through spectrograms (Huang et al. 2014; Badshah et al.
  Origins (BioWare 2009).                                                 2017). While this is not the task we were trying to solve,
     The rest of the paper is organized as follows. We formu-             it is similar in its use of spectrograms as the input to their
  late the problem precisely in Section 2, then discuss related           neural network.
  work in Section 3. We then present our approach in Section 4               Recently we explored training a neural network on an ex-
  and detail results of an empirical evaluation in Section 5. The         isting (non-video-game) accent database and then used the
trained network to detect accents in audio files from a video
game(Ensslin et al. 2017). They thought that training on cu-
rated accent database would yield a better classifier. There         Linear frequency
were two problems with their approach. First it required ac-
cess to a separate accent database. Second, their test accu-
racy was poor. Our approach is similar but trains on video-
game audio files directly and yields better test accuracy.

                   4    Our Approach                                     Log frequency
To keep our approach accessible to a broad set of game de-
velopers and researchers, we used a common off the shelf
deep neural network: AlexNet (Krizhevsky, Sutskever, and
Hinton 2012). This approach has yielded state-of-the-art re-
sults when classifying bird species using their song (Knight
et al. 2018).                                                                                 Linear amplitude   Log amplitude


4.1   Converting Audio to Images
                                                                    Figure 1: A composite spectrogram of a piano-roll audio.
As AlexNet was originally designed to classify images, we
converted audio files to spectrograms in a fashion simi-
lar to our previous work (Ensslin et al. 2017). Specifically,
each spectrogram consisted of four different image quad-
rants computed by Algorithm 1 as follows.
   The audio file S gets partitioned into m parts of w seconds
each (line 3). For each part sx , we apply the Fast Fourier
Transform to it, resulting in a sequence of amplitudes A
(line 4). We remove all amplitudes for frequencies below
fmin and above fmax (line 5). We then partition the remain-
ing frequency range [fmin , fmax ] into b linearly (if Lf holds)
or logarithmically spaced segments (line 6). For each seg-
ment B(y) we sum the corresponding amplitudes into the
scalar ay (line 8). We then map ay to a spectrogram pixel
I(x, y) using a color mapping C (line 9). Optionally we take
a logarithm of the amplitude ay (if La is false).

 Algorithm 1: Create spectrogram quadrant
  input : S, fmin , fmax , w, b, Lf , La , C                       Figure 2: A composite spectrogram of a voiceover file with
  output:               image I                                    an American accent.
        l spectrogram
             m
         |S|
1 m←      w
2 for x ∈ {1, 2, . . . , m} do                                     amplitude ay before converting it to the RGB color via color
3     sx ← xth window from S                                       mapping C (which maps lower amplitudes to colder/blue
4     A ← fft(sx )                                                 colors and higher amplitudes to warmer/red colors). The
               f                                                   bottom-left quadrant spaces frequency-range segments loga-
5     A ← A|fmax                                                   rithmically (i.e., Lf is set to false). Finally, the bottom-right
             min
              linspace(fmin , fmax , b), if Lf                     quadrant uses logarithmically spaced frequency-range seg-
6     B←
              logspace(fmin , fmax , b), otherwise                 ments as well as applies logarithm to amplitudes.
7     for y ∈ {1,P2, . . . , b} do                                    Each quadrant is a color image of the height of b pixels
8         ay ←         A|B(y)                                      and the width of m pixels. The resulting composite spectro-
                                                                  gram thus has 2b rows and 2m columns. Figure 2 shows a
                           C(ay ),      if La                      composite spectrogram of an actual videogame audio file.
9         I(x, y) ←
                           C(log(ay )), otherwise
                                                                   4.2    Training and Testing the Network
                                                                   Algorithm 1 converts a set of audio files {Sk } to a set of
   Figure 1 illustrates the process on a simple piano-roll au-     spectrograms Ik . As the audio files increase in duration, the
dio file. The top-left quadrant of the image is produced by        width of each quadrant (m pixels) will necessarily increase
setting both Lf and La to true and therefore uses linearly         in order to keep the same temporal resolution. Since off the
spaced frequency-range segments B. The top-right quadrant          shelf deep neural networks tend to require the input image
has La set to false and thus runs a logarithm on cumulative        to be of a fixed size, we divided each audio file Sk into seg-
ments {Ski } of up to s seconds. This ensures that the spec-           as possible and only one recording per NPC was used to
trogram maintains a temporal resolution of at least m    s pixels      form our dataset. The audio files were separately labeled by
per second of audio.                                                   three individuals Each individual listened to each audio file
    Our original dataset of audio files and their accent la-           and labeled its accent. Then the multiple labelers compared
bels {(Sk , lk )} thus becomes a dataset of audio segments             their labels and debated any differences until a consensus
each of which inherits the accent label of the original file:          was reached.2 This process resulted in 295 audio files such
{(Ski , lk )}. Once converted to spectrograms by Algorithm 1           that each file contained a single speaker labeled with a single
these become {(Iki , lk )}.                                            accent label: 147 with an American accent and 148 with a
    To get robust results, and avoid overfitting the data, we          British accent.
conduct the training and testing process in the standard fash-            The audio files were from 2 to 40 seconds in duration.
ion with T independent trials. On each trial t, we split               Using a segment length of 3 seconds, we created a data set
the dataset {(Sk , lk )} into α (complete) audio files to be           {(Iki , lk )} of 1100 segment spectrograms. The majority clas-
used for training and 1 − α which are used for testing:                sification average of this set is 51.1%; this means that if we
{(Sk , lk )} = Sttrain ∪ Sttest with |Sttrain | = bα|{(Sk , lk )}|c.   classified the data by selecting the label with the most ele-
Expressed at the level of spectrograms of audio segments               ments we would classify 51.1% of the segments correctly.
we have {(Iki , lk )} = Ittrain ∪ Ittest .
    On each trial t, we train a neural network on Ittrain using        5.2    Implementation Details
three hyperparameters: the number of epochs, the batch size            In our implementation of the approach we used the
for stochastic gradient descent and the learning rate. Once            audioread and fft functions in MATLAB to read each
the network Nt is trained we freeze its weights and test it on         audio file in, average the two channels of the clip and per-
Ittest . The per-segment accuracy of the trained network is the        form the Fast Fourier transform. The spectrogram was con-
percentage of audio file segments for which the accent level           verted to an RGB image using jet(100) colormap in
output by the network matched that in the test set:                    MATLAB. The composite spectrogram (four quadrants) was
                       |{(Iki , lk ) ∈ Ittest | Nt (Iki ) = lk }|      then resized to 227 × 227 pixels for input to the network
      Aper-segment
       t           =                                              .    using the imresize function in MATLAB, which uses a
                                          |Ittest |                    bi-cubic interpolation method.
The per-segment accuracy of the net is then averaged over                 We used a version of AlexNet that is available for down-
all T trials: Aper-segment = avgt Aper-segment
                                   t           .                       load as a MATLAB add-on alexnet.3 We trained it with
   We also calculate per-file accuracy. For that we run the            the MATLAB neural network toolbox via trainNetwork
network on all segments comprising an audio file from the              function using stochastic gradient descent with a learning
test set and take the majority vote on the labels the network          rate of 0.01 with a drop learn rate factor of 0.1. We ran all
produces.1 Thus we define Nt (Ik ) as the majority vote of             experiments on an Intel Core i7 980X workstation with a
the network’s classifications of each segment: Nt (Iki ). Then         six-core 3.33 GHz CPU and 24Gb of RAM. It hosted two
the per-file accuracy is defined as:                                   Nvidia Maxwell-based Titan X GPUs with 12Gb of video-
                                                                       RAM each. This allowed us to run two learning trials in par-
                     |{(Ik , lk ) ∈ Ittest | Nt (Ik ) = lk }|
       Aper-file
        t        =                                            .        allel (one trial per GPU).
                                       |Ittest |
                                                                       5.3    Single-accent Classification
As before: Aper-file = avgt Aper-file
                             t        .
                                                                       Spectrograms for audio segments were divided into a
                5       Empirical Evaluation                           training set, used to train the network, and a testing
                                                                       set. The training set contained 75% of the audio files
In this section we will present a specific implementation of
                                                                       of each class, while the remaining 25% were used in
our approach and an evaluation of its performance on audio
                                                                       the testing set. We made sure that all spectrograms be-
files from a commercial video game.
                                                                       longing to the same audio file are in the same            set (i.e.,
                                                                       ∀k@i1 , i2 (Iki1 , lk ) ∈ Itrain & (Iki2 , lk ) ∈ Itest ).
                                                                                                                              
5.1     Data Collection
                                                                          There were four control parameters we varied for the data
We used voiceover audio files captured from Dragon Age:
                                                                       preparation and network training: the number of epochs and
Origins (BioWare 2009) — a game with a wide variety of
                                                                       the batch size which relate to the network training configu-
accents and characters.
                                                                       ration; the number of frequency filters b and the time win-
   The background music in the game was turned off so only
                                                                       dow size w which determine the specification of the spectro-
the speech is present. We tried to capture as many NPCs
                                                                       grams. We did not know the best combination for the dataset
     1
       If an equal number of segments was labeled with the same ac-    at hand so we conducted a parameter sweep. To reduce the
cent then we break the tie between the labels in favor of the label    sweep time we factored the parameter space into a product
of the earliest such segment. For instance, if a five-segment audio
                                                                          2
file is labeled by the network as [British, British, American, Amer-        If no agreement could be achieved then the audio file was ex-
ican, Spanish] then we break the tie between British and American      cluded from the set.
                                                                          3
in favor of British. This method was used since we initially as-            We used MATLAB for training because of the convenience of
sumed there are no ties, so we always select the first most frequent   parameter sweep and data analysis as well as access to an existing
segment-level label.                                                   code base.
of two subspaces: one defined by the number of epochs and           of 1 second. For per-segment accuracy, the best parameters
the batch size and the other defined by the number of fre-          were 280 frequency filters and a window size of 0.05 sec-
quency filters and the time window size.                            onds, 50 epochs and a batch size of 5. These parameters
   We then fixed a single pair of parameters from the second        yielded an average accuracy of Aper-segment = 63.5 ± 3%
subspace and tried all 4 · 5 combinations of parameters from        over four trials which is similar to that with 3-second seg-
the first subspace. For each try we ran four independent trials     ments. The corresponding confusion matrix (over additional
of training and testing, splitting the dataset into training and    10 trials) is found in Table 2, right. For per-file accuracy the
testing partitions randomly on each trial. Test accuracy aver-      best parameters were 75 frequency filters and a window size
aged over the four trials defined the quality of the parameter      of 0.01 seconds with a test accuracy of Aper-file = 71 ± 4.5%
pair from the first subspace, given the fixed values from the       averaged over 4 trials. The corresponding confusion matrix
second subspace. We then picked the highest-quality param-          computed over 10 additional trials is found in Table 3, right.
eter pair from the first subspace and, keeping it fixed, swept
the second subspace trying all of its 4 · 6 pairs. If the best         6    Current Challenges and Future Work
quality and the second-subspace parameters found matched            Humans may use certain speech features (e.g., the way the
those found before, we stopped the process. Otherwise, we           speaker pronounces ‘r’) to identify accents in an audio file.
picked another (untried) parameter pair from the second sub-        Those features are present only occasionally and thus short
space and repeated the steps above.                                 audio files can be mislabeled by humans. Furthermore, hu-
   This factored sweep can stop short of finding the global         man labelers can be inconsistent in their labels. Such prob-
optimum in the overall four-dimensional parameter space.            lems with the dataset may reduce test accuracy. Future work
On the positive side, it is likely to be faster as it sweeps a      will scale up the number of human labelers as well as the
single two-dimensional subspace at a time. In our evaluation        length of the files to produce a more accurate/consistent
the process stopped after 4 iterations, each consisting of two      dataset. We will also attempt to train a network for more than
subspace sweeps. Thus only 4·(4·5+4·6) = 176 parameter              two accents, including fantasy accents. We will also extend
combinations were tried in total (as opposed to 4 · 5 · 4 · 6 =     the space of control parameter space to gain a better under-
480 that would be required to sweep the original space).            standing on how they affect the accuracy of the network.
                                                                       It will also be of interest to segment audio files in a
              Table 1: Control parameter space.                     content-aware way (instead of fixed 1- or 3-second seg-
                                                                    ments). Doing so may also allow the classifier to automati-
     Parameter                         Values                       cally remove silent parts of an audio file and thus avoid dilu-
 number of epochs               {10, 50, 100, 200}                  tion of dataset with meaningless data. Future work will com-
     batch size                 {3, 5, 10, 50, 100}                 pare the spectrogram-based representation of an audio file
   window size w        {0.05, 0.025, 0.01, 0.001} seconds          to mel-filter-bank coefficients (Palaz, Collobert, and Doss
 frequency filters b       {50, 75, 113, 227, 250, 280}             2013; Song and Cai 2015; Zhang et al. 2017) as well as use
                                                                    other neural networks such as VGG (Simonyan and Zisser-
                                                                    man 2014).
   We ran four trials per parameter combination and reported           Finally, measuring portability of a deep neural accent de-
the average accuracy of the four trials. We found that the          tector across games as well as its sensitivity to background
best parameters were 280 frequency filters and window size          music is a natural direction for future work.
of 0.05 seconds, 50 epochs and a batch size of 5; these yield
a test accuracy of Aper-segment = 63.25 ± 4%.
   We then locked in the control parameters listed above and
                                                                                         7    Conclusions
ran 10 additional trials. The resulting confusion matrix aver-      Accent classification is an important task in video-game
aged over the four trials is listed in Table 2, left.               development (e.g., for quality control and pre-screening
                                                                    for voiceover auditions). In the spirit of reducing game-
File-level Labeling. We then examined test accuracy at              production costs we proposed and evaluated an approach
the file level. As described earlier in the paper file-level test   for doing so automatically, via the use of deep learning.
accuracy Aper-file
             t     is computed by training the network to clas-     To keep our approach low-cost and accessible to practition-
sify segments but then labeling the file with a majority vote       ers outside of Computing Science, we used a readily avail-
over the segments. For instance, if an audio file was split into    able off the shelf deep neural network and a standard deep
7 segments, and the network labeled 3 of them as American           learning method. We evaluated our approach on a database
accent and 4 as British accent, we would label the entire au-       of voiceover files from a commercial video game Dragon
dio file as British.                                                Age: Origins where the network achieved the test accuracy
   We re-ran the experiment, the best parameters for this run       of 71%. These results demonstrate a promise of using off
were 250 frequency filters and a window size of 0.05 sec-           the shelf deep learning for game development and open a
onds, 50 epochs and a batch size of 5, which yielded an av-         number of exciting follow-up directions.
erage accuracy of Aper-file = 68±3.6%. We ran 10 additional
trials to compute the confusion matrix: Table 3, left.                               8   Acknowledgments
Segment Duration. Given that some audio files were                  We appreciate the support from Kule Institute for Advanced
shorter than 3 seconds, we also tried the segment duration          Study (KIAS), the Social Sciences and Humanities Council
        Table 2: The confusion matrix for per-segment labeling. Left: 3-second segments. Right: 1-second segments.
                                               Actual                                    Actual
                       Classified as    British American          Classified as   British American
                          British        63.6%       35.3%           British      64.3%      38.4%
                         American        36.4%       64.7%          American      35.7%      61.6%


           Table 3: The confusion matrix for per-file labeling. Left: 3-second segments. Right: 1-second segments.
                                               Actual                                    Actual
                       Classified as    British American          Classified as   British American
                          British        67.4%       31.2%           British      71.2%      29.1%
                         American        32.6%       68.8%          American      28.8%      70.9%


of Canada (SSHRC) via the Refiguring Innovation in Games          Huang, Z.; Dong, M.; Mao, Q.; and Zhan, Y. 2014. Speech
(ReFiG) project, the Alberta Conservation Association, the        Emotion Recognition Using CNN. In Proceedings of the
Alberta Biodiversity Monitoring Institute, and Nvidia,.           22nd ACM International Conference on Multimedia, 801–
                                                                  804.
                        References                                Knight, E. C.; Poo Hernandez, S.; Bayne, E. M.; Bultiko,
                                                                  V.; and Tucker, B. V. 2018. Pre-processing spectrogram
Badshah, A. M.; Ahmad, J.; Rahim, N.; and Baik, S. W.
                                                                  parameters improve the accuracy of birdsong classification
2017. Speech Emotion Recognition from Spectrograms with
                                                                  using convolutional neural networks. Under review.
Deep Convolutional Neural Network. In Proceedings of
2017 International Conference on Platform Technology and          Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-
Service (PlatCon), 1–5.                                           ageNet Classification with Deep Convolutional Neural Net-
                                                                  works. In Advances in Neural Information Processing Sys-
BioWare. 2009. Dragon Age: Origins.
                                                                  tems (NIPS), 1097–1105.
BioWare. 2010. Mass Effect 2.                                     Palaz, D.; Collobert, R.; and Doss, M. M. 2013. Estimating
Ensslin, A.; Goorimoorthee, T.; Carleton, S.; Bulitko, V.;        Phoneme Class Conditional Probabilities from Raw Speech
and Poo Hernandez, S. 2017. Deep Learning for Speech              Signal using Convolutional Neural Networks. arXiv preprint
Accent Detection in Videogames. In Proceedings of the             arXiv:1304.1018.
Experimental AI in Games (EXAG) Workshop at the AAAI              Simonyan, K., and Zisserman, A. 2014. Very deep convo-
Conference on Artificial Intelligence and Interactive Digital     lutional networks for large-scale image recognition. arXiv
Entertainment (AIIDE), 69–74.                                     preprint arXiv:1409.1556.
Espi, M.; Fujimoto, M.; Kinoshita, K.; and Nakatani, T.           Song, W., and Cai, J. 2015. End-to-End Deep Neural Net-
2015. Exploiting spectro-temporal locality in deep learning       work for Automatic Speech Recognition. Technical Report.
based acoustic event detection. EURASIP Journal on Audio,
Speech, and Music Processing 2015(1):26.                          Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio,
                                                                  C. L. Y.; and Courville, A. 2017. Towards End-to-End
Garofalo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.;      Speech Recognition with Deep Convolutional Neural Net-
Pallett, D. S.; and Dahlgren, N. L. 1993. The DARPA               works. arXiv preprint arXiv:1701.02720.
TIMIT acoustic-phonetic continuous speech corpus cdrom.
Linguistic Data Consortium.
Goodfellow, I.; Bengio, Y.; and Courville, A.
2016.         Deep Learning.        MIT Press.         http:
//www.deeplearningbook.org.
Graves, A.; Mohamed, A.-R.; and Hinton, G. 2013. Speech
Recognition with Deep Recurrent Neural Networks. In Pro-
ceedings of 2013 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), 6645–6649.
Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.-
r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath,
T. N.; et al. 2012. Deep Neural Networks for Acoustic Mod-
eling in Speech Recognition: The shared views of four re-
search groups. IEEE Signal Processing Magazine 29(6):82–
97.

</pre>