1 Introduction

Deep Neural Networks for Acoustic Mod- eling in Speech Recognition: The shared views of four re- search groups. IEEE Signal Processing Magazine

Deep Learning for Classification of Speech Accents in Video Games

Sergio Poo Hernandez and Vadim Bulitko

bulitkog@ualberta.ca fpooherna j bulitkog@ualberta.ca 0

Shelby Carleton

scarleto@ualberta.ca 1

Astrid Ensslin and Tejasvi Goorimoorthee

fensslin j tejasvig@ualberta.ca tejasvig@ualberta.ca 2 0 Computing Science, University of Alberta , Edmonton, AB 1 English and Film Studies, University of Alberta , Edmonton, AB 2 Humanities Computing, University of Alberta , Edmonton, AB

2012

29 6

In many video games, a wide range of non-playable characters make up the worlds players inhabit. Such characters are often voiced by human actors and their accents can have an influence on their perceived moral inclination, level of trustworthiness, social class, level of education and ethnic background. We use deep learning to train a neural network to classify speech accents. Such a machine-learned tool would provide game developers with an ability to analyze accent distribution in their titles as well as possibly help screening voiceover actors applying for a role. To make the approach accessible we used a readily available off the shelf deep network and commodity GPU hardware. Preliminary results are promising with a 71% test accuracy achieved over two accents in a commercial video game.

1 Introduction

Modern video games often feature numerous non-playable characters (NPCs) that populate the in-game world, contributing to the atmosphere, gameplay and storytelling. Such characters are usually available to interact with the player and are frequently voiced by well-known actors (e.g., Martin Sheen in Mass Effect 2 (BioWare 2010) ). As in movies, different accents in the same language (e.g., English) contribute to an ethnic, social and moral image of an NPC. Thus it is important for game developers to be aware of and control assignment of accents to in-game characters. Having a fast and low-cost way of determining an accent of a voiceover can help developers screen sent-in audition files or take an inventory of accents within a development prototype.

In this paper we demonstrate how machine learning can be used to automatically classify speech accents in videogame voiceovers. The approach is designed to be accessible to small-scale game developers and individual researchers in the field of game studies. Specifically, we train an offthe-self deep neural network on commodity hardware. The network achieved a 71% test accuracy over American and British accents in the commercial videogame Dragon Age: Origins (BioWare 2009) .

The rest of the paper is organized as follows. We formulate the problem precisely in Section 2, then discuss related work in Section 3. We then present our approach in Section 4 and detail results of an empirical evaluation in Section 5. The paper is concluded with a discussion of the current shortcomings and the corresponding future work.

Problem Formulation

The problem is mapping from an audio file containing speech to an accent label from a pre-determined set. A file is assumed to have a single speaker whose accent is consistent throughout the file. We evaluate performance of such a mapping by measuring its accuracy on a test set of files. The objective is then to increase the test accuracy while keeping the approach accessible to game developers as well as researchers from different disciplines.

Related Work

In speech recognition Graves, Mohamed, and Hinton (2013) used recurrent neural networks (Goodfellow, Bengio, and Courville 2016) to recognize phonemes in the TIMIT database (Garofalo et al. 1993) . Other work in phoneme classification in speech signals using convolutional neural networks (CNN) (Palaz, Collobert, and Doss 2013; Song and Cai 2015; Zhang et al. 2017) used logarithmic mel-filterbank coefficients and hybrid networks composed of a CNN and a recurrent neural networks (RNN). Their primary task is different from ours in that they are identifying phonemes to recognize words as opposed to accents. Yet other work on phoneme recognition (Hinton et al. 2012) highlighted the importance of weight initialization when recognizing phonemes. Once again the problem they were tackling is substantively different from ours. Research by Espi et al. (2015) of acoustic event detection emphasized the importance and feasibility of local feature extraction in detecting and classifying non-speech acoustic events occurring, for instance, in conversation scenes. This work is indirectly related to our current work as it does not label accents in a conversation, however it can be combined with our approach to detect and remove sections of the audio without speech.

Work has been done on detecting emotions in speech through spectrograms (Huang et al. 2014; Badshah et al. 2017). While this is not the task we were trying to solve, it is similar in its use of spectrograms as the input to their neural network.

Recently we explored training a neural network on an existing (non-video-game) accent database and then used the trained network to detect accents in audio files from a video game (Ensslin et al. 2017) . They thought that training on curated accent database would yield a better classifier. There were two problems with their approach. First it required access to a separate accent database. Second, their test accuracy was poor. Our approach is similar but trains on videogame audio files directly and yields better test accuracy. 4

Our Approach

To keep our approach accessible to a broad set of game developers and researchers, we used a common off the shelf deep neural network: AlexNet (Krizhevsky, Sutskever, and Hinton 2012). This approach has yielded state-of-the-art results when classifying bird species using their song (Knight et al. 2018). 4.1

Converting Audio to Images

As AlexNet was originally designed to classify images, we converted audio files to spectrograms in a fashion similar to our previous work (Ensslin et al. 2017) . Specifically, each spectrogram consisted of four different image quadrants computed by Algorithm 1 as follows.

The audio file S gets partitioned into m parts of w seconds each (line 3). For each part sx, we apply the Fast Fourier Transform to it, resulting in a sequence of amplitudes A (line 4). We remove all amplitudes for frequencies below fmin and above fmax (line 5). We then partition the remaining frequency range [fmin; fmax] into b linearly (if Lf holds) or logarithmically spaced segments (line 6). For each segment B(y) we sum the corresponding amplitudes into the scalar ay (line 8). We then map ay to a spectrogram pixel I(x; y) using a color mapping C (line 9). Optionally we take a logarithm of the amplitude ay (if La is false).

Algorithm 1: Create spectrogram quadrant input : S; fmin; fmax; w; b; Lf ; La; C output: spectrogram image I 1 m l jwSj m 2 for x 2 f1; 2; : : : ; mg do 3 sx xth window from S 4 A fft(sx) 5 6

Figure 1 illustrates the process on a simple piano-roll audio file. The top-left quadrant of the image is produced by setting both Lf and La to true and therefore uses linearly spaced frequency-range segments B. The top-right quadrant has La set to false and thus runs a logarithm on cumulative Linear frequency

Log frequency

Linear amplitude

Log amplitude amplitude ay before converting it to the RGB color via color mapping C (which maps lower amplitudes to colder/blue colors and higher amplitudes to warmer/red colors). The bottom-left quadrant spaces frequency-range segments logarithmically (i.e., Lf is set to false). Finally, the bottom-right quadrant uses logarithmically spaced frequency-range segments as well as applies logarithm to amplitudes.

Each quadrant is a color image of the height of b pixels and the width of m pixels. The resulting composite spectrogram thus has 2b rows and 2m columns. Figure 2 shows a composite spectrogram of an actual videogame audio file. Algorithm 1 converts a set of audio files fSkg to a set of spectrograms Ik. As the audio files increase in duration, the width of each quadrant (m pixels) will necessarily increase in order to keep the same temporal resolution. Since off the shelf deep neural networks tend to require the input image to be of a fixed size, we divided each audio file Sk into segments fSkig of up to s seconds. This ensures that the spectrogram maintains a temporal resolution of at least ms pixels per second of audio.

Our original dataset of audio files and their accent labels f(Sk; lk)g thus becomes a dataset of audio segments each of which inherits the accent label of the original file: f(Ski; lk)g. Once converted to spectrograms by Algorithm 1 these become f(Iki ; lk)g.

To get robust results, and avoid overfitting the data, we conduct the training and testing process in the standard fashion with T independent trials. On each trial t, we split the dataset f(Sk; lk)g into (complete) audio files to be used for training and 1 which are used for testing: f(Sk; lk)g = Sttrain [ Sttest with jSttrainj = b jf(Sk; lk)gjc. Expressed at the level of spectrograms of audio segments we have f(Iki ; lk)g = Itrain [ Ittest.

t t

On each trial t, we train a neural network on Itrain using three hyperparameters: the number of epochs, the batch size for stochastic gradient descent and the learning rate. Once the network Nt is trained we freeze its weights and test it on Ittest. The per-segment accuracy of the trained network is the percentage of audio file segments for which the accent level output by the network matched that in the test set: Aper-segment = jf(Iki ; lk) 2 Itest j Nt(Iki ) = lkgj :

t t t jItestj The per-segment accuracy of the net is then averaged over all T trials: Aper-segment = avgt Aper-segment. t

We also calculate per-file accuracy. For that we run the network on all segments comprising an audio file from the test set and take the majority vote on the labels the network produces.1 Thus we define Nt(Ik) as the majority vote of the network’s classifications of each segment: Nt(Iki ). Then the per-file accuracy is defined as:

t Aper-file = jf(Ik; lk) 2 Itest j Nt(Ik) = lkgj : t t

jItestj As before: Aper-file = avgt Atper-file.

Empirical Evaluation

In this section we will present a specific implementation of our approach and an evaluation of its performance on audio files from a commercial video game. 5.1

Data Collection

We used voiceover audio files captured from Dragon Age: Origins (BioWare 2009) — a game with a wide variety of accents and characters.

The background music in the game was turned off so only the speech is present. We tried to capture as many NPCs 1If an equal number of segments was labeled with the same accent then we break the tie between the labels in favor of the label of the earliest such segment. For instance, if a five-segment audio file is labeled by the network as [British, British, American, American, Spanish] then we break the tie between British and American in favor of British. This method was used since we initially assumed there are no ties, so we always select the first most frequent segment-level label. as possible and only one recording per NPC was used to form our dataset. The audio files were separately labeled by three individuals Each individual listened to each audio file and labeled its accent. Then the multiple labelers compared their labels and debated any differences until a consensus was reached.2 This process resulted in 295 audio files such that each file contained a single speaker labeled with a single accent label: 147 with an American accent and 148 with a British accent.

The audio files were from 2 to 40 seconds in duration. Using a segment length of 3 seconds, we created a data set f(Iki ; lk)g of 1100 segment spectrograms. The majority classification average of this set is 51:1%; this means that if we classified the data by selecting the label with the most elements we would classify 51:1% of the segments correctly. 5.2

Implementation Details

In our implementation of the approach we used the audioread and fft functions in MATLAB to read each audio file in, average the two channels of the clip and perform the Fast Fourier transform. The spectrogram was converted to an RGB image using jet(100) colormap in MATLAB. The composite spectrogram (four quadrants) was then resized to 227 227 pixels for input to the network using the imresize function in MATLAB, which uses a bi-cubic interpolation method.

We used a version of AlexNet that is available for download as a MATLAB add-on alexnet.3 We trained it with the MATLAB neural network toolbox via trainNetwork function using stochastic gradient descent with a learning rate of 0:01 with a drop learn rate factor of 0:1. We ran all experiments on an Intel Core i7 980X workstation with a six-core 3:33 GHz CPU and 24Gb of RAM. It hosted two Nvidia Maxwell-based Titan X GPUs with 12Gb of videoRAM each. This allowed us to run two learning trials in parallel (one trial per GPU). 5.3

Single-accent Classification

Spectrograms for audio segments were divided into a training set, used to train the network, and a testing set. The training set contained 75% of the audio files of each class, while the remaining 25% were used in the testing set. We made sure that all spectrograms belonging to the same audio file are in the same set (i.e., 8k@i1; i2 (Iki1 ; lk) 2 Itrain & (Iki2 ; lk) 2 Itest ).

There were four control parameters we varied for the data preparation and network training: the number of epochs and the batch size which relate to the network training configuration; the number of frequency filters b and the time window size w which determine the specification of the spectrograms. We did not know the best combination for the dataset at hand so we conducted a parameter sweep. To reduce the sweep time we factored the parameter space into a product 2If no agreement could be achieved then the audio file was excluded from the set.

3We used MATLAB for training because of the convenience of parameter sweep and data analysis as well as access to an existing code base. of two subspaces: one defined by the number of epochs and the batch size and the other defined by the number of frequency filters and the time window size.

We then fixed a single pair of parameters from the second subspace and tried all 4 5 combinations of parameters from the first subspace. For each try we ran four independent trials of training and testing, splitting the dataset into training and testing partitions randomly on each trial. Test accuracy averaged over the four trials defined the quality of the parameter pair from the first subspace, given the fixed values from the second subspace. We then picked the highest-quality parameter pair from the first subspace and, keeping it fixed, swept the second subspace trying all of its 4 6 pairs. If the best quality and the second-subspace parameters found matched those found before, we stopped the process. Otherwise, we picked another (untried) parameter pair from the second subspace and repeated the steps above.

This factored sweep can stop short of finding the global optimum in the overall four-dimensional parameter space. On the positive side, it is likely to be faster as it sweeps a single two-dimensional subspace at a time. In our evaluation the process stopped after 4 iterations, each consisting of two subspace sweeps. Thus only 4 (4 5+4 6) = 176 parameter combinations were tried in total (as opposed to 4 5 4 6 = 480 that would be required to sweep the original space).

We ran four trials per parameter combination and reported the average accuracy of the four trials. We found that the best parameters were 280 frequency filters and window size of 0:05 seconds, 50 epochs and a batch size of 5; these yield a test accuracy of Aper-segment = 63:25 4%.

We then locked in the control parameters listed above and ran 10 additional trials. The resulting confusion matrix averaged over the four trials is listed in Table 2, left. File-level Labeling. We then examined test accuracy at the file level. As described earlier in the paper file-level test accuracy Aper-file is computed by training the network to clast sify segments but then labeling the file with a majority vote over the segments. For instance, if an audio file was split into 7 segments, and the network labeled 3 of them as American accent and 4 as British accent, we would label the entire audio file as British.

We re-ran the experiment, the best parameters for this run were 250 frequency filters and a window size of 0:05 seconds, 50 epochs and a batch size of 5, which yielded an average accuracy of Aper-file = 68 3:6%. We ran 10 additional trials to compute the confusion matrix: Table 3, left. Segment Duration. Given that some audio files were shorter than 3 seconds, we also tried the segment duration of 1 second. For per-segment accuracy, the best parameters were 280 frequency filters and a window size of 0:05 seconds, 50 epochs and a batch size of 5. These parameters yielded an average accuracy of Aper-segment = 63:5 3% over four trials which is similar to that with 3-second segments. The corresponding confusion matrix (over additional 10 trials) is found in Table 2, right. For per-file accuracy the best parameters were 75 frequency filters and a window size of 0:01 seconds with a test accuracy of Aper-file = 71 4:5% averaged over 4 trials. The corresponding confusion matrix computed over 10 additional trials is found in Table 3, right. 6

Current Challenges and Future Work

Humans may use certain speech features (e.g., the way the speaker pronounces ‘r’) to identify accents in an audio file. Those features are present only occasionally and thus short audio files can be mislabeled by humans. Furthermore, human labelers can be inconsistent in their labels. Such problems with the dataset may reduce test accuracy. Future work will scale up the number of human labelers as well as the length of the files to produce a more accurate/consistent dataset. We will also attempt to train a network for more than two accents, including fantasy accents. We will also extend the space of control parameter space to gain a better understanding on how they affect the accuracy of the network.

It will also be of interest to segment audio files in a content-aware way (instead of fixed 1- or 3-second segments). Doing so may also allow the classifier to automatically remove silent parts of an audio file and thus avoid dilution of dataset with meaningless data. Future work will compare the spectrogram-based representation of an audio file to mel-filter-bank coefficients (Palaz, Collobert, and Doss 2013; Song and Cai 2015; Zhang et al. 2017) as well as use other neural networks such as VGG (Simonyan and Zisserman 2014).

Finally, measuring portability of a deep neural accent detector across games as well as its sensitivity to background music is a natural direction for future work.

Conclusions

Accent classification is an important task in video-game development (e.g., for quality control and pre-screening for voiceover auditions). In the spirit of reducing gameproduction costs we proposed and evaluated an approach for doing so automatically, via the use of deep learning. To keep our approach low-cost and accessible to practitioners outside of Computing Science, we used a readily available off the shelf deep neural network and a standard deep learning method. We evaluated our approach on a database of voiceover files from a commercial video game Dragon Age: Origins where the network achieved the test accuracy of 71%. These results demonstrate a promise of using off the shelf deep learning for game development and open a number of exciting follow-up directions.

Acknowledgments

We appreciate the support from Kule Institute for Advanced Study (KIAS), the Social Sciences and Humanities Council

Classified as

of Canada (SSHRC) via the Refiguring Innovation in Games (ReFiG) project, the Alberta Conservation Association, the Alberta Biodiversity Monitoring Institute, and Nvidia,.

Courville, A.

Press. http: Graves, A.; Mohamed, A.-R.; and Hinton, G. 2013. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6645–6649.

Huang, Z.; Dong, M.; Mao, Q.; and Zhan, Y. 2014. Speech Emotion Recognition Using CNN. In Proceedings of the 22nd ACM International Conference on Multimedia, 801– 804.

Knight, E. C.; Poo Hernandez, S.; Bayne, E. M.; Bultiko, V.; and Tucker, B. V. 2018. Pre-processing spectrogram parameters improve the accuracy of birdsong classification using convolutional neural networks. Under review. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS), 1097–1105.

Palaz, D.; Collobert, R.; and Doss, M. M. 2013. Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks. arXiv preprint arXiv:1304.1018.

Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Song, W., and Cai, J. 2015. End-to-End Deep Neural Network for Automatic Speech Recognition. Technical Report. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio, C. L. Y.; and Courville, A. 2017. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. arXiv preprint arXiv:1701.02720.

2017. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network . In Proceedings of 2017 International Conference on Platform Technology and Service (PlatCon) , 1 - 5 .

BioWare.

2009 . Dragon Age: Origins.

BioWare.

2010 . Mass Effect 2 .

Ensslin , A. ; Goorimoorthee , T. ; Carleton , S. ; Bulitko , V. ; and

Poo

Hernandez , S. 2017 . Deep Learning for Speech Accent Detection in Videogames . In Proceedings of the Experimental AI in Games (EXAG) Workshop at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE) , 69 - 74 .

2015. Exploiting spectro-temporal locality in deep learning based acoustic event detection . EURASIP Journal on Audio, Speech, and Music Processing 2015 ( 1 ): 26 .

Garofalo , J. S. ; Lamel , L. F. ; Fisher, W. M.; Fiscus , J. G. ; Pallett , D. S. ; and Dahlgren , N. L. 1993 . The DARPA TIMIT acoustic-phonetic continuous speech corpus cdrom .

Goodfellow , I. ; Bengio, Y. ; and 2016. Deep Learning . MIT //www.deeplearningbook.org.