=Paper=
{{Paper
|id=Vol-2282/EXAG_114
|storemode=property
|title=Deep Learning for Classification of Speech Accents in Video Games
|pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_114.pdf
|volume=Vol-2282
|authors=Sergio Poo Hernandez,Vadim Bulitko,Shelby Carleton,Astrid Ensslin,Tejasvi Goorimoorthee
|dblpUrl=https://dblp.org/rec/conf/aiide/HernandezBCEG18
}}
==Deep Learning for Classification of Speech Accents in Video Games==
Deep Learning for Classification of Speech Accents in Video Games Sergio Poo Hernandez and Vadim Bulitko Shelby Carleton Astrid Ensslin and Tejasvi Goorimoorthee Computing Science English and Film Studies Humanities Computing University of Alberta University of Alberta University of Alberta Edmonton, AB Edmonton, AB Edmonton, AB {pooherna | bulitko}@ualberta.ca scarleto@ualberta.ca {ensslin | tejasvi}@ualberta.ca Abstract paper is concluded with a discussion of the current short- comings and the corresponding future work. In many video games, a wide range of non-playable charac- ters make up the worlds players inhabit. Such characters are often voiced by human actors and their accents can have an 2 Problem Formulation influence on their perceived moral inclination, level of trust- The problem is mapping from an audio file containing worthiness, social class, level of education and ethnic back- speech to an accent label from a pre-determined set. A file ground. We use deep learning to train a neural network to is assumed to have a single speaker whose accent is consis- classify speech accents. Such a machine-learned tool would tent throughout the file. We evaluate performance of such a provide game developers with an ability to analyze accent distribution in their titles as well as possibly help screening mapping by measuring its accuracy on a test set of files. The voiceover actors applying for a role. To make the approach objective is then to increase the test accuracy while keep- accessible we used a readily available off the shelf deep net- ing the approach accessible to game developers as well as work and commodity GPU hardware. Preliminary results are researchers from different disciplines. promising with a 71% test accuracy achieved over two ac- cents in a commercial video game. 3 Related Work In speech recognition Graves, Mohamed, and Hinton (2013) 1 Introduction used recurrent neural networks (Goodfellow, Bengio, and Courville 2016) to recognize phonemes in the TIMIT Modern video games often feature numerous non-playable database (Garofalo et al. 1993). Other work in phoneme characters (NPCs) that populate the in-game world, con- classification in speech signals using convolutional neural tributing to the atmosphere, gameplay and storytelling. Such networks (CNN) (Palaz, Collobert, and Doss 2013; Song characters are usually available to interact with the player and Cai 2015; Zhang et al. 2017) used logarithmic mel-filter- and are frequently voiced by well-known actors (e.g., Martin bank coefficients and hybrid networks composed of a CNN Sheen in Mass Effect 2 (BioWare 2010)). As in movies, dif- and a recurrent neural networks (RNN). Their primary task ferent accents in the same language (e.g., English) contribute is different from ours in that they are identifying phonemes to an ethnic, social and moral image of an NPC. Thus it is to recognize words as opposed to accents. Yet other work important for game developers to be aware of and control on phoneme recognition (Hinton et al. 2012) highlighted assignment of accents to in-game characters. Having a fast the importance of weight initialization when recognizing and low-cost way of determining an accent of a voiceover phonemes. Once again the problem they were tackling is can help developers screen sent-in audition files or take an substantively different from ours. Research by Espi et al. inventory of accents within a development prototype. (2015) of acoustic event detection emphasized the impor- In this paper we demonstrate how machine learning can tance and feasibility of local feature extraction in detecting be used to automatically classify speech accents in video- and classifying non-speech acoustic events occurring, for in- game voiceovers. The approach is designed to be accessible stance, in conversation scenes. This work is indirectly re- to small-scale game developers and individual researchers lated to our current work as it does not label accents in a in the field of game studies. Specifically, we train an off- conversation, however it can be combined with our approach the-self deep neural network on commodity hardware. The to detect and remove sections of the audio without speech. network achieved a 71% test accuracy over American and Work has been done on detecting emotions in speech British accents in the commercial videogame Dragon Age: through spectrograms (Huang et al. 2014; Badshah et al. Origins (BioWare 2009). 2017). While this is not the task we were trying to solve, The rest of the paper is organized as follows. We formu- it is similar in its use of spectrograms as the input to their late the problem precisely in Section 2, then discuss related neural network. work in Section 3. We then present our approach in Section 4 Recently we explored training a neural network on an ex- and detail results of an empirical evaluation in Section 5. The isting (non-video-game) accent database and then used the trained network to detect accents in audio files from a video game(Ensslin et al. 2017). They thought that training on cu- rated accent database would yield a better classifier. There Linear frequency were two problems with their approach. First it required ac- cess to a separate accent database. Second, their test accu- racy was poor. Our approach is similar but trains on video- game audio files directly and yields better test accuracy. 4 Our Approach Log frequency To keep our approach accessible to a broad set of game de- velopers and researchers, we used a common off the shelf deep neural network: AlexNet (Krizhevsky, Sutskever, and Hinton 2012). This approach has yielded state-of-the-art re- sults when classifying bird species using their song (Knight et al. 2018). Linear amplitude Log amplitude 4.1 Converting Audio to Images Figure 1: A composite spectrogram of a piano-roll audio. As AlexNet was originally designed to classify images, we converted audio files to spectrograms in a fashion simi- lar to our previous work (Ensslin et al. 2017). Specifically, each spectrogram consisted of four different image quad- rants computed by Algorithm 1 as follows. The audio file S gets partitioned into m parts of w seconds each (line 3). For each part sx , we apply the Fast Fourier Transform to it, resulting in a sequence of amplitudes A (line 4). We remove all amplitudes for frequencies below fmin and above fmax (line 5). We then partition the remain- ing frequency range [fmin , fmax ] into b linearly (if Lf holds) or logarithmically spaced segments (line 6). For each seg- ment B(y) we sum the corresponding amplitudes into the scalar ay (line 8). We then map ay to a spectrogram pixel I(x, y) using a color mapping C (line 9). Optionally we take a logarithm of the amplitude ay (if La is false). Algorithm 1: Create spectrogram quadrant input : S, fmin , fmax , w, b, Lf , La , C Figure 2: A composite spectrogram of a voiceover file with output: image I an American accent. l spectrogram m |S| 1 m← w 2 for x ∈ {1, 2, . . . , m} do amplitude ay before converting it to the RGB color via color 3 sx ← xth window from S mapping C (which maps lower amplitudes to colder/blue 4 A ← fft(sx ) colors and higher amplitudes to warmer/red colors). The f bottom-left quadrant spaces frequency-range segments loga- 5 A ← A|fmax rithmically (i.e., Lf is set to false). Finally, the bottom-right min linspace(fmin , fmax , b), if Lf quadrant uses logarithmically spaced frequency-range seg- 6 B← logspace(fmin , fmax , b), otherwise ments as well as applies logarithm to amplitudes. 7 for y ∈ {1,P2, . . . , b} do Each quadrant is a color image of the height of b pixels 8 ay ← A|B(y) and the width of m pixels. The resulting composite spectro- gram thus has 2b rows and 2m columns. Figure 2 shows a C(ay ), if La composite spectrogram of an actual videogame audio file. 9 I(x, y) ← C(log(ay )), otherwise 4.2 Training and Testing the Network Algorithm 1 converts a set of audio files {Sk } to a set of Figure 1 illustrates the process on a simple piano-roll au- spectrograms Ik . As the audio files increase in duration, the dio file. The top-left quadrant of the image is produced by width of each quadrant (m pixels) will necessarily increase setting both Lf and La to true and therefore uses linearly in order to keep the same temporal resolution. Since off the spaced frequency-range segments B. The top-right quadrant shelf deep neural networks tend to require the input image has La set to false and thus runs a logarithm on cumulative to be of a fixed size, we divided each audio file Sk into seg- ments {Ski } of up to s seconds. This ensures that the spec- as possible and only one recording per NPC was used to trogram maintains a temporal resolution of at least m s pixels form our dataset. The audio files were separately labeled by per second of audio. three individuals Each individual listened to each audio file Our original dataset of audio files and their accent la- and labeled its accent. Then the multiple labelers compared bels {(Sk , lk )} thus becomes a dataset of audio segments their labels and debated any differences until a consensus each of which inherits the accent label of the original file: was reached.2 This process resulted in 295 audio files such {(Ski , lk )}. Once converted to spectrograms by Algorithm 1 that each file contained a single speaker labeled with a single these become {(Iki , lk )}. accent label: 147 with an American accent and 148 with a To get robust results, and avoid overfitting the data, we British accent. conduct the training and testing process in the standard fash- The audio files were from 2 to 40 seconds in duration. ion with T independent trials. On each trial t, we split Using a segment length of 3 seconds, we created a data set the dataset {(Sk , lk )} into α (complete) audio files to be {(Iki , lk )} of 1100 segment spectrograms. The majority clas- used for training and 1 − α which are used for testing: sification average of this set is 51.1%; this means that if we {(Sk , lk )} = Sttrain ∪ Sttest with |Sttrain | = bα|{(Sk , lk )}|c. classified the data by selecting the label with the most ele- Expressed at the level of spectrograms of audio segments ments we would classify 51.1% of the segments correctly. we have {(Iki , lk )} = Ittrain ∪ Ittest . On each trial t, we train a neural network on Ittrain using 5.2 Implementation Details three hyperparameters: the number of epochs, the batch size In our implementation of the approach we used the for stochastic gradient descent and the learning rate. Once audioread and fft functions in MATLAB to read each the network Nt is trained we freeze its weights and test it on audio file in, average the two channels of the clip and per- Ittest . The per-segment accuracy of the trained network is the form the Fast Fourier transform. The spectrogram was con- percentage of audio file segments for which the accent level verted to an RGB image using jet(100) colormap in output by the network matched that in the test set: MATLAB. The composite spectrogram (four quadrants) was |{(Iki , lk ) ∈ Ittest | Nt (Iki ) = lk }| then resized to 227 × 227 pixels for input to the network Aper-segment t = . using the imresize function in MATLAB, which uses a |Ittest | bi-cubic interpolation method. The per-segment accuracy of the net is then averaged over We used a version of AlexNet that is available for down- all T trials: Aper-segment = avgt Aper-segment t . load as a MATLAB add-on alexnet.3 We trained it with We also calculate per-file accuracy. For that we run the the MATLAB neural network toolbox via trainNetwork network on all segments comprising an audio file from the function using stochastic gradient descent with a learning test set and take the majority vote on the labels the network rate of 0.01 with a drop learn rate factor of 0.1. We ran all produces.1 Thus we define Nt (Ik ) as the majority vote of experiments on an Intel Core i7 980X workstation with a the network’s classifications of each segment: Nt (Iki ). Then six-core 3.33 GHz CPU and 24Gb of RAM. It hosted two the per-file accuracy is defined as: Nvidia Maxwell-based Titan X GPUs with 12Gb of video- RAM each. This allowed us to run two learning trials in par- |{(Ik , lk ) ∈ Ittest | Nt (Ik ) = lk }| Aper-file t = . allel (one trial per GPU). |Ittest | 5.3 Single-accent Classification As before: Aper-file = avgt Aper-file t . Spectrograms for audio segments were divided into a 5 Empirical Evaluation training set, used to train the network, and a testing set. The training set contained 75% of the audio files In this section we will present a specific implementation of of each class, while the remaining 25% were used in our approach and an evaluation of its performance on audio the testing set. We made sure that all spectrograms be- files from a commercial video game. longing to the same audio file are in the same set (i.e., ∀k@i1 , i2 (Iki1 , lk ) ∈ Itrain & (Iki2 , lk ) ∈ Itest ). 5.1 Data Collection There were four control parameters we varied for the data We used voiceover audio files captured from Dragon Age: preparation and network training: the number of epochs and Origins (BioWare 2009) — a game with a wide variety of the batch size which relate to the network training configu- accents and characters. ration; the number of frequency filters b and the time win- The background music in the game was turned off so only dow size w which determine the specification of the spectro- the speech is present. We tried to capture as many NPCs grams. We did not know the best combination for the dataset 1 If an equal number of segments was labeled with the same ac- at hand so we conducted a parameter sweep. To reduce the cent then we break the tie between the labels in favor of the label sweep time we factored the parameter space into a product of the earliest such segment. For instance, if a five-segment audio 2 file is labeled by the network as [British, British, American, Amer- If no agreement could be achieved then the audio file was ex- ican, Spanish] then we break the tie between British and American cluded from the set. 3 in favor of British. This method was used since we initially as- We used MATLAB for training because of the convenience of sumed there are no ties, so we always select the first most frequent parameter sweep and data analysis as well as access to an existing segment-level label. code base. of two subspaces: one defined by the number of epochs and of 1 second. For per-segment accuracy, the best parameters the batch size and the other defined by the number of fre- were 280 frequency filters and a window size of 0.05 sec- quency filters and the time window size. onds, 50 epochs and a batch size of 5. These parameters We then fixed a single pair of parameters from the second yielded an average accuracy of Aper-segment = 63.5 ± 3% subspace and tried all 4 · 5 combinations of parameters from over four trials which is similar to that with 3-second seg- the first subspace. For each try we ran four independent trials ments. The corresponding confusion matrix (over additional of training and testing, splitting the dataset into training and 10 trials) is found in Table 2, right. For per-file accuracy the testing partitions randomly on each trial. Test accuracy aver- best parameters were 75 frequency filters and a window size aged over the four trials defined the quality of the parameter of 0.01 seconds with a test accuracy of Aper-file = 71 ± 4.5% pair from the first subspace, given the fixed values from the averaged over 4 trials. The corresponding confusion matrix second subspace. We then picked the highest-quality param- computed over 10 additional trials is found in Table 3, right. eter pair from the first subspace and, keeping it fixed, swept the second subspace trying all of its 4 · 6 pairs. If the best 6 Current Challenges and Future Work quality and the second-subspace parameters found matched Humans may use certain speech features (e.g., the way the those found before, we stopped the process. Otherwise, we speaker pronounces ‘r’) to identify accents in an audio file. picked another (untried) parameter pair from the second sub- Those features are present only occasionally and thus short space and repeated the steps above. audio files can be mislabeled by humans. Furthermore, hu- This factored sweep can stop short of finding the global man labelers can be inconsistent in their labels. Such prob- optimum in the overall four-dimensional parameter space. lems with the dataset may reduce test accuracy. Future work On the positive side, it is likely to be faster as it sweeps a will scale up the number of human labelers as well as the single two-dimensional subspace at a time. In our evaluation length of the files to produce a more accurate/consistent the process stopped after 4 iterations, each consisting of two dataset. We will also attempt to train a network for more than subspace sweeps. Thus only 4·(4·5+4·6) = 176 parameter two accents, including fantasy accents. We will also extend combinations were tried in total (as opposed to 4 · 5 · 4 · 6 = the space of control parameter space to gain a better under- 480 that would be required to sweep the original space). standing on how they affect the accuracy of the network. It will also be of interest to segment audio files in a Table 1: Control parameter space. content-aware way (instead of fixed 1- or 3-second seg- ments). Doing so may also allow the classifier to automati- Parameter Values cally remove silent parts of an audio file and thus avoid dilu- number of epochs {10, 50, 100, 200} tion of dataset with meaningless data. Future work will com- batch size {3, 5, 10, 50, 100} pare the spectrogram-based representation of an audio file window size w {0.05, 0.025, 0.01, 0.001} seconds to mel-filter-bank coefficients (Palaz, Collobert, and Doss frequency filters b {50, 75, 113, 227, 250, 280} 2013; Song and Cai 2015; Zhang et al. 2017) as well as use other neural networks such as VGG (Simonyan and Zisser- man 2014). We ran four trials per parameter combination and reported Finally, measuring portability of a deep neural accent de- the average accuracy of the four trials. We found that the tector across games as well as its sensitivity to background best parameters were 280 frequency filters and window size music is a natural direction for future work. of 0.05 seconds, 50 epochs and a batch size of 5; these yield a test accuracy of Aper-segment = 63.25 ± 4%. We then locked in the control parameters listed above and 7 Conclusions ran 10 additional trials. The resulting confusion matrix aver- Accent classification is an important task in video-game aged over the four trials is listed in Table 2, left. development (e.g., for quality control and pre-screening for voiceover auditions). In the spirit of reducing game- File-level Labeling. We then examined test accuracy at production costs we proposed and evaluated an approach the file level. As described earlier in the paper file-level test for doing so automatically, via the use of deep learning. accuracy Aper-file t is computed by training the network to clas- To keep our approach low-cost and accessible to practition- sify segments but then labeling the file with a majority vote ers outside of Computing Science, we used a readily avail- over the segments. For instance, if an audio file was split into able off the shelf deep neural network and a standard deep 7 segments, and the network labeled 3 of them as American learning method. We evaluated our approach on a database accent and 4 as British accent, we would label the entire au- of voiceover files from a commercial video game Dragon dio file as British. Age: Origins where the network achieved the test accuracy We re-ran the experiment, the best parameters for this run of 71%. These results demonstrate a promise of using off were 250 frequency filters and a window size of 0.05 sec- the shelf deep learning for game development and open a onds, 50 epochs and a batch size of 5, which yielded an av- number of exciting follow-up directions. erage accuracy of Aper-file = 68±3.6%. We ran 10 additional trials to compute the confusion matrix: Table 3, left. 8 Acknowledgments Segment Duration. Given that some audio files were We appreciate the support from Kule Institute for Advanced shorter than 3 seconds, we also tried the segment duration Study (KIAS), the Social Sciences and Humanities Council Table 2: The confusion matrix for per-segment labeling. Left: 3-second segments. Right: 1-second segments. Actual Actual Classified as British American Classified as British American British 63.6% 35.3% British 64.3% 38.4% American 36.4% 64.7% American 35.7% 61.6% Table 3: The confusion matrix for per-file labeling. Left: 3-second segments. Right: 1-second segments. Actual Actual Classified as British American Classified as British American British 67.4% 31.2% British 71.2% 29.1% American 32.6% 68.8% American 28.8% 70.9% of Canada (SSHRC) via the Refiguring Innovation in Games Huang, Z.; Dong, M.; Mao, Q.; and Zhan, Y. 2014. Speech (ReFiG) project, the Alberta Conservation Association, the Emotion Recognition Using CNN. In Proceedings of the Alberta Biodiversity Monitoring Institute, and Nvidia,. 22nd ACM International Conference on Multimedia, 801– 804. References Knight, E. C.; Poo Hernandez, S.; Bayne, E. M.; Bultiko, V.; and Tucker, B. V. 2018. Pre-processing spectrogram Badshah, A. M.; Ahmad, J.; Rahim, N.; and Baik, S. W. parameters improve the accuracy of birdsong classification 2017. Speech Emotion Recognition from Spectrograms with using convolutional neural networks. Under review. Deep Convolutional Neural Network. In Proceedings of 2017 International Conference on Platform Technology and Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im- Service (PlatCon), 1–5. ageNet Classification with Deep Convolutional Neural Net- works. In Advances in Neural Information Processing Sys- BioWare. 2009. Dragon Age: Origins. tems (NIPS), 1097–1105. BioWare. 2010. Mass Effect 2. Palaz, D.; Collobert, R.; and Doss, M. M. 2013. Estimating Ensslin, A.; Goorimoorthee, T.; Carleton, S.; Bulitko, V.; Phoneme Class Conditional Probabilities from Raw Speech and Poo Hernandez, S. 2017. Deep Learning for Speech Signal using Convolutional Neural Networks. arXiv preprint Accent Detection in Videogames. In Proceedings of the arXiv:1304.1018. Experimental AI in Games (EXAG) Workshop at the AAAI Simonyan, K., and Zisserman, A. 2014. Very deep convo- Conference on Artificial Intelligence and Interactive Digital lutional networks for large-scale image recognition. arXiv Entertainment (AIIDE), 69–74. preprint arXiv:1409.1556. Espi, M.; Fujimoto, M.; Kinoshita, K.; and Nakatani, T. Song, W., and Cai, J. 2015. End-to-End Deep Neural Net- 2015. Exploiting spectro-temporal locality in deep learning work for Automatic Speech Recognition. Technical Report. based acoustic event detection. EURASIP Journal on Audio, Speech, and Music Processing 2015(1):26. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio, C. L. Y.; and Courville, A. 2017. Towards End-to-End Garofalo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.; Speech Recognition with Deep Convolutional Neural Net- Pallett, D. S.; and Dahlgren, N. L. 1993. The DARPA works. arXiv preprint arXiv:1701.02720. TIMIT acoustic-phonetic continuous speech corpus cdrom. Linguistic Data Consortium. Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. MIT Press. http: //www.deeplearningbook.org. Graves, A.; Mohamed, A.-R.; and Hinton, G. 2013. Speech Recognition with Deep Recurrent Neural Networks. In Pro- ceedings of 2013 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 6645–6649. Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.- r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. N.; et al. 2012. Deep Neural Networks for Acoustic Mod- eling in Speech Recognition: The shared views of four re- search groups. IEEE Signal Processing Magazine 29(6):82– 97.