A new Pitch Tracking Smoother based on Deep Neural Networks Michele Ferro Fabio Tamburini FICLIT, University of Bologna, Italy FICLIT, University of Bologna, Italy lele.ferro4@gmail.com fabio.tamburini@unibo.it Abstract Scholars worked hard searching for increas- English. This paper presents a new pitch ingly sophisticated techniques for these particu- tracking smoother based on deep neural lar cases, although extremely relevant for the con- networks (DNN). The proposed system struction of real applications, considering solved, has been extensively tested using two ref- or perhaps simply abandoning, the problem of erence benchmarks for English and exhib- the F0 extraction for the so-called “clean speech”. ited very good performances in correcting However, anyone who has used the most common pitch detection algorithms outputs. programs available for the automatic extraction of F0 is well aware that errors of halving or doubling Italiano. Questo contributo presenta un of the value of F0, to cite only one type of prob- programma di smoothing del profilo in- lem, are far from rare and that the automatic iden- tonativo basato su reti neurali deep. Il tification of voiced areas within the utterance still sistema è stato verificato utilizzando due poses numerous problems. corpora di riferimento e le sue prestazioni Every work that proposes a new method for the nella correzione degli errori di alcuni al- automatic extraction of F0 should perform an eval- goritmi per l’identificazione del pitch sono uation of the performances obtained in relation to decisamente buone. other PDAs, but, usually, these assessments suf- fer from the typical shortcomings deriving from 1 Introduction evaluation systems: they usually examine a very The pitch, and in particular the fundamental fre- limited set of algorithms, often not available in quency - F0 - which represents its physical coun- their implementation, typically considering cor- terpart, is one of the most relevant perceptual pa- pora not distributed, related to specific languages rameters of the spoken language and one of the and/or that contain particular typologies of spoken fundamental phenomena to be carefully consid- language (pathological, disturbed by noise, etc.) ered when analysing linguistic data at a phonetic (Veprek, Scordilis, 2002; Wu et al., 2003; Kotnik and phonological level. As a consequence, the et al., 2006; Jang et al., 2007; Luengo et al., 2007; automatic extraction of F0 has been a subject of Chu, Alwan, 2009; Bartosek, 2010; Huang, Lee, study for a long time and in literature there are 2012; Chu, Alwan, 2012). There are few stud- many works that aim to develop algorithms able ies, among the most recent, that have performed to reliably extract F0 from the acoustic component quite complete evaluations that are based on cor- of the utterances, algorithms that are commonly pora freely downloadable (deCheveigné, Kawa- identified as Pitch Detection Algorithms (PDAs). hara, 2002; Camacho, 2007; Wang, Loizou, 2012). Technically, the extraction of F0 is a problem These studies use very often a single metric in the far from trivial and the great variety of method- assessment that measures a single type of error, ologies applied to this problem demonstrates its not considering or partly considering the whole extreme complexity, especially considering that it panorama of indicators developed from the pio- is difficult to design a PDA that works optimally neering work of Rabiner and colleagues (1976) for the different recording conditions, considering and therefore, in our opinion, the results obtained that parameters such as speech type, noise, over- seem to be rather partial. lap, etc. are able to heavily influence the perfor- Tamburini (2013) performed an in depth study mance of this type of algorithms. of the different performances exhibited by several widely used PDAs by using standard evaluation predicted (instead getting one value for each se- metrics and well established corpus benchmarks. quence) given the full sequence of one-hot vectors Starting from this study, the main purpose of provided as input. our research was to improve the performances At the output softmax layer we expect to get of the best Pitch Detection Algorithms identi- a probability distribution for the pitch values in fied in Tamburini (2013) by introducing a post- the same interval 0-499Hz, considering the most processing smoother. In particular, we imple- likely one as the actual network prediction. This mented a pitch smoother adopting Keras1 , a pow- means that the network input and output layers erful high-level neural networks application pro- contain 500 neurons each. gram interface (API), written in Python and capa- ble of running on top of TensorFlow, CNTK, or 3 Experiments setup Theano. 3.1 Tested PDAs 2 Pitch error correction and smoothing We chose the three PDAs exhibiting the best per- formances in Tamburini (2013), namely RAPT, Typical PDAs are organised into two different SWIPE’ and YAAPT. Even though they were orig- modules: the first stage tries to detect pitch fre- inally developed as MATLAB functions, we de- quencies frame by frame and, in the second stage, cided to adopt the corresponding Python imple- the pitch candidates or probabilities are connected mentations. into pitch contours using dynamic programming The primary purpose in the development of techniques (Bagshaw, 1994; Chu, Alwan, 2012; RAPT (A Robust Algorithm for Pitch Track- Gonzalez, Brookes, 2014) or hidden Markov mod- ing) (Talkin, 1995) was to obtain the most ro- els (HMMs) (Jin, Wang, 2011; Wu et al., 2003). bust and accurate estimates possible, with lit- These techniques are, however, not completely tle thought to computational complexity, mem- satisfactory and various kind of errors remain in ory requirements or inherent processing delay. the intonation profile. That is why in the literature This PDA is designed to work at any sam- we can find various studies aiming at proposing pling frequency and frame rate over a wide pitch profile smoothers. Some works try to cor- range of possible F0, speaker and noise condi- rect intonation profile by applying traditional tech- tion. For the determination of the pitch pro- niques (Zhao et al., 2007; So et al., 2017; Jlassi file, a Normalized Cross-Correlation Function et al., 2016), while few others (see for example (NCCF) is used and each candidate of F0 is es- (Kellman, Morgan, 2016; Han, Wang, 2014)) are timated thanks to dynamic programming tech- based on DNN (either Mulity-Layer Perceptrons niques. The Python implementation is available or Elman Recurrent Neural Networks). at http://sp-tk.sourceforge.net/. The pitch smoother we propose is based on re- SWIPE (The Sawtooth Inspired Pitch Esti- current neural networks in order to process the en- mator) (Camacho, 2007) improves the perfor- tire sequence of raw pitch values computed by the mance of pitch tracking adopting these mea- various PDAs and trying to correct it by removing, sures: it avoids the use of the logarithm of the mainly, halving/doubling errors and other kind of spectrum, it applies a monotonically decaying glitches that could appear in raw pitch profiles. weight to the harmonics, then the spectrum in the neighbourhood of the harmonics and mid- At the input layer we inject one-hot vectors rep- dle points between harmonics are observed and resenting the frame pitch value in the interval 0- smooth weighting functions are used. We adopted 499Hz as detected by the PDA. We kept the pitch SWIPE’, a variant of this PDA that only uses frame size required by each PDA imposing only the main harmonics for pitch estimation, imple- a frame shift of 0.01 sec for every PDA. With mented in Python and it is available again at regard to the hidden layer we employed a bidi- http://sp-tk.sourceforge.net/. rectional Long-Short-Term Memory (LSTM) with The YAAPT (Yet Another Algorithm for Pitch 100 neurons for each direction. They are joined Tracking) (Zahorian, Hu, 2007) is a fundamental together and inserted into a TimeDistributed wrap- frequency (Pitch) tracking algorithm, which is per layer so that one value per timestep could be designed to be highly accurate and very robust for 1 https://keras.io/ both high quality and telephone speech. In gen- eral, a preprocessing step is used to create multiple • Voiced Detection Error: versions of the signal. Consequently, spectral V DE = (Evoi→unv + Eunv→voi )/Nf rame harmonics correlation techniques (SHC) and a Normalized Cross-Correlation Function (NCCF, as in RAPT) are adopted. The final profile of where Nvoi is the number of voiced frames in the F0 is estimated thanks to dynamic programming gold standard and Nf rame is the number of frames techniques. For our experiments we employed in the utterance. These indicators, taken individ- pYAAPT, a Python implementation available at ually or in pairs, have been used in a large num- http://bjbschmitt.github.io/AMFM d ber of works to evaluate the performance of PDAs. ecompy/pYAAPT.html. The two indicators, however, measure very dif- ferent errors; it is possible to measure the perfor- 3.2 Gold Standards mance using only one indicator, usually GP E, but The evaluation tests were based on two English it evaluates only part of the problem and hardly corpora considered as gold standards, both freely provide a faithful picture of PDA behaviour. On available and widely used in literature for the eval- the other hand, considering both measures leads uation of PDAs: to a difficult comparison of the results. To try to remedy these problems, Lee and Ellis • Keele Pitch Database (Plante et al., 1995): it (2012) have suggested slightly different metrics, is composed of 10 speakers, 5 males and 5 fe- which allow the definition of a single indicator: males, who read, in a controlled environment, a small balanced passage (the ’North Wind • Voiced Error: story’). The corpus contains also the output V E = (Ef 0 + Evoi→unv )/Nvoi of a laryngograph, from which it is possible to accurately estimate the value of F0. • Unvoiced Error: • FDA (Bagshaw et al., 1993): it is a small cor- pus containing 5’ of recording divided into U E = Eunv→voi /Nunv 100 utterances, read by two speakers, a male and a female, particularly rich in fricative • Pitch Tracking Error: sound, nasal, liquid and glide, sounds par- P T E = (V E + U E)/2 ticularly problematic to be analysed by the PDAs. Also in this case the gold standard for the values of F0 is estimated starting from the where Nunv is the number of unvoiced frames output of the laryngograph. contained in the gold standard. However, trying to interpret the results obtained by a PDA in light 3.3 Evaluation metrics of the P T E measurement is rather complex: it is Proper evaluation mechanisms have to introduce not immediate to identify from the obtained results suitable quantitative measures of performance that the most relevant source of errors. should be able to grasp the different critical as- In the light of what has been said so far, it seems pects of the problem under examination. In Ra- appropriate to introduce a new measure of per- biner et al. (1976) a de facto standard for PDA as- formance that is able to easily capture the per- sessment measures is established, a standard used formance of a PDA in a single, clear indicator by many others after him (e.g. (Chu, Alwan, that considers all types of possible errors to be 2009)). If Evoi→unv and Eunv→voi respectively equally relevant. So, following Tamburini (2013), represent the number of frames erroneously clas- we adopt, the Pitch Error Rate as performance sified between voiced and unvoiced and vice versa, metric, defined as: while Ef 0 represents the number of voiced frames in which the pitch value produced by the PDA dif- P ER = (Ef 0 + Evoi→unv + Eunv→voi )/Nf rame fers from the gold standard for more than 16Hz, then we can define: This measure sum all the types of possible errors without privileging or reducing the contribution of • Gross Pitch Error: any component and allowing a simpler interpreta- GP E = Ef 0 /Nvoi tion of the obtained outcomes. 4 Results with respect to the PDAs base outputs. All the dif- ferences resulted highly significant when applying We repeated the same experiments as in Tamburini a t-test. Given the very small standard deviation in (2013) with the Python implementations of the all the experiments we can conclude that, in this chosen algorithms (See Table 1) in order to de- case, the initialisation point did not affect the net- rive common baselines. We also computed the work performances too much. median of the values as in Tamburini (2013) as a simple smoothing method. As in the cited work, Keele Pitch Database it emerges quite clearly that the combination of PDA PDA PER Smoother Smoother different algorithms with the median method im- PER µ PER σ proves the PER results. pYAAPT 0.14056 0.05458 0.00157 Keele Pitch Database RAPT 0.12596 0.08726 0.00193 PDA PER Ef 0 Evoi→unv Eunv→voi SWIPE’ 0.14236 0.09666 0.00298 pYAAPT 0.14056 0.04278 0.04411 0.05366 FDA Corpus RAPT 0.12596 0.03789 0.05252 0.03554 SWIPE’ 0.14236 0.02762 0.06985 0.04488 PDA PDA PER Smoother Smoother Median 0.08814 0.02656 0.03359 0.03564 PER µ PER σ FDA Corpus pYAAPT 0.11912 0.06530 0.00277 PDA PER Ef 0 Evoi→unv Eunv→voi RAPT 0.09533 0.06698 0.00133 pYAAPT 0.11912 0.03023 0.03399 0.0549 RAPT 0.09533 0.01978 0.03438 0.04116 SWIPE’ 0.10594 0.07205 0.00215 SWIPE’ 0.10594 0.01385 0.04773 0.04434 Mixed Keele+FDA Corpus Median 0.10182 0.02537 0.03686 0.03917 PDA PDA PER Smoother Smoother Table 1: The experiments in Tamburini (2013) re- PER µ PER σ produced using the considered PDA python imple- pYAAPT 0.06951 0.05415 0.00128 mentation. RAPT 0.09859 0.07341 0.00133 SWIPE’ 0.08758 0.08288 0.00163 After the influential paper from Reimers and Gurevych (2017) it is clear to the community that Table 2: PER mean (µ) and standard devia- reporting a single score for each DNN training ses- tion (σ) obtained by the proposed pitch profile sion could be heavily affected by the system ini- smoother. One sample t-test significance test re- tialisation point and we should instead report the turns p0.001 for all experiments. N.B.: Even if mean and standard deviation of various runs with the number of experiments is small (10), the power the same setting in order to get a more accurate analysis of the t-tests is always equal to 1.0 show- picture of the real systems performances and make ing maximum t-test reliability. more reliable comparisons between them. In order to carry out the experiments with our 5 Conclusions new pitch smoother we had to split our datasets into training/validation/test set. For the final eval- This paper presented a new pitch smoother based uation of our pitch smoother, we considered only on deep neural networks that obtained excellent the PER measure. This metric was computed results when evaluated using standard benchmarks for each epoch during the training phase for all for English and evaluation metrics proposed in the subsets in order to determine the stopping epoch literature. when we get the minimum PER on the validation Future works could regard the intermixing of set. We performed 10 runs for each experiment various corpora in different languages in order to computing means, standard deviations and signif- test the possibility of deriving a pitch smoother icance tests. able to properly work without caring about lan- We also tested our pitch smoother on a mixed guage and, possibly, specific corpora and language configuration joining our datasets and adopting the registers. same procedures. Acknowledgements Table 2 shows all the obtained results. The pro- posed system always exhibits the best results in We gratefully acknowledge the support of any experiment with relevant performance gains NVIDIA Corporation with the donation of the Ti- tan Xp GPU used for this research. Jin, Z. and Wang, L. 2011. HMM-based multipitch tracking for noisy and reverberant speech. IEEE Trans. Audio, Speech, Lang. Process., 19(5):1091– 1102. References Jlassi, Wided and Bouzid, Aicha and Ellouze, Noured- Bartosek, J. 2010 Pitch Detection Algorithm Eval- dine 2016 A new method for pitch smoothing, 2nd uation Framework In Proceedings of 20th Czech- International Conference on Advanced Technologies German Workshop on Speech Processing, Prague, for Signal and Image Processing, Monastir, Tunisia, 118123. 657–661. Bagshaw, P.C. 1994 Automatic prosodic analysis for Kellman. M. and Morgan, N. 2017 Robust Multi-Pitch computer-aided pronunciation teaching, PhD The- Tracking: a trained classifier based approach, ICSI sis, University of Edimburgh. Tchnical Report, Berkeley, CA. Bagshaw, P.C. and Hiller, S.M. and Jack, M.A. 1993 Kotnik, B. and Höge, H. and Kacic, Z. 2006 Evalua- Enhanced pitch tracking and the processing of f0 tion of Pitch Detection Algorithms in Adverse Con- contours for computer aided intonation teaching, ditions In Proceedings of Speech Prosody 2006, Proceedings of Eurospeech ’93, Berlin, 1003–1006 Dresden, PS2883. Camacho A. 2007 SWIPE: A sawtooth waveform in- Lee, B.S. and Ellis, D. 2012 Noise Robust Pitch Track- spired pitch estimator for speech and music. PhD ing by Subband Autocorrelation Classification In Thesis, University of Florida. Proceedings of 13th Annual Conference of the Inter- national Speech Communication Association Inter- Chu, W. and Alwan A. 2009 Reducing F0 frame speech 2012, Portland (OR). error of F0 tracking algorithms under noisy con- Luengo, I., Saratxaga, I., Navas, E., 2007 Eval- ditions with an unvoiced/voiced classification fron- uation of Pitch Detection Algorithm under Real tend In Proceedings of IEEE International Confer- Conditions. In Proceedings of the IEEE Interna- ence on Acoustics, Speech and Signal Processing - tional Conference on Acoustics, Speech and Sig- ICASSP2009, 39693972. nal Processing ICASSP 2007, Honolulu, Hawaii, 4, 10571060. Chu, W. and Alwan, A. 2012. SAFE: A statistical ap- proach to F0 estimation under clean and noisy con- Plante, F. and Ainsworth, W.A. and Meyer, G. 1995 A ditions. IEEE Trans. Audio, Speech, Lang. Process., Pitch Extraction Reference Database. In Proceed- 20(3):933–944. ings of Eurospeech95, Madrid, 837840. de Cheveigné A. and Kawahara H. 2002 YIN, a fun- Rabiner, L.R. and Cheng, M.J. and Rosenberg, A.E. damental frequency estimator for speech and music and McGonegal C.A. 1976 A Comparative Perfor- Journal of the Acoustical Society of America, 111, mance Study of Several Pitch Detection Algorithms. 191730. IEEE Transaction on Acoustics, Speech and Signal Processing, 24, 399418. Gonzalez, S. and Brookes, M. 2014. PEFAC-A pitch estimation algorithm robust to high levels of Reimers, Nils and Gurevych, Iryna. 2017 Report- noise. IEEE Trans. Audio, Speech, Lang. Process., ing Score Distributions Makes a Difference: Perfor- 22(2):518–530. mance Study of LSTM-networks for Sequence Tag- ging, Proceedings of the 2017 Conference on Em- Han, Kun and Wang, DeLiang 2014. Neural Network pirical Methods in Natural Language Processing, Based Pitch Tracking in Very Noisy Speech. IEEE Copenhagen, 338–348. Trans. Audio, Speech, Lang. Process., 22(12):2158– So, YongJin and Jia, Jia and Cai, LianHong. 2012 2168. Analysis and Improvement of Auto-correlation Pitch Extraction Algorithm Based on Candidate Set, In Huang, F. and Lee, T. 2012 Robust Pitch Estimation Zhihong Q., Lei C., Weilian S., Tingkai W., Huamin Using l1-regularized Maximum Likelihood Estima- Y. (eds) Recent Advances in Computer Science and tion. In Proceedings of 13th Annual Conference of Information Engineering: Volume 5, Springer Berlin the International Speech Communication Associa- Heidelberg, 697–702. tion Interspeech 2012, Portland (OR). Talkin D. 1995 A robust algorithm for pitch tracking Jang, S.J. and Choi, S.H. and Kim, H.M. and Choi, (RAPT). In Kleijn W.B., Paliwal, K.K. (eds) Speech H.S. and Yoon Y.R. 2007 Evaluation of perfor- Coding and Synthesis, New York: Elsevier, 495518. mance of several established pitch detection algo- rithms in pathological voices. In Proceedings of Tamburini, Fabio 2013 Una valutazione ogget- the International Conference of the IEEE Engineer- tiva dei metodi pi diffusi per l’estrazione auto- ing in Medicine and Biology Society - EMBC, Lyon, matica della frequenza fondamentale. In Atti dell 620623. IX Convegno Nazionale dell’Associazione Italiana di Scienze della Voce (AISV2013), Bulzoni:Roma, 427–434. Veprek, P. and Scordilis, M.S. 2002 Analysis, en- hancement and evaluation of five pitch determi- nation techniques. Speech Communication, 37, 249270. Wang, D. and Loizou, P.C. 2012 Pitch Estimation Based on Long Frame Harmonic Model and Short Frame Average Correlation Coefficient. In Pro- ceedings of 13th Annual Conference of the Inter- national Speech Communication Association Inter- speech 2012, Portland (OR). Wu, M. and Wang, L. and Brown G.J. 2003. A mul- tipitch tracking algorithm for noisy speech. IEEE Trans. Audio, Speech, Lang. Process., 11(3):229– 241. Zahorian, S.A. and Hu, H. 2008 A Spectral/temporal method for Robust Fundamental Frequency Track- ing. Journal of the Acoustical Society of America, 123, 45594571. Zhao, Xufang and O’Shaughnessy, Douglas and Minh- Quang, Nguyen. 2007 A Processing Method for Pitch Smoothing Based on Autocorrelation and Cep- stral F0 Detection Approaches, Proceedings of the International Symposium on Signals, Systems and Electronics, Montreal, Canada, 59–62