      A new Pitch Tracking Smoother based on Deep Neural Networks

                  Michele Ferro                               Fabio Tamburini
         FICLIT, University of Bologna, Italy         FICLIT, University of Bologna, Italy
           lele.ferro4@gmail.com                      fabio.tamburini@unibo.it

                    Abstract                              Scholars worked hard searching for increas-
    English. This paper presents a new pitch          ingly sophisticated techniques for these particu-
    tracking smoother based on deep neural            lar cases, although extremely relevant for the con-
    networks (DNN). The proposed system               struction of real applications, considering solved,
    has been extensively tested using two ref-        or perhaps simply abandoning, the problem of
    erence benchmarks for English and exhib-          the F0 extraction for the so-called “clean speech”.
    ited very good performances in correcting         However, anyone who has used the most common
    pitch detection algorithms outputs.               programs available for the automatic extraction of
                                                      F0 is well aware that errors of halving or doubling
    Italiano. Questo contributo presenta un           of the value of F0, to cite only one type of prob-
    programma di smoothing del profilo in-            lem, are far from rare and that the automatic iden-
    tonativo basato su reti neurali deep. Il          tification of voiced areas within the utterance still
    sistema è stato verificato utilizzando due       poses numerous problems.
    corpora di riferimento e le sue prestazioni
                                                          Every work that proposes a new method for the
    nella correzione degli errori di alcuni al-
                                                      automatic extraction of F0 should perform an eval-
    goritmi per l’identificazione del pitch sono
                                                      uation of the performances obtained in relation to
    decisamente buone.
                                                      other PDAs, but, usually, these assessments suf-
                                                      fer from the typical shortcomings deriving from
1   Introduction                                      evaluation systems: they usually examine a very
The pitch, and in particular the fundamental fre-     limited set of algorithms, often not available in
quency - F0 - which represents its physical coun-     their implementation, typically considering cor-
terpart, is one of the most relevant perceptual pa-   pora not distributed, related to specific languages
rameters of the spoken language and one of the        and/or that contain particular typologies of spoken
fundamental phenomena to be carefully consid-         language (pathological, disturbed by noise, etc.)
ered when analysing linguistic data at a phonetic     (Veprek, Scordilis, 2002; Wu et al., 2003; Kotnik
and phonological level. As a consequence, the         et al., 2006; Jang et al., 2007; Luengo et al., 2007;
automatic extraction of F0 has been a subject of      Chu, Alwan, 2009; Bartosek, 2010; Huang, Lee,
study for a long time and in literature there are     2012; Chu, Alwan, 2012). There are few stud-
many works that aim to develop algorithms able        ies, among the most recent, that have performed
to reliably extract F0 from the acoustic component    quite complete evaluations that are based on cor-
of the utterances, algorithms that are commonly       pora freely downloadable (deCheveigné, Kawa-
identified as Pitch Detection Algorithms (PDAs).      hara, 2002; Camacho, 2007; Wang, Loizou, 2012).
   Technically, the extraction of F0 is a problem     These studies use very often a single metric in the
far from trivial and the great variety of method-     assessment that measures a single type of error,
ologies applied to this problem demonstrates its      not considering or partly considering the whole
extreme complexity, especially considering that it    panorama of indicators developed from the pio-
is difficult to design a PDA that works optimally     neering work of Rabiner and colleagues (1976)
for the different recording conditions, considering   and therefore, in our opinion, the results obtained
that parameters such as speech type, noise, over-     seem to be rather partial.
lap, etc. are able to heavily influence the perfor-       Tamburini (2013) performed an in depth study
mance of this type of algorithms.                     of the different performances exhibited by several
widely used PDAs by using standard evaluation           predicted (instead getting one value for each se-
metrics and well established corpus benchmarks.         quence) given the full sequence of one-hot vectors
   Starting from this study, the main purpose of        provided as input.
our research was to improve the performances               At the output softmax layer we expect to get
of the best Pitch Detection Algorithms identi-          a probability distribution for the pitch values in
fied in Tamburini (2013) by introducing a post-         the same interval 0-499Hz, considering the most
processing smoother. In particular, we imple-           likely one as the actual network prediction. This
mented a pitch smoother adopting Keras1 , a pow-        means that the network input and output layers
erful high-level neural networks application pro-       contain 500 neurons each.
gram interface (API), written in Python and capa-
ble of running on top of TensorFlow, CNTK, or           3     Experiments setup
                                                        3.1    Tested PDAs
2       Pitch error correction and smoothing            We chose the three PDAs exhibiting the best per-
                                                        formances in Tamburini (2013), namely RAPT,
Typical PDAs are organised into two different           SWIPE’ and YAAPT. Even though they were orig-
modules: the first stage tries to detect pitch fre-     inally developed as MATLAB functions, we de-
quencies frame by frame and, in the second stage,       cided to adopt the corresponding Python imple-
the pitch candidates or probabilities are connected     mentations.
into pitch contours using dynamic programming              The primary purpose in the development of
techniques (Bagshaw, 1994; Chu, Alwan, 2012;            RAPT (A Robust Algorithm for Pitch Track-
Gonzalez, Brookes, 2014) or hidden Markov mod-          ing) (Talkin, 1995) was to obtain the most ro-
els (HMMs) (Jin, Wang, 2011; Wu et al., 2003).          bust and accurate estimates possible, with lit-
   These techniques are, however, not completely        tle thought to computational complexity, mem-
satisfactory and various kind of errors remain in       ory requirements or inherent processing delay.
the intonation profile. That is why in the literature   This PDA is designed to work at any sam-
we can find various studies aiming at proposing         pling frequency and frame rate over a wide
pitch profile smoothers. Some works try to cor-         range of possible F0, speaker and noise condi-
rect intonation profile by applying traditional tech-   tion. For the determination of the pitch pro-
niques (Zhao et al., 2007; So et al., 2017; Jlassi      file, a Normalized Cross-Correlation Function
et al., 2016), while few others (see for example        (NCCF) is used and each candidate of F0 is es-
(Kellman, Morgan, 2016; Han, Wang, 2014)) are           timated thanks to dynamic programming tech-
based on DNN (either Mulity-Layer Perceptrons           niques. The Python implementation is available
or Elman Recurrent Neural Networks).                    at http://sp-tk.sourceforge.net/.
   The pitch smoother we propose is based on re-           SWIPE (The Sawtooth Inspired Pitch Esti-
current neural networks in order to process the en-     mator) (Camacho, 2007) improves the perfor-
tire sequence of raw pitch values computed by the       mance of pitch tracking adopting these mea-
various PDAs and trying to correct it by removing,      sures: it avoids the use of the logarithm of the
mainly, halving/doubling errors and other kind of       spectrum, it applies a monotonically decaying
glitches that could appear in raw pitch profiles.       weight to the harmonics, then the spectrum in
                                                        the neighbourhood of the harmonics and mid-
   At the input layer we inject one-hot vectors rep-
                                                        dle points between harmonics are observed and
resenting the frame pitch value in the interval 0-
                                                        smooth weighting functions are used. We adopted
499Hz as detected by the PDA. We kept the pitch
                                                        SWIPE’, a variant of this PDA that only uses
frame size required by each PDA imposing only
                                                        the main harmonics for pitch estimation, imple-
a frame shift of 0.01 sec for every PDA. With
                                                        mented in Python and it is available again at
regard to the hidden layer we employed a bidi-
rectional Long-Short-Term Memory (LSTM) with
                                                           The YAAPT (Yet Another Algorithm for Pitch
100 neurons for each direction. They are joined
                                                        Tracking) (Zahorian, Hu, 2007) is a fundamental
together and inserted into a TimeDistributed wrap-
                                                        frequency (Pitch) tracking algorithm, which is
per layer so that one value per timestep could be
                                                        designed to be highly accurate and very robust for
        https://keras.io/                               both high quality and telephone speech. In gen-
eral, a preprocessing step is used to create multiple     • Voiced Detection Error:
versions of the signal. Consequently, spectral                V DE = (Evoi→unv + Eunv→voi )/Nf rame
harmonics correlation techniques (SHC) and a
Normalized Cross-Correlation Function (NCCF,
as in RAPT) are adopted. The final profile of           where Nvoi is the number of voiced frames in the
F0 is estimated thanks to dynamic programming           gold standard and Nf rame is the number of frames
techniques. For our experiments we employed             in the utterance. These indicators, taken individ-
pYAAPT, a Python implementation available at            ually or in pairs, have been used in a large num-
http://bjbschmitt.github.io/AMFM d                      ber of works to evaluate the performance of PDAs.
ecompy/pYAAPT.html.                                     The two indicators, however, measure very dif-
                                                        ferent errors; it is possible to measure the perfor-
3.2   Gold Standards                                    mance using only one indicator, usually GP E, but
The evaluation tests were based on two English          it evaluates only part of the problem and hardly
corpora considered as gold standards, both freely       provide a faithful picture of PDA behaviour. On
available and widely used in literature for the eval-   the other hand, considering both measures leads
uation of PDAs:                                         to a difficult comparison of the results.
                                                           To try to remedy these problems, Lee and Ellis
  • Keele Pitch Database (Plante et al., 1995): it      (2012) have suggested slightly different metrics,
    is composed of 10 speakers, 5 males and 5 fe-       which allow the definition of a single indicator:
    males, who read, in a controlled environment,
    a small balanced passage (the ’North Wind             • Voiced Error:
    story’). The corpus contains also the output
                                                                    V E = (Ef 0 + Evoi→unv )/Nvoi
    of a laryngograph, from which it is possible
    to accurately estimate the value of F0.
                                                          • Unvoiced Error:
  • FDA (Bagshaw et al., 1993): it is a small cor-
    pus containing 5’ of recording divided into                         U E = Eunv→voi /Nunv
    100 utterances, read by two speakers, a male
    and a female, particularly rich in fricative          • Pitch Tracking Error:
    sound, nasal, liquid and glide, sounds par-
                                                                        P T E = (V E + U E)/2
    ticularly problematic to be analysed by the
    PDAs. Also in this case the gold standard for
    the values of F0 is estimated starting from the     where Nunv is the number of unvoiced frames
    output of the laryngograph.                         contained in the gold standard. However, trying
                                                        to interpret the results obtained by a PDA in light
3.3   Evaluation metrics                                of the P T E measurement is rather complex: it is
Proper evaluation mechanisms have to introduce          not immediate to identify from the obtained results
suitable quantitative measures of performance that      the most relevant source of errors.
should be able to grasp the different critical as-         In the light of what has been said so far, it seems
pects of the problem under examination. In Ra-          appropriate to introduce a new measure of per-
biner et al. (1976) a de facto standard for PDA as-     formance that is able to easily capture the per-
sessment measures is established, a standard used       formance of a PDA in a single, clear indicator
by many others after him (e.g. (Chu, Alwan,             that considers all types of possible errors to be
2009)). If Evoi→unv and Eunv→voi respectively           equally relevant. So, following Tamburini (2013),
represent the number of frames erroneously clas-        we adopt, the Pitch Error Rate as performance
sified between voiced and unvoiced and vice versa,      metric, defined as:
while Ef 0 represents the number of voiced frames
in which the pitch value produced by the PDA dif-       P ER = (Ef 0 + Evoi→unv + Eunv→voi )/Nf rame
fers from the gold standard for more than 16Hz,
then we can define:                                     This measure sum all the types of possible errors
                                                        without privileging or reducing the contribution of
  • Gross Pitch Error:                                  any component and allowing a simpler interpreta-
                GP E = Ef 0 /Nvoi                       tion of the obtained outcomes.
4     Results                                            with respect to the PDAs base outputs. All the dif-
                                                         ferences resulted highly significant when applying
We repeated the same experiments as in Tamburini         a t-test. Given the very small standard deviation in
(2013) with the Python implementations of the            all the experiments we can conclude that, in this
chosen algorithms (See Table 1) in order to de-          case, the initialisation point did not affect the net-
rive common baselines. We also computed the              work performances too much.
median of the values as in Tamburini (2013) as a
simple smoothing method. As in the cited work,                     Keele Pitch Database
it emerges quite clearly that the combination of            PDA   PDA PER Smoother Smoother
different algorithms with the median method im-                                 PER µ    PER σ
proves the PER results.                                   pYAAPT 0.14056       0.05458  0.00157
                   Keele Pitch Database                    RAPT    0.12596     0.08726  0.00193
      PDA       PER       Ef 0    Evoi→unv   Eunv→voi     SWIPE’   0.14236     0.09666  0.00298
    pYAAPT    0.14056 0.04278      0.04411    0.05366                  FDA Corpus
     RAPT     0.12596 0.03789      0.05252    0.03554
    SWIPE’    0.14236 0.02762      0.06985    0.04488       PDA   PDA PER Smoother Smoother
     Median   0.08814 0.02656      0.03359    0.03564                           PER µ    PER σ
                       FDA Corpus                         pYAAPT 0.11912       0.06530  0.00277
      PDA       PER       Ef 0    Evoi→unv   Eunv→voi
                                                           RAPT    0.09533     0.06698  0.00133
    pYAAPT    0.11912 0.03023      0.03399     0.0549
     RAPT     0.09533 0.01978      0.03438    0.04116     SWIPE’   0.10594     0.07205  0.00215
    SWIPE’    0.10594 0.01385      0.04773    0.04434            Mixed Keele+FDA Corpus
     Median   0.10182 0.02537      0.03686    0.03917
                                                            PDA   PDA PER Smoother Smoother
Table 1: The experiments in Tamburini (2013) re-                                PER µ    PER σ
produced using the considered PDA python imple-           pYAAPT 0.06951       0.05415  0.00128
mentation.                                                 RAPT    0.09859     0.07341  0.00133
                                                          SWIPE’   0.08758     0.08288  0.00163
   After the influential paper from Reimers and
Gurevych (2017) it is clear to the community that        Table 2: PER mean (µ) and standard devia-
reporting a single score for each DNN training ses-      tion (σ) obtained by the proposed pitch profile
sion could be heavily affected by the system ini-        smoother. One sample t-test significance test re-
tialisation point and we should instead report the       turns p0.001 for all experiments. N.B.: Even if
mean and standard deviation of various runs with         the number of experiments is small (10), the power
the same setting in order to get a more accurate         analysis of the t-tests is always equal to 1.0 show-
picture of the real systems performances and make        ing maximum t-test reliability.
more reliable comparisons between them.
   In order to carry out the experiments with our        5   Conclusions
new pitch smoother we had to split our datasets
into training/validation/test set. For the final eval-   This paper presented a new pitch smoother based
uation of our pitch smoother, we considered only         on deep neural networks that obtained excellent
the PER measure. This metric was computed                results when evaluated using standard benchmarks
for each epoch during the training phase for all         for English and evaluation metrics proposed in the
subsets in order to determine the stopping epoch         literature.
when we get the minimum PER on the validation               Future works could regard the intermixing of
set. We performed 10 runs for each experiment            various corpora in different languages in order to
computing means, standard deviations and signif-         test the possibility of deriving a pitch smoother
icance tests.                                            able to properly work without caring about lan-
   We also tested our pitch smoother on a mixed          guage and, possibly, specific corpora and language
configuration joining our datasets and adopting the      registers.
same procedures.
   Table 2 shows all the obtained results. The pro-
We gratefully acknowledge the support of
NVIDIA Corporation with the donation of the Ti-
tan Xp GPU used for this research.
any experiment with relevant performance gains           NVIDIA Corporation with the donation of the Ti-
