<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Cognitive Load of Modern T TS Systems Under Noisy Conditions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Avashna Govender</string-name>
          <email>agovender1@csir.co.za</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon King</string-name>
          <email>simon.king@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Speech Technology Research, University of Edinburgh</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Council for Scientific and Industrial Research</institution>
          ,
          <country country="ZA">South Africa</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cognitive load of text-to-speech (TTS) synthesis systems measured in the past consistently showed that processing synthetic speech is more dificult to process than human speech in the presence of noise. However, the systems previously evaluated are no longer state-of-the-art. The quality produced by modern TTS systems are considered indistinguishable from human speech. Does this mean that the cognitive load demanded by such systems are now equivalent to that of human speech? The work presented in this paper, sets out to answer this question by measuring the cognitive load of modern TTS systems under noisy conditions. Results show that the gap of cognitive load demanded by TTS and human speech is reducing when listening to systems such as Tacotron 2 and Fastspeech 2. However, diferences in cognitive load between these systems are still present. Therefore, despite modern TTS systems producing high quality speech, not all of them demand the same amount of cognitive load and thus not all TTS systems will provide the same user experience when embedded into real-world applications. Interestingly, results suggest that vocoded speech demands the same cognitive load as human speech which shows that it is possible to generate synthetic speech that can impose cognitive load that is equivalent to that of human speech.</p>
      </abstract>
      <kwd-group>
        <kwd>text-to-speech</kwd>
        <kwd>pupillometry</kwd>
        <kwd>cognitive load</kwd>
        <kwd>listening efort</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Speech technology is increasingly becoming popular and therefore evaluating the users’
experience is crucial. An important aspect of evaluating the users’ experience is by understanding the
dificulty experienced by the listener - if any - when listening to synthetic speech. To understand
the dificulty of listening, one needs to understand how synthetic speech interacts with the
human cognitive processing system whilst listening to it. In other words, a measurement of
cognitive load is required [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Previous work has investigated the cognitive load of synthetic
speech on various text-to-speech (TTS) systems in the past [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. All of which have shown
to demand a higher cognitive load than human speech. However, architectures in TTS are
constantly evolving and the most recent TTS systems are capable of producing synthetic speech
that is considered to be indistinguishable from human speech in terms of intelligibility and
naturalness [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This therefore makes us question whether the cognitive load of modern TTS
systems is also becoming indistinguishable from human speech. In this work, we set out to
CEUR
Workshop
Proceedings
      </p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
measure the cognitive load of modern TTS systems to determine whether synthetic speech
produced by modern TTS systems are still more dificult to listen to than human speech.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Experimental design</title>
      <sec id="sec-3-1">
        <title>2.1. Models evaluated</title>
        <p>Two state-of-the-art TTS systems were selected for the evaluation, namely Tacotron 2 and
Fastspeech 2 which is an adapted version of the original and is not publicly available. As a
lower bound in the evaluation, an older model, the Merlin1 TTS system was included. TTS
systems comprise of two key components, namely the acoustic model and the vocoder. The
acoustic model is responsible for the conversion from the text representation to an acoustic
representation whilst the vocoder is responsible for generating the speech from the acoustic
representation. To evaluate whether contributions to increased cognitive load stem from the
acoustic model alone and not the vocoder, we included samples generated by the vocoder in the
evaluation. All models were implemented in conjunction with the MultiBand-Melgan vocoder.
As the upper bound, human speech taken from the original samples from the dataset were
included.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Experimental setup</title>
        <p>
          Our same pupillometry paradigm proposed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] was used to measure the cognitive load of the
various models and human speech. The experiment was set-up in the same manner as reported
in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. An SR-Eyelink eye tracker was used to measure the pupil response in a light and sound
controlled lab whilst participants’ listened to audio samples in the presence of noise through
headphones. Three experiments were conducted. Each of them measuring the cognitive load of
the various systems in the presence of speech-shaped noise at SNRs -1 dB (Exp. A, N=15), -3
dB(Exp. B, N=20) and -5 dB (Exp. C, N=25) respectively. As in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], for each experiment, stimuli
were blocked by system, resulting in 5 blocks, each containing 20 sentences. The block order
was balanced using a 5x5 Latin square design to ensure all listeners, systems and sentences were
equally represented. At the end of each block, self-reported cognitive load scores were collected
on a 5-point rating scale to support the results collected from the pupillometry paradigm. The
same pre-processing and analysis procedures as reported in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] were applied and the same
event-related pupil dilation (ERPD) percentage formula was used. All analyses were carried out
in R.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Analysis</title>
        <p>
          In this work, Growth Curve Analysis (GCA) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] was used to analyse the time course of the
ERPD within a specific time period in which the peak is observed. The overall time course of
the data was captured using a third-order (cubic) orthogonal polynomial with fixed efects of
condition (various systems compared) and random efects of participant and item (sentence
stimulus). Using GCA, parameter estimates are generated from the model fits and statistical
1This version is an adapted version of https://github.com/CSTR-Edinburgh/merlin
5
4
4 Experiment Experiment-1-1ddBB -3dB -5-d3B dB -5 dB
55
5
1
aod 44
L
iitrveeodonpg liittfreeeveaopgLnoodSR
C3 3dC3
te
R
lf-e 22
S
11 HHumanuman MleANGMultiBaniltanudBMd-MelGAN2 TSayctrcanToosotteromns2 2FastsaeecFhptspe ch2 ilrenMMerlin
diferences were obtained using post-hoc tHeumsatns toMugltiBeantd-MceolGAmNp aTarcoitrsono2ns aFcasrtspoeeschs2 all syMesrlitnems. In [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] a
Systems
table is presented that 2describes what each time term represents. In this work, we focus only
on the intercept term that represents the overall mean pupil dilation.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results</title>
    </sec>
    <sec id="sec-5">
      <title>3.1. Recall accurac1y</title>
      <p>
        Human MultiBand-MelGAN Tacotron 2 Fastspeech 2 Merlin
After each trial, the listener was expected to reSpysetemast the sentence they heard verbatim. Recall
accuracy (RA) was calculated by summing additions, substitutions and insertions and dividing
by the total number of words in the sentence. Sentences consisted of approximately 8 words.
Recall accuracy across all experiments are 75%, 64% and 51% respectively. The expected recall
accuracy at -1dB, -3dB and -5dB as reported in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are approximately 80%, 60% and 40% for human
speech respectively. Therefore, the recall accuracy for the experiments in this work were in line
with the expected correct percentages except for the -1dB experiment which is a consequence
of the Merlin TTS system performing poorly (RA=66%). In all experiments Merlin performed
the worst and was significantly diferent (p &lt; 0.05) to all other systems except Fastspeech 2. All
other systems were found to be equivalent. Since all systems (except Merlin) were found to be
equivalent, we can be certain that contributions to increased cognitive load observed in this
work will not be a consequence of poor intelligiblity.
      </p>
      <sec id="sec-5-1">
        <title>3.2. Self-reported measures</title>
        <p>The self-reported cognitive load measures are presented in a boxplot in Figure 1. Merlin and
Fastspeech 2 are perceived to be the hardest to listen to whilst vocoded speech was reported as
the easiest to listen to with human speech closely following.</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.3. Pupil responses</title>
        <p>The intercept parameter estimates from the GCA for each experiment are presented in Table 1.
and the average pupil responses in each noise condition are presented in Figure 2.</p>
        <p>
          For all systems, the intercept increases or is equivalent between -1dB and -3dB, except Merlin
which appears to be similar across all noise conditions. In -5dB, Fastspeech 2 has a smaller
estimate compared to -3dB and this decline in pupil response is clearly visible in Figure 2. Tacotron
2 has remained more or less constant between -3dB and -5dB. This would mean that cognitive
load is either decreasing or equivalent when listening in an increased noise condition - which
is unlikely. In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], it is reported that when an evoked pupil response is smaller than expected,
this suggests the possibility of the listener withdrawing from the task. Since we observe a
reduced pupil response for Fastspeech 2 in the -5dB condition this result suggests that listeners
have withdrawn from the task. In other words, the listener has reached their ceiling cognitive
capacity in attempting to process the speech, perhaps as a result of being too challenging to
listen to. Similarly, the mean for Tacotron 2 remained the same for -3dB and -5dB. Again, it is
unlikely, that in the most dificult SNR condition, the listener is experiencing similar loads. It
is more plausible that in -5dB, a reduced pupil response is also being observed for Tacotron 2.
Since both systems reached cognitive ceiling capacity in -5dB, by comparing their estimates in
-3dB, we see that Fastspeech 2 is significantly greater than Tacotron 2. Therefore, Tacotron 2
demands less cognitive load than Fastspeech 2. Since the estimates for Merlin were all constant,
perhaps, even in the easiest condition, Merlin was too challenging to listen to and therefore
evoked a small pupil response throughout. Thus, from all systems evaluated, Merlin is the
most dificult TTS system to listen to. This is not surprising as the recall accuracy for Merlin
was poor, self-reported cognitive load was high as it was deliberately selected as the lower
bound. For human speech, the pupil response increases gradually as the SNR decreases but is
still manageable in -5dB. Vocoded speech is found to be equivalent to human speech in -3dB
and -5dB but in -1dB it appears to behave diferently to all other conditions and systems (see
Figure 2). Overall, these findings suggest that vocoded speech is equivalent in cognitive load to
human speech under noisy conditions. This finding is also an important one, as it shows that
increased cognitive load contributions in TTS systems do not stem from the vocoder but rather
from the acoustic model.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Conclusion</title>
      <p>
        The cognitive load of modern TTS was measured in this paper. From the self-reported cognitive
load, vocoded speech was perceived to be the easiest to process with human speech closely
following whilst Merlin was perceived to be the most dificult. Fastspeech 2 was perceived to be
slightly more challenging to listen to than Tacotron 2. By evaluating the cognitive load of TTS
systems using the pupillometry paradigm, the pupil response revealed the same results, thereby
validating the self-reported measures. These findings suggests that modern TTS systems are
becoming more manageable to listen to even in noisy conditions. More specifically, such results
reveal that Merlin, a statistical parametric speech synthesis (SPSS) model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] demands high
cognitive load and should not be used in real-world solutions that embed TTS. Fastspeech 2 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and Tacotron 2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], both utilise a sequence-to-sequence based architecture which is shown
to reduce cognitive load compared to SPSS. However, since diferences were still observed
between these 2 state-of-the-art systems, a deeper dive is necessary to understand where exactly
increased cognitive load contributions stem from. Furthermore, given that vocoded speech
has demanded the least cognitive load, this finding shows that increased cognitive load in TTS
systems does not stem from the vocoder. In conclusion, despite modern TTS systems producing
high quality speech, there are still diferences between them, vocoded and human speech in
terms of their cognitive load. Modern TTS systems are therefore moving in the direction of
being equivalent to human speech but not all systems will provide the same user experience.
Therefore the listeners’ experience will vary depending on the architecture that is used within
a given application. Given that we have identified that Fastspeech 2 versus Tacotron 2 imposes
difering cognitive loads in this work, this information becomes valuable as it allows us to
further unpack where possible contributions to increased cognitive load within the model stem
from. Such information informs us on how to develop better TTS models that are equivalent to
human speech or if not, even better. Evaluating the cognitive load of TTS is therefore necessary
for selecting the right architectures to be embedded into real-world applications such that the
lowest possible strain on listeners using them is imposed.
I would like to acknowledge Ella Crocker for collecting the pupil responses for this work.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>McGarrigle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Munro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dawes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Barry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amitay</surname>
          </string-name>
          ,
          <article-title>Listening efort and fatigue: What exactly are we measuring? a british society of audiology cognition in hearing special interest group 'white paper'</article-title>
          ,
          <source>International journal of audiology 53</source>
          (
          <year>2014</year>
          )
          <fpage>433</fpage>
          -
          <lpage>445</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Govender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Using pupillometry to measure the cognitive load of synthetic speech</article-title>
          ,
          <source>in: Interspeech</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2838</fpage>
          -
          <lpage>2842</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Govender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise</article-title>
          ., in: Interspeech,
          <year>2019</year>
          , pp.
          <fpage>1551</fpage>
          -
          <lpage>1555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Govender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Valentini-Botinhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis</article-title>
          ,
          <source>in: 10th ISCA Speech Synthesis Workshop</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tachibana</surname>
          </string-name>
          , T. Okamoto,
          <article-title>Text-to-speech synthesis, Speech-to-</article-title>
          <string-name>
            <surname>Speech Translation</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>39</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mirman</surname>
          </string-name>
          ,
          <article-title>Growth curve analysis and visualization using R, Chapman</article-title>
          and Hall/CRC,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cooke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Valentini-Botinhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Stylianou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sauert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Evaluating the intelligibility benefit of speech modifications in known noise conditions</article-title>
          ,
          <source>Speech Communication</source>
          <volume>55</volume>
          (
          <year>2013</year>
          )
          <fpage>572</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tofanin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Başkent</surname>
          </string-name>
          ,
          <article-title>The timing and efort of lexical access in natural and degraded speech, Frontiers in Psychology 7 (</article-title>
          <year>2016</year>
          )
          <fpage>398</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <article-title>Statistical parametric speech synthesis using deep neural networks</article-title>
          ,
          <source>in: Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2013</year>
          IEEE International Conference, IEEE,
          <year>2013</year>
          , pp.
          <fpage>7962</fpage>
          -
          <lpage>7966</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>Fastspeech 2: Fast and high-quality end-to-end text to speech</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>04558</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jaitly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Skerrv-Ryan</surname>
          </string-name>
          , et al.,
          <article-title>Natural tts synthesis by conditioning wavenet on mel spectrogram predictions</article-title>
          ,
          <source>in: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>4779</fpage>
          -
          <lpage>4783</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>