<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating visual prosody using articulography</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Languages and Literature, Lund University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Speech</institution>
          ,
          <addr-line>Music and Hearing, KTH</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Swedish, Linnaeus University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Logopedics</institution>
          ,
          <addr-line>Phoniatrics and Audiology</addr-line>
          ,
          <institution>Clinical Sciences, Lund University</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Lund University Humanities Lab, Lund University</institution>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0000</lpage>
      <abstract>
        <p>In this paper we describe present work on multimodal prosody by means of simultaneous recordings of articulation and head movements. Earlier work has explored patterning, usage and machine-learning based detection of focal pitch accents, head beats and eyebrow beats through audiovisual recordings. Kinematic data obtained through articulography allows for more comparable and accurate measurements, as well as three-dimensional data. Therefore, our current approach involves examining speech and body movements concurrently, using electromagnetic articulography (EMA). We have recorded large amounts of this kind of data previously, but for other purposes. In this paper, we present results from a study on the interplay between head movements and phrasing and find tendencies for upward movements occuring before and downward movements occuring after prosodic boundaries.</p>
      </abstract>
      <kwd-group>
        <kwd>multimodal prosody</kwd>
        <kwd>EMA</kwd>
        <kwd>head movements</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This study is part of a project investigating levels of multimodal prosodic prominence,
as resulting from an interplay of verbal prosody (pitch accents) and visual prosody
(head and eyebrow beats). Facial beat gestures align with pitch accents in speech,
functioning as visual prominence markers. However, it is not yet well understood
whether and how gestures and pitch accents might be combined to create different
types of multimodal prominence, and how specifically visual prominence cues are
used in spoken communication.</p>
      <p>In earlier work, Ambrazaitis &amp; House (2017) explored the patterning and usage of
focal pitch accents, head beats and eyebrow beats. The material consisted of Swedish
television news broadcasts and comprised audiovisual recordings of five news readers
(two female, three male). They found that head beats occur more frequently in the
second than in the first part of a news reading, and also that the distribution of head
beats might to some degree be governed by information structure, as the text-initial
clause often defines a common ground or presents the theme of the news story. The
choice between focal accent, head beat and a combination of them is subject to
variation which might represent a degree of freedom for the speaker to use the markers
expressively.</p>
      <p>Based on the same, but extended data, Frid et al. (2017) developed a system for
detection of speech-related head movements. The corpus was manually labelled for head
movement, applying a simplistic annotation scheme consisting of a binary decision
about absence/presence of a movement in relation to a word. They then used a
videobased face detection procedure to extract the head positions and movements over
time, and based on this they calculated velocity and acceleration features. Then a
machine learning system was trained to predict absence or presence of head
movement. The system achieved an F1 score of 0.69 (precision = 0.72, recall = 0.66) in
10fold cross validation. Furthermore, the area under the ROC curve was 0.77, indicating
that the system may be helpful for head movement labelling.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Kinematic vs audiovisual data</title>
      <p>One difficulty in tackling the relationship between speech and the body gestures is
that it requires simultaneously recorded kinematic and acoustic measurements.
Previous studies have used audiovisual data to study this link, but with such data, it is not
possible to compare synchronization of gestures directly. Kinematic data allow for
more comparable and accurate measurements. Therefore, our current approach
involves examining speech and body movements concurrently, using electromagnetic
articulography (EMA). This method allows for simultaneous recording of audio + 3D
movements of the articulators: tongue, lips, and jaw, but markers can also be placed
on the head. Head movements are typically used to normalise, but they can also be
used as raw data and thereby give us the co-occurrent position of the head. Compared
to video this gives us 3D coordinates instead of 2D, and has better temporal resolution
(video normally has a much lower frame rate) and better audio-video sync.</p>
      <p>In this study we also employ it as an example of data reuse (Pasquetto et al. 2017):
our material was recorded in other projects for other purposes, but we are able to use
it here to study co-occurrent properties of speech and head movements. In this study
we use data from one of the projects (see below).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>
        The data was recorded as part of the VOKART project (Schötz et al. 2013). 29 native
speakers (age: 20-63) of the Stockholm (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ), Gothenburg (
        <xref ref-type="bibr" rid="ref10">10</xref>
        ), and Malmö (
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
variants of Swedish were recorded by means of EMA using an AG500 (Carstens
Medizinelektronik) with a sampling frequency of 200 Hz. Ten sensors were attached to the
lips, jaw and tongue, along with two reference sensors on the nose ridge and behind
the ear to correct for head movements, using Cyano Veneer Fast dental glue. Audio
was recorded using a Sony ECM-T6 electret condenser microphone.
      </p>
      <p>The speech material consisted of 15–20 repetitions by each speaker of target words
in carrier sentences of the type “Det va inte hV1t utan hV2t ja sa” (It was not hV1t,
but hV2t I said), where V1 and V2 were different vowels. The target words
containing the vowels were stressed and produced with contrastive focus. The sentences were
displayed in random order on a computer screen, and the speakers were instructed to
read each sentence in their own dialect at a comfortable speech rate. In order to
familiarise the speakers with the sensors and the experimental setup the actual test
sentences were preceded by two phonetically rich and challenging sentences, which the
speakers were asked to repeat three times each. The two sentences were:
1) Mobiltelefonen är nittiotalets stora fluga, både bland företagare och
privatpersoner. (The mobile phone is the big hit of the nineties, both among business people
and private persons.)</p>
      <p>2) Flyget, tåget och bilbranschen tävlar om lönsamhet och folkets gunst. (Airlines,
train companies and the automobile industry are competing for profitability and
people's appreciation.)</p>
      <p>In addition, the speakers were also asked to describe a painting displayed on the
computer screen, resulting in about half a minute of spontaneous speech, with several
focused words and phrase-boundaries. A contour of the palate was obtained by the
speakers moving their tongue tips several times back and forth along the midline of
their palate. For this study, we focused on the phonetically rich sentences.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Analysis: head movements and phrase boundaries</title>
      <p>We analyzed the sentence data by looking at sentence-level patterns of head
movements and comparing them word by word. Sentence 1 above consists of two phrases,
with an intonational boundary between the words fluga and både. Sentence 2 is
essentially one phrase, but starts with a list that may cause boundary signalling. We
examined the material by looking for possible head movement reflections of those
boundaries. In order to get an annotation of the word boundaries of the sentences, we use the
forced alignment method provided by the Praat program (Boersma &amp; Weenink 2018),
which speeds up the process but still requires manual post-checking. Utterances that
contained misreadings and/or missing parts (because the recording stopped before the
reader finished the sentence) were discarded. In total, there were 86 examples of
Sentence 1 and 80 examples of Sentence 2.</p>
      <p>First we measured the velocity of the angle in the sagittal plane between 1) an
imaginary line between the two reference sensors (behind the ear and on the nose ridge)
and 2) a line running along the transverse plane (parallel to the ground). This
effectively measures the head's movement as it is tilted along this plane. We then
calculated the average angular velocity per word for each sentence and then grouped the data
by word. Figures 1 and 2 show summaries of the data in the form of boxplots. For
Sentence 1 (in Figure 1), we note that the boundary-preceding word fluga has a
positive median, whereas both the preceding word stora and the following word, både,
has negative medians. A similar, but less prominent, pattern can be observed in
Sentence 2 (Figure 2), where the first word Flyget has a positive median, whereas the
following word tåget, has a negative median.</p>
      <p>
        Fig. 1. Boxplots of mean angular velocity per word in sentence 1, n=86. Black horizontal lines
are medians, hinges correspond to the first and third quartiles, black dots are outliers
Fig. 2. Boxplots of mean angular velocity per word in sentence 2, n=80. Black horizontal lines
are medians, hinges correspond to the first and third quartiles, black dots are outliers.
We used R (R Core Team, 2018) and lme4 (Bates, Maechler &amp; Bolker, 2015) to
perform a linear mixed effects analysis of the relationship between mean angular velocity
and word. Linear mixed models were used to account for repeated measures. As fixed
effect, we entered word into the model. As random effect, we had intercepts for
subjects as well as by-subject random slopes for the effect of word. P-values were
obtained by likelihood ratio tests of the full model with the effect in question against the
model without the effect in question. P-values &lt; 0.05 were considered significant. We
tested all pairs of words w1 and w2 within each sentence, where w2 was the word
following w1. Table 1 summarizes the results. The results confirm the observation
that the boundary-preceding word fluga has a higher mean angular velocity than its
neighbouring words. Furthermore, the initial word in each sentence (Mobiltelefonen
and Flyget, respectively) is associated with a higher mean angular velocity, a well as
the word tävlar compared to the word om.
result
word affected mav (χ2 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )=8.5201, p=0.003512), lowering
it by about 0.077 rad/s ± 0.017 (standard errors)
word affected mav (χ2 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )=8.4946, p=0.003562), increasing
it by about 0.077 rad/s ± 0.017 (standard errors)
word affected mav (χ2 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )=5.8811, p=0.0153), lowering it
by about 0.043 rad/s ± 0.012 (standard errors)
word affected mav (χ2 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )=3.913, p=0.04792), lowering it
by about 0.043 rad/s ± 0.017 (standard errors)
word affected mav (χ2 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )=4.3803, p=0.03636), lowering it
by about 0.032 rad/s ± 0.012 (standard errors)
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Discussion/Conclusions</title>
      <p>Using EMA recordings to analyze head movement by comparing the kinematic
patterns of the sensors with the audio signal is a promising method to provide us with
information on the synchronization of head movements with for example prosodic
signals for prominence such as F0 excursions and syllable lengthening.</p>
      <p>The results presented here show that there is a tendency for participants to tilt the
head more upwards than downwards during the boundary-preceding words. The
words which succeed the boundary, conversely show an opposite tendency indicating
more downward movement. There is also a tendency for sentence-initial words to
have a higher mean angular velocity than the words following them.</p>
      <p>EMA data must be recorded on-line and obtaining it is quite laborious and less
suitable for collecting large amounts. Video (AV) data is easier. However, an
extension of recording EMA data is that we may augment existing AV corpora with
estimated articulatory information (Ouni 2013). Since we concurrently record audio and
EMA data we could build models that map acoustics to articulatory data. In this way,
AV corpora may be enriched with articulatory information.</p>
      <p>Previously motion capture data has been used to investigate temporal coordination
between head movement and the audio signal (Alexanderson et al. 2013) and between
head movement and EMA articulation data (Krivokapić et al. 2017; Esteve-Gilbert et
al. 2018). EMA methodology has also been used to analyze head movements alone,
but the current data will enable us to investigate the temporal coordination of head
movements, tongue and lip movements, and the audio signal in the same system. We
plan to use this methodology to investigate the role of head movement, articulation
and prosody in signaling prominence in the context of the newly initiated PROGEST
project and thereby contribute to the body of knowledge of multimodality in digital
humanities and digital representations of speech and gestures in communication.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was supported by grants from the Swedish Research Council: Swe-Clarin
(VR 2013-2003) and Progest (VR 2017-02140).
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alexanderson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>House</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Beskow</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Aspects of co-occurring syllables and head nods in spontane-ous dialogue</article-title>
          .
          <source>In Proc. of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013)</source>
          . Annecy, France.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ambrazaitis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>House</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings</article-title>
          ,
          <source>Speech Communication</source>
          ,
          <volume>95</volume>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>113</lpage>
          , https://doi.org/10.1016/j.specom.
          <year>2017</year>
          .
          <volume>08</volume>
          .008
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bates</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maechler</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolker</surname>
            <given-names>B.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Walker</surname>
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Fitting Linear Mixed-Effects Models Using lme4</article-title>
          .
          <source>Journal of Statistical Software</source>
          ,
          <volume>67</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .18637/jss.v067.
          <year>i01</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Boersma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Weenink</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Praat: doing phonetics by computer</article-title>
          [Computer program].
          <source>Version 6.0.39, retrieved 3 April</source>
          <year>2018</year>
          from http://www.praat.org/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Esteve-Gibert</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loevenbruck</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dohen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>D'Imperio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Head movements highlight important information in speech: an EMA study with French speakers</article-title>
          .
          <source>DOI10.13140/RG.2.2.21796.78727 Conference: XIV AISV Conference - Speech in Natural Context</source>
          ,
          <fpage>25</fpage>
          -
          <lpage>27</lpage>
          January 2018, Bozen-Bolzano
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Frid</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambrazaitis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svensson-Lundmark</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>House</surname>
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Towards classification of head move-ments in audiovisual recordings of read news</article-title>
          ,
          <source>Proceedings of the 4th European and 7th Nordic Symposium on Multimodal Communication (MMSYM</source>
          <year>2016</year>
          ), Copenhagen,
          <fpage>29</fpage>
          -30
          <source>September</source>
          <year>2016</year>
          , Volume,
          <source>Issue</source>
          <volume>141</volume>
          ,
          <fpage>2017</fpage>
          -
          <volume>09</volume>
          -21, Pages 4-9, ISSN 1650-
          <fpage>3740</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Krivokapić</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiede</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Tyrone</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>A Kinematic Study of Prosodic Structure in Articulatory and Manual Gestures: Results from a Novel Method of Data Collection</article-title>
          .
          <source>Lab Phonol</source>
          .
          <year>2017</year>
          ;
          <article-title>8(1): 3. Published online 2017 Mar 13</article-title>
          . doi:
          <volume>10</volume>
          .5334/labphon.75
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ouni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <article-title>Multimodal Speech: from articulatory speech to audiovisual speech</article-title>
          .
          <source>Machine Learning [cs.LG]</source>
          . Université de Lorraine,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pasquetto</surname>
            ,
            <given-names>I.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Randles</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Borgman</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          , (
          <year>2017</year>
          ).
          <source>On the Reuse of Scientific Data. Data Science Journal. 16</source>
          , p.
          <fpage>8</fpage>
          . DOI: http://doi.org/10.5334/dsj-2017
          <source>-008</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>R</given-names>
            <surname>Core Team</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>R: A language and environment for statistical computing</article-title>
          .
          <source>R Foundation for Statistical Computing</source>
          , Vienna, Austria. URL https://www.R-project.
          <source>org/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schötz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frid</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gustafsson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Löfqvist</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Functional Data Analysis of Tongue Articulation in Palatal Vowels: Gothenburg</article-title>
          and Malmöhus Swedish /i:, y: ,
          <source>ʉ:/. Proceedings of Interspeech</source>
          <year>2013</year>
          . Lyon.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>