=Paper= {{Paper |id=Vol-1347/paper03 |storemode=property |title=Perception of gesturally distinct consonants in Persian |pdfUrl=https://ceur-ws.org/Vol-1347/paper03.pdf |volume=Vol-1347 |dblpUrl=https://dblp.org/rec/conf/networds/FalahatiB15 }} ==Perception of gesturally distinct consonants in Persian== https://ceur-ws.org/Vol-1347/paper03.pdf
                 Perception of gesturally distinct consonants in Persian
           Reza Falahati                                                                     Chiara Bertini
     Laboratorio di Linguistica                                                         Laboratorio di Linguistica
     Scuola Normale Superiore                                                           Scuola Normale Superiore
       Piazza dei Cavalieri 7                                                             Piazza dei Cavalieri 7
          56126 Pisa, Italy                                                                 56126 Pisa, Italy
        reza.falahati@sns.it                                                              chiara.bertini@sns.it


                                                                    alveolar gesture, some had gestural overlap that
                          Abstract                                  masked at least some of the acoustic information
      This study explores the sensitivity of the                    for [t], and some had reduced alveolar gestures.
      individuals to the residual gestures                          The current study tests listeners’ sensitivity to
      remaining after the simplification of                         these three types of /t/ realizations.
      consonant clusters. Three sets of target
      stimuli having full, reduced, and zero                        2   Background
      alveolar gestures along with the control
      stimuli were used in a perceptual                             Choosing the basic units or building blocks by
      identification task. The results of the                       which the phenomena in a discipline could be
      experiment showed that subjects reliably                      explained is fundamentally important. Due to the
      distinguished the three target sets with                      “complex” nature of language, there is no
      varying residual gestures from the                            consensus among linguists as to the nature of this
      control. The results also showed that the                     basic unit in the field. The controversy over
      degree of residual gestures affects the                       choosing the building blocks extends to the
                                                                    domain of speech perception where different
      rate of [t] perception by the subjects;
                                                                    models have postulated various basic units of
      however, this was not statistically                           processing and storage.
      significant. The results are discussed in                        In general, there are two major theoretical
      the context of different theories of speech                   approaches to speech perception: gesturalist
      perception.                                                   theories versus auditory and exemplar theories.
                                                                    The two main gestural theories of speech
1       Introduction                                                perception are Motor Theory and Direct Realism
This study investigates the perception of three                     (MT and DR, henceforth). In motor theories, the
categories of consonant clusters that are                           intended phonetic gestures of the speaker are
perceptually similar but gesturally distinct. In                    considered to be the objects of speech
Persian, word-final coronal stops are optionally                    perception. These gestures are “represented in
deleted, when they are preceded by obstruents or                    the brain as invariant motor commands that call
the homorganic nasal /n/. For example, the final                    for movements of the articulators through certain
                                                                    linguistically      significant    configurations”
clusters in the words /ræft/ “went”, /duχt/ “sew”                   (Liberman and Mattingly 1985, p. 2). The main
and /qæsd/ “intension” are optionally simplified1                   motivation for choosing such basic unit by MT,
in fast/casual speech, resulting in: [ræf], [duχ],                  among other factors, is mainly because of
                                                                    patterns where different acoustic cues could give
and [qæs], respectively. The articulatory study                     rise to the same phonetic percept or where
conducted on this process in Persian by Falahati                    variant phonetic percepts were found for the
(2013) has shown that the gestures of the deleted                   same synthetic speech across different contexts
segments are often still present. More                              (Delattre et al., 1955, 1964; Liberman 1957;
specifically, the findings showed that of the                       Liberman and Mattingly 1985). Despite of the
clusters that sounded simplified, some had no                       fact that this theory has gone through significant
1
                                                                    changes from its inception, all the versions share
 The term “simplification” is used here for the acoustic and
                                                                    the idea that the objects of speech perception are
perceptual consequence of apparent coronal consonant
deletion, regardless of whether there is a residual                 articulatory events rather than acoustic or
articulatory gesture.                                               auditory events.

              Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
    In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                              Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org

                                                               13
   An intended gesture is produced by a number              complex stimuli with structured variance (Diehl
of muscles that act in concert sometimes ranging            et al., 2004). According to this approach, the
over more than one articulator. For instance,               phonological representations are assumed to be
constriction needed for producing coronal stops             speaker independent and they are associated with
involves the action of the tip/blade of the tongue          each word in the listener’s mental lexicon. The
and the jaw; however, such a constriction is                proponents of this approach take, for example,
considered one gesture. According to MT, the                categorical perception of non-speech sounds or
orchestration among gestures is quite systematic            categorical-like perception by non-human
and listeners can use the systematically varying            animals as evidence for their argument. They
acoustic cues for coronal stops as information to           also consider some of the cross-linguistic sound
detect the related consonant gestures.                      patterns and the “maximal auditory dispersion”
   MT assumes a biological link between                     in vowel systems as further support for their
perception and production. According to this                claim (Ohala 1990, 1995).
perspective both speech perception and speech                  Exemplar theories form another approach to
production share the same set of invariants and             speech perception where words and frequently-
are governed by auditory principles. “The                   used grammatical constructions are represented
motivation for articulatory and coarticulatory              in memory as large sets of exemplars containing
maneuvers is to produce just those acoustic                 fine phonetic information. Listeners are sensitive
patterns that fit the language-independent                  to phonetic details existing in the speech signal.
characteristics of the auditory system” (Liberman           In such a speech perception model, a mechanism
and Mattingly, 1985, p. 6). The acoustic signal             is needed for gradiently changing the lexical
only serves as a source of information about the            representations over time. In order to do so, the
gestures. It is the gestures which define the               perceptual system must be capable of making
phonetic category.                                          fine phonetic distinctions (Johnson 1997).
   The other main gestural theory to speech                   These different approaches to speech
perception is direct realism. Both DR and MT                perception have been tested in different studies.
share the claim that listeners to speech perceive           Beddor et al., (2013), for example, used eye-
vocal tract gestures. However, in DR it is the              tracking to assess listeners' use of coarticulatory
phonological gestures of the vocal tract, rather            vowel nasalization as that information unfolded
than the intended gestures, which are the                   in real time. In the experiment, subjects heard the
perceptual objects (Fowler 1981, 1984, 1996).               nasalized vowels with two different time
According to DR, “the temporal overlap of                   latencies. The prediction was that subjects will
vowels and consonants does not result in a                  fixate on the related image sooner when they
physical merging or assimilation of gestures;               hear the nasalized vowel earlier. The results
instead, the vowel and consonant gestures are               showed that listeners use relevant acoustic cues,
coproduced. That is, they remain, to a                      which was argued to allow listeners to track the
considerable extent, separate and independent               gestural information. Nalon (1992) in an
events...” (Diehl et al., 2004, p. 153). If we could        identification task tested whether participants
extend this to the gestures of two adjacent                 could identify different degrees of velar
consonants, one should expect that the gestures             assimilation. He used four different articulation
related to them also remain separate and distinct           types called full alveolar, residual alveolar, zero
from each other.                                            alveolar (i.e., full assimilation to the following
   In contrast to gestural theories, the auditory           velar), and nonalveolar (i.e., velar in underlying
theories assume that speech sounds are perceived            representation). The results of his study showed
via general cognitive and learning mechanisms.              that the participants perceived full alveolar
In this view, speech is not special and listeners           tokens with 100% accuracy with /d/ responses
do not perceive gestures. The auditory approach             while less than half the tokens with residual
to perception mainly considers general auditory
                                                            alveolar were identified with /d/ responses. In
mechanisms        responsible      for    perceptual
performance. According to this view, the speech             another study, Pisoni showed that the nonspeech
and nonspeech stimuli do not invoke a special or            analogs of VOT stimuli are perceived
speech-specific module. Gestures have no                    categorically. Similar studies like this were taken
mediatory role as to the perception of speech               as evidence against MT which claimed
sounds in this approach. Listeners use multiple             categorical perception as a specific feature of the
imperfect acoustic cues in order to categorize the          speech mode of perception.




                                                       14
   In this study, I will use three sets of simplified        Target Full_G: [æχtt kɑ], [æftt bæ], [uftt bɑ]
consonant clusters which are auditorily similar
                                                             Target Partial_G: [æχt kɑ], [æft bæ], [uft bɑ]
but gesturally different. The consonant clusters
(i.e., C1C2#) happen in the coda of the words                Target Zero_G: [æχ kɑ], [æf bæ], [uf bɑ]
followed by another word which also starts with              Control: [æχ ke], [æf bæ], [uf bɑ]
a consonant, therefore giving us three consonants
in a row in an intervocalic environment (i.e.,               The four sets of target and control nonwords
V1C1C2#C3V2). The prediction is that if subjects             presented above are the excised tokens taken
are sensitive, they should have different                    from the full words presented below:
judgment for the stimuli. The stimuli set with no
coronal gesture is expected to show the same
                                                             Target:    /sæχt   kɑr/    “hard-working”,       /næft
pattern as the control (with zero coronal gesture
in the underlying representation). The stimuli               bærɑje/ “oil for”, /kuft bɑʃeh/ “be cheap”
with overlapped gestures and reduced gestures
are predicted to show a pattern different both               Control: /næχ ke/ “thread that”, /sæf bærɑje/,
from control and the stimuli with zero residual
                                                             “cue for” / mæruf bɑʃeh / “be famous”
gestures. The following section introduces the
methodology of the study.
                                                             3.3    Procedure
3     Methodology
                                                             All the participants listened to forty stimuli (10
3.1    Participants                                          stimuli in each category) with eight repetitions.
Thirty-two Persian-speaking students from the                (total of 320 tokens) in a sound booth located at
Università di Pisa and Sant’Anna, seventeen                  the linguistics laboratory in Scuola Normale
females fifteen males, aged 18-38 participated               Superiore. The software Presentation was used to
in this study. The results of eight of them are not          present the stimuli to the listeners as an
considered for analysis because they reported to             identification task. The participants were asked
be bilinguals and mainly used a language rather              to listen very carefully and decide as quickly as
than Persian at home or with their close friends.            possible whether it is likely that there has been a
This resulted in twenty-four, twelve females                 [t] at the end of the first part of each stimuli. For
twelve males. None of them reported any hearing              each stimulus, the participants were asked to
problem.                                                     press either the green or the blue button on a
                                                             Cedrus response pad. On the screen of a
                                                             computer, listeners could also see “T” or “NO T”
3.2    Stimuli                                               corresponding to the response buttons. The
Three sets of target words varying in only the               stimuli were shuffled and presented in blocks in
degree/amount of alveolar residual gestures and              a way that participants could either begin by
one control stimuli set were used in the                     hearing all the tokens with [f] or [χ]. They also
experiment. The three target categories are                  had the choice of taking a break after listening to
mainly the same except for the degree of alveolar            every 80 tokens. All the participants received a
residual gestures. Target Full_G category has full           short training before the start of the experiment.
coronal gesture but has overlap hence marked                 The following section contains the results of the
with two superscript [tt]. Target Partial_G                  study.
category has partial residual gesture marked via
superscript [t] whereas Target Zero_G has no
gestural leftover. The stimuli in the control are            4     Results
used as the baseline since they don’t have any
                                                             The main goal of this study is to test listeners’
underlying coronal stop in the coda position of
                                                             sensitivity to different degrees of residual
the first word. Some examples of the target and
                                                             gestures remaining after the simplification of
control words are given below:
                                                             consonant clusters. The response type and
                                                             reaction time are the dependent variables in this
                                                             study; however, only the results related to
                                                             response type are presented here. Figure 1 below
                                                             shows the perception rate of [t] by all subjects




                                                        15
across the four conditions. According to this, the          5    Discussion and Conclusion
subjects show the highest rate of [t] perception in
                                                            This research investigated listeners’ sensitivity to
tokens with full alveolar gesture (i.e., 59.69%)
and the lowest for the ones in the control (i.e.,           three types of /t/ realizations as target and
36.09%). The condition with partial alveolar                compared the results with the control. The target
gestures shows the rate of 56.20% which is very             categories included simplified consonant clusters
close to the full condition. The stimuli in zero            with full, partial, and zero alveolar gestures. The
alveolar condition show an intermediate level               stimuli used as the baseline in the control had no
between the control and the other two target                alveolar gesture in the underlying form. The
conditions with the rate of 49.84%. This shows              general results of the study showed that subjects
almost a similar pattern between the two target             reliably distinguished the three target sets with
conditions with full and partial gestures, an               varying residual gestures from the control. This
intermediate situation for the target condition             could be due to more similarity in tongue
with zero gesture, and a pattern for the control            configuration in realizing these varying degrees
which is different from the three target                    of coronal stop articulation compared to the
conditions.                                                 control condition where there is no alveolar
                                                            gesture in the underlying form. Any articulatory
                                                            modification is expected to trigger acoustic
                                                            changes. The acoustic results of the stimuli used
         The rate of [t] perception for the                 in this study by Falahati (2013) showed no
                   mean subject                             significant difference between the simplified
                                                            tokens (i.e., the three target sets with varying
   80%                                                      degrees of residual gestures labeled all together
   60%                                                      as simplified) and control tokens. The acoustic
   40%                                                      parameters used in the analysis were V1 duration,
                                                            consonant clusters duration, and formant
   20%
                                                            transitions. Despite of the fact that the results did
    0%                                                      not show any significant difference between
            Control   Full_G    Partial_G Zero_G            simplified and control conditions, the duration of
                                                            V1 and consonant clusters in the simplified
                                                            condition was always higher than the control
Figure 1: The Rate of [t] Perception by all Subjects        condition. It could be the case that these acoustic
                                                            cues, although not very strong, are enough for
In order to examine the relation between the two            human’s auditory system to trigger the presence
categorical variables in the study, namely the              of a segment.
response type and stimuli condition, a Pearson                 The results of the current study also showed
chi-square test was run. The null hypothesis is             that participants perceived almost 36% of the
that there is no relation in the [t] perception and         tokens with no underlying coronal stop as having
the four conditions in the study. The results of            [t]. This is very similar to the results of the study
the test with [t] perception as the dependent               reported by Nalon (1992) where 20% of the
variable found significant main effect of                   control nonalveolar tokens were perceived as
conditions χ2 (3, N = 960) = 46.2, p < 0.001. This          having [d]. In his study, however, the control
shows that there is a significant relation between          tokens showed similar pattern to that of the target
the stimuli conditions and response type. In order          with zero alveolar (i.e., full assimilation). He
to determine whether the difference in the                  attributes this to both subjects’ natural language
perception of [t] across four categories is really          experience as well as the inherent ambiguity in
significant or it is due to chance variation, a             the stimuli. He states that subjects are “willing to
column proportions test was performed. This test            “undo” its effects” and therefore, in the case of
uses z-test to make the comparisons. The result             the current study, report coronal stops even
showed that the perception of [t] in the control            where there is no evidence for them.
                                                               The results of our study also showed that
was significantly different from the all target
categories. The next section presents the                   participants perceived more [t] in the tokens with
discussion and concluding remarks of the study.             full and partial alveolar gestures compared to the
                                                            ones with zero alveolar gestures. The difference




                                                       16
between the three categories, however, did not             their perception of [t]. The variation across
reach the significance level. Such result could            individuals regarding speech perception could be
shed more light on the theories of speech                  a good source of information for the specialists
perception discussed earlier in this paper. In             in the field. Moreover, the degree to which an
order to discuss this issue, first we need to              individual’s speech production could map to
further explore the nature of the three categories         his/her perception is an interesting topic which
in the target stimuli. From the three groups in the        remains to be explored.
target stimuli, one group categorically had no
alveolar gesture while the other two had different         Acknowledgments
degrees of the gesture either as a result of
overlap or reduction. We argue that the gradient           We are very grateful to Patrice Beddor for her
gestural reduction and overlap are due to low-             comments and suggestions on this study.
level phonetic and mechanical reasons while the
categorical deletion, which results in tokens with
zero gestures, is caused by the cognitive system.
In the former groups, speakers neither intend to
reduce nor plan to overlap gestures while the
latter process is intended by the speaker.
   According to MT and DR, listeners’ target in
speech perception is the intended or phonological
gestures. Therefore, the overlapped and reduced
stimuli should show different perceptual pattern
compared to the stimuli with no residual gesture.
The results in this study did not show a striking
difference between these three target sets. The
existence of acoustic cues pertaining to the
presence of gestures is a prerequisite to their
perception by the listener. If distinguishing
acoustic details could be found between these
three categories, then this would not support the
gesturalist approach to speech perception.
However, with the current results, such a claim
cannot be made. Further acoustic analysis
between these three target sets is needed to
examine this idea further.
   The findings in our experiment could be best
explained by referring to exemplar models of
speech perception. In such models, the lexical
representations of words change in a gradient
way over time. This is due to the nature of some
phonological processes in languages which are
not categorical. According to this view, the
perceptual mechanism is capable to make fine
phonetic distinctions. However, it is the mapping
between the gradient stimuli and the auditory
system which fails and does not result in
nonvariant forms.
   The lack of such a one-to-one mapping will
bring variation across subjects in the speech
community. The degree of such variation is
determined by the amount of individual’s
exposure to the specific variants. A closer look at
the results for individual subjects showed that all
twenty-four participants in the study could fall
into three or four dominant patterns based on




                                                      17
Reference                                                      John Ohala. 1990. Respiratory activity in speech. In
                                                                 W. J. Hardcastle & A. Marchal (eds.), Speech
Patrice S. Beddor, Kevin B. McGowan, Julie Boland,               Production and Speech Modeling, 23-53.
  Andries W. Coetzee, and Anna Brasher. 2013.                    Netherlands: Kluwer Academic Publishers.
  The perceptual time course of coarticulation.
  Journal of the Acoustical Society of America, 133,           John Ohala. 1995. The perceptual basis of some
  2350-2366.                                                     sound patterns. In B. Connell and A. Arvaniti (eds.),
                                                                 Phonology and phonetic evidence, Papers in
Pierre Delattre, Alvin M. Liberman, and Franklin S.              Laboratory Phonology IV, 87-92. Cambridge:
   Cooper. 1955. Acoustic loci and transitional cues             Cambridge University Press.
   for consonants. Journal of the Acoustical Society of
   America, 27, 769-773.
Pierre Delattre, Alvin M. Liberman, and Franklin S.
   Cooper. 1964. Formant transition and loci as
   acoustic correlates of place of articulation in
   American fricatives. Stud. Linguist. 18, 104-121.
Randy L. Diehl, Andrew J. Lotto, and Lorri L. Holt .
  2004. Speech perception. Annual Review of
  psychology. 55, 149-179.
Alvin M. Liberman and Ignatius G. Mattingly. 1985.
  The motor theory of speech perception revised.
  Cognition, 21: 1-36.
Reza Falahati. 2013. Gradient and Categorical
  Consonant Cluster Simplification in Persian:
  An Ultrasound and Acoustic Study, Ph. D
  Dissertation, University of Ottawa, Ottawa.

Carol C. Fowler. 1981. Production and perception of
 coarticulation among stressed and unstressed
 vowels. Journal of Speech and Hearing Research,
 46, 127-139.

Carol C. Fowler. 1984. Segmentation of coarticulated
 speech in perception. Perception & Psychophysics,
 36, 359-368.

Carol C. Fowler. 1996. Listeners do hear sounds, not
 tongues. Journal of the Acoustical Society of
 America, 99, 1730-1741.

Keith Johnson. 1997. Speech perception without
 speaker normalization: an exemplar model. In K.
 Johnson and J. W. Mullennix (eds.), Talker
 variability in speech processing, 145-165. San
 Diego: Academic Press.

Alvin M. Liberman and Ignatius G. Mattingly. 1985.
  The motor theory of speech perception revised.
  Cognition, 21: 1-36.
Francis Nalon, 1992. The descriptive role of
  segments: evidence from assimilation. In G. J.
  Docherty and R. Ladd (eds.), Papers in Laboratory
  Phonology IV, 261-289. Cambridge: Cambridge
  University Press.




                                                          18