=Paper= {{Paper |id=Vol-1347/paper03 |storemode=property |title=Perception of gesturally distinct consonants in Persian |pdfUrl=https://ceur-ws.org/Vol-1347/paper03.pdf |volume=Vol-1347 |dblpUrl=https://dblp.org/rec/conf/networds/FalahatiB15 }} ==Perception of gesturally distinct consonants in Persian== https://ceur-ws.org/Vol-1347/paper03.pdf

Perception of gesturally distinct consonants in Persian
Reza Falahati Chiara Bertini
Laboratorio di Linguistica Laboratorio di Linguistica
Scuola Normale Superiore Scuola Normale Superiore
Piazza dei Cavalieri 7 Piazza dei Cavalieri 7
56126 Pisa, Italy 56126 Pisa, Italy
reza.falahati@sns.it chiara.bertini@sns.it

alveolar gesture, some had gestural overlap that
Abstract masked at least some of the acoustic information
This study explores the sensitivity of the for [t], and some had reduced alveolar gestures.
individuals to the residual gestures The current study tests listeners’ sensitivity to
remaining after the simplification of these three types of /t/ realizations.
consonant clusters. Three sets of target
stimuli having full, reduced, and zero 2 Background
alveolar gestures along with the control
stimuli were used in a perceptual Choosing the basic units or building blocks by
identification task. The results of the which the phenomena in a discipline could be
experiment showed that subjects reliably explained is fundamentally important. Due to the
distinguished the three target sets with “complex” nature of language, there is no
varying residual gestures from the consensus among linguists as to the nature of this
control. The results also showed that the basic unit in the field. The controversy over
degree of residual gestures affects the choosing the building blocks extends to the
domain of speech perception where different
rate of [t] perception by the subjects;
models have postulated various basic units of
however, this was not statistically processing and storage.
significant. The results are discussed in In general, there are two major theoretical
the context of different theories of speech approaches to speech perception: gesturalist
perception. theories versus auditory and exemplar theories.
The two main gestural theories of speech
1 Introduction perception are Motor Theory and Direct Realism
This study investigates the perception of three (MT and DR, henceforth). In motor theories, the
categories of consonant clusters that are intended phonetic gestures of the speaker are
perceptually similar but gesturally distinct. In considered to be the objects of speech
Persian, word-final coronal stops are optionally perception. These gestures are “represented in
deleted, when they are preceded by obstruents or the brain as invariant motor commands that call
the homorganic nasal /n/. For example, the final for movements of the articulators through certain
linguistically significant configurations”
clusters in the words /ræft/ “went”, /duχt/ “sew” (Liberman and Mattingly 1985, p. 2). The main
and /qæsd/ “intension” are optionally simplified1 motivation for choosing such basic unit by MT,
in fast/casual speech, resulting in: [ræf], [duχ], among other factors, is mainly because of
patterns where different acoustic cues could give
and [qæs], respectively. The articulatory study rise to the same phonetic percept or where
conducted on this process in Persian by Falahati variant phonetic percepts were found for the
(2013) has shown that the gestures of the deleted same synthetic speech across different contexts
segments are often still present. More (Delattre et al., 1955, 1964; Liberman 1957;
specifically, the findings showed that of the Liberman and Mattingly 1985). Despite of the
clusters that sounded simplified, some had no fact that this theory has gone through significant
1
changes from its inception, all the versions share
The term “simplification” is used here for the acoustic and
the idea that the objects of speech perception are
perceptual consequence of apparent coronal consonant
deletion, regardless of whether there is a residual articulatory events rather than acoustic or
articulatory gesture. auditory events.

Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org

13
An intended gesture is produced by a number complex stimuli with structured variance (Diehl
of muscles that act in concert sometimes ranging et al., 2004). According to this approach, the
over more than one articulator. For instance, phonological representations are assumed to be
constriction needed for producing coronal stops speaker independent and they are associated with
involves the action of the tip/blade of the tongue each word in the listener’s mental lexicon. The
and the jaw; however, such a constriction is proponents of this approach take, for example,
considered one gesture. According to MT, the categorical perception of non-speech sounds or
orchestration among gestures is quite systematic categorical-like perception by non-human
and listeners can use the systematically varying animals as evidence for their argument. They
acoustic cues for coronal stops as information to also consider some of the cross-linguistic sound
detect the related consonant gestures. patterns and the “maximal auditory dispersion”
MT assumes a biological link between in vowel systems as further support for their
perception and production. According to this claim (Ohala 1990, 1995).
perspective both speech perception and speech Exemplar theories form another approach to
production share the same set of invariants and speech perception where words and frequently-
are governed by auditory principles. “The used grammatical constructions are represented
motivation for articulatory and coarticulatory in memory as large sets of exemplars containing
maneuvers is to produce just those acoustic fine phonetic information. Listeners are sensitive
patterns that fit the language-independent to phonetic details existing in the speech signal.
characteristics of the auditory system” (Liberman In such a speech perception model, a mechanism
and Mattingly, 1985, p. 6). The acoustic signal is needed for gradiently changing the lexical
only serves as a source of information about the representations over time. In order to do so, the
gestures. It is the gestures which define the perceptual system must be capable of making
phonetic category. fine phonetic distinctions (Johnson 1997).
The other main gestural theory to speech These different approaches to speech
perception is direct realism. Both DR and MT perception have been tested in different studies.
share the claim that listeners to speech perceive Beddor et al., (2013), for example, used eye-
vocal tract gestures. However, in DR it is the tracking to assess listeners' use of coarticulatory
phonological gestures of the vocal tract, rather vowel nasalization as that information unfolded
than the intended gestures, which are the in real time. In the experiment, subjects heard the
perceptual objects (Fowler 1981, 1984, 1996). nasalized vowels with two different time
According to DR, “the temporal overlap of latencies. The prediction was that subjects will
vowels and consonants does not result in a fixate on the related image sooner when they
physical merging or assimilation of gestures; hear the nasalized vowel earlier. The results
instead, the vowel and consonant gestures are showed that listeners use relevant acoustic cues,
coproduced. That is, they remain, to a which was argued to allow listeners to track the
considerable extent, separate and independent gestural information. Nalon (1992) in an
events...” (Diehl et al., 2004, p. 153). If we could identification task tested whether participants
extend this to the gestures of two adjacent could identify different degrees of velar
consonants, one should expect that the gestures assimilation. He used four different articulation
related to them also remain separate and distinct types called full alveolar, residual alveolar, zero
from each other. alveolar (i.e., full assimilation to the following
In contrast to gestural theories, the auditory velar), and nonalveolar (i.e., velar in underlying
theories assume that speech sounds are perceived representation). The results of his study showed
via general cognitive and learning mechanisms. that the participants perceived full alveolar
In this view, speech is not special and listeners tokens with 100% accuracy with /d/ responses
do not perceive gestures. The auditory approach while less than half the tokens with residual
to perception mainly considers general auditory
alveolar were identified with /d/ responses. In
mechanisms responsible for perceptual
performance. According to this view, the speech another study, Pisoni showed that the nonspeech
and nonspeech stimuli do not invoke a special or analogs of VOT stimuli are perceived
speech-specific module. Gestures have no categorically. Similar studies like this were taken
mediatory role as to the perception of speech as evidence against MT which claimed
sounds in this approach. Listeners use multiple categorical perception as a specific feature of the
imperfect acoustic cues in order to categorize the speech mode of perception.

14
In this study, I will use three sets of simplified Target Full_G: [æχtt kɑ], [æftt bæ], [uftt bɑ]
consonant clusters which are auditorily similar
Target Partial_G: [æχt kɑ], [æft bæ], [uft bɑ]
but gesturally different. The consonant clusters
(i.e., C1C2#) happen in the coda of the words Target Zero_G: [æχ kɑ], [æf bæ], [uf bɑ]
followed by another word which also starts with Control: [æχ ke], [æf bæ], [uf bɑ]
a consonant, therefore giving us three consonants
in a row in an intervocalic environment (i.e., The four sets of target and control nonwords
V1C1C2#C3V2). The prediction is that if subjects presented above are the excised tokens taken
are sensitive, they should have different from the full words presented below:
judgment for the stimuli. The stimuli set with no
coronal gesture is expected to show the same
Target: /sæχt kɑr/ “hard-working”, /næft
pattern as the control (with zero coronal gesture
in the underlying representation). The stimuli bærɑje/ “oil for”, /kuft bɑʃeh/ “be cheap”
with overlapped gestures and reduced gestures
are predicted to show a pattern different both Control: /næχ ke/ “thread that”, /sæf bærɑje/,
from control and the stimuli with zero residual
“cue for” / mæruf bɑʃeh / “be famous”
gestures. The following section introduces the
methodology of the study.
3.3 Procedure
3 Methodology
All the participants listened to forty stimuli (10
3.1 Participants stimuli in each category) with eight repetitions.
Thirty-two Persian-speaking students from the (total of 320 tokens) in a sound booth located at
Università di Pisa and Sant’Anna, seventeen the linguistics laboratory in Scuola Normale
females fifteen males, aged 18-38 participated Superiore. The software Presentation was used to
in this study. The results of eight of them are not present the stimuli to the listeners as an
considered for analysis because they reported to identification task. The participants were asked
be bilinguals and mainly used a language rather to listen very carefully and decide as quickly as
than Persian at home or with their close friends. possible whether it is likely that there has been a
This resulted in twenty-four, twelve females [t] at the end of the first part of each stimuli. For
twelve males. None of them reported any hearing each stimulus, the participants were asked to
problem. press either the green or the blue button on a
Cedrus response pad. On the screen of a
computer, listeners could also see “T” or “NO T”
3.2 Stimuli corresponding to the response buttons. The
Three sets of target words varying in only the stimuli were shuffled and presented in blocks in
degree/amount of alveolar residual gestures and a way that participants could either begin by
one control stimuli set were used in the hearing all the tokens with [f] or [χ]. They also
experiment. The three target categories are had the choice of taking a break after listening to
mainly the same except for the degree of alveolar every 80 tokens. All the participants received a
residual gestures. Target Full_G category has full short training before the start of the experiment.
coronal gesture but has overlap hence marked The following section contains the results of the
with two superscript [tt]. Target Partial_G study.
category has partial residual gesture marked via
superscript [t] whereas Target Zero_G has no
gestural leftover. The stimuli in the control are 4 Results
used as the baseline since they don’t have any
The main goal of this study is to test listeners’
underlying coronal stop in the coda position of
sensitivity to different degrees of residual
the first word. Some examples of the target and
gestures remaining after the simplification of
control words are given below:
consonant clusters. The response type and
reaction time are the dependent variables in this
study; however, only the results related to
response type are presented here. Figure 1 below
shows the perception rate of [t] by all subjects

15
across the four conditions. According to this, the 5 Discussion and Conclusion
subjects show the highest rate of [t] perception in
This research investigated listeners’ sensitivity to
tokens with full alveolar gesture (i.e., 59.69%)
and the lowest for the ones in the control (i.e., three types of /t/ realizations as target and
36.09%). The condition with partial alveolar compared the results with the control. The target
gestures shows the rate of 56.20% which is very categories included simplified consonant clusters
close to the full condition. The stimuli in zero with full, partial, and zero alveolar gestures. The
alveolar condition show an intermediate level stimuli used as the baseline in the control had no
between the control and the other two target alveolar gesture in the underlying form. The
conditions with the rate of 49.84%. This shows general results of the study showed that subjects
almost a similar pattern between the two target reliably distinguished the three target sets with
conditions with full and partial gestures, an varying residual gestures from the control. This
intermediate situation for the target condition could be due to more similarity in tongue
with zero gesture, and a pattern for the control configuration in realizing these varying degrees
which is different from the three target of coronal stop articulation compared to the
conditions. control condition where there is no alveolar
gesture in the underlying form. Any articulatory
modification is expected to trigger acoustic
changes. The acoustic results of the stimuli used
The rate of [t] perception for the in this study by Falahati (2013) showed no
mean subject significant difference between the simplified
tokens (i.e., the three target sets with varying
80% degrees of residual gestures labeled all together
60% as simplified) and control tokens. The acoustic
40% parameters used in the analysis were V1 duration,
consonant clusters duration, and formant
20%
transitions. Despite of the fact that the results did
0% not show any significant difference between
Control Full_G Partial_G Zero_G simplified and control conditions, the duration of
V1 and consonant clusters in the simplified
condition was always higher than the control
Figure 1: The Rate of [t] Perception by all Subjects condition. It could be the case that these acoustic
cues, although not very strong, are enough for
In order to examine the relation between the two human’s auditory system to trigger the presence
categorical variables in the study, namely the of a segment.
response type and stimuli condition, a Pearson The results of the current study also showed
chi-square test was run. The null hypothesis is that participants perceived almost 36% of the
that there is no relation in the [t] perception and tokens with no underlying coronal stop as having
the four conditions in the study. The results of [t]. This is very similar to the results of the study
the test with [t] perception as the dependent reported by Nalon (1992) where 20% of the
variable found significant main effect of control nonalveolar tokens were perceived as
conditions χ2 (3, N = 960) = 46.2, p < 0.001. This having [d]. In his study, however, the control
shows that there is a significant relation between tokens showed similar pattern to that of the target
the stimuli conditions and response type. In order with zero alveolar (i.e., full assimilation). He
to determine whether the difference in the attributes this to both subjects’ natural language
perception of [t] across four categories is really experience as well as the inherent ambiguity in
significant or it is due to chance variation, a the stimuli. He states that subjects are “willing to
column proportions test was performed. This test “undo” its effects” and therefore, in the case of
uses z-test to make the comparisons. The result the current study, report coronal stops even
showed that the perception of [t] in the control where there is no evidence for them.
The results of our study also showed that
was significantly different from the all target
categories. The next section presents the participants perceived more [t] in the tokens with
discussion and concluding remarks of the study. full and partial alveolar gestures compared to the
ones with zero alveolar gestures. The difference

16
between the three categories, however, did not their perception of [t]. The variation across
reach the significance level. Such result could individuals regarding speech perception could be
shed more light on the theories of speech a good source of information for the specialists
perception discussed earlier in this paper. In in the field. Moreover, the degree to which an
order to discuss this issue, first we need to individual’s speech production could map to
further explore the nature of the three categories his/her perception is an interesting topic which
in the target stimuli. From the three groups in the remains to be explored.
target stimuli, one group categorically had no
alveolar gesture while the other two had different Acknowledgments
degrees of the gesture either as a result of
overlap or reduction. We argue that the gradient We are very grateful to Patrice Beddor for her
gestural reduction and overlap are due to low- comments and suggestions on this study.
level phonetic and mechanical reasons while the
categorical deletion, which results in tokens with
zero gestures, is caused by the cognitive system.
In the former groups, speakers neither intend to
reduce nor plan to overlap gestures while the
latter process is intended by the speaker.
According to MT and DR, listeners’ target in
speech perception is the intended or phonological
gestures. Therefore, the overlapped and reduced
stimuli should show different perceptual pattern
compared to the stimuli with no residual gesture.
The results in this study did not show a striking
difference between these three target sets. The
existence of acoustic cues pertaining to the
presence of gestures is a prerequisite to their
perception by the listener. If distinguishing
acoustic details could be found between these
three categories, then this would not support the
gesturalist approach to speech perception.
However, with the current results, such a claim
cannot be made. Further acoustic analysis
between these three target sets is needed to
examine this idea further.
The findings in our experiment could be best
explained by referring to exemplar models of
speech perception. In such models, the lexical
representations of words change in a gradient
way over time. This is due to the nature of some
phonological processes in languages which are
not categorical. According to this view, the
perceptual mechanism is capable to make fine
phonetic distinctions. However, it is the mapping
between the gradient stimuli and the auditory
system which fails and does not result in
nonvariant forms.
The lack of such a one-to-one mapping will
bring variation across subjects in the speech
community. The degree of such variation is
determined by the amount of individual’s
exposure to the specific variants. A closer look at
the results for individual subjects showed that all
twenty-four participants in the study could fall
into three or four dominant patterns based on

17
Reference John Ohala. 1990. Respiratory activity in speech. In
W. J. Hardcastle & A. Marchal (eds.), Speech
Patrice S. Beddor, Kevin B. McGowan, Julie Boland, Production and Speech Modeling, 23-53.
Andries W. Coetzee, and Anna Brasher. 2013. Netherlands: Kluwer Academic Publishers.
The perceptual time course of coarticulation.
Journal of the Acoustical Society of America, 133, John Ohala. 1995. The perceptual basis of some
2350-2366. sound patterns. In B. Connell and A. Arvaniti (eds.),
Phonology and phonetic evidence, Papers in
Pierre Delattre, Alvin M. Liberman, and Franklin S. Laboratory Phonology IV, 87-92. Cambridge:
Cooper. 1955. Acoustic loci and transitional cues Cambridge University Press.
for consonants. Journal of the Acoustical Society of
America, 27, 769-773.
Pierre Delattre, Alvin M. Liberman, and Franklin S.
Cooper. 1964. Formant transition and loci as
acoustic correlates of place of articulation in
American fricatives. Stud. Linguist. 18, 104-121.
Randy L. Diehl, Andrew J. Lotto, and Lorri L. Holt .
2004. Speech perception. Annual Review of
psychology. 55, 149-179.
Alvin M. Liberman and Ignatius G. Mattingly. 1985.
The motor theory of speech perception revised.
Cognition, 21: 1-36.
Reza Falahati. 2013. Gradient and Categorical
Consonant Cluster Simplification in Persian:
An Ultrasound and Acoustic Study, Ph. D
Dissertation, University of Ottawa, Ottawa.

Carol C. Fowler. 1981. Production and perception of
coarticulation among stressed and unstressed
vowels. Journal of Speech and Hearing Research,
46, 127-139.

Carol C. Fowler. 1984. Segmentation of coarticulated
speech in perception. Perception & Psychophysics,
36, 359-368.

Carol C. Fowler. 1996. Listeners do hear sounds, not
tongues. Journal of the Acoustical Society of
America, 99, 1730-1741.

Keith Johnson. 1997. Speech perception without
speaker normalization: an exemplar model. In K.
Johnson and J. W. Mullennix (eds.), Talker
variability in speech processing, 145-165. San
Diego: Academic Press.

Alvin M. Liberman and Ignatius G. Mattingly. 1985.
The motor theory of speech perception revised.
Cognition, 21: 1-36.
Francis Nalon, 1992. The descriptive role of
segments: evidence from assimilation. In G. J.
Docherty and R. Ladd (eds.), Papers in Laboratory
Phonology IV, 261-289. Cambridge: Cambridge
University Press.