Introduction

Visual gender cues elicit agent expectations: different mismatches in situated language comprehension.

Michele Burigo (mburigo@cit-ec.uni-bielefeld.de)

Pia Knoeferle (knoeferl@cit-ec.uni-bielefeld.de)

0 0 Cognitive Interaction Technology Excellence Cluster, Bielefeld University , Bielefeld , Germany

234 239

Previous research has shown that visual cues (depicted events) can have a strong effect on language comprehension and guide attention more than stereotypical thematic role knowledge ('depicted / recent event preference'). We examined to which extent this finding generalizes to another visual cue (gender from the hands of an agent) and to which extent it is modulated by picture-sentence incongruence. Participants inspected videos of hands performing an action, and then listened to non-canonical German OVS sentences while we monitored their eye gaze to the faces of two potential subjects / agents (one male and one female). In Experiment 1, the sentential verb phrase matched (vs. mismatched) the video action and in Experiment 2, the sentential subject matched (vs. mismatched) the gender of the agent's hands in the video. Additionally, both experiments manipulated gender stereotypicality congruence (i.e. whether the gender stereotypicality of the described actions matched or mismatched the gender of the hands in the video). Participants overall preferred to inspect the target agent face (i.e. the face whose gender matched that of the hands seen in the previous video), suggesting the depicted event preference observed in previous studies generalizes to visual gender cues. Stereotypicality match did not seem to modulate this gaze behavior. However, when there was a mismatch between the sentence and the previous video, participants tended to look away from the target face (post-verbally for action-verb mismatches and at the final subject region for hand gender - subject gender mismatches), suggesting outright picture-sentence incongruence can modulate the preference to inspect the face whose gender matched that of the hands seen in the previous video.

Visual world situated sentence comprehension gender stereotypes mismatch eye tracking

Introduction

Studies on the role of world-knowledge during situated language comprehension – i.e. when language relates to a visual, non-linguistic context - have provided evidence that comprehenders tend to visually anticipate contextually plausible objects preceding their mention. In a visual context in which a little girl, a man, a motorbike and a carrousel were depicted, participants directed more anticipatory looks to the motorbike after hearing The man will ride compared to The girl will ride (Kamide, Altmann and Haywood, 2003) . Other studies have examined the influence of prior visual context on ensuing attention and language comprehension by showing participants a depicted event before the sentence. What these studies have shown so far is that visually grounded information enjoys priority over other types of information, such as relevant linguistic or world-knowledge. For instance, when depicted events were pitted against world knowledge related to occupational stereotypes, the former were preferred (Knoeferle and Crocker, 2007, Experiment 1) : During the comprehension of non-canonical German sentences like ‘The pilotACC spies on soon the…’ (literal translation) participants looked more often to the (location of the) agent that had been involved in the previously depicted spying event (e.g. a wizard) even when a stereotypically more plausible character had also been depicted (e.g. a detective).

Manipulations of verb tense have given rise to similar results in cases in which two objects were plausible role (i.e. theme) fillers for the action indicated by the verb (Knoeferle & Crocker, 2007, Experiment 3) . Participants saw an agent performing an action over one of two available objects (candelabra, crystal glasses) in a visual setting (e.g. a waiter polishing candelabra) and then listened to a sentence such as Der Kellner poliert… (‘The waiter polish…’). The verb was tense ambiguous and the continuation could be about the recent polishing event or a future polishing event (involving the other object: the crystal glasses). Tense disambiguation towards the simple past (i.e. …polierte soeben… ‘polished recently’) or the future tense (i.e. …poliert demnächst… ‘polishes soon’) removed this ambiguity. Eye-movement patterns during the verb and the adverb regions showed a preference for the recent event target over the future event target, regardless of tense (and before either target was mentioned, ‘recent-event preference’). This preference has been replicated with 5-year-old children (Zhang, Kornbluth, & Knoeferle, 2012) . It also persisted in real-world settings when the actor performed both the recent, and a “future” event (after the sentence) in each trial (e.g., Knoeferle, Carminati, Abashidze, & Essig, 2011) . A frequency bias towards the future events (filler trials showing more frequent post-sentential/future events) did give rise to an earlier increase of looks to the future event targets, but the overall recent event preference persisted (Abashidze, Knoeferle, & Carminati, 2014) .

In summary, during language comprehension, people seem to have a strong tendency to anticipate what has previously been visually grounded, even when this does not match their world knowledge. The preference seems to persist even when future tense cues should favor a future event. However, other biases (e.g. frequency) have modulated (but not overridden) this preference.

Picture-sentence verification: mismatch effects

One interesting question is to which extent the preference to rely on visual cues (e.g., depicted events) can be modulated by incongruence between language and the world. If we encounter frequent mismatches and verify world(picture)language congruence, our reliance on what is depicted might decrease. Picture-sentence verification experiments have a long tradition in psycholinguistics (Gough, 1965; Just & Carpenter, 1971) and when combined with continuous measures like eye tracking or ERPs (event-related brain potentials) they can provide insights into real time language processing (Knoeferle, Urbach, & Kutas, 2014; Vissers, Kolk, van de Meerendonk, & Chwilla, 2008; Wassernaar & Hagoort, 2007) .

Wassernaar and Hagoort (2007) conducted a picturesentence verification experiment with both healthy older adults and aphasics. Participants inspected a line drawing showing either a reversible event involving an agent and a patient (e.g. a man pushing a woman) or an irreversible event involving an agent and a theme (e.g. a woman reading a book). Then they read a sentence in either the active (for the semantically irreversible and reversible cases) or the passive voice (only for the semantically irreversible cases, e.g., ‘the tall man on this picture pushes the young woman’, translation from Dutch). Sentences could either match or mismatch the depicted visual information. For healthy older participants, mismatches (vs. matches) elicited larger early negative amplitudes time-locked to verb onset in reversible active sentences. For the irreversible active sentences and the reversible passive ones, the early negativity was followed by a late positive shift, resembling a P600. The authors argued that these ERP effects indexed processes of thematic role assignment. Vissers and colleagues (2008) observed a similar pattern for mismatches between the depicted spatial relations of objects (e.g. a square followed by a circle) and linguistically described spatial relations. They argued in particular that the P600 was an index of a general monitoring mechanism.

Even more relevant to the current study, Knoeferle et al. (2014) measured ERPs as participants read English subjectverb-object sentences (e.g. The gymnast punches the journalist) and verified whether they matched a recently seen picture. Sentences could either fully match the pictures or contain different types of mismatches, i.e. verb-action, role-relations, or both. These different mismatches elicited distinct ERP responses. Role mismatches (vs. matches) elicited larger anterior mean amplitude negativities (200400 ms after subject onset) while verb-action mismatches (vs. matches) elicited a somewhat later centro-parietal negativity (300-500 ms after verb onset), resembling an N400 effect. Thus, different picture-sentence relations might actually implicate functionally distinct cognitive mechanisms. Overall, incongruence seems to affect language comprehension rapidly and could thus modulate visually-based attention preferences.

The current study

The present study is motivated both by the studies on the depicted-event preference in situated language comprehension and by those on picture-sentence verification. We used visual gender cues (the gender of hands in an action video) to examine whether this sort of visual cue (much like a recent action) would also elicit anticipation of an upcoming depicted agent (a gendermatching face). Participants inspected videos of pairs of hands performing an action, and then listened to noncanonical German object-verb-subject (OVS) sentences while looking at the pictures of two potential agents’ faces (one male and one female). Post-comprehension, participants verified via a button press whether the sentence they listened to matched the video they had just seen. We expected faster responses for matches than mismatches.

Expectations regarding the agent depend on the successful association between the agents’ dimorphic gender cues from the videotaped events (i.e. the hands) and the faces shown during sentence comprehension. Gender categorization as such has been characterized as straightforward (Stangor, Lynch, Duan, & Glass, 1992; Macrae & Martin, 2007; Gaetano, van der Zwan, Blair, & Brooks, 2014) . If this process takes place rapidly and if gender cues influence subsequent language comprehension, then we should observe an early inspection preference for the target agent face (e.g. a woman when the hands in the video belonged to a woman) relative to the competitor agent (a man). This anticipation would support the idea that the depicted-event preference (Knoeferle et al., 2011) generalizes to handgender cues. The competitor however, could also receive attention if participants base their anticipation on stereotypical knowledge associated with the video event (e.g., building a toy model is a stereotypically male action).

If visual anticipation occurs, then we could furthermore begin to examine to which extent different video-language relations modulate it. In fact, we know little about how different sorts of mismatches influence gender-based agency expectations. Sentences either contained a mismatch between the videotaped action and the sentential verb phrase (object and verb, Experiment 1, henceforth ‘action-verb mismatches’) or a mismatch between the gender of the hands of the agent in the video (no face was shown in the video) and the gender of the sentential subject (Experiment 2). Furthermore, both experiments manipulated gender stereotypicality (i.e. the described action either stereotypically matched or mismatched the gender of the hands in the video).

If participants rapidly relate the unfolding sentence to their representation of previous events, both action-verb (phrase) and hands-subject gender mismatches should influence their expectations, resulting in attentional shifts away from the target face and towards the competitor face. Given that action-verb mismatches (Experiment 1) could be detected at the sentence-initial object noun (and ensuing verb), mismatch effects might emerge as early as the first noun phrase or the following verb. Hands-subject gender mismatches (Experiment 2), on the other hand, occurred at the subject noun phrase, we should therefore observe an effect at that region.

Furthermore, if gender stereotypes have a strong influence on visual attention during language processing, they could potentially override or at least modulate the visual anticipation of the target face. In that case we should observe fewer looks to the target face in stereotypicality mismatching cases (e.g. when female hands performed a stereotypically male action and the sentence described a stereotypically male action) compared to stereotype matches.

Experiments 1 and 2 Participants

32 different participants (16 females, 19-32 years) took part in each experiment. Each participant received 6 Euro for participation. All were German native speakers, had normal or corrected to normal vision, and gave informed consent before the experiment.

Materials and Design

We conducted a norming study on 104 verbalized actions (e.g., polishing nails, repairing a radio) to assess their gender stereotypicality. Participants (N=20) rated these on a bipolar 7-point scale. Based on the rating results, we selected 32 stereotypically female and 32 stereotypically male actions as our linguistic stimuli. We paired these with 128 videos (both stereotypically female and male actions were video-taped with a female and a male actor of which only the hands were shown, see Fig. 1). We used two female and two male actors. Videos were close-ups of pairs of hands acting upon objects on a table. The target displays bautV (presented during sentence comprehension) consisted each of two close-up photographs of a male and a female face.

From these materials we created 32 experimental items consisting of video pairs (one stereotypically female and one stereotypically male action) and their corresponding German sentence pairs with a non-canonical German objectverb-subject (OVS) structure (see Tables 1 and 2). We used this structure in order to monitor participants’ expectations regarding the upcoming subject (agent). The sentences were recorded with neutral intonation. We synchronized the syllable length of the constituents and their onsets for the sentences within an item.

In Experiment 1 we manipulated two factors. A first was action-verb congruence (the action described in the sentence as expressed by object-verb combinations either matched or mismatched the action in the video); a second factor was stereotypicality congruence (the action described by the sentence either matched or mismatched stereotypically with the gender implied by the hands in the video). For instance, a congruous condition would feature female hands in the video and a sentence about a stereotypically female action; an incongruous example included female hands in the video and a sentence about a stereotypically male action. Crossing these factors yielded 4 conditions (see Table 1), which were counterbalanced across experimental lists in a Latin Square manner. Word order and the use of the postverbal adverb gleich were constant across conditions. Experiment 2 was identical to Experiment 1 except that we

Video Sentence

bautV backtV gleichADV gleichADV SusannaNP2

Hands subject gender match

Yes Yes No No replaced the verb-action congruence manipulation with another congruence manipulation (between the gender cued by the hands in the video and the gender of the sentential subject, Table 2).

Fillers (total N=70) contained videos of actions which were not stereotypically marked (e.g. filling out a crossword), combined with the same sentence structures as in the experimental items (N=18); videos showing two pairs of hands engaged in an action with dative constructions (N=18; 9 dative-first and 9 dative-middle sentences) and pictures of objects paired with various sentence structures (N=34). Half of the fillers contained video-sentence mismatches of different types (e.g., action, gender of the agents and color).

Procedure

An Eyelink 1000 Desktop Mounted Eye-Tracker recorded participants’ eye movements. Viewing was binocular but we tracked only participants’ right eye. Participants completed 10 practice trials before the experiment. Each trial started with a video of the hands in action (3500 ms). The final frame (displaying both hands in rest position and the object) stayed for another 1500 ms. Next, a centrally-presented cross appeared for 1000 ms and then a target screen with two photographs of a female and a male face appeared (face position was counterbalanced across trials within a list, Fig. 1). After 1500 ms the sentence was played and eye movements to the pictures recorded. Participants responded as quickly and accurately as possible via a (“yes” or “no”) button press (Cedrus RB 834) whether the preceding video matched the sentence. The position of the response buttons was counterbalanced across participants.

Analysis

For each participant, we calculated the mean reaction times - time locked to the beginning of sentence presentation - and percentage of correct scores per condition and subjected these to ANOVA analyses. For the eye-tracking data, we divided each experimental sentence into four time regions, (the object noun phrase: NP1, the verb: V; the adverb: ADV; and the subject noun phrase: NP2). Each region extended from its onset to the onset of the next region except for NP2, which ended at sentence offset.

Because looks to one of the characters implied fewer looks to the other character in the visual scene, we computed the mean log-gaze probability ratios for each separate sentence region to measure the bias of inspecting the target agent (i.e. the face which matched in gender the hands in the previous video) over the competitor agent (the other face; ln(P(target agent)/P(competitor)). Values above zero reflect a target agent preference, while values below zero represent a preference for the competitor. These scores are suitable for parametric tests such as ANOVAs (Arai, Van Gompel & Scheepers, 2007) . We calculated mean log probability ratios per region by subjects and by items, which we subjected to ANOVA analyses. For the time course graphs, we plotted the gaze probability ratios in successive 20 ms time slots from the beginning of the sentence. Missing and incorrect responses were excluded from both the eye movement and response time analyses.

Time course of the eye movements

The time course graphs (Fig. 2) show the time course of eye movements to the target agent face relative to the competitor face (ln(P(target agent)/P(competitor)) per condition for each experiment. As no effect emerged in the first noun phrase region (NP1), we plot the data from mean verb onset. A first observation is that the log gaze probability ratios were positive in all plotted time regions, suggesting preferential inspection of the target agent (the woman if the hands in the video had belonged to a woman) over the competitor during the entire sentence.

Upon the encounter of video-sentence mismatches a divergence between matching (green lines) and mismatching conditions (red lines) can be seen. While values for matching conditions experienced an increase, values for the mismatching conditions started to decrease, indicating that participants started to look away from the target face. For action-verb mismatches (Panel A) we can see this happening at the post-verbal region, while for hands-subject gender mismatches (Panel B), this happens at the sentence-final subject region.

Results Experiment 1

Response times and accuracy: Participants were faster (ps < .001) and responded more accurately to action-verb mismatches than matches (ps < .001).

Eye-movement analysis: For all regions, log ratios remained above and significantly different from zero, which represents the estimate of the grand mean, suggesting an overall preference for the target face. No significant effects of the independent variables emerged for the NP1 and V regions. However, a significant effect of action-verb match emerged post-verbally, at the ADV region (ps < .001) and continued into the NP2 region (ps < .001): Participants directed more looks to the target face for action-verb matches than mismatches.

Results Experiment 2

Response times and accuracy: There were no significant differences between conditions.

Eye-movement analysis: Similar to Experiment 1, the log ratios remained above zero for all regions. There was a hands-subject match effect at NP2 (ps < .01).

Discussion

In two eye-tracking verification experiments, we assessed the generality of the depicted event preference for another visual (gender) cue by monitoring whether the gender of hands performing an action would affect expectations about the agent. We used visual dimorphic gender cues and gender-stereotype knowledge and two types of sentence mismatches regarding the previously seen events. We varied the video-sentence match manipulation from action-verb (Experiment 1) to hands-subject gender congruence (Experiment 2) and assessed in both experiments the influence of gender stereotype mismatches. Below we report the analyses of response latencies and accuracy in the verification task as well as the analyses of the eyemovement data.

Contrary to our predictions, participants responded faster and were more accurate for action-verb mismatches than matches (Experiment 1). This might be due to judgment facilitation for utterly mismatching verbal information compared to action-verb matches, which has also been seen in studies using a similar paradigm (Münster, Carminati & Knoeferle, 2014, but see Knoeferle, Urbach, & Kutas, 2014) . However, there was no such effect in Experiment 2. It is possible that the reliable mismatch RT effect in Experiment 1 came about because people could detect (and thus respond to) the mismatch in principle at the first noun. As a result, they responded faster to mismatches than matches. In Experiment 2, the sentence-final noun phrase revealed the hand-gender / subject-gender mismatch, meaning that participants had to wait until sentence end before responding to both the matches and mismatches alike. The late emergence of the mismatching region may have eliminated the response time differences observed in Experiment 1.

Regarding the eye movements, participants preferred to look at the target face (the face whose gender matched that of the hands in the video) relative to the competitor throughout the sentence. This preference emerged in both experiments and regardless of the stereotypical content of the sentence. Importantly, video-sentence mismatches modulated this preference, even though we did not observe an attentional shift towards the competitor. Participants tended to look away from the target agent when a mismatch was encountered, suggesting that both verbal and subject information affect expectations about the agent. This effect emerged unexpectedly late, at the post-verbal region for action-verb mismatches (Experiment 1) and rapidly at the final subject region for hands-subject gender mismatches (Experiment 2).

The delayed emergence of action-verb mismatch effects (Experiment 1) at the post-verbal region (rather than at the first noun or verb when the mismatch could have been detected) could indicate processes of integrating the noncanonical object with the verb while reconciling both object and verb with the representation of the previous event. Note that delays in visual attention (albeit not in a mismatch design) have been reported in studies using the same word order (Kamide, Scheepers & Altmann; 2003). Perhaps the non-canonical structure was partly responsible for the delays in visual attention and the action-verb congruence effect in Experiment 1.

Mismatch effects at the final subject region (Experiment 2), unlike action-verb mismatches, seemed more immediate, arguably because this type of incongruence involves a single linguistic entity. Perhaps participants adopted a referential strategy upon encountering the subject noun. This is compatible with the fact that they looked away from the initially considered (target) agent in mismatching cases (e.g., participants looked away from the male face when the video depicted male hands but the final noun referred to a female agent). However, a referential strategy would in addition predict a rapid shift of attention towards the competitor agent (e.g., the female face) closely time-locked to its referring word (Tanenhaus, Spivey-Knowlton, Eberhard & Sedivy, 1995) , which was not the case. Future research is necessary to ascertain to which extent the anticipation of the target face and its modulation through the mismatches reflect referential (vs. compositional) processes.

In addition, future research could examine the effects of other mismatches and linguistic manipulations with a view to enriching current models on situated language comprehension (Altmann & Mirković, 2009; Knoeferle & Crocker, 2006) as we continue to explore the limits of visual inspection preferences.

Acknowledgements

This project has received funding from the European Union's 7th Framework Program for research, technological development and demonstration under grant agreement n°316748 and CITEC 277 (German Research Council).

Abashidze D. , Carminati

M.N. , Knoeferle

P. ( 2014 ). How robust is the recent-event preference? In: Proceedings of the 36th Annual Meeting of the Cognitive Science Society . Austin, TX: Cognitive Science Society: 92 - 97 .

Altmann , G. T. M. & Mirković , J. ( 2009 ). Incrementality and prediction in human sentence processing . Cognitive Science , 33 , 583 - 609 .

Arai , M. , Van

Gompel

, R. P. G. , & Scheepers , C. ( 2007 ). Priming ditransitive structures in comprehension . Cognitive Psychology , 54 , 218 - 250 .

Gaetano , J., van der Zwan, R., Blair , D. , Brooks , A. ( 2014 ). Hands as Sex Cues: Sensitivity Measures, Male Bias Measures, and Implications for Sex Perception Mechanisms . PLoS ONE 9 ( 3 ): e91032 .

Gough P.B. ( 1965 ). Grammatical transformations and speed of understanding . Journal of Verbal Learning & Verbal Behavior . 4 : 107 - 111

Just , M. A. , & Carpenter , P. A. ( 1971 ). Comprehension of negation with qualification . Journal of Verbal Learning and Verbal Behavior , 10 , 244 - 253 .

Kamide , Y. , Altmann , G. T. M. & Haywood , S. L. ( 2003 ). The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye movements . Journal of Memory and Language , 49 , 133 - 159 .

Kamide , Y. , & Scheepers , C. , & Altmann , G.T.M. ( 2003 ). Integration of syntactic and semantic information in predictive processing: Cross-linguistic evidence from German and English . Journal of Psycholinguistic Research , 32 , 37 - 55 .

Knoeferle , P. , & Crocker , M. W. ( 2006 ). The coordinated interplay of scene, utterance, and world knowledge: Evidence from eye-tracking . Cognitive Science , 30 , 481 - 529 .

Knoeferle , P. & Crocker , M. W. ( 2007 ). The influence of recent scene events on spoken comprehension: Evidence from eye movements . JML , 57 ( 4 ): 519 - 543 .

Knoeferle

, Carminati

, Abashidze

, Essig

( 2011 ). Preferential inspection of recent real-world events over future events: evidence from eye tracking during spoken sentence comprehension . Frontiers in Psychology 2 : 306 .

Knoeferle , P. , Urbach , T.P. , Kutas , M. ( 2014 ). Different mechanisms for role relations versus verb-action congruence effects: Evidence from ERPs in picturesentence verification , Acta Psychologica , vol. 152 , pp. 133 - 148

Macrae , C. N. , & Martin , D. ( 2007 ). A boy primed sue: Feature-based processing and person construal . European Journal of Social Psychology , 37 , 793 - 805 .

Münster

, Carminati

M.N.

, Knoeferle

( 2014 ). How Do Static and Dynamic Emotional Faces Prime Incremental Semantic Interpretation?: Comparing Older and Younger Adults . In: Proceedings of the 36th Annual Meeting of the Cognitive Science Society . Austin, TX: Cognitive Science Society; 2014 : 2675 - 2680 .

Stangor , C. , Lynch , L. , Duan , C. , Glass , B. ( 1992 ). Categorization of individuals on the basis of multiple social features . J Pers Soc Psychol 62 : 207 - 218 .

Tanenhaus , M. , Spivey-Knowlton , M. , Eberhard , K. , & Sedivy , J. ( 1995 ). Integration of visual and linguistic information during spoken language comprehension . Science , 268 , 1632 - 1634 .

Vissers , C. , Kolk , H., van de Meerendonk, N. , & Chwilla , D. ( 2008 ). Monitoring in language perception: Evidence from ERPs in a picture-sentence matching task . Neuropsychologia , 967 - 982 .

Wassenaar , M. , & Hagoort , P. ( 2007 ). Thematic role assignment in patients with Broca' aphasia: Sentence picture matching electrified . Neuropsychologia , 716 - 740 .

Zhang , L. , Kornbluth , L. , & Knoeferle , P. ( 2012 ). The role of recent versus future events in childand adult language comprehension: Evidence from eye tracking. Poster presented in the 18th Annual Conference on Architectures and Mechanisms for Language Processing (AMLaP), Riva del Garda , Italy