=Paper=
{{Paper
|id=Vol-2029/paper1
|storemode=property
|title=Towards a Systematic Analysis of Linguistic and Visual Complexity in Disambiguation and Structural Prediction
|pdfUrl=https://ceur-ws.org/Vol-2029/paper1.pdf
|volume=Vol-2029
|authors=Özge Alaçam,Tobias Starona, Wolfgang Menzel
|dblpUrl=https://dblp.org/rec/conf/simbig/AlacamSM17
}}
==Towards a Systematic Analysis of Linguistic and Visual Complexity in Disambiguation and Structural Prediction==
Towards a Systematic Analysis of Linguistic and Visual Complexity in Disambiguation and Structural Prediction Özge Alaçam , Tobias Staron and Wolfgang Menzel Department of Informatics University of Hamburg alacam, staron, menzel@informatik.uni-hamburg.de Abstract rithms are still far away from that accuracy and fluency when it comes to challenging linguistic Situated language processing in humans or visual situations. Therefore, by developing a involves the interaction of linguistic and cross-modal parser to exploit visual knowledge, visual processing and this cross-modal we expect to enhance syntactic disambiguation, integration helps resolving ambiguities e.g. concerning relative clause attachments and and predicting what will be revealed next various scope ambiguities. in an unfolding sentence. However, most One of the most frequently investigated syntac- state-of-the-art parsing approaches rely tic ambiguity cases is the prepositional phrase (PP) solely on the language modality. This attachment ambiguity, where different semantic paper aims to introduce a new multi- interpretations are possible depending on assign- modal data-set (containing sentences ing different thematic roles (Tanenhaus et al., and respective images and audio files) 1995). A well-known example is the imperative addressing challenging linguistic and sentence: “put the apple on the towel in the box”, visual complexities, which state-of-the-art where the PP “on the towel” can be interpreted as parsers should be able to cope with. It modifier of an apple (as location of the apple), as also briefly addresses a proof-of-concept marked in 1 below, or as goal location as in 2. study that shows the contribution of employing external visual information [1] put [the apple on the towel]obj [in the during disambiguation. box]goal [2] put [the apple]obj [on the towel in the 1 Disambiguation and Structural box]goal Predictions The re-analysis of the interpretation during on- A better understanding of human perceptual and line language comprehension is termed as garden- comprehension processes concerning multi-modal path example. In a multi-modal setting where the environments is one of the crucial factors for re- scene contains an empty towel or an apple on a alizing dynamic human-computer interaction. A towel, the visual information constrains the refer- large body of empirical evidence in psycholin- ential choices as well as the possible interpreta- guistics suggests that human language processing tions, helping the disambiguation process. successfully integrates available information ac- Tanenhaus and his colleagues’ study (1995) quired from different modalities in order to re- showed that visual information influences incre- solve linguistic ambiguities (i.e. syntactic, se- mental thematic role disambiguation by narrow- mantic or discourse) and predict what will be re- ing down the possible interpretations. Further ev- vealed next in the unfolding sentence (Tanenhaus idence that supports this conclusion was provided et al., 1995; Altmann and Kamide, 1999; Knoe- by Knoeferle (2005) by addressing relatively more ferle, 2005). During spoken communication, on- complex scenes containing more agents and re- line disambiguation and prediction processes al- lations for both English and German. The re- low us to have more accurate and fluent conver- sults also indicated that this influence occurs in- sations. In contrast, state-of-the-art parsing algo- dependent from the experiment language. Fur- 38 thermore, Altmann and Kamide (1999)’s study can resolve references in syntactically ambiguous has documented that listeners are able to predict cases. They may have preferences but the accu- complements of a verb based on its selectional racy of the preferences are bounded by chance. On constraints. For example, when people hear the the other hand, humans naturally use external in- verb ’break’, their attention is directed towards formation from other modalities for disambigua- only breakable objects in the scene. Some nouns tion when available. Incorporating this feature, may also produce expectations for certain seman- cross-modal parsers may also resolve those ambi- tic classes of verbs by activating so-called event guities and reach correct interpretations of the vi- schema knowledge (McRae et al., 2001). Be- sually depicted events. Therefore, a better under- side verbs and nouns, Van Berkum et al. (2005)’s standing of human language processing concern- study also showed the effect of syntactic gender ing cross-model environments is one of the cru- cues for Dutch in the anticipation of the upcom- cial factors in the realization of dynamic human- ing words. Similar to German, pre-nominal ad- computer interaction. Furthermore, comparing the jectives as well as nouns are gender-marked in performance of the computational model with hu- Dutch and the gender of the adjective has to agree man performance (e. g. whether ambiguities with the gender of the noun. Their results showed were resolved correctly, at which point of a spo- that the human language processing system uses ken utterance a correct resolution was achieved, the gender cue, when it becomes available, to pre- how many changes were made before reaching dict the target object if its gender is different than the correct thematic role assignment) also pro- the gender of the other objects in the environ- vides valuable information about the plausibil- ment. They interpreted this as evidence for the ity and the effectiveness of the proposed pars- incremental nature of the human language system, ing architecture. Constructing a data-set that con- which can predict the upcoming words and imme- tains challenging linguistic and visual cases and diately begin incremental parsing operations. In a complex multi-modal settings, where state-of-the- more recent work, Coco and Keller (2015) inves- art parsers often fail, are fundamental towards tigated the language - vision interaction and how achieving this ultimate goal. In this paper, we it influences the interpretation of syntactically am- aim to introduce a multi-modal data-set consist- biguous sentences in a simple but real-world set- ing of garden-path (fully/temporally syntactically ting. Their study provided further evidence that ambiguous) sentences. visual and linguistic information influences the in- This paper is structured as follows. In section terpretation of a sentence at different points dur- 2, a data-set of ambiguous German sentences and ing online processing. The aforementioned em- their multi-modal representations are presented. A pirical studies provided insights regarding psycho- brief description of our cross-modal parser is pre- linguistically plausible parsing. However, those sented in Section 3. Section 3 also addresses a studies were limited to simple (written) linguis- test run conducted on fully ambiguous sentence tic or visual stimuli where object-action relations structures. Section 4 summarizes the results of this could be predicted relatively easily. work and draws conclusions Based on the prior research, our project fo- 2 Linguistic and Visual Complexities cuses also on studying underlying mechanisms of human cross-modal language processing of in- Recently, a corpus of language and vision ambigu- crementally revealed utterances with accompa- ities (LAVA) in English has been released (Berzak nying visual scenes, with the aim of using the et al., 2016). LAVA corpus contains 237 sen- empirically gained insights to develop a psycho- tences with linguistic ambiguities that can only be linguistically plausible cross-modal and incremen- disambiguated using external visual information tal syntactic parser which can be implemented provided as short videos or static visual images e.g. on a service robot. A parser that processes with real world complexity. It addresses a wide only linguistic information is expected to be able range of syntactic ambiguities including prepo- to successfully handle syntactically unambiguous sitional phrase or verb phrase attachments and cases by using linguistic constraints or statistical ambiguities in the interpretation of conjunctions. methods. However, without external information However, this corpus does not take linguistically from visual modality, neither humans nor parsers challenging cases like relative clause attachments 39 or scope ambiguities, which may also give valu- the window of the room that he cleans.) able insights understanding the underlying mech- Our data-set is currently consisting of 191 sen- anisms of cross-modal interactions, into account. tences1 and addresses 8 linguistically challeng- To our knowledge, the reference resolution con- ing cases concerning relative clause attachments, cerning these linguistic cases and the effect of lin- agent/patient agreement, verb/subject agreement, guistic complexity in visually disambiguated situ- and scope ambiguities for conjunctions and nega- ations have been scarcely investigated. Our multi- tions. The sentence sets for each structure are gen- modal data-set consists of challenging linguistic erated by using part-of-speech templates given in cases in German (itemized below), which becomes Table 1. Parsers often have problems with cor- fully unambiguous in the presence of visual stim- rect reference resolution for such linguistic ex- uli. Our main question from the psycholinguistic pressions because they usually attach the relative point of view is whether the presence of linguistic clause to a nearest option with respect to statisti- ambiguity and the linguistic complexity affect the cal distributions in their training data or explicitly processing of multi-modal stimuli. On the other stated rules. hand, from the computational perspective, we fo- Knoeferle’s (2005) sentence set was used as cus on whether and to what extent visual informa- baseline since the co-occurrence frequencies tion is useful for the disambiguation and structural between the action and the Agent in the sen- prediction processes in order to develop more flu- tence, as well as between the action and the ent and accurate computational parsing. Patient, were controlled to single out the effect German has three grammatical genders, namely of semantic associations or preferences during each noun is either feminine(f), masculine(m), or parsing operations. For a syntactic parser, this neuter(n). In a sentence that contains a relative may seem irrelevant, however in order to develop clause attachment, the gender of the relative pro- a comparable experimental setup for human noun has to be the same as the gender of its comprehension, this parameter needs to be taken antecedent. Sentence [3] illustrates an example, into account. which contains a relative clause licensing the NP. [3] Sie schmückt das Fenster(n), das(n) er Fully Ambiguous Sentence Structures säubert. (She decorates the window that [1] RPA2 - a Genitive NP he cleans.) Sie schmückt das Fenster(n) des Zimmers(n), In Sentence [4], the NP is modified by an ad- das er säubere. ditional NP, i.e. a genitive object. In this case, She decorates the window of the room that he since the gender of the relative pronoun matches cleans. only the first NP, it is clear that the window is be- Int.13 : He cleans the room (near-attachment). ing cleaned, not the car. However, due to ambigu- Int.2: He cleans the window (far-attachment). ous German case-marking, if the genders of the nouns of both NPs are the same, as in sentence [5], [2] RPA - Scope Ambiguities both far and near attachments are possible. Fur- Ich sehe Äpfel(pl) und Bananen(pl), die(pl) auf thermore, the verb is semantically congruent with dem Tisch liegen. both NP and PP as well. Correct reference resolu- I see apples and bananas that lie on the table. tion can not be achieved based on linguistic infor- Int.1: Both apples and bananas are on the table. mation alone. On the other hand, having access to Int.2: Only bananas are on the table. visual information eliminates other interpretations [3] RPA - a Dative PP and it favors only one assuming there will be no Da befindet sich ein Becher(m) auf einem ambiguity in the visual modality (see Figure 1 and Tisch(m), den(m) sie beschädigt. 2). It is the mug on the table that she damages. [4] Sie schmückt das Fenster(n) des Wa- Int.1: She damages the table (near-attachment). gens(m), das(n) er säubert. (She decorates Int.2: She damages the mug (far-attachment). the window of the car that he cleans.) 1 The short-term goal is to increase the sample size to 450 sentences. [5] Sie schmückt das Fenster(n) des Zim- 2 Relative Pronoun Agreement mers(n), das(n) er säubert. (She decorates 3 Int.=Interpretation 40 S BJ OBJA SU T GM OD DE T REL DE JA OB BJ SU Sie schmückt das Fenster des Zimmers, das er säubert. Figure 1: First interpretation of syntactically ambiguous sentence [5]: near attachment of relative clause - syntactic gold standard annotation and visual scene. S BJ OBJA SU T REL DE GM OD T JA DE OB BJ SU Sie schmückt das Fenster des Zimmers, das er säubert. Figure 2: Second interpretation of syntactically ambiguous sentence [5]: far attachment of relative clause - syntactic gold standard annotation and visual scene. [4] RPA with an agent/patient ambiguity gender. Below, three additional types of tem- Da ist eine Japanerin(f), die(f, RPnom/acc ) die poral ambiguities, which are convenient for the Putzfrau(f) soeben attackiert. investigation of how/when structural prediction There is a Japanese, who(m) the cleaning lady mechanisms are employed during parsing process attacks. are presented. Int.1: The cleaning lady attacks the Japanese woman. Temporally Ambiguous Sentence Structures Int.2: The Japanese woman attacks the cleaning lady. [6] Agent-Patient Agreement (following the data- set designed by Knoeferle (2005)) [5] Negative Scope Ambiguities Die Sängerin kauft die Jacke nicht, weil sie rot • Die Arbeiterin kostümiert mal eben den jun- ist. gen Mann. The singer does not wear the coat because it is The worker(f) just dresses up the joung red. man(m). Int.1: The singer does not buy the coat because • Die Arbeiterin verköstigt mal eben der As- of its color. tronaut. Int.2: The singer actually buys the coat but not The worker(f) is just fed4 by the astro- because it is red. naut(m). All the sentence structures for the fully am- [7] Verb-Subject Agreement biguous set (except negative scope sentences) presented above can be also transformed to • Die Sänger waschen den Arzt. temporally ambiguous sentence structures by The singers wash the doctor(m). changing the noun in either of the NPs (or PPs) 4 The original German sentence is in active voice in OVS with another noun that has an article in different word order. 41 • Die Sänger wäscht der Offizier. addresses only one action and two characters, see The singers are painted 4 by the officer(m). sentences in [6]. For each scenario, four different complexity levels were designed. In the first con- [8] Conjunction Scope Ambiguities dition, a visual scene contains three characters in • Die Sängerin bemalt den Offizier und die an environment, where there is no additional back- Ärztin. ground object, see Figure 3. This set-up resem- The singer(f) paints the officer(m) and the blances Knoeferle’s (2005) images and provides a doctor(f). baseline to compare our results with previous re- • Die Sängerin bemalt den Offizier und die search. The images in the second condition also Ärztin wäscht den Radfahrer. contain three characters, but in an environment The singer(f) paints the officer(m) and the with noninteracting distractor objects, see Figure- doctor(f) washes the cyclist(m). 4. In the last two conditions, a fourth character in • Die Sängerin bemalt den Offizier und die an Agent role, who acts on the ambiguous charac- Ärztin besprüht der Radfahrer. ter is added to the scene. While the images in the The singer(f) paints the officer(m) and the third condition do not have additional objects , the doctor(f) is sprayed4 by the cyclist(m). images in the fourth condition are in a cluttered en- vironment as in the condition 2 (see Figure 5 and Figure 6). It should be noted that background ob- 2.1 Image Construction and Visual jects and the fourth character do not have any se- Complexity mantic association with the actions mentioned in Besides the effect of linguistic complexity, the the sentences. Besides, visual complexities can be data-set was designed to be used in the investi- further diversified, e.g. by adding another patient gation of the following research questions: how, character to the scene or by adding semantically when and at which degree does visual complex- congruent distractor objects. ity affect sentence comprehension and are visual cues in such a complex linguistic case still strong enough to enhance correct interpretation. The 2D visual scenes were created with the SketchUp Make Software5 and all 3D objects were exported from the original SketchUp 3D Ware- house. The images were set to 1250 x 840 res- olution. Moreover, target objects and agents are located in different parts of the visual scene for each stimulus. It should be reminded that for the computational model, we do not need visual de- pictions, their semantic representations are suffi- Figure 3: 3 agents in an environment with no back- cient, however the visual depictions are crucial to ground objects; a Patient (a young boy on the left), conduct comparable experimental studies with hu- an Agent (an astronaut on the right) and an am- man subjects. Furthermore, an automatic extrac- biguous Agent/Patient character (a female worker tion of semantic roles from the images is another in the middle ). task that we are aiming for. That is the reason why not just semantic representations but the images themselves are integral part of our data-set. 2.2 Semantic Annotations The following figures illustrate how complex- The objects, characters and actions in the images ity is systematically controlled on one of the cases were annotated manually with respect to their se- in the data-set, namely Agent-Patient agreement. mantic roles, similar to McCrae’s approach (Mc- In the initial/original case, each scenario contains Crae, 2010), see also Mayberry et al. (2006). Se- three characters (one Patient, one Agent and one mantic roles are used to establish a relation be- ambiguous Agent/Patient character) and two pos- tween semantic and syntactic levels as an impor- sible actions. On the other hand each sentence tant part of modeling the cross-modal interaction. 5 http://http://www.sketchup.com/ - retrieved on Semantic roles are linguistic abstractions to dis- 03.08.2016 tinguish and classify the different functions of 42 Ambiguity Types Template # of unique items # of sample 1 RPA with a Genitive NP PRO1nom VP1 NP1acc NP2gen , WDT*acc Pro(2), NPacc/gen (48), 24 PRO2nom VP2 VP(48) 2 RPA Scope Ambiguities PROnom VP1 NP1nom,pl. NP1nom,pl. , WDT acc,pl. Pro(3), VP(36), 24 VP2 PP1 NPacc/dat (72) 3 RPA with a Dative-PP NPit cleft VP1 NP1nom NP2dat , WDT dat NPs(44), VP(23), 20 PRO3rd-sing. ADV VP2 ADV(24) 4 RPA Ambiguous Gender EX Vaux NP1nom WDTnom NP2acc ADV VP1 NPs(30), VP(20), 24 Case Marking ADV(12) EX Vaux NP1nom , WDTacc NP2acc ADV VP1 5 Negative Scope Ambigui- NP1nom VP1 NP2acc NEG, Conj. PROnom ADJ NPs(6), VP(6), 12 ties VP2 ADJ(12), ADV(6) 6 Agent Patient Agreement NP1nom VP NP2acc NPs(37), VP(48), 48 (all in 3rd P. Sing.) ADV(6) N1acc V N2nom 7 Verb Subject Agreement NP1nom-3rd Pl. VP 3rd Pl. NP2acc-3rd Sing. NPs(3), VP(6), ADV(6) 12 NP1acc-3rd Pl. V 3rd Sing. NP2nom-3rd Sing. 8 Conjunction Scope Ambi- NP1nom VP1 NP2acc Conj. NP3acc NPs(32), VP(27), 27 guities ADV(6) (all in 3rd P. Sing.) N1nom VP1 NP2acc Conj. NP3nom VP2 NP4acc NP1nom VP1 NP2acc Conj. NP3acc VP2 NP4nom TOTAL 191 Table 1: POS templates, the number of sentences for each ambiguity case, and the number of unique items in each POS category (*Relative Pronoun) Figure 6: 4 agents in an environment with back- Figure 4: 3 agents in an environment with back- ground objects. ground objects. the action in an utterance, in other words they are a useful tool to specify “who did what to whom”. The most common set of semantic roles includes Agent, Theme, Patient, Instrument, Lo- cation, Goal and Path. Figure 7 shows one exem- plary semantic annotation for the visual scene dis- played in Figure 1. There “Sie” is the Agent, who performs the decorating action, “das Fenster” is the Patient, the entity undergoing a change of state, caused by the action. To wrap-up, the current version of our multi- modal data-set in German that we constructed with Figure 5: 4 agents in an environment with no back- the aim of studying disambiguation and structural ground objects. prediction from both psycholinguistics and com- putational linguistics perspectives contains fol- 43 guide the parser to find the correct referent in the description of visual context. However, in contrast to those rule-based parsers (McCrae, 2009; Baumgärtner et al., 2012), we em- ploy statistical parsing with the aim to achieve state-of-the-art results and developing a language- independent parser. To realize cross-modality, we interface the data-driven parser (RBGParser, Figure 7: One exemplary semantic annotation for Zhang et al. (2014)), which is utilized to search the visual scene shown in Figure 1. for the most plausible disambiguation of a given sentence among all possible dependency trees, lowing items for each scenario in the data-set 6 with a rule-based component (jwcdg, Beuck et al. (2011)), which evaluates possible analyses pro- • a linguistic form and a sentence in German duced by RBG with respect to the visual knowl- with its English translation edge. This contextual information guides the pars- • gold standard annotations ing process and narrows down the hypotheses to- wards the most plausible representation for a given • possible interpretations sentence. • a target interpretation Another approach that could have been used is • a visual depiction of the target interpretation in to train a parser on combined linguistic and vi- four different visual complexities sual features (Salama and Menzel, 2016). How- ever, due to lack of available data to train the • a semantic representation of the visual depic- parser with, RBG is not dedicated to process the tion of the target interpretation contextual information in our approach. Instead, • an audio file and a data file with marked on- we embed a constraint-based component that is set/offsets (in msec.) of each linguistic entities able to evaluate a dependency tree based on sym- in the sentence bolic knowledge, i. e. the semantic role annota- tions. jwcdg is utilized to link the semantic roles 3 Cross-modal Parsing that the visual scenes are annotated with and the syntactic level of RBG. For example, the Agent As suggested by the literature mentioned in Sec- of an active sentence is supposed to be its Sub- tion 1, cross-modal integration facilitates to re- ject. Instead of developing a full grammar that solve ambiguities and predict what will be re- covers all relations between every semantic role vealed next in an unfolding sentence. How- and the syntactic level, our grammar covers the ever, most state-of-the-art parsing approaches rely cases relevant with respect to our test data and has solely on the language modality. McCrae (2009) been developed for German only. But, our gram- proposed a system for the integration of contex- mar will be extended to further cases during the tual knowledge into a rule-based syntactic and remainder of this project. Also, we plan to ex- semantic parser to resolve ambiguities in Ger- tended it to English, Turkish and Chinese. To the man, e.g. Genitive-Dative ambiguity of feminine best of our knowledge, there exists no comparable nouns or PP attachment ambiguities. Baumgärtner system for cross-modal broad-coverage syntactic et al. (2012) extended that system by adding incre- parsing yet. Since we aim to introduce the corpus mental processing capabilities leading to the only of fully/temporally ambiguous sentences in Ger- cross-modal and incremental syntactic parser so man, more technical aspects of the current parser far. In their study of visually guided natural lan- have been left out of scope here. guage processing, Baumgärtner et al. (2012) pro- pose a computational model that successfully in- tegrates visual context to improve the processing 3.1 A Test Run of sentences of German, and semantic informa- This section presents the results of our proof-of- tion derived from language input that is used to concept test run, where the performance of our 6 The data-set can be accessed from https:// developed cross-modal parser has been tested and gitlab.com/natsCML/SIMBig2017 compared with the performance of the original 44 RBG model in order to see whether the contextual and patients correctly but with wrong syntactic information improves parsing results. labels in 20 out of 40 cases. Our cross-modal Our task for the computational model in this test parser improved those results by labeling only 10 run is to assign thematic roles correctly with re- agent/patients wrongly. The performance of this is spect to the visual depiction of the event. There- expected to be improved by fine-tuning of the se- fore, the disambiguation task was performed by mantic annotations employed during parsing oper- the cross-modal parser on fully ambiguous sen- ations. tence (see Table 1, Type [1 - 4], 108 sentences in total). For each sentence, the corresponding vi- 4 Discussion sual stimulus has been manually annotated as de- Which linguistic entity resolves the ambiguities scribed in Subsection 2.2. under different ambiguity and complexity con- The RBG models had been trained on the first ditions by humans gives us valuable informa- 100k sentences of the Hamburg Dependency Tree- tion about the underlying mechanism of language- bank (HDT) (Foth et al., 2014) part A, a German vision interaction in a situated setting, enabling corpus that is freely available for research pur- us to improve a psycho-linguistically plausible poses. All sentences, which are from the Ger- parser. However, for designing such a parser, in man news website Heise Online7 , are manually addition to reach an understanding in two endeav- dependecy-annotated. TurboTagger8 is used to ors, namely the cognitive aspects of language pro- predict the PoS tags, from the tag set of Schiller cessing and technical aspects of parsing technol- et al. (1995), instead of using the gold standard ogy, the multi-modal data-set that pertains very ones. challenging garden-path (fully or temporally am- RPA-Genitive (Type [1]) case involves 24 sen- biguous) cases for both areas in a systematic way tences. In one half of it, the relative clause is at- needs to be designed carefully. This paper ad- tached to the first NP, far attachment, and in the dresses this bridging component. other half, it is attached to genitive object, near Here we introduce a multi-modal set6 for am- attachment. The original RBG was not able to at- biguous German sentences addressing 8 different tach relative clauses correctly in all 12 cases of far linguistic and four different visual complexities. attachment, while there was no wrong attachment Furthermore, the contribution of the external in- in the case of near-attachments as expected due to formation in parsing operations was shown by a the respective statistical distribution in the training proof-of concept study. Further studies will ad- data. In contrast, our cross-modal parser was able dress the comparison between performance of hu- to attach all relative-clauses correctly by utilizing man subjects and computational model on both external contextual information. disambiguation and structural predictions tasks RPA-Scope (Type[2]) is also consisting of 24 concerning the entire data-set. sentences; in one half, relative clause is attached to both NPs (wide scope), while it is attached to only Acknowledgments the closest NP in the rest (narrow-scope). A simi- lar pattern in the parsing results as in the previous This research was funded by the German Research case was observed. While the original RBG was Foundation (DFG) in project “Crossmodal Learn- not able to make any correct attachment for the ing”, TRR-169. wide-scope cases, our model correctly attached all relative clauses. References RPA-Dative (Type [3]) set contains 20 sen- tences; one half is far-attached and the other half is Gerry TM Altmann and Yuki Kamide. 1999. Incre- near-attached. The previous pattern was again ob- mental interpretation at verbs: Restricting the do- main of subsequent reference. Cognition 73(3):247– served in this case. While the original RBG was 264. blind to far-attachments, our parser was able to disambiguate the sentences by using external cues. Christopher Baumgärtner, Niels Beuck, and Wolfgang Menzel. 2012. An architecture for incremental in- In case of RC-gender, RBG attached all agents formation fusion of cross-modal representations. In 7 https://www.heise.de Multisensor Fusion and Integration for Intelligent 8 TurboTagger is distributed together with TurboParser Systems (MFI), 2012 IEEE Conference on. IEEE, (Martins et al., 2013) Hamburg, Germany, pages 498–503. 45 Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Anne Schiller, Simone Teufel, and Christine Thielen. Katz, and Shimon Ullman. 2016. Do you see what 1995. Guidelines für das tagging deutscher textcor- i mean? visual resolution of linguistic ambiguities. pora mit STTS. Universität Stuttgart und Univer- arXiv preprint arXiv:1603.08079 . sität Tübingen . Niels Beuck, Arne Köhn, and Wolfgang Menzel. 2011. Michael K Tanenhaus, Michael J Spivey-Knowlton, Incremental parsing and the evaluation of partial de- Kathleen M Eberhard, and Julie C Sedivy. 1995. pendency analyses. In DepLing 2011, Proceedings Integration of visual and linguistic information of the 1st International Conference on Dependency in spoken language comprehension. Science Linguistics. 268(5217):1632. Moreno I Coco and Frank Keller. 2015. The interac- Jos JA Van Berkum, Colin M Brown, Pienie Zwitser- tion of visual and linguistic saliency during syntac- lood, Valesca Kooijman, and Peter Hagoort. 2005. tic ambiguity resolution. The Quarterly Journal of Anticipating upcoming words in discourse: evi- Experimental Psychology 68(1):46–74. dence from erps and reading times. Journal of Experimental Psychology: Learning, Memory, and Kilian A. Foth, Arne Köhn, Niels Beuck, and Wolf- Cognition 31(3):443. gang Menzel. 2014. Because size does matter: The Hamburg Dependency Treebank. In Nico- Yuan Zhang, Tao Lei, Regina Barzilay, Tommi letta Calzolari (Conference Chair), Khalid Choukri, Jaakkola, and Amir Globerson. 2014. Steps to ex- Thierry Declerck, Hrafn Loftsson, Bente Maegaard, cellence: Simple inference with refined scoring of Joseph Mariani, Asuncion Moreno, Jan Odijk, and dependency trees. In Proceedings of the 52nd An- Stelios Piperidis, editors, Proceedings of the Lan- nual Meeting of the Association for Computational guage Resources and Evaluation Conference 2014. Linguistics (Volume 1: Long Papers). Association LREC, European Language Resources Association for Computational Linguistics, Baltimore, Mary- (ELRA), Reykjavik, Iceland. land, pages 197–207. Pia Stefanie Knoeferle. 2005. The role of visual scenes in spoken language comprehension: Evidence from eye-tracking. Ph.D. thesis, Universitätsbibliothek. André F. T. Martins, Miguel B. Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the Annual Meeting of the Association for Computa- tional Linguistics (ACL). pages 617–622. MR Mayberry, Matthew W Crocker, and Pia Knoeferle. 2006. A connectionist model of the coordinated in- terplay of scene, utterance, and world knowledge. In Proceedings of the 28th annual conference of the Cognitive Science Society. pages 567–572. Patrick McCrae. 2009. A model for the cross-modal influence of visual context upon language procesing. In Proceedings of the International Conference Re- cent Advances in Natural Language Processing (RANLP 2009). Borovets, Bulgaria, pages 230–235. Patrick McCrae. 2010. A computational model for the influence of cross-modal context upon syntactic parsing . Ken McRae, Mary Hare, Todd Ferretti, and Jef- frey L Elman. 2001. Activating verbs from typi- cal agents, patients, instruments, and locations via event schemas. In Proceedings of the Twenty-Third Annual Conference of the Cognitive Science Society. Erlbaum Mahwah, NJ, pages 617–622. Amr Rekaby Salama and Wolfgang Menzel. 2016. Multimodal graph-based dependency parsing of natural language. In International Conference on Advanced Intelligent Systems and Informatics. Springer International Publishing, pages 22–31. 46