=Paper= {{Paper |id=Vol-2029/paper1 |storemode=property |title=Towards a Systematic Analysis of Linguistic and Visual Complexity in Disambiguation and Structural Prediction |pdfUrl=https://ceur-ws.org/Vol-2029/paper1.pdf |volume=Vol-2029 |authors=Özge Alaçam,Tobias Starona, Wolfgang Menzel |dblpUrl=https://dblp.org/rec/conf/simbig/AlacamSM17 }} ==Towards a Systematic Analysis of Linguistic and Visual Complexity in Disambiguation and Structural Prediction== https://ceur-ws.org/Vol-2029/paper1.pdf
    Towards a Systematic Analysis of Linguistic and Visual Complexity
              in Disambiguation and Structural Prediction

                  Özge Alaçam , Tobias Staron and Wolfgang Menzel
                               Department of Informatics
                                  University of Hamburg
             alacam, staron, menzel@informatik.uni-hamburg.de



                    Abstract                                rithms are still far away from that accuracy and
                                                            fluency when it comes to challenging linguistic
    Situated language processing in humans                  or visual situations. Therefore, by developing a
    involves the interaction of linguistic and              cross-modal parser to exploit visual knowledge,
    visual processing and this cross-modal                  we expect to enhance syntactic disambiguation,
    integration helps resolving ambiguities                 e.g. concerning relative clause attachments and
    and predicting what will be revealed next               various scope ambiguities.
    in an unfolding sentence. However, most                    One of the most frequently investigated syntac-
    state-of-the-art parsing approaches rely                tic ambiguity cases is the prepositional phrase (PP)
    solely on the language modality. This                   attachment ambiguity, where different semantic
    paper aims to introduce a new multi-                    interpretations are possible depending on assign-
    modal data-set (containing sentences                    ing different thematic roles (Tanenhaus et al.,
    and respective images and audio files)                  1995). A well-known example is the imperative
    addressing challenging linguistic and                   sentence: “put the apple on the towel in the box”,
    visual complexities, which state-of-the-art             where the PP “on the towel” can be interpreted as
    parsers should be able to cope with. It                 modifier of an apple (as location of the apple), as
    also briefly addresses a proof-of-concept               marked in 1 below, or as goal location as in 2.
    study that shows the contribution of
    employing external visual information                      [1] put [the apple on the towel]obj [in the
    during disambiguation.                                         box]goal

                                                               [2] put [the apple]obj [on the towel in the
1   Disambiguation and Structural                                  box]goal
    Predictions
                                                            The re-analysis of the interpretation during on-
A better understanding of human perceptual and              line language comprehension is termed as garden-
comprehension processes concerning multi-modal              path example. In a multi-modal setting where the
environments is one of the crucial factors for re-          scene contains an empty towel or an apple on a
alizing dynamic human-computer interaction. A               towel, the visual information constrains the refer-
large body of empirical evidence in psycholin-              ential choices as well as the possible interpreta-
guistics suggests that human language processing            tions, helping the disambiguation process.
successfully integrates available information ac-              Tanenhaus and his colleagues’ study (1995)
quired from different modalities in order to re-            showed that visual information influences incre-
solve linguistic ambiguities (i.e. syntactic, se-           mental thematic role disambiguation by narrow-
mantic or discourse) and predict what will be re-           ing down the possible interpretations. Further ev-
vealed next in the unfolding sentence (Tanenhaus            idence that supports this conclusion was provided
et al., 1995; Altmann and Kamide, 1999; Knoe-               by Knoeferle (2005) by addressing relatively more
ferle, 2005). During spoken communication, on-              complex scenes containing more agents and re-
line disambiguation and prediction processes al-            lations for both English and German. The re-
low us to have more accurate and fluent conver-             sults also indicated that this influence occurs in-
sations. In contrast, state-of-the-art parsing algo-        dependent from the experiment language. Fur-



                                                       38
thermore, Altmann and Kamide (1999)’s study                   can resolve references in syntactically ambiguous
has documented that listeners are able to predict             cases. They may have preferences but the accu-
complements of a verb based on its selectional                racy of the preferences are bounded by chance. On
constraints. For example, when people hear the                the other hand, humans naturally use external in-
verb ’break’, their attention is directed towards             formation from other modalities for disambigua-
only breakable objects in the scene. Some nouns               tion when available. Incorporating this feature,
may also produce expectations for certain seman-              cross-modal parsers may also resolve those ambi-
tic classes of verbs by activating so-called event            guities and reach correct interpretations of the vi-
schema knowledge (McRae et al., 2001). Be-                    sually depicted events. Therefore, a better under-
side verbs and nouns, Van Berkum et al. (2005)’s              standing of human language processing concern-
study also showed the effect of syntactic gender              ing cross-model environments is one of the cru-
cues for Dutch in the anticipation of the upcom-              cial factors in the realization of dynamic human-
ing words. Similar to German, pre-nominal ad-                 computer interaction. Furthermore, comparing the
jectives as well as nouns are gender-marked in                performance of the computational model with hu-
Dutch and the gender of the adjective has to agree            man performance (e. g. whether ambiguities
with the gender of the noun. Their results showed             were resolved correctly, at which point of a spo-
that the human language processing system uses                ken utterance a correct resolution was achieved,
the gender cue, when it becomes available, to pre-            how many changes were made before reaching
dict the target object if its gender is different than        the correct thematic role assignment) also pro-
the gender of the other objects in the environ-               vides valuable information about the plausibil-
ment. They interpreted this as evidence for the               ity and the effectiveness of the proposed pars-
incremental nature of the human language system,              ing architecture. Constructing a data-set that con-
which can predict the upcoming words and imme-                tains challenging linguistic and visual cases and
diately begin incremental parsing operations. In a            complex multi-modal settings, where state-of-the-
more recent work, Coco and Keller (2015) inves-               art parsers often fail, are fundamental towards
tigated the language - vision interaction and how             achieving this ultimate goal. In this paper, we
it influences the interpretation of syntactically am-         aim to introduce a multi-modal data-set consist-
biguous sentences in a simple but real-world set-             ing of garden-path (fully/temporally syntactically
ting. Their study provided further evidence that              ambiguous) sentences.
visual and linguistic information influences the in-             This paper is structured as follows. In section
terpretation of a sentence at different points dur-           2, a data-set of ambiguous German sentences and
ing online processing. The aforementioned em-                 their multi-modal representations are presented. A
pirical studies provided insights regarding psycho-           brief description of our cross-modal parser is pre-
linguistically plausible parsing. However, those              sented in Section 3. Section 3 also addresses a
studies were limited to simple (written) linguis-             test run conducted on fully ambiguous sentence
tic or visual stimuli where object-action relations           structures. Section 4 summarizes the results of this
could be predicted relatively easily.                         work and draws conclusions

   Based on the prior research, our project fo-               2   Linguistic and Visual Complexities
cuses also on studying underlying mechanisms
of human cross-modal language processing of in-               Recently, a corpus of language and vision ambigu-
crementally revealed utterances with accompa-                 ities (LAVA) in English has been released (Berzak
nying visual scenes, with the aim of using the                et al., 2016). LAVA corpus contains 237 sen-
empirically gained insights to develop a psycho-              tences with linguistic ambiguities that can only be
linguistically plausible cross-modal and incremen-            disambiguated using external visual information
tal syntactic parser which can be implemented                 provided as short videos or static visual images
e.g. on a service robot. A parser that processes              with real world complexity. It addresses a wide
only linguistic information is expected to be able            range of syntactic ambiguities including prepo-
to successfully handle syntactically unambiguous              sitional phrase or verb phrase attachments and
cases by using linguistic constraints or statistical          ambiguities in the interpretation of conjunctions.
methods. However, without external information                However, this corpus does not take linguistically
from visual modality, neither humans nor parsers              challenging cases like relative clause attachments



                                                         39
or scope ambiguities, which may also give valu-                     the window of the room that he cleans.)
able insights understanding the underlying mech-
                                                                Our data-set is currently consisting of 191 sen-
anisms of cross-modal interactions, into account.
                                                             tences1 and addresses 8 linguistically challeng-
To our knowledge, the reference resolution con-
                                                             ing cases concerning relative clause attachments,
cerning these linguistic cases and the effect of lin-
                                                             agent/patient agreement, verb/subject agreement,
guistic complexity in visually disambiguated situ-
                                                             and scope ambiguities for conjunctions and nega-
ations have been scarcely investigated. Our multi-
                                                             tions. The sentence sets for each structure are gen-
modal data-set consists of challenging linguistic
                                                             erated by using part-of-speech templates given in
cases in German (itemized below), which becomes
                                                             Table 1. Parsers often have problems with cor-
fully unambiguous in the presence of visual stim-
                                                             rect reference resolution for such linguistic ex-
uli. Our main question from the psycholinguistic
                                                             pressions because they usually attach the relative
point of view is whether the presence of linguistic
                                                             clause to a nearest option with respect to statisti-
ambiguity and the linguistic complexity affect the
                                                             cal distributions in their training data or explicitly
processing of multi-modal stimuli. On the other
                                                             stated rules.
hand, from the computational perspective, we fo-
                                                                Knoeferle’s (2005) sentence set was used as
cus on whether and to what extent visual informa-
                                                             baseline since the co-occurrence frequencies
tion is useful for the disambiguation and structural
                                                             between the action and the Agent in the sen-
prediction processes in order to develop more flu-
                                                             tence, as well as between the action and the
ent and accurate computational parsing.
                                                             Patient, were controlled to single out the effect
   German has three grammatical genders, namely
                                                             of semantic associations or preferences during
each noun is either feminine(f), masculine(m), or
                                                             parsing operations. For a syntactic parser, this
neuter(n). In a sentence that contains a relative
                                                             may seem irrelevant, however in order to develop
clause attachment, the gender of the relative pro-
                                                             a comparable experimental setup for human
noun has to be the same as the gender of its
                                                             comprehension, this parameter needs to be taken
antecedent. Sentence [3] illustrates an example,
                                                             into account.
which contains a relative clause licensing the NP.
[3] Sie schmückt das Fenster(n), das(n) er                         Fully Ambiguous Sentence Structures
    säubert. (She decorates the window that
                                                         [1] RPA2 - a Genitive NP
    he cleans.)
                                                             Sie schmückt das Fenster(n) des Zimmers(n),
   In Sentence [4], the NP is modified by an ad-             das er säubere.
ditional NP, i.e. a genitive object. In this case,           She decorates the window of the room that he
since the gender of the relative pronoun matches             cleans.
only the first NP, it is clear that the window is be-        Int.13 : He cleans the room (near-attachment).
ing cleaned, not the car. However, due to ambigu-            Int.2: He cleans the window (far-attachment).
ous German case-marking, if the genders of the
nouns of both NPs are the same, as in sentence [5],      [2] RPA - Scope Ambiguities
both far and near attachments are possible. Fur-             Ich sehe Äpfel(pl) und Bananen(pl), die(pl) auf
thermore, the verb is semantically congruent with            dem Tisch liegen.
both NP and PP as well. Correct reference resolu-            I see apples and bananas that lie on the table.
tion can not be achieved based on linguistic infor-          Int.1: Both apples and bananas are on the table.
mation alone. On the other hand, having access to            Int.2: Only bananas are on the table.
visual information eliminates other interpretations      [3] RPA - a Dative PP
and it favors only one assuming there will be no             Da befindet sich ein Becher(m) auf einem
ambiguity in the visual modality (see Figure 1 and           Tisch(m), den(m) sie beschädigt.
2).                                                          It is the mug on the table that she damages.
[4] Sie schmückt das Fenster(n) des Wa-                     Int.1: She damages the table (near-attachment).
    gens(m), das(n) er säubert. (She decorates              Int.2: She damages the mug (far-attachment).
    the window of the car that he cleans.)                      1
                                                                  The short-term goal is to increase the sample size to 450
                                                             sentences.
[5] Sie schmückt das Fenster(n) des Zim-                       2
                                                                  Relative Pronoun Agreement
    mers(n), das(n) er säubert. (She decorates                 3
                                                                  Int.=Interpretation




                                                        40
                    S

              BJ          OBJA
            SU

                                   T         GM
                                               OD
                                 DE

                                                      T                 REL
                                                    DE

                                                                                JA
                                                                              OB
                                                                                        BJ
                                                                                      SU
      Sie      schmückt   das      Fenster   des      Zimmers,    das         er        säubert.



 Figure 1: First interpretation of syntactically ambiguous sentence [5]: near attachment of relative clause
 - syntactic gold standard annotation and visual scene.




                    S

              BJ          OBJA
            SU

                                   T                        REL
                                 DE          GM
                                               OD

                                                      T                          JA
                                                    DE                        OB
                                                                                        BJ
                                                                                      SU
      Sie      schmückt   das      Fenster   des      Zimmers,    das         er        säubert.



 Figure 2: Second interpretation of syntactically ambiguous sentence [5]: far attachment of relative clause
 - syntactic gold standard annotation and visual scene.


[4] RPA with an agent/patient ambiguity                                                       gender. Below, three additional types of tem-
    Da ist eine Japanerin(f), die(f, RPnom/acc ) die                                          poral ambiguities, which are convenient for the
    Putzfrau(f) soeben attackiert.                                                            investigation of how/when structural prediction
    There is a Japanese, who(m) the cleaning lady                                             mechanisms are employed during parsing process
    attacks.                                                                                  are presented.
    Int.1: The cleaning lady attacks the Japanese
    woman.                                                                                         Temporally Ambiguous Sentence Structures
    Int.2: The Japanese woman attacks the cleaning
    lady.                                                                                   [6] Agent-Patient Agreement (following the data-
                                                                                                set designed by Knoeferle (2005))
[5] Negative Scope Ambiguities
    Die Sängerin kauft die Jacke nicht, weil sie rot                                               • Die Arbeiterin kostümiert mal eben den jun-
    ist.                                                                                              gen Mann.
    The singer does not wear the coat because it is                                                   The worker(f) just dresses up the joung
    red.                                                                                              man(m).
    Int.1: The singer does not buy the coat because                                                 • Die Arbeiterin verköstigt mal eben der As-
    of its color.                                                                                     tronaut.
    Int.2: The singer actually buys the coat but not                                                  The worker(f) is just fed4 by the astro-
    because it is red.                                                                                naut(m).
   All the sentence structures for the fully am-                                            [7] Verb-Subject Agreement
 biguous set (except negative scope sentences)
 presented above can be also transformed to                                                         • Die Sänger waschen den Arzt.
 temporally ambiguous sentence structures by                                                          The singers wash the doctor(m).
 changing the noun in either of the NPs (or PPs)                                                 4
                                                                                                   The original German sentence is in active voice in OVS
 with another noun that has an article in different                                           word order.




                                                                                        41
     • Die Sänger wäscht der Offizier.                          addresses only one action and two characters, see
       The singers are painted 4 by the officer(m).               sentences in [6]. For each scenario, four different
                                                                  complexity levels were designed. In the first con-
[8] Conjunction Scope Ambiguities                                 dition, a visual scene contains three characters in
     • Die Sängerin bemalt den Offizier und die                  an environment, where there is no additional back-
       Ärztin.                                                   ground object, see Figure 3. This set-up resem-
       The singer(f) paints the officer(m) and the                blances Knoeferle’s (2005) images and provides a
       doctor(f).                                                 baseline to compare our results with previous re-
     • Die Sängerin bemalt den Offizier und die                  search. The images in the second condition also
       Ärztin wäscht den Radfahrer.                             contain three characters, but in an environment
       The singer(f) paints the officer(m) and the                with noninteracting distractor objects, see Figure-
       doctor(f) washes the cyclist(m).                           4. In the last two conditions, a fourth character in
     • Die Sängerin bemalt den Offizier und die                  an Agent role, who acts on the ambiguous charac-
       Ärztin besprüht der Radfahrer.                           ter is added to the scene. While the images in the
       The singer(f) paints the officer(m) and the                third condition do not have additional objects , the
       doctor(f) is sprayed4 by the cyclist(m).                   images in the fourth condition are in a cluttered en-
                                                                  vironment as in the condition 2 (see Figure 5 and
                                                                  Figure 6). It should be noted that background ob-
 2.1 Image Construction and Visual                                jects and the fourth character do not have any se-
     Complexity                                                   mantic association with the actions mentioned in
 Besides the effect of linguistic complexity, the                 the sentences. Besides, visual complexities can be
 data-set was designed to be used in the investi-                 further diversified, e.g. by adding another patient
 gation of the following research questions: how,                 character to the scene or by adding semantically
 when and at which degree does visual complex-                    congruent distractor objects.
 ity affect sentence comprehension and are visual
 cues in such a complex linguistic case still strong
 enough to enhance correct interpretation.
    The 2D visual scenes were created with the
 SketchUp Make Software5 and all 3D objects were
 exported from the original SketchUp 3D Ware-
 house. The images were set to 1250 x 840 res-
 olution. Moreover, target objects and agents are
 located in different parts of the visual scene for
 each stimulus. It should be reminded that for the
 computational model, we do not need visual de-
 pictions, their semantic representations are suffi-              Figure 3: 3 agents in an environment with no back-
 cient, however the visual depictions are crucial to              ground objects; a Patient (a young boy on the left),
 conduct comparable experimental studies with hu-                 an Agent (an astronaut on the right) and an am-
 man subjects. Furthermore, an automatic extrac-                  biguous Agent/Patient character (a female worker
 tion of semantic roles from the images is another                in the middle ).
 task that we are aiming for. That is the reason why
 not just semantic representations but the images
 themselves are integral part of our data-set.                    2.2   Semantic Annotations
    The following figures illustrate how complex-                 The objects, characters and actions in the images
 ity is systematically controlled on one of the cases             were annotated manually with respect to their se-
 in the data-set, namely Agent-Patient agreement.                 mantic roles, similar to McCrae’s approach (Mc-
 In the initial/original case, each scenario contains             Crae, 2010), see also Mayberry et al. (2006). Se-
 three characters (one Patient, one Agent and one                 mantic roles are used to establish a relation be-
 ambiguous Agent/Patient character) and two pos-                  tween semantic and syntactic levels as an impor-
 sible actions. On the other hand each sentence                   tant part of modeling the cross-modal interaction.
    5
      http://http://www.sketchup.com/   -   retrieved   on        Semantic roles are linguistic abstractions to dis-
 03.08.2016                                                       tinguish and classify the different functions of



                                                             42
  Ambiguity Types                Template                                                # of unique items             #    of
                                                                                                                       sample
  1     RPA with a Genitive NP   PRO1nom VP1 NP1acc               NP2gen ,     WDT*acc   Pro(2),     NPacc/gen (48),   24
                                 PRO2nom VP2                                             VP(48)

  2     RPA Scope Ambiguities    PROnom VP1 NP1nom,pl. NP1nom,pl. , WDT acc,pl.          Pro(3),          VP(36),      24
                                 VP2 PP1                                                 NPacc/dat (72)
  3     RPA with a Dative-PP     NPit cleft VP1 NP1nom NP2dat , WDT dat                  NPs(44),         VP(23),      20
                                 PRO3rd-sing. ADV VP2                                    ADV(24)
  4 RPA Ambiguous Gender         EX Vaux NP1nom WDTnom NP2acc ADV VP1                    NPs(30),         VP(20),      24
  Case Marking                                                                           ADV(12)
                                 EX Vaux NP1nom , WDTacc NP2acc ADV VP1
  5 Negative Scope Ambigui-      NP1nom VP1 NP2acc NEG, Conj. PROnom ADJ                 NPs(6),       VP(6),          12
  ties                           VP2                                                     ADJ(12), ADV(6)
  6 Agent Patient Agreement      NP1nom VP NP2acc                                        NPs(37),    VP(48),           48
  (all in 3rd P. Sing.)                                                                  ADV(6)
                                 N1acc V N2nom
  7     Verb Subject Agreement   NP1nom-3rd Pl. VP 3rd Pl. NP2acc-3rd Sing.              NPs(3), VP(6), ADV(6)         12
                                 NP1acc-3rd Pl. V 3rd Sing. NP2nom-3rd Sing.
  8 Conjunction Scope Ambi-      NP1nom VP1 NP2acc Conj. NP3acc                          NPs(32),         VP(27),      27
  guities                                                                                ADV(6)
  (all in 3rd P. Sing.)          N1nom VP1 NP2acc Conj. NP3nom VP2 NP4acc
                                 NP1nom VP1 NP2acc Conj. NP3acc VP2 NP4nom
  TOTAL                                                                                                                191

Table 1: POS templates, the number of sentences for each ambiguity case, and the number of unique
items in each POS category (*Relative Pronoun)




                                                                  Figure 6: 4 agents in an environment with back-
Figure 4: 3 agents in an environment with back-                   ground objects.
ground objects.

                                                                  the action in an utterance, in other words they
                                                                  are a useful tool to specify “who did what to
                                                                  whom”. The most common set of semantic roles
                                                                  includes Agent, Theme, Patient, Instrument, Lo-
                                                                  cation, Goal and Path. Figure 7 shows one exem-
                                                                  plary semantic annotation for the visual scene dis-
                                                                  played in Figure 1. There “Sie” is the Agent, who
                                                                  performs the decorating action, “das Fenster” is
                                                                  the Patient, the entity undergoing a change of state,
                                                                  caused by the action.
                                                                     To wrap-up, the current version of our multi-
                                                                  modal data-set in German that we constructed with
Figure 5: 4 agents in an environment with no back-                the aim of studying disambiguation and structural
ground objects.                                                   prediction from both psycholinguistics and com-
                                                                  putational linguistics perspectives contains fol-



                                                             43
                                                             guide the parser to find the correct referent in the
                                                             description of visual context.
                                                                However, in contrast to those rule-based parsers
                                                             (McCrae, 2009; Baumgärtner et al., 2012), we em-
                                                             ploy statistical parsing with the aim to achieve
                                                             state-of-the-art results and developing a language-
                                                             independent parser. To realize cross-modality,
                                                             we interface the data-driven parser (RBGParser,
Figure 7: One exemplary semantic annotation for              Zhang et al. (2014)), which is utilized to search
the visual scene shown in Figure 1.                          for the most plausible disambiguation of a given
                                                             sentence among all possible dependency trees,
lowing items for each scenario in the data-set 6             with a rule-based component (jwcdg, Beuck et al.
                                                             (2011)), which evaluates possible analyses pro-
 • a linguistic form and a sentence in German                duced by RBG with respect to the visual knowl-
   with its English translation                              edge. This contextual information guides the pars-
 • gold standard annotations                                 ing process and narrows down the hypotheses to-
                                                             wards the most plausible representation for a given
 • possible interpretations                                  sentence.
 • a target interpretation                                      Another approach that could have been used is
 • a visual depiction of the target interpretation in        to train a parser on combined linguistic and vi-
   four different visual complexities                        sual features (Salama and Menzel, 2016). How-
                                                             ever, due to lack of available data to train the
 • a semantic representation of the visual depic-            parser with, RBG is not dedicated to process the
   tion of the target interpretation                         contextual information in our approach. Instead,
 • an audio file and a data file with marked on-             we embed a constraint-based component that is
   set/offsets (in msec.) of each linguistic entities        able to evaluate a dependency tree based on sym-
   in the sentence                                           bolic knowledge, i. e. the semantic role annota-
                                                             tions. jwcdg is utilized to link the semantic roles
3   Cross-modal Parsing                                      that the visual scenes are annotated with and the
                                                             syntactic level of RBG. For example, the Agent
As suggested by the literature mentioned in Sec-
                                                             of an active sentence is supposed to be its Sub-
tion 1, cross-modal integration facilitates to re-
                                                             ject. Instead of developing a full grammar that
solve ambiguities and predict what will be re-
                                                             covers all relations between every semantic role
vealed next in an unfolding sentence. How-
                                                             and the syntactic level, our grammar covers the
ever, most state-of-the-art parsing approaches rely
                                                             cases relevant with respect to our test data and has
solely on the language modality. McCrae (2009)
                                                             been developed for German only. But, our gram-
proposed a system for the integration of contex-
                                                             mar will be extended to further cases during the
tual knowledge into a rule-based syntactic and
                                                             remainder of this project. Also, we plan to ex-
semantic parser to resolve ambiguities in Ger-
                                                             tended it to English, Turkish and Chinese. To the
man, e.g. Genitive-Dative ambiguity of feminine
                                                             best of our knowledge, there exists no comparable
nouns or PP attachment ambiguities. Baumgärtner
                                                             system for cross-modal broad-coverage syntactic
et al. (2012) extended that system by adding incre-
                                                             parsing yet. Since we aim to introduce the corpus
mental processing capabilities leading to the only
                                                             of fully/temporally ambiguous sentences in Ger-
cross-modal and incremental syntactic parser so
                                                             man, more technical aspects of the current parser
far. In their study of visually guided natural lan-
                                                             have been left out of scope here.
guage processing, Baumgärtner et al. (2012) pro-
pose a computational model that successfully in-
tegrates visual context to improve the processing            3.1   A Test Run
of sentences of German, and semantic informa-                This section presents the results of our proof-of-
tion derived from language input that is used to             concept test run, where the performance of our
  6
    The data-set can be accessed from https://               developed cross-modal parser has been tested and
gitlab.com/natsCML/SIMBig2017                                compared with the performance of the original



                                                        44
RBG model in order to see whether the contextual               and patients correctly but with wrong syntactic
information improves parsing results.                          labels in 20 out of 40 cases. Our cross-modal
   Our task for the computational model in this test           parser improved those results by labeling only 10
run is to assign thematic roles correctly with re-             agent/patients wrongly. The performance of this is
spect to the visual depiction of the event. There-             expected to be improved by fine-tuning of the se-
fore, the disambiguation task was performed by                 mantic annotations employed during parsing oper-
the cross-modal parser on fully ambiguous sen-                 ations.
tence (see Table 1, Type [1 - 4], 108 sentences in
total). For each sentence, the corresponding vi-               4   Discussion
sual stimulus has been manually annotated as de-               Which linguistic entity resolves the ambiguities
scribed in Subsection 2.2.                                     under different ambiguity and complexity con-
   The RBG models had been trained on the first                ditions by humans gives us valuable informa-
100k sentences of the Hamburg Dependency Tree-                 tion about the underlying mechanism of language-
bank (HDT) (Foth et al., 2014) part A, a German                vision interaction in a situated setting, enabling
corpus that is freely available for research pur-              us to improve a psycho-linguistically plausible
poses. All sentences, which are from the Ger-                  parser. However, for designing such a parser, in
man news website Heise Online7 , are manually                  addition to reach an understanding in two endeav-
dependecy-annotated. TurboTagger8 is used to                   ors, namely the cognitive aspects of language pro-
predict the PoS tags, from the tag set of Schiller             cessing and technical aspects of parsing technol-
et al. (1995), instead of using the gold standard              ogy, the multi-modal data-set that pertains very
ones.                                                          challenging garden-path (fully or temporally am-
   RPA-Genitive (Type [1]) case involves 24 sen-               biguous) cases for both areas in a systematic way
tences. In one half of it, the relative clause is at-          needs to be designed carefully. This paper ad-
tached to the first NP, far attachment, and in the             dresses this bridging component.
other half, it is attached to genitive object, near               Here we introduce a multi-modal set6 for am-
attachment. The original RBG was not able to at-               biguous German sentences addressing 8 different
tach relative clauses correctly in all 12 cases of far         linguistic and four different visual complexities.
attachment, while there was no wrong attachment                Furthermore, the contribution of the external in-
in the case of near-attachments as expected due to             formation in parsing operations was shown by a
the respective statistical distribution in the training        proof-of concept study. Further studies will ad-
data. In contrast, our cross-modal parser was able             dress the comparison between performance of hu-
to attach all relative-clauses correctly by utilizing          man subjects and computational model on both
external contextual information.                               disambiguation and structural predictions tasks
   RPA-Scope (Type[2]) is also consisting of 24                concerning the entire data-set.
sentences; in one half, relative clause is attached to
both NPs (wide scope), while it is attached to only            Acknowledgments
the closest NP in the rest (narrow-scope). A simi-
lar pattern in the parsing results as in the previous          This research was funded by the German Research
case was observed. While the original RBG was                  Foundation (DFG) in project “Crossmodal Learn-
not able to make any correct attachment for the                ing”, TRR-169.
wide-scope cases, our model correctly attached all
relative clauses.
                                                               References
   RPA-Dative (Type [3]) set contains 20 sen-
tences; one half is far-attached and the other half is         Gerry TM Altmann and Yuki Kamide. 1999. Incre-
near-attached. The previous pattern was again ob-                mental interpretation at verbs: Restricting the do-
                                                                 main of subsequent reference. Cognition 73(3):247–
served in this case. While the original RBG was                  264.
blind to far-attachments, our parser was able to
disambiguate the sentences by using external cues.             Christopher Baumgärtner, Niels Beuck, and Wolfgang
                                                                 Menzel. 2012. An architecture for incremental in-
   In case of RC-gender, RBG attached all agents
                                                                 formation fusion of cross-modal representations. In
   7
   https://www.heise.de                                          Multisensor Fusion and Integration for Intelligent
   8
   TurboTagger is distributed together with TurboParser          Systems (MFI), 2012 IEEE Conference on. IEEE,
(Martins et al., 2013)                                           Hamburg, Germany, pages 498–503.




                                                          45
Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris             Anne Schiller, Simone Teufel, and Christine Thielen.
  Katz, and Shimon Ullman. 2016. Do you see what                 1995. Guidelines für das tagging deutscher textcor-
  i mean? visual resolution of linguistic ambiguities.           pora mit STTS. Universität Stuttgart und Univer-
  arXiv preprint arXiv:1603.08079 .                              sität Tübingen .

Niels Beuck, Arne Köhn, and Wolfgang Menzel. 2011.            Michael K Tanenhaus, Michael J Spivey-Knowlton,
  Incremental parsing and the evaluation of partial de-          Kathleen M Eberhard, and Julie C Sedivy. 1995.
  pendency analyses. In DepLing 2011, Proceedings                Integration of visual and linguistic information
  of the 1st International Conference on Dependency              in spoken language comprehension.        Science
  Linguistics.                                                   268(5217):1632.

Moreno I Coco and Frank Keller. 2015. The interac-             Jos JA Van Berkum, Colin M Brown, Pienie Zwitser-
 tion of visual and linguistic saliency during syntac-            lood, Valesca Kooijman, and Peter Hagoort. 2005.
 tic ambiguity resolution. The Quarterly Journal of               Anticipating upcoming words in discourse: evi-
 Experimental Psychology 68(1):46–74.                             dence from erps and reading times. Journal of
                                                                  Experimental Psychology: Learning, Memory, and
Kilian A. Foth, Arne Köhn, Niels Beuck, and Wolf-                Cognition 31(3):443.
  gang Menzel. 2014. Because size does matter:
  The Hamburg Dependency Treebank. In Nico-                    Yuan Zhang, Tao Lei, Regina Barzilay, Tommi
  letta Calzolari (Conference Chair), Khalid Choukri,            Jaakkola, and Amir Globerson. 2014. Steps to ex-
  Thierry Declerck, Hrafn Loftsson, Bente Maegaard,              cellence: Simple inference with refined scoring of
  Joseph Mariani, Asuncion Moreno, Jan Odijk, and                dependency trees. In Proceedings of the 52nd An-
  Stelios Piperidis, editors, Proceedings of the Lan-            nual Meeting of the Association for Computational
  guage Resources and Evaluation Conference 2014.                Linguistics (Volume 1: Long Papers). Association
  LREC, European Language Resources Association                  for Computational Linguistics, Baltimore, Mary-
  (ELRA), Reykjavik, Iceland.                                    land, pages 197–207.

Pia Stefanie Knoeferle. 2005. The role of visual scenes
   in spoken language comprehension: Evidence from
   eye-tracking. Ph.D. thesis, Universitätsbibliothek.

André F. T. Martins, Miguel B. Almeida, and Noah A.
  Smith. 2013. Turning on the turbo: Fast third-order
  non-projective turbo parsers. In Proceedings of the
  Annual Meeting of the Association for Computa-
  tional Linguistics (ACL). pages 617–622.

MR Mayberry, Matthew W Crocker, and Pia Knoeferle.
 2006. A connectionist model of the coordinated in-
 terplay of scene, utterance, and world knowledge.
 In Proceedings of the 28th annual conference of the
 Cognitive Science Society. pages 567–572.

Patrick McCrae. 2009. A model for the cross-modal
  influence of visual context upon language procesing.
  In Proceedings of the International Conference Re-
  cent Advances in Natural Language Processing
  (RANLP 2009). Borovets, Bulgaria, pages 230–235.

Patrick McCrae. 2010. A computational model for
  the influence of cross-modal context upon syntactic
  parsing .

Ken McRae, Mary Hare, Todd Ferretti, and Jef-
  frey L Elman. 2001. Activating verbs from typi-
  cal agents, patients, instruments, and locations via
  event schemas. In Proceedings of the Twenty-Third
  Annual Conference of the Cognitive Science Society.
  Erlbaum Mahwah, NJ, pages 617–622.

Amr Rekaby Salama and Wolfgang Menzel. 2016.
 Multimodal graph-based dependency parsing of
 natural language.     In International Conference
 on Advanced Intelligent Systems and Informatics.
 Springer International Publishing, pages 22–31.




                                                          46