Asymmetries in Extraction From Nominal Copular Sentences:
                a Challenging Case Study for NLP Tools

                 Paolo Lorusso, Matteo Greco, Cristiano Chesi, Andrea Moro
                         NEtS at Scuola Universitaria Superiore IUSS.
                            P.zza Vittoria 15, I-27100 Pavia (Italy)
                          {paolo.lorusso, matteo.greco,
                  andrea.moro, cristiano.chesi}@iusspavia.it


                                                                  revealed by some widely used Natural Language
                          Abstract                                Processing (NLP) tools. This leads to poor
                                                                  performance in tasks like Machine Translation
    In this paper we discuss two types of                         (MT).
    nominal copular sentences (Canonical and
                                                                      This argument seems to us especially relevant
    Inverse, Moro 1997) and we demonstrate
                                                                  in those structural configurations in which a non-
    how the peculiarities of these two
                                                                  local dependency must be established: in parsing,
    configurations are hardly considered by
                                                                  for instance, interpreting correctly a wh-
    standard NLP tools that are currently
                                                                  dependency requires that the dependent (the wh-
    publicly available. Here we show that
                                                                  phrase) and the dependee (the head selecting the
    example-based MT tools (e.g. Google
                                                                  wh- phrase as its argument/modifier) are
    Translate) as well as other NLP tools
                                                                  identified, and the nature of the dependence
    (UDpipe, LinguA, Stanford Parser, and
                                                                  disambiguated (e.g. argument vs. modifier). In (1)
    Google Cloud AI API) fail in capturing the
                                                                  we exemplify the special case of a non-local
    critical distinctions between the two
                                                                  dependency between a wh- PP and a DP it
    structures in the end producing both wrong
                                                                  depends on (a co-indexed underscore signals the
    analyses and, possibly as a consequence of
                                                                  possible extraction sites, hence the dependent
    a non-coherent (or missing) structural
                                                                  constituent; the diacritic “*” prefixes, as usual,
    analysis, incorrect translations in the case
                                                                  illegal sites):
    of MT tools. To support the proposed
    analysis, we present also an empirical                        (1) [Di quale segnale]i [i telescopi *_ i] hanno
    study showing that native speakers are                             Of which signal     the telescopes     have
    indeed sensitive to the critical distinctions.                        scoperto *_i    [un’interferenza _ i]?
    This poses a sharp challenge for NLP tools                            discovered      an interference?
    that aim at being cognitively plausible or at                     ‘[which signal]i did the telescopes discover
    least descriptively adequate (Chowdhury                           an interference of _ i?’
    & Zamparelli 2018).                                           The second DP un’interferenza (an interference)
                                                                  (the internal argument) is the dependee of the wh-
1. Introduction                                                   phrase and neither the subject DP nor the
                                                                  predicate can host this wh- dependency instead.
The main hypothesis of this paper is that sentence
comprehension cannot be achieved independently                       According to Google Translate (as of 12th July
from a coherent structural analysis. To support                   2019), this second option seems indeed a viable
this claim, we first present a precise structural                 one:
analysis that is critical for recovering the relevant             (2) What signal did the telescopes find an
dependencies within specific constructions, then                      interference?
we will show that the crucial structural properties
captured by the theoretical framework are in fact                 The translation is ill formed being the internal
correctly perceived by native speakers, but not                   argument of find filled both by the wh- phrase and

  Copyright © 2019 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
the DP an interference (which cannot take a wh-                Moro (1991, 1997, 2006) showed that these
DP as its own argument due to the absence of a             two types of copular constructions can be
relevant preposition).                                     distinguished on the basis of different diagnostics
                                                           like agreement on the verb, grammaticality for the
   In this work we focus on a similar non-local
                                                           extraction of DPs (Wh- or clitic) and pronominal
dependency involving two kinds of copular
                                                           binding.
sentences: Inverse (3.a) and Canonical (3.b).
Using these constructions, we will test the                    Traditionally, copular sentences are analyzed
availability of wh- PP sub-extraction from both            as involving the raising of a DP from the same
the first and the second DP as exemplified in (4).         base generated structure (Stowell 1978). Moro
                                                           (1997, 2018) showed that the predicate DPs
(3) a. le foto del muro sono la causa della rivolta
    the pictures of the wall are the cause of the riot     (including there and its equivalents across
     b. la causa della rivolta sono le foto del muro       languages) can be raised along with the subject
     the cause of the riot are the pictures of-the wall    DPs to the preverbal position from the so-called
     ‘the cause of the riot is the pictures of the wall’   Small Clause (SC) – a structure resulting from
                                                           merging two DPs (Moro 2000, 2009 Chomsky
(4) a. [Di quale rivolta]i le foto del muro sono           2013, Rizzi 2016). In other words, while in
        of which riot      the pictures of_the wall are    Canonical copular sentences the subject DP raises
                  la causa _ i ?                           to the preverbal position and the predicative DP
                  the cause                                stays in situ inside the small clause in the
     b. [Di quale muro]i le foto _ i sono                  postverbal position (4), in the Inverse copular
        of which riot      the pictures of the wall are    sentences the predicative DP raises to the
                  la causa della rivolta?                  preverbal position and the subject DP stays in situ
                  the cause of_the riot                    inside the small clause in the postverbal position
                                                           (5).
In the first part of this paper (§2), we will briefly      (5) Canonical copular sentence structure
present an analysis for these constructions, then
we will demonstrate that native speakers are                              IP
selectively sensitive both to the copular structural
configuration (Canonical vs. Inverse) and to the                 DPsubj        VP
extraction site (subject vs. predicate) (§3). In §4
we will test the insensibility of some freely
                                                                          V             SC
available NLP tools (Google Translate, the
Natural Language service of Google Cloud AI
                                                                                ti            DPpred
API, UDpipe, Stanford Parser and LinguA) to the
syntactic oppositions previously discussed.

2.   The structure of nominal copular
     sentences                                             (6) Inverse copular sentence structure
Copular sentences are those sentences whose                                    IP
main verb is to be (the copula) and its equivalents
across languages. A subset of copular sentences is                    DPpred          VP
the one involving two DPs, linearly ordered as DP
V DP. Those are dubbed nominal copular
                                                                               V              SC
sentences. In this configuration, a nominal phrase
realizes the predicate of the sentence (“the
cause…” in (3)) while the other is the subject of                                    DPsubj            ti
the predicate (“the pictures…” in (3)). According
to Moro (1997), nominal copular sentences can be
distinguished in two subtypes: Canonical copular
                                                           2.1    Asymmetries in copular sentences
sentences (3.a) – in which the order is subject-
copula-predicative expression – and Inverse                These two different representations offer a
copular sentences (3b) – in which the order is             principled explanation for many asymmetries
inverted, i.e. predicative expression-copula-              across languages. Distinguishing between
subject.                                                   Canonical and Inverse copular sentences is not
always easy or possible (see Jespersen 1924 as           the Canonical configuration allow the extraction
cited in Moro 1997). However, agreement and              from the predicate DP, whereas all the other kinds
PP/ne sub-extraction offer robust diagnostics. For       of extraction – from the subject in Canonical and
example, verbs invariably agree with the subject         from both the subject and the predicate in Inverse
DP in Italian (7), regardless of the pre-verbal or       – should be disallowed (§2.1).
post-verbal position, while they invariably agree
                                                             In order to test these hypotheses, we performed
with the preverbal DP in English (8):
                                                         (i) a Self-Paced Reading (SPR) experiment with a
                                                         Sentence Comprehension Task at the end, and (ii)
(7) a. le foto         sono/*è la causa
                                                         an Acceptability Judgement Task (AJT).
        the pictures   are /*is the cause
     b. la causa sono/*è le foto                         3.1      Material and methods
        the cause      are/*is the pictures
                                              Italian    In both the SPR and AJT the set of stimuli was the
(8) a. the pictures are/*is the cause.                   same: 128 items (divided in 4 conditions) and 40
    b. the cause *are/is the pictures                    fillers, in SPR, and 60 fillers, in AJT per condition
                                              English    (72 items per experiment in SPR, 92 in AJT). The
                                                         2x2 design produced four experimental
Extraction is only allowed from the post-verbal          conditions, exemplified in (11):
DP – the predicate – in Canonical sentences (9),
whereas it is not allowed from the post-verbal DP        (11) Condition 1:
– the subject – in Inverse copular sentences (10).           Canonical + Extraction from the Subject
                                                         *[PP Di quale muro]i … [DP le foto _i]a sono [SC [_a]
(9) a. which rioti do you think a picture of the                Of which wall              the pictures are
    wall was the cause of _i?                                                  [DP la causa [PP della rivolta]]]?
                                                                                  the cause      of_the riot?
    b. di quale rivoltai pensi che una foto del
     of which rioti do you think that a picture of_the
        muro sia la causa _i?                            Condition 2:
        wall is the cause _i?                            Canonical + Extraction from the Predicate
                                                         [PPDi quale rivolta]k … [DP le foto [PP del muro]]a
(10) a. *which walli do you think a cause of the               Of which riot           the pictures of_the wall
     riot was a picture of _i?                                      sono [SC [ _a]     [la causa _k]]
                                                                    are                the cause?
     b. *di quale muroi pensi che la causa della
     of which walli you think that the cause of_the
        rivolta sia una foto _i?                         Condition 3:
        riot     is a picture _i?                        Inverse + Extraction from the Subject
                                                         *[PP Di quale muro]i…[la causa [PP della rivolta]]b
3.   Experimental evidence supporting the                      Of which wall           the cause of_the riot
     analysis of copular sentences                                  sono [SC [le foto _i] [ _b]]?
                                                                    are (=is) the pictures?
Before considering the computational side or the
proposed structural analysis we investigated             Condition 4:
whether the human parser is sensitive to the             Inverse + Extraction from the Predicate
critical distinctions illustrated here. Two               *[PP Di quale rivolta]k … [la causa _k ]b sono [SC
experiments are discussed, testing the processing              Of which riot …         the cause          are (=is)
of Canonical vs Inverse copular sentences (first                    [DP le foto [PP del muro]] [ _b]]?
condition) involving the extraction of a wh-                         the pictures    of_the wall
element from a DP embedded either under the              3.2      Self-Paced Reading
subject or the predicate (second condition).
                                                         32 native Italian speakers participated in the
   Our prediction was that the sensitivity to            experiment. Stimuli were composed by questions
agreement and to the argumental vs. predicative          and by their answers; participants had to read the
role distinction for the two DPs involved would          question word by word and, then, the answer.
have influenced both the online and the offline          Finally, they had to judge the appropriateness of
performance of native speakers: participants             the answer.
should show an advantage in parsing Canonical
copular sentences (vs. Inverse ones), since only
3.3    Results                                           Participants had to rate the acceptability of
                                                         questions on a scale from 1 to 7.
Participants showed higher accuracy in answering
to comprehension questions when the extraction           3.5    Results
occurred from the post-verbal DP in Canonical
copular sentences – DP predicate in Condition 2          The results (fig.2) confirm the previous on-line
– than in Inverse copular sentences – DP subject         findings and show that (i) Canonical constructions
in Condition 3 – while extraction from the Inverse       were more acceptable than Inverse ones and that
copular constructions induced lower accuracy             (ii) among the different types of copular
(-0.41, z=‐2.054, p=0.04; Fig. 1). This confirms         sentences, the ones with an extraction from
that the structural asymmetry between referential        predicates have higher rates than the ones with
subjects and predicative DPs has a central role in       extraction from subjects.
both the processing and the comprehension of
nominal copular sentences. Similarly, Inverse vs
Canonical opposition seems relevant since
extractions from both sites in the Inverse copular
constructions produce lower accurate answers
compared to the extraction from the predicate in
canonical copulars (coherently with Moro 1997,
2006 that predict the DP in both inverse
constructions to be illegal extraction sites).
                                                         Fig.2 Acceptance rates across conditions.


                                                         4. Parsing copular sentences
                                                         To evaluate the state-of-the-art of NLP with
                                                         respect to the contrasts we discussed (Canonical
                                                         vs Inverse copular sentences) in a configuration
                                                         where overt agreement disambiguates the critical
                                                         roles (predicate vs subject), we ran few tests using
                                                         the following tools:

                                                         1. UDpipe (Straka et al 2016)
                                                         2. Stanford Parser - English (Chen & Manning
                                                            2014)
                                                         3. LinguA parser (Attardi, Dell’Orletta 2009)
                                                         4. Google Translate (translate.google.com)
                                                         5. Google Cloud AI Solutions
Fig.1 Percentage of correct answers across conditions.      (cloud.google.com)
Reading times, on the other hand, revealed a clear       We first tested standard Canonical (3.a) and
difference at the copular region for the two             Inverse (3.b) copular constructions, then we tried
conditions (t=3.37 p=0.002) suggesting a penalty         to assess qualitatively the output analyses
for the Inverse copular constructions compared to        provided by these tools with respect to sub-
the Canonical one. Also at the first DP region the       extraction from the predicate in Canonical
Predicate vs Subject distintion is productively          sentences (9.a-b), here repeated for convenience:
differentialed (t>2 p=0.008) indicating the la
causa (“the cause”) and “le foto” (“the pictures”)       (3) a. le foto del muro sono la causa della rivolta
conditions, respectively predicate and subject                 the pictures of the wall are the cause of the riot
condition, are perceived as different.                         b. la causa della rivolta sono le foto del muro
3.4    Acceptability Judgement Task                            the cause of the riot are the pictures of-the wall
                                                               the cause of the riot is the pictures of the wall
40 native Italian speakers participated in the
experiment. Stimuli were the same than in SPR.
(9) a. which rioti do you think a picture of the       as its subject). Unfortunately, the same analysis is
    wall was the cause of _i?                          proposed for inverse copular constructions (14.b).
    b. di quale rivoltai pensi che una foto del
    muro sia la causa _i?                              (14) a. Canonical copular sentence analysis
    of which rioti do you think that a picture of
    the wall is the cause _i?
4.1    UDpipe
   UDPipe Natural Language Processing - Text                 b. Inverse copular sentence analysis
Annotation interface (Wijffels 2018, Straka et al
2016) provides a handy tool easily integrated in
the R environment. Various pre-trained models
are available for many languages. We run our
analyses using the pre-trained model italian-isdt-     The quality of the analysis for the sub-extraction
ud-2.4-190531. The results of the analysis for         case confirms every suspicion: the sub-extracted
both Canonical (10.a) and Inverse (10.b) are           wh-item (which riot) is wrongly associated to the
simply the same. In fact, not even the basic local     matrix predicate (think) (15).
dependencies are fully recovered (e.g. det-noun).
The analysis of the sub-extraction from predicate      (15) sub-extraction from predicate in Canonical
in Canonical structures (13.a) is paradoxically less        configuration
disastrous than the other analyses, but if we try to
analyze sub-extraction from the subject of a
Canonical construction, we obtain wrong analyses
(13.b) (the wh- items is considered an extra           4.3    LinguA
argument of cause):
                                                       LinguA annotation pipeline (service provided on-
(12) a. Canonical copular sentence analysis            line by ItaliaNLP Lab at Istituto di Linguistica
                                                       Computazionale "Antonio Zampolli" ILC in Pisa)
                                                       has been used for our tests on Italian,
                                                       implementing a version of Attardi & Dell’Orletta
      b. Inverse copular sentence analysis             (2009) parser (currently the state-of-the-art parser
                                                       for Italian). The analyses of this parser are
                                                       definitely more precise than the ones proposed by
                                                       the UDpipe tool, but the symmetric results
(13) a. sub-extraction from predicate in Canonical     returned for both Canonical and Inverse copular
     configuration                                     sentences did not identify either the dependency
                                                       between the predicate and the subject or their
                                                       actual role in the structure (16.a-b). The analysis
                                                       of the extraction, interestingly attempts an
      b. sub-extraction from subject in Canonical      interpretation of the wh- item as an (extra)
      configuration                                    argument of the first DP (le foto [di quale rivolta]
                                                       (del muro)). This is a wrong analysis, but it is
                                                       coherent with the slow-down observed in self-
                                                       paced reading experiment (§3.3) at the first DP
                                                       region, though the parser does not make the
4.2    Stanford Parser                                 relevant distinction between subject (17.a) and
                                                       predicate (17.b) (in this second case, sub-
Stanford parser (Chen & Manning 2014) can be
                                                       extraction is interpreted as a copula argument).
considered the state-of-the-art parser for English.
Canonical constructions, in fact, gave the
opportunity to live up to expectations: the analysis
of the canonical copular sentence (14.a) is
perfectly in line with the analysis presented in §2-
§2.1 (cause is identified as predicate and pictures
(16) a. Canonical copular sentence analysis


      b. Inverse copular sentence analysis
                                                       Fig.4 The structural analysis of the Canonical sentence
                                                       ‘le intercettazioni sono la documentazione’ (‘The
                                                       interceptions are the documentation’) given by Google
                                                       Natural Language.

(17) a. sub-extraction from predicate in Canonical
     configuration


      b. sub-extraction from subject in Inverse
      configuration                                    Fig.5 Structural analysis of the Inverse copular
                                                       sentence ‘la documentazione sono le intercettazioni’
                                                       (lett. the documentation are the interceptions; ‘The
                                                       documentation is the interceptions’) given by Google
                                                       Natural Language.

4.4    Google AI                                       4.4    Google Translate
We finally investigated the Natural Language           In order to evaluate the impact of these wrong
service – one of the tools provided by Google          analyses on a practical NLP task, we finally
Cloud AI Solutions API – which returns syntactic       carried out our conclusive experiments on one of
representations            of             sentences    the most famous and largely exploited machine
(https://cloud.google.com/natural-language/).          translation software: Google Translate.
While both canonical and inverse copular                   Starting with simple examples, we observed
analyses are equivalent in English to the ones         that when the tool is provided with the Italian
provided by the Stanford Parser (hence partially       Inverse copular sentence ‘La causa della rivolta
consistent with our analyses), in Italian, using the   sono le foto del muro’ (lett. the cause of the riot
Canonical copular sentence ‘le intercettazionik        are the pictures of the wall; ‘The cause of the riot
sonok la documentazionei’ (‘the interceptions are      is the pictures of the wall’), it gives the wrong
the documentation’), the tool incorrectly analyses     English translation ‘*The cause of the uprising
the predicate DP the documentation as an attribute     are the photos of the wall’ (Fig.6), in which the
(fig. 4) (this might be a consistent annotation of     verb does not agree with the pre-verbal DP “the
all nominal predicates Google adopted, but it is       cause of the uprising”, contrary to what it does in
clearly misleading here). Moreover, when it is         English (as we saw in 7).
provided with the Inverse form of the sentence ‘la
documentazione sono le intercettazioni’ (lett. the
documentation are the interceptions; ‘The
documentation is the interceptions’), the tool
incorrectly analyzes the raised predicative DP the
documentation – singular noun – as the subject,
putting it in a wrong agreement relation with the
verb (plural form) (Fig. 5). Then, in the end, this
parser fails in recognizing the critical difference    Fig.6      Example       from       Google  translate:
between Canonical and Inverse copular sentences        https://translate.google.it/?hl=it#view=home&op=tran
giving exactly the same analysis for both cases        slate&sl=auto&tl=en&text=La%20causa%20della%2
(3.a) and (3.b).                                       0rivolta%20sono%20le%20foto%20del%20muro
Interestingly, reversing the translation from          exception of the Stanford Parser for English that
English to Italian the cause of the riot is the        at least succeeded in analyzing correctly the
pictures of the wall the system correctly produces     canonical copular sentences. This analysis was
la causa della rivolta sono le immagini del muro       however insufficient in the case of inverse
where proper agreement (with the post-verbal           constructions and in case of sub-extraction,
subject) is in place. Since the analysis provided by   confirming that non-local dependencies are
any tool we tested is theoretically inconsistent       critical configurations native speakers are able to
with this result, we hypothesized that this            parse but machine do not, yet.
translation could have been obtained adopting an
example-based approach; it was worth then to test      Reference
if the correct agreement with the post-verbal
                                                       Attardi G., Dell’Orletta F. (2009). Reverse Revision
subject is just an accident (this is a well know         and Linear Tree Combination for Dependency
prototypical sentence, widely discussed in               Parsing“. In: NAACL-HLT 2009 – North American
literature and it might have been included in the        Chapter of the Association for Computational
Google Translate training set) or if the analysis is     Linguistics – Human Language Technologies
generalized of any possible subject/predicate pair.      (Boulder, Colorado, June 2009). Proceedings,
                                                         Association for Computational Linguistics, 2009.
A sentence like la documentazione sono le                pp. 261 – 264.
intercettazioni (lett. the documentation are the
interceptions, that means ‘The documentation is        Chen D., C. D. Manning. (2014). A Fast and Accurate
the interceptions’) would suit our purpose nicely.       Dependency Parser using Neural Networks.
                                                         Proceedings of EMNLP 2014. pp. 740-750
In the English > Italian direction the correct
singular copular agreement is produced (“the           Chomsky, N., (2013). ‘Problems of projection.’ Lingua
documentation is the interceptions”) but from            130:33–49
Italian to English this time the wrong agreement       Chowdhury, S. A., & Zamparelli, R. (2018, August).
is obtained, totally ignoring the number of the real     ‘RNN simulations of grammaticality judgments on
post-verbal subject (the documentation is the            long-distance dependencies.’ In Proceedings of the
interceptions > la documentazione è le                   27th International Conference on Computational
intercettazioni). We concluded then that no deep         Linguistics (pp. 133-144).
analysis is attempted so as to distinguish between     Jespersen, 0., (1924) The Philosophy of Grammar,
subject and predicate roles and this turns out to be      Allen & Unwin, London.
fatal.
                                                       Moro, A., (1991). The raising of predicates: copula,
                                                        expletives and existence. MIT Working Papers in
5.   Conclusion
                                                        Linguistics 15: 119-181.
In this paper we demonstrated that nominal             Moro, A., (1997). The Raising of Predicates.
copular sentences constitute a clear challenge for      Cambridge: Cambridge UP
the computational analysis since the same string
                                                       Moro, A., (2000). Dynamic Antisymmetry. Linguistic
of elements [DP V DP] can have in principle two         Inquiry Monograph, Series, MIT Press
different syntactic representations (hence two
different meanings), depending on which kind of        Moro, A., (2006). ‘Copular sentences.’ In Everaert, M.
copular sentence is realized (Canonical or              & H. van Riemsdijk (eds.), MA. Blackwell
Inverse). In this paper, we spotted various glitches    Companion to Syntax II, Blackwell, Oxford, 1-23.
in the automatic analyses which in the end led         Moro, A., (2009). ‘Rethinking Symmetry: A Note on
either to significant failures (Google Translate) or    Labelling and the EPP.’ In La grammatica tra storia
to rough structural hypotheses that bluntly ignore      e teoria: Scritti in onore di Giorgio Graffi, edited by
the relevant contrasts here discussed. Our              P. Cotticelli Kurras and A. Tomaselli, 129–31.
empirical study, testing both online and offline the    Alessandria: Edizioni dell’Orso; also at
                                                        http://www.ledonline.it/snippets/allegati/snippets19
wh- PP sub-extraction possibilities from both
                                                        007.pdf.
subject and predicate DPs, shows that native
speakers are sensitive with respect to the different   Moro, A., (2018). ‘Copular sentences.’ In Everaert, M.
structural roles; in addition, they perceive as         & H. van Riemsdijk (eds.), MA. Blackwell
expected the underlying structural representation       Companion to Syntax, Revised edition vol. II,
                                                        Blackwell, Oxford, 1-23.
of Canonical vs. Inverse copular construction.
None of the NLP tools we tested succeeded in
providing a full set of coherent analyses, with the
Rizzi, L., (2016). ‘Labeling, maximality, and the head-
  phrase distinction.’ The Linguistic Review 33, 103–
  127.
Straka, M., Hajic, J., & Straková, J. (2016). UDPipe:
   trainable pipeline for processing CoNLL-U files
   performing tokenization, morphological analysis,
   pos tagging and parsing. In Proceedings of the tenth
   international conference on language resources and
   evaluation (LREC 2016) (pp. 4290-4297).
Stowell, T., (1978). ‘What was there before there was
   there.’ In D. Farkas et al., eds., Papers from the
   Fourteenth Regional Meeting, Chicago Linguistic
   Society. Chicago Linguistic Society, University of
   Chicago.
Wijffels, J. (2018). udpipe: Tokenization, Parts of
  Speech Tagging, Lemmatization and Dependency
  Parsing with the ‚UDPipe ‘‚NLP ‘Toolkit. R
  package version 0.5.