<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reflexives, Impersonals and Their Kin: a Classification Problem</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kledia Topciu</string-name>
          <email>kledia.topciu@student.unisi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <email>cristiano.chesi@iusspavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NETS - IUSS</institution>
          ,
          <addr-line>P.zza Vittoria 15, I-27100 Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi Di Siena</institution>
          ,
          <addr-line>Via Roma 56, I-53100 Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Despite the fact that true reflexives always require a local antecedent, attempting an automatic referential resolution is often far from trivial: in many languages, reflexives are morphologically indistinguishable from impersonals and both particles are sensitive to the syntactic structure in a non-trivial sense. Focusing on Italian, we annotated part of the Repubblica Corpus to attempt an automatic classification of the reflexive and impersonal si constructions. In this preliminary study we show that the accuracy of the automatic classification methods that do not use any relevant structural information are rather modest. A thoughtful discussion of the structural analysis required to distinguish among different contexts is provided, in the end suggesting that these structural configurations are not easily recoverable using a purely distributional approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The non-triviality of reflexive/impersonal
constructions in Italian is exemplified in (1):</p>
      <p>Copyright © 2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
d. proi Sii/*j tolse la giaccai.
proi SIi/*j took3-SG-PAST off the jacket
‘S/He took off the jacket.’
e. Il compagnoj di Adai si*i/j presentò.</p>
      <p>The friendj of A.i SI*i/j introduced3-SG-PAST
‘A.’s friend introduced her/him-self.’
f. Riconosciuto il compagnoj di Adai,</p>
      <p>prok si*i/*j/k presentò.</p>
      <p>Recognized3-SG-P.PART the friendj of A.i,</p>
      <p>prok SI*i/*j/k introduced3-SG-PAST.
‘Once s/he recognized A.’s friend,</p>
      <p>s/he introduced her/him-self.’
g. Sigeneric pensa sempre a salvarsi la pelle.</p>
      <p>
        SIgeneric thinks always to saveINF-REFLthe skin
‘We always think about saving our own skin.’
Expecting the co-referential DP to be always
“immediately to the left” of the reflexive form
quickly leads to wrong predictions: if this
generalization might seem sufficient in (1a) this is
bluntly wrong in (1b), where we need to assume an
empty referent
        <xref ref-type="bibr" rid="ref20">(pro, Rizzi 1986)</xref>
        before the
reflexive (see §1.1). Moreover, we should accept
that the coreferential DP can be placed sometimes
to the right of the predicate
        <xref ref-type="bibr" rid="ref3">(structurally speaking,
pro and post-verbal subject options are related,
Belletti 2002)</xref>
        ; in this case, the
(focalized/dislocated) post-verbal subject is a good
candidate, (1b). Being “the closest DP” is however
not a sufficient condition as suggested by the
examples (1c-d). Hence, the null subject hypothesis
as well as a structural analysis unravelling the role
of each DP surrounding the predicate is requested,
for the identification of the correct local binding
domain (1e-f). Last but not least, a proper
classification of the predicate admitting a reflexive
or an impersonal pronoun is needed (1g). Under
this perspective, we decided to run a little
experiment to verify the consistency of a
“usagebased” approach
        <xref ref-type="bibr" rid="ref23">(Tomasello 2003)</xref>
        in this specific
context and consider whether the “structural
analysis”
        <xref ref-type="bibr" rid="ref6">(Chomsky 1995; 2008)</xref>
        can be proved to
be an outdated approach for the classification of the
distinct kinds of si. In the remaining part of this
introduction we will present the (possibly outdated)
structural analyses proposed for reflexive (§1.1)
and impersonal (§1.2) clitic si. We will then present
our experiment consisting of the annotation of a
small fragment of the Repubblica Corpus
        <xref ref-type="bibr" rid="ref1">(Baroni
et al. 2004)</xref>
        that we used to train and test a set of
Machine Learning classification algorithms (§2).
Results presentation (§3) and their discussion (§4)
will follow.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1.1 The reflexivization configuration</title>
      <p>
        A popular structural analysis of reflexives is the
unaccusative one: under this perspective, the
subject of reflexives is an underlying object (just
like the subject of unaccusatives) which has to raise
to the subject position for Case reasons (reflexive
morphology absorbs its Case). Two main variants
of this approach are discussed in the literature: a
lexical and a syntactic one. The lexical version
predicts that the external argument is absorbed in
the lexicon
        <xref ref-type="bibr" rid="ref12 ref14">(Marantz 1984 and Grimshaw 1990)</xref>
        ,
while the syntactic one proposes that the external
argument is present in syntax via the reflexive clitic
se
        <xref ref-type="bibr" rid="ref17 ref22">(Kayne 1988, Pesetsky 1995, Sportiche 1998)</xref>
        .
      </p>
      <p>A different analysis is proposed by Reinhart &amp;
Siloni (1999, 2005): reflexives should be
unergative entries since unaccusativity tests (e.g.
ne cliticization, (2b)) fail with reflexive
constructions:
(2) a. Ne sono arrivati tre.</p>
      <p>of+themcl are arrived three
‘Three of them arrived.’
b. *Se ne sono vestiti tre.</p>
      <p>SI of+themcl are dressed three
‘Three of them got dressed.’
Since the internal argument only can be cliticized
and the reflexive verb fails the ne test, we conclude
that the subject of the reflexives is an external
argument, unlike the subject of unaccusatives.
Another test helping us to tease apart external from
internal argument structures is reduced relatives
modification: when the modification is
implemented via past participle, this does not allow
for predicates with an external argument. The
reduced relative in (3a) contains a reflexive
predicate, while the one in (3b) is an impossible
cliticization of a transitive reflexive past participle.
(3) a. Il bicchiere rottosi ieri apparteneva a mio
nonno.
the glass broken-him/herself yesterday
belonged to my grandfather
b. *L'uomo lavatosi ieri è mio nonno.</p>
      <p>
        the man washed-him/herself yesterday is
my grandfather
A robust evidence supports the idea that the subject
of reflexive verbs patterns with the subject of
unergatives, hence confirming its external
argument nature
        <xref ref-type="bibr" rid="ref16">(but see Pescarini 2015:42ff)</xref>
        .
      </p>
      <p>Kayne (1975) observes that reflexives occur in
environments where transitive verbs are
disallowed, e.g. in French causative constructions:
when the verb embedded under the causative verb
faire ‘make’ is a transitive verb (4a), its subject
must be introduced by the preposition a ‘to’; when
the lower verb is intransitive or reflexive, its
subject cannot be introduced by a (4b/c).
(4) a. Je ferai laver Jean *(a) Luc.</p>
      <p>Io makeFUT wash Jean to Luc.</p>
      <p>‘I will make Jean wash Luc’.
b. Je ferai courir (*a) Jean.</p>
      <p>I makeFUT Jean run.</p>
      <p>‘I will make Jean run.’
c. Je ferai se laver (*a) Jean.</p>
      <p>I makeFUT SE wash Jean.</p>
      <p>‘I will make Jean wash himself.’
When the lower verb is reflexive, its subject
appears without the preposition, exactly like the
subject of unergative verbs. Therefore, reflexive
verbs are not transitive entries either.</p>
      <p>
        <xref ref-type="bibr" rid="ref19">Reinhart &amp; Siloni (2005)</xref>
        suggest that these
reflexive constructions are unergative entries
derived from their transitive alternate by a
reduction operation targeting the internal argument
(identified with the external one). They take verbal
reflexivization even further and propose a
lexiconsyntax parameter: arity operations (on θ-roles) can
apply either to the syntax or to the lexicon.
Reflexivization is essentially the same
phenomenon cross-linguistically, that is, two
available θ-roles are assigned to the same syntactic
argument, or, better said, the operation of
reflexivization takes two θ-roles and forms one
complex θ-role. 
      </p>
      <p>The distinctions follow from two different
modes of operation: a lexical mode and a syntactic
one. Languages such as Hebrew, English, Russian
and Dutch have the parameter set to “lexicon”,
while in Romance languages, Greek and German
the “syntax” value of the parameter is set. In the
syntactic option (which is relevant here), what is to
become a reflexive verb leaves the lexicon with the
same number of θ-roles, which need to be assigned,
as the basic verbal entry. Since the clitic itself
cannot be viewed as an argument (the lack of Case
blocks its merge), the “extra” θ-role has to be
explained by an arity reduction operation.</p>
      <p>
        In conclusion, an automatic classification
algorithm, attempting at identifying the typology of
the si reflexive pronoun, should necessarily have
access to the subcategorization verbal frame and
postulate an arity-reduction as suggested by
        <xref ref-type="bibr" rid="ref19">(Reinhart &amp; Siloni 2005)</xref>
        . If this information is not
available as lexical resource, we might try to rely
on structural cues to infer the correct argument
structure
        <xref ref-type="bibr" rid="ref13 ref15 ref2">(as in Merlo &amp; Stevenson 2001, Basili et
al 1997 or Ienco et al. 2008)</xref>
        . On the other hand, if
statistical cues would be available, annotating them
overtly would be unnecessary.
      </p>
      <p>
        A further complication, however, is associated
to the existence of a class of “reflexive” predicates
(e.g. alzarsi, ‘to stand up’) which are bona fide
unaccusatives
        <xref ref-type="bibr" rid="ref16">(inherent/lexical si constructions
Pescarini 2015)</xref>
        . In this case, the overlapping
between the bare verbal root and a transitive form
of some inherent si predicates does not help in
automatic classification task (e.g. in “si lava la
mano”, he/she wash his/her hand, due to the
transitive nature of lavare/to wash, the post-verbal
DP “la mano” could be analyzed both as direct
object or post-verbal subject).
      </p>
    </sec>
    <sec id="sec-3">
      <title>1.2 Impersonal si constructions</title>
      <p>
        The reflexive reading is not the only available
option when the si pronoun is present: an
impersonal reading is also possible. Impersonal si
constructions are used to introduce a generic,
unspecified subject and to make general statements
about groups of people
        <xref ref-type="bibr" rid="ref8">(Cinque 1988,
DobrovieSorin, C. 1998, 1999 a.o.)</xref>
        . In Italian, si
constructions are exemplified in (5a). The subject
is unspecified and the sentence has a generic
reading because of si, otherwise its absence would
result in a sentence with a specific subject (5b)
being Italian a pro-drop language
        <xref ref-type="bibr" rid="ref20">(Rizzi 1986)</xref>
        .
(5) a. In Italia si mangia troppo.
      </p>
      <p>In Italy si eats3rdSG too much
‘In Italy, people eat too much.’
b. In Italia pro mangia troppo.</p>
      <p>In Italia pro reads3rd-SG a lot
‘In Italia he/she reads a lot’
Notice that the adverbial modal modification
“troppo” is coherent with the generic reading,
while a punctual temporal adverbial modification
would result inconsistent (“#In Italia si mangia
domani” vs. “In Italia si mangia sempre”).</p>
      <p>
        As for the argumental status of si, there is a
large disagreement in the linguistic community:
        <xref ref-type="bibr" rid="ref8">Cinque (1988)</xref>
        proposes the existence of two
different si items: the presence of si is usually
restricted to finite clauses, however, it is also
permitted in certain untensed clauses, namely in
Aux-to-Comp (6) and Raising structures (7) with
transitive and unergative verbs.
(6) Non essendosi ancora scoperto il colpevole…
not beingGERUND-SI yet discoveredP-PART-SG-MASC
the culpritSG-MASC
‘Not having yet discovered the culprit...’
(7) Sembra non essersi ancora
colpevole …
seems3RD-SG not being-SI yet
discovered P-PART-SG-MASC the culpritSG-MASC
‘It seems it hasn’t yet been discovered
the culprit.’
scoperto
il
Cinque considers these instances of si as
argumental ones (+arg), which can be present in
general only with verbs that project an external
θrole. The other si is a non-argumental one (-arg),
which can be present with any verb class
(therefore, also with verbs that do not assign an
external θ-role).
      </p>
      <p>
        <xref ref-type="bibr" rid="ref10">Dobrovie-Sorin (1998</xref>
        , 1999) argues that it is
not necessary to postulate this: according to her,
what Cinque calls a +arg si is actually a middle
passive Accusative si. The only Nominative si is
Cinque’s -arg si. She argues that si is not licensed
in non-finite clause because it is a Nominative clitic
and, in Italian, Nominative clitics are not allowed
in non-finite clauses. Only transitive and
unergative Aux-to-Comp and Raising structures
allow si as Accusative. Dobrovie-Sorin tries to
unify all the uses of SE in Romance languages and
assumes that si is not a special lexical item that
absorbs a θ-role or Case. Her analysis accounts for
special cases, such as Romanian, which has si
constructions but doesn’t have Nominative clitics.
Italian si constructions, on the other hand, rely
either on Nominative (8) or Accusative (which also
includes reflexive configurations) (9).
(8) Non sii ei è mai contenti.
      </p>
      <p>not SI is3RD-SG ever satisfied
'One is never satisfied.'
(9) Il grecoi sii traduce ei facilmente.</p>
      <p>the Greek SI translates3RD-SG easily
‘Greek translates easily.’
In (8), si is an anaphor and if we assume a restricted
theory of binding, the anaphoric status of the clitic
is transferred to its trace. The indexing
configuration corresponds to a single argument, the
Theme. On the other hand, the si in (9) is not an
anaphor and therefore imposes no relation between
the subject and object positions; it binds an empty
category in the subject A-position.</p>
      <p>
        A rephrase of Dobrovie-Sorin’s proposal is
formulated by
        <xref ref-type="bibr" rid="ref21">Salvi (2018)</xref>
        , who argues that in
modern Italian there are two reflexive si
constructions: a passive one and an impersonal one
        <xref ref-type="bibr" rid="ref16">(the reader should refer to Pescarini 2015 for a
more detailed discussion of a richer classification)</xref>
        .
The first one, exemplified in (10b), is characterized
by the cancelation of the subject (10a) and the
transformation of the direct object into the
grammatical subject (triggering agreement); the
derived grammatical subject can occur also in the
canonical preverbal position (10c):
(10) a. Il preside ha consegnato i diplomi.
      </p>
      <p>The dean has awarded the diplomas
b. Si sono consegnati i diplomi.</p>
      <p>SIgeneric are awarded the diplomas
‘Diplomas got awarded’
c. I diplomi si consegnano (agli studenti).
the diplomas SIgeneric awarded</p>
      <p>(to the students)
‘Diplomas are getting awarded
(to the students)’
This construction is only possible with
(di)transitive predicates, since the promotion of the
object to the grammatical subject role is only
available when a direct object is available.</p>
      <p>On the other hand, the impersonal version of si
does not induce the promotion of the internal
argument to the grammatical subject role and in
fact this construction is available without any
verbal class restriction:
(11) a. Si guarda la partita</p>
      <p>SIgeneric watches the game
‘We watch the game’
b. Si dorme</p>
      <p>SIgeneric sleeps
‘We sleep’
c. Si cade</p>
      <p>SIgeneric falls
‘We fall’
In sum, with the impersonal si construction, the
subcategorization verbal frame (i.e. the verbal
argumental structure) could help in isolating the
passive si construction, but not the impersonal one.
As for reflexive si, the full argument structure must
be identified and then either the passive strategy
(deletion and promotion) or the impersonal one
(simple deletion) considered. As a consequence of
the null subject option in Italian, the difference
between impersonal and passive si is often blurred.</p>
    </sec>
    <sec id="sec-4">
      <title>2. Materials and methods</title>
      <p>
        From Repubblica Corpus
        <xref ref-type="bibr" rid="ref1">(Baroni et al 2004)</xref>
        , we
extracted all contexts in which the “si” lemma was
present: 2.737.558 contexts are returned by the
simple query including a left and right context of
maximum 8 words around the si + predicate
cluster; each left and right context was cut at full
stops, colons, semi colons, exclamative and
question marks, whenever those were found within
the 8 tokens context. The tagset used in the
Repubblica Corpus neither distinguishes among
reflexive and various types of impersonal forms
(“CLI/si” is the generic tag used) nor among
different verbal classes with respect to their
argumental structure (only VB for “be”, VH for
“have”, and VV for other verbs are included). We
then decided to annotate manually the first 2.000
contexts returned by our query (0,07% of the total)
using the following scheme much simplified with
respect to the structural asymmetries revealed by
the discussion in §1: I (impersonal), L (local, DP
immediately preceding “si” is the correct one), PV
(post-verbal: the first DP after the predicate
following “si” is the correct co-referent) and LM
(the DP immediately preceding, in the hierarchical
sense, the reflexive “si” is the correct one, but such
DP is “modified” by a PP or a relative clause) and
A (the referent is not present/retrievable in the
extracted context; these are in the great majority
pro-drop cases, in just two cases the referent was
lexically realized outside the context isolated).
Both authors annotated independently the corpus
and discussed about the disagreement cases (less
than 1% of the sample) in order to find an
agreement in the annotation. Table 1 indicates the
distribution of the classes across the annotated
corpus fragment, while Table 2 exemplifies the
classification. Due to the simplicity of this
classification (that essentially focus on the
identification of the reflexive antecedent, if
present/necessary), we would expect a better
performance compared to any richer classification,
which is apparently necessary according to the
structural analysis previously discussed.
example
[i fedeli]i sii sono tuttavia sciolti
the faithfulls, nevertheless, split up
[il vertice di Dublino]i sii è dimostrato
the Dublin summit proved to be …
nel cortile sii stendono [le stuoie]i
in the courtyard the mats unfolded
per 16 anni sii è occupato dei processi
for 16 years [he] took care of the trials
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.1 Classifiers descriptions</title>
      <p>
        Under the “usage-based” approach the
disambiguation (i.e. the interpretation of the correct
referent, if necessary) of the distinct si
constructions should be possible on the basis of the
purely statistical distribution of the (implicit)
features across the corpus
        <xref ref-type="bibr" rid="ref23">(Tomasello 2003 and
related works)</xref>
        . To test this hypothesis we created a
set of classifiers using the Weka environment
        <xref ref-type="bibr" rid="ref11">(Frank et al 2016)</xref>
        . 4 different classifiers are used
including the original extracted context of
maximum 8 words before and after the clitic si +
predicate cluster (Table 3): pure Bag-of-Words
(BoW) approach was used for the first two
classifiers, one with only the left context included,
the other with both left and right context; then we
manipulated the left context classifier substituting
the words with their POS (classifier C3-POS-L) and
with a more coarse set of POS tags (C4-CPOS-L).
POS and CPOS annotation are obtained using a
free online tool
        <xref ref-type="bibr" rid="ref7">(ItaliaNLP REST API, Cimino &amp;
Dell’Orletta 2016)</xref>
        .
      </p>
      <p>Class. ID
C1-BOW-L
C2-BOW-LR</p>
      <p>C3-POS-L
C4-CPOS-L</p>
      <p>Approach</p>
      <p>Context
BoW
POS
CPOS</p>
      <p>Left context
Left &amp; Right context
Left context
Left context</p>
    </sec>
    <sec id="sec-6">
      <title>2.2 Classification algorithms</title>
      <p>Given the baseline classification of 49.7% of
accuracy, obtained by choosing always the
reflexive local class (L classification), we
compared Naïve Bayesian algorithms (i.e.
NaïveBayes, n.bayes in table 4, and
NaïveBayesMultimodal, n.bayes.mul. in table 4)
with a decision tree-based algorithm (i.e. J48) and
then with both 3 layers convoluted (with LSTM
layer; conv.net in table 4) and simple recurrent
neural networks using Weka wrappers for
Deeplearning4j 1.5.13 (srnn.net in table 4) for a
total of 5 classifiers. We run our experiments
within Weka 3.8.3 environment with CUDA 10.1
GPU nVIDIA support. Word embeddings are built
using a larger fragment of left and right contexts
(+/-10 words at most, breaking the left/right
context at full stops) extracted from Repubblica
corpus including the “si” seed (first 1.000.000
sentences returned using the publicly available
Sketch Engine search interface).</p>
    </sec>
    <sec id="sec-7">
      <title>3. Results</title>
      <p>
        The results of the classification tests are reported in
table 4. The accuracy indicates the rate of correct
classifications and the standard deviation running
10 experiments with cross-fold validation
(standard deviation is indicated) and the
significance is expressed with respect to the
baseline:  indicates that the accuracy is
significantly better than baseline,  significantly
worse and no sign means no significant difference
        <xref ref-type="bibr" rid="ref24">(pair-wise comparison using corrected resampled
T-Test, Witten &amp; Frank 2005)</xref>
        .
      </p>
      <p>Algorithm</p>
      <p>Accuracy (SD)</p>
      <p>Sign.</p>
      <p>49.70%
56.95% (2.79)
54.28% (2.03)
58.34% (2.48)
51.88% (1.44)
39.63% (11.79)
49.21% (3.40)
51.61% (1.17)
48.66% (2.53)
49.77% (0.41)
39.05% (12.77)
54.49% (2.35)
53.26% (1.99)
60.76% (2.97)
57.58% (1.98)
43.52% (7.17)
59.96% (2.85)
50.89% (1.03)
61.49% (3.08)
49.70% (0.25)
44.20% (6.17)
















Class. ID
baseline
C1-BOW-L J48
C2-BOW-LR J48</p>
      <p>C3-POS-L</p>
      <p>J48
C4-CPOS-L J48
n.bayes
n.bayes.mul.
conv.net
srnn.net
n.bayes
n.bayes.mul.
conv.net
srnn.net
n.bayes
n.bayes.mul.
conv.net
srnn.net
n.bayes
n.bayes.mul.
conv.net
srnn.net
In both left and left-right context classifiers, BoW
approach (C1-BOW-L and C2-BOW-LR) is clearly
not sufficient to solve the classification problem;
the introduction of a right context (C2-BOW-LR)
significantly reduces the performance of the
classifier. Notice that in almost 10% of the cases
the availability of the referent is post-verbal (PV
classification). Decision trees (J48), overall,
perform better (M=58.34% SD=2.48) but this
performance represents a significant improvement
only with C1-BOW-L and C4-CPOS-L classifiers.
None of the deep learning approaches (conv.net
and srnn.net) are significantly better than decision
trees (in some cases SRNs perform significantly
worse). The best absolute performance in obtained
substituting words with coarse POS (C4-CPOS-L).
In this case J48 obtains the best accuracy
(M=61.49% SD=3.08).</p>
    </sec>
    <sec id="sec-8">
      <title>4. Discussion</title>
      <p>
        In this paper, we discussed the nature of some si
constructions in Italian, suggesting that, despite
their apparent simplicity, their structural intricacies
require a deep syntactic analysis for identifying
correctly the typology of the clitic in various
contexts and retrieve, when necessary, a proper
referent. Also using a simplified set of five classes
(I = impersonal; L = local immediately preceding
coreferential DP; PV = local, immediately
postverbal coreferential DP; LM = local preceding
coreferential DP but with prepositional phrase or
relative clause modification; A = absent referent),
we demonstrated that, using an annotated sample
of the Repubblica corpus, no classifier has
exceeded the performance of 61.49% of accuracy.
This is well below any human reasonable
performance (as suggested by the 99% agreement
in classification between annotators). These
results, even though still based on a small fragment
of the Repubblica Corpus, extend
        <xref ref-type="bibr" rid="ref5">Chesi &amp; Moro
(2018)</xref>
        original considerations using a wider
dataset and more advanced ML algorithms.
These results showed that neither the algorithms
used nor the extension of the context (both left and
right) helped in classifying correctly the instances
of “si” when the referent had to be retrieved
nonlocally or in impersonal “si” cases. Replacing the
words with their POS mildly helped in improving
the performance of some classifiers (especially
using the coarse tagset), with decision tree
classifier (J48) obtaining the best performance (on
average) across the tests.
      </p>
      <p>Given the poor performance of the classifiers
tested, we concluded that the “usage-based”
intuition is not sufficient here to account for the
acquisition of the discriminative capabilities any
Italian native speaker owns and that enable her/him
to identify correctly the relevant referent both
preand post-verbally, even in the case of complex
subjects (referent DPs modified by prepositional
phrases or relative clauses), as well as its
unnecessity (in generic/impersonal readings) or its
recovery in case of pro-drop. We might expect then
that a richer syntactic annotation could help to
boost the automatic classification results in
accordance with the structural analysis
summarized in §1.1 and §1.2: first, a verbal
subcategorization specification properly describing
the predicate argument structure could be useful,
then a correct analysis of the subject phrase
structure, including agreement cues should be used,
as well as a richer classification of temporal/modal
adverbials/modifiers.</p>
      <p>
        As suggested by an anonymous reviewer,
information structure, which is largely obliterated
in written texts, is expected to disambiguate
between reflexive and impersonal constructions:
for instance, non-dislocated preverbal subjects
(L(M) in our classification) should be ruled out in
impersonal constructions
        <xref ref-type="bibr" rid="ref18">(see Raposo &amp;
Uriagereka 1996)</xref>
        ; moreover, non-focalized (or
right-dislocated) postverbal subjects (PV in our
classification) should be ruled out in reflexive
constructions. Then, despite the fact that
prosody/information structure cannot be assessed
within a corpus-based study, we might expect an
improvement of the classifiers performance
considering some relevant features associated to
these configurations: e.g. post-verbal subject
annotation in connection with the verbal class and
adverbials placement between the subject and verb
indicating a dislocated subject.
      </p>
      <p>A follow up of this study should test these
predictions and, possibly, extend the study to the
whole Repubblica corpus, confirming (or
disconfirming) our preliminary results that suggest
we cannot avoid a deep structural analysis of these
constructions to classify (and interpret) them
correctly.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>Marco</given-names>
            , Silvia Bernardini, Federica Comastri, Lorenzo Piccioni, Alessandra Volpi, Guy Aston, and Marco
          </string-name>
          <string-name>
            <surname>Mazzoleni</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Introducing the La Repubblica Corpus: A Large, Annotated, TEI (XML)-compliant Corpus of Newspaper Italian</article-title>
          .
          <source>In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC</source>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Basili</surname>
            , Roberto, Maria Teresa Pazienza, and
            <given-names>Michele</given-names>
          </string-name>
          <string-name>
            <surname>Vindigni</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Corpus-driven unsupervised learning of verb subcategorization frames</article-title>
          .
          <source>Congress of the Italian Association for Artificial Intelligence</source>
          . Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Belletti</surname>
          </string-name>
          , Adriana.
          <year>2002</year>
          .
          <article-title>Aspects of the low IP area. Forthcoming in The structure of IP and CP</article-title>
          .
          <source>The Cartography of Syntactic Structures</source>
          , vol. 2,
          <string-name>
            <surname>L.</surname>
          </string-name>
          Rizzi (ed.). New York: Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Burzio</surname>
            ,
            <given-names>Luigi</given-names>
          </string-name>
          <year>1992</year>
          .
          <article-title>On the morphology of reflexives and impersonals. Theoretical analyses in Romance linguistics</article-title>
          . Amsterdam: Benjamins,
          <fpage>399</fpage>
          -
          <lpage>414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chesi</surname>
          </string-name>
          , Cristiano, &amp;
          <string-name>
            <surname>Moro</surname>
            ,
            <given-names>Andrea</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Il divario (apparente) tra gerarchia e tempo</article-title>
          .
          <source>Sistemi intelligenti</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>11</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Chomsky</surname>
            ,
            <given-names>Noam</given-names>
          </string-name>
          <year>1995</year>
          .
          <article-title>The minimalist program</article-title>
          . Cambridge, MA: MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cimino</surname>
          </string-name>
          , Andrea, Dell'Orletta, Felice.
          <year>2016</year>
          .
          <article-title>“Building the state-of-the-art in POS tagging of Italian Tweets”</article-title>
          .
          <source>In Proceedings of EVALITA '16</source>
          ,
          <article-title>Evaluation of NLP and Speech Tools for Italian, 7 December</article-title>
          , Napoli, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Cinque</surname>
            ,
            <given-names>Guglielmo</given-names>
          </string-name>
          <year>1988</year>
          .
          <article-title>On si constructions and the theory of arb</article-title>
          .
          <source>Linguistic inquiry</source>
          ,
          <volume>19</volume>
          (
          <issue>4</issue>
          ),
          <fpage>521</fpage>
          -
          <lpage>581</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Dillon</surname>
            , Brian, Alan Mishler, Shayne Sloggett, and
            <given-names>Colin</given-names>
          </string-name>
          <string-name>
            <surname>Phillips</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Contrasting intrusion profiles for agreement and anaphora: experimental and modeling evidence</article-title>
          .
          <source>J. Mem. Lang</source>
          .
          <volume>69</volume>
          ,
          <fpage>85</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Dobrovie-Sorin</surname>
          </string-name>
          ,
          <year>Carmen</year>
          .
          <year>1998</year>
          .
          <article-title>Impersonal se constructions in Romance and the passivization of unergatives</article-title>
          .
          <source>Linguistic Inquiry</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ),
          <fpage>399</fpage>
          -
          <lpage>437</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Frank</surname>
            , Eibe,
            <given-names>Mark A.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The WEKA Workbench</article-title>
          .
          <article-title>Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques"</article-title>
          , Morgan Kaufmann, Fourth Edition,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Grimshaw</surname>
          </string-name>
          , Jane.
          <year>1990</year>
          .
          <article-title>Argument Structure</article-title>
          . MIT Press, Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Ienco</surname>
            , Dino,
            <given-names>Serena</given-names>
          </string-name>
          <string-name>
            <surname>Villata</surname>
            , and
            <given-names>Cristina</given-names>
          </string-name>
          <string-name>
            <surname>Bosco</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Automatic extraction of subcategorization frames for Italian</article-title>
          .
          <source>In LREC08</source>
          , pp.
          <fpage>2094</fpage>
          -
          <lpage>2100</lpage>
          . European Language Resources Association (ELRA)
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Marantz</surname>
          </string-name>
          , Alec.
          <year>1984</year>
          .
          <article-title>On the Nature of Grammatical Relations</article-title>
          . MIT Press, Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Merlo</surname>
            , Paola and
            <given-names>S</given-names>
            tevenson, S
          </string-name>
          <string-name>
            <surname>Suzanne</surname>
          </string-name>
          ,
          <year>2001</year>
          .
          <article-title>Automatic verb classification based on statistical distributions of argument structure</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>27</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>373</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Pescarini</surname>
          </string-name>
          , Diego.
          <year>2015</year>
          .
          <article-title>Le costruzioni con si</article-title>
          . Italiano, dialetti, lingue romanze. Roma: Carocci.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Pesetsky</surname>
            ,
            <given-names>David. 1995. Zero</given-names>
          </string-name>
          <string-name>
            <surname>Syntax</surname>
          </string-name>
          . MIT Press, Cambridge, MA
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Raposo</surname>
            ,
            <given-names>Eduardo &amp; Juan</given-names>
          </string-name>
          <string-name>
            <surname>Uriagereka</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Indefinite SE</article-title>
          .
          <source>Natural Language and Linguistic Theory</source>
          <volume>14</volume>
          :
          <fpage>749</fpage>
          -
          <lpage>810</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Reinhart</surname>
          </string-name>
          , Tania, &amp;
          <string-name>
            <surname>Siloni</surname>
          </string-name>
          , Tal.
          <year>2005</year>
          .
          <article-title>The lexiconsyntax parameter: Reflexivization and other arity operations</article-title>
          .
          <source>Linguistic inquiry</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ),
          <fpage>389</fpage>
          -
          <lpage>436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Rizzi</surname>
            ,
            <given-names>Luigi.</given-names>
          </string-name>
          (
          <year>1986</year>
          ).
          <article-title>Null objects in Italian and the theory of 'pro'</article-title>
          .
          <source>Linguistic inquiry</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <fpage>501</fpage>
          -
          <lpage>558</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Salvi</surname>
            ,
            <given-names>Giampaolo</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>La formazione della costruzione impersonale in italiano</article-title>
          .
          <source>Linguística: Revista de Estudos Linguísticos da Universidade do Porto</source>
          ,
          <volume>3</volume>
          ,
          <fpage>13</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Sportiche</surname>
          </string-name>
          , Dominique.
          <year>1998</year>
          .
          <article-title>Partitions and atoms of clause structure: Subjects, agreement, Case and clitics</article-title>
          . New York: Routledge.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Tomasello</surname>
          </string-name>
          , Michael.
          <year>2003</year>
          .
          <article-title>Constructing a language: A usage-based theory of language acquisition</article-title>
          . Cambridge, MA: Harvard University press.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Witten</surname>
            , Ian,
            <given-names>H.</given-names>
          </string-name>
          <source>and Eibe Frank</source>
          <year>2005</year>
          .
          <article-title>Data Mining: Practical machine learning tools and techniques</article-title>
          . 2nd edition Morgan Kaufmann, San Francisco.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>