<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>There and Back Again: Cross-Lingual Transfer Learning for Event Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Caselli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmet U¨stu¨ n</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rikjuniversiteit Groningen</institution>
          ,
          <addr-line>Groningen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <volume>2</volume>
      <fpage>719</fpage>
      <lpage>725</lpage>
      <abstract>
        <p>English. In this contribution we investigate the generalisation abilities of a pre-trained multilingual Language Model, namely Multilingual BERT, in different transfer learning scenarios for event detection and classification for Italian and English. Our results show that zero-shot models have satisfying, although not optimal, performances in both languages (average F1 higher than 60 for event detection vs. average F1 ranging between 40 and 50 for event classification). We also show that adding extra fine-tuning data of the evaluation language is not simply beneficial but results in better models when compared to the corresponding non zeroshot transfer ones, achieving highly competitive results when compared to state-ofthe-art systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Recently pre-trained word representations
encoded in Language Models (LM) have gained
lot of popularity in Natural Language
Processing (NLP) thanks to their ability to encode high
level syntactic-semantic language features and
produce state-of-the-art results in various tasks,
such as Named Entity Recognition
        <xref ref-type="bibr" rid="ref21">(Peters et
al., 2018)</xref>
        , Machine Translation
        <xref ref-type="bibr" rid="ref17 ref23">(Johnson et al.,
2017; Ramachandran et al., 2017)</xref>
        , Text
Classification
        <xref ref-type="bibr" rid="ref13 ref9">(Eriguchi et al., 2018; Chronopoulou et
al., 2019)</xref>
        , among others. These models are
pretrained on large amounts of unannotated text and
then fine-tuned using the induced LM structure
to generalise over specific training data. Given
their success in monolingual environments,
espe
      </p>
      <p>
        Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
cially for English, there has been a growing
interest in the development of cross-lingual as well
as multilingual representations
        <xref ref-type="bibr" rid="ref10 ref2 ref20 ref29 ref4">(Vulic´ and Moens,
2015; Ammar et al., 2016; Conneau et al., 2018;
Artetxe et al., 2018)</xref>
        to investigate different
crosslingual transfer learning scenarios, including
zeroshot transfer, i.e. the direct application of a model
fine-tuned using data in one language to a different
test language.
      </p>
      <p>
        Following the approach in Pires et al. (2019),
in this paper we investigate the generalisation
abilities of Multilingual BERT
        <xref ref-type="bibr" rid="ref12">(Devlin et al.,
2019)</xref>
        1 on English (EN) and Italian (IT).
Multilingual BERT is particularly well suited for this
task because it easily allows the implementation
of cross-lingual transfer learning, including
zeroshot transfer.
      </p>
      <p>
        We use event detection as our downstream task,
a highly complex semantic task with a well
established tradition in NLP
        <xref ref-type="bibr" rid="ref1 ref14 ref16 ref20 ref25 ref29">(Ahn, 2006; Ji and
Grishman, 2008; Ritter et al., 2012; Nguyen and
Grishman, 2015; Huang et al., 2018)</xref>
        . The goal of
the task is to identify event mentions, i.e.
linguistic expressions describing “things” that happen or
hold as true in the world, and subsequently
classify them according to a (pre-defined) taxonomy.
The complexity of the task relies in its high
dependence on the context of occurrence of the
expressions that may trigger an event mention. Indeed,
the eventiveness of an expression is prone to
ambiguity because there exists a continuum between
eventive and non-eventive readings in the space
of event semantics (Araki et al., 2018). Such
intrinsic ambiguity of event expressions challenges
the generalisation abilities of stochastic models
and allows to investigate advantages and limits of
transfer learning approaches when semantics has a
pivotal role in the resolution of a problem/task.
      </p>
      <p>We explore different multi-lingual and
cross1https://github.com/google-research/
bert
lingual aspects of transfer learning with respect
to event detection through a series of experiments,
focusing on the following research questions:
RQ1 How well do Multilingual BERT fine-tuned
models generalise in zero-shot transfer
learning scenarios on both languages?
RQ2 Do we obtain more robust models by
finetuning zero-shot models with additional
(training) data of the evaluation language?
Our results show that Multilingual BERT
obtains satisfying performances in zero-shot
scenarios for the identification of event triggers
(average F1 63.53 on Italian and 66.79 on English),
while this is not the case for event classification
(average F1 42.86 on Italian and 51.26 on
English). We also show that extra fine-tuning the
zero-shot models with data of the evaluation
language is not just beneficial, but it actually gives
better results than models fine-tuned on the
corresponding test language only (i.e. fine-tuning
and test in the same language), and achieves
competitive results with state-of-the-art systems
developed using dedicated architectures. Our
code is available (https://github.com/
ahmetustun/BertForEvent).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>
        We have used two corpora annotated with event
information: the TempEval-3 corpus (TE3) for
English
        <xref ref-type="bibr" rid="ref26">(UzZaman et al., 2013)</xref>
        and the EVENTI
corpus for Italian
        <xref ref-type="bibr" rid="ref7">(Caselli et al., 2014)</xref>
        . The corpora
have been independently annotated with language
specific annotation schemes, grounded on a shared
metadata markup language for temporal
information processing, ISO-TimeML
        <xref ref-type="bibr" rid="ref15">(ISO, 2008)</xref>
        , thus
sharing definitions and tags’ names for the
markable expressions. The corpora are composed by
contemporary news articles2 and have been
developed in the context of two evaluation campaigns
for temporal processing, namely TempEval-3 and
EVENTI@EVALITA 2014.
      </p>
      <p>Events are defined as anything that can
be said to happen, or occur, or hold true,
with no restriction to parts-of-speech (POS),
including verbs, nouns, adjectives, and also
2We have excluded the extra test set on historical news
from the Italian data set, and the automatically annotated
training set from the English one.
prepositional phrases (PP). Every event
mention is further assigned to one of 7
possible classes: OCCURRENCE, ASPECTUAL,
PERCEPTION, REPORTING, I(NTESIONAL)
STATE, I(NTENSIONAL) ACTION, and STATE,
capturing the relationship the event participates
(such as factual, evidential, reported, intensional).
Although semantically interoperable, one of the
most relevant annotation differences that may
impact the evaluation of the zero-shot models
concerns the marking of modal verbs and copulas
introducing event nouns, adjectives or PPs. While
in English these elements are never annotated as
event triggers, this is done in Italian. A detailed
description of additional language specific
adaptations and differences between English and Italian
is reported in Caselli and Sprugnoli (2017).</p>
      <p>
        Tables 1 and 2 illustrate the distribution of the
annotation of events for POS (token based) and
classes (event based), respectively. Both corpora,
when released, did not explicitly have a
development section. Following previous work
        <xref ref-type="bibr" rid="ref5 ref8">(Caselli,
2018)</xref>
        , we generated development sets by
excluding from the training data all the documents that
composed the test data for Italian and English in
the SemEval 2010 TempEval-2 campaign
        <xref ref-type="bibr" rid="ref28">(Verhagen et al., 2010)</xref>
        .
      </p>
      <p>The Italian corpus is larger than the
corresponding English version, although the distribution of
events, both per POS and per class, is
comparable. The different distribution of the
REPORTING, I STATE, I ACTION, and STATE classes
reflects differences in annotation instructions rather
than language specific characteristics. For
instance, in Italian, the class REPORTING is
assigned only if the event mention is an instance of
a speech verb/noun (verba/nomina dicendi), while
in English this constraint is less strict.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Model</title>
      <p>
        Multilingual BERT (Bidirectional Encoder
Representations from Transformers) shares the
same framework of the monolingual English
BERTBASE
        <xref ref-type="bibr" rid="ref12">(Devlin et al., 2019)</xref>
        . BERT is
a pre-trained LM that improves over existing
fine-tuning approaches by jointly conditioning on
both left and right contexts in all layers to generate
pre-trained deep bidirectional representations.
Multilingual BERT’s architecture contains an
encoder consisting of 12 Transformer blocks with
12 self-attention heads
        <xref ref-type="bibr" rid="ref27">(Vaswani et al., 2017)</xref>
        , and
Classes
OCCURRENCE
ASPECTUAL
PERCEPTION
REPORTING
I STATE
I ACTION
STATE
Total
hidden size of 768.
      </p>
      <p>Unlike the original BERT, Multilingual BERT
is pre-trained on the concatenation of monolingual
Wikipedia pages of 104 languages with a shared
word piece vocabulary. One of the peculiar
characteristics of this multilingual model is that it does
not make use of any special marker to signal the
input language, nor has any mechanism that
explicitly indicates that translation equivalent pairs
should have similar representations.</p>
      <p>For the fine-tuning, we use a standard sequence
tagging model. We apply a softmax classifier over
each token by passing the token’s last layer of
activation to the softmax layer to make a tag
prediction. Since BERT’s wordpiece tokenizer can split
words into multiple tokens, we take the prediction
for the first token (piece) per word, ignoring the
rest. No parameter tuning was performed,
learning rate was set to 1e-4, and batch size to 8.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Event detection is best described as composed by
two sub-tasks: first, identify if a word, w, in a
given sentence S is an instance of an event
mention, evw; and subsequently, assign it to a class
C, evw 2 C. We break the experiments in two
blocks: in the first block, we investigate the
quality of the fine-tuned Multilingual BERT models
on the identification of the event mentions only.
This is an easier task with respect to
classification, as it can be framed as a binary classification
task. In this way, we can actually have a sort of
maximal threshold of the performance of the
zeroshot cross-lingual transfer learning models. In the
second block of experiments, we investigate the
ability of the models in performing the two
subtasks “at once”, i.e. identifying and classifying
an event mention. This is a more complex task,
especially in zero-shot transfer learning scenarios,
because the ISO-TimeML classes are assigned
following syntactic-semantic criteria: the same word
can be assigned to different classes according to
the specific syntactic context in which it occurs.
For each language pair and direction of the transfer
(i.e. ENtrain–ITtest vs. ITtrain–ENtest), we also
benchmark the performance in monolingual
finetuned transfer scenarios (i.e. ITtrain–ITtest vs.
ENtrain–ENtest), to have an upper-bound limit
of Multilingual BERT and an indirect evidence of
the intrinsic quality of the proposed multilingual
model. For the English data, we also test the
performance using English BERTB AS E , so to better
understand limits of the multilingual model.</p>
      <p>
        Finally, we compare our results to the best
systems that participated in the corresponding
evaluation campaigns in each language, as well as to
state-of-the-art systems. In particular, we selected:
- HLT-FBK
        <xref ref-type="bibr" rid="ref19 ref7">(Mirza and Minard, 2014)</xref>
        , a
feature-based SVM model for Italian (best
system at EVENTI@EVALITA);
- ATT1
        <xref ref-type="bibr" rid="ref18">(Jung and Stent, 2013)</xref>
        , a
featurebased MaxEnt model for English (best
system for event detection and classification at
TempEval-3);
- CRF4TimeML
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref8">(Caselli and Morante, 2018)</xref>
        ,
a feature-based CRF model for English that
has obtained state-of-the-art results on event
classification;
- Bi-LSTM-CRF
        <xref ref-type="bibr" rid="ref23 ref24 ref5 ref6 ref8">(Reimers and Gurevych,
2017; Caselli, 2018)</xref>
        , a neural network
model based on a Bi-LSTM using a CRF
classifer as final layer. The architecture
has been originally developed and tested
on English
        <xref ref-type="bibr" rid="ref23 ref24 ref6">(Reimers and Gurevych, 2017)</xref>
        ,
and subsequently adapted to Italian
        <xref ref-type="bibr" rid="ref5 ref8">(Caselli,
2018)</xref>
        . The English version of the system
reports state-of-the-art scores for the event
detection task only, while the Italian version
obtained state-of-the-art results for detection
and classification.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>
        All scores for the Multilingual BERT models
have been averaged against 5 runs
        <xref ref-type="bibr" rid="ref23 ref24 ref6">(Reimers and
Gurevych, 2017)</xref>
        . Subscript numbers correspond
to standard deviation scores. Tables 3 and 4
illustrate the results on the Italian test data for the event
detection and the event detection and classification
sub-tasks, respectively. Results on the English test
are illustrated in Table 5 for event detection and
in Table 6 for event detection and classification.
For each experiment, we also report the number of
fine-tuning epochs.
      </p>
      <p>The main take-away is that the portability of
the zero-shot models is not the same for the two
sub-tasks: for the event detection sub-task, both
models obtain close results (average F1 63.53 on
Italian vs. average F1 66.79 on English), while
this is not the case for the event detection and
classification sub-task (average F1 42.86 on
Italian vs. average F1 51.26 on English),
suggesting this sub-task as being intrinsically more
difficult. We also observe that the zero-shot models
have different behaviors with respect to Precision
and Recall: the zero-shot transfer on Italian has
a high Precision and a low Recall, while the
opposite happens on English. 4 The stability of the
zero-shot models seems to be influenced by the
size of the fine-tuning training data. In particular,
zero-shot transfer learning on English consistently
results in more stable models, as the lower scores
4For instance, average Precision for event detection is
93.11 on Italian vs. 53.19 on English, while average Recall is
51.71 on Italian and 89.92 on English, respectively. A similar
pattern is observed for the detection and classification
subtask.
for the standard deviation show when compared to
the Italian counterpart (+/- 2.04 for EVENTItrain
on the TE3 test data vs. +/- 7.45 for TE3train on
the EVENTI test data for the event detection
subtask; +/- 2.67 for EVENTItrain on the TE3 test
data vs. +/- 3.15 for TE3train on the EVENTI test
data for the event detection and classification
subtask).</p>
      <p>Annotation differences in the two languages
have an impact in the evaluation of the zero-shot
models. To measure this, we excluded all modal
and copula verbs both as predictions on the
English test by the zero-shot Italian model, and as
gold labels from the Italian test, when applying the
zero-shot English model. In both cases we observe
an improvement, with an increase of the average
F1 to 72.26 on English and 66.01 on Italian.
Although other language specific annotations may be
at play, the Italian zero-shot model appears to be
more powerful than the English one.</p>
      <p>
        The addition of extra fine-tuning with data from
the evaluation language results in a positive
outcome, improving performances in both sub-tasks.
In three out of the four cases (event detection on
English, and event detection and classification on
English and Italian) the extra-fine tuning with the
full training set of the evaluation language results
in better models than the corresponding non
zeroshot ones. Adding training material targeting the
evaluation test is a well know technique in domain
adaptation
        <xref ref-type="bibr" rid="ref11">(Daume´ III, 2007)</xref>
        . Quite surprisingly
with respect to previous work that used this
approach, we observe an improvement also with
respect to fine-tuned transfer scenarios, i.e. models
tuned and tested on the same language,
suggesting that the multilingual model is actually learning
from both languages.
      </p>
      <p>In terms of absolute scores, our results for the
zero-shot scenarios are in line with the findings
reported in Pires et al. (2019) for typologically
related languages, such as English and Italian.
However, limits of zero-shot transfer scenarios seem
more evident in semantic tasks when compared to
morpho-synatactic ones. For instance, Pires et al.
(2019) reports absolute F1 scores comparable to
ours on Named Entity Recognition on 4 language
pairs, while results on POS tagging achieve an
accuracy above 80% on all language pairs. More
recently, Wu and Dredze (2019) have shown a
similar behavior to our zero-shot scenarios of
Multilingual BERT in a text classification task.</p>
      <p>
        Fine Tuning
TE3train - zero-shot
TE3train + EVENTIdev
TE3train + EVENTItrain
EVENTItrain
        <xref ref-type="bibr" rid="ref5 ref8">(Caselli, 2018)</xref>
        HLT-FBK
Fine Tuning
EVENTItrain - zero-shot
EVENTItrain + TE3dev
EVENTItrain + TE3train
TE3train
        <xref ref-type="bibr" rid="ref23 ref24 ref6">(Reimers and Gurevych, 2017)</xref>
        3
ATT1
Epochs
1
1 + 2
1 + 1
1
n/a
n/a
      </p>
      <p>Epochs
2
1 + 2
1 + 3
2
n/a</p>
      <p>
        EVENTI F1
42.863:15
55.381:34
73.900:45
73.690:80
72.97
67.14
Extra fine-tuning Extra fine-tuning, even with
a minimal amount of data as shown by the results
using the development sets, shifts the model’s
predictions to be more in-line with the
corresponding language specific annotations. Furthermore, it
reduces the effects of cross-lingual transfer based
on the presence of the same word pieces between
the fine-tuned and the evaluation languages due to
the single multilingual vocabulary of Multilingual
BERT
        <xref ref-type="bibr" rid="ref22">(Pires et al., 2019)</xref>
        . This also results in an
increasing stability of the models and a reduction
of the differences in the average scores for
Precision and Recall with respect to the zero-shot
models.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Comparison to other systems Zero-shot mod</title>
      <p>els obtain satisfying, though not optimal, results
as they fall far from both the state-of-the-art
models and the best performing systems in the
corresponding evaluation exercises (i.e. HLT-FBK for
Italian and ATT1 for English). Extra fine-tuning
with the development data provides competitive
models against the best systems in the evaluation
exercises only. When the full training data is used
for extra fine-tuning in the target evaluation
language, results are very close to the state of the
art, although only in one case the Multilingual
BERT model is actually outperforming it (namely,
on event detection and classification for Italian).
These models also obtain very competitive results
with respect to state-of-the-art systems, indicating
that multilinguality does not seem to negatively
affect the quality of the pre-trained LM.
However, results on English using English BERTB AS E
appears to be partially in line with this
observation. By applying the same settings, we obtain
an average F1 on event detection of 82.85,5 and
an average F1 for event detection and
classification of 71.09. Although results of the
monolingual model are expected to be higher in general, in
this case, we observe that the differences in
performance between the two tasks are not in the same
range. BERTB AS E obtains an increase of 2% on
event detection but it reaches almost 11% on event
detection and classification. Differences in class
labelling between English and Italian (see
Section 2) can partially explain this behaviour.
However, given the sensitivity of event classification to
the syntactic context, these results call for further
investigation on the encoding of syntactic
information between the monolingual and the
multilingual BERT models.</p>
      <p>Errors Comparing the errors of the zero-shot
models is not an easy task mainly because of the
language specific annotations in the two corpora.
However, focusing on the three major POS, i.e.
nouns, verbs, and adjectives, and on the False
Negatives only, both models present a similar
proportions of errors, with nouns representing the hardest
case (53.84% on Italian vs. 54.90% on English),
followed by verbs (30.29% on Italian vs. 17.64%
on English), and by adjectives (7.51% on Italian
vs. 5.88% on English). When observing the
classification mismatches (i.e. correct event mention but
5Precision: 81.26; Recall: 84.70
wrong class), both models overgeneralise the
OCCURRENCE class in the majority of cases.
However, zero-shot transfer on English actually
extends mis-classification errors mirroring the
distribution of the classes of the Italian training data. In
particular, it wrongly classifies English
REPORTING events as I ACTION (33.33%), and
OCCURRENCE as STATE (15.51%) or I ACTION
(34.48%). Although the syntactic context may
have influenced the classification errors, these
patterns further highlight the differences in
annotations between the two languages.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this contribution we investigated the
generalisation abilities of Multilingual BERT on Italian
and English using event detection as a downstream
task. The results show that Multilingual BERT
seems to handle cross-lingual generalisation
between Italian and English in a satisfying way,
although with some limitations. Limitations in
this case come from two sources: annotation
differences in the two languages and, partially, the
shared multilingual vocabulary. Zero-shot systems
appears to be particularly sensitive to the
finetuning data, and, in these experiments, they
provide empirical evidence of the impact of different
annotation decisions for events in English and
Italian.</p>
      <p>We have shown that extra fine-tuning with data
of the evaluation language not only is beneficial
but it may lead to better systems, suggesting that
the multilingual model may be combining
information from the two languages, and thus obtaining
competitive results with respect to task-specific
architectures. This opens up to new strategies for
the development of systems by using interoperable
annotated data in different languages to improve
performances and possibly obtain more robust and
portable models across different data distributions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Ahn</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The stages of event extraction</article-title>
          .
          <source>In Proceedings of the Workshop on Annotating and Reasoning about Time and Events</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Waleed</given-names>
            <surname>Ammar</surname>
          </string-name>
          , George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Massively multilingual word embeddings</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          2018.
          <article-title>Interoperable annotation of events and event relations across domains</article-title>
          .
          <source>In Proceedings 14th Joint ACL - ISO Workshop on Interoperable Semantic Annotation</source>
          , pages
          <fpage>10</fpage>
          -
          <lpage>20</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Mikel</given-names>
            <surname>Artetxe</surname>
          </string-name>
          , Gorka Labaka, and
          <string-name>
            <given-names>Eneko</given-names>
            <surname>Agirre</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>789</fpage>
          -
          <lpage>798</lpage>
          , Melbourne, Australia, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Roser</given-names>
            <surname>Morante</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Systems Agreements and Disagreements in Temporal Processing: An Extensive Error Analysis of the TempEval-3 Task</article-title>
          . In Nicoletta Calzolari (Conference chair),
          <source>Khalid Choukri</source>
          , Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hlne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors,
          <source>Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ), Paris, France, may.
          <source>European Language Resources Association (ELRA).</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rachele</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>ItTimeML and the Ita-TimeBank: Language Specific Adaptations for Temporal Annotation</article-title>
          . In Nancy Ide and James Pustejovsky, editors,
          <source>Handbook of Linguistic Annotation - Volume II</source>
          , pages
          <fpage>969</fpage>
          -
          <lpage>988</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          , Rachele Sprugnoli, Manuela Speranza, and
          <string-name>
            <given-names>Monica</given-names>
            <surname>Monachini</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>EVENTI: EValuation of Events and Temporal INformation at Evalita 2014</article-title>
          .
          <source>In Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 &amp; and of the Fourth International Workshop EVALITA</source>
          <year>2014</year>
          , pages
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          . Pisa University Press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Italian Event Detection Goes Deep Learning</article-title>
          .
          <source>In Proceedings of the 5th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Alexandra</given-names>
            <surname>Chronopoulou</surname>
          </string-name>
          , Christos Baziotis, and
          <string-name>
            <given-names>Alexandros</given-names>
            <surname>Potamianos</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>An embarrassingly simple approach for transfer learning from pretrained language models</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>2089</fpage>
          -
          <lpage>2095</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          , Ruty Rinott, Guillaume Lample,
          <string-name>
            <given-names>Adina</given-names>
            <surname>Williams</surname>
          </string-name>
          , Samuel Bowman, Holger Schwenk, and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>XNLI: Evaluating cross-lingual sentence representations</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2475</fpage>
          -
          <lpage>2485</lpage>
          , Brussels, Belgium, October-November.
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Hal</given-names>
            <surname>Daume</surname>
          </string-name>
          ´ III.
          <year>2007</year>
          .
          <article-title>Frustratingly easy domain adaptation</article-title>
          .
          <source>ACL</source>
          <year>2007</year>
          , page 256.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Akiko</given-names>
            <surname>Eriguchi</surname>
          </string-name>
          , Melvin Johnson, Orhan Firat, Hideto Kazawa, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Macherey</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Zero-shot cross-lingual classification using multilingual neural machine translation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Lifu</given-names>
            <surname>Huang</surname>
          </string-name>
          , Heng Ji, Kyunghyun Cho, Ido Dagan,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Riedel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Clare</given-names>
            <surname>Voss</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Zero-shot transfer learning for event extraction</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>2160</fpage>
          -
          <lpage>2170</lpage>
          , Melbourne, Australia, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>SemAf</surname>
          </string-name>
          /Time Working Group ISO,
          <year>2008</year>
          .
          <source>ISO DIS 24617-1</source>
          :
          <year>2008</year>
          <article-title>Language resource management - Semantic annotation framework - Part 1: Time and events</article-title>
          .
          <source>ISO Central Secretariat</source>
          , Geneva.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Heng</given-names>
            <surname>Ji</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ralph</given-names>
            <surname>Grishman</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Refining event extraction through cross-document inference</article-title>
          .
          <source>Proceedings of ACL-08: HLT</source>
          , pages
          <fpage>254</fpage>
          -
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Melvin</given-names>
            <surname>Johnson</surname>
          </string-name>
          , Mike Schuster,
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            , Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vie´gas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and
            <given-names>Jeffrey</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Google's multilingual neural machine translation system: Enabling zero-shot translation</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          :
          <fpage>339</fpage>
          -
          <lpage>351</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Hyuckchul</given-names>
            <surname>Jung</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amanda</given-names>
            <surname>Stent</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Att1: Temporal annotation using big windows and rich syntactic and semantic features</article-title>
          .
          <source>In Second Joint Conference on Lexical and Computational Semantics (*SEM)</source>
          , Volume
          <volume>2</volume>
          :
          <source>Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval</source>
          <year>2013</year>
          ), pages
          <fpage>20</fpage>
          -
          <lpage>24</lpage>
          , Atlanta, Georgia, USA, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Paramita</given-names>
            <surname>Mirza</surname>
          </string-name>
          and
          <string-name>
            <surname>Anne-Lyse Minard</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>FBKHLT-time: a complete Italian Temporal Processing system for EVENTI-EVALITA 2014</article-title>
          . In Fourth International Workshop EVALITA 2014, pages
          <fpage>44</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Thien</given-names>
            <surname>Huu</surname>
          </string-name>
          Nguyen and
          <string-name>
            <given-names>Ralph</given-names>
            <surname>Grishman</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Event detection and domain adaptation with convolutional neural networks</article-title>
          .
          <source>In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>2</volume>
          , pages
          <fpage>365</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Matthew E Peters</surname>
            , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
            <given-names>Kenton</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and Luke</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Telmo</given-names>
            <surname>Pires</surname>
          </string-name>
          , Eva Schlinger, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Garrette</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , pages
          <fpage>4996</fpage>
          -
          <lpage>5001</lpage>
          , Florence, Italy, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Prajit</given-names>
            <surname>Ramachandran</surname>
          </string-name>
          , Peter Liu, and
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Unsupervised pretraining for sequence to sequence learning</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>383</fpage>
          -
          <lpage>391</lpage>
          , Copenhagen, Denmark, September. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>338</fpage>
          -
          <lpage>348</lpage>
          , Copenhagen, Denmark, September. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Alan</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sam</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.
          <year>2012</year>
          .
          <article-title>Open domain event extraction from twitter</article-title>
          .
          <source>In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>1104</fpage>
          -
          <lpage>1112</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Naushad</surname>
            <given-names>UzZaman</given-names>
          </string-name>
          , Hector Llorens, Leon Derczynski, James Allen,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Verhagen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>James</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations</article-title>
          .
          <source>In Second Joint Conference on Lexical and Computational Semantics (*SEM)</source>
          , Volume
          <volume>2</volume>
          :
          <source>Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval</source>
          <year>2013</year>
          ), pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          , Atlanta, Georgia, USA, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Marc</given-names>
            <surname>Verhagen</surname>
          </string-name>
          , Roser Sauri, Tommaso Caselli, and
          <string-name>
            <given-names>James</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          .
          <year>2010</year>
          . Semeval-2010
          <source>task 13: Tempeval-2. In Proceedings of the 5th international workshop on semantic evaluation</source>
          , pages
          <fpage>57</fpage>
          -
          <lpage>62</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Vulic</surname>
          </string-name>
          ´ and
          <string-name>
            <surname>Marie-Francine Moens</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Bilingual word embeddings from non-parallel documentaligned data applied to bilingual lexicon induction</article-title>
          .
          <source>In Proceedings of the 53rd Annual</source>
          <article-title>Meeting of the Association for Computational Linguistics and the Shijie Wu</article-title>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dredze</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Beto, bentz, becas: The surprising cross-lingual effectiveness of bert</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .09077.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>