=Paper=
{{Paper
|id=Vol-2481/paper15
|storemode=property
|title=There and Back Again: Cross-Lingual Transfer Learning for Event Detection
|pdfUrl=https://ceur-ws.org/Vol-2481/paper15.pdf
|volume=Vol-2481
|authors=Tommaso Caselli,Ahmet Üstün
|dblpUrl=https://dblp.org/rec/conf/clic-it/CaselliU19
}}
==There and Back Again: Cross-Lingual Transfer Learning for Event Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-2481/paper15.pdf</pdf>
<pre>
                              There and Back Again:
                Cross-Lingual Transfer Learning for Event Detection

                                  Tommaso Caselli, Ahmet Üstün
                       Rikjuniversiteit Groningen, Groningen, The Netherlands
                               {t.caselli|a.ustun}@rug.nl


                      Abstract                             cially for English, there has been a growing in-
                                                           terest in the development of cross-lingual as well
    English. In this contribution we in-                   as multilingual representations (Vulić and Moens,
    vestigate the generalisation abilities of a            2015; Ammar et al., 2016; Conneau et al., 2018;
    pre-trained multilingual Language Model,               Artetxe et al., 2018) to investigate different cross-
    namely Multilingual BERT, in different                 lingual transfer learning scenarios, including zero-
    transfer learning scenarios for event de-              shot transfer, i.e. the direct application of a model
    tection and classification for Italian and             fine-tuned using data in one language to a different
    English. Our results show that zero-shot               test language.
    models have satisfying, although not opti-                Following the approach in Pires et al. (2019),
    mal, performances in both languages (av-               in this paper we investigate the generalisation
    erage F1 higher than 60 for event detec-               abilities of Multilingual BERT (Devlin et al.,
    tion vs. average F1 ranging between 40                 2019) 1 on English (EN) and Italian (IT). Multi-
    and 50 for event classification). We also              lingual BERT is particularly well suited for this
    show that adding extra fine-tuning data of             task because it easily allows the implementation
    the evaluation language is not simply ben-             of cross-lingual transfer learning, including zero-
    eficial but results in better models when              shot transfer.
    compared to the corresponding non zero-                   We use event detection as our downstream task,
    shot transfer ones, achieving highly com-              a highly complex semantic task with a well estab-
    petitive results when compared to state-of-            lished tradition in NLP (Ahn, 2006; Ji and Grish-
    the-art systems.                                       man, 2008; Ritter et al., 2012; Nguyen and Gr-
                                                           ishman, 2015; Huang et al., 2018). The goal of
1   Introduction                                           the task is to identify event mentions, i.e. linguis-
Recently pre-trained word representations en-              tic expressions describing “things” that happen or
coded in Language Models (LM) have gained                  hold as true in the world, and subsequently clas-
lot of popularity in Natural Language Process-             sify them according to a (pre-defined) taxonomy.
ing (NLP) thanks to their ability to encode high           The complexity of the task relies in its high depen-
level syntactic-semantic language features and             dence on the context of occurrence of the expres-
produce state-of-the-art results in various tasks,         sions that may trigger an event mention. Indeed,
such as Named Entity Recognition (Peters et                the eventiveness of an expression is prone to am-
al., 2018), Machine Translation (Johnson et al.,           biguity because there exists a continuum between
2017; Ramachandran et al., 2017), Text Classi-             eventive and non-eventive readings in the space
fication (Eriguchi et al., 2018; Chronopoulou et           of event semantics (Araki et al., 2018). Such in-
al., 2019), among others. These models are pre-            trinsic ambiguity of event expressions challenges
trained on large amounts of unannotated text and           the generalisation abilities of stochastic models
then fine-tuned using the induced LM structure             and allows to investigate advantages and limits of
to generalise over specific training data. Given           transfer learning approaches when semantics has a
their success in monolingual environments, espe-           pivotal role in the resolution of a problem/task.
                                                              We explore different multi-lingual and cross-
     Copyright c 2019 for this paper by its authors. Use
                                                             1
permitted under Creative Commons License Attribution 4.0       https://github.com/google-research/
International (CC BY 4.0).                                 bert
 lingual aspects of transfer learning with respect             prepositional phrases (PP). Every event men-
 to event detection through a series of experiments,           tion is further assigned to one of 7 possi-
 focusing on the following research questions:                 ble classes: OCCURRENCE, ASPECTUAL,
                                                               PERCEPTION, REPORTING, I(NTESIONAL)
RQ1 How well do Multilingual BERT fine-tuned                   STATE, I(NTENSIONAL) ACTION, and STATE,
    models generalise in zero-shot transfer learn-             capturing the relationship the event participates
    ing scenarios on both languages?                           (such as factual, evidential, reported, intensional).
                                                               Although semantically interoperable, one of the
RQ2 Do we obtain more robust models by fine-                   most relevant annotation differences that may im-
    tuning zero-shot models with additional                    pact the evaluation of the zero-shot models con-
    (training) data of the evaluation language?                cerns the marking of modal verbs and copulas in-
                                                               troducing event nouns, adjectives or PPs. While
    Our results show that Multilingual BERT ob-                in English these elements are never annotated as
 tains satisfying performances in zero-shot scenar-            event triggers, this is done in Italian. A detailed
 ios for the identification of event triggers (aver-           description of additional language specific adapta-
 age F1 63.53 on Italian and 66.79 on English),                tions and differences between English and Italian
 while this is not the case for event classification           is reported in Caselli and Sprugnoli (2017).
 (average F1 42.86 on Italian and 51.26 on En-                    Tables 1 and 2 illustrate the distribution of the
 glish). We also show that extra fine-tuning the               annotation of events for POS (token based) and
 zero-shot models with data of the evaluation lan-             classes (event based), respectively. Both corpora,
 guage is not just beneficial, but it actually gives           when released, did not explicitly have a develop-
 better results than models fine-tuned on the cor-             ment section. Following previous work (Caselli,
 responding test language only (i.e. fine-tuning               2018), we generated development sets by exclud-
 and test in the same language), and achieves                  ing from the training data all the documents that
 competitive results with state-of-the-art systems             composed the test data for Italian and English in
 developed using dedicated architectures. Our                  the SemEval 2010 TempEval-2 campaign (Verha-
 code is available (https://github.com/                        gen et al., 2010).
 ahmetustun/BertForEvent).                                        The Italian corpus is larger than the correspond-
                                                               ing English version, although the distribution of
 2       Data                                                  events, both per POS and per class, is compara-
                                                               ble. The different distribution of the REPORT-
 We have used two corpora annotated with event in-
                                                               ING, I STATE, I ACTION, and STATE classes re-
 formation: the TempEval-3 corpus (TE3) for En-
                                                               flects differences in annotation instructions rather
 glish (UzZaman et al., 2013) and the EVENTI cor-
                                                               than language specific characteristics. For in-
 pus for Italian (Caselli et al., 2014). The corpora
                                                               stance, in Italian, the class REPORTING is as-
 have been independently annotated with language
                                                               signed only if the event mention is an instance of
 specific annotation schemes, grounded on a shared
                                                               a speech verb/noun (verba/nomina dicendi), while
 metadata markup language for temporal informa-
                                                               in English this constraint is less strict.
 tion processing, ISO-TimeML (ISO, 2008), thus
 sharing definitions and tags’ names for the mark-
                                                               3   Model
 able expressions. The corpora are composed by
 contemporary news articles2 and have been devel-              Multilingual BERT (Bidirectional Encoder
 oped in the context of two evaluation campaigns               Representations from Transformers) shares the
 for temporal processing, namely TempEval-3 and                same framework of the monolingual English
 EVENTI@EVALITA 2014.                                          BERTB AS E (Devlin et al., 2019). BERT is
    Events are defined as anything that can                    a pre-trained LM that improves over existing
 be said to happen, or occur, or hold true,                    fine-tuning approaches by jointly conditioning on
 with no restriction to parts-of-speech (POS),                 both left and right contexts in all layers to generate
 including verbs, nouns, adjectives, and also                  pre-trained deep bidirectional representations.
     2
                                                               Multilingual BERT’s architecture contains an
      We have excluded the extra test set on historical news
 from the Italian data set, and the automatically annotated    encoder consisting of 12 Transformer blocks with
 training set from the English one.                            12 self-attention heads (Vaswani et al., 2017), and
                                  TE3                   EVENTI
            POS           Train    Dev     Test    Train Dev   Test     Examples
            Verb          8,141    393     542    11,269 193 2,426      en:run; it:correre
            Noun          2,268    124     175     6,710 111 1,499      en:attack; it:attacco
            Adjectives      165      8      21       610    9  118      en:(is) dormat; it:(è) dormiente
            Other/PP         29      1        8      146    1   25      en:on board; it:a bordo
            Total        10,603    526     746    18,735 314 4,068

    Table 1: Distribution of events per POS in each corpus per Training, Development, and Test data.
                                           TE3                 EVENTI
             Classes               Train    Dev   Test    Train Dev   Test      Examples
             OCCURRENCE            6,530    302   466     9,041 162 1,949       en:run; it:correre
             ASPECTUAL               264     33    35       446   14  107       en:start; it:inizio
             PERCEPTION               79      4      2      162    2   37       en:see; it:vedere
             REPORTING             1,544     67    92       714    8  149       en:say; it:dire
             I STATE                 651     29    36     1,599   29  355       en:like; it:piacere
             I ACTION                827     57    47     1,476   25  357       en:attempt; it:tentare
             STATE                   708     34    68     4,090   61  843       en:keep; it:tenersi
             Total                10,603    526   746    17,528 301 3,798

     Table 2: Distribution of event classes in each corpus per Training, Development, and Test data.


hidden size of 768.                                         shot cross-lingual transfer learning models. In the
   Unlike the original BERT, Multilingual BERT              second block of experiments, we investigate the
is pre-trained on the concatenation of monolingual          ability of the models in performing the two sub-
Wikipedia pages of 104 languages with a shared              tasks “at once”, i.e. identifying and classifying
word piece vocabulary. One of the peculiar char-            an event mention. This is a more complex task,
acteristics of this multilingual model is that it does      especially in zero-shot transfer learning scenarios,
not make use of any special marker to signal the            because the ISO-TimeML classes are assigned fol-
input language, nor has any mechanism that ex-              lowing syntactic-semantic criteria: the same word
plicitly indicates that translation equivalent pairs        can be assigned to different classes according to
should have similar representations.                        the specific syntactic context in which it occurs.
   For the fine-tuning, we use a standard sequence          For each language pair and direction of the transfer
tagging model. We apply a softmax classifier over           (i.e. ENtrain –ITtest vs. ITtrain –ENtest ), we also
each token by passing the token’s last layer of ac-         benchmark the performance in monolingual fine-
tivation to the softmax layer to make a tag predic-         tuned transfer scenarios (i.e. ITtrain –ITtest vs.
tion. Since BERT’s wordpiece tokenizer can split            ENtrain –ENtest ), to have an upper-bound limit
words into multiple tokens, we take the prediction          of Multilingual BERT and an indirect evidence of
for the first token (piece) per word, ignoring the          the intrinsic quality of the proposed multilingual
rest. No parameter tuning was performed, learn-             model. For the English data, we also test the per-
ing rate was set to 1e-4, and batch size to 8.              formance using English BERTB AS E , so to better
                                                            understand limits of the multilingual model.
4   Experiments                                                Finally, we compare our results to the best sys-
                                                            tems that participated in the corresponding eval-
Event detection is best described as composed by            uation campaigns in each language, as well as to
two sub-tasks: first, identify if a word, w, in a           state-of-the-art systems. In particular, we selected:
given sentence S is an instance of an event men-
tion, evw ; and subsequently, assign it to a class             - HLT-FBK (Mirza and Minard, 2014), a
C, evw ∈ C. We break the experiments in two                      feature-based SVM model for Italian (best
blocks: in the first block, we investigate the qual-             system at EVENTI@EVALITA);
ity of the fine-tuned Multilingual BERT models                 - ATT1 (Jung and Stent, 2013), a feature-
on the identification of the event mentions only.                based MaxEnt model for English (best sys-
This is an easier task with respect to classifica-               tem for event detection and classification at
tion, as it can be framed as a binary classification             TempEval-3);
task. In this way, we can actually have a sort of
maximal threshold of the performance of the zero-              - CRF4TimeML (Caselli and Morante, 2018),
        a feature-based CRF model for English that               for the standard deviation show when compared to
        has obtained state-of-the-art results on event           the Italian counterpart (+/- 2.04 for EVENTItrain
        classification;                                          on the TE3 test data vs. +/- 7.45 for TE3train on
                                                                 the EVENTI test data for the event detection sub-
    - Bi-LSTM-CRF (Reimers and Gurevych,                         task; +/- 2.67 for EVENTItrain on the TE3 test
      2017; Caselli, 2018), a neural network                     data vs. +/- 3.15 for TE3train on the EVENTI test
      model based on a Bi-LSTM using a CRF                       data for the event detection and classification sub-
      classifer as final layer. The architecture                 task).
      has been originally developed and tested
                                                                    Annotation differences in the two languages
      on English (Reimers and Gurevych, 2017),
                                                                 have an impact in the evaluation of the zero-shot
      and subsequently adapted to Italian (Caselli,
                                                                 models. To measure this, we excluded all modal
      2018). The English version of the system re-
                                                                 and copula verbs both as predictions on the En-
      ports state-of-the-art scores for the event de-
                                                                 glish test by the zero-shot Italian model, and as
      tection task only, while the Italian version
                                                                 gold labels from the Italian test, when applying the
      obtained state-of-the-art results for detection
                                                                 zero-shot English model. In both cases we observe
      and classification.
                                                                 an improvement, with an increase of the average
                                                                 F1 to 72.26 on English and 66.01 on Italian. Al-
5       Results
                                                                 though other language specific annotations may be
All scores for the Multilingual BERT models                      at play, the Italian zero-shot model appears to be
have been averaged against 5 runs (Reimers and                   more powerful than the English one.
Gurevych, 2017). Subscript numbers correspond                       The addition of extra fine-tuning with data from
to standard deviation scores. Tables 3 and 4 illus-              the evaluation language results in a positive out-
trate the results on the Italian test data for the event         come, improving performances in both sub-tasks.
detection and the event detection and classification             In three out of the four cases (event detection on
sub-tasks, respectively. Results on the English test             English, and event detection and classification on
are illustrated in Table 5 for event detection and               English and Italian) the extra-fine tuning with the
in Table 6 for event detection and classification.               full training set of the evaluation language results
For each experiment, we also report the number of                in better models than the corresponding non zero-
fine-tuning epochs.                                              shot ones. Adding training material targeting the
   The main take-away is that the portability of                 evaluation test is a well know technique in domain
the zero-shot models is not the same for the two                 adaptation (Daumé III, 2007). Quite surprisingly
sub-tasks: for the event detection sub-task, both                with respect to previous work that used this ap-
models obtain close results (average F1 63.53 on                 proach, we observe an improvement also with re-
Italian vs. average F1 66.79 on English), while                  spect to fine-tuned transfer scenarios, i.e. models
this is not the case for the event detection and                 tuned and tested on the same language, suggest-
classification sub-task (average F1 42.86 on Ital-               ing that the multilingual model is actually learning
ian vs. average F1 51.26 on English), suggest-                   from both languages.
ing this sub-task as being intrinsically more dif-
                                                                    In terms of absolute scores, our results for the
ficult. We also observe that the zero-shot models
                                                                 zero-shot scenarios are in line with the findings
have different behaviors with respect to Precision
                                                                 reported in Pires et al. (2019) for typologically re-
and Recall: the zero-shot transfer on Italian has
                                                                 lated languages, such as English and Italian. How-
a high Precision and a low Recall, while the op-
                                                                 ever, limits of zero-shot transfer scenarios seem
posite happens on English. 4 The stability of the
                                                                 more evident in semantic tasks when compared to
zero-shot models seems to be influenced by the
                                                                 morpho-synatactic ones. For instance, Pires et al.
size of the fine-tuning training data. In particular,
                                                                 (2019) reports absolute F1 scores comparable to
zero-shot transfer learning on English consistently
                                                                 ours on Named Entity Recognition on 4 language
results in more stable models, as the lower scores
                                                                 pairs, while results on POS tagging achieve an ac-
    4
     For instance, average Precision for event detection is      curacy above 80% on all language pairs. More re-
93.11 on Italian vs. 53.19 on English, while average Recall is   cently, Wu and Dredze (2019) have shown a sim-
51.71 on Italian and 89.92 on English, respectively. A similar
pattern is observed for the detection and classification sub-    ilar behavior to our zero-shot scenarios of Multi-
task.                                                            lingual BERT in a text classification task.
    Fine Tuning                 Epochs   EVENTI F1          Fine Tuning                      Epochs    EVENTI F1
    TE3tr ain - zero-shot       1          63.537.45        TE3tr ain - zero-shot            2           42.863.15
    TE3tr ain + EVENTIdev       1+2        77.571.73        TE3tr ain + EVENTIdev            1+2         55.381.34
     TE3tr ain + EVENTItr ain   1+1        87.170.56        TE3tr ain + EVENTItr ain         1+3         73.900.45
    EVENTItr ain                1          87.361.16        EVENTItr ain                     2           73.690.80
    (Caselli, 2018)             n/a            87.79        (Caselli, 2018)                  n/a             72.97
    HLT-FBK                     n/a            86.68        HLT-FBK                          n/a             67.14

Table 3: Event mention detection - test on Italian.      Table 4: Event detection and classification - test on
Best scores in bold.                                     Italian. Best scores in bold.
    Fine Tuning                     Epochs     TE3 F1        Fine Tuning                      Epochs      TE3 F1
    EVENTItr ain - zero-shot        1        66.792.04       EVENTItr ain - zero-shot         2         51.262.67
    EVENTItr ain + TE3dev           1+2      80.671.11       EVENTItr ain + TE3dev            1+2       64.162.82
    EVENTItr ain + TE3tr ain        1+1      81.870.13       EVENTItr ain + TE3tr ain         1+3       68.970.94
    TE3tr ain                       1        81.391.23       TE3tr ain                        2         63.361.47
    (Reimers and Gurevych, 2017)3   n/a          83.45       CRF4TimeML                       n/a           72.24
    ATT1                            n/a          81.05       ATT1                                           71.88

Table 5: Event mention detection - test on English. Table 6: Event detection and classification - test on
Best scores in bold.                                English. Best scores in bold.


6    Discussion                                           affect the quality of the pre-trained LM. How-
                                                          ever, results on English using English BERTB AS E
Extra fine-tuning Extra fine-tuning, even with            appears to be partially in line with this observa-
a minimal amount of data as shown by the results          tion. By applying the same settings, we obtain
using the development sets, shifts the model’s pre-       an average F1 on event detection of 82.85,5 and
dictions to be more in-line with the correspond-          an average F1 for event detection and classifica-
ing language specific annotations. Furthermore, it        tion of 71.09. Although results of the monolin-
reduces the effects of cross-lingual transfer based       gual model are expected to be higher in general, in
on the presence of the same word pieces between           this case, we observe that the differences in perfor-
the fine-tuned and the evaluation languages due to        mance between the two tasks are not in the same
the single multilingual vocabulary of Multilingual        range. BERTB AS E obtains an increase of 2% on
BERT (Pires et al., 2019). This also results in an        event detection but it reaches almost 11% on event
increasing stability of the models and a reduction        detection and classification. Differences in class
of the differences in the average scores for Preci-       labelling between English and Italian (see Sec-
sion and Recall with respect to the zero-shot mod-        tion 2) can partially explain this behaviour. How-
els.                                                      ever, given the sensitivity of event classification to
                                                          the syntactic context, these results call for further
Comparison to other systems Zero-shot mod-                investigation on the encoding of syntactic infor-
els obtain satisfying, though not optimal, results        mation between the monolingual and the multi-
as they fall far from both the state-of-the-art mod-      lingual BERT models.
els and the best performing systems in the corre-
sponding evaluation exercises (i.e. HLT-FBK for           Errors Comparing the errors of the zero-shot
Italian and ATT1 for English). Extra fine-tuning          models is not an easy task mainly because of the
with the development data provides competitive            language specific annotations in the two corpora.
models against the best systems in the evaluation         However, focusing on the three major POS, i.e.
exercises only. When the full training data is used       nouns, verbs, and adjectives, and on the False Neg-
for extra fine-tuning in the target evaluation lan-       atives only, both models present a similar propor-
guage, results are very close to the state of the         tions of errors, with nouns representing the hardest
art, although only in one case the Multilingual           case (53.84% on Italian vs. 54.90% on English),
BERT model is actually outperforming it (namely,          followed by verbs (30.29% on Italian vs. 17.64%
on event detection and classification for Italian).       on English), and by adjectives (7.51% on Italian
These models also obtain very competitive results         vs. 5.88% on English). When observing the classi-
with respect to state-of-the-art systems, indicating      fication mismatches (i.e. correct event mention but
                                                             5
that multilinguality does not seem to negatively                 Precision: 81.26; Recall: 84.70
wrong class), both models overgeneralise the OC-           2018. Interoperable annotation of events and event
CURRENCE class in the majority of cases. How-              relations across domains. In Proceedings 14th Joint
                                                           ACL - ISO Workshop on Interoperable Semantic An-
ever, zero-shot transfer on English actually ex-
                                                           notation, pages 10–20. Association for Computa-
tends mis-classification errors mirroring the distri-      tional Linguistics.
bution of the classes of the Italian training data. In
particular, it wrongly classifies English REPORT-        Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
                                                           2018. A robust self-learning method for fully un-
ING events as I ACTION (33.33%), and OC-                   supervised cross-lingual mappings of word embed-
CURRENCE as STATE (15.51%) or I ACTION                     dings. In Proceedings of the 56th Annual Meeting of
(34.48%). Although the syntactic context may               the Association for Computational Linguistics (Vol-
have influenced the classification errors, these pat-      ume 1: Long Papers), pages 789–798, Melbourne,
                                                           Australia, July. Association for Computational Lin-
terns further highlight the differences in annota-         guistics.
tions between the two languages.
                                                         Tommaso Caselli and Roser Morante. 2018. Sys-
7   Conclusion                                             tems Agreements and Disagreements in Temporal
                                                           Processing: An Extensive Error Analysis of the
In this contribution we investigated the general-          TempEval-3 Task. In Nicoletta Calzolari (Con-
isation abilities of Multilingual BERT on Italian          ference chair), Khalid Choukri, Christopher Cieri,
and English using event detection as a downstream          Thierry Declerck, Sara Goggi, Koiti Hasida, Hi-
                                                           toshi Isahara, Bente Maegaard, Joseph Mariani,
task. The results show that Multilingual BERT              Hlne Mazo, Asuncion Moreno, Jan Odijk, Ste-
seems to handle cross-lingual generalisation be-           lios Piperidis, and Takenobu Tokunaga, editors,
tween Italian and English in a satisfying way,             Proceedings of the Eleventh International Confer-
although with some limitations. Limitations in             ence on Language Resources and Evaluation (LREC
                                                           2018), Paris, France, may. European Language Re-
this case come from two sources: annotation dif-           sources Association (ELRA).
ferences in the two languages and, partially, the
shared multilingual vocabulary. Zero-shot systems        Tommaso Caselli and Rachele Sprugnoli. 2017. It-
appears to be particularly sensitive to the fine-          TimeML and the Ita-TimeBank: Language Specific
                                                           Adaptations for Temporal Annotation. In Nancy
tuning data, and, in these experiments, they pro-          Ide and James Pustejovsky, editors, Handbook of
vide empirical evidence of the impact of different         Linguistic Annotation - Volume II, pages 969–988.
annotation decisions for events in English and Ital-       Springer.
ian.                                                     Tommaso Caselli, Rachele Sprugnoli, Manuela Sper-
   We have shown that extra fine-tuning with data          anza, and Monica Monachini. 2014. EVENTI:
of the evaluation language not only is beneficial          EValuation of Events and Temporal INformation at
but it may lead to better systems, suggesting that         Evalita 2014. In Proceedings of the First Italian
                                                           Conference on Computational Linguistics CLiC-it
the multilingual model may be combining infor-
                                                           2014 & and of the Fourth International Workshop
mation from the two languages, and thus obtaining          EVALITA 2014, pages 27–34. Pisa University Press.
competitive results with respect to task-specific ar-
chitectures. This opens up to new strategies for         Tommaso Caselli. 2018. Italian Event Detection Goes
                                                           Deep Learning. In Proceedings of the 5th Italian
the development of systems by using interoperable          Conference on Computational Linguistics (CLiC-it
annotated data in different languages to improve           2018), Turin, Italy.
performances and possibly obtain more robust and
                                                         Alexandra Chronopoulou, Christos Baziotis, and
portable models across different data distributions.
                                                           Alexandros Potamianos. 2019. An embarrass-
                                                           ingly simple approach for transfer learning from pre-
                                                           trained language models. In Proceedings of the
References                                                 2019 Conference of the North American Chapter of
David Ahn. 2006. The stages of event extraction.           the Association for Computational Linguistics: Hu-
  In Proceedings of the Workshop on Annotating and         man Language Technologies, Volume 1 (Long and
  Reasoning about Time and Events, pages 1–8. Asso-        Short Papers), pages 2089–2095, Minneapolis, Min-
  ciation for Computational Linguistics.                   nesota, June. Association for Computational Lin-
                                                           guistics.
Waleed Ammar, George Mulcaire, Yulia Tsvetkov,
  Guillaume Lample, Chris Dyer, and Noah A. Smith.       Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-
  2016. Massively multilingual word embeddings.            ina Williams, Samuel Bowman, Holger Schwenk,
                                                           and Veselin Stoyanov. 2018. XNLI: Evaluating
Jun Araki, Lamana Mulaffer, Arun Pandian, Yukari           cross-lingual sentence representations. In Proceed-
  Yamakawa, Kemal Oflazer, and Teruko Mitamura.            ings of the 2018 Conference on Empirical Methods
  in Natural Language Processing, pages 2475–2485,           on Natural Language Processing (Volume 2: Short
  Brussels, Belgium, October-November. Association           Papers), volume 2, pages 365–371.
  for Computational Linguistics.
                                                           Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Hal Daumé III. 2007. Frustratingly easy domain adap-       Gardner, Christopher Clark, Kenton Lee, and Luke
  tation. ACL 2007, page 256.                               Zettlemoyer. 2018. Deep contextualized word rep-
                                                            resentations. arXiv preprint arXiv:1802.05365.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of         Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
   deep bidirectional transformers for language under-       How multilingual is multilingual BERT? In Pro-
   standing. In Proceedings of the 2019 Conference of        ceedings of the 57th Annual Meeting of the Asso-
   the North American Chapter of the Association for         ciation for Computational Linguistics, pages 4996–
   Computational Linguistics: Human Language Tech-           5001, Florence, Italy, July. Association for Compu-
   nologies, Volume 1 (Long and Short Papers), pages         tational Linguistics.
   4171–4186, Minneapolis, Minnesota, June. Associ-
   ation for Computational Linguistics.                    Prajit Ramachandran, Peter Liu, and Quoc Le. 2017.
                                                             Unsupervised pretraining for sequence to sequence
Akiko Eriguchi, Melvin Johnson, Orhan Firat, Hideto          learning. In Proceedings of the 2017 Conference
  Kazawa, and Wolfgang Macherey. 2018. Zero-shot             on Empirical Methods in Natural Language Pro-
  cross-lingual classification using multilingual neural     cessing, pages 383–391, Copenhagen, Denmark,
  machine translation.                                       September. Association for Computational Linguis-
                                                             tics.
Lifu Huang, Heng Ji, Kyunghyun Cho, Ido Dagan, Se-
   bastian Riedel, and Clare Voss. 2018. Zero-shot         Nils Reimers and Iryna Gurevych. 2017. Report-
   transfer learning for event extraction. In Proceed-       ing score distributions makes a difference: Perfor-
   ings of the 56th Annual Meeting of the Association        mance study of lstm-networks for sequence tagging.
   for Computational Linguistics (Volume 1: Long Pa-         In Proceedings of the 2017 Conference on Empiri-
   pers), pages 2160–2170, Melbourne, Australia, July.       cal Methods in Natural Language Processing, pages
   Association for Computational Linguistics.                338–348, Copenhagen, Denmark, September. Asso-
SemAf/Time Working Group ISO, 2008. ISO DIS                  ciation for Computational Linguistics.
  24617-1: 2008 Language resource management -             Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012.
  Semantic annotation framework - Part 1: Time and           Open domain event extraction from twitter. In Pro-
  events. ISO Central Secretariat, Geneva.                   ceedings of the 18th ACM SIGKDD international
Heng Ji and Ralph Grishman. 2008. Refining event             conference on Knowledge discovery and data min-
  extraction through cross-document inference. Pro-          ing, pages 1104–1112. ACM.
  ceedings of ACL-08: HLT, pages 254–262.
                                                           Naushad UzZaman, Hector Llorens, Leon Derczyn-
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim             ski, James Allen, Marc Verhagen, and James Puste-
 Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,            jovsky. 2013. Semeval-2013 task 1: Tempeval-3:
 Fernanda Viégas, Martin Wattenberg, Greg Corrado,          Evaluating time expressions, events, and temporal
 Macduff Hughes, and Jeffrey Dean. 2017. Google’s            relations. In Second Joint Conference on Lexical
 multilingual neural machine translation system: En-         and Computational Semantics (*SEM), Volume 2:
 abling zero-shot translation. Transactions of the As-       Proceedings of the Seventh International Workshop
 sociation for Computational Linguistics, 5:339–351.         on Semantic Evaluation (SemEval 2013), pages 1–9,
                                                             Atlanta, Georgia, USA, June. Association for Com-
Hyuckchul Jung and Amanda Stent. 2013. Att1: Tem-            putational Linguistics.
  poral annotation using big windows and rich syn-
  tactic and semantic features. In Second Joint Con-       Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  ference on Lexical and Computational Semantics             Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  (*SEM), Volume 2: Proceedings of the Seventh In-           Kaiser, and Illia Polosukhin. 2017. Attention is all
  ternational Workshop on Semantic Evaluation (Se-           you need. In Advances in neural information pro-
  mEval 2013), pages 20–24, Atlanta, Georgia, USA,           cessing systems, pages 5998–6008.
  June. Association for Computational Linguistics.
                                                           Marc Verhagen, Roser Sauri, Tommaso Caselli, and
Paramita Mirza and Anne-Lyse Minard. 2014. FBK-             James Pustejovsky. 2010. Semeval-2010 task 13:
  HLT-time: a complete Italian Temporal Processing          Tempeval-2. In Proceedings of the 5th international
  system for EVENTI-EVALITA 2014. In Fourth In-             workshop on semantic evaluation, pages 57–62. As-
  ternational Workshop EVALITA 2014, pages 44–49.           sociation for Computational Linguistics.

Thien Huu Nguyen and Ralph Grishman. 2015. Event           Ivan Vulić and Marie-Francine Moens. 2015. Bilin-
  detection and domain adaptation with convolutional          gual word embeddings from non-parallel document-
  neural networks. In Proceedings of the 53rd Annual          aligned data applied to bilingual lexicon induction.
  Meeting of the Association for Computational Lin-           In Proceedings of the 53rd Annual Meeting of the
  guistics and the 7th International Joint Conference         Association for Computational Linguistics and the
  7th International Joint Conference on Natural Lan-
  guage Processing (Volume 2: Short Papers), vol-
  ume 2, pages 719–725.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-
  cas: The surprising cross-lingual effectiveness of
  bert. arXiv preprint arXiv:1904.09077.

</pre>