There and Back Again: Cross-Lingual Transfer Learning for Event Detection Tommaso Caselli, Ahmet Üstün Rikjuniversiteit Groningen, Groningen, The Netherlands {t.caselli|a.ustun}@rug.nl Abstract cially for English, there has been a growing in- terest in the development of cross-lingual as well English. In this contribution we in- as multilingual representations (Vulić and Moens, vestigate the generalisation abilities of a 2015; Ammar et al., 2016; Conneau et al., 2018; pre-trained multilingual Language Model, Artetxe et al., 2018) to investigate different cross- namely Multilingual BERT, in different lingual transfer learning scenarios, including zero- transfer learning scenarios for event de- shot transfer, i.e. the direct application of a model tection and classification for Italian and fine-tuned using data in one language to a different English. Our results show that zero-shot test language. models have satisfying, although not opti- Following the approach in Pires et al. (2019), mal, performances in both languages (av- in this paper we investigate the generalisation erage F1 higher than 60 for event detec- abilities of Multilingual BERT (Devlin et al., tion vs. average F1 ranging between 40 2019) 1 on English (EN) and Italian (IT). Multi- and 50 for event classification). We also lingual BERT is particularly well suited for this show that adding extra fine-tuning data of task because it easily allows the implementation the evaluation language is not simply ben- of cross-lingual transfer learning, including zero- eficial but results in better models when shot transfer. compared to the corresponding non zero- We use event detection as our downstream task, shot transfer ones, achieving highly com- a highly complex semantic task with a well estab- petitive results when compared to state-of- lished tradition in NLP (Ahn, 2006; Ji and Grish- the-art systems. man, 2008; Ritter et al., 2012; Nguyen and Gr- ishman, 2015; Huang et al., 2018). The goal of 1 Introduction the task is to identify event mentions, i.e. linguis- Recently pre-trained word representations en- tic expressions describing “things” that happen or coded in Language Models (LM) have gained hold as true in the world, and subsequently clas- lot of popularity in Natural Language Process- sify them according to a (pre-defined) taxonomy. ing (NLP) thanks to their ability to encode high The complexity of the task relies in its high depen- level syntactic-semantic language features and dence on the context of occurrence of the expres- produce state-of-the-art results in various tasks, sions that may trigger an event mention. Indeed, such as Named Entity Recognition (Peters et the eventiveness of an expression is prone to am- al., 2018), Machine Translation (Johnson et al., biguity because there exists a continuum between 2017; Ramachandran et al., 2017), Text Classi- eventive and non-eventive readings in the space fication (Eriguchi et al., 2018; Chronopoulou et of event semantics (Araki et al., 2018). Such in- al., 2019), among others. These models are pre- trinsic ambiguity of event expressions challenges trained on large amounts of unannotated text and the generalisation abilities of stochastic models then fine-tuned using the induced LM structure and allows to investigate advantages and limits of to generalise over specific training data. Given transfer learning approaches when semantics has a their success in monolingual environments, espe- pivotal role in the resolution of a problem/task. We explore different multi-lingual and cross- Copyright c 2019 for this paper by its authors. Use 1 permitted under Creative Commons License Attribution 4.0 https://github.com/google-research/ International (CC BY 4.0). bert lingual aspects of transfer learning with respect prepositional phrases (PP). Every event men- to event detection through a series of experiments, tion is further assigned to one of 7 possi- focusing on the following research questions: ble classes: OCCURRENCE, ASPECTUAL, PERCEPTION, REPORTING, I(NTESIONAL) RQ1 How well do Multilingual BERT fine-tuned STATE, I(NTENSIONAL) ACTION, and STATE, models generalise in zero-shot transfer learn- capturing the relationship the event participates ing scenarios on both languages? (such as factual, evidential, reported, intensional). Although semantically interoperable, one of the RQ2 Do we obtain more robust models by fine- most relevant annotation differences that may im- tuning zero-shot models with additional pact the evaluation of the zero-shot models con- (training) data of the evaluation language? cerns the marking of modal verbs and copulas in- troducing event nouns, adjectives or PPs. While Our results show that Multilingual BERT ob- in English these elements are never annotated as tains satisfying performances in zero-shot scenar- event triggers, this is done in Italian. A detailed ios for the identification of event triggers (aver- description of additional language specific adapta- age F1 63.53 on Italian and 66.79 on English), tions and differences between English and Italian while this is not the case for event classification is reported in Caselli and Sprugnoli (2017). (average F1 42.86 on Italian and 51.26 on En- Tables 1 and 2 illustrate the distribution of the glish). We also show that extra fine-tuning the annotation of events for POS (token based) and zero-shot models with data of the evaluation lan- classes (event based), respectively. Both corpora, guage is not just beneficial, but it actually gives when released, did not explicitly have a develop- better results than models fine-tuned on the cor- ment section. Following previous work (Caselli, responding test language only (i.e. fine-tuning 2018), we generated development sets by exclud- and test in the same language), and achieves ing from the training data all the documents that competitive results with state-of-the-art systems composed the test data for Italian and English in developed using dedicated architectures. Our the SemEval 2010 TempEval-2 campaign (Verha- code is available (https://github.com/ gen et al., 2010). ahmetustun/BertForEvent). The Italian corpus is larger than the correspond- ing English version, although the distribution of 2 Data events, both per POS and per class, is compara- ble. The different distribution of the REPORT- We have used two corpora annotated with event in- ING, I STATE, I ACTION, and STATE classes re- formation: the TempEval-3 corpus (TE3) for En- flects differences in annotation instructions rather glish (UzZaman et al., 2013) and the EVENTI cor- than language specific characteristics. For in- pus for Italian (Caselli et al., 2014). The corpora stance, in Italian, the class REPORTING is as- have been independently annotated with language signed only if the event mention is an instance of specific annotation schemes, grounded on a shared a speech verb/noun (verba/nomina dicendi), while metadata markup language for temporal informa- in English this constraint is less strict. tion processing, ISO-TimeML (ISO, 2008), thus sharing definitions and tags’ names for the mark- 3 Model able expressions. The corpora are composed by contemporary news articles2 and have been devel- Multilingual BERT (Bidirectional Encoder oped in the context of two evaluation campaigns Representations from Transformers) shares the for temporal processing, namely TempEval-3 and same framework of the monolingual English EVENTI@EVALITA 2014. BERTB AS E (Devlin et al., 2019). BERT is Events are defined as anything that can a pre-trained LM that improves over existing be said to happen, or occur, or hold true, fine-tuning approaches by jointly conditioning on with no restriction to parts-of-speech (POS), both left and right contexts in all layers to generate including verbs, nouns, adjectives, and also pre-trained deep bidirectional representations. 2 Multilingual BERT’s architecture contains an We have excluded the extra test set on historical news from the Italian data set, and the automatically annotated encoder consisting of 12 Transformer blocks with training set from the English one. 12 self-attention heads (Vaswani et al., 2017), and TE3 EVENTI POS Train Dev Test Train Dev Test Examples Verb 8,141 393 542 11,269 193 2,426 en:run; it:correre Noun 2,268 124 175 6,710 111 1,499 en:attack; it:attacco Adjectives 165 8 21 610 9 118 en:(is) dormat; it:(è) dormiente Other/PP 29 1 8 146 1 25 en:on board; it:a bordo Total 10,603 526 746 18,735 314 4,068 Table 1: Distribution of events per POS in each corpus per Training, Development, and Test data. TE3 EVENTI Classes Train Dev Test Train Dev Test Examples OCCURRENCE 6,530 302 466 9,041 162 1,949 en:run; it:correre ASPECTUAL 264 33 35 446 14 107 en:start; it:inizio PERCEPTION 79 4 2 162 2 37 en:see; it:vedere REPORTING 1,544 67 92 714 8 149 en:say; it:dire I STATE 651 29 36 1,599 29 355 en:like; it:piacere I ACTION 827 57 47 1,476 25 357 en:attempt; it:tentare STATE 708 34 68 4,090 61 843 en:keep; it:tenersi Total 10,603 526 746 17,528 301 3,798 Table 2: Distribution of event classes in each corpus per Training, Development, and Test data. hidden size of 768. shot cross-lingual transfer learning models. In the Unlike the original BERT, Multilingual BERT second block of experiments, we investigate the is pre-trained on the concatenation of monolingual ability of the models in performing the two sub- Wikipedia pages of 104 languages with a shared tasks “at once”, i.e. identifying and classifying word piece vocabulary. One of the peculiar char- an event mention. This is a more complex task, acteristics of this multilingual model is that it does especially in zero-shot transfer learning scenarios, not make use of any special marker to signal the because the ISO-TimeML classes are assigned fol- input language, nor has any mechanism that ex- lowing syntactic-semantic criteria: the same word plicitly indicates that translation equivalent pairs can be assigned to different classes according to should have similar representations. the specific syntactic context in which it occurs. For the fine-tuning, we use a standard sequence For each language pair and direction of the transfer tagging model. We apply a softmax classifier over (i.e. ENtrain –ITtest vs. ITtrain –ENtest ), we also each token by passing the token’s last layer of ac- benchmark the performance in monolingual fine- tivation to the softmax layer to make a tag predic- tuned transfer scenarios (i.e. ITtrain –ITtest vs. tion. Since BERT’s wordpiece tokenizer can split ENtrain –ENtest ), to have an upper-bound limit words into multiple tokens, we take the prediction of Multilingual BERT and an indirect evidence of for the first token (piece) per word, ignoring the the intrinsic quality of the proposed multilingual rest. No parameter tuning was performed, learn- model. For the English data, we also test the per- ing rate was set to 1e-4, and batch size to 8. formance using English BERTB AS E , so to better understand limits of the multilingual model. 4 Experiments Finally, we compare our results to the best sys- tems that participated in the corresponding eval- Event detection is best described as composed by uation campaigns in each language, as well as to two sub-tasks: first, identify if a word, w, in a state-of-the-art systems. In particular, we selected: given sentence S is an instance of an event men- tion, evw ; and subsequently, assign it to a class - HLT-FBK (Mirza and Minard, 2014), a C, evw ∈ C. We break the experiments in two feature-based SVM model for Italian (best blocks: in the first block, we investigate the qual- system at EVENTI@EVALITA); ity of the fine-tuned Multilingual BERT models - ATT1 (Jung and Stent, 2013), a feature- on the identification of the event mentions only. based MaxEnt model for English (best sys- This is an easier task with respect to classifica- tem for event detection and classification at tion, as it can be framed as a binary classification TempEval-3); task. In this way, we can actually have a sort of maximal threshold of the performance of the zero- - CRF4TimeML (Caselli and Morante, 2018), a feature-based CRF model for English that for the standard deviation show when compared to has obtained state-of-the-art results on event the Italian counterpart (+/- 2.04 for EVENTItrain classification; on the TE3 test data vs. +/- 7.45 for TE3train on the EVENTI test data for the event detection sub- - Bi-LSTM-CRF (Reimers and Gurevych, task; +/- 2.67 for EVENTItrain on the TE3 test 2017; Caselli, 2018), a neural network data vs. +/- 3.15 for TE3train on the EVENTI test model based on a Bi-LSTM using a CRF data for the event detection and classification sub- classifer as final layer. The architecture task). has been originally developed and tested Annotation differences in the two languages on English (Reimers and Gurevych, 2017), have an impact in the evaluation of the zero-shot and subsequently adapted to Italian (Caselli, models. To measure this, we excluded all modal 2018). The English version of the system re- and copula verbs both as predictions on the En- ports state-of-the-art scores for the event de- glish test by the zero-shot Italian model, and as tection task only, while the Italian version gold labels from the Italian test, when applying the obtained state-of-the-art results for detection zero-shot English model. In both cases we observe and classification. an improvement, with an increase of the average F1 to 72.26 on English and 66.01 on Italian. Al- 5 Results though other language specific annotations may be All scores for the Multilingual BERT models at play, the Italian zero-shot model appears to be have been averaged against 5 runs (Reimers and more powerful than the English one. Gurevych, 2017). Subscript numbers correspond The addition of extra fine-tuning with data from to standard deviation scores. Tables 3 and 4 illus- the evaluation language results in a positive out- trate the results on the Italian test data for the event come, improving performances in both sub-tasks. detection and the event detection and classification In three out of the four cases (event detection on sub-tasks, respectively. Results on the English test English, and event detection and classification on are illustrated in Table 5 for event detection and English and Italian) the extra-fine tuning with the in Table 6 for event detection and classification. full training set of the evaluation language results For each experiment, we also report the number of in better models than the corresponding non zero- fine-tuning epochs. shot ones. Adding training material targeting the The main take-away is that the portability of evaluation test is a well know technique in domain the zero-shot models is not the same for the two adaptation (Daumé III, 2007). Quite surprisingly sub-tasks: for the event detection sub-task, both with respect to previous work that used this ap- models obtain close results (average F1 63.53 on proach, we observe an improvement also with re- Italian vs. average F1 66.79 on English), while spect to fine-tuned transfer scenarios, i.e. models this is not the case for the event detection and tuned and tested on the same language, suggest- classification sub-task (average F1 42.86 on Ital- ing that the multilingual model is actually learning ian vs. average F1 51.26 on English), suggest- from both languages. ing this sub-task as being intrinsically more dif- In terms of absolute scores, our results for the ficult. We also observe that the zero-shot models zero-shot scenarios are in line with the findings have different behaviors with respect to Precision reported in Pires et al. (2019) for typologically re- and Recall: the zero-shot transfer on Italian has lated languages, such as English and Italian. How- a high Precision and a low Recall, while the op- ever, limits of zero-shot transfer scenarios seem posite happens on English. 4 The stability of the more evident in semantic tasks when compared to zero-shot models seems to be influenced by the morpho-synatactic ones. For instance, Pires et al. size of the fine-tuning training data. In particular, (2019) reports absolute F1 scores comparable to zero-shot transfer learning on English consistently ours on Named Entity Recognition on 4 language results in more stable models, as the lower scores pairs, while results on POS tagging achieve an ac- 4 For instance, average Precision for event detection is curacy above 80% on all language pairs. More re- 93.11 on Italian vs. 53.19 on English, while average Recall is cently, Wu and Dredze (2019) have shown a sim- 51.71 on Italian and 89.92 on English, respectively. A similar pattern is observed for the detection and classification sub- ilar behavior to our zero-shot scenarios of Multi- task. lingual BERT in a text classification task. Fine Tuning Epochs EVENTI F1 Fine Tuning Epochs EVENTI F1 TE3tr ain - zero-shot 1 63.537.45 TE3tr ain - zero-shot 2 42.863.15 TE3tr ain + EVENTIdev 1+2 77.571.73 TE3tr ain + EVENTIdev 1+2 55.381.34 TE3tr ain + EVENTItr ain 1+1 87.170.56 TE3tr ain + EVENTItr ain 1+3 73.900.45 EVENTItr ain 1 87.361.16 EVENTItr ain 2 73.690.80 (Caselli, 2018) n/a 87.79 (Caselli, 2018) n/a 72.97 HLT-FBK n/a 86.68 HLT-FBK n/a 67.14 Table 3: Event mention detection - test on Italian. Table 4: Event detection and classification - test on Best scores in bold. Italian. Best scores in bold. Fine Tuning Epochs TE3 F1 Fine Tuning Epochs TE3 F1 EVENTItr ain - zero-shot 1 66.792.04 EVENTItr ain - zero-shot 2 51.262.67 EVENTItr ain + TE3dev 1+2 80.671.11 EVENTItr ain + TE3dev 1+2 64.162.82 EVENTItr ain + TE3tr ain 1+1 81.870.13 EVENTItr ain + TE3tr ain 1+3 68.970.94 TE3tr ain 1 81.391.23 TE3tr ain 2 63.361.47 (Reimers and Gurevych, 2017)3 n/a 83.45 CRF4TimeML n/a 72.24 ATT1 n/a 81.05 ATT1 71.88 Table 5: Event mention detection - test on English. Table 6: Event detection and classification - test on Best scores in bold. English. Best scores in bold. 6 Discussion affect the quality of the pre-trained LM. How- ever, results on English using English BERTB AS E Extra fine-tuning Extra fine-tuning, even with appears to be partially in line with this observa- a minimal amount of data as shown by the results tion. By applying the same settings, we obtain using the development sets, shifts the model’s pre- an average F1 on event detection of 82.85,5 and dictions to be more in-line with the correspond- an average F1 for event detection and classifica- ing language specific annotations. Furthermore, it tion of 71.09. Although results of the monolin- reduces the effects of cross-lingual transfer based gual model are expected to be higher in general, in on the presence of the same word pieces between this case, we observe that the differences in perfor- the fine-tuned and the evaluation languages due to mance between the two tasks are not in the same the single multilingual vocabulary of Multilingual range. BERTB AS E obtains an increase of 2% on BERT (Pires et al., 2019). This also results in an event detection but it reaches almost 11% on event increasing stability of the models and a reduction detection and classification. Differences in class of the differences in the average scores for Preci- labelling between English and Italian (see Sec- sion and Recall with respect to the zero-shot mod- tion 2) can partially explain this behaviour. How- els. ever, given the sensitivity of event classification to the syntactic context, these results call for further Comparison to other systems Zero-shot mod- investigation on the encoding of syntactic infor- els obtain satisfying, though not optimal, results mation between the monolingual and the multi- as they fall far from both the state-of-the-art mod- lingual BERT models. els and the best performing systems in the corre- sponding evaluation exercises (i.e. HLT-FBK for Errors Comparing the errors of the zero-shot Italian and ATT1 for English). Extra fine-tuning models is not an easy task mainly because of the with the development data provides competitive language specific annotations in the two corpora. models against the best systems in the evaluation However, focusing on the three major POS, i.e. exercises only. When the full training data is used nouns, verbs, and adjectives, and on the False Neg- for extra fine-tuning in the target evaluation lan- atives only, both models present a similar propor- guage, results are very close to the state of the tions of errors, with nouns representing the hardest art, although only in one case the Multilingual case (53.84% on Italian vs. 54.90% on English), BERT model is actually outperforming it (namely, followed by verbs (30.29% on Italian vs. 17.64% on event detection and classification for Italian). on English), and by adjectives (7.51% on Italian These models also obtain very competitive results vs. 5.88% on English). When observing the classi- with respect to state-of-the-art systems, indicating fication mismatches (i.e. correct event mention but 5 that multilinguality does not seem to negatively Precision: 81.26; Recall: 84.70 wrong class), both models overgeneralise the OC- 2018. Interoperable annotation of events and event CURRENCE class in the majority of cases. How- relations across domains. In Proceedings 14th Joint ACL - ISO Workshop on Interoperable Semantic An- ever, zero-shot transfer on English actually ex- notation, pages 10–20. Association for Computa- tends mis-classification errors mirroring the distri- tional Linguistics. bution of the classes of the Italian training data. In particular, it wrongly classifies English REPORT- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully un- ING events as I ACTION (33.33%), and OC- supervised cross-lingual mappings of word embed- CURRENCE as STATE (15.51%) or I ACTION dings. In Proceedings of the 56th Annual Meeting of (34.48%). Although the syntactic context may the Association for Computational Linguistics (Vol- have influenced the classification errors, these pat- ume 1: Long Papers), pages 789–798, Melbourne, Australia, July. Association for Computational Lin- terns further highlight the differences in annota- guistics. tions between the two languages. Tommaso Caselli and Roser Morante. 2018. Sys- 7 Conclusion tems Agreements and Disagreements in Temporal Processing: An Extensive Error Analysis of the In this contribution we investigated the general- TempEval-3 Task. In Nicoletta Calzolari (Con- isation abilities of Multilingual BERT on Italian ference chair), Khalid Choukri, Christopher Cieri, and English using event detection as a downstream Thierry Declerck, Sara Goggi, Koiti Hasida, Hi- toshi Isahara, Bente Maegaard, Joseph Mariani, task. The results show that Multilingual BERT Hlne Mazo, Asuncion Moreno, Jan Odijk, Ste- seems to handle cross-lingual generalisation be- lios Piperidis, and Takenobu Tokunaga, editors, tween Italian and English in a satisfying way, Proceedings of the Eleventh International Confer- although with some limitations. Limitations in ence on Language Resources and Evaluation (LREC 2018), Paris, France, may. European Language Re- this case come from two sources: annotation dif- sources Association (ELRA). ferences in the two languages and, partially, the shared multilingual vocabulary. Zero-shot systems Tommaso Caselli and Rachele Sprugnoli. 2017. It- appears to be particularly sensitive to the fine- TimeML and the Ita-TimeBank: Language Specific Adaptations for Temporal Annotation. In Nancy tuning data, and, in these experiments, they pro- Ide and James Pustejovsky, editors, Handbook of vide empirical evidence of the impact of different Linguistic Annotation - Volume II, pages 969–988. annotation decisions for events in English and Ital- Springer. ian. Tommaso Caselli, Rachele Sprugnoli, Manuela Sper- We have shown that extra fine-tuning with data anza, and Monica Monachini. 2014. EVENTI: of the evaluation language not only is beneficial EValuation of Events and Temporal INformation at but it may lead to better systems, suggesting that Evalita 2014. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-it the multilingual model may be combining infor- 2014 & and of the Fourth International Workshop mation from the two languages, and thus obtaining EVALITA 2014, pages 27–34. Pisa University Press. competitive results with respect to task-specific ar- chitectures. This opens up to new strategies for Tommaso Caselli. 2018. Italian Event Detection Goes Deep Learning. In Proceedings of the 5th Italian the development of systems by using interoperable Conference on Computational Linguistics (CLiC-it annotated data in different languages to improve 2018), Turin, Italy. performances and possibly obtain more robust and Alexandra Chronopoulou, Christos Baziotis, and portable models across different data distributions. Alexandros Potamianos. 2019. An embarrass- ingly simple approach for transfer learning from pre- trained language models. In Proceedings of the References 2019 Conference of the North American Chapter of David Ahn. 2006. The stages of event extraction. the Association for Computational Linguistics: Hu- In Proceedings of the Workshop on Annotating and man Language Technologies, Volume 1 (Long and Reasoning about Time and Events, pages 1–8. Asso- Short Papers), pages 2089–2095, Minneapolis, Min- ciation for Computational Linguistics. nesota, June. Association for Computational Lin- guistics. Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- 2016. Massively multilingual word embeddings. ina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Jun Araki, Lamana Mulaffer, Arun Pandian, Yukari cross-lingual sentence representations. In Proceed- Yamakawa, Kemal Oflazer, and Teruko Mitamura. ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, on Natural Language Processing (Volume 2: Short Brussels, Belgium, October-November. Association Papers), volume 2, pages 365–371. for Computational Linguistics. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Hal Daumé III. 2007. Frustratingly easy domain adap- Gardner, Christopher Clark, Kenton Lee, and Luke tation. ACL 2007, page 256. Zettlemoyer. 2018. Deep contextualized word rep- resentations. arXiv preprint arXiv:1802.05365. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. deep bidirectional transformers for language under- How multilingual is multilingual BERT? In Pro- standing. In Proceedings of the 2019 Conference of ceedings of the 57th Annual Meeting of the Asso- the North American Chapter of the Association for ciation for Computational Linguistics, pages 4996– Computational Linguistics: Human Language Tech- 5001, Florence, Italy, July. Association for Compu- nologies, Volume 1 (Long and Short Papers), pages tational Linguistics. 4171–4186, Minneapolis, Minnesota, June. Associ- ation for Computational Linguistics. Prajit Ramachandran, Peter Liu, and Quoc Le. 2017. Unsupervised pretraining for sequence to sequence Akiko Eriguchi, Melvin Johnson, Orhan Firat, Hideto learning. In Proceedings of the 2017 Conference Kazawa, and Wolfgang Macherey. 2018. Zero-shot on Empirical Methods in Natural Language Pro- cross-lingual classification using multilingual neural cessing, pages 383–391, Copenhagen, Denmark, machine translation. September. Association for Computational Linguis- tics. Lifu Huang, Heng Ji, Kyunghyun Cho, Ido Dagan, Se- bastian Riedel, and Clare Voss. 2018. Zero-shot Nils Reimers and Iryna Gurevych. 2017. Report- transfer learning for event extraction. In Proceed- ing score distributions makes a difference: Perfor- ings of the 56th Annual Meeting of the Association mance study of lstm-networks for sequence tagging. for Computational Linguistics (Volume 1: Long Pa- In Proceedings of the 2017 Conference on Empiri- pers), pages 2160–2170, Melbourne, Australia, July. cal Methods in Natural Language Processing, pages Association for Computational Linguistics. 338–348, Copenhagen, Denmark, September. Asso- SemAf/Time Working Group ISO, 2008. ISO DIS ciation for Computational Linguistics. 24617-1: 2008 Language resource management - Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Semantic annotation framework - Part 1: Time and Open domain event extraction from twitter. In Pro- events. ISO Central Secretariat, Geneva. ceedings of the 18th ACM SIGKDD international Heng Ji and Ralph Grishman. 2008. Refining event conference on Knowledge discovery and data min- extraction through cross-document inference. Pro- ing, pages 1104–1112. ACM. ceedings of ACL-08: HLT, pages 254–262. Naushad UzZaman, Hector Llorens, Leon Derczyn- Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim ski, James Allen, Marc Verhagen, and James Puste- Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, jovsky. 2013. Semeval-2013 task 1: Tempeval-3: Fernanda Viégas, Martin Wattenberg, Greg Corrado, Evaluating time expressions, events, and temporal Macduff Hughes, and Jeffrey Dean. 2017. Google’s relations. In Second Joint Conference on Lexical multilingual neural machine translation system: En- and Computational Semantics (*SEM), Volume 2: abling zero-shot translation. Transactions of the As- Proceedings of the Seventh International Workshop sociation for Computational Linguistics, 5:339–351. on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA, June. Association for Com- Hyuckchul Jung and Amanda Stent. 2013. Att1: Tem- putational Linguistics. poral annotation using big windows and rich syn- tactic and semantic features. In Second Joint Con- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ference on Lexical and Computational Semantics Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz (*SEM), Volume 2: Proceedings of the Seventh In- Kaiser, and Illia Polosukhin. 2017. Attention is all ternational Workshop on Semantic Evaluation (Se- you need. In Advances in neural information pro- mEval 2013), pages 20–24, Atlanta, Georgia, USA, cessing systems, pages 5998–6008. June. Association for Computational Linguistics. Marc Verhagen, Roser Sauri, Tommaso Caselli, and Paramita Mirza and Anne-Lyse Minard. 2014. FBK- James Pustejovsky. 2010. Semeval-2010 task 13: HLT-time: a complete Italian Temporal Processing Tempeval-2. In Proceedings of the 5th international system for EVENTI-EVALITA 2014. In Fourth In- workshop on semantic evaluation, pages 57–62. As- ternational Workshop EVALITA 2014, pages 44–49. sociation for Computational Linguistics. Thien Huu Nguyen and Ralph Grishman. 2015. Event Ivan Vulić and Marie-Francine Moens. 2015. Bilin- detection and domain adaptation with convolutional gual word embeddings from non-parallel document- neural networks. In Proceedings of the 53rd Annual aligned data applied to bilingual lexicon induction. Meeting of the Association for Computational Lin- In Proceedings of the 53rd Annual Meeting of the guistics and the 7th International Joint Conference Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing (Volume 2: Short Papers), vol- ume 2, pages 719–725. Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- cas: The surprising cross-lingual effectiveness of bert. arXiv preprint arXiv:1904.09077.