Italian Event Detection Goes Deep Learning Tommaso Caselli CLCG, Rijksuniversiteit Groningen Oude Kijk in’t Jaatsraaat, 26 9712 EK Groningen (NL) t.caselli@{rug.nl}{gmail.com} Abstract media platforms (e.g. Facebook and Twitter), and are less and less exposed to a diversity of perspec- English. This paper reports on a set of tives and opinions. The combination of these fac- experiments with different word embed- tors may easily result in information overload and dings to initialize a state-of-the-art Bi- impenetrable “filter bubbles”. Events, i.e. things LSTM-CRF network for event detection that happen or hold as true in the world, are the ba- and classification in Italian, following the sic components of such data stream. Being able to EVENTI evaluation exercise. The net- correctly identify and classify them plays a major work obtains a new state-of-the-art result role to develop robust solutions to deal with the by improving the F1 score for detection of current stream of data (e.g. the storyline frame- 1.3 points, and of 6.5 points for classifica- work (Vossen et al., 2015)), as well to improve the tion, by using a single step approach. The performance of many Natural Language Process- results also provide further evidence that ing (NLP) applications such as automatic summa- embeddings have a major impact on the rization and question answering (Q.A.). performance of such architectures. Event detection and classification has seen a Italiano. Questo contributo descrive una growing interest in the NLP community thanks to serie di esperimenti con diverse rappre- the availability of annotated corpora (LDC, 2005; sentazioni distribuzionali di parole (word Pustejovsky et al., 2003a; O’Gorman et al., 2016; embeddings) per inizializzare una rete Cybulska and Vossen, 2014) and evaluation cam- neurale stato dell’arte di tipo Bi-LSTM- paigns (Verhagen et al., 2007; Verhagen et al., CRF per il riconoscimento e la classi- 2010; UzZaman et al., 2013; Bethard et al., 2015; ficazione di eventi in italiano, in base Bethard et al., 2016; Minard et al., 2015). In all’esercizio di valutazione EVENTI. La the context of the 2014 EVALITA Workshop, the rete migliora lo stato dell’arte di 1.3 punti EVENTI evaluation exercise (Caselli et al., 2014)1 di F1 per il riconoscimento, e di 6.5 was organized to promote research in Italian Tem- punti per la classificazione, affrontando il poral Processing, of which event detection and compito in un unico sistema. L’analisi classification is a core subtask. dei risultati fornisce ulteriore supporto al Since the EVENTI campaign, there has been a fatto che le rappresentazioni distribuzion- lack of further research, especially in the applica- ali di parole hanno un impatto molto alto tion of deep learning models to this task in Italian. nei risultati di queste architetture. The contributions of this paper are the followings: i.) the adaptation of a state-of-the-art sequence to sequence (seq2seq) neural system to event detec- 1 Introduction tion and classification for Italian in a single step Current societies are exposed to a continuous flow approach; ii.) an investigation on the quality of ex- of information that results in a large production of isting Italian word embeddings for this task; iii.) a data (e.g. news articles, micro-blogs, social me- comparison against a state-of-the-art discrete clas- dia posts, among others), at different moments in sifier. The pre-trained models and scripts running time. In addition to this, the consumption of infor- mation has dramatically changed: more and more 1 https://sites.google.com/site/ people directly access information through social eventievalita2014/ the system (or re-train it) are publicly available. 2 . addition to the training and test data, we have cre- ated also a Main Task development set by exclud- 2 Task Description ing from the training data all the articles that com- We follow the formulation of the task as specified posed the test data of the Italian dataset at the Se- in the EVENTI exercise: determine the extent and mEval 2010 TempEval-2 campaign (Verhagen et the class of event mentions in a text, according al., 2010). The new partition of the corpus results to the It-TimeML tag definition (Sub- in the following distribution of the task B in EVENTI). tag: i) 17,528 events in the training data, of which In EVENTI, the tag is applied to 1,207 are multi-token mentions; ii.) 301 events every linguistic expression denoting a situation in the development set, of which 13 are multi- that happens or occurs, or a state in which some- token mentions; and finally, iii.) 3,798 events in thing obtains or holds true, regardless of the spe- the Main task test, of which 271 are multi-token cific parts-of-speech that may realize it. EVENTI mentions. distinguishes between single token and multi- Tables 1 and 2 report, respectively, the distribu- tokens events, where the latter are restricted to spe- tion of the events per token part-of speech (POS) cific cases of eventive multi-word expressions in and per event class. Not surprisingly, verbs are the lexicographic dictionaries (e.g. “fare le valigie” largest annotated category, followed by nouns, ad- [to pack]), verbal periphrases (e.g. “(essere) in jectives, and prepositional phrases. Such a distri- grado di” [(to be) able to]; “c’è” [there is]), and bution reflects both a kind of “natural” distribution named events (e.g. “la strage di Beslan” [Beslan of the realization of events in an Indo-european school siege]). language, and, at the same time, specific annota- Each event is further assigned to one tion choices. For instance, adjectives have been of 7 possible classes, namely: OCCUR- annotated only when in a predicative position and RENCE, ASPECTUAL, PERCEPTION, when introduced by a copula or a copular con- REPORTING, I(NTESIONAL) STATE, struction. As for the classes, OCCURRENCE and I(NTENSIONAL) ACTION, and STATE. STATE represent the large majority of all events, These classes are derived from the English followed by the intensional ones (I STATE and TimeML Annotation Guidelines (Pustejovsky I ACTION), expressing some factual relationship et al., 2003). The TimeML event classes dis- between the target events and their arguments, and tinguishes with respect to other classifications, finally the others (REPORTING, ASPECTUAL, such as ACE (LDC, 2005) or FrameNet (Baker and PERCEPTION). et al., 1998), because they expresses relationships the target event participates in (such as factual, 3 System and Experiments evidential, reported, intensional) rather than We adapted a publicly available Bi-LSTM net- semantic categories denoting the meaning of the work with a CRF classifier as last layer (Reimers event. This means that the EVENT classes are and Gurevych, 2017). 4 (Reimers and Gurevych, assigned by taking into account both the semantic 2017) demonstrated that word embeddings, and the syntactic context of occurrence of the among other hyper-parameters, have a major im- target event. Readers are referred to the EVENTI pact on the performance of the network, regardless Annotation Guidelines for more details3 . of the specific task. On the basis of these experi- 2.1 Dataset mental observations, we decided to investigate the impact of different Italian word embeddings for The EVENTI corpus consists of three datasets: the the Subtask B Main Task of the EVENTI exercise. Main Task training data, the Main task test data, We thus selected 5 word embeddings for Italian and the Pilot task test data. The Main Task data to initialize the network, differentiating one with are on contemporary news articles, while the Pi- respect to each other either for the representation lot Task on historical news articles. For our ex- model used (word2vec vs. GloVe; CBOW periments, we focused only on the Main Task. In vs. skip-gram), dimensionality (300 vs. 100), 2 https://github.com/tommasoc80/Event_ or corpora used for their generation (Italian detection_CLiC-it2018 3 4 https://sites.google.com/site/ https://github.com/UKPLab/ eventievalita2014/file-cabinet emnlp2017-bilstm-cnn-crf Class Training Dev. Test POS Training Dev. Test OCCURRENCE 9,041 162 1,949 Noun 6,710 111 1,499 ASPECTUAL 446 14 107 Verb 11,269 193 2,426 I STATE 1,599 29 355 Adjective 610 9 118 I ACTION 1,476 25 357 Preposition 146 1 25 PERCEPTION 162 2 37 Overall Event Tokens 18,735 314 4,068 REPORTING 714 8 149 STATE 4,090 61 843 Table 1: Distribution of the event mentions per Overall Events 17,528 301 3,798 POS per token in all datasets of the EVENTI corpus. Table 2: Distribution of the event mentions per class in all datasets of the EVENTI corpus. Wikipedia vs. crawled web document vs. large event detection task for English (Reimers and textual corpora or archives): Gurevych, 2017): two LSTM layers of 100 units each, Nadam optimizer, variational dropout (0.5, • Berardi2015 w2v (Berardi et al., 2015): 300 0.5), with gradient normalization (τ = 1), and dimension word embeddings generated using batch size of 8. Character-level embeddings, the word2vec (Mikolov et al., 2013) skip- learned using a Convolutional Neural Network gram model 5 from the Italian Wikipedia; (CNN) (Ma and Hovy, 2016), are concatenated with the word embedding vector to feed into the • Berardi2015 glove (Berardi et al., 2015): 300 LSTM network. Final layer of the network is a dimensions word embeddings generated us- CRF classifier. ing the GloVe model (Pennington et al., Evaluation is conducted using the EVENTI 2014) from the Italian Wikipedia6 ; evaluation framework. Standard Precision, Recall, • Fastext-It: 300 dimension word embeddings and F1 apply for the event detection. Given that from the Italian Wikipedia 7 obtained us- the extent of an event tag may be composed by ing Bojanovsky’s skip-gram model represen- more than one tokens, systems are evaluated both tation (Bojanowski et al., 2016), where each for strict match, i.e. one point only if all tokens word is represented as a bag of character n- which compose an tag are correctly grams 8 ; identified, and relaxed match, i.e. one point for any correct overlap between the system output and • ILC-ItWack (Cimino and Dell’Orletta, the reference gold data. The classification aspect 2016): 300 dimension word embeddings is evaluated using the F1-attribute score (UzZa- generated by using the word2vec CBOW man et al., 2013), that captures how well a system model 9 from the ItWack corpus; identify both the entity (extent) and attribute (i.e. class) together. • DH-FBK 100 (Tonelli et al., 2017): 100 We approached the task in a single-step by de- dimension word and phrase embeddings, tecting and classifying event mentions at once generated using the word2vec and rather than in the standard two step approach, phrase2vec models, from 1.3 billion i.e. detection first and classification on top of the word corpus (Italian Wikipedia, OpenSub- detected elements. The task is formulated as a titles2016 (Lison and Tiedemann, 2016), seq2seq problem, by converting the original an- PAISA corpus 10 , and the Gazzetta Ufficiale). notation format into an BIO scheme (Beginning, Inside, Outside), with the resulting alphabet being As for the other parameters, the network main- B-class label, I-class label and O. Example 1 be- tains the optimized configurations used for the low illustrates a simplified version of the problem 5 Parameters: negative sampling 10, context window 10 for a short sentence: 6 Berardi2015 w2v and Berardi2015 glove uses a 2015 dump of the Italian Wikipedia (1) input problem solution 7 Wikipedia dump not specified. Marco (B-STATE | I-STATE | . . . | O) O 8 https://github.com/facebookresearch/ pensa (B-STATE | I-STATE | . . . | O) B-ISTATE fastText/blob/master/pretrained-vectors. di (B-STATE | I-STATE | . . . | O) O md andare (B-STATE | I-STATE | . . . | O) B-OCCUR 9 Parameters: context window 5. a (B-STATE | I-STATE | . . . | O) O 10 http://www.corpusitaliano.it/ casa (B-STATE | I-STATE | . . . | O) O Strict Evaluation Relaxed Evaluation Embedding Parameter R P F1 F1-class R P F1 F1-class Berardi2015 w2v 0.868 0.868 0.868 0.705 0.892 0.892 0.892 0.725 Berardi2015 Glove 0.848 0.872 0.860 0.697 0.870 0.895 0.882 0.714 Fastext-It 0.897 0.863 0.880 0.736 0.921 0.887 0.903 0.756 ILC-ItWack 0.831 0.884 0.856 0.702 0.860 0.914 0.886 0.725 DH-FBK 100 0.855 0.859 0.857 0.685 0.881 0.885 0.883 0.705 FBK-HLT@EVENTI 2014 0.850 0.884 0.867 0.671 0.868 0.902 0.884 0.685 Table 3: Results for Bubtask B Main Task - Event detection and classification. . (B-STATE | I-STATE | . . . | O) O 3.1 Results and Discussion Results for the experiments are illustrated in Ta- ble 3. We also report the results of the best sys- tem that participated at EVENTI Subtask B, FBK- HLT (Mirza and Minard, 2014). FBK-HLT is a cascade of two SVM classifiers (one for detection Figure 1: Plots of F1 scores of the Bi-LSTM-CRF and one for classification) based on rich linguis- systems against the FBK-HLT system for Event tic features. Figure 1 plots charts comparing F1 Extent (left side) and Event Class (right side). F1 scores of the network initialized with each of the scores refers to the five embeddings against the FBK-HLT system for the event detection and classification tasks, respec- 0.83% in test for ILC-ItWack). tively. The network obtains the best F1 score, both for The results of the Bi-LSTM-CRF network are detection (F1 of 0.880 for strict evaluation and varied in both evaluation configurations. The dif- 0.903 for relaxed evaluation with Fastext-It em- ferences are mainly due to the embeddings used to beddings) and for classification (F1-class of 0.756 initialize the network. The best embedding con- for strict evaluation, and 0.751 for relaxed evalua- figuration is Fastext-It that differentiate from all tion with Fastext-It embeddings). Although FBK- the others for the approach used for generating HLT suffers in the classification subtask, it quali- the embeddings. Embedding’s dimensionality im- fies as a highly competitive system for the detec- pacts on the performances supporting the findings tion subtask. By observing the strict F1 scores, in (Reimers and Gurevych, 2017), but it seems FBK-HLT beats three configurations (DH-FBK- that the quantity (and variety) of data used to gen- 100, ILC-ItWack, Berardi2015 Glove) 11 , almost erate the embeddings can have a mitigating effect, equals one (Berardi2015 w2v) 12 , and it is outper- as shown by the results of the DH-FBK-100 con- formed only by one (Fastext-It) 13 . In the relaxed figuration (especially in the classification subtask, evaluation setting, DH-FBK-100 is the only con- and in the Recall scores for the event extent sub- figuration that does not beat FBK-HLT (although task). Coverage of the embeddings (and conse- the difference is only 0.001 point). Nevertheless, it quenlty, tokenization of the dataset and the em- is remarkable to observe that FBK-HLT has a very beddings) is a further aspect to keep into account, high Precision (0.902, relaxed evaluation mode), but it seems to have a minor impact with respect that is overcome by only one embedding config- to dimensionality. It turns out that (Berardi et al., uration, ILC-ItWack. The results also indicates 2015)’s embeddings are those suffering the most that word embeddings have a major contribution from out of vocabulary (OVV) tokens (2.14% and on Recall, supporting observations that distributed 1.06% in training, 2.77% and 1.84% in test for the representations have better generalization capabil- word2vec model and GloVe, respectively) with ities than discrete feature vectors. This is further respect to the others. However, they still outper- 11 p-value < 0.005 only against Berardi2015 Glove and form DH-FBK 100 and ILC-ItWack, whose OVV DH-FBK-100, with McNemar’s test. are much lower (0.73% in training and 1.12% 12 p-value > 0.005 with McNemar’s test. 13 in test for DH-FBK 100; 0.74% in training and p-value < 0.005 with McNemar’s test. supported by the fact that these results are obtained solve the event detection and classification task using a single step approach, where the network in Italian, according to the EVENTI exercise. has to deal with a total of 15 possible different la- We obtained new state-of-the-art results using the bels. Fastext-It embeddings, and improved the F1-class We further compared the outputs of the best score of 6.5 points in strict evaluation mode. As model, i.e. Fastext-It, against FBK-HLT. As for for the event detection subtask, we observe a lim- the event detection subtask, we have adopted an ited improvement (+1.3 points in strict F1), mainly event-based analysis rather than a token based due to gains in Recall. Such results are extremely one, as this will provide better insights on errors positive as the task has been modeled in a single concerning multi-token events and event parts-of- step approach, i.e. detection and classification at speech (see Table 1 for reference). 14 By analyzing once, for the first time in Italian. Further sup- the True Positives, we observe that the Fastext- port that embeddings have a major impact in the It model has better performances than FBK-HLT performance of neural architectures is provided, with nouns (77.78% vs. 65.64%, respectively) and as the variations in performance of the Bi-LSMT- prepositional phrases (28.00% vs. 16.00%, re- CRF models show. This is due to a combination spectively). Performances are very close for verbs of factors such as dimensionality, (raw) data, and (88.04% vs. 88.49%, respectively) and adjectives the method used for generating the embeddings. (80.50% vs. 79.66%, respectively). These re- Future work should focus on the development of sults, especially those for prepositional phrases, embeddings that move away from the basic word indicates that the Bi-LSTM-CRF network struc- level, integrating extra layers of linguistic analy- ture and embeddings are also much more robust sis (e.g. syntactic dependencies) (Komninos and at detecting multi-tokens instances of events, and Manandhar, 2016), that have proven to be very difficult realizations of events, such as nouns. powerful for the same task in English. Concerning the classification, we focused on the mismatches between correctly identified Acknowledgments events (extent layer) and class assignment. The The author wants to thank all researchers and re- Fastext-It model wrongly assigns the class to only search groups who made available their word em- 557 event tokens compared to the 729 cases for beddings and their code. Sharing is caring. FBK-HLT. The distribution of the class errors, in terms of absolute numbers, is the same between the two systems, with the top three wrong classes References being, in both cases, OCCURRENCE, I ACTION Collin F Baker, Charles J Fillmore, and John B Lowe. and STATE. OCCURRENCE, not surprisingly, is 1998. The berkeley framenet project. In Proceed- the class that tends to be assigned more often by ings of the 17th international conference on Compu- both systems, being also the most frequent. How- tational linguistics-Volume 1, pages 86–90. Associ- ever, if FBK-HLT largely overgeneralizes OC- ation for Computational Linguistics. CURRENCE (59.53% of all class errors), this cor- Giacomo Berardi, Andrea Esuli, and Diego Marcheg- responds to only one third of the errors (37.70%) giani. 2015. Word embeddings go to italy: A com- in the Bi-LSTM-CRF network. Other notable dif- parison of models and training datasets. In IIR. ferences concern I ACTION (27.82% of errors for Steven Bethard, Leon Derczynski, Guergana Savova, the Bi-LSTM-CRF vs. 17.28% for FBK-HLT), James Pustejovsky, and Marc Verhagen. 2015. STATE (8.79% for the Bi-LSTM-CRF vs. 15.22% Semeval-2015 task 6: Clinical tempeval. In Pro- for FBK-HLT) and REPORTING (7.89% for the ceedings of the 9th International Workshop on Se- Bi-LSTM-CRF vs. 2.33% for FBK-HLT) classes. mantic Evaluation (SemEval 2015), pages 806–814. Steven Bethard, Guergana Savova, Wei-Te Chen, Leon 4 Conclusion and Future Work Derczynski, James Pustejovsky, and Marc Verhagen. 2016. Semeval-2016 task 12: Clinical tempeval. In This paper has investigated the application of Proceedings of the 10th International Workshop on different word embeddings for the initialization Semantic Evaluation (SemEval-2016), pages 1052– of a state-of-the-art Bi-LSTM-CRF network to 1062. 14 Piotr Bojanowski, Edouard Grave, Armand Joulin, Note that POS are manually tagged for events, not for their components. and Tomas Mikolov. 2016. Enriching word vec- tors with subword information. arXiv preprint event coreference with temporal, causal and bridg- arXiv:1607.04606. ing annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), pages T. Caselli, R. Sprugnoli, M. Speranza, and M. Mona- 47–56. Association for Computational Linguistics. chini. 2014. Eventi.EValuation of Events and Tem- poral INformation at Evalita 2014. In C. Bosco, Jeffrey Pennington, Richard Socher, and Christo- F. DellOrletta, S. Montemagni, and M. Simi, editors, pher D. Manning. 2014. Glove: Global vectors for Evaluation of Natural Language and Speech Tools word representation. In Empirical Methods in Nat- for Italian, volume 1, pages 27–34. Pisa University ural Language Processing (EMNLP), pages 1532– Press. 1543. Andrea Cimino and Felice Dell’Orletta. 2016. Build- James Pustejovsky, José M Castano, Robert Ingria, ing the state-of-the-art in pos tagging of italian Roser Sauri, Robert J Gaizauskas, Andrea Set- tweets. In CLiC-it/EVALITA. zer, Graham Katz, and Dragomir R Radev. 2003. Timeml: Robust specification of event and tempo- Agata Cybulska and Piek Vossen. 2014. Using a ral expressions in text. New directions in question sledgehammer to crack a nut? Lexical diversity and answering, 3:28–34. event coreference resolution. In Proceedings of the 9th Language Resources and Evaluation Conference James Pustejovsky, José Castao, Robert Ingria, Roser (LREC2014), Reykjavik, Iceland, May 26-31. Saurı̀, Robert Gaizauskas, Andrea Setzer, and Gra- ham Katz. 2003a. TimeML: Robust Specification Alexandros Komninos and Suresh Manandhar. 2016. of Event and Temporal Expressions in Text. In Fifth Dependency based embeddings for sentence classi- International Workshop on Computational Seman- fication tasks. In Proceedings of the 2016 Confer- tics (IWCS-5). ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- Nils Reimers and Iryna Gurevych. 2017. Report- guage Technologies, pages 1490–1500. ing score distributions makes a difference: Perfor- mance study of lstm-networks for sequence tagging. LDC. 2005. Ace (automatic content extraction) In Proceedings of the 2017 Conference on Empiri- english annotation guidelines for events ver. 5.4.3 cal Methods in Natural Language Processing, pages 2005.07.01. In Linguistic Data Consortium. 338–348, Copenhagen, Denmark, September. Asso- ciation for Computational Linguistics. Pierre Lison and Jörg Tiedemann. 2016. Opensub- titles2016: Extracting large parallel corpora from Sara Tonelli, Alessio Palmero Aprosio, and Marco movie and tv subtitles. Mazzon. 2017. The impact of phrases on ital- ian lexical simplification. In Proceedings of the Xuezhe Ma and Eduard Hovy. 2016. End-to-end Fourth Italian Conference on Computational Lin- sequence labeling via bi-directional lstm-cnns-crf. guistics (CLiC-it 2017), Rome, Italy. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume N. UzZaman, H. Llorens, L. Derczynski, J. Allen, 1: Long Papers), pages 1064–1074. Association for M. Verhagen, and J. Pustejovsky. 2013. SemEval- Computational Linguistics. 2013 task 1: Tempeval-3: Evaluating time expres- sions, events, and temporal relations. In Proceed- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- ings of SemEval-2013, pages 1–9. Association for rado, and Jeff Dean. 2013. Distributed representa- Computational Linguistics, Atlanta, Georgia, USA. tions of words and phrases and their compositional- M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, ity. In Advances in neural information processing G. Katz, and J. Pustejovsky. 2007. SemEval-2007 systems, pages 3111–3119. Task 15: TempEval Temporal Relation Identifica- Anne-Lyse Minard, Manuela Speranza, Eneko tion. In Proceedings of SemEval 2007, pages 75–80, Agirre, Itziar Aldabe, Marieke van Erp, Bernardo June. Magnini, German Rigau, Ruben Urizar, and Fon- Marc Verhagen, Roser Sauri, Tommaso Caselli, and dazione Bruno Kessler. 2015. Semeval-2015 task James Pustejovsky. 2010. Semeval-2010 task 13: 4: Timeline: Cross-document event ordering. In Tempeval-2. In Proceedings of the 5th international Proceedings of the 9th International Workshop workshop on semantic evaluation, pages 57–62. As- on Semantic Evaluation (SemEval 2015), pages sociation for Computational Linguistics. 778–786. Piek Vossen, Tommaso Caselli, and Yiota Kont- Paramita Mirza and Anne-Lyse Minard. 2014. Fbk- zopoulou. 2015. Storylines for structuring massive hlt-time: a complete italian temporal processing sys- streams of news. In Proceedings of the First Work- tem for eventi-evalita 2014. In Fourth International shop on Computing News Storylines, pages 40–49. Workshop EVALITA 2014, pages 44–49. Tim O’Gorman, Kristin Wright-Bettner, and Martha Palmer. 2016. Richer event description: Integrating