Italian Event Detection Goes Deep Learning

                                       Tommaso Caselli
                 CLCG, Rijksuniversiteit Groningen Oude Kijk in’t Jaatsraaat, 26
                                  9712 EK Groningen (NL)
                          t.caselli@{rug.nl}{gmail.com}


                     Abstract                          media platforms (e.g. Facebook and Twitter), and
                                                       are less and less exposed to a diversity of perspec-
    English. This paper reports on a set of            tives and opinions. The combination of these fac-
    experiments with different word embed-             tors may easily result in information overload and
    dings to initialize a state-of-the-art Bi-         impenetrable “filter bubbles”. Events, i.e. things
    LSTM-CRF network for event detection               that happen or hold as true in the world, are the ba-
    and classification in Italian, following the       sic components of such data stream. Being able to
    EVENTI evaluation exercise. The net-               correctly identify and classify them plays a major
    work obtains a new state-of-the-art result         role to develop robust solutions to deal with the
    by improving the F1 score for detection of         current stream of data (e.g. the storyline frame-
    1.3 points, and of 6.5 points for classifica-      work (Vossen et al., 2015)), as well to improve the
    tion, by using a single step approach. The         performance of many Natural Language Process-
    results also provide further evidence that         ing (NLP) applications such as automatic summa-
    embeddings have a major impact on the              rization and question answering (Q.A.).
    performance of such architectures.
                                                          Event detection and classification has seen a
    Italiano. Questo contributo descrive una           growing interest in the NLP community thanks to
    serie di esperimenti con diverse rappre-           the availability of annotated corpora (LDC, 2005;
    sentazioni distribuzionali di parole (word         Pustejovsky et al., 2003a; O’Gorman et al., 2016;
    embeddings) per inizializzare una rete             Cybulska and Vossen, 2014) and evaluation cam-
    neurale stato dell’arte di tipo Bi-LSTM-           paigns (Verhagen et al., 2007; Verhagen et al.,
    CRF per il riconoscimento e la classi-             2010; UzZaman et al., 2013; Bethard et al., 2015;
    ficazione di eventi in italiano, in base           Bethard et al., 2016; Minard et al., 2015). In
    all’esercizio di valutazione EVENTI. La            the context of the 2014 EVALITA Workshop, the
    rete migliora lo stato dell’arte di 1.3 punti      EVENTI evaluation exercise (Caselli et al., 2014)1
    di F1 per il riconoscimento, e di 6.5              was organized to promote research in Italian Tem-
    punti per la classificazione, affrontando il       poral Processing, of which event detection and
    compito in un unico sistema. L’analisi             classification is a core subtask.
    dei risultati fornisce ulteriore supporto al          Since the EVENTI campaign, there has been a
    fatto che le rappresentazioni distribuzion-        lack of further research, especially in the applica-
    ali di parole hanno un impatto molto alto          tion of deep learning models to this task in Italian.
    nei risultati di queste architetture.              The contributions of this paper are the followings:
                                                       i.) the adaptation of a state-of-the-art sequence to
                                                       sequence (seq2seq) neural system to event detec-
1   Introduction
                                                       tion and classification for Italian in a single step
Current societies are exposed to a continuous flow     approach; ii.) an investigation on the quality of ex-
of information that results in a large production of   isting Italian word embeddings for this task; iii.) a
data (e.g. news articles, micro-blogs, social me-      comparison against a state-of-the-art discrete clas-
dia posts, among others), at different moments in      sifier. The pre-trained models and scripts running
time. In addition to this, the consumption of infor-
mation has dramatically changed: more and more           1
                                                           https://sites.google.com/site/
people directly access information through social      eventievalita2014/
the system (or re-train it) are publicly available. 2 .   addition to the training and test data, we have cre-
                                                          ated also a Main Task development set by exclud-
2     Task Description                                    ing from the training data all the articles that com-
We follow the formulation of the task as specified        posed the test data of the Italian dataset at the Se-
in the EVENTI exercise: determine the extent and          mEval 2010 TempEval-2 campaign (Verhagen et
the class of event mentions in a text, according          al., 2010). The new partition of the corpus results
to the It-TimeML <EVENT> tag definition (Sub-             in the following distribution of the <EVENT>
task B in EVENTI).                                        tag: i) 17,528 events in the training data, of which
   In EVENTI, the tag <EVENT> is applied to               1,207 are multi-token mentions; ii.) 301 events
every linguistic expression denoting a situation          in the development set, of which 13 are multi-
that happens or occurs, or a state in which some-         token mentions; and finally, iii.) 3,798 events in
thing obtains or holds true, regardless of the spe-       the Main task test, of which 271 are multi-token
cific parts-of-speech that may realize it. EVENTI         mentions.
distinguishes between single token and multi-                Tables 1 and 2 report, respectively, the distribu-
tokens events, where the latter are restricted to spe-    tion of the events per token part-of speech (POS)
cific cases of eventive multi-word expressions in         and per event class. Not surprisingly, verbs are the
lexicographic dictionaries (e.g. “fare le valigie”        largest annotated category, followed by nouns, ad-
[to pack]), verbal periphrases (e.g. “(essere) in         jectives, and prepositional phrases. Such a distri-
grado di” [(to be) able to]; “c’è” [there is]), and      bution reflects both a kind of “natural” distribution
named events (e.g. “la strage di Beslan” [Beslan          of the realization of events in an Indo-european
school siege]).                                           language, and, at the same time, specific annota-
   Each event is further assigned to one                  tion choices. For instance, adjectives have been
of 7 possible classes, namely:              OCCUR-        annotated only when in a predicative position and
RENCE,         ASPECTUAL,            PERCEPTION,          when introduced by a copula or a copular con-
REPORTING,                I(NTESIONAL) STATE,             struction. As for the classes, OCCURRENCE and
I(NTENSIONAL) ACTION,               and      STATE.       STATE represent the large majority of all events,
These classes are derived from the English                followed by the intensional ones (I STATE and
TimeML Annotation Guidelines (Pustejovsky                 I ACTION), expressing some factual relationship
et al., 2003). The TimeML event classes dis-              between the target events and their arguments, and
tinguishes with respect to other classifications,         finally the others (REPORTING, ASPECTUAL,
such as ACE (LDC, 2005) or FrameNet (Baker                and PERCEPTION).
et al., 1998), because they expresses relationships
the target event participates in (such as factual,        3   System and Experiments
evidential, reported, intensional) rather than            We adapted a publicly available Bi-LSTM net-
semantic categories denoting the meaning of the           work with a CRF classifier as last layer (Reimers
event. This means that the EVENT classes are              and Gurevych, 2017). 4 (Reimers and Gurevych,
assigned by taking into account both the semantic         2017) demonstrated that word embeddings,
and the syntactic context of occurrence of the            among other hyper-parameters, have a major im-
target event. Readers are referred to the EVENTI          pact on the performance of the network, regardless
Annotation Guidelines for more details3 .                 of the specific task. On the basis of these experi-
2.1    Dataset                                            mental observations, we decided to investigate the
                                                          impact of different Italian word embeddings for
The EVENTI corpus consists of three datasets: the         the Subtask B Main Task of the EVENTI exercise.
Main Task training data, the Main task test data,         We thus selected 5 word embeddings for Italian
and the Pilot task test data. The Main Task data          to initialize the network, differentiating one with
are on contemporary news articles, while the Pi-          respect to each other either for the representation
lot Task on historical news articles. For our ex-         model used (word2vec vs. GloVe; CBOW
periments, we focused only on the Main Task. In           vs. skip-gram), dimensionality (300 vs. 100),
  2
    https://github.com/tommasoc80/Event_                  or corpora used for their generation (Italian
detection_CLiC-it2018
  3                                                         4
    https://sites.google.com/site/                            https://github.com/UKPLab/
eventievalita2014/file-cabinet                            emnlp2017-bilstm-cnn-crf
                                                                   Class              Training    Dev.    Test
   POS                    Training   Dev.    Test                  OCCURRENCE            9,041    162    1,949
   Noun                      6,710   111    1,499                  ASPECTUAL               446     14      107
   Verb                    11,269    193    2,426                  I STATE               1,599     29      355
   Adjective                   610     9      118                  I ACTION              1,476     25      357
   Preposition                 146     1       25                  PERCEPTION              162      2       37
   Overall Event Tokens    18,735    314    4,068                  REPORTING               714      8      149
                                                                   STATE                 4,090     61      843
Table 1: Distribution of the event mentions per                    Overall Events      17,528     301    3,798
POS per token in all datasets of the EVENTI
corpus.                                                       Table 2: Distribution of the event mentions per
                                                              class in all datasets of the EVENTI corpus.


Wikipedia vs. crawled web document vs. large               event detection task for English (Reimers and
textual corpora or archives):                              Gurevych, 2017): two LSTM layers of 100 units
                                                           each, Nadam optimizer, variational dropout (0.5,
  • Berardi2015 w2v (Berardi et al., 2015): 300            0.5), with gradient normalization (τ = 1), and
    dimension word embeddings generated using              batch size of 8. Character-level embeddings,
    the word2vec (Mikolov et al., 2013) skip-              learned using a Convolutional Neural Network
    gram model 5 from the Italian Wikipedia;               (CNN) (Ma and Hovy, 2016), are concatenated
                                                           with the word embedding vector to feed into the
  • Berardi2015 glove (Berardi et al., 2015): 300
                                                           LSTM network. Final layer of the network is a
    dimensions word embeddings generated us-
                                                           CRF classifier.
    ing the GloVe model (Pennington et al.,
                                                              Evaluation is conducted using the EVENTI
    2014) from the Italian Wikipedia6 ;
                                                           evaluation framework. Standard Precision, Recall,
  • Fastext-It: 300 dimension word embeddings              and F1 apply for the event detection. Given that
    from the Italian Wikipedia 7 obtained us-              the extent of an event tag may be composed by
    ing Bojanovsky’s skip-gram model represen-             more than one tokens, systems are evaluated both
    tation (Bojanowski et al., 2016), where each           for strict match, i.e. one point only if all tokens
    word is represented as a bag of character n-           which compose an <EVENT> tag are correctly
    grams 8 ;                                              identified, and relaxed match, i.e. one point for
                                                           any correct overlap between the system output and
  • ILC-ItWack (Cimino and Dell’Orletta,                   the reference gold data. The classification aspect
    2016): 300 dimension word embeddings                   is evaluated using the F1-attribute score (UzZa-
    generated by using the word2vec CBOW                   man et al., 2013), that captures how well a system
    model 9 from the ItWack corpus;                        identify both the entity (extent) and attribute (i.e.
                                                           class) together.
  • DH-FBK 100 (Tonelli et al., 2017): 100                    We approached the task in a single-step by de-
    dimension word and phrase embeddings,                  tecting and classifying event mentions at once
    generated using the word2vec and                       rather than in the standard two step approach,
    phrase2vec models, from 1.3 billion                    i.e. detection first and classification on top of the
    word corpus (Italian Wikipedia, OpenSub-               detected elements. The task is formulated as a
    titles2016 (Lison and Tiedemann, 2016),                seq2seq problem, by converting the original an-
    PAISA corpus 10 , and the Gazzetta Ufficiale).         notation format into an BIO scheme (Beginning,
                                                           Inside, Outside), with the resulting alphabet being
   As for the other parameters, the network main-
                                                           B-class label, I-class label and O. Example 1 be-
tains the optimized configurations used for the
                                                           low illustrates a simplified version of the problem
   5
     Parameters: negative sampling 10, context window 10   for a short sentence:
   6
     Berardi2015 w2v and Berardi2015 glove uses a 2015
dump of the Italian Wikipedia                              (1) input    problem                          solution
   7
     Wikipedia dump not specified.                              Marco (B-STATE | I-STATE | . . . | O)    O
   8
     https://github.com/facebookresearch/                       pensa (B-STATE | I-STATE | . . . | O)    B-ISTATE
fastText/blob/master/pretrained-vectors.                        di     (B-STATE | I-STATE | . . . | O)   O
md                                                              andare (B-STATE | I-STATE | . . . | O)   B-OCCUR
   9
     Parameters: context window 5.                              a      (B-STATE | I-STATE | . . . | O)   O
  10
     http://www.corpusitaliano.it/                              casa   (B-STATE | I-STATE | . . . | O)   O
                                                        Strict Evaluation                   Relaxed Evaluation
          Embedding Parameter               R       P      F1 F1-class          R       P        F1 F1-class
          Berardi2015 w2v               0.868   0.868   0.868      0.705    0.892   0.892     0.892      0.725
          Berardi2015 Glove             0.848   0.872   0.860      0.697    0.870   0.895     0.882      0.714
          Fastext-It                    0.897   0.863   0.880      0.736    0.921   0.887     0.903     0.756
          ILC-ItWack                    0.831   0.884   0.856      0.702    0.860   0.914     0.886     0.725
          DH-FBK 100                    0.855   0.859   0.857      0.685    0.881   0.885     0.883      0.705
          FBK-HLT@EVENTI 2014           0.850   0.884   0.867      0.671    0.868   0.902     0.884     0.685

             Table 3: Results for Bubtask B Main Task - Event detection and classification.

      .      (B-STATE | I-STATE | . . . | O)    O


3.1   Results and Discussion
Results for the experiments are illustrated in Ta-
ble 3. We also report the results of the best sys-
tem that participated at EVENTI Subtask B, FBK-
HLT (Mirza and Minard, 2014). FBK-HLT is a
cascade of two SVM classifiers (one for detection            Figure 1: Plots of F1 scores of the Bi-LSTM-CRF
and one for classification) based on rich linguis-           systems against the FBK-HLT system for Event
tic features. Figure 1 plots charts comparing F1             Extent (left side) and Event Class (right side). F1
scores of the network initialized with each of the           scores refers to the
five embeddings against the FBK-HLT system for
the event detection and classification tasks, respec-        0.83% in test for ILC-ItWack).
tively.                                                         The network obtains the best F1 score, both for
   The results of the Bi-LSTM-CRF network are                detection (F1 of 0.880 for strict evaluation and
varied in both evaluation configurations. The dif-           0.903 for relaxed evaluation with Fastext-It em-
ferences are mainly due to the embeddings used to            beddings) and for classification (F1-class of 0.756
initialize the network. The best embedding con-              for strict evaluation, and 0.751 for relaxed evalua-
figuration is Fastext-It that differentiate from all         tion with Fastext-It embeddings). Although FBK-
the others for the approach used for generating              HLT suffers in the classification subtask, it quali-
the embeddings. Embedding’s dimensionality im-               fies as a highly competitive system for the detec-
pacts on the performances supporting the findings            tion subtask. By observing the strict F1 scores,
in (Reimers and Gurevych, 2017), but it seems                FBK-HLT beats three configurations (DH-FBK-
that the quantity (and variety) of data used to gen-         100, ILC-ItWack, Berardi2015 Glove) 11 , almost
erate the embeddings can have a mitigating effect,           equals one (Berardi2015 w2v) 12 , and it is outper-
as shown by the results of the DH-FBK-100 con-               formed only by one (Fastext-It) 13 . In the relaxed
figuration (especially in the classification subtask,        evaluation setting, DH-FBK-100 is the only con-
and in the Recall scores for the event extent sub-           figuration that does not beat FBK-HLT (although
task). Coverage of the embeddings (and conse-                the difference is only 0.001 point). Nevertheless, it
quenlty, tokenization of the dataset and the em-             is remarkable to observe that FBK-HLT has a very
beddings) is a further aspect to keep into account,          high Precision (0.902, relaxed evaluation mode),
but it seems to have a minor impact with respect             that is overcome by only one embedding config-
to dimensionality. It turns out that (Berardi et al.,        uration, ILC-ItWack. The results also indicates
2015)’s embeddings are those suffering the most              that word embeddings have a major contribution
from out of vocabulary (OVV) tokens (2.14% and               on Recall, supporting observations that distributed
1.06% in training, 2.77% and 1.84% in test for the           representations have better generalization capabil-
word2vec model and GloVe, respectively) with                 ities than discrete feature vectors. This is further
respect to the others. However, they still outper-              11
                                                                 p-value < 0.005 only against Berardi2015 Glove and
form DH-FBK 100 and ILC-ItWack, whose OVV                    DH-FBK-100, with McNemar’s test.
are much lower (0.73% in training and 1.12%                   12
                                                                 p-value > 0.005 with McNemar’s test.
                                                              13
in test for DH-FBK 100; 0.74% in training and                    p-value < 0.005 with McNemar’s test.
supported by the fact that these results are obtained         solve the event detection and classification task
using a single step approach, where the network               in Italian, according to the EVENTI exercise.
has to deal with a total of 15 possible different la-         We obtained new state-of-the-art results using the
bels.                                                         Fastext-It embeddings, and improved the F1-class
   We further compared the outputs of the best                score of 6.5 points in strict evaluation mode. As
model, i.e. Fastext-It, against FBK-HLT. As for               for the event detection subtask, we observe a lim-
the event detection subtask, we have adopted an               ited improvement (+1.3 points in strict F1), mainly
event-based analysis rather than a token based                due to gains in Recall. Such results are extremely
one, as this will provide better insights on errors           positive as the task has been modeled in a single
concerning multi-token events and event parts-of-             step approach, i.e. detection and classification at
speech (see Table 1 for reference). 14 By analyzing           once, for the first time in Italian. Further sup-
the True Positives, we observe that the Fastext-              port that embeddings have a major impact in the
It model has better performances than FBK-HLT                 performance of neural architectures is provided,
with nouns (77.78% vs. 65.64%, respectively) and              as the variations in performance of the Bi-LSMT-
prepositional phrases (28.00% vs. 16.00%, re-                 CRF models show. This is due to a combination
spectively). Performances are very close for verbs            of factors such as dimensionality, (raw) data, and
(88.04% vs. 88.49%, respectively) and adjectives              the method used for generating the embeddings.
(80.50% vs. 79.66%, respectively). These re-                     Future work should focus on the development of
sults, especially those for prepositional phrases,            embeddings that move away from the basic word
indicates that the Bi-LSTM-CRF network struc-                 level, integrating extra layers of linguistic analy-
ture and embeddings are also much more robust                 sis (e.g. syntactic dependencies) (Komninos and
at detecting multi-tokens instances of events, and            Manandhar, 2016), that have proven to be very
difficult realizations of events, such as nouns.              powerful for the same task in English.
   Concerning the classification, we focused
on the mismatches between correctly identified                Acknowledgments
events (extent layer) and class assignment. The               The author wants to thank all researchers and re-
Fastext-It model wrongly assigns the class to only            search groups who made available their word em-
557 event tokens compared to the 729 cases for                beddings and their code. Sharing is caring.
FBK-HLT. The distribution of the class errors, in
terms of absolute numbers, is the same between
the two systems, with the top three wrong classes             References
being, in both cases, OCCURRENCE, I ACTION
                                                              Collin F Baker, Charles J Fillmore, and John B Lowe.
and STATE. OCCURRENCE, not surprisingly, is                     1998. The berkeley framenet project. In Proceed-
the class that tends to be assigned more often by               ings of the 17th international conference on Compu-
both systems, being also the most frequent. How-                tational linguistics-Volume 1, pages 86–90. Associ-
ever, if FBK-HLT largely overgeneralizes OC-                    ation for Computational Linguistics.
CURRENCE (59.53% of all class errors), this cor-              Giacomo Berardi, Andrea Esuli, and Diego Marcheg-
responds to only one third of the errors (37.70%)               giani. 2015. Word embeddings go to italy: A com-
in the Bi-LSTM-CRF network. Other notable dif-                  parison of models and training datasets. In IIR.
ferences concern I ACTION (27.82% of errors for
                                                              Steven Bethard, Leon Derczynski, Guergana Savova,
the Bi-LSTM-CRF vs. 17.28% for FBK-HLT),                         James Pustejovsky, and Marc Verhagen. 2015.
STATE (8.79% for the Bi-LSTM-CRF vs. 15.22%                      Semeval-2015 task 6: Clinical tempeval. In Pro-
for FBK-HLT) and REPORTING (7.89% for the                        ceedings of the 9th International Workshop on Se-
Bi-LSTM-CRF vs. 2.33% for FBK-HLT) classes.                      mantic Evaluation (SemEval 2015), pages 806–814.

                                                              Steven Bethard, Guergana Savova, Wei-Te Chen, Leon
4   Conclusion and Future Work                                   Derczynski, James Pustejovsky, and Marc Verhagen.
                                                                 2016. Semeval-2016 task 12: Clinical tempeval. In
This paper has investigated the application of                   Proceedings of the 10th International Workshop on
different word embeddings for the initialization                 Semantic Evaluation (SemEval-2016), pages 1052–
of a state-of-the-art Bi-LSTM-CRF network to                     1062.
   14                                                         Piotr Bojanowski, Edouard Grave, Armand Joulin,
      Note that POS are manually tagged for events, not for
their components.                                                and Tomas Mikolov. 2016. Enriching word vec-
  tors with subword information.       arXiv preprint       event coreference with temporal, causal and bridg-
  arXiv:1607.04606.                                         ing annotation. In Proceedings of the 2nd Workshop
                                                            on Computing News Storylines (CNS 2016), pages
T. Caselli, R. Sprugnoli, M. Speranza, and M. Mona-         47–56. Association for Computational Linguistics.
   chini. 2014. Eventi.EValuation of Events and Tem-
   poral INformation at Evalita 2014. In C. Bosco,        Jeffrey Pennington, Richard Socher, and Christo-
   F. DellOrletta, S. Montemagni, and M. Simi, editors,      pher D. Manning. 2014. Glove: Global vectors for
   Evaluation of Natural Language and Speech Tools           word representation. In Empirical Methods in Nat-
   for Italian, volume 1, pages 27–34. Pisa University       ural Language Processing (EMNLP), pages 1532–
   Press.                                                    1543.

Andrea Cimino and Felice Dell’Orletta. 2016. Build-       James Pustejovsky, José M Castano, Robert Ingria,
  ing the state-of-the-art in pos tagging of italian        Roser Sauri, Robert J Gaizauskas, Andrea Set-
  tweets. In CLiC-it/EVALITA.                               zer, Graham Katz, and Dragomir R Radev. 2003.
                                                            Timeml: Robust specification of event and tempo-
Agata Cybulska and Piek Vossen. 2014. Using a               ral expressions in text. New directions in question
  sledgehammer to crack a nut? Lexical diversity and        answering, 3:28–34.
  event coreference resolution. In Proceedings of the
  9th Language Resources and Evaluation Conference        James Pustejovsky, José Castao, Robert Ingria, Roser
  (LREC2014), Reykjavik, Iceland, May 26-31.                Saurı̀, Robert Gaizauskas, Andrea Setzer, and Gra-
                                                            ham Katz. 2003a. TimeML: Robust Specification
Alexandros Komninos and Suresh Manandhar. 2016.             of Event and Temporal Expressions in Text. In Fifth
  Dependency based embeddings for sentence classi-          International Workshop on Computational Seman-
  fication tasks. In Proceedings of the 2016 Confer-        tics (IWCS-5).
  ence of the North American Chapter of the Associ-
  ation for Computational Linguistics: Human Lan-         Nils Reimers and Iryna Gurevych. 2017. Report-
  guage Technologies, pages 1490–1500.                      ing score distributions makes a difference: Perfor-
                                                            mance study of lstm-networks for sequence tagging.
LDC. 2005. Ace (automatic content extraction)               In Proceedings of the 2017 Conference on Empiri-
  english annotation guidelines for events ver. 5.4.3       cal Methods in Natural Language Processing, pages
  2005.07.01. In Linguistic Data Consortium.                338–348, Copenhagen, Denmark, September. Asso-
                                                            ciation for Computational Linguistics.
Pierre Lison and Jörg Tiedemann. 2016. Opensub-
   titles2016: Extracting large parallel corpora from     Sara Tonelli, Alessio Palmero Aprosio, and Marco
   movie and tv subtitles.                                  Mazzon. 2017. The impact of phrases on ital-
                                                            ian lexical simplification. In Proceedings of the
Xuezhe Ma and Eduard Hovy. 2016. End-to-end                 Fourth Italian Conference on Computational Lin-
  sequence labeling via bi-directional lstm-cnns-crf.       guistics (CLiC-it 2017), Rome, Italy.
  In Proceedings of the 54th Annual Meeting of the
  Association for Computational Linguistics (Volume       N. UzZaman, H. Llorens, L. Derczynski, J. Allen,
  1: Long Papers), pages 1064–1074. Association for         M. Verhagen, and J. Pustejovsky. 2013. SemEval-
  Computational Linguistics.                                2013 task 1: Tempeval-3: Evaluating time expres-
                                                            sions, events, and temporal relations. In Proceed-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-        ings of SemEval-2013, pages 1–9. Association for
  rado, and Jeff Dean. 2013. Distributed representa-        Computational Linguistics, Atlanta, Georgia, USA.
  tions of words and phrases and their compositional-
                                                          M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple,
  ity. In Advances in neural information processing
                                                            G. Katz, and J. Pustejovsky. 2007. SemEval-2007
  systems, pages 3111–3119.
                                                            Task 15: TempEval Temporal Relation Identifica-
Anne-Lyse Minard, Manuela Speranza, Eneko                   tion. In Proceedings of SemEval 2007, pages 75–80,
  Agirre, Itziar Aldabe, Marieke van Erp, Bernardo          June.
  Magnini, German Rigau, Ruben Urizar, and Fon-           Marc Verhagen, Roser Sauri, Tommaso Caselli, and
  dazione Bruno Kessler. 2015. Semeval-2015 task           James Pustejovsky. 2010. Semeval-2010 task 13:
  4: Timeline: Cross-document event ordering. In           Tempeval-2. In Proceedings of the 5th international
  Proceedings of the 9th International Workshop            workshop on semantic evaluation, pages 57–62. As-
  on Semantic Evaluation (SemEval 2015), pages             sociation for Computational Linguistics.
  778–786.
                                                          Piek Vossen, Tommaso Caselli, and Yiota Kont-
Paramita Mirza and Anne-Lyse Minard. 2014. Fbk-              zopoulou. 2015. Storylines for structuring massive
  hlt-time: a complete italian temporal processing sys-      streams of news. In Proceedings of the First Work-
  tem for eventi-evalita 2014. In Fourth International       shop on Computing News Storylines, pages 40–49.
  Workshop EVALITA 2014, pages 44–49.

Tim O’Gorman, Kristin Wright-Bettner, and Martha
  Palmer. 2016. Richer event description: Integrating