-

CLEF

ProTestA: Identifying and Extracting Protest Events in News Notebook for ProtestNews Lab at CLEF 2019

Angelo Basile

angelo.basile@symanto.net 1

Tommaso Caselli

t.caselli@rug.nl 0 0 Rijksuniversiteit Groningen , Groningen , the Netherlands 1 Symanto Research GmbH & Co. , Nurnberg , Germany

2019

9 9 12

This notebook describes our participation to the ProtestNew Lab, identifying protest events in news articles in English. Systems are challenged to perform unsupervised domain adaptation against three sub-tasks: document classi cation, sentence classi cation, and event extraction. We describe the nal submitted systems for all sub-tasks, as well as a series of negative results. Results indicate pretty robust performances in all tasks (average F1 of 0.705 for the document classi cation sub-task, average F1 of 0.592 for the sentence classi cation sub-task; average F1 0.528 for the event extraction sub-task), ranking in the top 4 systems, although drops in the out-of-domain test sets are not minimal.

document classi cation sentence classi cation event ex- traction protest events

The growth of the Web has made more and more data available, and the need for Natural Language Processing (NLP) systems that are able to generalize across data distributions has become a urgent topic. In addition to this, portability of models across data sets, even when assumed to be on the same domain, is still a big challenge in NLP. Indeed recent studies have shown that systems, even when using architectures based on Neural Networks and distributed word representations, are highly dependent on their training sets and can hardly generalise [ 12,30,5 ].

The 2019 CLEF ProtestNews Lab 3 targets models' portability and unsupervised domain adaptation in the area of social protest events to support comparative social studies. The lab is organised along three tasks: a.) document classi cation (Task 1); b.) sentence classi cation (Task 2); and c.) event trigger and argument extraction (Task 3).

Task 1 and 2 are essentially text classi cation tasks. The goal is to distinguish between documents and sentences that report on or contain mentions of protest events. Task 3 is an event extraction task, where systems have to identify the correct event trigger, in this case a protest event, and its associated arguments in every sentence of a document.

As described in [ 7 ], the creation of the data sets followed a very detailed procedure to ensure maximal agreement among the annotators as well as to avoid errors. Furthermore, the task is designed as a cascade of sub-tasks: rst, identify if a document reports a protest event (Task 1), then identify which sentences are actually describing the protest event in the speci c document (Task 2), and, nally, for each protest event sentence, identify the actual event mention(s) and its arguments (Task 3). However, there is no overlap among the training and test data of the three tasks.

As already mentioned, the lab's main challenge is unsupervised domain adaptation. The lab organisers made available training and development data for one domain, namely news reporting protest events in India, and asked the participants to test their models both on in-domain data and on out-of-domain ones, namely news about protest events in China. In the remainder of the notebook, we will refer to these two test distributions as India and China.

When analysing the three tasks, it seems evident that the rst two tasks are very similar and can be targeted with a common architecture, and possibly features, modulo the granularity of the text message, i.e. full document vs. sentences. On the other hand, the third task requires a dedicated and radically di erent approach.

In the remainder of this contribution, we illustrate the systems we developed for the nal submissions, and provide some data analysis that may help understand the drops in performance across the two test data. We also describe and re ect on what we tried but did not work as expected. 2

Final Systems

The three tasks have been addressed with two separate systems. In particular, for Task 1 and 2, we opted for a feature based stacked ensemble model based on a set of di erent basic Logistic Regression classi ers, while for Task 3, we used a Bi-LSTM architecture optimized for sequence labelling task. 2.1

Training Materials

The lab organisers made available training and development data for each task. Table 1, summarises the distributions of the labels of the training and development data for Task 1 and 2, i.e. document and sentence classi cation, respectively. As the gures show, the positive class, i.e. the protest documents or sentences, is pretty much unbalanced with respect to the negative one, i.e. non-protest, ranging between 22.41% for Task 1 to 16.78% in Task 2 in training. The distribution of the classes is mirrored in the development data, with minor di erences for Task 2, where the positive class is slightly bigger than the negative one (20.81% vs. 16.78%). For training our systems, we did not use any additional training material. The development data was used to identify which methods to use for the nal systems rather than ne tuning the models, given the fact that at test time the models has to perform optimally for two di erent data distributions, in- and out-of-domain (India vs. China).

Task Data set Task 1 (Document Classi cation) TDreavi.n Task 2 (Sentence Classi cation ) TDreavi.n

Protest Not Protest 769 (22.41%) 2,661 (77.58%) 102 (22.31%) 355 (77.68%) 988 (16.78%) 4,897 (83.21%) 138 (20.81%) 525 (79.18%)

4 https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/

english-events-guidelines-v5.4.3.pdf The average amount of event trigger per sentence is 1.41 in training and 1.35 in development, indicating that multiple event triggers are available in the same sentence. As for the arguments, the average per event trigger is 2.24 in training and 1.85 in development, indicating both that arguments are shared among event triggers in the same sentences and that not all arguments are available in every sentence. Similarly to Task 1 and 2, only the available training data was used to train the system. 2.2

Classifying Documents and Sentences (Task 1 and Task2)

The document and sentence classi cation tasks have been formulated as standard classi cation tasks. In the perspective of maximizing the system results on both test distributions, we have developed a stacked ensemble model of Logistic Regression classi ers, following a previously successful implementation that showed robust portability [ 18 ].

We extracted three di erent sets of features. Each set of features was used to train a basic Logistic Regression classi er, as available in the scikit learn platform [ 20 ], and obtain a 10-fold cross-validation prediction for each training set. This means that for each document/sentence, we have 3 basic classi ers as well as their corresponding predictions, resulting in a total of 6 meta level features (3 classi ers X 2 classes per each task) per document/sentence as input for the nal meta-classi er. The meta-classi er is an additional Logistic Regression classi er. In training, we have used the default value of the C parameter and balanced class weights. Pre-processing of the data is limited to lowercase, and removal of special characters (e.g. #, , (, . . . ) and digits.

Word embeddings features We used the pre-trained FastText embedding [ 3 ] for English with 300 dimensions and sub-word information. 5 For each document/sentence, we obtain a 300 dimension representation by applying average pooling on the token embeddings, any time that a token is not present in the embedding vocabulary, we extract sub-words of length 3 or greater and check if they are present in the embeddings. This is a strategy to maximize the information in the training data as well as to reduce out of vocabulary (OOV) tokens across the di erent test distributions.

Most important token and character n-grams features These two sets of features have been identi ed as useful features to increase the robusteness and portability of the models across data sets. The features have been extracted by performing two sets of TF-IDF scores over each training data (i.e. Task 1 and Task 2) to select the most important tokens and characters n-gram per class (i.e. protest documents/sentences vs. non-protest documents/sentences). For each extracted token, the maximum and minimum cosine similarity is obtained with respect to all tokens in a document/sentence using the FastText embeddings. Similarly to 5 We used the wiki-news-300d-1M-subword.vec model, available at fasttext.cc/docs/en/english-vectors.html. https:// the word embeddings feature, in case a token is not present in the embeddings, we checked for sub-words embeddings. The character n-grams are represented by means of Boolean features indicating whether they are present or not in the document/sentence, thus capturing and representing di erent information.

Table 3 illustrates the settings used to tune the system for each test distributions and task. The amounts of token and character n-grams varies per task as well as per test distribution. Although more experiments are needed, during the submission phase and quite not surprisingly, we observed that the higher the number of tokens and character n-grams is extracted, the better the model performs on the same test data distribution, thus loosing in portability (for instance, with 4,000 token n-grams and 1,000 characters n-grams, the F1 of Task 2 on China drops to 0.553 to 0.536). We framed the event mention and argument extraction task as a supervised sequence labelling classi cation problem, following a well established practice in NLP [ 1,8,26,2,19 ]. In particular, given a sentence, S, the system is asked to identify all linguistic expressions w 2 S, where w is a mention of a protest event, evw, as well as all linguistic expressions y 2 S where y is a mention of an argument, argy, associated to a speci c event mention evw.

We have implemented a two-step approach using a common sequence labelling model based on a publicly available Bi-LSTM network with a CRF classi er as last layer [ 25 ]. 6 In more details, we developed two di erent models: rst, we detect event trigger mentions, and subsequently, the event arguments (and their roles). Such an approach is inspired by SRL architectures [ 27 ], where rst predicates are identi ed and disambiguated, and afterwards the relevant arguments are labelled. In our case, we rst identify sentences that contain relevant event triggers, and then look for event arguments in these sentences.

6 https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf

We did not ne-tuned the hyperparameters, but followed the suggestions in [ 25,24 ] for sequence labelling tasks. In Table 4, we report the shared parameters of the networks for both tasks. The \LSTM Layers" refers separately to the number of forward and backward layers.

Training is stopped after 5 consecutive epochs with no improvements. Komninos and Manandhar ([ 11 ]) pre-trained word embeddings are used to initialize the ELMo embeddings [ 22 ] and ne-tune them with respect to the training data. The ELMo embeddings are used to enhance the network generalisation capabilities for event and argument detection over both test data distributions. As for the event trigger detection sub-task, the embedding representations are further concatenated with character-level embeddings [ 15 ], and parts-of-speech (POS) embeddings. POS tags have been obtained from the Stanford CoreNLP toolkit [ 16 ]. 7 This minimal set of features is further extended with embedding representations for dependency relations and event triggers for the argument detection sub-task. At test time, the protest event triggers are obtained from the event mentions model.

Both for the event trigger and argument detection sub-tasks, we have conducted ve di erent runs to better asses the variability of the deep learning models due to random initialisations. At test time, we selected the model that obtained the best F1 scores on the development set out of the ve runs. 3

Results on Test Data

In this section we illustrate the results for all three tasks 8. Notice that for Task 3 the scores are cumulative for both event trigger and participant detection. For all tasks, the ranking is based on the average F1 of the systems on the two test distributions (i.e. India and China). Table 5 reports the results for each task and the corresponding rankings. For ease of comparison, we also report the distance from the best ranking system for each task expressed in di erences in F1 scores. 7 We used version 3.9.2 8 Results and ranking were taken from the Codalab page of the ProtestNews Lab available at https://competitions.codalab.org/competitions/22349#results .

Quite disappointingly, the drops against the China test data are not minimal, although with a pretty wide range. The minimal drop is on Task 2, where the system does not perform optimally on the India test data (F1 0.631). On the other hand, the largest drop is in Task 1, where the system looses 0.21 points in F1 when applied to the China data set. As for Task 3, the drop is still relevant (0.144 points). However, we also noticed that the results for Precision and Recall on the China are pretty well balanced (P = 0.492; R = 0.425), indicating some robustness of the trained model. 4

Data Analysis

To better understand the results of our systems for the three tasks, we have conducted an analysis of the data to highlight similarities and di erences. Following recent work [ 23,13 ], we embrace the vision that corpora, even when on the same domain, are not monolithic entities but rather they are regions in a high dimensional space of latent factors, including topics, genres, writing styles, years of publication, among others, that express similarities and diversities. In particular, we have investigated to what extent the test sets of the three tasks occupy a similar (or di erent) portions of this space with respect to their corresponding training distributions. To do so, we used two metrics, the Jensen-Shannon (J-S) divergence and the out-of-vocabulary rate of tokens (OOV), that previous work in transfer learning [28] has shown to be particularly useful for this kind of analysis.

The J-S divergence assesses similarity between two probability distributions, q and r and is a smoothed, symmetric variant of the Kullback-Leibler divergence. On the other hand, the OOV rate can be used to assess the di erences between data distributions, as it highlights the percentage of unknown tokens. All measures have been computed between the training data and the two test distributions, i.e. India and China. Results are reported on Table 6.

As the gures in Table 6 show the test distributions, India and China, can be seen as occupying pretty di erent portions of this variety of spaces. Not surprisingly, the China distributions are very di erent from their respective training ones, although with varying degrees. This variation in similarity, however, also a ects the India test distributions, where the highest similarity is observed for Task 1 (0.922) and the lowest for Task 3 (0.703). The J-S scores show that the test distributions for Task 3 and Task 2 are even more di erent than those for Task 1, indicating that the di erences in performances of the models across the three tasks is subject to these variations in similarities. As a further support to this observation, we found that there is a positive signi cant correlation between the J-S similarity scores and the F1 values across the three tasks ( =0.901, p<.05).

The OOV rates support the observations conducted with the J-S divergence. The OOV rates for the India test distributions are much lower then those compared to China, clearly signalling that there are strong lexical di erences among the data sets. The OOV rate for India and China for Task 3 are much closer than those for Task 1 and 2, and still the di erences in overall F1 scores for this task between the two test distributions is pretty large (F1 0.600 for India vs. F1 0.456 for China), suggesting that OOV is actually a less powerful predictor of di erences in performance between data distributions. Indeed, we have found that there is a negative non signi cant correlation between OOV rates and the F1 scores across the tasks ( =-0.804, p>.05).

Finally, a further aspect to account for the behavior of the models concerns the proportion of the predictions. In particular, for Task 1 and 2 the proportions of the predictions in the two classes (protest vs. non-protest) are the same as in the training sets for the India data (i.e. in-domain), while they drop when applied to the China tests. In Task 3, the system predicts the same proportion of event triggers in both test distributions (0.90 event per sentence on India, and 0.95 event per sentence on China, respectively), although slightly lower than that observed in training. On the other hand, the proportions of predicted arguments per event are di erent: on the India data, they are in line with those of the training sets (2.51 vs. 2.24 in training, respectively), while they are lower in the China data (1.94 vs. 2.24 in training, respectively). These observations further indicate that the systems for Tasks 1 and 2 are more dependent on the training set, while the system for Task 3 appears to be more resilient to out-of-domain data. 5

Alternative Methods: What Did Not Work

In this section we brie y report on alternative methods that actually resulted to be detrimental for the performance with respect to the nal settings. We mainly focused on changing strategies in modelling by using di erent algorithms and paradigms rather than attempting to extend the training materials. Task 1 and 2 - Inductive Transfer Learning In an attempt to build a system that better generalizes across data sets, we tried exploiting recent advancements in transfer learning. Combining a ne-tuned language model with a classi er has been shown to be a sound strategy for classifying text [ 6 ]. We thus experimented with two pre-trained contextualized embedding models, ELMo [ 22,21 ] and BERT [ 4 ]. In both cases, we extracted xed sentence representations and used them in combination with a linear SVM with the default hyper-parameters provided by the scikit-learn implementation [ 20 ]. With BERT, we obtain xed representations by applying average pooling to the second-to-last hidden layer using the pre-trained BERT base model. For Task 1, we represented a document as the average of the sentence embeddings obtained by using ELMo or BERT. We used spaCy for splitting the document into sentences. We experimented with frozen and ne-tuned weights. We ne-tune the inner three layers with ELMo, while we started from the pre-trained base English model and trained it for 10,000 steps on the Indian training set with BERT. In this latter case, training data was assembled by combining the document and sentence training corpora. Finally, we also experimented with an ensemble model using both word and character n-grams and BERT embedding representations.

We obtained promising results on the development set, but, surprisingly, the performances dropped when applied to the test distributions (F1 = 0.466 for ELMo, and F1 = 0.567 for BERT, respectively). The ensemble model using both dense and sparse representations outperformed the simpler model by 0.1 F1 point.

Task 1 and 2 - Convolutional Neural Networks (CNN) It has been shown that character-level convolutional neural networks perform well in document and sentence classi cation tasks [31] and, being character based, these models are in theory not severely harmed by OOV words, thus making portability across test distributions less prone to errors. We experimented with the architecture described in [ 10 ], randomly initializing character embeddings, which are then passed as input to a stack of convolutional networks, with kernel sizes ranging from 3 to 7. As a regularization method, we use a 10% dropout [29]. Similarly to the use of the inductive transfer leaning approaches, good results on the development set were followed by very poor test results (F1 = 0.427). As for this approach, we hypothesize that the training data is too small for e ectively using randomly initialized embeddings, although character-based. Task 1 and 2 - FastText We experimented with Facebook's FastText system, an an o -the-shelf supervised classi er [ 9 ]. We trained two versions of the system using di erent token n-grams representations (i.e. bigrams and trigrams), the wiki-news-300d-1M-subword.vec FastText embeddings with subwords for English, and varying learning rates (ranging from 0.1 up to 1.0). We ne tuned the learning rate against the development data. Pre-processing of the data is the same as the one used for the nal system, namely lowercase and removal of special characters (e.g. #, , (, . . . ) and digits. When applied to the test data, the best model scored F1 0.608 We also observed that bigrams performs optimally for Task 1, regardless of the test distributions, while for Task 2, bigrams worked best for the in-domain test distributions, i.e. India, and trigrams for the out-of-domain one, i.e. China.

Task 3 - Multi-task Learning We investigated if a multi-task learning architecture, still based on the Bi-LSTM network, could be a viable solutions to improve the system performance and portability. Given the incompatibility of the Task 3 annotations with other existing corpora for event extraction (e.g. ACE and POLCON [ 14 ]), opted for a multi-task learning approach, as it has proven useful to address scarcity of labeled data. However, we adopted a slightly di erent strategy: rather than using an alternative tasks in support of our target task, e.g. semantic role labelling in support of opinion role labelling [ 17 ], we used an alternative data set annotated with di erent labels but targeting the same problem. We thus extracted all sentences annotated with Attack and Demonstrate events from the ACE corpus and used them as a support task in a multi-task learning setting. In this case, we achieved an averaged F1 of 0.517 for both test distributions, lower than 0.011 points than the nal submitted system. On the positive side, however, we observed that the multi-task model obtains the best Precision score on the China test data (0.541), although at a large expense of Recall. 6

Conclusion and Future Work

Our contribution mainly focused on two aspects: a.) assess the most viable approach for each task at stake and maximize portability with limited e orts; b.) explain the limits of the trained models in terms of similarities and di erences across training and test distributions rather than just limiting to technical aspects of the systems.

Task 1 and Task 2 have shown that a simple system can obtain competitive results in an unsupervised domain adaptation setting. This aspect is actually encouraging and triggers further investigation in this direction by focusing e orts on parameter optimisation. We also believe that the lack of any material for the out-of-domain distributions is a further challenge to take into account, as no ne tuning of the models on the target domain was actually possible. As far as we can put e orts into the development of maximally \generalisable" systems, the dependance of the models on the training materials remains high, thus posing the problem if we are not just modelling data sets rather than linguistic phenomena.

Task 3 has actually highlighted the contribution of both more complex architectures, such as a Bi-LSTM-CRF network, and contextualised embedding representations, such as ELMo. In this speci c case, the trained model is able to predict a comparable amount of event triggers between the two test distributions, although it su ers on the argument sub-task, where less arguments are predicted for the out-of-domain data. Unfortunately, the evaluation format does not allow to quantify the losses per sub-task and per trained models.

Finally, the similarity and diversity measures (i.e. J-S divergence and OOV) resulted in useful tools to better understand the di erent behaviors of the systems on both test distributions. It is worth noticing how J-S similarity scores correlates with F1 scores of the trained models, suggesting that it could be possible to quantify, or predict, a margin loss of systems before applying them to out-of-domain test distributions and, consequently, take actions to minimize the losses. Association for Computational Linguistics (Volume 1: Long Papers). pp. 1192{1202. Association for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1113, https://www.aclweb.org/ anthology/P16-1113 28. Ruder, S., Plank, B.: Learning to select data for transfer learning with bayesian optimization. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 372{382. Association for Computational Linguistics, Copenhagen, Denmark (September 2017), https://www.aclweb.org/anthology/ D17-1038 29. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Dropout: a simple way to prevent neural networks from over tting. Journal of Machine Learning Research 15, 1929{1958 (2014) 30. Weber, N., Shekhar, L., Balasubramanian, N.: The ne line between linguistic generalization and failure in seq2seq-attention models. arXiv preprint arXiv:1805.01445 (2018) 31. Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classi cation. In: NIPS (2015)

1. Ahn , D. : The stages of event extraction . In: Proceedings of the Workshop on Annotating and Reasoning about Time and Events . pp. 1 { 8 . Association for Computational Linguistics ( 2006 )

2. Bethard , S.: ClearTK-TimeML : A minimalist approach to TempEval 2013 . In: Second Joint Conference on Lexical and Computational Semantics (* SEM) . vol. 2 , pp. 10 { 14 ( 2013 )

3. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching word vectors with subword information . arXiv preprint arXiv:1607.04606 ( 2016 )

4. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 )

5. Glockner , M. , Shwartz , V. , Goldberg , Y. : Breaking nli systems with sentences that require simple lexical inferences . In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) . pp. 650 { 655 . Association for Computational Linguistics ( 2018 ), http://aclweb.org/ anthology/P18-2103

6. Howard , J. , Ruder , S. : Universal language model ne-tuning for text classi cation . In: ACL ( 2018 )

7. Hurriyetoglu, A. , Yoruk, E., Yuret, D. , Yoltar , C. , Gurel, B. , Durusan , F. , Mutlu , O.: A task set proposal for automatic protest information collection across multiple countries . In: Azzopardi, L. , Stein , B. , Fuhr , N. , Mayr , P. , Hau , C. , Hiemstra , D . (eds.) Advances in Information Retrieval. pp. 316 { 323 . Springer International Publishing, Cham ( 2019 )

8. Ji , H. , Grishman , R.: Re ning event extraction through cross-document inference . Proceedings of ACL-08: HLT pp. 254 { 262 ( 2008 )

9. Joulin , A. , Grave , E. , Bojanowski , P. , Mikolov , T. : Bag of tricks for e cient text classi cation . arXiv preprint arXiv:1607.01759 ( 2016 )

10. Kim , Y. : Convolutional neural networks for sentence classi cation . In: EMNLP ( 2014 )

11. Komninos , A. , Manandhar , S. : Dependency based embeddings for sentence classi cation tasks . In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 1490 { 1500 ( 2016 )

12. Lake , B.M. , Baroni , M. : Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks . arXiv preprint arXiv:1711.00350 ( 2017 )

13. Liakata , M. , Saha , S. , Dobnik , S. , Batchelor , C. , Rebholz-Schuhmann , D. : Automatic recognition of conceptualization zones in scienti c articles and two life science applications . Bioinformatics 28 ( 7 ), 991 {1000 (Apr 2012 ). https://doi.org/10.1093/bioinformatics/bts071, http://dx.doi.org/10. 1093/bioinformatics/bts071

14. Lorenzini , J. , Makarov , P. , Kriesi , H. , Wueest , B. : Towards a dataset of automatically coded protest events from english-language newswire documents . In: Paper presented at the Amsterdam Text Analysis Conference ( 2016 )

15. Ma , X. , Hovy , E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf . arXiv preprint arXiv:1603.01354 ( 2016 )

16. Manning , C. , Surdeanu , M. , Bauer , J. , Finkel , J. , Bethard , S. , McClosky , D. : The Stanford CoreNLP Natural Language Processing Toolkit . In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. System Demonstrations . pp. 55 { 60 ( 2014 )

17. Marasovic , A. , Frank , A. : SRL4ORL: Improving opinion role labeling using multi-task learning with semantic role labeling . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long Papers). pp. 583 { 594 . Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018 ). https://doi.org/10.18653/v1/ N18 -1054, https://www.aclweb.org/ anthology/N18-1054

18. Montani , J.P. : Tuwienkbs at germeval 2018: German abusive tweet detection . In: 14th Conference on Natural Language Processing KONVENS 2018 . p. 45 ( 2018 )

19. Nguyen , T.H. , Grishman , R.: Event detection and domain adaptation with convolutional neural networks . In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2 :

Short

Papers ) . vol. 2 , pp. 365 { 371 ( 2015 )

20. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , et al.: Scikit-learn: Machine learning in python . Journal of machine learning research 12(Oct) , 2825 { 2830 ( 2011 )

21. Peters , M. , Ruder , S. , Smith , N.A. : To tune or not to tune? adapting pretrained representations to diverse tasks . arXiv preprint arXiv: 1903 . 05987 ( 2019 )

22. Peters , M.E. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . In: Proc. of NAACL ( 2018 )

23. Plank , B. , Van Noord , G. : E ective measures of domain similarity for parsing . In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 . pp. 1566 { 1576 . Association for Computational Linguistics ( 2011 )

24. Reimers , N. , Gurevych , I. : Optimal hyperparameters for deep lstm-networks for sequence labeling tasks . CoRR abs/1707 .06799 ( 2017 ), http://arxiv.org/abs/ 1707.06799

25. Reimers , N. , Gurevych , I. : Reporting score distributions makes a di erence: Performance study of lstm-networks for sequence tagging . In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . pp. 338 { 348 . Association for Computational Linguistics, Copenhagen, Denmark ( September 2017 ), https://www.aclweb.org/anthology/D17-1035

26. Ritter , A. , Etzioni , O. , Clark , S. , et al.: Open domain event extraction from twitter . In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 1104 { 1112 . ACM ( 2012 )

27. Roth , M. , Lapata , M. : Neural semantic role labeling with dependency path embeddings . In: Proceedings of the 54th Annual Meeting of the