<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CLEF</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ProTestA: Identifying and Extracting Protest Events in News Notebook for ProtestNews Lab at CLEF 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Angelo Basile</string-name>
          <email>angelo.basile@symanto.net</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Caselli</string-name>
          <email>t.caselli@rug.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rijksuniversiteit Groningen</institution>
          ,
          <addr-line>Groningen</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Symanto Research GmbH &amp; Co.</institution>
          ,
          <addr-line>Nurnberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>9</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This notebook describes our participation to the ProtestNew Lab, identifying protest events in news articles in English. Systems are challenged to perform unsupervised domain adaptation against three sub-tasks: document classi cation, sentence classi cation, and event extraction. We describe the nal submitted systems for all sub-tasks, as well as a series of negative results. Results indicate pretty robust performances in all tasks (average F1 of 0.705 for the document classi cation sub-task, average F1 of 0.592 for the sentence classi cation sub-task; average F1 0.528 for the event extraction sub-task), ranking in the top 4 systems, although drops in the out-of-domain test sets are not minimal.</p>
      </abstract>
      <kwd-group>
        <kwd>document classi cation</kwd>
        <kwd>sentence classi cation</kwd>
        <kwd>event ex- traction</kwd>
        <kwd>protest events</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The growth of the Web has made more and more data available, and the need for
Natural Language Processing (NLP) systems that are able to generalize across
data distributions has become a urgent topic. In addition to this, portability
of models across data sets, even when assumed to be on the same domain, is
still a big challenge in NLP. Indeed recent studies have shown that systems,
even when using architectures based on Neural Networks and distributed word
representations, are highly dependent on their training sets and can hardly
generalise [
        <xref ref-type="bibr" rid="ref12 ref5">12,30,5</xref>
        ].
      </p>
      <p>The 2019 CLEF ProtestNews Lab 3 targets models' portability and
unsupervised domain adaptation in the area of social protest events to support
comparative social studies. The lab is organised along three tasks: a.) document
classi cation (Task 1); b.) sentence classi cation (Task 2); and c.) event trigger
and argument extraction (Task 3).</p>
      <p>Task 1 and 2 are essentially text classi cation tasks. The goal is to distinguish
between documents and sentences that report on or contain mentions of protest
events. Task 3 is an event extraction task, where systems have to identify the
correct event trigger, in this case a protest event, and its associated arguments
in every sentence of a document.</p>
      <p>
        As described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the creation of the data sets followed a very detailed
procedure to ensure maximal agreement among the annotators as well as to avoid
errors. Furthermore, the task is designed as a cascade of sub-tasks: rst, identify
if a document reports a protest event (Task 1), then identify which sentences
are actually describing the protest event in the speci c document (Task 2), and,
nally, for each protest event sentence, identify the actual event mention(s) and
its arguments (Task 3). However, there is no overlap among the training and
test data of the three tasks.
      </p>
      <p>As already mentioned, the lab's main challenge is unsupervised domain
adaptation. The lab organisers made available training and development data for one
domain, namely news reporting protest events in India, and asked the
participants to test their models both on in-domain data and on out-of-domain ones,
namely news about protest events in China. In the remainder of the notebook,
we will refer to these two test distributions as India and China.</p>
      <p>When analysing the three tasks, it seems evident that the rst two tasks
are very similar and can be targeted with a common architecture, and
possibly features, modulo the granularity of the text message, i.e. full document vs.
sentences. On the other hand, the third task requires a dedicated and radically
di erent approach.</p>
      <p>In the remainder of this contribution, we illustrate the systems we developed
for the nal submissions, and provide some data analysis that may help
understand the drops in performance across the two test data. We also describe and
re ect on what we tried but did not work as expected.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Final Systems</title>
      <p>The three tasks have been addressed with two separate systems. In particular,
for Task 1 and 2, we opted for a feature based stacked ensemble model based on
a set of di erent basic Logistic Regression classi ers, while for Task 3, we used
a Bi-LSTM architecture optimized for sequence labelling task.
2.1</p>
      <sec id="sec-2-1">
        <title>Training Materials</title>
        <p>The lab organisers made available training and development data for each task.
Table 1, summarises the distributions of the labels of the training and
development data for Task 1 and 2, i.e. document and sentence classi cation,
respectively. As the gures show, the positive class, i.e. the protest documents
or sentences, is pretty much unbalanced with respect to the negative one, i.e.
non-protest, ranging between 22.41% for Task 1 to 16.78% in Task 2 in
training. The distribution of the classes is mirrored in the development data, with
minor di erences for Task 2, where the positive class is slightly bigger than the
negative one (20.81% vs. 16.78%). For training our systems, we did not use any
additional training material. The development data was used to identify which
methods to use for the nal systems rather than ne tuning the models, given
the fact that at test time the models has to perform optimally for two di erent
data distributions, in- and out-of-domain (India vs. China).</p>
        <p>Task Data set
Task 1 (Document Classi cation) TDreavi.n
Task 2 (Sentence Classi cation ) TDreavi.n</p>
        <p>Protest Not Protest
769 (22.41%) 2,661 (77.58%)
102 (22.31%) 355 (77.68%)
988 (16.78%) 4,897 (83.21%)
138 (20.81%) 525 (79.18%)</p>
        <sec id="sec-2-1-1">
          <title>4 https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/</title>
          <p>english-events-guidelines-v5.4.3.pdf
The average amount of event trigger per sentence is 1.41 in training and 1.35
in development, indicating that multiple event triggers are available in the same
sentence. As for the arguments, the average per event trigger is 2.24 in training
and 1.85 in development, indicating both that arguments are shared among event
triggers in the same sentences and that not all arguments are available in every
sentence. Similarly to Task 1 and 2, only the available training data was used to
train the system.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Classifying Documents and Sentences (Task 1 and Task2)</title>
        <p>
          The document and sentence classi cation tasks have been formulated as
standard classi cation tasks. In the perspective of maximizing the system results on
both test distributions, we have developed a stacked ensemble model of
Logistic Regression classi ers, following a previously successful implementation that
showed robust portability [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>
          We extracted three di erent sets of features. Each set of features was used
to train a basic Logistic Regression classi er, as available in the scikit learn
platform [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], and obtain a 10-fold cross-validation prediction for each training
set. This means that for each document/sentence, we have 3 basic classi ers as
well as their corresponding predictions, resulting in a total of 6 meta level features
(3 classi ers X 2 classes per each task) per document/sentence as input for
the nal meta-classi er. The meta-classi er is an additional Logistic Regression
classi er. In training, we have used the default value of the C parameter and
balanced class weights. Pre-processing of the data is limited to lowercase, and
removal of special characters (e.g. #, , (, . . . ) and digits.
        </p>
        <p>
          Word embeddings features We used the pre-trained FastText embedding [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
for English with 300 dimensions and sub-word information. 5 For each
document/sentence, we obtain a 300 dimension representation by applying average
pooling on the token embeddings, any time that a token is not present in the
embedding vocabulary, we extract sub-words of length 3 or greater and check if
they are present in the embeddings. This is a strategy to maximize the
information in the training data as well as to reduce out of vocabulary (OOV) tokens
across the di erent test distributions.
        </p>
        <p>Most important token and character n-grams features These two sets of features
have been identi ed as useful features to increase the robusteness and portability
of the models across data sets. The features have been extracted by performing
two sets of TF-IDF scores over each training data (i.e. Task 1 and Task 2) to
select the most important tokens and characters n-gram per class (i.e. protest
documents/sentences vs. non-protest documents/sentences). For each extracted
token, the maximum and minimum cosine similarity is obtained with respect to
all tokens in a document/sentence using the FastText embeddings. Similarly to
5 We used the wiki-news-300d-1M-subword.vec model, available at
fasttext.cc/docs/en/english-vectors.html.
https://
the word embeddings feature, in case a token is not present in the embeddings,
we checked for sub-words embeddings. The character n-grams are represented
by means of Boolean features indicating whether they are present or not in the
document/sentence, thus capturing and representing di erent information.</p>
        <p>
          Table 3 illustrates the settings used to tune the system for each test
distributions and task. The amounts of token and character n-grams varies per task
as well as per test distribution. Although more experiments are needed, during
the submission phase and quite not surprisingly, we observed that the higher
the number of tokens and character n-grams is extracted, the better the model
performs on the same test data distribution, thus loosing in portability (for
instance, with 4,000 token n-grams and 1,000 characters n-grams, the F1 of Task
2 on China drops to 0.553 to 0.536).
We framed the event mention and argument extraction task as a supervised
sequence labelling classi cation problem, following a well established practice
in NLP [
          <xref ref-type="bibr" rid="ref1 ref19 ref2 ref26 ref8">1,8,26,2,19</xref>
          ]. In particular, given a sentence, S, the system is asked
to identify all linguistic expressions w 2 S, where w is a mention of a protest
event, evw, as well as all linguistic expressions y 2 S where y is a mention of an
argument, argy, associated to a speci c event mention evw.
        </p>
        <p>
          We have implemented a two-step approach using a common sequence
labelling model based on a publicly available Bi-LSTM network with a CRF
classi er as last layer [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. 6 In more details, we developed two di erent models: rst,
we detect event trigger mentions, and subsequently, the event arguments (and
their roles). Such an approach is inspired by SRL architectures [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], where rst
predicates are identi ed and disambiguated, and afterwards the relevant
arguments are labelled. In our case, we rst identify sentences that contain relevant
event triggers, and then look for event arguments in these sentences.
        </p>
        <sec id="sec-2-2-1">
          <title>6 https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf</title>
          <p>
            We did not ne-tuned the hyperparameters, but followed the suggestions
in [
            <xref ref-type="bibr" rid="ref24 ref25">25,24</xref>
            ] for sequence labelling tasks. In Table 4, we report the shared
parameters of the networks for both tasks. The \LSTM Layers" refers separately to the
number of forward and backward layers.
          </p>
          <p>
            Training is stopped after 5 consecutive epochs with no improvements.
Komninos and Manandhar ([
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]) pre-trained word embeddings are used to initialize
the ELMo embeddings [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] and ne-tune them with respect to the training
data. The ELMo embeddings are used to enhance the network generalisation
capabilities for event and argument detection over both test data distributions.
As for the event trigger detection sub-task, the embedding representations are
further concatenated with character-level embeddings [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], and parts-of-speech
(POS) embeddings. POS tags have been obtained from the Stanford CoreNLP
toolkit [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. 7 This minimal set of features is further extended with embedding
representations for dependency relations and event triggers for the argument
detection sub-task. At test time, the protest event triggers are obtained from the
event mentions model.
          </p>
          <p>Both for the event trigger and argument detection sub-tasks, we have
conducted ve di erent runs to better asses the variability of the deep learning
models due to random initialisations. At test time, we selected the model that
obtained the best F1 scores on the development set out of the ve runs.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results on Test Data</title>
      <p>In this section we illustrate the results for all three tasks 8. Notice that for Task
3 the scores are cumulative for both event trigger and participant detection. For
all tasks, the ranking is based on the average F1 of the systems on the two test
distributions (i.e. India and China). Table 5 reports the results for each task and
the corresponding rankings. For ease of comparison, we also report the distance
from the best ranking system for each task expressed in di erences in F1 scores.
7 We used version 3.9.2
8 Results and ranking were taken from the Codalab page of the ProtestNews Lab
available at https://competitions.codalab.org/competitions/22349#results .</p>
      <p>Quite disappointingly, the drops against the China test data are not minimal,
although with a pretty wide range. The minimal drop is on Task 2, where the
system does not perform optimally on the India test data (F1 0.631). On the
other hand, the largest drop is in Task 1, where the system looses 0.21 points in
F1 when applied to the China data set. As for Task 3, the drop is still relevant
(0.144 points). However, we also noticed that the results for Precision and Recall
on the China are pretty well balanced (P = 0.492; R = 0.425), indicating some
robustness of the trained model.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Data Analysis</title>
      <p>
        To better understand the results of our systems for the three tasks, we have
conducted an analysis of the data to highlight similarities and di erences.
Following recent work [
        <xref ref-type="bibr" rid="ref13 ref23">23,13</xref>
        ], we embrace the vision that corpora, even when on the
same domain, are not monolithic entities but rather they are regions in a high
dimensional space of latent factors, including topics, genres, writing styles, years
of publication, among others, that express similarities and diversities. In
particular, we have investigated to what extent the test sets of the three tasks occupy
a similar (or di erent) portions of this space with respect to their
corresponding training distributions. To do so, we used two metrics, the Jensen-Shannon
(J-S) divergence and the out-of-vocabulary rate of tokens (OOV), that previous
work in transfer learning [28] has shown to be particularly useful for this kind
of analysis.
      </p>
      <p>The J-S divergence assesses similarity between two probability distributions,
q and r and is a smoothed, symmetric variant of the Kullback-Leibler
divergence. On the other hand, the OOV rate can be used to assess the di erences
between data distributions, as it highlights the percentage of unknown tokens.
All measures have been computed between the training data and the two test
distributions, i.e. India and China. Results are reported on Table 6.</p>
      <p>As the gures in Table 6 show the test distributions, India and China, can be
seen as occupying pretty di erent portions of this variety of spaces. Not
surprisingly, the China distributions are very di erent from their respective training
ones, although with varying degrees. This variation in similarity, however, also
a ects the India test distributions, where the highest similarity is observed for
Task 1 (0.922) and the lowest for Task 3 (0.703). The J-S scores show that the
test distributions for Task 3 and Task 2 are even more di erent than those for
Task 1, indicating that the di erences in performances of the models across the
three tasks is subject to these variations in similarities. As a further support
to this observation, we found that there is a positive signi cant correlation
between the J-S similarity scores and the F1 values across the three tasks ( =0.901,
p&lt;.05).</p>
      <p>The OOV rates support the observations conducted with the J-S divergence.
The OOV rates for the India test distributions are much lower then those
compared to China, clearly signalling that there are strong lexical di erences among
the data sets. The OOV rate for India and China for Task 3 are much closer
than those for Task 1 and 2, and still the di erences in overall F1 scores for this
task between the two test distributions is pretty large (F1 0.600 for India vs.
F1 0.456 for China), suggesting that OOV is actually a less powerful predictor
of di erences in performance between data distributions. Indeed, we have found
that there is a negative non signi cant correlation between OOV rates and the
F1 scores across the tasks ( =-0.804, p&gt;.05).</p>
      <p>Finally, a further aspect to account for the behavior of the models concerns
the proportion of the predictions. In particular, for Task 1 and 2 the proportions
of the predictions in the two classes (protest vs. non-protest) are the same as
in the training sets for the India data (i.e. in-domain), while they drop when
applied to the China tests. In Task 3, the system predicts the same proportion
of event triggers in both test distributions (0.90 event per sentence on India, and
0.95 event per sentence on China, respectively), although slightly lower than that
observed in training. On the other hand, the proportions of predicted arguments
per event are di erent: on the India data, they are in line with those of the
training sets (2.51 vs. 2.24 in training, respectively), while they are lower in the
China data (1.94 vs. 2.24 in training, respectively). These observations further
indicate that the systems for Tasks 1 and 2 are more dependent on the training
set, while the system for Task 3 appears to be more resilient to out-of-domain
data.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Alternative Methods: What Did Not Work</title>
      <p>
        In this section we brie y report on alternative methods that actually resulted to
be detrimental for the performance with respect to the nal settings. We mainly
focused on changing strategies in modelling by using di erent algorithms and
paradigms rather than attempting to extend the training materials.
Task 1 and 2 - Inductive Transfer Learning In an attempt to build a system
that better generalizes across data sets, we tried exploiting recent advancements
in transfer learning. Combining a ne-tuned language model with a classi er has
been shown to be a sound strategy for classifying text [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We thus experimented
with two pre-trained contextualized embedding models, ELMo [
        <xref ref-type="bibr" rid="ref21 ref22">22,21</xref>
        ] and BERT
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In both cases, we extracted xed sentence representations and used them in
combination with a linear SVM with the default hyper-parameters provided by
the scikit-learn implementation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. With BERT, we obtain xed
representations by applying average pooling to the second-to-last hidden layer using the
pre-trained BERT base model. For Task 1, we represented a document as the
average of the sentence embeddings obtained by using ELMo or BERT. We used
spaCy for splitting the document into sentences. We experimented with frozen
and ne-tuned weights. We ne-tune the inner three layers with ELMo, while we
started from the pre-trained base English model and trained it for 10,000 steps
on the Indian training set with BERT. In this latter case, training data was
assembled by combining the document and sentence training corpora. Finally,
we also experimented with an ensemble model using both word and character
n-grams and BERT embedding representations.
      </p>
      <p>We obtained promising results on the development set, but, surprisingly, the
performances dropped when applied to the test distributions (F1 = 0.466 for
ELMo, and F1 = 0.567 for BERT, respectively). The ensemble model using
both dense and sparse representations outperformed the simpler model by 0.1
F1 point.</p>
      <p>
        Task 1 and 2 - Convolutional Neural Networks (CNN) It has been shown
that character-level convolutional neural networks perform well in document and
sentence classi cation tasks [31] and, being character based, these models are
in theory not severely harmed by OOV words, thus making portability across
test distributions less prone to errors. We experimented with the architecture
described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], randomly initializing character embeddings, which are then
passed as input to a stack of convolutional networks, with kernel sizes ranging
from 3 to 7. As a regularization method, we use a 10% dropout [29]. Similarly
to the use of the inductive transfer leaning approaches, good results on the
development set were followed by very poor test results (F1 = 0.427). As for
this approach, we hypothesize that the training data is too small for e ectively
using randomly initialized embeddings, although character-based.
Task 1 and 2 - FastText We experimented with Facebook's FastText system,
an an o -the-shelf supervised classi er [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We trained two versions of the
system using di erent token n-grams representations (i.e. bigrams and trigrams),
the wiki-news-300d-1M-subword.vec FastText embeddings with subwords for
English, and varying learning rates (ranging from 0.1 up to 1.0). We ne tuned
the learning rate against the development data. Pre-processing of the data is
the same as the one used for the nal system, namely lowercase and removal of
special characters (e.g. #, , (, . . . ) and digits. When applied to the test data,
the best model scored F1 0.608 We also observed that bigrams performs
optimally for Task 1, regardless of the test distributions, while for Task 2, bigrams
worked best for the in-domain test distributions, i.e. India, and trigrams for the
out-of-domain one, i.e. China.
      </p>
      <p>
        Task 3 - Multi-task Learning We investigated if a multi-task learning
architecture, still based on the Bi-LSTM network, could be a viable solutions to
improve the system performance and portability. Given the incompatibility of
the Task 3 annotations with other existing corpora for event extraction (e.g.
ACE and POLCON [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), opted for a multi-task learning approach, as it has
proven useful to address scarcity of labeled data. However, we adopted a slightly
di erent strategy: rather than using an alternative tasks in support of our
target task, e.g. semantic role labelling in support of opinion role labelling [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
we used an alternative data set annotated with di erent labels but targeting
the same problem. We thus extracted all sentences annotated with Attack and
Demonstrate events from the ACE corpus and used them as a support task in a
multi-task learning setting. In this case, we achieved an averaged F1 of 0.517 for
both test distributions, lower than 0.011 points than the nal submitted system.
On the positive side, however, we observed that the multi-task model obtains the
best Precision score on the China test data (0.541), although at a large expense
of Recall.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>Our contribution mainly focused on two aspects: a.) assess the most viable
approach for each task at stake and maximize portability with limited e orts; b.)
explain the limits of the trained models in terms of similarities and di erences
across training and test distributions rather than just limiting to technical
aspects of the systems.</p>
      <p>Task 1 and Task 2 have shown that a simple system can obtain competitive
results in an unsupervised domain adaptation setting. This aspect is actually
encouraging and triggers further investigation in this direction by focusing e orts
on parameter optimisation. We also believe that the lack of any material for the
out-of-domain distributions is a further challenge to take into account, as no ne
tuning of the models on the target domain was actually possible. As far as we
can put e orts into the development of maximally \generalisable" systems, the
dependance of the models on the training materials remains high, thus posing the
problem if we are not just modelling data sets rather than linguistic phenomena.</p>
      <p>Task 3 has actually highlighted the contribution of both more complex
architectures, such as a Bi-LSTM-CRF network, and contextualised embedding
representations, such as ELMo. In this speci c case, the trained model is able
to predict a comparable amount of event triggers between the two test
distributions, although it su ers on the argument sub-task, where less arguments are
predicted for the out-of-domain data. Unfortunately, the evaluation format does
not allow to quantify the losses per sub-task and per trained models.</p>
      <p>Finally, the similarity and diversity measures (i.e. J-S divergence and OOV)
resulted in useful tools to better understand the di erent behaviors of the
systems on both test distributions. It is worth noticing how J-S similarity scores
correlates with F1 scores of the trained models, suggesting that it could be
possible to quantify, or predict, a margin loss of systems before applying them to
out-of-domain test distributions and, consequently, take actions to minimize the
losses.
Association for Computational Linguistics (Volume 1: Long Papers). pp.
1192{1202. Association for Computational Linguistics, Berlin, Germany
(Aug 2016). https://doi.org/10.18653/v1/P16-1113, https://www.aclweb.org/
anthology/P16-1113
28. Ruder, S., Plank, B.: Learning to select data for transfer learning with bayesian
optimization. In: Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing. pp. 372{382. Association for Computational Linguistics,
Copenhagen, Denmark (September 2017), https://www.aclweb.org/anthology/
D17-1038
29. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Dropout: a simple way to prevent neural networks from over tting. Journal of
Machine Learning Research 15, 1929{1958 (2014)
30. Weber, N., Shekhar, L., Balasubramanian, N.: The ne line between
linguistic generalization and failure in seq2seq-attention models. arXiv preprint
arXiv:1805.01445 (2018)
31. Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text
classi cation. In: NIPS (2015)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The stages of event extraction</article-title>
          .
          <source>In: Proceedings of the Workshop on Annotating and Reasoning about Time and Events</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          . Association for Computational Linguistics (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bethard</surname>
          </string-name>
          , S.:
          <string-name>
            <surname>ClearTK-TimeML</surname>
          </string-name>
          :
          <article-title>A minimalist approach to TempEval 2013</article-title>
          . In: Second Joint Conference on Lexical and
          <article-title>Computational Semantics (* SEM)</article-title>
          .
          <source>vol. 2</source>
          , pp.
          <volume>10</volume>
          {
          <issue>14</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Glockner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shwartz</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Breaking nli systems with sentences that require simple lexical inferences</article-title>
          .
          <source>In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          . pp.
          <volume>650</volume>
          {
          <fpage>655</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2018</year>
          ), http://aclweb.org/ anthology/P18-2103
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universal language model ne-tuning for text classi cation</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Hurriyetoglu,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Yoruk, E., Yuret,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Yoltar</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          , Gurel,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Durusan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Mutlu</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.:</surname>
          </string-name>
          <article-title>A task set proposal for automatic protest information collection across multiple countries</article-title>
          . In: Azzopardi,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Mayr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Hau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hiemstra</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.) Advances in Information Retrieval. pp.
          <volume>316</volume>
          {
          <fpage>323</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grishman</surname>
          </string-name>
          , R.:
          <article-title>Re ning event extraction through cross-document inference</article-title>
          .
          <source>Proceedings of ACL-08:</source>
          HLT pp.
          <volume>254</volume>
          {
          <issue>262</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of tricks for e cient text classi cation</article-title>
          .
          <source>arXiv preprint arXiv:1607.01759</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentence classi cation</article-title>
          .
          <source>In: EMNLP</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Komninos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manandhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dependency based embeddings for sentence classi cation tasks</article-title>
          .
          <source>In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <volume>1490</volume>
          {
          <issue>1500</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lake</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks</article-title>
          .
          <source>arXiv preprint arXiv:1711.00350</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Liakata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batchelor</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Automatic recognition of conceptualization zones in scienti c articles and two life science applications</article-title>
          .
          <source>Bioinformatics</source>
          <volume>28</volume>
          (
          <issue>7</issue>
          ),
          <volume>991</volume>
          {1000 (Apr
          <year>2012</year>
          ). https://doi.org/10.1093/bioinformatics/bts071, http://dx.doi.org/10. 1093/bioinformatics/bts071
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lorenzini</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makarov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriesi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wueest</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Towards a dataset of automatically coded protest events from english-language newswire documents</article-title>
          .
          <source>In: Paper presented at the Amsterdam Text Analysis Conference</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , E.:
          <article-title>End-to-end sequence labeling via bi-directional lstm-cnns-crf</article-title>
          .
          <source>arXiv preprint arXiv:1603.01354</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClosky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The Stanford CoreNLP Natural Language Processing Toolkit</article-title>
          .
          <source>In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. System Demonstrations</source>
          . pp.
          <volume>55</volume>
          {
          <issue>60</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Marasovic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SRL4ORL: Improving opinion role labeling using multi-task learning with semantic role labeling</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <volume>583</volume>
          {
          <fpage>594</fpage>
          . Association for Computational Linguistics, New Orleans,
          <source>Louisiana (Jun</source>
          <year>2018</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N18</fpage>
          -1054, https://www.aclweb.org/ anthology/N18-1054
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Montani</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          : Tuwienkbs at germeval 2018:
          <article-title>German abusive tweet detection</article-title>
          .
          <source>In: 14th Conference on Natural Language Processing KONVENS</source>
          <year>2018</year>
          . p.
          <volume>45</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grishman</surname>
          </string-name>
          , R.:
          <article-title>Event detection and domain adaptation with convolutional neural networks</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          <article-title>)</article-title>
          .
          <source>vol. 2</source>
          , pp.
          <volume>365</volume>
          {
          <issue>371</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of machine learning research 12(Oct)</source>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          :
          <article-title>To tune or not to tune? adapting pretrained representations to diverse tasks</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>05987</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In: Proc. of NAACL</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Plank</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Noord</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>E ective measures of domain similarity for parsing</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume</source>
          <volume>1</volume>
          . pp.
          <volume>1566</volume>
          {
          <fpage>1576</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Optimal hyperparameters for deep lstm-networks for sequence labeling tasks</article-title>
          .
          <source>CoRR abs/1707</source>
          .06799 (
          <year>2017</year>
          ), http://arxiv.org/abs/ 1707.06799
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Reporting score distributions makes a di erence: Performance study of lstm-networks for sequence tagging</article-title>
          .
          <source>In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>338</volume>
          {
          <fpage>348</fpage>
          . Association for Computational Linguistics, Copenhagen, Denmark (
          <year>September 2017</year>
          ), https://www.aclweb.org/anthology/D17-1035
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Open domain event extraction from twitter</article-title>
          .
          <source>In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <volume>1104</volume>
          {
          <fpage>1112</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Neural semantic role labeling with dependency path embeddings</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>