=Paper= {{Paper |id=Vol-2380/paper_69 |storemode=property |title=Multitask Models for Supervised Protest Detection in Texts |pdfUrl=https://ceur-ws.org/Vol-2380/paper_69.pdf |volume=Vol-2380 |authors=Benjamin J. Radford |dblpUrl=https://dblp.org/rec/conf/clef/Radford19 }} ==Multitask Models for Supervised Protest Detection in Texts== https://ceur-ws.org/Vol-2380/paper_69.pdf
        Multitask Models for Supervised Protest
                  Detection in Texts

                     Benjamin J. Radford1[0000−0002−8440−0655]

         University of North Carolina at Charlotte, Charlotte NC 28223, USA
                             benjamin.radford@uncc.edu



        Abstract. The CLEF 2019 ProtestNews Lab tasks participants to iden-
        tify text relating to political protests within larger corpora of news data.
        Three tasks include article classification, sentence detection, and event
        extraction. I apply multitask neural networks capable of producing pre-
        dictions for two and three of these tasks simultaneously. The multitask
        framework allows the model to learn relevant features from the train-
        ing data of all three tasks. This paper demonstrates performance near
        or above the reported state-of-the-art for automated political event cod-
        ing though noted differences in research design make direct comparisons
        difficult.

        Keywords: event data · neural networks · political protests




1     Introduction

Hürriyetoğlu et al. propose a competitive lab in which participants are tasked
with producing models to automatically identify indicators of protest in cross-
country (but monolingual) text corpora [13]. With respect to application area,
this challenge builds on work done in political science on deriving structured data
about political events of interest from unstructured texts (i.e. news). Method-
ologically, the lab is structured as a competition in which select data are provided
to competitors for model training and other data are withheld for evaluation pur-
poses. The tasks themselves fall into the categories of text classification at the
document (task 1) and sentence (task 2) levels and semantic role labeling (task
3).
    This paper proceeds by first introducing the three challenge tasks and the
provided data. Then the two models are described: one model for tasks 1 and 2
and a second model for tasks 1, 2, and 3. Results for each task are discussed and
compared to the published state-of-the-art on similar tasks. Finally, directions
for future research are highlighted.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
1.1   Data and Task Description
The competition comprises three tasks. All three tasks evaluate participants’
ability to identify indicators of protest events in English text data. However,
the three tasks differ in resolution (document-, sentence-, and word-level data)
and in provided data sets. Each task comprises four data sets: train, dev, test,
and China. As its name implies, the train data set is used to train models. The
dev data set is a validation set provided to participants for model fine-tuning.
The test data set is out-of-sample and therefore the labels associated with these
data are withheld from participants. Similarly, the China data set is an out-of-
sample set used to evaluate cross-country performance of models. Train, dev, and
test contain text data from English-language news stories collected from Indian
sources. The China data set is composed of English-language news collected
from Chinese sources. Lab participants are able to observe X and y, the texts
and associated labels, for train and dev. Participants can only observe X, the
texts, for test and dev.
    The role of the China set is to evaluate the performance of models in a cross-
country setting; more specifically, the China data appear similar in form to the
test data (in that they are English news wire text) but are generated by different
underlying data generating processes (DGP). The DGP of the train, dev, and test
sets represent Indian political and social processes as well as reporting norms,
standards, and laws. The China data set represents the same for China.
    A small amount of data preprocessing is performed prior to modeling. All
non-alphanumeric characters are removed and all whitespace characters (e.g.
tabs, newlines, spaces) are replaced with a single space. For tasks 1 and 2,
characters are all converted to lowercase.1 All sequences are zero-padded such
that every sequence within a given task’s corpus is of equal length. The sequence
length for each corpus is equivalent to the maximum sequence length observed
in that corpus prior to padding (given in the following subsections). This is done
to satisfy a software requirement that input sequences are of the same length
during model training.

Document Classification Task 1 challenges participants to classify docu-
ments, in this case news articles, as one of either relating to a protest or not
relating to a protest. Documents in the train data set vary in size from 44 words
to 1599 words. The mean document length is 312 words. Total data set sizes are
given in Table 1.

Sentence Classification Task 2 is similar to task 1 performed not on the
document level but at the sentence level. Given a sentence, the model is tasked
to predict whether the sentence describes a protest event or not. Task 2 train
1
  For task 3, characters are not converted to lowercase because it would have nega-
  tively impacted the performance of the named entity recognition preprocessing step,
  described later.
2
  Including control words that indicate the beginning and end of sequences.
                             Table 1. Data set size by task.

                                  Task 1     Task 2     Task 3
                      Data set (documents) (sentences) (words2 )
                      Train           3,429      5,884   21,873
                      Dev               456        662    3,224
                      Test              686      1,106    6,586
                      China           1,800      1,234    4,387



data set sentences range in length from one word to 150 words and have a mean
of length of 24 words.3


Semantic Role Labeling Task 3 differs from tasks 1 and 2 in that it is ef-
fectively a multiclass classification problem. Given sentences tokenized at the
word level, participants are tasked with identifying sets of words (or phrases)
that represent particular roles in the context of a protest. These roles include
triggers, locations, facilities, organizers, participants, event times, and targets.
Tokens are labeled using IOB, inside, outside, beginning, tags [22]. Tokens la-
beled “O” are outside of a role tag. Tokens labeled “B” represent the beginning
of a phrase that is associated with one of the roles. Tokens labeled “I” are inside
of an identified phrase associated with a role. An example from the train dataset
is given in Figure 1.


           The    protesters       blocked     the   Rayakottai    road       .
           O      B-participant    B-trigger   O     B-fname       I-fname    O

Fig. 1. An example of a role-labeled sentence from the train data set. The top row is
the provided sentence and the bottom row consists of token-level role annotations.




1.2     Prior Work

A robust research effort within political science has seen many iterations on tech-
niques for both manual and automatic coding of event records from unstructured
texts. Most previous work on automated event coding relies on large dictionaries
of terms and phrases organized into known ontological categories. These dictio-
naries are provided alongside text data to event-coding software that performs
pattern matching to identify instances of dictionary phrases within the texts.
If those phrases found in the texts match a set of heuristics, the software pro-
duces an event record. Protests, the event category of interest here, represent
3
    In fact, two entries appear to have no words – they are empty strings. It is unclear if
    this is a problem with the original data, the download process, or the pre-processing
    steps.
just one ontological category within the CAMEO ontology, the most common
of event-coding ontologies in use today [9]. Dictionary-based event coding soft-
ware includes TABARI [24] and PETRARCH [25] which have been used in the
production of many event data sets including ICEWS [19], GDELT [15], and
Phoenix [2].
    While most dictionaries for coding event data are hand-coded by researchers,
recent efforts have sought to largely automate the dictionary generation process
as well – a step towards a fully-automated event data pipeline.4 One such ef-
fort makes use of distributed word vectors to populate dictionaries given a small
input set of exemplar (“seed”) phrases [20, 21]. Similar work leverages label prop-
agation to expand a given set of terms and phrases for event coding [16].
    Most recently, supervised learning has been applied to the problem of event
identification within text with the goal of producing an end-to-end solution. A
neural network technique similar to that presented here was used by Beieler
to label sentences according to the Schrodt’s QuadClass ontology [3, 23]. That
research assumed the existence of an event in the provided text and tasked a
model to classify the event as one of four types; this differs from the task at
hand – to predict event existence versus non-existence and to identify the key
actors and actions relevant to a protest event.


2     Models

While the model presented here for tasks 1 and 2 differs from the model for
task 3, they have several properties in common. Both models are examples of
recurrent neural networks (RNNs). RNNs expect time-ordered inputs and are
able to model time-dependent sequences by persisting information across time
steps. These models differ from traditional autoregressive statistical models in
that the lag structure is variable, not pre-determined. Both models are also
multitask; that is, each model is trained on examples from two or more of the
tasks simultaneously. Finally, the inputs to both models are, at least in part,
sequences of words (or tokens). However, prior to the modeling stage, every
word has been replaced by a its corresponding word vector. The word vectors
are pre-trained on the English Wikipedia corpus using FastText, a neural net-
work language model that leverages both contextual information and sub-word
information to produce word vectors [4]. Word vectors are real-valued numer-
ical vectors that encode semantic and syntactic relationships between words.
The word vector representations of synonyms should be close to one another
4
    It is arguable that a fully-automated event data pipeline is not desirable – if one
    intends to produce structured event records, they likely desire those records to con-
    form to some mental model. Withholding the desired mental model from the event
    data collection process risks producing records that do not conform to the desired
    categories or ontology. Therefore, it is difficult to imagine scenarios where fully-
    unsupervised event data collection is preferable to supervised or semi-supervised
    event data collection.
(where “closeness” often means having a high cosine similarity). FastText mod-
els words as the combination of sub-word n-grams (letter sequences). The use of
pre-trained word vectors has become common for applications in which training
a novel word embedding model may be infeasible due to, for example, corpus size
or compute resources [17]. FastText-based vectors were chosen because, unlike
word2vec-based vectors, out-of-sample inference can be performed with Fast-
Text. If words exist in our corpora that do not exist in the vocabulary that the
FastText model was trained on, new word vectors for those out-of-sample words
can be derived from the sub-word (i.e. character n-gram) information of those
out-of-sample words.



            Inputs               Hidden Layers                          Outputs

      −
      →    −
           →          −
                      →                                Dense             Task 1
      w1   w2         wn            biLSTM
                                                      sigmoid          prediction



                                                       Dense             Task 2
                                                      sigmoid          prediction



Fig. 2. Model architecture for tasks 1 and 2. Sequence input is shown on the left for
word vectors w~1 through w~n .




2.1    Tasks 1 and 2 Model

The model for tasks 1 and 2 is simple. It takes as input a time-ordered sequence
of tokens (i.e. words) of arbitrary length (and possibly padded with zero vec-
tors) and outputs a document-level and sentence-level prediction that the given
sequence describes a protest event. The input tokens are length 300 real-valued
distributed word vectors derived from the pre-trained FastText model. The mod-
els first layer consists of 10 bidirectional long short-term memory (LSTM) RNN
cells with no activation function [11]. The layer outputs only the activation val-
ues of the cells at the final sequence token – a 10 × 1 real-valued vector. This
output connects to two dense, fully-connected layers of size 10 × 1 that compute
the weighted sum of the 10 activation values from the LSTM’s output. One of
these two layers is trained only examples from task 1 (documents) and the other
is trained on only examples from task 2 (sentences). Both layers’ outputs are
subject to a sigmoid activation function that maps output values between 0 and
1 corresponding to predictions of non-protest or protest, respectively. Dropout of
between 0.4 and 0.6 is applied between each layer (including the input layer) and
values are chosen empirically using the dev data set. The selected loss function
is log loss and the model is fit with RMSProp [10]. The model architecture is
shown in Figure 2.
    The multitask nature of this model, having separate outputs for document
and sentence-level predictions, allows the two tasks to jointly train the single
LSTM layer; this effectively increases the training data size for this layer. Having
two outputs allows the model to specialize for the two subtly-different tasks.
Given a document, it may be the case that only a small portion of the document
(a handful of words) refers to a protest. On the other hand, given a single sentence
about a protest, it is likely that a relatively larger portion of the words in that
sentence refer to the protest in question. Task 1 requires that the model be
sensitive to the small proportion of words indicative of a protest event in a
larger document; task 2 is not necessarily so constrained. However, in hindsight,
separating these tasks may not have been necessary: the document sub-model
performs comparably to the sentence sub-model on sentence input and vice versa.


              Inputs                         Hidden Layers                     Outputs

      −
      →     −
            →          −
                       →                                        Dense           Task 1
      w1    w2         wn       biGRU         Max Pooling
                                                               sigmoid        prediction



                                                                Dense           Task 2
                              Concatenate
                                                               sigmoid        prediction




     −−→    −−→        −−→                      biGRU           GRU             Task 3
     ent1   ent2       entn    Embedding
                                                 tanh          softmax        prediction




Fig. 3. Model architecture for task 3. Sequence input is shown on the left for word
vectors w~1 through w~n and corresponding one-hot encoded entity values ent1 through
entn .




2.2        Task 3 Model
The model architecture for task 3 differs from that of tasks 1 and 2.5 The input
is still a time-ordered sequence of word vectors representative of a document
or sentence. The LSTM layer has been replaced by a layer of 20 bidirectional
gated recurrent units (GRU) [6]. Instead of outputting the activation values of
the GRU layer for the last sequence token, all activation values for the sequence
are output. For the portions of the model corresponding to tasks 1 and 2, the
5
    This is due, at least in part, to the fact that the competition was structured in such
    a way that tasks 1 and 2 were judged simultaneously and 3 was judged later. It is
    my believe that the model for task 3 would have fared similarly well had it diverged
    less from the model for tasks 1 and 2.
output sequences from the GRU layer are flattened along the time axis by taking
the maximum output value for each timestep. Aside from these minor changes
and adjustments to dropout rates, the task 1 and task 2 sub-models are as
described above. Task 3, the semantic role labeling task, requires a more compli-
cated model. In particular, this sub-model consists of an additional bidirectional
GRU layer with hyperbolic tangent activation and a subsequent unidirectional
GRU layer that outputs a sequence of softmax-normalized predictions for each
word’s semantic role. The bidirectional GRU layer in this sub-model inputs not
only the output sequence produced by the shared GRU layer but also inputs
the original sequence of word embeddings as well as a sequence corresponding to
the named entities identified in the input sequence. The three input sequences
(shared GRU output, word vectors, and named entities) are concatenated word-
for-word. Named entities are discovered using Spacy, a natural language process-
ing module written in Python [12]. Dropout of 0.25 is included between every
layer of the task 3 model. The full model architecture is shown in Figure 3.



3     Results

The two models are able to perform all three tasks at levels competitive with
the reported results of similar research efforts.
    The models were each trained for 100 epochs on consumer-grade hardware
including a 6 core CPU and an NVIDIA 1070Ti GPU.6 Training times were
typically under 30 minutes. Models were written in Keras [7], a machine learning
library for Python that wraps TensorFlow [1].
    Due to the setup of the Lab, the true outcome values (y) for all test and China
data sets are unavailable. Therefore, only the high-level summary metrics (F1
scores) provided as feedback by the ProtestNews Lab online evaluation system
are reported for those data sets. More complete results (including precision and
recall) are provided with respect to the train and dev data sets because they can
be computed without the out-of-sample y values.
    While the models are able to generalize out-of-sample, their performance
degrades noticeably as the task resolution becomes finer (from documents to
sentences to words) and as the data transition from in-sample to validation,
out-of-sample, and out-of-DGP. This is not surprising as the training data sets
for tasks 1, 2, and 3 contain less overall information for each subsequent task
(and, in the case of task 3, ask more of that limited amount of information).
The decreasing performance from in-sample to out-of-sample data sets points to
overfitting, a common problem in models with many parameters and one that
can sometimes be remedied with additional training data and data augmentation
techniques.

6
    An epoch is defined as 20 batches per task where a batch is comprised of 128 training
    examples.
                          Normalized confusion matrix                              Normalized confusion matrix

             No Protest        0.95               0.05                No Protest        0.96               0.04
True label




                                                         True label
                Protest        0.21
                                t                 0.79                   Protest        0.49               0.51




                                                est




                                                                                         t




                                                                                                         est
                            tes




                                                                                     tes
                                                 t




                                                                                                          t
                          Pro




                                             Pro




                                                                                   Pro




                                                                                                      Pro
                          No




                                                                                   No
                                    Predicted label                                          Predicted label
                    (a) Task 1 (Document)                                     (b) Task 2 (Sentence)

                                  Fig. 4. Model performance for tasks 1 and 2.



3.1             Task 1 and 2 Results

The multitask model for tasks 1 and 2 is able to classify documents with 92%
accuracy on the dev set with an F1 score of 0.80.7 As can be seen in Figure 4(b),
the model correctly predicts 95% of non-protest events and 79% of protest events
in the dev set. In Table 2 we can see the true out-of-sample F1 scores associated
with the test and China sets are 0.84 and 0.66, respectively. This result on the
test set is encouraging because it is actually above the corresponding value on
the dev set, 0.80, and suggests that the model has not be overfit to the dev set
through hyperparameter selection.
    Model performance deteriorates somewhat for task 2, sentence classification.
While the model accuracy is still high at 87%, its precision has dropped markedly.
In other words, the model is able to accurately classify non-protest event sen-
tences (96% accuracy) but only classifies protest events correctly 51% of the
time. This can be seen in Figure 4(b). In out-of-sample tests, the model achieves
an F1 score of 0.66 on the test set and 0.46 on the China set.
7
  Note that the dev set was available at training time but was at no point provided
  to the model before inference was performed. Therefore it is out-of-sample but was
  available for hyperparameter tuning. Only limited results are available for the true
  out-of-sample datasets as these sets are held by the lab organizers.
8
  All values computed using contest-provided code. Precision and recall values were
  not provided by the contest organizers in the output of test and China data set eval-
  uations and are therefore unavailable here. Unfortunately, the random seed values
  for the models trained and evaluated on the test and China data sets were lost and
  so the (train and dev ) and (test and China) results are from two different model
  runs.
                       Table 2. Results for task 1 and 2 model.8

                               Task 1                   Task 2
                       Precision Recall      F1 Precision Recall     F1
                 Train      0.93   0.84     0.88     0.63   0.82    0.71
                 Dev        0.79   0.81     0.80     0.51   0.79    0.62
                 Test          –      –     0.84        –      –    0.66
                 China         –      –     0.66        –      –    0.46



3.2     Task 3 Results

The model that produced the Lab-submitted results for task 3 is actually capable
of performing all three tasks, the results of which are shown in Table 3. However,
I focus here on the results for semantic role labeling as this model was only
evaluated on the test and China data sets for that particular task. The precision,
recall, and F1 scores shown here are multiclass weighted averages computed with
a Lab-provided script. The unweighted average accuracy of this model on the
dev set is very high, 94%, due largely to class imbalance. The model correctly
predicts that most words in each sentence are not one of the selected roles.
However, the model appears to generalize poorly: the F1 score for the train data
set is 0.82 but drops to 0.50, 0.52, and 0.39 for the dev, test, and China data sets,
respectively. This is indicative of a model that is overfit to the training data.


                          Table 3. Results for task 3 model.9

                   Task 1              Task 2              Task 3
           Precision Recall F1 Precision Recall F1 Precision Recall               F1
     Train      0.99   0.98 0.98    0.87   0.97 0.91    0.83   0.81              0.82
     Dev        0.84   0.88 0.86    0.55   0.84 0.66    0.50   0.50              0.50
     Test          –      –    –       –      –    –    0.63   0.44              0.52
     China         –      –    –       –      –    –    0.54   0.31              0.39



    An example of a task 3 dev data set sentence with actual and predicted
annotations is shown in Figure 5. This example illustrates five of the seven
role categories and includes a target, participants, an organizer, triggers, and a
location. For each row of text there are up to two rows of annotations. The top
row of annotations represents the true role values provided by the Lab organizers.
The bottom row of annotations are those predicted by the model.10
    One-versus-all classification performance for the various role types is shown
in Table 4. These metrics are evaluated on the out-of-sample dev set. The model
performs better on common role labels than less common labels; it achieves
9
    All values computed using contest-provided code. Due to time constraints imposed
    by the contest structure, performance was not evaluated for tasks 1 and 2 on the test
    and China data sets. Unfortunately, the random seed values for the models trained
Govt Readies to Wield the Stick as Striking Doctors Decide to Harden Their Stand
| {z }                                    | {z } | {z }
target                                     trigger participant
                                                   | {z }
                                                   participant
15th September 05:49 AM THIRUVANANTHAPURAM: With the government ap-
                                                                                 |        {z        }
                                                                                         target
pearing to be in no mood to meet the demand of the doctors of the health service,
                                                             | {z }
                                                            participant
                                                             | {z }
                                                            participant
the Kerala Government Medical Officers Association spearheading the hunger strike
        |                      {z                                }                   |         {z       }
                            organizer                                                      trigger
        |                      {z                                }                   |         {z       }
                            organizer                                                      trigger
in front of the state secretariat has called for intensifying the agitation in the coming
|                {z                 }                                |      {z      }
              location                                                    trigger
    |             {z                }                                |      {z      }
               location                                                   trigger
days.

Fig. 5. An example of a task 3 dev data set sentence with actual and predicted anno-
tations . The top row of annotations are those labels that are provided with the data
(i.e. “true labels”). The bottom row of annotations are those that are predicted by the
model. For example, “in front of the state secretariat” is coded as a location in the dev
data but only the words “front of the state secretariat” are identified by the model as
a location.




                  Table 4. Role-wise performance on task 3 dev set

                                          Precision Recall F1
                          event time           0.04  0.03 0.03
                          facility name        0.13  0.07 0.09
                          location             0.00  0.00 0.00
                          organizer            0.52  0.43 0.47
                          participant          0.65  0.60 0.63
                          place                0.65  0.48 0.55
                          target               0.13  0.41 0.20
                          trigger              0.76  0.75 0.76
F1 scores greater than 0.5 on triggers, participants, and places. The model fails
to label any locations correctly. This is probably due to the model’s failure to
recognize the prepositions preceding locations as the B tokens in the location
phrase. For example, “in front of the state secretariat” should be labeled “B-
loc, I-loc, I-loc, I-loc, I-loc, I-loc.” Instead, the model predicts “O, I-loc, I-loc,
I-loc, I-loc, I-loc.” Another example from the dev set reads “near a mosque” and
should be labeled “B-loc, I-loc, I-loc.” The model instead predicts “B-fname,
I-loc, I-loc,” where “fname” represents the role “facility name.”
     The model for task 3 is also able to perform document and sentence clas-
sification. While test and China set results are unavailable for this model with
respect to these two tasks, the model’s performance on train and dev improves
upon the results presented in Table 2 across the board. Future work should de-
termine whether this is due to the second model’s ability to overfit to these tasks
and data sets or due to the inclusion of task 3 data in the model’s first GRU
layer. One approach for exploring this is discussed in the paper’s final section.

3.3    Comparison to the State-of-the-Art
Hürriyetoğlu et al. present preliminary findings for tasks 1 and 2 [13]. The results
shown here compare favorably to the best of their models on both tasks.11 A
model based on BERT [8], for example, is reported to score F1=0.90 and F1=.64
on data sets roughly equivalent to test and China for task 1, just above and
below the scores of 0.84 and 0.66 reported here. The authors report a high score
of F1=0.56 on task 2 test data achieved by a support vector machine model; this
falls short of the bidirectional LSTM that scored F1=0.66.
    Previous studies have evaluated the performance of both human and machine-
based coding for political event data. One of these reports that the ICEWS
Jabari-NLP system achieves an average top-level event category precision of
75.6% (document level). The authors further report that the system achieves
average top-level event category recall values for documents and sentences of
65% and 59%, respectively [5]. This evaluation matches most closely with tasks
1 and 2 here and, in all cases, the above-presented protest models outperform
the Jabari-NLP system results. Of course, the Jabari-NLP system was burdened
with classifying 19 different event types while the task at hand represents only
one.
    One previous study of undergraduate coders tasked with classifying top-level
event categories found that the three coders achieved precision values of 39%,

   and evaluated on the test and China data sets were lost and so the (train and dev )
   and (test and China) results are from two different model runs.
10
   In fact, the model must distinguish between the beginning token of a role phrase
   and the “internal” tokens. For example, “Kerala Government Medical Officers As-
   sociation” would be annotated, word-for-word, “B-organizer I-organizer I-organizer
   I-organizer I-organizer.” These are omitted for clarity.
11
   The data set used in [13] is similar to, but may not be identical to, the data set used
   here. Therefore, caution should be taken when comparing the results between these
   two papers.
48%, and 55% [14]. As was the case with the Jabari-NLP comparison, these
particular precision values are not directly comparable to those reported for the
protest models due to the fact that the human coders were provided a multiclass
classification task, not binary.
    Using convolutional neural networks, Beieler [3] reports precision scores of
0.85 and 0.60 for QuadClass classification on English and machine-translated
texts, respectively, when word tokens are used. If character-based tokens are
used, these scores increase to 0.94 (English) and 0.93 (native Arabic). However,
the task presented is one of event classification conditional on the existence of
an event in the text. This contrasts with the binary event/non-event objective
of tasks 1 and 2. Nonetheless, these results point to a path forward for continued
work on protest event detection via character-based models and convolutional
neural networks.


4      Discussion

A drawback of the multitask RNN models used above is that they do not lend
themselves to model interrogation – they are typically viewed as black box mod-
els whose parameters resist simple interpretation. However, the addition of an
attention layer to the input sequences would allow researchers to identify those
input tokens (i.e. words) that contribute the most (or least) to a given prediction.
Attention layers do this by masking some input tokens and not others, condi-
tional on the input sequence itself. This would help to answer the question of
which words contribute to accurate or inaccurate model predictions and whether
those informative words differ from task to task.12
    The models presented here make use of sub-word (i.e. character) information
but only in the construction of word vectors from a pre-trained FastText model.
By the time the sequences are input to the recurrent neural network models,
the sub-word information has been aggregated to word-level tokens. Based on
previous research that demonstrates the advantages of character-based models
[3], foregoing aggregation to the word vector level altogether may be beneficial.
Instead, distributed character n-gram vectors could form the input sequences to
a neural network classifier like those discussed above. This may, for example,
allow the model to learn that n-gram vectors representing capitalized letters are
more likely to occur in proper nouns, even if those proper nouns have never
before been seen by the model.
    Finally, the CLEF 2019 ProtestNews Lab has provided the research commu-
nity with a valuable “ground truth” data set on protest (and non-protest) events.
The lack of hand-annotated and curated event data sets has made difficult the
evaluation of event coding systems. Furthermore, due to copyright concerns that
the ProtestNews organizers have cleverly overcome, previous event data sets have
12
     While there does not appear to be a single best citation for attention neural networks,
     the earliest use of attention in RNN models may be [18].
not published the underlying text data from which event records were derived.13
Now that annotated text data are available, future solutions for deriving struc-
tured event records and their attributes from text should take advantage of this
resources to evaluate their performance.
    The results presented here indicate that supervised learning can achieve
strong results in identifying politically-relevant events within unstructured text.
However, the generalization of these models to out-of-sample data is imperfect;
the ease with which neural network models like those used here can overfit to
the training data means that care must be taken to ensure that the models
continue to perform well on out-of-sample data. This is especially true if there
is reason to believe that the out-of-sample data may represent a different data
generating process than the in-sample data, as is the case here with the China
data set. Nonetheless, when sufficient training data are available (perhaps only
a few thousand examples), supervised learning can play an important role in
generating political event data.


References

 1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
    G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,
    Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,
    J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
    Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,
    Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,
    X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015),
    http://tensorflow.org/, software available from tensorflow.org
 2. Beieler, J.: Creating a real-time, reproducible event dataset. CoRR
    abs/1612.00866 (2016), http://arxiv.org/abs/1612.00866
 3. Beieler, J.: Generating politically-relevant event data. CoRR abs/1609.06239
    (2016), http://arxiv.org/abs/1609.06239
 4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word
    vectors with subword information. CoRR abs/1607.04606 (2016),
    http://arxiv.org/abs/1607.04606
 5. Boschee, E., Lautenschlager, J., O’Brien, S., Shellman, S., Starz, J., Ward, M.:
    Bbn accent event coding evaluation.updated v01.pdf. In: ICEWS Coded Event
    Data. Harvard Dataverse (2015). https://doi.org/10.7910/DVN/28075/GBAGXI,
    https://doi.org/10.7910/DVN/28075/GBAGXI
 6. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.:
    Learning phrase representations using RNN encoder-decoder for statistical machine
    translation. CoRR abs/1406.1078 (2014), http://arxiv.org/abs/1406.1078
 7. Chollet, F., et al.: Keras. https://keras.io (2015)
 8. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
    tional transformers for language understanding. CoRR abs/1810.04805 (2018),
    http://arxiv.org/abs/1810.04805
13
     The organizers provided a script that allowed participants to download the story
     data themselves from the original source websites.
 9. Gerner, D.J., Abu-Jabr, R., Schrodt, P.A., Ömür Yilmaz: Conflict and mediation
    event observations (cameo): A new event data framework for the analysis of foreign
    policy interactions. In: of Foreign Policy Interactions.” Paper presented at the
    International Studies Association (2002)
10. Hinton, G., Srivastava, N., Swersky, K.: Neural networks for ma-
    chine learning: Lecture 6a overview of mini-batch gradient descent
    http://www.cs.toronto.edu/∼tijmen/csc321/slides/lecture slides lec6.pdf
11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
    put. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735,
    http://dx.doi.org/10.1162/neco.1997.9.8.1735
12. Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloom
    embeddings, convolutional neural networks and incremental parsing. To appear
    (2017)
13. Hürriyetoğlu, A., Yörük, E., Yüret, D., Yoltar, Ç., Gürel, B., Duruşan, F., Mutlu,
    O.: A task set proposal for automatic protest information collection across multiple
    countries. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra,
    D. (eds.) Advances in Information Retrieval. pp. 316–323. Springer International
    Publishing, Cham (2019)
14. King, G., Lowe, W.: An automated information extraction tool for inter-
    national conflict data with performance as good as human coders: A rare
    events evaluation design. International Organization 57, 617–642 (Summer 2003),
    http://gking.harvard.edu/files/gking/files/infoex.pdf?m=1360039060
15. Leetaru, K., Schrodt, P.A.: Gdelt: Global data on events, location, and tone. ISA
    Annual Convention (2013)
16. Makarov, P.: Automated acquisition of patterns for coding political event data:
    Two case studies. In: Proceedings of Workshop on Computational Linguistics for
    Cultural Heritage, Social Sciences, Humanities and Literature. pp. 103–112 (Au-
    gust 2018)
17. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-
    training distributed word representations. In: Proceedings of the International Con-
    ference on Language Resources and Evaluation (LREC 2018) (2018)
18. Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual
    attention (2014), https://arxiv.org/abs/1406.6247
19. O’Brien, S.P.: Crisis early warning and decision support: Contemporary approaches
    and thoughts on future research. International Studies Review 12(1), 87–104
    (2010), http://www.jstor.org/stable/40730711
20. Radford, B.J.: Automated Learning of Event Coding Dictionaries for Novel Do-
    mains with an Application to Cyberspace. Ph.D. thesis, Duke University (2016),
    http://hdl.handle.net/10161/13386
21. Radford, B.J.: Automated dictionary generation for political eventcod-
    ing. Political Science Research and Methods p. 1–15 (Forthcoming).
    https://doi.org/10.1017/psrm.2019.1
22. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning.
    CoRR cmp-lg/9505040 (1995), http://arxiv.org/abs/cmp-lg/9505040
23. Schrodt, P.A.: Forecasting Conflict in the Balkans using Hidden Markov Models,
    pp. 161–184. Springer Netherlands, Dordrecht (2006)
24. Schrodt,      P.A.:      Tabari:      Textual     analysis       by    augmented        re-
    placement         instructions,         version        0.8.4.       manual.         (2014),
    http://eventdata.parusanalytics.com/tabari.dir/TABARI.0.8.4b3.manual.pdf
25. Schrodt, P.A., Beieler, J., Idris, M.: Three’s a charm?: Open event data coding with
    el:diablo, petrarch, and the open event data alliance. version 1.0 (March 2014),
    http://eventdata.parusanalytics.com/papers.dir/Schrodt-Beieler-Idris-ISA14.pdf