A Comparative Study on Generalizability of
Information Extraction Models on Protest News

             Erkan Başar1,2 , Simge Ekiz1 , and Antal van den Bosch2
                                     1
                                      FloodTags,
           Binckhorstlaan 36, M2.11, 2516 BE, The Hague, The Netherlands
                           {basar,sekiz}@floodtags.com
                 2
                   Centre for Language Studies, Radboud University,
                 P.O. Box 9103, 6500 HD Nijmegen, The Netherlands
                              a.vandenbosch@let.ru.nl


        Abstract. Information Extraction applications can help social scientists
        to obtain necessary information to understand the reasons behind certain
        social dynamics. Many recent state-of-the-art information extraction ap-
        proaches are based on supervised machine learning which can recognize
        information that has similar patterns with previously shown ones. Recog-
        nizing relevant information with never-shown patterns, however, is still
        a challenging task. In this study, we design a Recurrent Neural Network
        (RNN) architecture employing ELMo embeddings and Residual Bidirec-
        tional Long-Short Term Memory layers to overcome this challenge in the
        context of CLEF 2019 ProtestNews shared task. Furthermore, we train a
        classical Conditional Random Fields (CRF) model as our strong baseline
        to display a contrast between a state-of-the-art classical machine learn-
        ing approach and a recent neural network method both in performance
        and in generalizability. We show that RNN model outperforms classical
        CRF model and shows a better promise on generalizability.

        Keywords: information extraction · recurrent neural networks · condi-
        tional random fields · word embeddings.


1     Introduction

Social science studies can benefit from analyzing and comparing protest event
information from multiple countries to understand the reasons behind certain
social dynamics such as emerging welfare regimes. Although, online mass media
agencies report the major incidents as soon as in a day, manually collecting such
amounts of data is time-taking, expensive and hard to maintain. Automating
the process with natural language processing (NLP) allows us to harness such
information on a large scale with a high speed.

    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
    Many recent state-of-the-art NLP approaches are based on supervised ma-
chine learning technique where the machine discovers patterns on manually pre-
pared gold-standard data to learn how to detect relevant information. With the
recent developments, it is possible to teach information extraction models to
detect information that has similar patterns to the training data at reasonable
levels. However, generating a model that can correctly recognize relevant data
with never-observed patterns is relatively a challenging task. Generalizability
of a model especially affects the results when used on cross-cultural text. Even
though the text is written in the same language on the model trained, the cul-
tural differences can affect the context and the language usage such as the word
choices and ordering. This issue can cause semantic bias[3], potentially neglect
the model, render it useless, and thus, eventually force us to generate models
specific to cultures and regions.
    CLEF organizers prepared the ProtestNews Task 3:Event information extrac-
tion shared task[14], stressing both the importance of automatically extracting
protest event information from news and the impact of the generalizability in
natural language processing applications. The aim of the shared task is to de-
velop generalizable NLP tools that is robust enough to be used regardless of
their data source with a focus on detecting and extracting relevant information
on protest news.
    We design a Recurrent Neural Network (RNN) using Residual Bidirectional
Long Short-Term Memory (BiLSTM) layers with pretrained Embeddings for
Language Models (ELMo) word embeddings[20] and a classical Conditional Ran-
dom Fields (CRF)[15] based machine learning model. We accept the CRF based
model as a strong baseline and compare both the performances and the gener-
alizabilities of two models. We display a contrast between one of the classical
machine learning approaches and one of the latest methods as well as we aim for
this automation to be as accurate as possible, knowing that our methods will
probably not be as precise as human annotation.


2   Background

Information Extraction (IE) is the task of automatically extracting structured
information from unstructured text and studied as part of the natural language
processing area [6]. Previous studies demonstrated that end-to-end information
extraction systems can be developed to analyze news media data by employing
probabilistic approaches [23, 10]. In the ProtestNews shared task, the event in-
formation extraction is designed as an entity sequence labelling task. The goal
of the sequence labelling can be described as labelling the sequences of relevant
words in text by a single categorical class.
    A classical machine learning framework for labelling sequential data is Linear-
chain Conditional Random Fields [15]. CRF prevents the label bias problem that
occurs when there is an uncertainty in the previous tag of the sequence [19]. The
strength of the CRF is also coming from the ability of dealing with the arbitrary,
overlapping features of the input [7]. CRF is accepted as one of the state-of-the-
art approaches and CRF-based models are applied to sequence labelling tasks[24,
8] including some of the state-of-the-art named entity recognition tools[9].
    In the recent years, studies have shown promise with Recurrent Neural Net-
works [16]. Long-Short Term Memories (LSTM) [11] based RNNs can learn tem-
poral dependence between the sequences and also when to forget them. Moreover,
Bidirectional LSTM [13] based networks are reading the data once from the be-
ginning to the end and once from the end to the beginning and this making them
learn stronger relations. Furthermore, neural networks started to be used in fea-
ture extraction to generate word representation that is effective in supervised
sequence labelling problems [22]. In common state-of-the-art approaches, word
representations are generated over individual tokens [17]. ELMo [20] is one of
the recent and state-of-the-art word representation network that generates em-
beddings over the entire word sequences instead of individual tokens, providing
an advantage in sequence based tasks.


3   Dataset
In this study, we only use the dataset provided by the ProtestNews shared task
organizers in our experiments. The provided data is already separated as train-
ing, development, respectively containing 250 and 36 English news sentences
from newspapers published in India. In order to test the generalizability of the
models, two separate unlabelled test data is provided. The main test set con-
tains 80 English news sentences from India, and the secondary test set contains
39 English news sentences from China. As shown in Figure 1, the average sen-
tence lengths both in training and test sets are around 50. However, while the
maximum length in the training data is 440, test data contained two sentences
with lengths 579 and 643.


Fig. 1: The distribution of sentence lengths in training data (on the left) and in
the combined test sets (on the right)


    Likewise, the whole dataset is pre-tokenized and shared in standard CONLL
format. Moreover, there are 8 different classes in the dataset named as partici-
pant, trigger, loc, place, etime, fname, organizer, and target given in beginning-
inside-outside (BIO) tagging format. The sample sizes of the classes over the
complete entities varies as shown in Figure 2. While the trigger class is used 970
times while place class used 318 times and the least used class is fname with 128
labels.


Fig. 2: The distribution of the label usage in training data. The graph shows
a count of the labels per complete entity, although the data received in BIO
tagging format. The uninformative outside class is disregarded.


4     Methods
Hereunder, we describe the two models we trained for the task and evaluation
process.

4.1   Conditional Random Fields Algorithm
We employ Conditional Random Fields by using the Python binding of CRF-
Suite library [18]. The CRF model is trained on the given training data and the
hyperparameters are optimized on the given development set by using Random
Search method [2], especially to optimize the regularization parameters known
as “C” parameters.
    In classical supervised machine learning, the features have a great impact on
the classification accuracy. For the feature extraction, we use a sliding window
with a length of 5 tokens meaning that if we are extracting features of a token
at position i, the sliding window will contain tokeni-2 , tokeni-1 , tokeni , tokeni+1 ,
and tokeni+2 . Besides getting the features of individual tokens in the sliding
window independently, we also include bi-grams and tri-grams of the tokens and
most of the features.
    The features extracted to train the CRF model is as listed below;
 – Context: Token of the focus and its surroundings in a sliding window and
   n-gram combinations.
 – Lemmas: Lemmatized version of the focus and its surroundings in a sliding
   window and n-gram combinations, obtained by using spaCy [12] NLP tool.
 – Orthographic Types: The orthographic types of the tokens in a sliding win-
   dow and n-gram combinations.
 – Part-of-Speech: Part-of-speech tags of the tokens in a sliding window and
   n-gram combinations, obtained by using spaCy [12] NLP tool.
 – Temporal Tags: Temporal tags of the tokens in a sliding window and n-gram
   combinations, obtained by using HeidelTime temporal tagging tool [21].
 – Named Entities: Named entity tags of the tokens in a sliding window and
   n-gram combinations, obtained by using spaCy [12] NLP tool.
 – isCapital: A boolean value indicates whether the first letter of the tokeni is
   capitalised or not.

4.2    Recurrent Neural Network Approach
We propose a Recurrent Neural Network architecture briefly consists of the
ELMo word embeddings network, two residual BiLSTM layers and a time dis-
tributed dense layer with softmax activation at the end, as shown in Figure
3.


Fig. 3: Architecture of the proposed network consists of ELMo word embeddings
network and two residual BiLSTM layers.


    We use the ELMo network pretrained on 1 Billion Word Benchmark [4] and
distributed in Tensorflow Hub1 . Thus, ELMo is left out in our training process
and used only as a feature extraction layer. ELMo embeddings are generated over
the entire word sequences, unlike the other popular embeddings. Consequently
the input layer of our network requires tokenized sentences as sequences. We set
1
    Tensorflow Hub: https://www.tensorflow.org/hub, Accessed on 24/05/2019
ELMo to return word embeddings of 1024 dimensions for an inputted sentence
sequence. Then, the embeddings are fed to the residually connected BiLSTM
layers, each with 512 units and 0.2 dropout rate.
    We use Adam optimizer with the learning rate of 0.001 to find the global
minimum for the sparse categorical cross-entropy loss function. We observe the
relation between epochs and training and validation losses. Thus, we initially
train our network for 30 epochs. After the 10th epoch, however, we do not observe
any difference in the validation accuracy. The learning rate, dropout rate, units,
and dimensions are decided heuristically.
    Implementation of the network and the loss function is done in Keras (v2.2.4)
[5] with the Tensorflow (v1.14.1) [1] back-end.


4.3   Evaluation

The evaluation of the algorithms are done in the submission platform of the
ProtestNews shared task [14]. The evaluation metrics used are precision, recall
and f1-score.


5     Results

On the primary test set, we observe a macro average f1-score of 37.64 with
CRF-based model and 55.75 with RNN model, as shown in Figure 4. However,
we also see that CRF displays a higher precision score than RNN while RNN
outperforms CRF model on the recall. Thus, RNN model displays a more bal-
anced performance than CRF. On the secondary test set, the performance of
CRF model significantly drops and RNN shows dominance over CRF at each
scoring.
   In Figure 5, we observe that both of the models has a performance loss
when tested on the secondary model. We see that the performance decrease of
the CRF based method is highly visible on the second test set. However, RNN
based method compensate this performance drop better.
   As an overall, our best performing model gives 55.75 f1-score on average.


6     Conclusions

In this study we have proposed a Recurrent Neural Network based approach
to extract information on protest news. Furthermore, we have built a classical
Conditional Random Fields as our strong baseline to display a contrast between
a classical machine learning approach and a more recent method both in perfor-
mance and in generalizability.
    We have seen that the Recurrent Neural Network based model significantly
outperforms Conditional Random Fields based approach both in the primary
set and the secondary set. On a setup to test generalizability of each model,
we have shown that the CRF based model demonstrates our initial claims with
Fig. 4: Results of proposed methods on both test sets. India-English test set
referred as primary test set. China-English test set referred as secondary test set


   Fig. 5: F1-score comparisons of CRF and RNN models on generalizability
generalizability problem. While both of the models lost performance on the sec-
ondary test set, we conclude that RNN based approach shows a better promise
on generalizability, compared to the CRF method.
   In future studies, we would like to include character embedding, and evaluate
whether including character-level information is going to improve the results of
RNN based model.


References

 1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
    G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,
    Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,
    J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
    Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,
    Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,
    X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015),
    http://tensorflow.org/, software available from tensorflow.org
 2. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal
    of Machine Learning Research 13(Feb), 281–305 (2012)
 3. Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from
    language corpora contain human-like biases. Science 356(6334), 183–186 (2017)
 4. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P.: One billion
    word benchmark for measuring progress in statistical language modeling. Comput-
    ing Research Repository (CoRR) abs/1312.3005, 1–6 (2013)
 5. Chollet, F., et al.: Keras. https://keras.io (2015)
 6. Cowie, J., Lehnert, W.: Information extraction. Commun. ACM
    39(1),      80–91      (Jan      1996).      https://doi.org/10.1145/234173.234209,
    http://doi.acm.org/10.1145/234173.234209
 7. Culotta, A., McCallum, A.: Confidence estimation for information extraction. In:
    Proceedings of HLT-NAACL 2004: Short Papers. pp. 109–112. Association for
    Computational Linguistics (2004)
 8. Cuong, N.V., Chandrasekaran, M.K., Kan, M.Y., Lee, W.S.: Scholarly document
    information extraction using extensible features for efficient higher order semi-crfs.
    In: Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries.
    pp. 61–64. ACM (2015)
 9. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into
    information extraction systems by gibbs sampling. In: Proceedings of the 43rd an-
    nual meeting on association for computational linguistics. pp. 363–370. Association
    for Computational Linguistics (2005)
10. Gupta, D., Strötgen, J., Berberich, K.: Eventminer: Mining events from annotated
    documents. In: Proceedings of the 2016 ACM International Conference on the
    Theory of Information Retrieval. pp. 261–270. ACM (2016)
11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
    put. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735,
    http://dx.doi.org/10.1162/neco.1997.9.8.1735
12. Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloom
    embeddings, convolutional neural networks and incremental parsing. To appear
    (2017)
13. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging.
    CoRR abs/1508.01991 (2015), http://arxiv.org/abs/1508.01991
14. Hürriyetoğlu, A., Yörük, E., Yüret, D., Yoltar, Ç., Gürel, B., Duruşan, F., Mutlu,
    O.: A task set proposal for automatic protest information collection across multiple
    countries. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra,
    D. (eds.) Advances in Information Retrieval. pp. 316–323. Springer International
    Publishing, Cham (2019)
15. Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: Proba-
    bilistic models for segmenting and labeling sequence data. In: Proceedings of the
    eighteenth international conference on machine learning, ICML. vol. 1, pp. 282–289
    (2001)
16. Li, B., Liu, T., Zhao, Z., Du, X.: Attention-based recurrent neural network for
    sequence labeling. In: Cai, Y., Ishikawa, Y., Xu, J. (eds.) Web and Big Data. pp.
    340–348. Springer International Publishing, Cham (2018)
17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed rep-
    resentations of words and phrases and their compositionality. In: Proceedings
    of the 26th International Conference on Neural Information Processing Sys-
    tems - Volume 2. pp. 3111–3119. NIPS’13, Curran Associates Inc., USA (2013),
    http://dl.acm.org/citation.cfm?id=2999792.2999959
18. Okazaki, N.: Crfsuite: a fast implementation of conditional random fields (crfs)
    (2007), http://www.chokkan.org/software/crfsuite/
19. Peng, F., McCallum, A.: Information extraction from research papers using condi-
    tional random fields. Information processing & management 42(4), 963–979 (2006)
20. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
    L.: Deep contextualized word representations. In: Proceedings of the 2018 Confer-
    ence of the North American Chapter of the Association for Computational Lin-
    guistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237.
    Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018).
    https://doi.org/10.18653/v1/N18-1202, https://www.aclweb.org/anthology/N18-
    1202
21. Strötgen, J., Gertz, M.: Multilingual and cross-domain temporal tagging. Language
    Resources and Evaluation 47(2), 269–298 (2013). https://doi.org/10.1007/s10579-
    012-9179-y
22. Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and gen-
    eral method for semi-supervised learning. In: Proceedings of the 48th Annual
    Meeting of the Association for Computational Linguistics. pp. 384–394. ACL
    ’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010),
    http://dl.acm.org/citation.cfm?id=1858681.1858721
23. Vossen, P., Agerri, R., Aldabe, I., Cybulska, A., van Erp, M., Fokkens, A., Laparra,
    E., Minard, A.L., Aprosio, A.P., Rigau, G., Rospocher, M., Segers, R.: Newsreader:
    Using knowledge resources in a cross-lingual reading machine to generate more
    knowledge from massive streams of news. Special Issue Knowledge-Based Sys-
    tems, Elsevier (2016). https://doi.org/dx.doi.org/10.1016/j.knosys.2016.07.013,
    http://www.sciencedirect.com/science/article/pii/S0950705116302271
24. Wu, F., Weld, D.S.: Open information extraction using wikipedia. In: Proceedings
    of the 48th annual meeting of the association for computational linguistics. pp.
    118–127. Association for Computational Linguistics (2010)