=Paper= {{Paper |id=Vol-2657/paper3 |storemode=property |title=Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets |pdfUrl=https://ceur-ws.org/Vol-2657/paper3.pdf |volume=Vol-2657 |authors=Jitin Krishnan,Hemant Purohit,Huzefa Rangwala |dblpUrl=https://dblp.org/rec/conf/kdd/KrishnanPR20 }} ==Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets== https://ceur-ws.org/Vol-2657/paper3.pdf
  Attention Realignment and Pseudo-Labelling for Interpretable
           Cross-Lingual Classification of Crisis Tweets
                   Jitin Krishnan                                          Hemant Purohit                              Huzefa Rangwala
       Department of Computer Science                                Department of Information                  Department of Computer Science
          George Mason University                                     Sciences & Technology                        George Mason University
                  Fairfax, VA                                        George Mason University                              Fairfax, VA
             jkrishn2@gmu.edu                                               Fairfax, VA                               rangwala@gmu.edu
                                                                        hpurohit@gmu.edu

ABSTRACT                                                                                1   INTRODUCTION
State-of-the-art models for cross-lingual language understanding                        Social media platforms such as Twitter provide valuable information
such as XLM-R [7] have shown great performance on benchmark                             to aid emergency response organizations in gaining real-time situ-
data sets. However, they typically require some fine-tuning or cus-                     ational awareness during the sudden onset of crisis situations [4].
tomization to adapt to downstream NLP tasks for a domain. In this                       Extracting critical information about affected individuals, infras-
work, we study unsupervised cross-lingual text classification task                      tructure damage, medical emergencies, or food and shelter needs
in the context of crisis domain, where rapidly filtering relevant data                  can help emergency managers make time-critical decisions and
regardless of language is critical to improve situational awareness                     allocate resources efficiently [15, 21, 22, 30, 31, 36]. Researchers
of emergency services. Specifically, we address two research ques-                      have designed numerous classification models to help towards this
tions: a) Can a custom neural network model over XLM-R trained                          humanitarian goal of converting real-time social media streams into
only in English for such classification task transfer knowledge to                      actionable knowledge [1, 22, 26, 28, 29]. Recently, with the advent
multilingual data and vice-versa? b) By employing an attention                          of multilingual models such as multilingual BERT [9] and XLM
mechanism, does the model attend to words relevant to the task                          [20], researchers have started adopting them to multilingual disas-
regardless of the language? To this goal, we present an attention                       ter tweets [6, 25]. Since XLM-R [7] has been shown to be the most
realignment mechanism that utilizes a parallel language classifier to                   superior model in cross-lingual language understanding, we re-
minimize any linguistic differences between the source and target                       strict our work to this model to explore the aspects of cross-lingual
languages. Additionally, we pseudo-label the tweets from the target                     transfer of knowledge and interpretability.
language which is then augmented with the tweets in the source
language for retraining the model. We conduct experiments using
Twitter posts (tweets) labelled as a ‘request’ in the open source
data set by Appen1 , consisting of multilingual tweets for crisis re-
sponse. Experimental results show that attention realignment and
pseudo-labelling improve the performance of unsupervised cross-
lingual classification. We also present an interpretability analysis by
evaluating the performance of attention layers on original versus
translated messages.

KEYWORDS
   Social Media, Crisis Management, Text Classification, Unsuper-
vised Cross-Lingual Adaptation, Interpretability
ACM Reference Format:
Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Attention Re-                Figure 1: Problem: Unsupervised cross-lingual tweet classifi-
alignment and Pseudo-Labelling for Interpretable Cross-Lingual Classifica-              cation, e.g., train a model using English tweets, predict labels
tion of Crisis Tweets. In Proceedings of KDD Workshop on Knowledge-infused
                                                                                        for Multilingual tweets, and vice-versa.
Mining and Learning (KiML’20). , 7 pages. https://doi.org/10.1145/nnnnnnn.
nnnnnnn
                                                                                           In this work, we address two questions. First is to examine
1 https://appen.com/datasets/combined-disaster-response-data/                           whether XLM-R is effective in capturing multilingual knowledge by
                                                                                        constructing a custom model over it to analyze if a model trained
                                                                                        using English-only tweets will generalize to multilingual data and
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           vice-versa. Social media streams are generally different from other
California, USA, August 24, 2020. Use permitted under Creative Commons License          text, given the user-generated content. For example, tweets are
Attribution 4.0 International (CC BY 4.0).
                                                                                        usually short with possibly errors and ambiguity in the behavioral
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s).                                                 expressions. These properties in turn make the classification task or
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 extracting representations a bit more challenging. Second question
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                      Krishnan, et al.


is to examine whether word translations will be equally attended                With more and more machine learning systems being adopted
by the attention layers. For instance, the words with higher atten-          by diverse application domains, transparency in decision-making
tion weights in a sentence in Haitian Creole such as “Tanpri nou             inevitably becomes an essential criteria, especially in high-risk
bezwen tant avek dlo nou zon silo mesi” should align with the words          scenarios [12] where trust is of utmost importance. With deep
in its corresponding translated tweet in English “Please, we need            neural networks, including natural language systems, shown to
tents and water. We are in Silo, Thank you!”. Our core idea is that if       be easily fooled [16], there has been many promising ideas that
‘dlo’ in the Haitian tweet has a higher weight, so should its English        empower machine learning systems with the ability to explain
translation ‘water’. This word-level language agnostic property can          their predictions [5, 32]. Gilpin et al. [11] presents a survey of
promote machine learning models to be more interpretable. This               interpretability in machine learning, which provides a taxonomy of
also brings several benefits to downstream tasks such as knowledge           research that addresses various aspects of this problem. Similar to
graph construction using keywords extracted from tweets. In situa-           the work by Ross et al. [33], we employ an attention-based approach
tions where data is available only in one language, this similarity in       to evaluate model interpretability applied to the crisis-domain.
attention would still allow us to extract relevant phrases in cross-
lingual settings. To the best of our knowledge in crisis analytics           3 METHODOLOGY
domain, aligning attention in cross-lingual setting is not attempted         3.1 Problem Statement: Unsupervised
before. In this work, we focus our classification experiments only
to tweets containing ‘request’ intent, which will be expanded to
                                                                                 Cross-Lingual Crisis Tweet Classification
other behaviors, tasks, and datasets in the future.                          Consider tweets in language A and their corresponding translated
    Contributions: We propose a novel attention realignment method           tweets in language B. The task of unsupervised cross-lingual classi-
which promotes the task classifier to be more language agnostic,             fication is to train a classifier using the data only from the source
which in turn tests the effectiveness of multilingual knowledge              language and predict the labels for the data in the target language.
capture of XLM-R model for crisis tweets; and a pseudo-labelling             This experimental set up is usually represented as 𝐴 → 𝐵 for train-
procedure to further enhance the model’s generalizability. Furher,           ing a model using A and testing on B or 𝐴 → 𝐵 for training a
incorporating the attention-based mechanism allows us to perform             model using B and testing on A. 𝑋 refers to the data and 𝑌 refers
an interpretability analysis on the model, by comparing how words            to the ground truth labels. The multilingual dataset used in our
are attended in the original versus translated tweets.                       experiments consists of original multilingual (𝑚𝑙) tweets and their
                                                                             translated (𝑒𝑛) tweets in English. To summarize:
                                                                             Experiment 𝐴 (𝑒𝑛 → 𝑚𝑙):
2    RELATED WORK AND BACKGROUND                                             Input: 𝑋𝑒𝑛 , 𝑌𝑒𝑛 , 𝑋𝑚𝑙
                                                                                         𝑝𝑟𝑒𝑑
There are numerous prior works (c.f. surveys [4, 14]) that focus             Output: 𝑌𝑚𝑙 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑚𝑙 )
specifically on disaster related data to perform classification and          Experiment 𝐵 (𝑚𝑙 → 𝑒𝑛):
other rapid assessments during an onset of a new disaster event.             Input: 𝑋𝑚𝑙 , 𝑌𝑚𝑙 , 𝑋𝑒𝑛
Crisis period is an important but challenging situation, where col-                      𝑝𝑟𝑒𝑑
                                                                             Output: 𝑌𝑒𝑛 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑒𝑛 )
lecting labeled data during an ongoing event is very expensive. This
problem led to several works on domain adaptation techniques in              3.2    Overview
which machine learning models can learn and generalize to unseen
crisis event [3, 10, 18, 23]. In the context of crisis data, Nguyen et al.   In the following sections, we propose two methodologies to en-
[28] designed a convolutional neural network model which does not            hance cross-lingual classification: 1) Attention Realignment and 2)
require any feature engineering and Alam et al. [1] designed a CNN           Pseudo-Labelling. Attention realignment utilizes a language clas-
architecture with adversarial training on graph embeddings. Krish-           sifier which is trained in parallel to realign the attention layer of
nan et al. [19] showed that sharing a common layer for multiple              the task classifier such that the weights are more geared towards
tasks can improve performance of tasks with limited labels.                  task-specific words regardless of the language. Pseudo-Labelling
   In multilingual or cross-lingual direction, many works [8, 17]            further enhances the classifier by adding high quality seeds from
tried to align word embeddings (such as fastText [27]) from different        the target language that are pseudo-labelled by the task classifier.
languages into the same space so that a word and its translations
have the same vector. These models are superseded by models such
                                                                             3.3    Attention Realignment by Parallel
as multilingual BERT [9] and XLM-R [7] that produce contextual                      Language Classifier
embeddings which can be pretrained using several languages to-               As depicted in Fig 2, model on the left side is the task classifier and
gether to achieve impressive performance gains on multilingual               the model on the right side is a language classifier that is trained in
use-cases.                                                                   parallel. The purpose of this language classifier is to pick up aspects
   Attention mechanism [2, 24] is one of the most widely used meth-          that is missed by the XLM-R model. This could be tweet-specific,
ods in deep learning that can construct a context vector by weigh-           crisis-specific, or other linguistic nuances that can separate original
ing on the entire input sequence which improves over previous                tweets and translated tweets. Note that semantically, translated
sequence-to-sequence models [13, 34, 35]. As the model produces              words are expected to have similar XLM-R representations.
weights associated with each word in a sentence, this allows for                Attention realignment is a mechanism we introduce to promote
evaluating interpretability by comparing the words that are given            the task classifier to be more language independent. The main idea
priority in original versus translated tweets.                               is that the words that are given higher attention in a language
                                                                                                                      KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets




                                  Figure 2: Attention Realignment with Pseudo-Labelling over XLM-R model


         Notation Definition                                                               representation in language agnostic models; while the sentence
         𝑒𝑛             Tweets translated to English (‘message’                            structure, grammar, and other nuances can vary. We enforce this
                        column in the dataset)                                             rule by constructing two operations:
         𝑚𝑙             Multilingual Tweets (‘original’ column                                 (1) Attention Difference: When a sentence goes through model
                        in the dataset)                                                            M1, it also goes through model M2. For the same sentence,
         𝛼              Attention Layer                                                            this returns two attention layer weights: one from the task
         𝑇              A component that uses Task-specific                                        classifier (𝛼−
                                                                                                                →) and the other from the language classifier
                                                                                                                 𝑇
                        data. i.e., + and − ‘Request’ tweets                                       (𝛼𝑇 ). Directly subtracting 𝛼−
                                                                                                    −
                                                                                                    →  ′                        → ′ from 𝛼−
                                                                                                                                 𝑇
                                                                                                                                          → poses two issues: 1)
                                                                                                                                           𝑇
         𝐿              A component that uses Language-                                            we do not know whether they are comparable and 2) 𝛼−       →′
                                                                                                                                                               𝑇
                        specific data. i.e., 𝑒𝑛 and 𝑚𝑙 tweets                                      may have negative values. A simple solution to this is to
         𝑎𝐵𝑖𝐿𝑆𝑇 𝑀       Activation from the BiLSTM layer                                           normalize bothe vectors and clip 𝛼−→ ′ such that it is between
                                                                                                                                       𝑇
         𝛽, 𝛾, 𝜁        Hyperparameters                                                            0 and 1. Thus, an attention subtraction step is as follows:
                            Table 1: Notations
                                                                                                                 𝛼−
                                                                                                                  →                   𝛼−
                                                                                                                                       →′
                                                                                                                                                    !
                                                                                                                   𝑇                    𝑇
                                                                                                                  → − 𝛾𝑇 𝑐𝑙𝑖𝑝
                                                                                                                 𝛼−                    → ′ , 0, 1
                                                                                                                                      𝛼−
                                                                                                                                                                       (1)
                                                                                                                  𝑇                     𝑇

classifier should be less important in a task classifier. For example,                             where 𝛾𝑇 is a hyperparameter to tune the amount of subtrac-
‘dlo’ in Haitian and ‘water’ in English should have the same vector                                tion needed for the task classifier. Similarly, for the language
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                             Krishnan, et al.


        classifier,                                                               𝑇𝑥                               30
                                                                                  Deep Learning Library            Keras
                         𝛼−
                          →′                   𝛼−
                                                →
                                                            !
                           𝐿                     𝐿
                              − 𝛾𝐿 𝑐𝑙𝑖𝑝            , 0, 1              (2)        Optimizer                        Adam [𝑙𝑟 = 0.005, 𝑏𝑒𝑡𝑎 1 = 0.9,
                         𝛼→ ′
                          −
                          𝐿                    𝛼−
                                                →
                                                  𝐿                                                                𝑏𝑒𝑡𝑎 2 = 0.999, 𝑑𝑒𝑐𝑎𝑦 = 0.01]
    (2) Attention Loss: Along with attention difference, the model                Maximum Epoch                    100
        can also be trained by inserting an additional loss function              Dropout                          0.2
        term that penalizes the similarity between the attention                  Early Stopping Patience          10
        weights from the two classifiers. We use the Frobenius norm.              Batch Size                       32
                         𝐿 = ∥𝛼−
                               𝐴𝑡
                                 →𝑇 𝛼−→′ ∥ 2
                                        𝑇     𝑇       𝐹           (3)             𝜁𝑇                               1
                                                                                  𝜁𝐿                               0.1
                           𝐿𝐴𝑙 = ∥𝛼−
                                   →𝑇 𝛼−
                                    𝐿
                                        →′ ∥ 2
                                        𝐿 𝐹                        (4)            𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , 𝛾𝐿                0.01
        for task and language respectively. Resulting final loss func-                         Table 3: Implementation Details
        tion of joint training will be:
                                                      
            𝐿(𝜃 ) = 𝜁𝑇 𝐶𝐸𝑇 + 𝛽𝑇 𝐿𝐴𝑡 + 𝜁𝐿 𝐶𝐸𝐿 + 𝛽𝐿 𝐿𝐴𝑙              (5)
        where 𝛽 is the hyperparameter to tune the attention loss                We use the open source dataset from Appen3 consisting of multi-
        weight, 𝜁 is the hyperparameter to tune the joint training           lingual crisis response tweets. The dataset statistics for tweets with
        loss, and 𝐶𝐸 denotes the binary cross entropy loss,                  ‘request’ behavior labels is shown in Table 2. For all the experiments,
                                                                             the dataset is balanced for each split.
                          𝑁
                      1 Õ                                                       Each experiment is denoted as 𝐴 → 𝐵, where 𝐴 is the data that
            𝐶𝐸 = −          [𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 )]     (6)
                      𝑁 𝑖=1                                                  is used to train the model and 𝐵 is the data that is used for testing
                                                                             the model. For example, 𝑒𝑛 → 𝑚𝑙 means we train the model using
        It is important to note that the Frobenius norm is not simply
                                                                             English tweets and test on multilingual tweets.
        between the attention weights of the two models but rather
                                                                                Models are implemented in Keras and the details are shown in
        between the attention weights produced by the two models
                                                                             table 3. Hyperparameters 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , and 𝛾𝐿 are not exhaustively
        on the same input tweet. For example, for a given tweet, the
                                                                             tuned; we leave this exploration for future work.
        task classifier attends more to task-specific words and the
        language classifier attends to language-specific words. So
        the mechanism makes sure that they are distinct.                                             Baseline         Model M1         Model M2
                                                                                      𝑒𝑛 → 𝑚𝑙        59.98           62.53             66.79
3.4     Pseudo-Labelling                                                                             (80.57)         (77.02)           (82.39)
To enhance the model further, we pseudo-label the data in the                         𝑚𝑙 → 𝑒𝑛        60.93           65.69             70.95
target language. For example, if we are training a model using the                                   (70.07)         (63.50)           (73.84)
English tweets, we use the original tweets before translation for            Table 4: Performance Comparison (Accuracy in %) for
pseudo-labelling. The idea is simply to gather high-quality seeds            𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑇 𝑎𝑟𝑔𝑒𝑡 (𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑆𝑜𝑢𝑟𝑐𝑒).
from the target to retrain the model. Note that, we still do not use         Baseline = XLMR + BiLSTM + Attention.
any target labels here; still following the unsupervised goal. Thus,         Model M1 = Baseline + Attention Realignment.
for retraining model M1 for 𝑒𝑛 → 𝑚𝑙, the new dataset would consist           Model M2 = Model M1 + Pseudo-Labelling.
      + and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜+ as positive examples and 𝑋 − and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜−
of: 𝑋𝑒𝑛        𝑚𝑙                                  𝑒𝑛       𝑚𝑙
as negative examples.

3.5     XLM-R Usage                                                          5    RESULTS & DISCUSSION
The recommended feature usage of XLM-R2 is either by fine-tuning             Table 4 shows the cross-lingual performance comparison of all the
to the task or by aggregating features from all the 25 layers. We            models. The three models are described below:
employ the later to extract the multilingual embeddings for the                 (1) Baseline: The baseline model consists of embeddings re-
tweets.                                                                             trieved from XLM-R trained over BiLSTMs and Attention lay-
                                                                                    ers. This is a traditional sequence (text) classifier enhanced
4     DATASET & EXPERIMENTAL SETUP                                                  with attention mechanism. Activations from the BiLSTM
                                                                                    layers are weighed by the attention layer to construct the
                              Train         Validation          Test                context vector which is then passed through a dense layer
             Positive         3554          418                 496                 and softmax function to produce the classification output.
             Negative         17473         2152                2128            (2) Model M1: Adding attention realignment to the baseline
                                                                                    model produces model M1. Attention realignment is achieved
          Table 2: Dataset Statistics for both 𝑒𝑛 amd 𝑚𝑙                            through a language classifier which is trained in parallel with
                                                                                    the goal to make the task classifier more language agnostic.
2 https://github.com/facebookresearch/XLM                                    3 https://appen.com/datasets/combined-disaster-response-data/
                                                                                                                   KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets




Figure 3: Attention visualization example for ‘request’ tweets: words and their attention weights for two tweets in Haitian
Creole and its translation in English (darker the shade, higher the attention).




       The attention weights for both task and language classifiers                        scores are shown in brackets in table 4. A deeper investigation in
       are manipulated by each other during training by a process                          this direction on various other tasks can shed more light on the
       of subtraction (attention difference) as well a loss component                      impact of realignment mechanism.
       (attention loss). See section 3.3.
   (3) Model M2: Adding the pseudo-labelling procedure to model                            5.1     Interpretability: Attention Visualization
       M1 produces model M2. Using Model M1 which is trained
                                                                                           We follow a similar attention architecture shown in [18]. The con-
       to be language agnostic, tweets from the target languages
                                                                                           text vector is constructed as a result of dot product between the
       are pseudo-labelled. High quality seeds are selected (using
                                                                                           attention weights and word activations. This represents the inter-
       Model M1 𝑝>0.7) and augmented to the original training
                                                                                           pretable layer in our architecture. The attention weights represent
       dataset to retrain the task classifier.
                                                                                           the importance of each word in the classification process. Two ex-
   Results show that, for cross-lingual evaluation on 𝑒𝑛 → 𝑚𝑙,                             amples are shown in figure 3. In the first example, both 𝑒𝑛 → 𝑒𝑛
model M1 outperforms the baseline by +4.3% and model M2 outper-                            and 𝑚𝑙 → 𝑚𝑙 give attention to the word ‘hungry’ (i.e., ‘grangou’ in
forms by +11.4%. On 𝑚𝑙 → 𝑒𝑛, model M1 outperforms the baseline                             Haitian Creole). Note that these two are results from the models
by +7.8% and model M2 outperforms by +16.5%. This shows that                               that are trained in the same language in which they are tested; thus,
both models are effective in cross-lingual crisis tweet classification.                    expecting an ideal performance. For the baseline model in the cross-
An interesting observation to note is that using attention realign-                        lingual set-up 𝑒𝑛 → 𝑚𝑙, although it correctly predicts the label, the
ment alone decreased the classification performance in the same                            attention weights are more spread apart. In model M2 with atten-
language, which is brought back up by pseudo-labelling. These                              tion realignment and pseudo-labelling, although with some spread,
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                                                   Krishnan, et al.


the attention weights are shifted more toward ‘grangou’. Similarly                          [8] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer,
in example 2, the attention weights in the baseline model are more                              and Hervé Jégou. 2017. Word Translation Without Parallel Data. arXiv preprint
                                                                                                arXiv:1710.04087 (2017).
spread apart. Cross-lingual performance of model M2 aligns more                             [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
with 𝑒𝑛 → 𝑒𝑛 and 𝑚𝑙 → 𝑚𝑙. These examples show the importance                                    Pre-training of deep bidirectional transformers for language understanding. arXiv
                                                                                                preprint arXiv:1810.04805 (2018).
of having interpretability as a key criterion in cross-lingual crisis                      [10] Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by
tweet classification problems; which can also be used for down-                                 backpropagation. arXiv preprint arXiv:1409.7495 (2014).
stream tasks such as extracting relevant keywords for knowledge                            [11] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and
                                                                                                Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of
graph construction.                                                                             machine learning. In 2018 IEEE 5th International Conference on data science and
                                                                                                advanced analytics (DSAA). IEEE, 80–89.
                                                                                           [12] David Gunning. 2017. Explainable artificial intelligence (xai). Defense Advanced
6    CONCLUSION                                                                                 Research Projects Agency (DARPA), nd Web 2 (2017).
                                                                                           [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
We presented a novel approach for unsupervised cross-lingual cri-                               computation 9, 8 (1997), 1735–1780.
sis tweet classification problem using a combination of attention                          [14] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015.
                                                                                                Processing social media messages in mass emergency: A survey. ACM Computing
realignment mechanism and a pseudo-labelling procedure (over                                    Surveys (CSUR) 47, 4 (2015), 1–38.
the state-of-the-art multilingual model XLM-R) to promote the task                         [15] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline:
classifier to be more language agnostic. Performance evaluation                                 Human-annotated twitter corpora for NLP of crisis-related messages. arXiv
                                                                                                preprint arXiv:1605.05894 (2016).
showed that both models M1 and M2 outperformed the baseline by                             [16] Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading
+4.3% and +11.4% respectively for cross-lingual text classification                             comprehension systems. arXiv preprint arXiv:1707.07328 (2017).
from English to Multilingual. We also presented an interpretabil-                          [17] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard
                                                                                                Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval
ity analysis by comparing the attention layers of the models. It                                criterion. arXiv preprint arXiv:1804.07745 (2018).
shows the importance of incorporating a word-level language ag-                            [18] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Diversity-Based
                                                                                                Generalization for Neural Unsupervised Text Classification under Domain Shift.
nostic characteristic in the learning process, when training data                               https://arxiv.org/pdf/2002.10937.pdf (2020).
is available only in one language. Performing extensive hyperpa-                           [19] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Unsupervised and
rameter tuning and expanding the idea to other tasks (including                                 Interpretable Domain Adaptation to Rapidly Filter Social Web Data for Emergency
                                                                                                Services. arXiv preprint arXiv:2003.04991 (2020).
cross-task/multi-task) are left as future work. We also plan another                       [20] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model
direction for future work as to incorporate the human-engineered                                pretraining. arXiv preprint arXiv:1901.07291 (2019).
knowledge from the multilingual knowledge graphs such as Ba-                               [21] Kathy Lee, Ankit Agrawal, and Alok Choudhary. 2013. Real-time disease surveil-
                                                                                                lance using twitter data: demonstration on flu and cancer. In Proceedings of the
belNet in our model architecture that could improve the learning                                19th ACM SIGKDD international conference on Knowledge discovery and data
of similar concepts across languages critical to the crisis response                            mining. 1474–1477.
                                                                                           [22] Hongmin Li, Doina Caragea, Cornelia Caragea, and Nic Herndon. 2018. Disaster
agencies.                                                                                       response aided by tweet classification with a domain adaptation approach. Journal
Reproducibility: Source code is available available at: https://                                of Contingencies and Crisis Management 26, 1 (2018), 16–27.
github.com/jitinkrishnan/Cross-Lingual-Crisis-Tweet-Classification                         [23] Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang. 2018. Hierarchical attention
                                                                                                transfer network for cross-domain sentiment classification. In Thirty-Second
                                                                                                AAAI Conference on Artificial Intelligence.
                                                                                           [24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-
7    ACKNOWLEDGEMENT                                                                            tive approaches to attention-based neural machine translation. arXiv preprint
                                                                                                arXiv:1508.04025 (2015).
Authors would like to thank U.S. National Science Foundation                               [25] Guoqin Ma. 2019.            Tweets Classification with BERT in the Field
grants IIS-1815459 and IIS-1657379 for partially supporting this                                of Disaster Management.                     https://pdfs.semanticscholar.org/d226/
research.                                                                                       185fa1e14118d746cf0b04dc5be8f545ec24.pdf.
                                                                                           [26] Reza Mazloom, Hongmin Li, Doina Caragea, Cornelia Caragea, and Muhammad
                                                                                                Imran. 2019. A Hybrid Domain Adaptation Approach for Identifying Crisis-
                                                                                                Relevant Tweets. International Journal of Information Systems for Crisis Response
REFERENCES                                                                                      and Management (IJISCRAM) 11, 2 (2019), 1–19.
 [1] Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with             [27] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar-
     adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151                 mand Joulin. 2018. Advances in Pre-Training Distributed Word Representations.
     (2018).                                                                                    In Proceedings of the International Conference on Language Resources and Evalua-
 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-                       tion (LREC 2018).
     chine translation by jointly learning to align and translate. arXiv preprint          [28] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham-
     arXiv:1409.0473 (2014).                                                                    mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related
 [3] John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation                 data on social networks using convolutional neural networks. arXiv preprint
     with structural correspondence learning. In Proceedings of the 2006 conference on          arXiv:1608.03902 (2016).
     empirical methods in natural language processing. 120–128.                            [29] Ferda Ofli, Patrick Meier, Muhammad Imran, Carlos Castillo, Devis Tuia, Nicolas
 [4] Carlos Castillo. 2016. Big crisis data: social media in disasters and time-critical        Rey, Julien Briant, Pauline Millet, Friedrich Reinhard, Matthew Parkan, et al. 2016.
     situations. Cambridge University Press.                                                    Combining human computing and machine learning to make sense of big (aerial)
 [5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter               data for disaster response. Big data 4, 1 (2016), 47–59.
     Abbeel. 2016. Infogan: Interpretable representation learning by information max-      [30] Bahman Pedrood and Hemant Purohit. 2018. Mining help intent on twitter during
     imizing generative adversarial nets. In Advances in neural information processing          disasters via transfer learning with sparse coding. In International Conference
     systems. 2172–2180.                                                                        on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior
 [6] Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020. Cross-                    Representation in Modeling and Simulation. Springer, 141–153.
     Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup.        [31] Hemant Purohit, Carlos Castillo, Fernando Diaz, Amit Sheth, and Patrick Meier.
     In Proceedings of the 58th Annual Meeting of the Association for Computational             2013. Emergency-relief coordination on social media: Automatically matching
     Linguistics: Student Research Workshop. 292–298.                                           resource requests and offers. First Monday 19, 1 (Dec. 2013).
 [7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-            [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i
     laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,                 trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd
     and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning             ACM SIGKDD international conference on knowledge discovery and data mining.
     at scale. arXiv preprint arXiv:1911.02116 (2019).                                          1135–1144.
                                                                                                                         KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets


[33] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for         [36] István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori
     the right reasons: Training differentiable models by constraining their explana-           Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there:
     tions. arXiv preprint arXiv:1703.03717 (2017).                                             Looking for help from tweets during a large scale disaster. In Proceedings of the
[34] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-              51st Annual Meeting of the Association for Computational Linguistics (Volume 1:
     works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.                    Long Papers). 1619–1629.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
     with neural networks. In Advances in neural information processing systems. 3104–
     3112.