=Paper=
{{Paper
|id=Vol-2657/paper3
|storemode=property
|title=Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets
|pdfUrl=https://ceur-ws.org/Vol-2657/paper3.pdf
|volume=Vol-2657
|authors=Jitin Krishnan,Hemant Purohit,Huzefa Rangwala
|dblpUrl=https://dblp.org/rec/conf/kdd/KrishnanPR20
}}
==Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets==
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Jitin Krishnan Hemant Purohit Huzefa Rangwala Department of Computer Science Department of Information Department of Computer Science George Mason University Sciences & Technology George Mason University Fairfax, VA George Mason University Fairfax, VA jkrishn2@gmu.edu Fairfax, VA rangwala@gmu.edu hpurohit@gmu.edu ABSTRACT 1 INTRODUCTION State-of-the-art models for cross-lingual language understanding Social media platforms such as Twitter provide valuable information such as XLM-R [7] have shown great performance on benchmark to aid emergency response organizations in gaining real-time situ- data sets. However, they typically require some fine-tuning or cus- ational awareness during the sudden onset of crisis situations [4]. tomization to adapt to downstream NLP tasks for a domain. In this Extracting critical information about affected individuals, infras- work, we study unsupervised cross-lingual text classification task tructure damage, medical emergencies, or food and shelter needs in the context of crisis domain, where rapidly filtering relevant data can help emergency managers make time-critical decisions and regardless of language is critical to improve situational awareness allocate resources efficiently [15, 21, 22, 30, 31, 36]. Researchers of emergency services. Specifically, we address two research ques- have designed numerous classification models to help towards this tions: a) Can a custom neural network model over XLM-R trained humanitarian goal of converting real-time social media streams into only in English for such classification task transfer knowledge to actionable knowledge [1, 22, 26, 28, 29]. Recently, with the advent multilingual data and vice-versa? b) By employing an attention of multilingual models such as multilingual BERT [9] and XLM mechanism, does the model attend to words relevant to the task [20], researchers have started adopting them to multilingual disas- regardless of the language? To this goal, we present an attention ter tweets [6, 25]. Since XLM-R [7] has been shown to be the most realignment mechanism that utilizes a parallel language classifier to superior model in cross-lingual language understanding, we re- minimize any linguistic differences between the source and target strict our work to this model to explore the aspects of cross-lingual languages. Additionally, we pseudo-label the tweets from the target transfer of knowledge and interpretability. language which is then augmented with the tweets in the source language for retraining the model. We conduct experiments using Twitter posts (tweets) labelled as a ‘request’ in the open source data set by Appen1 , consisting of multilingual tweets for crisis re- sponse. Experimental results show that attention realignment and pseudo-labelling improve the performance of unsupervised cross- lingual classification. We also present an interpretability analysis by evaluating the performance of attention layers on original versus translated messages. KEYWORDS Social Media, Crisis Management, Text Classification, Unsuper- vised Cross-Lingual Adaptation, Interpretability ACM Reference Format: Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Attention Re- Figure 1: Problem: Unsupervised cross-lingual tweet classifi- alignment and Pseudo-Labelling for Interpretable Cross-Lingual Classifica- cation, e.g., train a model using English tweets, predict labels tion of Crisis Tweets. In Proceedings of KDD Workshop on Knowledge-infused for Multilingual tweets, and vice-versa. Mining and Learning (KiML’20). , 7 pages. https://doi.org/10.1145/nnnnnnn. nnnnnnn In this work, we address two questions. First is to examine 1 https://appen.com/datasets/combined-disaster-response-data/ whether XLM-R is effective in capturing multilingual knowledge by constructing a custom model over it to analyze if a model trained using English-only tweets will generalize to multilingual data and In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, vice-versa. Social media streams are generally different from other California, USA, August 24, 2020. Use permitted under Creative Commons License text, given the user-generated content. For example, tweets are Attribution 4.0 International (CC BY 4.0). usually short with possibly errors and ambiguity in the behavioral KiML’20, August 24, 2020, San Diego, California, USA, © 2020 Copyright held by the author(s). expressions. These properties in turn make the classification task or https://doi.org/10.1145/nnnnnnn.nnnnnnn extracting representations a bit more challenging. Second question KiML’20, August 24, 2020, San Diego, California, USA, Krishnan, et al. is to examine whether word translations will be equally attended With more and more machine learning systems being adopted by the attention layers. For instance, the words with higher atten- by diverse application domains, transparency in decision-making tion weights in a sentence in Haitian Creole such as “Tanpri nou inevitably becomes an essential criteria, especially in high-risk bezwen tant avek dlo nou zon silo mesi” should align with the words scenarios [12] where trust is of utmost importance. With deep in its corresponding translated tweet in English “Please, we need neural networks, including natural language systems, shown to tents and water. We are in Silo, Thank you!”. Our core idea is that if be easily fooled [16], there has been many promising ideas that ‘dlo’ in the Haitian tweet has a higher weight, so should its English empower machine learning systems with the ability to explain translation ‘water’. This word-level language agnostic property can their predictions [5, 32]. Gilpin et al. [11] presents a survey of promote machine learning models to be more interpretable. This interpretability in machine learning, which provides a taxonomy of also brings several benefits to downstream tasks such as knowledge research that addresses various aspects of this problem. Similar to graph construction using keywords extracted from tweets. In situa- the work by Ross et al. [33], we employ an attention-based approach tions where data is available only in one language, this similarity in to evaluate model interpretability applied to the crisis-domain. attention would still allow us to extract relevant phrases in cross- lingual settings. To the best of our knowledge in crisis analytics 3 METHODOLOGY domain, aligning attention in cross-lingual setting is not attempted 3.1 Problem Statement: Unsupervised before. In this work, we focus our classification experiments only to tweets containing ‘request’ intent, which will be expanded to Cross-Lingual Crisis Tweet Classification other behaviors, tasks, and datasets in the future. Consider tweets in language A and their corresponding translated Contributions: We propose a novel attention realignment method tweets in language B. The task of unsupervised cross-lingual classi- which promotes the task classifier to be more language agnostic, fication is to train a classifier using the data only from the source which in turn tests the effectiveness of multilingual knowledge language and predict the labels for the data in the target language. capture of XLM-R model for crisis tweets; and a pseudo-labelling This experimental set up is usually represented as 𝐴 → 𝐵 for train- procedure to further enhance the model’s generalizability. Furher, ing a model using A and testing on B or 𝐴 → 𝐵 for training a incorporating the attention-based mechanism allows us to perform model using B and testing on A. 𝑋 refers to the data and 𝑌 refers an interpretability analysis on the model, by comparing how words to the ground truth labels. The multilingual dataset used in our are attended in the original versus translated tweets. experiments consists of original multilingual (𝑚𝑙) tweets and their translated (𝑒𝑛) tweets in English. To summarize: Experiment 𝐴 (𝑒𝑛 → 𝑚𝑙): 2 RELATED WORK AND BACKGROUND Input: 𝑋𝑒𝑛 , 𝑌𝑒𝑛 , 𝑋𝑚𝑙 𝑝𝑟𝑒𝑑 There are numerous prior works (c.f. surveys [4, 14]) that focus Output: 𝑌𝑚𝑙 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑚𝑙 ) specifically on disaster related data to perform classification and Experiment 𝐵 (𝑚𝑙 → 𝑒𝑛): other rapid assessments during an onset of a new disaster event. Input: 𝑋𝑚𝑙 , 𝑌𝑚𝑙 , 𝑋𝑒𝑛 Crisis period is an important but challenging situation, where col- 𝑝𝑟𝑒𝑑 Output: 𝑌𝑒𝑛 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑒𝑛 ) lecting labeled data during an ongoing event is very expensive. This problem led to several works on domain adaptation techniques in 3.2 Overview which machine learning models can learn and generalize to unseen crisis event [3, 10, 18, 23]. In the context of crisis data, Nguyen et al. In the following sections, we propose two methodologies to en- [28] designed a convolutional neural network model which does not hance cross-lingual classification: 1) Attention Realignment and 2) require any feature engineering and Alam et al. [1] designed a CNN Pseudo-Labelling. Attention realignment utilizes a language clas- architecture with adversarial training on graph embeddings. Krish- sifier which is trained in parallel to realign the attention layer of nan et al. [19] showed that sharing a common layer for multiple the task classifier such that the weights are more geared towards tasks can improve performance of tasks with limited labels. task-specific words regardless of the language. Pseudo-Labelling In multilingual or cross-lingual direction, many works [8, 17] further enhances the classifier by adding high quality seeds from tried to align word embeddings (such as fastText [27]) from different the target language that are pseudo-labelled by the task classifier. languages into the same space so that a word and its translations have the same vector. These models are superseded by models such 3.3 Attention Realignment by Parallel as multilingual BERT [9] and XLM-R [7] that produce contextual Language Classifier embeddings which can be pretrained using several languages to- As depicted in Fig 2, model on the left side is the task classifier and gether to achieve impressive performance gains on multilingual the model on the right side is a language classifier that is trained in use-cases. parallel. The purpose of this language classifier is to pick up aspects Attention mechanism [2, 24] is one of the most widely used meth- that is missed by the XLM-R model. This could be tweet-specific, ods in deep learning that can construct a context vector by weigh- crisis-specific, or other linguistic nuances that can separate original ing on the entire input sequence which improves over previous tweets and translated tweets. Note that semantically, translated sequence-to-sequence models [13, 34, 35]. As the model produces words are expected to have similar XLM-R representations. weights associated with each word in a sentence, this allows for Attention realignment is a mechanism we introduce to promote evaluating interpretability by comparing the words that are given the task classifier to be more language independent. The main idea priority in original versus translated tweets. is that the words that are given higher attention in a language KiML’20, August 24, 2020, San Diego, California, USA, Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Figure 2: Attention Realignment with Pseudo-Labelling over XLM-R model Notation Definition representation in language agnostic models; while the sentence 𝑒𝑛 Tweets translated to English (‘message’ structure, grammar, and other nuances can vary. We enforce this column in the dataset) rule by constructing two operations: 𝑚𝑙 Multilingual Tweets (‘original’ column (1) Attention Difference: When a sentence goes through model in the dataset) M1, it also goes through model M2. For the same sentence, 𝛼 Attention Layer this returns two attention layer weights: one from the task 𝑇 A component that uses Task-specific classifier (𝛼− →) and the other from the language classifier 𝑇 data. i.e., + and − ‘Request’ tweets (𝛼𝑇 ). Directly subtracting 𝛼− − → ′ → ′ from 𝛼− 𝑇 → poses two issues: 1) 𝑇 𝐿 A component that uses Language- we do not know whether they are comparable and 2) 𝛼− →′ 𝑇 specific data. i.e., 𝑒𝑛 and 𝑚𝑙 tweets may have negative values. A simple solution to this is to 𝑎𝐵𝑖𝐿𝑆𝑇 𝑀 Activation from the BiLSTM layer normalize bothe vectors and clip 𝛼−→ ′ such that it is between 𝑇 𝛽, 𝛾, 𝜁 Hyperparameters 0 and 1. Thus, an attention subtraction step is as follows: Table 1: Notations 𝛼− → 𝛼− →′ ! 𝑇 𝑇 → − 𝛾𝑇 𝑐𝑙𝑖𝑝 𝛼− → ′ , 0, 1 𝛼− (1) 𝑇 𝑇 classifier should be less important in a task classifier. For example, where 𝛾𝑇 is a hyperparameter to tune the amount of subtrac- ‘dlo’ in Haitian and ‘water’ in English should have the same vector tion needed for the task classifier. Similarly, for the language KiML’20, August 24, 2020, San Diego, California, USA, Krishnan, et al. classifier, 𝑇𝑥 30 Deep Learning Library Keras 𝛼− →′ 𝛼− → ! 𝐿 𝐿 − 𝛾𝐿 𝑐𝑙𝑖𝑝 , 0, 1 (2) Optimizer Adam [𝑙𝑟 = 0.005, 𝑏𝑒𝑡𝑎 1 = 0.9, 𝛼→ ′ − 𝐿 𝛼− → 𝐿 𝑏𝑒𝑡𝑎 2 = 0.999, 𝑑𝑒𝑐𝑎𝑦 = 0.01] (2) Attention Loss: Along with attention difference, the model Maximum Epoch 100 can also be trained by inserting an additional loss function Dropout 0.2 term that penalizes the similarity between the attention Early Stopping Patience 10 weights from the two classifiers. We use the Frobenius norm. Batch Size 32 𝐿 = ∥𝛼− 𝐴𝑡 →𝑇 𝛼−→′ ∥ 2 𝑇 𝑇 𝐹 (3) 𝜁𝑇 1 𝜁𝐿 0.1 𝐿𝐴𝑙 = ∥𝛼− →𝑇 𝛼− 𝐿 →′ ∥ 2 𝐿 𝐹 (4) 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , 𝛾𝐿 0.01 for task and language respectively. Resulting final loss func- Table 3: Implementation Details tion of joint training will be: 𝐿(𝜃 ) = 𝜁𝑇 𝐶𝐸𝑇 + 𝛽𝑇 𝐿𝐴𝑡 + 𝜁𝐿 𝐶𝐸𝐿 + 𝛽𝐿 𝐿𝐴𝑙 (5) where 𝛽 is the hyperparameter to tune the attention loss We use the open source dataset from Appen3 consisting of multi- weight, 𝜁 is the hyperparameter to tune the joint training lingual crisis response tweets. The dataset statistics for tweets with loss, and 𝐶𝐸 denotes the binary cross entropy loss, ‘request’ behavior labels is shown in Table 2. For all the experiments, the dataset is balanced for each split. 𝑁 1 Õ Each experiment is denoted as 𝐴 → 𝐵, where 𝐴 is the data that 𝐶𝐸 = − [𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 )] (6) 𝑁 𝑖=1 is used to train the model and 𝐵 is the data that is used for testing the model. For example, 𝑒𝑛 → 𝑚𝑙 means we train the model using It is important to note that the Frobenius norm is not simply English tweets and test on multilingual tweets. between the attention weights of the two models but rather Models are implemented in Keras and the details are shown in between the attention weights produced by the two models table 3. Hyperparameters 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , and 𝛾𝐿 are not exhaustively on the same input tweet. For example, for a given tweet, the tuned; we leave this exploration for future work. task classifier attends more to task-specific words and the language classifier attends to language-specific words. So the mechanism makes sure that they are distinct. Baseline Model M1 Model M2 𝑒𝑛 → 𝑚𝑙 59.98 62.53 66.79 3.4 Pseudo-Labelling (80.57) (77.02) (82.39) To enhance the model further, we pseudo-label the data in the 𝑚𝑙 → 𝑒𝑛 60.93 65.69 70.95 target language. For example, if we are training a model using the (70.07) (63.50) (73.84) English tweets, we use the original tweets before translation for Table 4: Performance Comparison (Accuracy in %) for pseudo-labelling. The idea is simply to gather high-quality seeds 𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑇 𝑎𝑟𝑔𝑒𝑡 (𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑆𝑜𝑢𝑟𝑐𝑒). from the target to retrain the model. Note that, we still do not use Baseline = XLMR + BiLSTM + Attention. any target labels here; still following the unsupervised goal. Thus, Model M1 = Baseline + Attention Realignment. for retraining model M1 for 𝑒𝑛 → 𝑚𝑙, the new dataset would consist Model M2 = Model M1 + Pseudo-Labelling. + and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜+ as positive examples and 𝑋 − and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜− of: 𝑋𝑒𝑛 𝑚𝑙 𝑒𝑛 𝑚𝑙 as negative examples. 3.5 XLM-R Usage 5 RESULTS & DISCUSSION The recommended feature usage of XLM-R2 is either by fine-tuning Table 4 shows the cross-lingual performance comparison of all the to the task or by aggregating features from all the 25 layers. We models. The three models are described below: employ the later to extract the multilingual embeddings for the (1) Baseline: The baseline model consists of embeddings re- tweets. trieved from XLM-R trained over BiLSTMs and Attention lay- ers. This is a traditional sequence (text) classifier enhanced 4 DATASET & EXPERIMENTAL SETUP with attention mechanism. Activations from the BiLSTM layers are weighed by the attention layer to construct the Train Validation Test context vector which is then passed through a dense layer Positive 3554 418 496 and softmax function to produce the classification output. Negative 17473 2152 2128 (2) Model M1: Adding attention realignment to the baseline model produces model M1. Attention realignment is achieved Table 2: Dataset Statistics for both 𝑒𝑛 amd 𝑚𝑙 through a language classifier which is trained in parallel with the goal to make the task classifier more language agnostic. 2 https://github.com/facebookresearch/XLM 3 https://appen.com/datasets/combined-disaster-response-data/ KiML’20, August 24, 2020, San Diego, California, USA, Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Figure 3: Attention visualization example for ‘request’ tweets: words and their attention weights for two tweets in Haitian Creole and its translation in English (darker the shade, higher the attention). The attention weights for both task and language classifiers scores are shown in brackets in table 4. A deeper investigation in are manipulated by each other during training by a process this direction on various other tasks can shed more light on the of subtraction (attention difference) as well a loss component impact of realignment mechanism. (attention loss). See section 3.3. (3) Model M2: Adding the pseudo-labelling procedure to model 5.1 Interpretability: Attention Visualization M1 produces model M2. Using Model M1 which is trained We follow a similar attention architecture shown in [18]. The con- to be language agnostic, tweets from the target languages text vector is constructed as a result of dot product between the are pseudo-labelled. High quality seeds are selected (using attention weights and word activations. This represents the inter- Model M1 𝑝>0.7) and augmented to the original training pretable layer in our architecture. The attention weights represent dataset to retrain the task classifier. the importance of each word in the classification process. Two ex- Results show that, for cross-lingual evaluation on 𝑒𝑛 → 𝑚𝑙, amples are shown in figure 3. In the first example, both 𝑒𝑛 → 𝑒𝑛 model M1 outperforms the baseline by +4.3% and model M2 outper- and 𝑚𝑙 → 𝑚𝑙 give attention to the word ‘hungry’ (i.e., ‘grangou’ in forms by +11.4%. On 𝑚𝑙 → 𝑒𝑛, model M1 outperforms the baseline Haitian Creole). Note that these two are results from the models by +7.8% and model M2 outperforms by +16.5%. This shows that that are trained in the same language in which they are tested; thus, both models are effective in cross-lingual crisis tweet classification. expecting an ideal performance. For the baseline model in the cross- An interesting observation to note is that using attention realign- lingual set-up 𝑒𝑛 → 𝑚𝑙, although it correctly predicts the label, the ment alone decreased the classification performance in the same attention weights are more spread apart. In model M2 with atten- language, which is brought back up by pseudo-labelling. These tion realignment and pseudo-labelling, although with some spread, KiML’20, August 24, 2020, San Diego, California, USA, Krishnan, et al. the attention weights are shifted more toward ‘grangou’. Similarly [8] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, in example 2, the attention weights in the baseline model are more and Hervé Jégou. 2017. Word Translation Without Parallel Data. arXiv preprint arXiv:1710.04087 (2017). spread apart. Cross-lingual performance of model M2 aligns more [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: with 𝑒𝑛 → 𝑒𝑛 and 𝑚𝑙 → 𝑚𝑙. These examples show the importance Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). of having interpretability as a key criterion in cross-lingual crisis [10] Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by tweet classification problems; which can also be used for down- backpropagation. arXiv preprint arXiv:1409.7495 (2014). stream tasks such as extracting relevant keywords for knowledge [11] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of graph construction. machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE, 80–89. [12] David Gunning. 2017. Explainable artificial intelligence (xai). Defense Advanced 6 CONCLUSION Research Projects Agency (DARPA), nd Web 2 (2017). [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural We presented a novel approach for unsupervised cross-lingual cri- computation 9, 8 (1997), 1735–1780. sis tweet classification problem using a combination of attention [14] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing social media messages in mass emergency: A survey. ACM Computing realignment mechanism and a pseudo-labelling procedure (over Surveys (CSUR) 47, 4 (2015), 1–38. the state-of-the-art multilingual model XLM-R) to promote the task [15] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline: classifier to be more language agnostic. Performance evaluation Human-annotated twitter corpora for NLP of crisis-related messages. arXiv preprint arXiv:1605.05894 (2016). showed that both models M1 and M2 outperformed the baseline by [16] Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading +4.3% and +11.4% respectively for cross-lingual text classification comprehension systems. arXiv preprint arXiv:1707.07328 (2017). from English to Multilingual. We also presented an interpretabil- [17] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval ity analysis by comparing the attention layers of the models. It criterion. arXiv preprint arXiv:1804.07745 (2018). shows the importance of incorporating a word-level language ag- [18] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Diversity-Based Generalization for Neural Unsupervised Text Classification under Domain Shift. nostic characteristic in the learning process, when training data https://arxiv.org/pdf/2002.10937.pdf (2020). is available only in one language. Performing extensive hyperpa- [19] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Unsupervised and rameter tuning and expanding the idea to other tasks (including Interpretable Domain Adaptation to Rapidly Filter Social Web Data for Emergency Services. arXiv preprint arXiv:2003.04991 (2020). cross-task/multi-task) are left as future work. We also plan another [20] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model direction for future work as to incorporate the human-engineered pretraining. arXiv preprint arXiv:1901.07291 (2019). knowledge from the multilingual knowledge graphs such as Ba- [21] Kathy Lee, Ankit Agrawal, and Alok Choudhary. 2013. Real-time disease surveil- lance using twitter data: demonstration on flu and cancer. In Proceedings of the belNet in our model architecture that could improve the learning 19th ACM SIGKDD international conference on Knowledge discovery and data of similar concepts across languages critical to the crisis response mining. 1474–1477. [22] Hongmin Li, Doina Caragea, Cornelia Caragea, and Nic Herndon. 2018. Disaster agencies. response aided by tweet classification with a domain adaptation approach. Journal Reproducibility: Source code is available available at: https:// of Contingencies and Crisis Management 26, 1 (2018), 16–27. github.com/jitinkrishnan/Cross-Lingual-Crisis-Tweet-Classification [23] Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang. 2018. Hierarchical attention transfer network for cross-domain sentiment classification. In Thirty-Second AAAI Conference on Artificial Intelligence. [24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec- 7 ACKNOWLEDGEMENT tive approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015). Authors would like to thank U.S. National Science Foundation [25] Guoqin Ma. 2019. Tweets Classification with BERT in the Field grants IIS-1815459 and IIS-1657379 for partially supporting this of Disaster Management. https://pdfs.semanticscholar.org/d226/ research. 185fa1e14118d746cf0b04dc5be8f545ec24.pdf. [26] Reza Mazloom, Hongmin Li, Doina Caragea, Cornelia Caragea, and Muhammad Imran. 2019. A Hybrid Domain Adaptation Approach for Identifying Crisis- Relevant Tweets. International Journal of Information Systems for Crisis Response REFERENCES and Management (IJISCRAM) 11, 2 (2019), 1–19. [1] Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with [27] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar- adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151 mand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. (2018). In Proceedings of the International Conference on Language Resources and Evalua- [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- tion (LREC 2018). chine translation by jointly learning to align and translate. arXiv preprint [28] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham- arXiv:1409.0473 (2014). mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related [3] John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation data on social networks using convolutional neural networks. arXiv preprint with structural correspondence learning. In Proceedings of the 2006 conference on arXiv:1608.03902 (2016). empirical methods in natural language processing. 120–128. [29] Ferda Ofli, Patrick Meier, Muhammad Imran, Carlos Castillo, Devis Tuia, Nicolas [4] Carlos Castillo. 2016. Big crisis data: social media in disasters and time-critical Rey, Julien Briant, Pauline Millet, Friedrich Reinhard, Matthew Parkan, et al. 2016. situations. Cambridge University Press. Combining human computing and machine learning to make sense of big (aerial) [5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter data for disaster response. Big data 4, 1 (2016), 47–59. Abbeel. 2016. Infogan: Interpretable representation learning by information max- [30] Bahman Pedrood and Hemant Purohit. 2018. Mining help intent on twitter during imizing generative adversarial nets. In Advances in neural information processing disasters via transfer learning with sparse coding. In International Conference systems. 2172–2180. on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior [6] Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020. Cross- Representation in Modeling and Simulation. Springer, 141–153. Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup. [31] Hemant Purohit, Carlos Castillo, Fernando Diaz, Amit Sheth, and Patrick Meier. In Proceedings of the 58th Annual Meeting of the Association for Computational 2013. Emergency-relief coordination on social media: Automatically matching Linguistics: Student Research Workshop. 292–298. resource requests and offers. First Monday 19, 1 (Dec. 2013). [7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning ACM SIGKDD international conference on knowledge discovery and data mining. at scale. arXiv preprint arXiv:1911.02116 (2019). 1135–1144. KiML’20, August 24, 2020, San Diego, California, USA, Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets [33] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for [36] István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori the right reasons: Training differentiable models by constraining their explana- Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there: tions. arXiv preprint arXiv:1703.03717 (2017). Looking for help from tweets during a large scale disaster. In Proceedings of the [34] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net- 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. Long Papers). 1619–1629. [35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104– 3112.