=Paper=
{{Paper
|id=Vol-2936/paper-189
|storemode=property
|title=Detection of hate speech spreaders using convolutional neural networks
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-189.pdf
|volume=Vol-2936
|authors=Marco Siino,Elisa Di Nuovo,Ilenia Tinnirello,Marco La Cascia
|dblpUrl=https://dblp.org/rec/conf/clef/SiinoNTC21
}}
==Detection of hate speech spreaders using convolutional neural networks==
Detection of hate speech spreaders using convolutional neural networks Notebook for PAN at CLEF 2021 Marco Siino1 , Elisa Di Nuovo2 , Ilenia Tinnirello1 and Marco La Cascia1 1 Università degli Studi di Palermo, Dipartimento di Ingegneria, Palermo, 90128, Italy 2 Università degli Studi di Torino, Dipartimento di Lingue e Letterature Straniere e Culture Moderne, Torino, 10124, Italy Abstract In this paper we describe a deep learning model based on a Convolutional Neural Network (CNN). The model was developed for the Profiling Hate Speech Spreaders (HSSs) task proposed by PAN 2021 organizers and hosted at the 2021 CLEF Conference. Our approach to the task of classifying an author as HSS or not (nHSS) takes advantage of a CNN based on a single convolutional layer. In this binary classification task, on the tests performed using a 5-fold cross validation, the proposed model reaches a maximum accuracy of 0.80 on the multilingual (i.e., English and Spanish) training set, and a minimum loss value of 0.51 on the same set. As announced by the task organizers, the trained model presented is able to reach an overall accuracy of 0.79 on the full test set. This overall accuracy is obtained averaging the accuracy achieved by the model on both languages. In particular, with regard to the Spanish test set, the organizers announced that our model achieves an accuracy of 0.85, while on the English test set the same model achieved - as announced by the organizers too - an accuracy of 0.73. Thanks to the model presented in this paper, our team won the 2021 PAN competition on profiling HSSs. Keywords Author Profiling, Hate Speech, Twitter, Spanish, English 1. Introduction The aim of the PAN 2021 Profiling Hate Speech Spreaders (HSSs) on Twitter task [1, 2] was to investigate whether the author of a given Twitter thread is likely to spread tweets containing hate speech or not. The multilingual dataset, namely English and Spanish datasets provided by the organizers of the task, consisted of 120,000 tweets: 200 tweets per author, 200 authors per each language training set and 100 authors per each language test set [3]. The model we used to compete for the task consists of a shallow Convolutional Neural Network (CNN). Broadly speaking, our network preprocess each sample in the dataset to build a dictionary1 where the CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " marco.siino@unipa.it (M. Siino); elisa.dinuovo@unito.it (E. Di Nuovo); ilenia.tinnirello@unipa.it (I. Tinnirello); marco.lacascia@unipa.it (M. La Cascia) 0000-0002-4453-5352 (M. Siino); 0000-0002-4814-982X (E. Di Nuovo); 0000-0002-1305-0248 (I. Tinnirello); 0000-0002-8766-6395 (M. La Cascia) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 In this paper we use the well known computer science concept of a dictionary. A dictionary (or associative array) is a data type composed by a collection of (key,value) pairs. However, in the TextVectorization class definition used in our model is used the word vocabulary referring to a list of tokens (e.g., n-grams returned after preprocessing key is an integer number and the value is an n-gram resulting from our custom preprocessing function. Each integer value (e.g. a key in the dictionary) is then mapped to a single point into a 100-dimensional word embedding space. Then, a 1-dimensional convolution is applied. The output of the convolution layer is then fed into an average pooling and then into a global average pooling layer fully connected to a single dense layer output. The remainder of this paper is organized as follows. In Section 2 we present some related works about the usage of deep learning methods for similar text classification tasks. In Section 3 we describe in detail our approach, explaining our choices and the configuration of each layer of the model. In Section 4 we report the results obtained in the final 5-fold cross validation and on the test set. In Section 5 we discuss some interesting future works and in Section 6 we conclude our paper. 2. Related works To address the problem of identifying HSSs within the Twitter microblogging platform, we started from the analysis and study of state-of-the-art techniques for text classification [4, 5, 6]. Our choice to use a deep learning-based approach was dictated by the notable performances reached in various text classification tasks where deep AI methods outperformed classical techniques used in natural language processing (e.g. Bayes, Decision Tree, K-Neearest Neighbour, Support Vector Machine) as reported in [7, 8]. Also hybrid approaches are present in the literature, as in the case of a CNN used to extract textual features and a SVM to carry out the classification and prediction [9]. However, similar results to that obtained by a CNN can be achieved. As a starting point to develop our model we decided to consider the work conducted in [10] in which, for the first time, a CNN was used to address a text classification task. The CNN obtained promising performances compared to the state of the art. To model text in a vector format, a common representation in most deep learning models is based on word embeddings [11, 12]. And it is thanks to word embedding-based methods that since 2015 [13] deep and non-deep models have performed very well on several text classification tasks. As already mentioned, our work concerning the task of identifying users prone to spread hate speech on Twitter by analysing their last 200 tweets was experimentally tested on the dataset provided by the PAN 2021 competition organizers. A similar author profiling task was organized last year [14], in which the participants had to identify authors prone to spread fake news based on their last 100 tweets. The winners were [15] and [16]. Their models obtained an overall accuracy of 0.77 on the provided test set. The approaches used by the winners were based on a SVM and n-grams and on an ensemble of different machine learning models. As reported in [14] the only approach based on CNNs was the one presented in [17], achieving an overall accuracy of 0.72. some input text) in which keys are token indices for such a list and values are the tokens. 3. Proposed model The architecture of the model is presented in Figure 1, in which the dimensions of inputs and outputs of each model layer are highlighted. Figure 1: Model architecture. Numbers in brackets indicate tensor dimensions; 𝑁 𝑜𝑛𝑒 stands for the batch dimension not yet known before running the model. Layers as depicted on our Google Colab Notebook. Each layer of the network is chosen taking into account the works presented in Section 2 as well as those cited below and the hints gained from our extensive tests conducted over the training set provided. Indeed, many experiments were attempted to determine the appropriate hyperparameter values and network architecture. In what follows we present each layer of the network and the choosen hyperparameter values. Before discussing the network architecture, it is important to bear in mind that each set of the dataset (training and test per language) is made of XML files—each XML is related to a single author—containing 200 tweets of the author. In addition, a ground truth file containing the labels 0 and 1 matched to each XML file is provided for the training set. For handling these files and before training, our system organizes these XML files into two folders (i.e., 0 and 1) while reading the ground truth file. Then each sample (i.e., a single XML file) is read by the model for training or test, depending on the number of fold validation. These function for reading samples is accomplished by the first layer (namely, InputLayer). 3.1. Text vectorization The first layer of our model reads the text in the XML files and apply our custom preprocessing function to split n-grams. In what follows, we refer to an n-gram as a sequence of characters. This sequence of characters is determined looking at the space before and after the sequence considered (i.e., the n-grams are splitted from the input text in correspondence of spaces). Then we build a dictionary where the keys are integer numbers and the values are the n-grams from the training set. While applying this tokenization based on spaces to the English dataset, we likely obtain n-grams that correspond to traditional tokens, or syntactic words. It is not the case when applied to the Spanish language. Hence, being n-gram a broader term as defined above, we prefer saying n-grams instead of tokens. Since the classification of HSSs is approached as an author profiling task, we decided to keep punctuation and capitalization to maintain a certain amount of stylistic information in our dictionary entries. Specifically and as an example, when splitting text, we associate two different dictionary entries to the word Hello and to hello or to Hello!. The hyperparameters characterizing this layer are described below. • Standardize. It is the preprocessing function applied to the text before proceeding with its vectorization. In our case, this function, in addition to removing tabulations and newline characters, substitutes the occurrences of the CDATA tag with a space followed by a minus than sign, adds a space between the closing of one tag and the next, and then split each n-gram at each space; • Max tokens. This parameter refers to the dictionary size. To get this value we simply count the numbers of different n-grams resulting from our preprocessing step. It is worth noting that our dictionary size is developed scanning both the Spanish and the English training sets; • Output mode. This parameter is the type of token index returned by the vectorization function. We used the INT type, so that every word is mapped to a positive integer number; • Sequence length. Although each XML document contains 200 tweets, the size in terms of produced n-grams is different for each sample because of the different length of each tweet. For this reason, we decided to consider the longest sample of the training set as size value, padding the shorter documents. As shown in Figures 1 and 2, this size is 3,911. As mentioned, padding is used for documents with a resulting number of token indices less than 3,911. Eventually, longer documents in the test set would be cut at this value. The resulting output of this layer is a sequence of 3,911 positive integers corresponding to the dictionary keys of the n-grams of the XML document considered. Some random examples of value → key pairs in our dictionary are shown below. ... rock → 210 ... Hi! → 2315 ... pregunta → 1508 ... 3.2. Embedding This layer takes as input a tensor of 3,911 integer numbers generated as described in the previous subsection. Each integer value of this tensor is mapped to a 100-dimension word embedding tensor. In this way, each integer from the previous layer is mapped to a single tensor consisting of 100 floating point values. A notable difference with the previous layer is that the 100 coordinate values of each tensor is updated at each optimization step while training the model. More precisely, we trained and tested multiple models as the word embedding space varies from 2 to 800 dimensions, as also discussed in a similar Twitter text classification problem [18]. The best performances over different tests on a 5-fold cross validation were obtained with a 100-dimension embedding space. 3.3. Convolution In our model a single 1D-convolution layer is implemented. This layer consists of 64 filters of size 36. The layer then performs convolution on 36-ngram windows with stride value of 1 (i.e., after each convolution, the convolutional filter is shifted of one word embedding tensor). For this layer no padding is added and ReLu [19, 20] is used as activation function on the output values. Number of filters and filters size (i.e., the two main parameters of this layer) are of paramount importance for the global performance of the model. Indeed, the filter size determines the size of the windows over the text of the input sample provided. In this way, we observed that a filter of size 36 generally gets n-grams from 3–4 different tweets each time. Similarly, the number of filters used (i.e., 64) determines the number of different feature maps relevant for the classification task. Both parameters are determined after extensive experiments conducted over the training set on many 5-fold cross validation runs. To fine-tune these two hyperparameters, we performed a binary search [21, 22] for both, looking in the range values 1–1,024. We discovered that a number of filters greater than 256 increases the overfitting of the model while a filter size greater than 1,024 does not allow the model to reach an accuracy of 1.0 not even on the train fold considered. 3.4. Average and global average pooling The average pooling layer [23] downsamples the input representation by taking the average value over the window defined by a pool size parameter. The window is shifted by strides. As an example, consider a single dimension array X = [1.0, 2.0, 3.0, 4.0, 5.0]. Defining a 1D-average pooling layer having pool size of 2 and stride of 1 and providing X as input to such a layer, the array Y=[1.5, 2.5, 3.5, 4.5] is returned. In this case too, in the attempt of finding the best value for the hyperparameters of this layer, we performed a binary search and found an optimimum value of 8 for the pool size and 1 for the stride. The pool size of 8 represents the number of averaged values outputted from the convolution layer at each step. We suppose that the optimum of 1 as stride value might be maybe due to our tokenization choices. A final 1D-Global Average Pooling layer is similar to the previously described average pooling one. In this case, it is not the average value over a window of the pool size defined that is returned as output but, instead, a global average along the first dimension from the previous layer outputs. Looking at the Figure 1, the output of AveragePooling1D layer is made of 484x64 elements. The global average value is calculated along the first dimension of size 484, in fact reducing by one the dimension of the input tensor. As an example consider the following matrix X. ⎡ ⎤ 𝑥11 𝑥12 𝑥13 . . . 𝑥1𝑛 ⎢𝑥21 𝑥22 𝑥23 . . . 𝑥2𝑛 ⎥ 𝑋=⎢ . .. .. . .. ⎥ ⎢ ⎥ ⎣ .. . . . . . ⎦ 𝑥𝑑1 𝑥𝑑2 𝑥𝑑3 . . . 𝑥𝑑𝑛 Providing X as input to a 1D-Global Average Pooling layer returns as output the following array Y, where each 𝑦 i is calculated averaging all the values along the i-th column of the matrix X. [︀ ]︀ 𝑌 = 𝑦1 𝑦2 𝑦3 . . . 𝑦𝑛 3.5. Dense The Global Average Pooling 1D layer is fully-connected to the last layer which is a single dense unit output. The layer is followed by a simple linear activation (e.g., 𝑎(𝑥) = 𝑥). The final output is a single float value. Positive values are considered as HSSs and negative ones as nHSSs. A threshold of 0.0 is set to determine the accuracy of the model in predicting the label of the sample provided. 3.6. Model training The values assigned to the various hyperparameters were originally set taking into account many of the decisions adopted in the studies conducted in [24, 25] and subsequently fine-tuned to improve the accuracy achieved by our model. To initialize the weights of the model we used a Glorot uniform initializer [26]. The model is compiled with a binary cross entropy loss function; this function calculates loss with respect to two classes (i.e., 0 and 1) as defined in 1. 𝑁 1 ∑︁ 𝐿𝑜𝑠𝑠𝐵𝐶𝐸 = − [𝑦𝑛 × log (ℎ𝜃 (𝑥𝑛 )) + (1 − 𝑦𝑛 ) × log (1 − ℎ𝜃 (𝑥𝑛 ))] (1) 𝑁 𝑛=1 where: • N is the number of training examples; • 𝑦𝑛 is the target label for the training sample 𝑛; • 𝑥𝑛 is the input sample 𝑛; • ℎ𝜃 is the neural network model with weights 𝜃. Optimization is performed with an Adamic optimizer [27] after giving each batch of data as input. We performed a binary search for finding the optimal batch size. The model achieved the best overall accuracy with a batch size of 2. The model architecture is depicted in Figure 2, where the number of the various network hyperparameters are provided. Figure 2: Model representation showing the number of parameters involved at each layer. Such a small number of parameters allows low computational load for training and testing. Figure as depicted on our Google Colab notebook. Training Set Test Set English | class 0 100 50 English | class 1 100 50 Spanish | class 0 100 50 Spanish | class 1 100 50 Table 1 Dataset summary showing the number of samples (i.e., authors) for each set, language and class. 4. Experimental evaluation and results In this section we report the results obtained by our model during evaluation on the 5-fold cross validation on the training set. Then, we report the results of the trained model on the test set. 4.1. Experiments Table 1 reports the number of samples within each set. Each sample in all the sets is an XML file whose name corresponds to the author id. Each XML file contains 200 tweets of the considered author. Considering that we have 200 tweets per author (the number of authors is shown in Table 1), the whole dataset consists of 120,000 tweets. Task organizers invited participants to deploy their models on TIRA[28]. As communicated by email by the task organizers, TIRA has been experiencing technical issues. Therefore, our model was developed and tested as a Jupyter Notebook in Google Colab using TensorFlow. The complete source code is publicly available and reusable.2 To validate our model and fine-tune its hyperparameters, we ran 5-fold cross validations at each test performed. We considered the full training set made by the union of both language sets, then we shuffled this multilanguage training set and used it for the cross validations. Then we split 80–20 for each of the fold. Specifically, the first fold was made using the first 80% of the samples for training and the remaining 20% for validation. The remaining folds were made as 2 Model notebook: https://colab.research.google.com/drive/1hUwn_uk0YPC6Tpo3MK1gDVGPQxmzPh_E. 1-Fold 2-Fold 3-Fold 4-Fold 5-Fold 80%T - 20%V 60%T - 20%V - 20%T 40%T - 20%V - 40%T 20%T - 20%V - 60%T 20%V - 80%T Table 2 5-fold splitting applied on the complete multilingual training set. T indicates that this percentage is used for training and V for validation on this fold. Fold Nr. 1 2 3 4 5 Avg. Dev. Accuracy 0.6625 0.7000 0.6750 0.8000 0.6875 0.7050 0.0491 Loss 0.6097 0.7070 0.7771 0.5074 0.6234 0.6449 0.0916 Table 3 Results achieved by the model on a 5-fold cross validation on the complete multilingual training set (i.e., Spanish and English data). Both loss and accuracy are computed for the validation set used at the fold indicated on the upper row. In the last two columns we report the values of the arithmetic mean and the standard deviation over the 5 folds. reported in Table 2 with the order of the percentage of samples taken for both sets (train and validation) from the complete training set. 4.2. Results In Table 3 the results obtained adopting a 5-fold cross validation on the complete multilingual training dataset are reported. The 5-folds were made as explained in the previous subsection. Table 3 reports accuracy and loss values achieved on the validation set used at each fold, together with the arithmetic mean and standard deviation. For each fold we trained the model for 15 epochs, then we reported the higher accuracy and the related loss over the 15 epochs of training with respect to the validation set used at the fold indicated in the upper row. As can be noted, some splits achieved a better performance and this could be due to a higher level of similarity between the considered train and validation sets. Finally, as reported in the PAN website, our model achieved an accuracy of 0.73 on the English test set and of 0.85 on the Spanish test set.3 Considering these results, the overall accuracy (i.e., the arithmetic mean of the accuracy achieved per language) is 0.79. 5. Future works In future developments, it would be interesting to analyze the behaviour of our model and the output of each layer. Furthermore, another point deserving to be investigated concerns the reasons behind the fact that the tested models are better at identifying HSSs in Spanish than in English. In fact, in each performed test, our models performed better on the Spanish set, even using a non-deep model (i.e., a Naïve Bayesian model using the same preprocessing function). Perhaps, conducting a qualitative analysis of the dataset tweets could be beneficial for understanding why profiling HSSs in Spanish achieves a higher accuracy than in English. 3 Pan 2021 task results: https://pan.webis.de/clef21/pan21-web/author-profiling.html#results. This investigation could help us shed some light on this accuracy difference in the two datasets– which concerned also the last year PAN author profiling task—but also on how our model works looking at both correct and wrong predictions. Another direction to improve accuracy in profiling HSSs could be to add more complexity to the model, maybe using some additional layers. Given the dimension of the dataset provided some techniques of data augmentation could be also applied. Finally, some investigation on the content of each tweet could guide us in applying some techniques to remove noise (i.e., not relevant features) from the input samples. 6. Conclusion In this paper, we described the submitted model for the Profiling HSSs on Twitter task at PAN 2021. It consists of a CNN based on a single convolutional layer. To get a more accurate evaluation of the model performance, we run several 5-fold cross validation for each different hyperparameter configuration. In fact, several binary searches are conducted to fine tuning the hyperparameters involved during the training of our proposed model. After finding the model achieving the highest accuracy during our cross validation tests, we trained such a model on the entire training set to submit our predictions on the test set. The model proved to maintain the good accuracy achieved on the cross-validation process also when tested on the test set. Overall, as announced by the organizers, our software—achieving an average accuracy of 0.79 (0.73 on the English test set and 0.85 on the Spanish test set)—ranked first in the PAN 2021 HSSs profiling task. Our model, developed in TensorFlow, is publicly available as a Jupyter Notebook on Google Colab. Acknowledgments We would like to thank anonymous reviewers for their comments and suggestions that have helped to improve the presentation of the paper. CRediT Authorship Contribution Statement Marco Siino: Conceptualization, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing - Original draft, Writing - review & editing. Elisa Di Nuovo: Formal analysis, Investigation, Writing - Original draft, Writing - review & editing. Ilenia Tinnirello: Supervision, Writing - review & editing. Marco La Cascia: Supervision, Writing - review & editing. References [1] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wol- ska, , E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling Hate Speech Spreaders on Twitter,and Style Change Detection, in: 12th International Conference of the CLEF Association (CLEF 2021), Springer, 2021, p. 1. [2] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech Spreaders on Twitter Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021, p. 1. [3] F. Rangel, M. A. Chulvi, G. L. De La Pena, E. Fersini, P. Rosso, Profiling Hate Speech Spreaders on Twitter [Data set], https://zenodo.org/record/4603578, 2021. [4] M. Thangaraj, M. Sivakami, Text classification techniques: A literature review., Interdisci- plinary Journal of Information, Knowledge & Management 13 (2018). [5] B. Altınel, M. C. Ganiz, Semantic text classification: A survey of past and recent advances, Information Processing & Management 54 (2018) 1129–1153. [6] R. Oshikawa, J. Qian, W. Y. Wang, A survey on natural language processing for fake news detection, arXiv preprint arXiv:1811.00770 (2018). [7] H. Wu, Y. Liu, J. Wang, Review of text classification methods on deep learning, CMC- COMPUTERS MATERIALS & CONTINUA 63 (2020) 1309–1321. [8] S. Hashida, K. Tamura, T. Sakai, Classifying tweets using convolutional neural networks with multi-channel distributed representation, IAENG International Journal of Computer Science 46 (2019) 68–75. [9] Z. Wang, Z. Qu, Research on web text classification algorithm based on improved cnn and svm, in: 2017 IEEE 17th International Conference on Communication Technology (ICCT), IEEE, 2017, pp. 1958–1961. [10] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882 (2014). [11] G. E. Hinton, et al., Learning distributed representations of concepts, in: Proceedings of the eighth annual conference of the cognitive science society, volume 1, Amherst, MA, 1986, p. 12. [12] S. Wang, W. Zhou, C. Jiang, A survey of word embeddings based on deep learning, Computing 102 (2020) 717–740. [13] A. Severyn, A. Moschitti, Twitter sentiment analysis with deep convolutional neural networks, in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015, pp. 959–962. [14] F. Rangel, A. Giachanou, B. Ghanem, P. Rosso, Overview of the 8th author profiling task at pan 2020: Profiling fake news spreaders on twitter, in: CLEF, 2020, p. 1. [15] J. Pizarro, Using n-grams to detect fake news spreaders on twitter, in: CLEF, 2020, p. 1. [16] J. Buda, F. Bolonyai, An ensemble model using n-grams and statistical features to identify fake news spreaders on twitter, in: CLEF, 2020, p. 1. [17] M. P. Chilet L., Profiling fake news spreaders on twitter, in: CLEF, 2020, p. 0. [18] X. Yang, C. Macdonald, I. Ounis, Using word embeddings in twitter election classification, Information Retrieval Journal 21 (2018) 183–207. [19] K. Fukushima, Visual feature extraction by a multilayered network of analog threshold elements, IEEE Transactions on Systems Science and Cybernetics 5 (1969) 322–333. [20] K. Fukushima, S. Miyake, Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition, in: Competition and cooperation in neural nets, Springer, 1982, pp. 267–285. [21] L. F. Williams Jr, A modification to the half-interval search (binary search) method, in: Proceedings of the 14th annual Southeast regional conference, 1976, pp. 95–101. [22] D. Knuth, The Art Of Computer Programming, vol. 3: Sorting And Searching, Addison- Wesley, 1973. [23] TensorFlow, AveragePooling1D layer, https://keras.io/api/layers/pooling_layers/average_ pooling1d/, 2021. [24] Y. Zhang, B. Wallace, A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification, arXiv preprint arXiv:1510.03820 (2015). [25] A. Jacovi, O. S. Shalom, Y. Goldberg, Understanding convolutional neural networks for text classification, arXiv preprint arXiv:1809.08037 (2018). [26] Keras, Layer weight initializers, https://keras.io/api/layers/initializers/, 2021. [27] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2017. arXiv:1412.6980. [28] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019, p. 1. doi:10. 1007/978-3-030-22948-1\_5. A. Online Resources The source code of our model is available via • GitHub, • Google Colab.