A Tweet Text Binary Artificial Neural Network Classifier Theodore Nikoletopoulos1, Claudia Wolff2 1 Unaffiliated, Athens, Greece 2 Kiel University, Kiel, Germany theo_nikoletopoulos@yahoo.co.uk, wolff@geographie.uni-kiel.de ABSTRACT with the classification task produced better F1-scores on the dev. set. We present an Artificial Neural Network (ANN) text classifier to In order to calculate the desired word embeddings, we first tokenize deal with the task of automatically detecting a tweet as being flood- text, i.e. decompose it to individual words, symbols, punctuation related or not. The framework for classifying flood-related tweets marks etc. Each token is assigned an index and we consider a consists of three basic ANN models. Each model is a different ANN vocabulary of the most frequent tokens. Further, we set the length type and the final output is determined by a majority rule on the of the text’s representation as a sequence of tokens to a fixed length. individual model outputs. The overall F1 score on the test set was Both the vocabulary’s size and the text’s length are 0.5405, significantly lower than on the training/validation set, hyperparameters with which one can experiment. suggesting that we overfitted the training set. 2.2 Undersampling 1 INTRODUCTION As mentioned in [1] the dataset is skewed/imbalanced; there are fewer samples of the positive class (i.e. flood-related) than the This research was conducted as part of the ‘Flood-Related negative (approximately 20% - 80%). This makes training the Multimedia Task’ challenge provided by the Multimedia model hard because during training it is presented with more Evaluation Benchmark (MediaEval) 2020 [1]. The goal of the task negative samples and consequently ‘learns’ better the negative is to automatically identify and classify tweets which are relevant class and misclassifies a lot of positive samples, thus leading to a to flooding in Northeastern Italy. For this binary classification poor F1-score. problem, we used different types of ANNs to automatically classify To tackle this issue, we use under sampling as follows: We keep all the tweet’s text [2]. As different types of ANNs might capture positive samples of the training set and select randomly some (not different characteristics of the ANN input, we chose to implement all) of the negative samples in order to have a set with a negative- three different types and determine the final decision by using a positive class ratio closer to one and therefore a more balanced set. majority rule on the individual ANN outputs. The value of this ratio is a hyperparameter which can be fine-tuned 2 APPROACH 2.3 ANN Models 2.1 Text Vectorization Many ANN types for different tasks exist [2]. In this study, we are To convert the tweet’s text to a numeric format as required by the dealing with a binary classification problem whose solution may be ANNs input layers we make use of word embeddings [3]. Word viewed as a partition of the embeddings space into two sets, one for embeddings are a way to map words onto low dimensional each class. This can be achieved by Multi-Layer Perceptron (MLP) (compared to other text numerical representation formats) vectors added after the Embedding layer of the model. We chose a simple with the important property that words with similar meaning are architecture of one hidden layer with 32 units having a ReLU mapped to vectors which are close to each other (in e.g. Euclidean activation function followed by a single output unit with a sigmoid distance) in the associated vector space [3]. activation function. Word embeddings are calculated by ANNs trained on large We then build on the previous model by considering a layer of the corpora, and many sets of such embeddings for a lot of different so-called Recurrent Neural Networks (RNN) consisting of 32 languages exist. However, rather than using pre-calculated word bidirectional LSTM units. RNNs are models where units have embeddings, we found that including an Embedding layer in our internal state acting as memory, thus they are capable of processing models and calculate/learn from scratch the embeddings jointly and learning sequence characteristics since they can ‘remember’ inputs seen in the past. A typical application of RNNs is time series prediction, but since text is a sequence of (correlated) words they Copyright 2020 for this paper by its authors. Use permitted under Creative are also used a lot in Natural Language Processing (NLP). The Commons License Attribution 4.0 International (CC BY 4.0). MediaEval’20, December 14-15 2020, Online MediaEval’20, December 14-15 2020, Online T. Nikoletopoulos et al. LSTM layer is placed after the Embeddings layer and on top of that, we have the previous MLP structure. Finally, we employed another type of ANN capable of handling 3.3 Outlook - Ways to improve the performance sequences - the Convolutional Neural Network (CNN). Here Experimenting with simpler text representations such as Bag of learning a sequence is achieved via a different mechanism which Words (BOW) and Term Frequency Inverse Document Frequency exploits the mathematical operation of convolution of the input (TF-IDF) vectors and a Logistic Regression classifier revealed that sequence with a small kernel. We thus placed after the Embeddings taking into account tweet entities such as hashtags, in addition to layer two parallel layers with 32 kernels of length 5 each. The the plain text, improved predictive performance. outputs of those parallel Convolutional layers are then merged and However, due to time limitations, this approach was not being fed into the previous MLP architecture. implemented in our ANN framework. Further, it would require To convert the continuous (between zero and one) ANN output to more sophisticated tokenization schemes able to extract hashtags, binary (i.e. flood-related input text or not) we use a threshold. Texts than those used for the ANNs input. having output above the threshold are labelled as flood-related (i.e. Geographical information of tweets, either in the form of metadata one) and texts having output below the threshold as labelled zero. (e.g. coordinates, place attribute) or location mentions in the The threshold is chosen for each model separately by maximizing tweet’s text could be exploited to ‘geo locate’ the tweet and the F1-score. Finally, the text’s class was assigned by a majority possibly be used as additional inputs to the model. Especially since rule on the three models’ output. the dev. set focuses on a particular study area [1]. Finally, let us mention that this study focused solely on the tweet’s text without considering the associated image. A two-branch 3 RESULTS AND DISCUSSION model, where one branch would be the model presented here excluding the output layer and the other branch an image classifier 3.1 Model setup and performance both feeding the same output layer could be used to handle both After experimenting with various values, we ended up with a text and image input. vocabulary of size 3000, sequence length of 40, embedding vector dimension of 300 and under-sampling ratio of 1.75. The vocabulary size and sequence length are small compared to typical Natural 3.4 Code availability Language Processing (NLP) applications due to the short form of The model was implemented as a Google Colab Ipython notebook the tweet's text. The architecture of the ANNs used is described and code is available upon request above. (theo_nikoletopoulos@yahoo.co.uk). ANNs were trained and evaluated individually on the same train/validation sets which were created by splitting the devset to an 80-20% ratio. The F1-scores on the validation set were 0.59 for the MLP, 0.60 for the RNN and CNN. Those scores were obtained REFERENCES by choosing thresholds 0.40, 0.65, 0.40 respectively. Finally, we [1] Stelios Andreadis, Ilias Gialampoukidis, Anastasios Karakostas, Stefanos Vrochidis, Ioannis Kompatsiaris, Roberto combined the three ANN outputs by assigning to each input the Fiorin, Daniele Norbiato, and Michele Ferri. 2020. The Flood- majority class for the three ANN outputs. We chose this strategy, related Multimedia Task at MediaEval 2020. In MediaEval hoping that each ANN would perhaps capture different 2020. idiosyncrasies of the input. The overall F1 score improved slightly [2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep to 0.61. Our score on the test set was 0.5405, significantly lower, learning. www.deeplearningbook.org suggesting that we overfitted the training set. [3] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (p./pp. 3111--3119) 3.2 Limitations of the study The main challenge of the task was related to the labelling of the training dataset. We noticed that many samples looked flood- related from a visual inspection but were not labeled as such (some example ids are:940319294084202496, 944240672294531073, 950753737466830940, 1059017654088790018, 1055172135587536896). Further, we noticed that many positive samples are from meteorological alerts. This could maybe restrict the training set and explain the difficulties of the model in generalizing well and thus, influence the overall model performance.