A Tweet Text Binary Artificial Neural Network Classifier
                                               Theodore Nikoletopoulos1, Claudia Wolff2
                                                              1
                                                        Unaffiliated, Athens, Greece
                                                          2
                                                       Kiel University, Kiel, Germany
                                      theo_nikoletopoulos@yahoo.co.uk, wolff@geographie.uni-kiel.de


ABSTRACT                                                                     with the classification task produced better F1-scores on the dev.
                                                                             set.
We present an Artificial Neural Network (ANN) text classifier to
                                                                             In order to calculate the desired word embeddings, we first tokenize
deal with the task of automatically detecting a tweet as being flood-
                                                                             text, i.e. decompose it to individual words, symbols, punctuation
related or not. The framework for classifying flood-related tweets
                                                                             marks etc. Each token is assigned an index and we consider a
consists of three basic ANN models. Each model is a different ANN
                                                                             vocabulary of the most frequent tokens. Further, we set the length
type and the final output is determined by a majority rule on the
                                                                             of the text’s representation as a sequence of tokens to a fixed length.
individual model outputs. The overall F1 score on the test set was
                                                                             Both the vocabulary’s size and the text’s length are
0.5405, significantly lower than on the training/validation set,
                                                                             hyperparameters with which one can experiment.
suggesting that we overfitted the training set.
                                                                             2.2    Undersampling
1 INTRODUCTION                                                               As mentioned in [1] the dataset is skewed/imbalanced; there are
                                                                             fewer samples of the positive class (i.e. flood-related) than the
This research was conducted as part of the ‘Flood-Related
                                                                             negative (approximately 20% - 80%). This makes training the
Multimedia Task’ challenge provided by the Multimedia
                                                                             model hard because during training it is presented with more
Evaluation Benchmark (MediaEval) 2020 [1]. The goal of the task
                                                                             negative samples and consequently ‘learns’ better the negative
is to automatically identify and classify tweets which are relevant
                                                                             class and misclassifies a lot of positive samples, thus leading to a
to flooding in Northeastern Italy. For this binary classification
                                                                             poor F1-score.
problem, we used different types of ANNs to automatically classify
                                                                             To tackle this issue, we use under sampling as follows: We keep all
the tweet’s text [2]. As different types of ANNs might capture
                                                                             positive samples of the training set and select randomly some (not
different characteristics of the ANN input, we chose to implement
                                                                             all) of the negative samples in order to have a set with a negative-
three different types and determine the final decision by using a
                                                                             positive class ratio closer to one and therefore a more balanced set.
majority rule on the individual ANN outputs.
                                                                             The value of this ratio is a hyperparameter which can be fine-tuned

2 APPROACH
                                                                             2.3    ANN Models
2.1 Text Vectorization
                                                                             Many ANN types for different tasks exist [2]. In this study, we are
To convert the tweet’s text to a numeric format as required by the
                                                                             dealing with a binary classification problem whose solution may be
ANNs input layers we make use of word embeddings [3]. Word
                                                                             viewed as a partition of the embeddings space into two sets, one for
embeddings are a way to map words onto low dimensional
                                                                             each class. This can be achieved by Multi-Layer Perceptron (MLP)
(compared to other text numerical representation formats) vectors
                                                                             added after the Embedding layer of the model. We chose a simple
with the important property that words with similar meaning are
                                                                             architecture of one hidden layer with 32 units having a ReLU
mapped to vectors which are close to each other (in e.g. Euclidean
                                                                             activation function followed by a single output unit with a sigmoid
distance) in the associated vector space [3].
                                                                             activation function.
    Word embeddings are calculated by ANNs trained on large
                                                                             We then build on the previous model by considering a layer of the
corpora, and many sets of such embeddings for a lot of different
                                                                             so-called Recurrent Neural Networks (RNN) consisting of 32
languages exist. However, rather than using pre-calculated word
                                                                             bidirectional LSTM units. RNNs are models where units have
embeddings, we found that including an Embedding layer in our
                                                                             internal state acting as memory, thus they are capable of processing
models and calculate/learn from scratch the embeddings jointly
                                                                             and learning sequence characteristics since they can ‘remember’
                                                                             inputs seen in the past. A typical application of RNNs is time series
                                                                             prediction, but since text is a sequence of (correlated) words they
Copyright 2020 for this paper by its authors. Use permitted under Creative   are also used a lot in Natural Language Processing (NLP). The
Commons License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online
 MediaEval’20, December 14-15 2020, Online                                                                          T. Nikoletopoulos et al.

LSTM layer is placed after the Embeddings layer and on top of that,
we have the previous MLP structure.
Finally, we employed another type of ANN capable of handling              3.3 Outlook - Ways to improve the performance
sequences - the Convolutional Neural Network (CNN). Here                  Experimenting with simpler text representations such as Bag of
learning a sequence is achieved via a different mechanism which           Words (BOW) and Term Frequency Inverse Document Frequency
exploits the mathematical operation of convolution of the input           (TF-IDF) vectors and a Logistic Regression classifier revealed that
sequence with a small kernel. We thus placed after the Embeddings         taking into account tweet entities such as hashtags, in addition to
layer two parallel layers with 32 kernels of length 5 each. The           the plain text, improved predictive performance.
outputs of those parallel Convolutional layers are then merged and        However, due to time limitations, this approach was not
being fed into the previous MLP architecture.                             implemented in our ANN framework. Further, it would require
To convert the continuous (between zero and one) ANN output to            more sophisticated tokenization schemes able to extract hashtags,
binary (i.e. flood-related input text or not) we use a threshold. Texts   than those used for the ANNs input.
having output above the threshold are labelled as flood-related (i.e.     Geographical information of tweets, either in the form of metadata
one) and texts having output below the threshold as labelled zero.        (e.g. coordinates, place attribute) or location mentions in the
The threshold is chosen for each model separately by maximizing           tweet’s text could be exploited to ‘geo locate’ the tweet and
the F1-score. Finally, the text’s class was assigned by a majority        possibly be used as additional inputs to the model. Especially since
rule on the three models’ output.                                         the dev. set focuses on a particular study area [1].
                                                                          Finally, let us mention that this study focused solely on the tweet’s
                                                                          text without considering the associated image. A two-branch
3   RESULTS AND DISCUSSION                                                model, where one branch would be the model presented here
                                                                          excluding the output layer and the other branch an image classifier
3.1 Model setup and performance                                           both feeding the same output layer could be used to handle both
After experimenting with various values, we ended up with a               text and image input.
vocabulary of size 3000, sequence length of 40, embedding vector
dimension of 300 and under-sampling ratio of 1.75. The vocabulary
size and sequence length are small compared to typical Natural            3.4 Code availability
Language Processing (NLP) applications due to the short form of
                                                                          The model was implemented as a Google Colab Ipython notebook
the tweet's text. The architecture of the ANNs used is described
                                                                          and       code       is     available     upon        request
above.
                                                                          (theo_nikoletopoulos@yahoo.co.uk).
ANNs were trained and evaluated individually on the same
train/validation sets which were created by splitting the devset to
an 80-20% ratio. The F1-scores on the validation set were 0.59 for
the MLP, 0.60 for the RNN and CNN. Those scores were obtained
                                                                          REFERENCES
by choosing thresholds 0.40, 0.65, 0.40 respectively. Finally, we         [1] Stelios Andreadis, Ilias Gialampoukidis, Anastasios
                                                                              Karakostas, Stefanos Vrochidis, Ioannis Kompatsiaris, Roberto
combined the three ANN outputs by assigning to each input the
                                                                              Fiorin, Daniele Norbiato, and Michele Ferri. 2020. The Flood-
majority class for the three ANN outputs. We chose this strategy,             related Multimedia Task at MediaEval 2020. In MediaEval
hoping that each ANN would perhaps capture different                          2020.
idiosyncrasies of the input. The overall F1 score improved slightly       [2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep
to 0.61. Our score on the test set was 0.5405, significantly lower,           learning. www.deeplearningbook.org
suggesting that we overfitted the training set.                           [3] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J.
                                                                              (2013). Distributed representations of words and phrases and
                                                                              their compositionality. Advances in neural information
                                                                              processing systems (p./pp. 3111--3119)
3.2 Limitations of the study
The main challenge of the task was related to the labelling of the
training dataset. We noticed that many samples looked flood-
related from a visual inspection but were not labeled as such (some
example ids are:940319294084202496, 944240672294531073,
950753737466830940,                        1059017654088790018,
1055172135587536896). Further, we noticed that many positive
samples are from meteorological alerts. This could maybe restrict
the training set and explain the difficulties of the model in
generalizing well and thus, influence the overall model
performance.