<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Tweet Text Binary Artificial Neural Network Classifier</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kiel University</institution>
          ,
          <addr-line>Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Theodore Nikoletopoulos</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Unaffiliated</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We present an Artificial Neural Network (ANN) text classifier to deal with the task of automatically detecting a tweet as being floodrelated or not. The framework for classifying flood-related tweets consists of three basic ANN models. Each model is a different ANN type and the final output is determined by a majority rule on the individual model outputs. The overall F1 score on the test set was 0.5405, significantly lower than on the training/validation set, suggesting that we overfitted the training set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        This research was conducted as part of the ‘Flood-Related
Multimedia Task’ challenge provided by the Multimedia
Evaluation Benchmark (MediaEval) 2020 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The goal of the task
is to automatically identify and classify tweets which are relevant
to flooding in Northeastern Italy. For this binary classification
problem, we used different types of ANNs to automatically classify
the tweet’s text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As different types of ANNs might capture
different characteristics of the ANN input, we chose to implement
three different types and determine the final decision by using a
majority rule on the individual ANN outputs.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 APPROACH 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>Text Vectorization</title>
      <p>
        To convert the tweet’s text to a numeric format as required by the
ANNs input layers we make use of word embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Word
embeddings are a way to map words onto low dimensional
(compared to other text numerical representation formats) vectors
with the important property that words with similar meaning are
mapped to vectors which are close to each other (in e.g. Euclidean
distance) in the associated vector space [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Word embeddings are calculated by ANNs trained on large
corpora, and many sets of such embeddings for a lot of different
languages exist. However, rather than using pre-calculated word
embeddings, we found that including an Embedding layer in our
models and calculate/learn from scratch the embeddings jointly
with the classification task produced better F1-scores on the dev.
set.</p>
      <p>In order to calculate the desired word embeddings, we first tokenize
text, i.e. decompose it to individual words, symbols, punctuation
marks etc. Each token is assigned an index and we consider a
vocabulary of the most frequent tokens. Further, we set the length
of the text’s representation as a sequence of tokens to a fixed length.
Both the vocabulary’s size and the text’s length are
hyperparameters with which one can experiment.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Undersampling</title>
      <p>
        As mentioned in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the dataset is skewed/imbalanced; there are
fewer samples of the positive class (i.e. flood-related) than the
negative (approximately 20% - 80%). This makes training the
model hard because during training it is presented with more
negative samples and consequently ‘learns’ better the negative
class and misclassifies a lot of positive samples, thus leading to a
poor F1-score.
      </p>
      <p>To tackle this issue, we use under sampling as follows: We keep all
positive samples of the training set and select randomly some (not
all) of the negative samples in order to have a set with a
negativepositive class ratio closer to one and therefore a more balanced set.
The value of this ratio is a hyperparameter which can be fine-tuned
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>ANN Models</title>
      <p>
        Many ANN types for different tasks exist [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this study, we are
dealing with a binary classification problem whose solution may be
viewed as a partition of the embeddings space into two sets, one for
each class. This can be achieved by Multi-Layer Perceptron (MLP)
added after the Embedding layer of the model. We chose a simple
architecture of one hidden layer with 32 units having a ReLU
activation function followed by a single output unit with a sigmoid
activation function.
      </p>
      <p>We then build on the previous model by considering a layer of the
so-called Recurrent Neural Networks (RNN) consisting of 32
bidirectional LSTM units. RNNs are models where units have
internal state acting as memory, thus they are capable of processing
and learning sequence characteristics since they can ‘remember’
inputs seen in the past. A typical application of RNNs is time series
prediction, but since text is a sequence of (correlated) words they
are also used a lot in Natural Language Processing (NLP). The
LSTM layer is placed after the Embeddings layer and on top of that,
we have the previous MLP structure.</p>
      <p>Finally, we employed another type of ANN capable of handling
sequences - the Convolutional Neural Network (CNN). Here
learning a sequence is achieved via a different mechanism which
exploits the mathematical operation of convolution of the input
sequence with a small kernel. We thus placed after the Embeddings
layer two parallel layers with 32 kernels of length 5 each. The
outputs of those parallel Convolutional layers are then merged and
being fed into the previous MLP architecture.</p>
      <p>To convert the continuous (between zero and one) ANN output to
binary (i.e. flood-related input text or not) we use a threshold. Texts
having output above the threshold are labelled as flood-related (i.e.
one) and texts having output below the threshold as labelled zero.
The threshold is chosen for each model separately by maximizing
the F1-score. Finally, the text’s class was assigned by a majority
rule on the three models’ output.
3
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND DISCUSSION</title>
    </sec>
    <sec id="sec-7">
      <title>Model setup and performance</title>
      <p>After experimenting with various values, we ended up with a
vocabulary of size 3000, sequence length of 40, embedding vector
dimension of 300 and under-sampling ratio of 1.75. The vocabulary
size and sequence length are small compared to typical Natural
Language Processing (NLP) applications due to the short form of
the tweet's text. The architecture of the ANNs used is described
above.</p>
      <p>ANNs were trained and evaluated individually on the same
train/validation sets which were created by splitting the devset to
an 80-20% ratio. The F1-scores on the validation set were 0.59 for
the MLP, 0.60 for the RNN and CNN. Those scores were obtained
by choosing thresholds 0.40, 0.65, 0.40 respectively. Finally, we
combined the three ANN outputs by assigning to each input the
majority class for the three ANN outputs. We chose this strategy,
hoping that each ANN would perhaps capture different
idiosyncrasies of the input. The overall F1 score improved slightly
to 0.61. Our score on the test set was 0.5405, significantly lower,
suggesting that we overfitted the training set.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Limitations of the study</title>
      <p>The main challenge of the task was related to the labelling of the
training dataset. We noticed that many samples looked
floodrelated from a visual inspection but were not labeled as such (some
example ids are:940319294084202496, 944240672294531073,
950753737466830940, 1059017654088790018,
1055172135587536896). Further, we noticed that many positive
samples are from meteorological alerts. This could maybe restrict
the training set and explain the difficulties of the model in
generalizing well and thus, influence the overall model
performance.</p>
      <p>T. Nikoletopoulos et al.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Outlook - Ways to improve the performance</title>
      <p>Experimenting with simpler text representations such as Bag of
Words (BOW) and Term Frequency Inverse Document Frequency
(TF-IDF) vectors and a Logistic Regression classifier revealed that
taking into account tweet entities such as hashtags, in addition to
the plain text, improved predictive performance.</p>
      <p>However, due to time limitations, this approach was not
implemented in our ANN framework. Further, it would require
more sophisticated tokenization schemes able to extract hashtags,
than those used for the ANNs input.</p>
      <p>
        Geographical information of tweets, either in the form of metadata
(e.g. coordinates, place attribute) or location mentions in the
tweet’s text could be exploited to ‘geo locate’ the tweet and
possibly be used as additional inputs to the model. Especially since
the dev. set focuses on a particular study area [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Finally, let us mention that this study focused solely on the tweet’s
text without considering the associated image. A two-branch
model, where one branch would be the model presented here
excluding the output layer and the other branch an image classifier
both feeding the same output layer could be used to handle both
text and image input.
3.4</p>
    </sec>
    <sec id="sec-10">
      <title>Code availability</title>
      <p>The model was implemented as a Google Colab Ipython notebook
and code is available upon request
(theo_nikoletopoulos@yahoo.co.uk).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Stelios</given-names>
            <surname>Andreadis</surname>
          </string-name>
          , Ilias Gialampoukidis, Anastasios Karakostas, Stefanos Vrochidis, Ioannis Kompatsiaris, Roberto Fiorin, Daniele Norbiato, and
          <string-name>
            <given-names>Michele</given-names>
            <surname>Ferri</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>The Floodrelated Multimedia Task at MediaEval 2020</article-title>
          . In MediaEval
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ian</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Yoshua Bengio, and Aaron Courville:
          <article-title>Deep learning</article-title>
          . www.deeplearningbook.org
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>Advances in neural information processing systems</source>
          (p./pp.
          <fpage>3111</fpage>
          --
          <lpage>3119</lpage>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>