<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Learning Based Framework for Classification of Water Quality in Social Media Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Hanif</string-name>
          <email>hanif.soomro@nu.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ammar Khawer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Atif Tahir</string-name>
          <email>atif.tahir@nu.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Rafi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Computer and Emerging Sciences, Karachi Campus</institution>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the method proposed by team FAST-NU-DS, for the task of WaterMM: Water Quality in Social Multimedia at MediaEval, 2021. The task aims to analyze water security, safety, and quality of water and build a classifier that diferentiates whether the tweet is discussing water quality issues. The task includes a dataset in the form of tweets containing the tweet's text, their metadata, and a few tweets also contain images. The proposed method has performed pre-processing steps on the text and tags of the dataset and applied Bidirectional Encoder Representations from Transformers (BERT). The proposed method has applied Visual Geometry Group (VGG16) pre-trained on the ImageNet dataset for the binary classification of images. The proposed method has achieved a 0.31 F1 score for text-only content. Moreover, the combination of text and images provided a 0.24 F1 score.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The enormous amount of data generated by social media is being
investigated for the solution of various problems. Various social
media platforms, including Twitter, allow users to share text and
image content, which can be used for situational awareness at any
time. The task of "WaterMM: Water Quality in Social Multimedia"
at MediaEval, 2021 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], has focused on examining water safety,
quality and security by using social media data. The task is aimed
to assist with the complaints regarding the quality and conditions of
drinking water through social media data, which will help the water
utility and protection agencies to better serve the communities at
large.
      </p>
    </sec>
    <sec id="sec-2">
      <title>LITERATURE REVIEW</title>
      <p>
        The research efort utilized BERT and various competitors for the
representation of disaster-related tweets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The method has
experimentally proved that the BERT has surpassed various embedding
methods, including Glove [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and FastText [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Another research
efort has taken Two diferent datasets in English and Italian
languages and applied BERT. The research has focused on avoidance
of noise and managing various web-related noisy objects, including
emoticons, emojis, mentions, hashtags, and so on [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Researchers
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have performed multi-label classification on disaster-based
tweets. The method has produced state-of-the-art results on the
dataset by using two variants of BERT. Another research framework
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has been proposed to investigate the flooding situation. The
framework collects real-time images and text based data and shows
its relevancy or irrelevancy with flooding disasters. The framework
classifies tweets based on their text and checks if the tweet contains
an image, then image features are also considered to make a strong
prediction.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>PROPOSED APPROACH</title>
      <p>
        The proposed method for WaterMM: Water Quality in Social
Multimedia at MediaEval, 2021 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], has utilized a bilingual text-based
dataset, which contains tweets in either Italian language or English
language. At the next stage, images are added along with text to
perform binary classification based on either the tweet is discussing
water quality-related issues or not.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Approach for text data</title>
      <p>
        For the first sub-task, only text contents are utilized to binary tweets
and predict whether the tweet discusses water quality. For the
processing of text extracted from tweets, the description of tweets
and tags are considered for the binary classification task. As the
dataset for "WaterMM: Water Quality in Social Multimedia" at
MediaEval, 2021 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] contains tweets in the English language and in
the Italian language. Therefore, googletrans [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] library is utilized
for the translation of each tweet from the Italian language to the
English language. So that, all the data is available in one language
(English).
      </p>
      <p>After translation of train and test data, various pre-processing
steps are performed to clean text contents. Initially, the Uniform
Resource Locator (URLs) are removed from the description of tweets.
Moreover, hash symbols and punctuations from tweets are also
removed from each tweet of training and testing sets. The
preprocessing step also removed smileys and emoticons from the text of
tweets and contents converted into the lowercase. Finally, numbers
and symbols are eliminated from the data to make the contents
more meaningful. The description part of the tweet also contains
stop words, which have less importance for the binary classification
task. So, stop words are also removed from the tweets.</p>
      <p>
        It has been observed that the dataset of WaterMM: Water Quality
in Social Multimedia at MediaEval, 2021 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is highly imbalanced.
The minority class of the dataset contains 1140 tweets, which shows
discussion related to water quality. However, the majority class has
4248 tweets, which are not discussing the quality of water.
Therefore, the minority of majority class ratio is almost 1:4. Oversampling
technique has been used to reduce the class imbalance. The minority
class is oversampled three times to decrease the imbalance between
classes.
      </p>
      <p>Later, the Bidirectional Encoder Representations from
Transformers (BERT) is trained by using a train-set of the dataset. Each
instance of the training set is created by combining text and tags
of the tweet, which are then converted into tokens. The
’bert-baseuncased’ model is selected for the processing, which lowercased the
contents and then converted them into tokens. For further
processing, the [CLS] and [SEP] keywords are added to separate the dataset
instances. The maximum length for a single text-based instance of
the dataset is set as 256 tokens. The training set is divided into train
and valid sets by allocating 10% of training data to the validation
set, and the remaining 90% is allocated for training. Furthermore,
the AdamW optimizer is used along with the learning rate set as
2e-5, and the epsilon is set as 1e-8. The model has been trained for
three epochs. Finally, The trained model is then used to predict 1920
test set instances. The prediction in the form of 0 or 1 is collected
and stored in a comma-separated format for test-set.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Approach for text and image data</title>
      <p>
        The second sub-task has utilized text as well as images available for
the tweets. Though very few tweets contain images, only 954 tweets
from the train-set contain images, and 245 tweets from the
testset contain images. Due to the insuficient quantity of images, the
oversampling has been performed for both minority and majority
classes. The class of images, which represent the availability of
water quality, has only 264 images. The quantity of minority class
is oversampled by creating five diferent augmented samples of each
image. For augmentation, python-based library "Augmentor" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
is utilized. The random samples are created by varying diferent
parameters, including rotate, zoop, and flip. Similarly, the majority
class is added with two additional augmented copies of images for
each of its instances.
      </p>
      <p>
        The increased quantity of images is utilized for the
classification by applying Visual Geometry Group (VGG16) model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
pretrained on ImageNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] dataset. The model is fine-tuned by
retraining the last four layers of the model. The rest of the model is frozen
to keep previous learning on the ImageNet dataset. The learning
rate is set as 10−5 and the dropout value as 0.3. The sigmoid function
has been used, and the problem is related to binary classification.
The quantity of 20% instances are allocated f validation set, and the
rest of the 80% is utilized for training. The model is retrained for 25
epochs, and the trained model is saved for prediction on test-set
images. The model predicts test set images, and the confidence
score for each of the images is stored.
      </p>
      <p>On the other hand, the prediction confidence for text instances is
retrieved by using BERT, and predictions are normalized between
0 and 1. Later, the prediction achieved by applying VGG16 based
model is combined with BERT-based predictions, and the average
is calculated. Then, the sigmoid function is applied for the final
prediction. The approach is depicted in figure 1.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>The proposed method has achieved a 31.67% F1-score for the first
run. The first run has utilized descriptions and tags of the tweets
for its prediction. However, the second run has achieved 24.45%
F1score. The second run has utilized descriptions and tags of tweets
and images for a limited number of instances. The results for textual
and combination of both visual and textual are summarized in
table 1. It has been observed that the test set’s score is less compared
to train and validation sets. The reason for less efective results may
involve a very similar type of tweets in both classes. The tweet text
contains various similar words in both classes, which might have
confused the algorithms, such as water and bottles. The quantity of
tweets containing images is less than suficient for deep learning
models, due to which Run two has produced a low evaluation score
compared to Run 1. Moreover, observations revealed that few of
the images declared as a part of the class showing water quality
but do not visualize anything related to water. On the other hand,
it has also been observed that images in the negative class, which
does not discuss water quality, also include water-related contents
as water bottles. So, this has confused deep learning models to
discriminate between both classes. Results may be improved using
multiple deep learning based models for image classification.
Textbased classification method can also be improved by increasing
minority class, where instead of simple over-sampling, synonyms
may be used to increase instances.
5</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>The research has proposed a Bidirectional Encoder Representations
from Transformers (BERT) approach for finding water-quality
related tweets. The method has also utilized Visual Geometry Group
(VGG16), pre-trained on ImageNet dataset to binary classify the
images based on whether they contain evidence of water
quality. Research can be enhanced by using the Places dataset, which
describes scene-based information. Furthermore, advanced
oversampling techniques may be used as translation-based or
synonymbased oversampling.</p>
      <p>WaterMM: Water Quality in Social Multimedia</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research work was funded by Higher Education Commission
(HEC) Pakistan and Ministry of Planning Development and Reforms
under the National Center in Big Data and Cloud Computing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Stelios</given-names>
            <surname>Andreadis</surname>
          </string-name>
          , Ilias Gialampoukidis, Aristeidis Bozas, Anastasia Moumtzidou, Roberto Fiorin, Francesca Lombardo, Anastasios Karakostas, Daniele Norbiato, Stefanos Vrochidis, Michele Ferri, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>WaterMM:Water Quality in Social Multimedia Task at MediaEval 2021</article-title>
          .
          <source>In Proceedings of the MediaEval 2021 Workshop</source>
          , Online.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Marcus</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Bloice</surname>
            ,
            <given-names>Christof</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            , and
            <given-names>Andreas</given-names>
          </string-name>
          <string-name>
            <surname>Holzinger</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Augmentor: an image augmentation library for machine learning</article-title>
          .
          <source>arXiv preprint arXiv:1708.04680</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>5</volume>
          (
          <year>2017</year>
          ),
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ashis</given-names>
            <surname>Kumar Chanda</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Eficacy of BERT embeddings on predicting disaster from Twitter data</article-title>
          .
          <source>arXiv preprint arXiv:2108.10698</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wei Dong, Richard Socher,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2009</year>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          . In 2009 IEEE conference
          <article-title>on computer vision and pattern recognition</article-title>
          .
          <source>IEEE</source>
          ,
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Suhun</given-names>
            <surname>Han</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>googletrans 3.0.0</article-title>
          . https://pypi.org/project/ googletrans/. (
          <year>2020</year>
          ).
          <source>Accessed: 2020-11-1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Anastasia</given-names>
            <surname>Moumtzidou</surname>
          </string-name>
          , Stelios Andreadis, Ilias Gialampoukidis, Anastasios Karakostas, Stefanos Vrochidis, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Flood relevance estimation from visual and textual content in social media streams</article-title>
          .
          <source>In Companion Proceedings of the The Web Conference</source>
          <year>2018</year>
          .
          <fpage>1621</fpage>
          -
          <lpage>1627</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Pennington</surname>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Marco</given-names>
            <surname>Pota</surname>
          </string-name>
          , Mirko Ventura, Hamido Fujita, and
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Esposito</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>181</volume>
          (
          <year>2021</year>
          ),
          <fpage>115119</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hamada</surname>
            <given-names>M Zahera</given-names>
          </string-name>
          ,
          <article-title>Ibrahim A Elgendy, Rricha Jalota, Mohamed Ahmed Sherif</article-title>
          , EM Voorhees, and
          <string-name>
            <given-names>A</given-names>
            <surname>Ellis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Fine-tuned BERT Model for Multi-Label Tweets Classification.</article-title>
          .
          <source>In TREC. 1-7.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>