=Paper= {{Paper |id=Vol-3181/paper29 |storemode=property |title=Deep Learning Based Framework for Classification of Water Quality in Social Media Data |pdfUrl=https://ceur-ws.org/Vol-3181/paper29.pdf |volume=Vol-3181 |authors=Muhammad Hanif,Ammar Khawer,Muhammad Atif Tahir,Muhammad Rafi |dblpUrl=https://dblp.org/rec/conf/mediaeval/HanifKTR21 }} ==Deep Learning Based Framework for Classification of Water Quality in Social Media Data== https://ceur-ws.org/Vol-3181/paper29.pdf
      Deep Learning Based Framework for Classification of Water
                    Quality in Social Media Data
                      Muhammad Hanif, Ammar Khawer, Muhammad Atif Tahir, Muhammad Rafi
                           National University of Computer and Emerging Sciences, Karachi Campus, Pakistan
                                      {hanif.soomro,k201414,atif.tahir,muhammad.rafi}@nu.edu.pk

ABSTRACT                                                                             classifies tweets based on their text and checks if the tweet contains
This paper describes the method proposed by team FAST-NU-DS,                         an image, then image features are also considered to make a strong
for the task of WaterMM: Water Quality in Social Multimedia at                       prediction.
MediaEval, 2021. The task aims to analyze water security, safety,
and quality of water and build a classifier that differentiates whether              3     PROPOSED APPROACH
the tweet is discussing water quality issues. The task includes a                    The proposed method for WaterMM: Water Quality in Social Mul-
dataset in the form of tweets containing the tweet’s text, their meta-               timedia at MediaEval, 2021 [1], has utilized a bilingual text-based
data, and a few tweets also contain images. The proposed method                      dataset, which contains tweets in either Italian language or English
has performed pre-processing steps on the text and tags of the                       language. At the next stage, images are added along with text to
dataset and applied Bidirectional Encoder Representations from                       perform binary classification based on either the tweet is discussing
Transformers (BERT). The proposed method has applied Visual Ge-                      water quality-related issues or not.
ometry Group (VGG16) pre-trained on the ImageNet dataset for the
binary classification of images. The proposed method has achieved                    3.1    Approach for text data
a 0.31 F1 score for text-only content. Moreover, the combination of
                                                                                     For the first sub-task, only text contents are utilized to binary tweets
text and images provided a 0.24 F1 score.
                                                                                     and predict whether the tweet discusses water quality. For the
                                                                                     processing of text extracted from tweets, the description of tweets
1    INTRODUCTION                                                                    and tags are considered for the binary classification task. As the
The enormous amount of data generated by social media is being                       dataset for "WaterMM: Water Quality in Social Multimedia" at
investigated for the solution of various problems. Various social                    MediaEval, 2021 [1] contains tweets in the English language and in
media platforms, including Twitter, allow users to share text and                    the Italian language. Therefore, googletrans [6] library is utilized
image content, which can be used for situational awareness at any                    for the translation of each tweet from the Italian language to the
time. The task of "WaterMM: Water Quality in Social Multimedia"                      English language. So that, all the data is available in one language
at MediaEval, 2021 [1], has focused on examining water safety,                       (English).
quality and security by using social media data. The task is aimed                      After translation of train and test data, various pre-processing
to assist with the complaints regarding the quality and conditions of                steps are performed to clean text contents. Initially, the Uniform
drinking water through social media data, which will help the water                  Resource Locator (URLs) are removed from the description of tweets.
utility and protection agencies to better serve the communities at                   Moreover, hash symbols and punctuations from tweets are also
large.                                                                               removed from each tweet of training and testing sets. The pre-
                                                                                     processing step also removed smileys and emoticons from the text of
2    LITERATURE REVIEW                                                               tweets and contents converted into the lowercase. Finally, numbers
The research effort utilized BERT and various competitors for the                    and symbols are eliminated from the data to make the contents
representation of disaster-related tweets [4]. The method has exper-                 more meaningful. The description part of the tweet also contains
imentally proved that the BERT has surpassed various embedding                       stop words, which have less importance for the binary classification
methods, including Glove [8] and FastText [3]. Another research                      task. So, stop words are also removed from the tweets.
effort has taken Two different datasets in English and Italian lan-                     It has been observed that the dataset of WaterMM: Water Quality
guages and applied BERT. The research has focused on avoidance                       in Social Multimedia at MediaEval, 2021 [1] is highly imbalanced.
of noise and managing various web-related noisy objects, including                   The minority class of the dataset contains 1140 tweets, which shows
emoticons, emojis, mentions, hashtags, and so on [9]. Researchers                    discussion related to water quality. However, the majority class has
[11] have performed multi-label classification on disaster-based                     4248 tweets, which are not discussing the quality of water. There-
tweets. The method has produced state-of-the-art results on the                      fore, the minority of majority class ratio is almost 1:4. Oversampling
dataset by using two variants of BERT. Another research framework                    technique has been used to reduce the class imbalance. The minority
[7] has been proposed to investigate the flooding situation. The                     class is oversampled three times to decrease the imbalance between
framework collects real-time images and text based data and shows                    classes.
its relevancy or irrelevancy with flooding disasters. The framework                     Later, the Bidirectional Encoder Representations from Trans-
                                                                                     formers (BERT) is trained by using a train-set of the dataset. Each
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   instance of the training set is created by combining text and tags
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online                                            of the tweet, which are then converted into tokens. The ’bert-base-
                                                                                     uncased’ model is selected for the processing, which lowercased the
MediaEval’21, December 13-15 2021, Online                                                                                         M. Hanif et al.




                                Figure 1: Data flow diagram showing processing of visual and text data


contents and then converted them into tokens. For further process-              Table 1: Results of Proposed method on Test Data
ing, the [CLS] and [SEP] keywords are added to separate the dataset
instances. The maximum length for a single text-based instance of                           Run#         Method        F1-Score
the dataset is set as 256 tokens. The training set is divided into train
                                                                                            Run 1       BERT           31.67%
and valid sets by allocating 10% of training data to the validation
                                                                                            Run 3    BERT+VGG16        24.45%
set, and the remaining 90% is allocated for training. Furthermore,
the AdamW optimizer is used along with the learning rate set as
2e-5, and the epsilon is set as 1e-8. The model has been trained for
three epochs. Finally, The trained model is then used to predict 1920
test set instances. The prediction in the form of 0 or 1 is collected
and stored in a comma-separated format for test-set.                       4   RESULTS AND ANALYSIS
                                                                           The proposed method has achieved a 31.67% F1-score for the first
                                                                           run. The first run has utilized descriptions and tags of the tweets
                                                                           for its prediction. However, the second run has achieved 24.45% F1-
3.2    Approach for text and image data                                    score. The second run has utilized descriptions and tags of tweets
The second sub-task has utilized text as well as images available for      and images for a limited number of instances. The results for textual
the tweets. Though very few tweets contain images, only 954 tweets         and combination of both visual and textual are summarized in
from the train-set contain images, and 245 tweets from the test-           table 1. It has been observed that the test set’s score is less compared
set contain images. Due to the insufficient quantity of images, the        to train and validation sets. The reason for less effective results may
oversampling has been performed for both minority and majority             involve a very similar type of tweets in both classes. The tweet text
classes. The class of images, which represent the availability of          contains various similar words in both classes, which might have
water quality, has only 264 images. The quantity of minority class         confused the algorithms, such as water and bottles. The quantity of
is oversampled by creating five different augmented samples of each        tweets containing images is less than sufficient for deep learning
image. For augmentation, python-based library "Augmentor" [2]              models, due to which Run two has produced a low evaluation score
is utilized. The random samples are created by varying different           compared to Run 1. Moreover, observations revealed that few of
parameters, including rotate, zoop, and flip. Similarly, the majority      the images declared as a part of the class showing water quality
class is added with two additional augmented copies of images for          but do not visualize anything related to water. On the other hand,
each of its instances.                                                     it has also been observed that images in the negative class, which
    The increased quantity of images is utilized for the classifica-       does not discuss water quality, also include water-related contents
tion by applying Visual Geometry Group (VGG16) model [10], pre-            as water bottles. So, this has confused deep learning models to
trained on ImageNet [5] dataset. The model is fine-tuned by retrain-       discriminate between both classes. Results may be improved using
ing the last four layers of the model. The rest of the model is frozen     multiple deep learning based models for image classification. Text-
to keep previous learning on the ImageNet dataset. The learning            based classification method can also be improved by increasing
rate is set as 10−5 and the dropout value as 0.3. The sigmoid function     minority class, where instead of simple over-sampling, synonyms
has been used, and the problem is related to binary classification.        may be used to increase instances.
The quantity of 20% instances are allocated f validation set, and the
rest of the 80% is utilized for training. The model is retrained for 25
epochs, and the trained model is saved for prediction on test-set          5   DISCUSSION AND OUTLOOK
images. The model predicts test set images, and the confidence             The research has proposed a Bidirectional Encoder Representations
score for each of the images is stored.                                    from Transformers (BERT) approach for finding water-quality re-
    On the other hand, the prediction confidence for text instances is     lated tweets. The method has also utilized Visual Geometry Group
retrieved by using BERT, and predictions are normalized between            (VGG16), pre-trained on ImageNet dataset to binary classify the
0 and 1. Later, the prediction achieved by applying VGG16 based            images based on whether they contain evidence of water qual-
model is combined with BERT-based predictions, and the average             ity. Research can be enhanced by using the Places dataset, which
is calculated. Then, the sigmoid function is applied for the final         describes scene-based information. Furthermore, advanced over-
prediction. The approach is depicted in figure 1.                          sampling techniques may be used as translation-based or synonym-
                                                                           based oversampling.
WaterMM: Water Quality in Social Multimedia                                    MediaEval’21, December 13-15 2021, Online


ACKNOWLEDGMENTS
This research work was funded by Higher Education Commission
(HEC) Pakistan and Ministry of Planning Development and Reforms
under the National Center in Big Data and Cloud Computing.

REFERENCES
 [1] Stelios Andreadis, Ilias Gialampoukidis, Aristeidis Bozas, Anasta-
     sia Moumtzidou, Roberto Fiorin, Francesca Lombardo, Anastasios
     Karakostas, Daniele Norbiato, Stefanos Vrochidis, Michele Ferri, and
     Ioannis Kompatsiaris. 2021. WaterMM:Water Quality in Social Multi-
     media Task at MediaEval 2021. In Proceedings of the MediaEval 2021
     Workshop, Online.
 [2] Marcus D Bloice, Christof Stocker, and Andreas Holzinger. 2017. Aug-
     mentor: an image augmentation library for machine learning. arXiv
     preprint arXiv:1708.04680 (2017).
 [3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.
     2017. Enriching word vectors with subword information. Transactions
     of the Association for Computational Linguistics 5 (2017), 135–146.
 [4] Ashis Kumar Chanda. 2021. Efficacy of BERT embeddings on pre-
     dicting disaster from Twitter data. arXiv preprint arXiv:2108.10698
     (2021).
 [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
     2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
     conference on computer vision and pattern recognition. IEEE, 248–255.
 [6] Suhun Han. 2020. googletrans 3.0.0. https://pypi.org/project/
     googletrans/. (2020). Accessed: 2020-11-1.
 [7] Anastasia Moumtzidou, Stelios Andreadis, Ilias Gialampoukidis, Anas-
     tasios Karakostas, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2018.
     Flood relevance estimation from visual and textual content in social
     media streams. In Companion Proceedings of the The Web Conference
     2018. 1621–1627.
 [8] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014.
     Glove: Global vectors for word representation. In Proceedings of the
     2014 conference on empirical methods in natural language processing
     (EMNLP). 1532–1543.
 [9] Marco Pota, Mirko Ventura, Hamido Fujita, and Massimo Esposito.
     2021. Multilingual evaluation of pre-processing for BERT-based senti-
     ment analysis of tweets. Expert Systems with Applications 181 (2021),
     115119.
[10] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).
[11] Hamada M Zahera, Ibrahim A Elgendy, Rricha Jalota, Mo-
     hamed Ahmed Sherif, EM Voorhees, and A Ellis. 2019. Fine-tuned
     BERT Model for Multi-Label Tweets Classification.. In TREC. 1–7.