<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NLP Techniques for Water Quality Analysis in Social Media Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Asif Ayub</string-name>
          <email>asifayub836@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khubaib Ahmad</string-name>
          <email>khubaibtakkar@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kashif Ahmad</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nasir Ahmad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ala Al-Fuqaha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Systems Engineering, University of Engineering and Technology</institution>
          ,
          <addr-line>Peshawar</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University</institution>
          ,
          <addr-line>Qatar Foundation, Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents our contributions to the MediaEval 2021 task namely ”WaterMM: Water Quality in Social Multimedia”. The task aims at analyzing social media posts relevant to water quality with particular focus on the aspects like watercolor, smell, taste, and related illnesses. To this aim, a multimodal dataset containing both textual and visual information along with meta-data is provided. Considering the quality and quantity of available content, we mainly focus on textual information by employing three diferent models individually and jointly in a late-fusion manner. These models include (i) Bidirectional Encoder Representations from Transformers (BERT), (ii) Robustly Optimized BERT Pre-training Approach (XLMRoBERTa), and a (iii) custom Long short-term memory (LSTM) model obtaining an overall F1-score of 0.794, 0.717, 0.663 on the oficial test set, respectively. In the fusion scheme, all the models are treated equally and no significant improvement is observed in the performance over the best performing individual model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In recent years, social media has emerged as a valuable tool and
platform to discuss and convey concerns over diferent challenges and
daily life issues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The literature covers a diversified list of societal,
environmental, and technological topics, such as racism and hate
speech [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], public health [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], natural disasters and rehabilitation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
and technological conspiracies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], discussed in social media outlets.
More recently, there have been debates in social networks on
environmental issues especially the quality of air and drinking water
in diferent parts of the world. The discussions generally revolve
around the topics like strange color, smell, bad taste, and diseases
caused by polluted water. This information could help in several
ways. For instance, it can serve as valuable feedback for public
authorities on the water distribution network. However, extracting
information from such informal sources is very challenging. It is
possible that social media posts containing water-quality-related
keywords do not represent discussions on polluted water. In this
regard, Machine Learning (ML) and Natural Language Processing
(NLP) techniques could be employed to automatically analyze and
iflter out irrelevant posts. In order to explore the potential of ML
and NLP techniques in this challenging problem, a task namely
”WaterMM: Water Quality in Social Multimedia” has been introduced
in the benchmark MediaEval 2021 competition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>This paper provides a detailed description of the methods
proposed by team CSE-Innoverts for the water quality analysis
represented in the MediaEval task. The dataset provided for the task
covers multi-modal information including textual, visual, and
metadata. However, images are available for very few posts. Moreover,
the majority of the available images are not relevant. Thus, we
mainly focus on textual information by proposing four diferent
solutions as detailed in Section 2.
2</p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACHES</title>
      <p>
        In total, we submitted 4 diferent runs by employing three
diferent Neural Networks (NNs) architectures, namely BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
XLMRoBERTa [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and LSTM, individually and jointly in a late fusion
scheme. Run 1 is based on the late fusion where we jointly
employed the models by aggregating the classification scores obtained
with the individual models. Figure 1 provides the block diagram
of the proposed methodology for Run 1. Run 2, Run 3, and Run 4
are based on the individual models namely BERT, XLM-RoBERTa,
and LSTM, respectively. The details of the individual model based
solutions are provided below.
      </p>
      <p>• BERT-based Solution (Run 2): In this proposed
solution, we rely on a pre-trained BERT model, which is
finetuned on the data development set provided by the task
organizers. Before proceeding with fine-tuning the model,
necessary pre-processing is performed, using Tensorflow
libraries, to bring the data in the required form to be used for
training the model. Since it is a binary classification task,
we used Binary Cross entropy loss function with Adaptive
Moments (Adam) optimizer.
• XLM-RoBERTa-based Solution (Run 3): In this approach,
we rely on the multilingual pre-trained XLM-RoBERTa
model that is fine-tuned on the development set. As a
ifrst step, the input text is tokenized in the pre-processing
phase. A pre-trained model is then fine-tuned on the
preprocessed data using Adam optimizer with a binary
crossentropy loss function.
• LSTM-based Solution (Run 4): In this approach, we rely
on a custom LSTM model. The model is composed of three
layers including an input, LSTM, and output layer. We used
this model as a baseline for our experiments. However, the
model obtained encouraging results on the development
and was thus utilized in the fusion scheme.</p>
      <p>We also cleaned the data before feeding into the models by
removing URLs, account handles, emojis, and unnecessary punctuation.</p>
      <p>Input Text</p>
      <p>Models</p>
      <p>Late Fusion</p>
      <p>Predicted_Label
Moreover, in all the proposed solutions, we used an up-sampling
technique to balance the dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>3 RESULTS AND ANALYSIS</title>
    </sec>
    <sec id="sec-4">
      <title>3.1 Evaluation Metric</title>
      <p>For the evaluation of the proposed methods, we used four diferent
metrics, namely (i) accuracy, (ii) micro precision, (iii) micro recall,
and (iv) micro F1-score. Precision, recall, and f1-scores are the
oficial metrics while accuracy has been used as an additional metric
for the evaluation of the methods on the development set.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Experimental Results on the Development Set</title>
      <p>Table 1 provides the experimental results of our proposed solutions
on the development set. To this aim, a separate validation set
composed of 1,810 samples is used. Run 1 represents our fusion-based
solutions while Run 2, Run 3, and Run 4 represent our solutions
based on the individual models namely BERT, RoBERTa, and LSTM,
respectively. On the development set, overall better results are
obtained with the BERT-based solution obtaining an overall F1-score
and accuracy of 0.950 and 0.929, respectively. The least performance
in terms of F1-score and accuracy are observed for RoBERTa.</p>
    </sec>
    <sec id="sec-6">
      <title>3.3 Experimental Results on the Test Set</title>
      <p>Table 2 provides the oficial results on the test set in terms of
precision, recall, and f1-score. Overall better results are obtained for
BERT among the individual model-based solutions while the least
scores are observed for the LSTM based solution. However,
interestingly, no significant improvement in the performance for the
fusion-based solution over the best-performing individual
modelsbased solution has been observed. One of the possible reasons could
be the low-performing models as all the models are treated equally
by simply aggregating the obtained posterior probabilities. This
limitation could be addressed by using merit-based fusion where
weights are assigned to the contributing models based on the
performance of the model.</p>
    </sec>
    <sec id="sec-7">
      <title>4 CONCLUSIONS AND FUTURE WORK</title>
      <p>The quantity and quality of the images associated with the social
media posts were not good enough to contribute to the task. Thus,
we focused on the textual information only by employing several
NNs based solutions. In total, four diferent solutions including a
fusion and three individual models based solutions. In the current
implementation, we used a simple fusion mechanism by simply
aggregating the posterior probabilities obtained with each individual
model.</p>
      <p>In the future, we aim to employ more sophisticated fusion schemes
by assigning merit-based weights to the contributing models. We
also aim to make use of the additional information available in the
form of metadata in our future fusion-based solutions.</p>
      <p>WaterMM: Water Quality in Social Multimedia</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kashif</given-names>
            <surname>Ahmad</surname>
          </string-name>
          , Konstantin Pogorelov, Michael Riegler, Nicola Conci, and
          <string-name>
            <given-names>Pal</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Social media and satellites: Disaster event detection, linking and summarization</article-title>
          .
          <source>MULTIMEDIA TOOLS AND APPLICATIONS 78</source>
          ,
          <issue>3</issue>
          (
          <year>2019</year>
          ),
          <fpage>2837</fpage>
          -
          <lpage>2875</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Stelios</given-names>
            <surname>Andreadis</surname>
          </string-name>
          , Ilias Gialampoukidis, Aristeidis Bozas, Anastasia Moumtzidou, Roberto Fiorin, Francesca Lombardo, Anastasios Karakostas, Daniele Norbiato, Stefanos Vrochidis, Michele Ferri, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>WaterMM:Water Quality in Social Multimedia Task at MediaEval 2021</article-title>
          .
          <source>In Proceedings of the MediaEval 2021 Workshop</source>
          , Online.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Abdullah</given-names>
            <surname>Hamid</surname>
          </string-name>
          , Nasrullah Shiekh, Naina Said, Kashif Ahmad, Asma Gul, Laiq Hassan, and
          <string-name>
            <surname>Ala</surname>
          </string-name>
          Al-Fuqaha.
          <year>2020</year>
          .
          <article-title>Fake news detection in social media using graph neural networks and NLP Techniques: A COVID-19 use-case</article-title>
          .
          <source>arXiv preprint arXiv:2012</source>
          .
          <volume>07517</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ariadna</given-names>
            <surname>Matamoros-Fernández</surname>
          </string-name>
          and
          <string-name>
            <given-names>Johan</given-names>
            <surname>Farkas</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Racism, Hate Speech, and Social Media: A Systematic Review and Critique</article-title>
          .
          <source>Television &amp; New Media 22</source>
          ,
          <issue>2</issue>
          (
          <year>2021</year>
          ),
          <fpage>205</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Salman</given-names>
            <surname>Bin</surname>
          </string-name>
          <string-name>
            <surname>Naeem</surname>
          </string-name>
          , Rubina Bhatti, and
          <string-name>
            <given-names>Aqsa</given-names>
            <surname>Khan</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>An exploration of how fake news is taking over social media and putting public health at risk</article-title>
          .
          <source>Health Information &amp; Libraries Journal 38</source>
          ,
          <issue>2</issue>
          (
          <year>2021</year>
          ),
          <fpage>143</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Naina</given-names>
            <surname>Said</surname>
          </string-name>
          , Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Laiq Hassan, Nasir Ahmad, and
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Conci</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Natural disasters detection in social media and satellite imagery: a survey</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>78</volume>
          ,
          <issue>22</issue>
          (
          <year>2019</year>
          ),
          <fpage>31267</fpage>
          -
          <lpage>31302</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>