<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KCE DALab-APDA@FIRE2019: Author Pro ling and Deception Detection in Arabic using Weighted Embedding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sharmila Devi V</string-name>
          <email>sharmiladevi1002@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kannimuthu S</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ravikumar G</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M</string-name>
          <email>anandkumar@nitk.edu.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering</institution>
          ,
          <addr-line>CIET, Coimbatore</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Technology, Karpagam College of Engineering</institution>
          ,
          <addr-line>Coimbatore</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Information Technology, National Institute of Technology Karnataka</institution>
          ,
          <addr-line>Surathkal</addr-line>
          ,
          <country>India m</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper explaining the work submitted on Author Proling and Deception Detection in Arabic Tweets shared task organized at the Forum for Information Retrieval Evaluation (FIRE) 2019. The rst task Author pro ling illustrates identifying the categories of authors based on the Arabic tweets. In the second task, the aim is to Detect deception in Arabic for two genres such as Twitter and News. Deception detection means that the automatic way of identifying false messages in the text content on social network or news. For each task, we have submitted three di erent systems. For submission 1, we have used the Term Frequency and Inverse Document Frequency (TFIDF) based Support Vector Machine classi cation and in submission 2, we have used fastText classi er. For submission 3, we have proposed a low dimensional weighted document embedding (TFIDF + Word embedding) with SVM classi cation. We have attained second place in the Deception detection and third in Author pro ling. The performance di erence between the top team results and the submitted runs are only 3.34% for Author proling and 1.16% for Deception detection.</p>
      </abstract>
      <kwd-group>
        <kwd>Author pro ling Deception detection Arabic tweets Machine Learning TFIDF Word embeddings fastText Classi er Weighted document embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In our busy day-to-day life, a computer-based technology, social media plays a
major role in sharing of information, ideas, thoughts from one people to another.
Most of the people used to send their personal messages, documents, videos and
photos through social media network such as Twitter, Facebook, WhatsApp etc.
Author pro ling is the method which analyse the demographic features of an
author such as age, gender and the language varieties. Some of the applications
of author pro ling are forensics, security, marketing, etc. For example, in the
marketing eld, it is useful to nd which pro le of the customer like or dislike
the product. This analysis will help companies for better market segmentation.
From a forensic viewpoint, it is important to nd out the pro le of the person
who wrote the suspicious text. Deception detection is the method of analysing
whether the given message is lie or truth. The rest of the paper will brie y as
follows: In section 2, we discuss the literature survey about the author pro ling
and deception detection in various languages. Section 3 mentions the data set
description and the statistics. In section 4, we explain the methodology and
section 5 discusses the results obtained. In section 6, we conclude the paper with
limitations and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        The peculiarities of the Arabic dialectal varieties are used in social media and the
annotation framework is proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The suspicious message of the author
is whether a potential threat or not is focused in Arabic Author Pro ling for
Cyber-Security project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The framework for improving the deception detection
accuracy for online digital news veracity is proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Bayesian classi
cation and K- means clustering algorithm to nd out the deception detection in
the twitter pro le characteristics is proposed to analyze the user behavior [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Various features extraction methods proposed in deception detection from
Arabic Twitter post [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The accuracy gained for the SVM with trigram over other
classi ers is 91.55%. Arabic word correction to manipulate the vulnerability is
explained in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They achieved accuracy of 96.5% for detecting abusive Arabic
tweets.
      </p>
      <p>
        Author pro ling system for Urdu is proposed [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by word and
characterbased term frequency and TFIDF features and support vector machine classi er.
Weighted embeddings based on a novel median-based loss function is explained
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with the experimental results on Wikipedia and twitter data. Embedding
variations to the doc2vec embedding on a new evaluation task using Trip advisor
reviews, and also the CQADupStack benchmark are proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Word
mover's embedding to enable the unsupervised document embedding from
pretrained word embeddings is proposed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Identi cation of the age and gender
form blog authors are proposed [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and the experiments on information retrieval
features yielded best predictions.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Description</title>
      <p>
        The dataset for Arabic author pro ling is given as ve di erent categories where
each consists of three natives. The details of the nativity are given in the overview
of the shared task [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The dataset consists of three age groups (25, Between 25
and 34 and Above 35) and two genders (male and female) in all the categories.
The primary di erence between the given deception and pro ling dataset is in the
representation. In Author pro ling, each XML le which consists of 100 tweets
needs to be labeled as gender, age group, and language variety. But in Deception
detection, each tweet should be identi ed whether it is truth or lie. Two di erent
domains such as News and Tweets are given for deception detection. We have
submitted 6 runs for Deception detection.
      </p>
      <p>All the ve training dataset of author pro ling and deception detection are
completely balanced and the number of documents in di erent classes are given
on Table 1 and 2.
We have totally submitted three methods which are based on TFIDF features
with SVM classi er, word bi-grams with fastText classi er and TFIDF weighted
document embeddings. We have submitted 21 runs for Arabic Author pro ling
and Deception detection. In the case of deception detection, we have tried the
same approaches followed for the Arabic author pro ling task. The three methods
are explained below.</p>
      <p>
        Submission-1: The rst run is based on the conventional method where we
have used the word and character n-gram features with SVM classi er [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Word
uni-grams and character bi-grams, trigrams and four-grams are considered as
features. Out of all features, we have considered a maximum of 5000 features for
words and 5000 for characters. These feature values are weighted with TFIDF
values. The nal feature matrix is given to the Linear SVM for classi cation. The
SVM parameters are L2 norm for a penalty with C value 1 and multi-class using
one versus rest. We have followed the same method for Arabic author pro ling
and Deception detection.
      </p>
      <p>
        Submission-2: In the second run, we have used the well-known fastText
embedding and classi er [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for pro ling the Arabic authors and identifying the
deception. The fastText classi er is compatible for the sentence classi cation,
task so for Deception detection we have used the fastText classi er as such. But
in the case of Author pro ling task, the XML le is input. Fortunately, all the
training as well as testing XML les are made from equal (100) tweets. So we
have modi ed the input as individual tweets and trained as a sentence classi
cation task. After tagging the tweets during testing, we have counted the labels
of each XML le and select the maximum label as a label for the corresponding
XML le. The main drawback of this approach is to infer the cross-validation
results. The parameters of fastText are xed as follows, word bi-grams, learning
rate lr=0.25 and 40 epochs. We have used softmax as the loss function.
Submission-3: We have developed the weighted word embedding model for
the third submission. Here, we have used the Arabic pre-trained word vectors
from Arabic tweets and web pages [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The complete architecture of the model
is shown in Figure 1. In the case of Author pro ling, initially word unigram
features are vectorized using conventional TFIDF vectorizer. The maximum
features are limited to 5000, so each XML document is represented as 5000 unique
words. All the XML documents in the training data are TFIDF vectorized with
maximum feature size of 5000. The existing skip-gram based Arabic pre-trained
vectors [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] of size 300 are used to create the embedding matrix for the unique
words. The words which are not present in the pre-trained vectors are considered
as unknown words, for these words the embeddings are generated randomly from
the word vectors. Finally, we have taken the dot product between the TFIDF
and embedding matrix which results in the document transformed to low
dimensional document vectors. These set of vectors are considered as TFIDF weighted
document embeddings which are further trained using SVM.
      </p>
      <p>Author pro ling in Arabic tweets for gender it is 0.7667, age it is 0.5722 and the
variety it is 0.9694. The performance also evaluated jointly where the accuracy
gained is 0.4222. The top accuracy gained for Deception detection for news it is
0.7331 and for Twitter, it is 0.8541, the average performance of the accuracy is
obtained as 0.7887.
In this paper, we illustrate the work on the identi cation of age, gender and
language variety in author pro ling and deception detection in Arabic (APDA).
Using the given training dataset, we have developed three systems. We have
used the Term Frequency and Inverse Document Frequency and SVM, fastText
classi er method and weighted word embedding with SVM. Compared with the
traditional model the most expected weighted embeddings attained less
accuracy. The main reason for less accuracy is that the certain words in the given
dataset are not present in the pre-trained model. Even though, we have used the
pre-trained model of Arabic tweets, around 30% of unknown words present in
the training data. This can be resolved with the recent character-speci c word
embeddings. With this 30% of information loss, the performance of the proposed
low-dimensional document embedding on Author pro ling attained decent
accuracy. In the future, this can be enhanced with character-speci c embedding and
retrain the pre-trained models.</p>
      <p>Sharmila et al.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Zaghouani</surname>
            , Wajdi, and
            <given-names>Anis</given-names>
          </string-name>
          <string-name>
            <surname>Char</surname>
          </string-name>
          .
          <article-title>"Guidelines and Annotation Framework for Arabic Author Pro ling." arXiv preprint arXiv:</article-title>
          <year>1808</year>
          .
          <volume>07678</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rosso</surname>
            , Paolo, Francisco Rangel, Bilal Ghanem, and
            <given-names>Anis</given-names>
          </string-name>
          <string-name>
            <surname>Char</surname>
          </string-name>
          .
          <article-title>"ARAP: Arabic Author Pro ling Project for Cyber-Security."</article-title>
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>61</volume>
          (
          <year>2018</year>
          ):
          <fpage>135</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Eembi@ Jamil, Normala Che, Iskandar Ishak, and
          <string-name>
            <given-names>Fatimah</given-names>
            <surname>Sidi</surname>
          </string-name>
          .
          <article-title>"Deception detection approach for data veracity in online digital news: Headlines vs contents." AIP Conference Proceedings</article-title>
          . Vol.
          <year>1891</year>
          . No.
          <article-title>1</article-title>
          .
          <string-name>
            <given-names>AIP</given-names>
            <surname>Publishing</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Alowibdi</surname>
            <given-names>JS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buy</surname>
            <given-names>UA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philip</surname>
            <given-names>SY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghani</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mokbel</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Deception detection in Twitter. Social network analysis and mining</article-title>
          .
          <source>2015 Dec</source>
          <volume>1</volume>
          ;
          <issue>5</issue>
          (
          <issue>1</issue>
          ):
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Al-Saif</surname>
          </string-name>
          , Hissah, and
          <string-name>
            <surname>Hmood</surname>
          </string-name>
          Al-Dossari.
          <article-title>"Detecting and Classifying Crimes from Arabic Twitter Posts using Text Mining Techniques."</article-title>
          <source>International Journal of Advanced Computer Science and Applications</source>
          <volume>9</volume>
          .10 (
          <year>2018</year>
          ):
          <fpage>377</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Abozinadah</surname>
          </string-name>
          , Ehab A.,
          <article-title>and</article-title>
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>"Improved micro-blog classi cation for detecting abusive Arabic Twitter accounts." International Journal of Data Mining and Knowledge Management Process (IJDKP) 6</article-title>
          .6 (
          <year>2016</year>
          ):
          <fpage>17</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Joulin</surname>
            , Armand, Edouard Grave, Piotr Bojanowski, and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>"Bag of tricks for e cient text classi cation</article-title>
          .
          <source>" arXiv preprint arXiv:1607.01759</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Sharmila</given-names>
            <surname>Devi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Kannimuthu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ravikumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Anand</surname>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>"KCe Dalab@maponsms-Fire2018: E ective word and character-based features for multilingual author pro ling" (</article-title>
          <year>2018</year>
          ) CEUR Workshop Proceedings,
          <volume>2266</volume>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>De</surname>
            <given-names>Boom</given-names>
          </string-name>
          , Cedric, Steven Van Canneyt,
          <string-name>
            <surname>Thomas Demeester</surname>
            , and
            <given-names>Bart</given-names>
          </string-name>
          <string-name>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>"Representation learning for very short texts using weighted word embedding aggregation</article-title>
          .
          <source>" arXiv preprint arXiv:1607.00570</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schmidt</surname>
          </string-name>
          , Craig W.
          <article-title>"Improving a tf-idf weighted document vector embedding." arXiv preprint arXiv:</article-title>
          <year>1902</year>
          .
          <volume>09875</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , Lingfei,
          <source>Ian EH Yen</source>
          , Kun Xu, Fangli Xu, Avinash Balakrishnan,
          <string-name>
            <surname>Pin-Yu</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Pradeep Ravikumar, and
          <string-name>
            <given-names>Michael J.</given-names>
            <surname>Witbrock</surname>
          </string-name>
          .
          <article-title>"Word Mover's Embedding: From Word2Vec to Document Embedding." arXiv preprint arXiv:</article-title>
          <year>1811</year>
          .
          <volume>01713</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Weren</surname>
          </string-name>
          ,
          <string-name>
            <surname>Edson</surname>
            <given-names>RD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            <given-names>U</given-names>
          </string-name>
          . Kauer, Lucas Mizusaki,
          <string-name>
            <given-names>Viviane P.</given-names>
            <surname>Moreira</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Palazzo M. de Oliveira</surname>
          </string-name>
          , and
          <string-name>
            <surname>Leandro</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Wives</surname>
          </string-name>
          .
          <article-title>"Examining Multiple Features for Author Pro ling</article-title>
          .
          <source>"</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Abu Bakr Soliman, Kareem Eisa, and
          <string-name>
            <surname>Samhaa</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>El-Beltagy</surname>
          </string-name>
          ,
          <article-title>AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP</article-title>
          ,
          <source>in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing</source>
          <year>2017</year>
          ), Dubai,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Char</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaghouani</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snchez-Junquera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Overview of the track on author pro ling and deception detection in arabic</article-title>
          . In: Mehta P.,
          <string-name>
            <surname>Rosso</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            <given-names>M</given-names>
          </string-name>
          . (Eds.)
          <article-title>Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS.org, Kolkata, India, December
          <volume>12</volume>
          -
          <fpage>15</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>