<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble Model for Profiling Fake News Spreaders on Twitter</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>H.L Shashirekha</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anusha M.D</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nitin S Prakash</string-name>
          <email>3nitinsp123@gmail.com</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>Capturing millions of users round the globe, social media has become one of the most popular means of communication and information sharing. Yet, using social media such as Facebook, WhatsApp, Twitter, and Instagram is like a double-edged sword. On one side is the easy accessible, low cost and rapid dissemination of information simultaneously to the large audience while on the other side is the dangerous exposure of fake news. The large spread of fake news has the potential for creating extremely negative impacts on individuals and society. Hence, detecting fake news spreader is the need of the day. To tackle this problem, this paper presents an Ensemble model for profiling fake news spreaders on Twitter which allows using multiple heterogeneous classifiers and combining the results of these classifiers just like a team decision rather than an individual decision. The first step in building the model is to extract three feature sets namely, Unigram TFIDF, N-gram TF and Doc2vec from the PAN 2020 fake news spreader dataset and applying feature reduction on Unigram TFIDF and N-gram TF. Two Linear SVC classifiers and a Logistic Regression classifier are built using three feature sets and are ensembled using majority voting. The results obtained on PAN 2020 test set are encouraging.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>In today’s world, social media have become one of the main platforms for
exchange of information. Social media such as Facebook, WhatsApp, Twitter, and
Instagram, have the advantage of reaching wider audience simultaneously at a much
faster rate which has made them popular especially among younger generation. It is
very easy and economic to generate news online and disseminate it very fast through
social media. However, the information available on social media may be false or
faked purposefully with various intentions such as to deceive the readers, insult an
individual or a group creating embarrassing situations, impact negatively on
individuals as well as on the society and so on. So identifying fake news spreader is
an important task in this modern era and is gaining popularity day by day.</p>
      <p>
        Fake news will usually be incomplete, unstructured and noisy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Hence, effective
methods are required to extract useful features from the fake news to differentiate
whether the news is fake or not. As the news has to be identified as fake or not, it can
be modeled as a binary Text Classification (TC) problem with only two labels
‘yes’/’1’ representing fake news and ‘no’/’0’ representing genuine news. Text
classification is the process of assigning one of the predefined tags/categories/labels
to a new or unlabeled text automatically according to its content. It is one of the
fundamental tasks in Natural Language Processing (NLP) with broad applications
such as fake news detection, sentiment analysis, topic labeling, spam detection, and
intent detection. Researchers have explored several algorithms for TC that gives good
performance. However, some algorithms perform better on some datasets, but the
same algorithms may not give even an average performance on some other datasets.
So, it is very difficult to claim or prove that a particular classifier is good for all
datasets. As a result, instead of using a single classifier it is better to use a group of
classifiers and take a collective decision just like a decision of a team rather than an
individual. This approach called as Ensemble approach overcomes the weakness of
one classifier by the strength of other classifier and gives better performance than an
individual classifier.
      </p>
      <p>In this paper, we propose an Ensemble approach for profiling fake news spreaders
on Twitter. The rest of the paper is organized as follows. Section 2 highlights the
related work and the proposed Ensemble approach for profiling fake news spreaders
on Twitter is described in Section 3. Experiments and results are described in Section
4 and the paper finally concludes in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related Works</title>
      <p>Researchers have developed several algorithms for profiling fake news spreaders.
Few important and relevant ones are highlighted below:</p>
      <p>
        A Machine Learning (ML) model for Spam Identification proposed by Kyumin et.
al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] uses social honey pots in MySpace and Twitter as fake websites that act as
traps to spammers. Since the collected spam data contained signals strongly correlated
with observable profile features such as contents, friend information or posting
patterns, these features are used to feed various ML classifiers and obtained results in
range of 95% to 98% F1 score. A framework for collecting, preprocessing, annotating
and analyzing bots in Twitter proposed by Zafar et. al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] extracts several features
such as the number of likes, retweets, user replies, mentions, URLs, follower friend
ratio and it has been found that humans share novel contents compared to bots which
rely more on retweets or sharing URL. A bot and gender profiling study submitted to
the PAN 2019 by Hamed et. al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have exploited the TF-IDF features to train a
model for human-bot detection and then an ensemble voting classifier has been built
using three base models on character-level and word-level representations of the
documents. The proposed model has achieved accuracies of 83% and 73% for English
and Spanish respectively on the test set provided by PAN2019.
      </p>
      <p>
        An author profiling model submitted to PAN 2017 by Basile et. al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have
constructed Ensemble model with voting classifier and dynamic and ad-hoc grid
search approach. The model was trained with character-level and word-level N-gram
representations of the documents and results aggregated using majority voting
achieved 85% accuracy on the test set. A ML model for gender classification of
Twitter using n-grams has been proposed by Burger et. al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In addition to using
user’s full name as most informative field with respect to gender, they have used word
unigrams, bigrams and character 1 to 5 grams as features and obtained an accuracy of
89.1%. Daneshvar et. al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have proposed a ML model that uses latent semantic
analysis on the TF-IDF matrix with a linear SVM classifier for the dataset provided
by PAN 2018. Their model reported accuracies of 82.21%, 82% and 80.9% on
English, Spanish, and Arabic datasets respectively and was the best-performing
model. John et. al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have addressed the bot identification problem from an
emotional perspective and evaluated their model on India Election Dataset collected
from July 15th 2013 to March 24th 2014. They have also performed a comparative
study of classifiers including and excluding sentiment features, and sentiment model
achieved an accuracy of 95% which is better than the other models. Tetreault et. al.
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed a ML model using Logistic Regression classifiers to build an Ensemble
classifier for Native Language Identification (NLI). A wide range of features such as
word and character n-gram, POS, function words, writing quality markers and
spelling errors are used to build the classifier. The proposed model achieved an
accuracy of 90.1% on International Corpus of Learner English (ICLE) corpus. Filter
based feature selection methods for the prediction of risks in Hepatitis disease is
proposed by Pinar et. al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Their work consists of a preprocessing module
including standardizing non-standard language expressions such as replacing slang
words, contractions, abbreviations, and emoticons by their corresponding normalized
language expressions. The proposed feature selection methods helped in improving
learning accuracy and learning time reduction and obtained an accuracy of 84% on
the dataset collected from UCI machine learning data repository that contains 19
fields with one class attribute.
      </p>
      <p>
        Emotions play a key role in deceiving the reader. Based on emotions, Bilal et. al.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a LSTM neural network model that is emotionally-infused to detect
false news. The proposed model obtained accuracies of 80.72% and 64.82% for news
articles and Twitter based on a large set of emotions and the results illustrate that false
information has different emotional patterns in each of its types. A model based on
Low Dimensionality Representation (LDR) proposed by Francisco et. al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for
language variety identification was applied on the age and gender identification task
in PAN Lab at CLEF and the obtained results are quite competitive with the best
model in author profiling task in PAN.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Methodology</title>
      <p>Our proposed Ensemble approach for profiling fake news spreaders in Twitter is
explained in this section. Architecture for building the Ensemble model using the fake
news spreader training set provided by PAN 2020 [13] is shown in Figure 1. The
proposed approach consists of the following modules:</p>
      <sec id="sec-3-1">
        <title>3.1 Data Preparation</title>
        <p>This step combines all the tweets of each user as one text document and assigns the
corresponding label as per the data given by PAN 2020 [13].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Preprocessing</title>
        <p>The first step in preprocessing is demojification which is the process of converting
emojis to text. Emojis are visual representation of emotions, object or symbol which
can be inserted individually or together to create a string. As these visual
representations convey certain meaning they are converted to words conveying the
corresponding meaning and used similar to other words in the text. Following this all
punctuation symbols and numeric information are removed as they do not contribute
to TC. The remaining text is tokenized into words and to reduce number of words,
unwanted words including all stopwords, uninformative words, words with length less
than three are removed. Further, Stemming and Lemmatization are applied to reduce
the words into their root forms. Text data by default is high dimensional. On the
average preprocessing reduces the dimension of text by 20-25%. The remaining
words are given as input to the feature extraction step to extract features.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2 Feature Extraction</title>
        <p>Feature extraction module is responsible for extracting the distinguishing features
from the given text collection that completely describe the original dataset. The
following feature selection algorithms are used in our model:
 Unigram Term Frequency/Inverse Document Frequency (TF/IDF) is a
weighting scheme often used in Information Retrieval and Text Mining (TM)
which represents the relative importance of the word in the document and the
entire corpus. It is a common technique used in any text analysis application
including TC. A single word is considered as a feature and hence the name
Unigram TF/IDF.
 N_gram is a sequence of N words called as word N_grams or N characters called
as character N_grams which are highly used in TM and NLP. This simple idea
has been found to be effective in many applications [14]. N_gram model predicts
the occurrence of the word based on the occurrence of its previous N-1 words. A
TF of bi-grams, tri-grams and four-grams are used in our model.
 Doc2Vec is an NLP tool for representing text documents as a vector and is a
generalization of the word2vec method. It is an unsupervised algorithm to
generate vectors for variable length pieces of text such as sentences, paragraph,
or documents, which has the advantage of capturing semantics in the input texts.
The vectors generated by doc2vec can be used for tasks like finding similarity of
text at various levels like sentences, paragraph or documents. A vector of size
300 is used in our model.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3 Feature Reduction</title>
        <p>Text data is by default high dimensional as they consists of words and words are
the features for any text analysis applications. Complexity of any algorithm increases
with the increase in the number of features. Hence, reducing the number of features
yet representing the original dataset becomes very important to reduce the complexity
of the algorithms. Feature reduction plays an important role in reducing the number of
features by eliminating redundant and irrelevant features thereby improving the
performance of the algorithm [15].</p>
        <p>Three feature selection algorithms namely Chi Square test, Mutual Information
and F-test are used for feature reduction. All these three feature selection algorithms
are based on filter approach. In filter approach, a statistical measure is used to assign a
score to each feature and the features are ranked by the score. Only top k ranked
features will be selected for further processing. Brief descriptions of the feature
selection algorithms are given below:
 Chi-square test helps to solve the problem of feature selection by testing the
relationship between the features. It is based on the difference between the observed
and the expected values for each category. Higher Chi-square value shows that
feature more dependent on the response variable and can be selected for the model
training.
 Mutual Information is a powerful feature selection technique that can be used to
measure the relationship between the variables including non-linear relationship. It
is invariant under transformations in the feature space that are invertible and
differentiable, e.g. translations, rotations, and any transformation preserving the
order of the original elements of the feature vectors. The main advantage of this
method is rapidity off execution.
 F-test is a class of statistical tests that calculates the ratio between variances values,
such as the variance from two different samples or the explained and unexplained
variance by a statistical test, like ANOVA. The scikit-learn machine library
provides an implementation of the ANOVA f-test in the f_classif( ) function. It is
most often used to compare statistical models that have been fitted to a dataset, in
order to identify the model that best fits the population from which the data were
sampled.</p>
        <p>Feature reduction is applied only for Unigram TF/IDF and N-gram models as the
number of features in these two models are very high. The features are ranked using
the above methods and top 5000 features are considered in each method. Then, a
reduced feature set consisting of the disjoint union of all these features is used for
building the classifiers.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.2 Classifier Construction</title>
        <p>Classifiers are the supervised learning models constructed using the training set.
Two classifiers namely, Linear SVC and Logistic regression available at Scikit-learn,
Machine Learning in Python1 are used in our work. Brief descriptions of these two
classifiers are given below:
 Linear SVC is a Linear Support Vector Classifier (SVC) similar to Support
Vector Machine classifier with linear kernel available at Scikit-learn. It accepts
the training data and returns a "best fit" hyperplane that categorizes the given
data. As Linear SVC is efficient, flexible and works fast, it is highly
recommended for TC.
 Logistic regression is a very effective classification method used on text data. It
is a statistical model that in its basic form uses a logistic function to model a
binary dependent variable, although many more complex extensions exist.</p>
        <p>
          After preprocessing the training set provided by PAN 2020 [13], Unigram TFIDF,
N-gram TF and Doc2vec features are extracted. Further, feature reduction is applied
to Unigram TFIDF and N-gram TF models only as these two models have more
features. All the three features sets are scaled using MaxAbsScaler. This scaler
automatically scales the data to a [
          <xref ref-type="bibr" rid="ref1">-1, 1</xref>
          ] range based on the absolute maximum. It is
meant for data that is already centered at zero or sparse data. It does not shift/center
the data, and thus does not destroy any sparsity. These scaled feature sets are used to
build the learning models.
        </p>
        <p>Two Linear SVC models are constructed using scaled Unigram TFIDF and N-gram
TF features and Logistic Regression model is constructed using scaled Doc2vec
features of vector size 300.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.5 Ensemble</title>
        <p>There is no classifier which always gives a good result for all the datasets. Hence,
instead of using a single classifier, an ensemble of three classifiers mentioned above
(Linear SVC using Unigram TFIDF, Linear SVC using N-gram TF features and
Logistic Regression using Doc2vec) are used with majority voting [16].</p>
        <p>The ensemble classifier accepts the test data from the PAN 2020 organizers and
assigns either a ‘1’ or ‘0’ representing ‘fake’ or ‘not fake’ labels respectively as
shown in Figure 2. The test data is preprocessed and the three sets of features
mentioned above, namely, Unigram TFIDF, N-gram TF and Doc2vec are extracted.
Only those features presented in the reduced feature set are input to the ensemble
model to assign the suitable label ignoring the rest.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment results</title>
      <p>The code is implemented in Python using Scikit-learn, Machine Learning in
Python. Uncompressed dataset provided by PAN 2020 consists of two folders, one
for English and another for Spanish language. Each folder in turn contains an xml file
1 scikit-learn.org
author (twitter user) with 100 Tweets and name of the XML file corresponds to the
unique author id. Author list and ground truth is given in truth.txt file. Details of
dataset are given in Table 1. Two separate models are created for two languages
English and Spanish.</p>
      <p>The proposed model is evaluated through PAN submission system called TIRA
Integrated Research Architecture, which is a modularized platform for shared tasks
[17] and the performance of the model is reported by task evaluators. Our model has
obtained an accuracy of 73.50% for English language text and 67.50% for Spanish
language text and an overall accuracy of 70.50% with a runtime of 00:01:52
(hh:mm:ss).</p>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion</title>
      <p>With the increasing popularity of social media, more and more people consume
news from social media instead of traditional news media. However, social media has
been used to spread fake news which has strong negative impact on individual and the
society. In this paper, an Ensemble model is built for profiling fake news spreaders
on Twitter task in PAN 2020. Ensemble approach uses majority voting of the three
(two Linear SVC classifiers and a Logistic Regression) classifiers built using
Unigram TF/IDF, N_gram TF and Doc2Vec feature sets. The proposed model
obtained 73.50% and 67.50% accuracies on English and Spanish languages
respectively.
13. Rangel F., Giachanou A., Ghanem B., and Rosso P. “Overview of the 8th Author
Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter”. In: L.
Cappellato, C. Eickhoff, N. Ferro, and A. Névéol (eds.) CLEF 2020 Labs and
Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org,
2020.
14. Keselj V., Peng F., Cercone N. and Thomas C. “N-gram-based Author Profiles
for Authorship Attribution”. In Proc. of the Conference Pacific Association for
Computational Linguistics, 2003.
15. Gao W., Kannan S., Oh S., and Viswanath P. “Estimating mutual information for
discrete-continuous mixtures”. arXiv preprint arXiv:1709.06212
16. Tuwe Lofstrom. “On Effectively Creating Ensembles of Classifiers, Studies on
Creation Strategies, Diversity and Predicting with Confidence”, Stockholm
University, Ph.D. thesis, 2015.
17. Potthast Martin, Tim Gollub, Matti Wiegmann, and Benno Stein. “TIRA
integrated research architecture”. In Information Retrieval Evaluation in a
Changing World, pp. 123-160. Springer, Cham, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Jiliang</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yi</given-names>
            <surname>Chang</surname>
          </string-name>
          , and Huan Liu. “
          <article-title>Mining social media with social theories a survey”</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>15</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>20</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Kyumin</given-names>
            <surname>Lee</surname>
          </string-name>
          , Brian David Eoff, and James Caverlee. “
          <article-title>Seven months with the devils: A long-term study of content polluters on twitter”</article-title>
          .
          <source>In Fifth International AAAI Conference on Weblogs and Social Media</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Zafar</given-names>
            <surname>Gilani</surname>
          </string-name>
          , Reza Farahbakhsh, Gareth Tyson,
          <string-name>
            <given-names>Liang</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jon</given-names>
            <surname>Crowcroft</surname>
          </string-name>
          . “
          <article-title>Of bots and humans (on twitter)”</article-title>
          .
          <source>In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          <year>2017</year>
          , pp.
          <fpage>349</fpage>
          -
          <lpage>354</lpage>
          , ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Hamed</given-names>
            <surname>Babaei</surname>
          </string-name>
          <string-name>
            <surname>Giglou</surname>
          </string-name>
          , Mostafa Rahgouy, Taher Rahgooy, Mohammad Karami Sheykhlan and
          <string-name>
            <given-names>Erfan</given-names>
            <surname>Mohammadzadeh</surname>
          </string-name>
          . “
          <article-title>Author profiling: Bot and gender prediction using a multi-aspect ensemble approach”. Notebook for PAN at CLEF 2019”</article-title>
          . In Linda Cappellato and
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>David E.</given-names>
            <surname>Losada</surname>
          </string-name>
          and Henning Müller (eds.)
          <article-title>CLEF 2019 Labs and Workshops</article-title>
          ,
          <string-name>
            <given-names>Notebook</given-names>
            <surname>Papers</surname>
          </string-name>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Basile</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwyer</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medvedeva</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rawee</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haagsma</surname>
            <given-names>H.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nissim</surname>
            <given-names>M.</given-names>
          </string-name>
          “NGrAM: New Groningen Author-profiling
          <string-name>
            <surname>Model</surname>
          </string-name>
          ”
          <article-title>- Notebook for PAN at CLEF 2017</article-title>
          . In Cappellato, L.,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
          </string-name>
          , T. (eds.)
          <article-title>CLEF 2017 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>11</volume>
          -
          <fpage>14</fpage>
          September, Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Burger J.D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Henderson</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>G.</given-names>
          </string-name>
          and Zarrella G. “
          <article-title>Discriminating gender on twitter”</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>1301</fpage>
          -
          <lpage>1309</lpage>
          . EMNLP'
          <volume>11</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Daneshvar</surname>
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Inkpen</surname>
            <given-names>D.</given-names>
          </string-name>
          “
          <article-title>Gender Identification in Twitter using N-grams and</article-title>
          <string-name>
            <surname>LSA</surname>
          </string-name>
          ”
          <article-title>- Notebook for PAN at CLEF 2018</article-title>
          . In Cappellato, L.,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>L</given-names>
          </string-name>
          . (eds.),
          <article-title>CLEF 2018 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>10</volume>
          -
          <fpage>14</fpage>
          September, Avignon, France,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. John P Dickerson,
          <string-name>
            <surname>Vadim Kagan</surname>
            and
            <given-names>V S</given-names>
          </string-name>
          <string-name>
            <surname>Subrahmanian</surname>
          </string-name>
          . “
          <article-title>Using sentiment to detect bots on twitter: Are humans more opinionated than bots”</article-title>
          ?
          <source>In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          , pp.
          <fpage>620</fpage>
          -
          <lpage>627</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tetreault</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blanchard</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chodorow</surname>
            <given-names>M.</given-names>
          </string-name>
          “
          <article-title>Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification”</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2012</year>
          , pp.
          <fpage>2585</fpage>
          -
          <lpage>2602</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pinar</surname>
            <given-names>Yildirim</given-names>
          </string-name>
          , “
          <article-title>Filter Based Feature Selection Methods for Prediction of Risks in Hepatitis Disease”</article-title>
          ,
          <source>International Journal of Machine Learning and Computing</source>
          , Vol.
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ghanem</surname>
            <given-names>Bilal</given-names>
          </string-name>
          , Paolo Rosso, and Francisco Rangel. “
          <article-title>An emotional analysis of false information in social media and news articles”</article-title>
          .
          <source>ACM Transactions on Internet Technology (TOIT)</source>
          Vol
          <volume>20</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Rangel Francisco, Marc Franco-Salvador, and Paolo Rosso.
          <article-title>“A low dimensionality representation for language variety identification”</article-title>
          .
          <source>In International Conference on Intelligent Text Processing and Computational Linguistics</source>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          , Springer, Cham,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>