<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J.J.M.: Detection
of fake news in a new corpus for the spanish language. Journal of Intelligent &amp;
Fuzzy Systems 36(5)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>HAHA at FakeDeS 2021: A Fake News Detection Method Based on TF-IDF and Ensemble Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kun Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Yunnan</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2664</volume>
      <abstract>
        <p>This paper describes our participation in the FakeDeS [5] Task at Iberlef 2021: Fake News Detection in Spanish. Base on this task, we propose the classic TF-IDF feature extraction technology and Stacking ensemble learning method base on weak classi ers. It not only analyzes the content of the news, but also combines e ective information such as publishers and topics to improve the performance of our model. We used ve machine learning models, and achieved very competitive results on both the validation set and the test set, and got the second place in the nal evaluation phase.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake News Detection</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Ensemble Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Fake news refers to a kind of public opinion that uses false information to
deceive the parties in order to achieve a certain purpose. It fails to truly re ect
the original appearance of objective things and contains false elements. The
information provided by fake news is designed to manipulate people for di erent
purposes: terrorism, political election, advertising, satire, etc. In social networks,
fake news spreads in seconds among thousands of people and research has shown
that misinformation spreads faster, farther, deeper, and more widely than true
information [12], so it is necessary to develop tools to help control the amount
of fake news on the network.</p>
      <p>
        A few years ago, the method of detecting fake news was mainly to analyze
the e ective features from various sources, including the content of the text,
user data and the form of news dissemination. It mainly distinguishes true and
false news from the aspect of language features, such as writing style and special
headlines [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], vocabulary and syntactic analysis [10]. In addition to language
features [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], some studies have proposed classi cation schemes on user features
and time features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Recent fake news detection methods mainly use machine
learning and deep learning techniques, with special attention to language-based
methods [
        <xref ref-type="bibr" rid="ref8">8, 11, 14, 15</xref>
        ]. Some people use TF-IDF feature extraction technology
for fake news detection and have achieved good results [
        <xref ref-type="bibr" rid="ref1 ref2">1, 13, 2</xref>
        ].
      </p>
      <p>A fake news detection system is designed to help user detect and lter
potentially deceptive news. A predictive method of deliberately misleading news is
based on the analysis of the real and deceptive previously censored news, that is,
the annotated corpus. Text is main carrier of news information, and the study
of news text helps to e ectively identify fake news. The speci c task is: given
the text of a news event, determine whether the event is real news of fake news.
For the evaluation of systems, we will use a new testing corpus containing news
related to COVID-19 and news from other Ibero-American countries. Its
availability will introduce two main challenges to the task: thematic and language
variation. Our systems need to take into consideration that part of the testing
corpus contains news in a topic that does not exist in training corpus, likewise,
we should take into account that the other part of the testing corpus contains
news in a di erent variation of the Spanish that is in training corpus. This paper
proposes a method for fake news detection: A fake news detection method based
on TF-IDF and ensemble machine learning.</p>
      <p>TF-IDF has the characteristics of simplicity, fast calculation. and it performs
well for processing long texts. The Section 2 introduces the corpus and analyzes
the composition and distribution of the data. The third section introduce the
methodology, data processing methods, feature extraction methods, the base
model used, and the nal ensemble model. The experiments sett and results are
presented in Section 4. Finally, Section 5 outlines the nal conclusions and future
work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus Description</title>
      <p>The Spanish fake news corpus [9] is news collected from several online sources:
existing newspaper websites, media company websites, special websites dedicated
to verifying fake news, and websites designated by di erent reporters as regular
fake news publications. All these articles are written in Mexican Spanish. The
corpus collected 971 news items from di erent sources from January to July
2018: 971 news items were divided into training set and test set. Among them,
there are 676 pieces of data in training set and 295 pieces of data in the test set.
Only two categories (True of Fake) are considered for the marking of the corpus,
and the speci c conditions of each piece of data are as follows:
{ Category: Fake/ True.
{ Topic: Science/ Sport/ Economy/ Education/ Entertainment/ Politics, Health/</p>
      <p>Security/ Society.
{ Headline: The title of the news.
{ Text: The complete text of the news.</p>
      <p>{ Link: The URL where the news was published.</p>
      <p>Among them, the number of fake news and real news is fairly balanced. And
the number of fake news and real news in the Topic column is almost the same.
However, there is a big gap between the authenticity of news published on di
erent websites. Some websites are almost all fake news, and there are also websites
that are all real news. This provides a good idea for our feature extraction. We
will consider the impact of \Link" and \Topic" on the classi cation results when
we do experiments.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method and Technology</title>
      <p>This section includes 4 parts: data preprocessing, feature extraction methods,
classi cation models, and ensemble model methods.
3.1</p>
      <sec id="sec-3-1">
        <title>Data Preprocessing</title>
        <p>In order to get better results, data preprocessing is essential. And data
preprocessing is usually the rst step in natural language processing tasks. First
intercept the most critical information in Link, and the result after interception
looks like this: www.elruinaversal.com. Because we have analyzed the sources of
fake news and found that almost all the news on some websites is fake news.
Then we observe the data and nd that each row consists of Category, Topic,
Source, Headline, Text and Link. All the contents in Category, Topic, Source,
Headline, Text and Link are merged as our new input. Finally, data cleaning
is performed on the merged input data. Perform data cleaning on the merged
data, use regular expressions to remove links, special symbols, punctuation, etc.
According to the length of the text, the length of the longest text is 2578 and
the shortest text is 31. So we decided to use nltk to remove the stop words in
the text. The longest text length after removing the stop words is 1379, and
the shortest is 18. Removing stop words in the text will reduce the e ect of
the model, and not removing stop words will improve the performance of the
model. However, we have proved through experiments that removing stop words
will reduce the performance of the model, which will be explained in subsequent
experiments. And we will verify it in the next experiment. The data processing
is: 1) merge all columns + remove stop words: 2) merge all columns + without
remove stop words: 3) only Text + remove stop words: 4) only Text + without
remove stop words.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Feature Extraction</title>
        <p>The method base on news content focuses on extracting various features of fake
news content, including knowledge-base and style-based features. This paper
mainly uses two methods to extract text features: 1) LabelEncoder; 2) TF-IDF.
{ LabelEncoder: We use Sklearn's LabelEncoder method to hard-code text
features, that is, to encode discrete numbers or text, and convert the
discrete data to numbers between (0, n-1), where n represents di erent data
values. We performed LabelEncoder on the Topic and Source features. The
LabelEncoder method also played a very good role in the experiment.
{ TF-IDF: Term Frequency-inverse Document Frequency (TF-IDF) is a
statistical analysis method for keywords, used to evaluate the importance of a
word to a document set or a corpus. The importance of a word is positively
correlated with the number of times it appears in the article, and negatively
correlated with the number of times it appears in the corpus. TF-IDF can
e ectively avoid the in uence of commonly used words on keywords and
improve the relevance between keywords and articles. TF refers to the total
number of times a word appears in the article. This indicator is usually
normalized and de ned as the number of times a word appears in the article
divided by the total number of words in the article, which can prevent the
result from being biased towards too long document (The same word usually
has a higher word frequency in a long document than in a short document).
IDF refers to the frequency of reverse documents. The fewer documents that
contain a word, the greater of the IDF value, indicating that the word has a
strong ability to distinguish. Using TF-IDF can well extract text features in
Spanish news. For texts with a length of several thousand, TF-IDF is better
than RNN and other neural network model in extracting features of long
texts. And for the challenge of changing language, TF-IDF can easily solve
it.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Base Classi cation Model</title>
        <p>We use ve basic weak classi ers as our base model.</p>
        <p>{ LogisticRegression (LR): Logistic regression is used to discover the
connection between features and output results, used for classi cation problems in
supervised machine learning algorithms, and has a close relationship with
neural networks. Neural networks can be regarded as multiple logistic
regression classi ers Stacked. Logistic regression can be used for binary
classi cation problems and multi-classi cation problems.
{ SGDClassi er (SGDC): Mainly used in large-scale sparse data problems. It
is a collection of linear classi ers trained with stochastic gradient descent
algorithm, It is a linear (soft interval) support vector machine classi er by
default, which is logistic regression in this article.
{ PassiveAggressiveClassi er (PAC): It is a classic online linear classi er, which
can continuously integrate new samples to adjust the classi cation model
and enhance the classi cation ability of the model. It can perform feature
extraction on streaming data, and can perform incremental learning.
{ RidgeClassi er (RC): This classi er uses penalized least square function to
adapt to the classi cation model. The loss function used by RidgeClassi er
can make di erent calculation performance pro les.
{ LnearSVC (LSVC): A linear classi cation supports vector machine is
implemented by liblinear, which can be used for two-class classi cation or
multiclass classi cation.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Ensemble Model</title>
        <p>
          LightGBM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]: The model is a fast, distributed, high-performance gradient
boosting framework base on decision tree algorithm. It can be used in sorting,
classi cation, regression, and many other machine learning tasks. LightGBM is an
improvement of the GBDT algorithm [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. LightGBM didn't use the traditional
pre-sorting idea, but optimizes the histogram of the eigenvalues. The weak
classi er is used to iteratively train to obtain the optimal model, which has the
advantages of good training e ect and not easy to over t. The train method is
GBDT: it is an algorithm that classi es or regresses data by using an additive
model and continuously reducing the residuals generated during the training
process.
        </p>
        <p>Choose the above LogisticRegression (LR), SGDClassi er (SGDC),
PassiveAggressiveClassi er (PAC), RidgeClassi er (RC), LnearSVC (LSVC) as the weak
base model. Figure 1 shows the use of the Stacking method in the ensemble
learning method to predict all the trained base models on the entire training
set, and each base model will get a classi cation prediction result. For each base
model, we train out model by using 5-fold cross-validation, concatenate the
classi cation prediction results after each base model training, and nally send all
the features to the nal LightGBM model for training.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>First, data processing is performed, and then all the hyperparameter of the
experiment are introduced. Almost all base model use default parameters, and each
base model uses 5-fold cross-validation for training; the LightGBM model also
uses 5-fold cross-validation for training. The training model of the LightGBM
model is GBDT; the learning rate is set to 0.01; the maximum number of
iterations num-boost-round is set to 10000; the progress is displayed every 50
iterations. Finally, the model outputs the nal accuracy rate and and F1-macro. The
hyperparameters of each classi er are as follows: LR (random-state=1017, C=3),
SGDC (random-state=1017, loss=`log'), PAC (random-state=1017, C=2), RC
(random-state=1017), LSVC (random-state=1017).
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Result</title>
        <p>We will evaluate the model from F1 measure and accuracy on the \fake" class.
The results of the base model and ensemble model on the training data shown
in Table 1, where \Merge-All" is to concatenate all the information, and
\Stopwords" is removing stop words from the text. Accuracy and F1-macro are both
results obtained on the validation set. From Table 1, we can see that the
performance of the ensemble model is always better than that of the base model,
regardless of whether the merging and stop words are removed. For the same
model, the highest accuracy can be obtained by performing two data processing
methods at the same time. 91.1% and the highest F1-macro score 91.3%. At
the same time, removing the stop words actually weakens the performance of
the model, and merging all column information can signi cantly improve the
performance of the base model, and it also improves the ensemble model to a
certain extent.</p>
        <p>We can get that in the same base model, merge can play a certain role in
improving the accuracy of the model and the F1-macro score. Without the merge,
the accuracy of the model is signi cantly reduced, and the remove stop words
from the text will reduce the performance of the model. At the same time,
no matter how the data is processed, the ensemble model is better than the
base model in accuracy, and the F1-macro score of the ensemble model performs
better with merged. Among them, the accuracy of the ensemble model is at least
2.6% higher than that of the base model, and at most 7%. The F1-macro score
is improved by at least 2.1% and at most 6.8%. This fully illustrates that our
ensemble learning model plays a very good role in improving the performance of
the model. The more information the model inputs, the better the performance
of the model, but too much data will a ect the e ciency of model operation.</p>
        <p>On the test data set, we only got two results: 1) not merging and removing
stop words, 2) merging and not removing stop words. The F1-macro score of the
rst type is only 0.6975, and the F1-macro score of the second type reaches 0.7548
show in Table 2, which once again shows that data processing and ensemble
learning methods can e ectively improve the performance of the model.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Further Work</title>
      <p>This article describes the fake news detection classi cation task in the IberLEF
2021 task. We used some classic feature extraction methods and machine
learning techniques, and achieved high performance on the development set through
ensemble methods. Compared with other deep learning and machine learning
methods, the performance on the test set is also very competitive. Compared
with MEX-A3T 2020 [13], the accuracy rate on the veri cation set has increased
by about 8%, and the F1-macro score has increased by 6%. Compared with
last year's best papers, the results of our model are also very competitive. The
best F1-macro score we got on the test set was 0.7548, which was the second
place. Due to changes in language and tweet content, the performance on the
development set is still lower</p>
      <p>The future work is to explore more advanced technologies, use better feature
extraction methods, and achieve better results in the next competition. Secondly,
we also plan to apply our model to other languages and better solve the current
ood of Covid-19 information. Finally, it is harmful to treat all news from a link
as fake news or real news. We will solve this problem in future work.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank the organizers for the opportunity and organization of
this task, as well as teachers and seniors for their help. Finally, we would like
to thank the school for supporting my research and the patient work of future
reviewers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Traore</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Detecting opinion spams and fake news using text classi cation</article-title>
          .
          <source>Security and Privacy</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <year>e9</year>
          (
          <year>2018</year>
          ). https://doi.org/https://doi.org/10.1002/spy2.9, https://onlinelibrary.wiley.com/doi/abs/10.1002/spy2.9
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Arce-Cardenas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fajardo-Delgado</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carmona</surname>
            ,
            <given-names>M.A.A.</given-names>
          </string-name>
          : Tecnm at MEXA3T 2020:
          <article-title>Fake news and aggressiveness analysis in mexican spanish</article-title>
          .
          <source>In: IberLEF@SEPLN. CEUR Workshop Proceedings</source>
          , vol.
          <volume>2664</volume>
          , pp.
          <volume>265</volume>
          {
          <fpage>272</fpage>
          . CEURWS.org (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendoza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poblete</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Information credibility on twitter</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on World Wide Web</source>
          . p.
          <volume>675</volume>
          {
          <fpage>684</fpage>
          . WWW '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2011</year>
          ). https://doi.org/10.1145/1963405.1963500, https://doi.org/10.1145/1963405.1963500
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Greedy function approximation: A gradient boostingmachine</article-title>
          .
          <source>The Annals of Statistics</source>
          <volume>29</volume>
          (
          <issue>5</issue>
          ),
          <volume>1189</volume>
          {
          <fpage>1232</fpage>
          (
          <year>2001</year>
          ). https://doi.org/10.1214/aos/1013203451, https://doi.org/10.1214/aos/1013203451
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Duran</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bel-Enguix</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Overview of fakedes task at iberlef 2020: Fake news detection in spanish</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <issue>0</issue>
          ) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ke</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          , T.Y.:
          <article-title>Lightgbm: A highly e cient gradient boosting decision tree</article-title>
          .
          <source>In: Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          . p.
          <volume>3149</volume>
          {
          <fpage>3157</fpage>
          . NIPS'
          <volume>17</volume>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kwon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Prominent features of rumor propagation in online social media</article-title>
          .
          <source>In: 2013 IEEE 13th International Conference on Data Mining</source>
          . pp.
          <volume>1103</volume>
          {
          <issue>1108</issue>
          (
          <year>2013</year>
          ). https://doi.org/10.1109/ICDM.
          <year>2013</year>
          .61
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Oshikawa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , W.Y.:
          <article-title>A survey on natural language processing for fake news detection</article-title>
          .
          <source>In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . pp.
          <volume>6086</volume>
          {
          <fpage>6093</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          , Marseille, France (May
          <year>2020</year>
          ), https://www.aclweb.org/anthology/2020.lrec-
          <volume>1</volume>
          .
          <fpage>747</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>