=Paper=
{{Paper
|id=Vol-2936/paper-42
|storemode=property
|title=Classifier for fake news detection and Topical Domain of News Articles
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-42.pdf
|volume=Vol-2936
|authors=William Kana Tsoplefack
|dblpUrl=https://dblp.org/rec/conf/clef/Tsoplefack21
}}
==Classifier for fake news detection and Topical Domain of News Articles==
Classifier for fake news detection and Topical Domain of News Articles Kana Tsoplefack William1 1 University Duisburg-Essen, Germany Abstract Digitization has resulted in a plethora of new methods to read articles or excerpts online using smart- phones or tablets. Nowadays, everything is available online, and dealing with false information has grown increasingly. Online newspapers cannot check the veracity of all social media posts, and in or- der to combat the spread of fake news, machine learning algorithms might be beneficial in classifying articles based on labels provided by experts. This paper will present relevant algorithms and their out- comes. Keywords Fake News, Machine learning 1. Introduction Today, spreading article content is as simple as clicking a link, and due to the rapid circulation of content, fake news articles follow the same pattern. With this expansion, it has become more vital to provide users with a concise overview of which class the entire article may belong to, whether such content is misleading, false, or true. To that end, many fact-checking websites are in high demand, with their already classified content increasing the amount of work to be done on each article with experts appropriate for their areas of expertise. Faced with this, machine learning algorithms may be useful in detecting fake news or identifying topics. First, article classification may be useful when there is a deficit of experts, or there are insufficient resources to hire them. Second, understanding the topic of articles may aid in the hiring process by providing companies with an estimate of how many specialists they will need to classify their articles. A basic description of applying machine learning to classify articles and the results gained will be discussed in this study. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " williamkana46@gmail.com (K. T. William) ~ https://boby024.github.io/Webpage/ (K. T. William) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work Many studies have been developed in the fields of classification using machine learning. In topics such as hate speech and fake news, [1] presents different classification methods used to detect hate speech using certain features extraction techniques: simple surface features such as Bag of Word (BoW) or Term Frequency-Inverse Document Frequency (TFIDF) that show the benefits of using character n-grams and words n-grams. [2] and [3] respectively describe how BoW and TFIDF operate. This study will just apply the logic of machine learning algorithms in articles classification. 3. Method Many machine learning algorithms have been developed to assist humans in classification tasks, and their advances are becoming increasingly accurate. This paper discusses findings obtained by classifying articles for fake news detection and topic identification using two classifiers, Multinomial Naive Bayes and Random Forest. Naive Bayes and Random Forest were chosen from among all machine learning algorithms. The following are the reasons why these algorithms were preferred to others for this paper: First, both Multinomial Naive Bayes and Random Forest [4] can predict non-binary outcomes. Second, Multinomial Naive Bayes uses term frequency, i.e. the number of times a specific phrase appears in a document. To normalise this term frequency, the raw term frequency is divided by the document length and the maximum likelihood based on the training data to estimate the conditional probability; the term frequency is used. Random Forest, on the other hand, builds many decision trees and then merges them for a more precise and consistent prediction. 4. Dataset The dataset is downloaded from Zenodo [5]. Two datasets have been labelled: the first is aimed at fake news detection and contains 900 articles in the training set and 364 articles in the testing set. The training set contains labels such as false, partially false, true, and others when the article couldn’t be categorised other mentioned early. The second section is for topic identification and includes a training set of 318 articles and a testing set of 137 articles. There are also many label values in the training set, such as crime, climate, economy, education, elections, and health. The dataset used in this paper was obtained through [6, 7, 8]. A preprocessing step is necessary using the NLK library [9]. The dataset was collected using the approach mentioned in [10, 11]. The four different categories of data are mentioned in [12, 13]. The cleaning process involves the removal of email addresses, hyperlinks, numerals, and special characters. Second, words will be lemmatised in order to determine their dictionary root. 4.1. Implementation Most machine learning algorithms are simple to implement in many programming languages, and Python was used for this paper’s task due to familiarity. The implementation of many machine learning algorithms has become much easier with the help of the Scikit-learn library [14]. As features parameters, Scikit-learn default’s function is used without any additional needs such as n_estimators (default value 100) for Random forest classifier. 5. Results and Discussion 5.1. Results The above-mentioned machine learning algorithms were tested for fake news detection and identifying the topic of given articles. Two approaches were tested in order to obtain a more comprehensive view of classifiers and noises that could influence the results. As a result of these factors, two distinct outcomes were obtained, as shown in 1 and 2. The results in percent presented in the Table is the F1 score which is the weighted average of the precision and recall. Table 1 shows the results of the training dataset based on the labels provided, implying that the algorithms must determine the real label value. For example, if an expert classifies the article as "mostly false," the algorithm must return this value. Table 2, on the other hand, will put label values such as false, mostly false (true, mostly true) in the same category. For example, two articles A and B with labels true and mostly true will both obtain the value "true." Table 1 Classification Results 1 Classifier Multinomial Naive Bayes Random Forest accuracy 0.18 0.18 Table 2 Classification Results 2 Classifier Multinomial Naive Bayes Random Forest accuracy 0.59 0.73 Table 3 displays F1 Scores for category classification, which aims to identify if an article belongs to a category such as "politic" or "health" using the algorithms mentioned in the preceding section. Table 3 Category Classification Classifier Multinomial Naive Bayes Random Forest accuracy 0.81 0.64 6. Discussion As mentioned in the preceding section, the classifier used here to detect fake news is insufficient when the label is a non-binary outcome, but it can be exploited when a group is established, such as false and partially false in the same group with the value "false". This statement can be supported by the result 2. The task of topic detection yielded significantly better results with both classifiers, particularly Multinomial Naive Bayes, indicating a better outcome than the second, demonstrating that by adjusting its parameters, this classifier can perform better. 7. Conclusion In this paper, the advantages of machine learning algorithms have been presented to categorize articles. Despite the lack of precision, these algorithms allow positive outcomes to be obtained in this paper. This work could help fight against the classification of articles and could help online newspapers in this challenge. Machine learning techniques with the support of innovative human insights are getting more effective in text labeling so that they are now being used to classify text and this case, articles. Other features, such as ngram [15, 16] combined with machine learning algorithms, have also demonstrated advantages in classification tasks. Alternative machine learning techniques, such as deep learning, may also be useful in fake news detection and topic identification. References [1] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language processing, in: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics, Valencia, Spain, 2017, pp. 1–10. URL: https://www.aclweb.org/anthology/W17-1101. doi:10.18653/v1/ W17-1101. [2] Y. Zhang, R. Jin, Z.-H. Zhou, Understanding bag-of-words model: a statistical framework, International Journal of Machine Learning and Cybernetics 1 (2010) 43–52. [3] T. Joachims, A probabilistic analysis of the rocchio algorithm with tfidf for text categoriza- tion, in: Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997, p. 143–151. [4] U. Parida, M. Nayak, A. K. Nayak, News text categorization using random forest and naïve bayes, in: 2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON), 2021, pp. 1–4. doi:10.1109/ ODICON50556.2021.9428925. [5] G. K. Shahi, J. M. Struß, T. Mandl, Task 3: Fake news detection at CLEF-2021 CheckThat!, 2021. URL: https://doi.org/10.5281/zenodo.4714517. doi:10.5281/zenodo.4714517. [6] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news, in: Proceedings of the 43rd European Conference on Information Retrieval, ECIR ’21, Lucca, Italy, 2021, pp. 639–649. URL: https://link.springer.com/chapter/ 10.1007/978-3-030-72240-1_75. [7] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, S. Modha, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! lab on detect- ing check-worthy claims, previously fact-checked claims, and fake news", year = 2021, booktitle = "proceedings of the 12th international conference of the clef association: Infor- mation access evaluation meets multiliguality, multimodality, and visualization", series = CLEF ’2021, address = Bucharest, Romania (online)„ ???? [8] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection, in: Working Notes of CLEF 2021—Conference and Labs of the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021. [9] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, " O’Reilly Media, Inc.", 2009. [10] G. K. Shahi, Amused: An annotation framework of multi-modal social media data, 2020. arXiv:2010.00502. [11] G. K. Shahi, D. Röchert, S. Stieglitz, Covid ct: Analysis and detection of different conspiracy theories on youtube in the context of covid-19 (????). [12] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation on twitter, Online social networks and media (2021) 100104. [13] G. K. Shahi, T. A. Majchrzak, Exploring the Spread of COVID-19 Misinformation on Twitter, Technical Report, EasyChair, 2021. [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, Journal of machine learning research 12 (2011) 2825–2830. [15] S. Malmasi, M. Zampieri, Detecting hate speech in social media, arXiv preprint arXiv:1712.06427 (2017). [16] A. Gaydhani, V. Doma, S. Kendre, L. Bhagwat, Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach, arXiv preprint arXiv:1809.08651 (2018).