<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An adaptive approach to detecting fake news based on generalized text features</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Andrii</forename><surname>Shupta</surname></persName>
							<email>andrii.shupta@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>Institutes 11 st. 29016</addrLine>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olexander</forename><surname>Barmak</surname></persName>
							<email>alexander.barmak@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>Institutes 11 st. 29016</addrLine>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Adam</forename><surname>Wierzbicki</surname></persName>
							<email>adamw@pjwstk.edu.pl</email>
							<affiliation key="aff1">
								<orgName type="institution">Polish-Japanese Academy of Information Technology</orgName>
								<address>
									<addrLine>Koszykowa 86 st. 02-008</addrLine>
									<settlement>Warsaw</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tetiana</forename><surname>Skrypnyk</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>Institutes 11 st. 29016</addrLine>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">An adaptive approach to detecting fake news based on generalized text features</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1ECA8C87F4DE3749A216982567A5BA67</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-06-19T15:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Fake news, Fake news detection, Natural Language Processing 0009-0000-9771-5579 (A. Shupta)</term>
					<term>0000-0003-0739-9678 (O. Barmak)</term>
					<term>0000-0003-0075-7030 (A. Wierzbicki)</term>
					<term>0000-0002-8531-5348 (T. Skrypnyk)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Fake news has become a serious problem in recent years as they can quickly spread through social media and other online platforms. Various methods and materials can be used to detect fake news. One approach involves analyzing the content of the news, including the text and accompanying images or videos. Another approach involves considering the social context in which the news is spread, such as the news source and the mood of the people sharing them. An adaptive approach for detecting fake news using Natural Language Processing is presented in this work. It is proposed to use a feature vector constructed from generalized characteristics of news texts. The possibility of expanding the feature vector and training data sets to adapt the classifier to new types and types of fake news is also proposed. The experimental results presented qualitatively (visual analytics) and quantitatively (statistical metrics) demonstrate the ability of the proposed approach to detect fake news with sufficient quality (90%). Overall, the research aims to contribute to the development of a reliable and accurate system for detecting fake news, which may have important consequences for addressing this problem in modern society.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Fake news has become a serious problem in modern society, as it can quickly spread through social media and other online platforms, influencing people's thoughts and beliefs. Detecting fake news has become an important task that requires the use of various methods and techniques to accurately identify false or misleading information.</p><p>Social media is a primary means of news consumption, especially for younger individuals, but as the popularity of consuming news on social media platforms increases, so does the prevalence of misinformation, including false information and unsupported claims. Various methods based on text and social context have been developed to identify fake news on social media, but recent studies have explored the limitations and weaknesses of these fake news detectors. <ref type="bibr" target="#b0">[1]</ref>.</p><p>There are various social media platforms available to users, enabling them to post and share news online. These platforms lack verification measures for users and their posts, leading to the spread of false information by some users. Such misinformation can include propaganda targeted at individuals, society, organizations, or political parties. Due to the sheer volume of content, it is challenging for humans to detect all instances of fake news, highlighting the need for automated machine learning classifiers. <ref type="bibr" target="#b1">[2]</ref>.</p><p>Fake news detection methods are commonly trained on data that is available at the time of training, which may not be applicable to future events. This is because many of the labeled samples used for training on verified fake news may become outdated quickly as new events emerge. <ref type="bibr" target="#b2">[3]</ref>.</p><p>In the study, an adaptive approach to detecting fake news is proposed, based on a transparent, interpreted feature vector constructed from generalized characteristics of news texts. The adaptability of the approach lies in the ability to supplement the feature vector with new characteristics and to build a set of classifiers on different training sets.</p><p>The contributions of the article are as follows:</p><p>• an adaptive approach to fake news detection is proposed based on a feature vector constructed from generalized content characteristics; • the ability of the proposed approach to detect fake news with acceptable values of statistical metrics is demonstrated. The structure of the article is as follows: Section 2. Related works provides an overview and analysis of modern approaches to fake news detection and formulates the research goal. Section 3. Methods and Materials describes the proposed adaptive approach to fake news detection. Section 4. Results and discussion presents the research results, including visual and numerical values of statistical metrics, their correlation with similar research, and the confirmation of the proposed approach's ability to detect fake news. The further prospects of the proposed approach are discussed. Finally, the conclusions are presented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Numerous works have been done to detect fake news using different techniques and methods. In <ref type="bibr" target="#b3">[4]</ref>, the authors proposed a novel method for detecting fake news by combining various features, including text and user-based features, and using deep learning models. The fundamental algorithms used in their study are an extension of traditional Convolutional Neural Networks (CNNs) to graphs. This enables the combination of dissimilar types of data such as content, user profile and activity, social graph, and news propagation. They achieved an accuracy of 92.7%.</p><p>Another study <ref type="bibr" target="#b4">[5]</ref> focused on using linguistic features. Their study utilized a dataset comprised of two datasets containing an equal number of true and fake news articles related to politics. To extract linguistic and stylometric features, text fields from the dataset were utilized, and a bag of words TF and BOW TF-IDF vector were generated. A variety of machine learning models, including bagging and boosting methods, were then applied to achieve the highest level of accuracy.</p><p>In study <ref type="bibr" target="#b5">[6]</ref>, two machine learning algorithms were evaluated using word n-grams and character ngrams analysis for fake news detection. The experimental results showed that character n-grams combined with Term-Frequency-Inverted Document Frequency (TF-IDF) achieved better performance, with a Gradient Boosting Classifier achieving an accuracy of 96%.</p><p>Finally, in <ref type="bibr" target="#b6">[7]</ref>, the authors of this article proposed a theory-driven model to detect fake news, which examines news content at different levels, including the lexicon, syntax, semantics, and discourse. They used well-established theories in social and forensic psychology to represent news at each level and conducted fake news detection within a supervised machine learning framework. As an interdisciplinary study, their work aims to explore potential patterns in fake news, improve interpretability in fake news feature engineering, and investigate the relationships between fake news, deception/disinformation, and clickbaits.</p><p>Based on the analysis of related work, various weaknesses in the approaches can be identified. One of them is the inadequate quality of the data on which the model is based. If the model is trained on incorrect or insufficient data, it may classify news incorrectly.</p><p>Another factor is the speed at which news spreads on the Internet. Fake news can quickly gain popularity and spread faster than any model can detect them. It is also important to consider that fake news may contain some truthful information, making their detection more difficult.</p><p>Yet another reason is the changing technologies and approaches to creating fake news. As new technologies emerge over time that allow for more convincing fake news, models created to detect previous versions of fake news may be ineffective. It is also important to consider that most approaches to detecting fake news are based on machine learning, which can be vulnerable to attacks by malicious actors. For example, malicious actors can train the model to classify a certain type of news as fake by changing the content of the news.</p><p>Therefore, the aim of this work is to propose an approach that can adapt to the changing nature of fake news. The approach should retrain on new data, use previous results, and improve the accuracy of detecting fake news. Additionally, the approach should allow for expanding the set of features to detect new types of fake news. In summary, the adaptive approach should add and combine existing factors and provide explanations for what exactly influences the result of detecting fake news.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods and materials</head><p>At work, a new approach is proposed for detecting fake news, which is based on analyzing generalized characteristics of the content rather than just the text itself. To detect fake news, experts use a set of generalized content characteristics. Typically, the text is examined for faulty reasoning (arguments are supported by "rotten" evidence, quotes are attributed to unknown sources, numerical figures are presented without indicating their sources, etc.). Indicators of faulty reasoning include: theses not supported by credible evidence, common myths instead of arguments, lack of specific data and sources, and so on. Text can also be evaluated for emotionally charged content that manipulates the reader, with the goal of making the reader a "useful idiot." This is achieved through exaggeration, epithets, negatively connotated words, and strong emotional appeals that shut down the reader's logic and encourage them to act based on outrage. The industry of creating fake news is constantly evolving, and other methods of creating them are possible. Therefore, there is a need to propose an approach that would allow an expert to analyze the text based on its existing characteristics and also provide tools to add new characteristics and "retrain" classifiers on new sets of fake news.  The proposed approach consists of a method of training classifiers (based on various characteristics of the text and training data sets) (Figure <ref type="figure" target="#fig_0">1</ref>) and a method of classification using the selected classifier (Figure <ref type="figure" target="#fig_1">2</ref>).</p><p>As can be seen from Figure <ref type="figure" target="#fig_0">1</ref>, the input information for the classifier training method consists of a training set of labeled fake and non-fake news and a set of methods for obtaining numerical characteristics of the text. The next step is text preprocessing. Then, the news text is transformed into a feature vector using the methods of obtaining numerical characteristics. The resulting labeled set of feature vectors is fed into the classifier. The classifier can be any machine learning (ML) or deep learning (DL) method. The resulting classifier is analyzed for its ability to classify both the training and testing data sets. After evaluation, the classifier can be used for detecting fake news.</p><p>For the classification method (Figure <ref type="figure" target="#fig_1">2</ref>), the input information is arbitrary news text and a classifier selected by an expert. The result of the method is to determine whether the news belongs to a fake or non-fake category.</p><p>Further, we will describe in detail the main steps of the presented methods and the algorithms and methods used in the research for transforming text information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Textual content analysis tools</head><p>To analyze the ability of the proposed approach to detect fake news, the spaCy Python NLP library <ref type="bibr" target="#b7">[8]</ref> was used, which includes a range of natural language processing tools, including named entity recognition, part-of-speech tagging, and dependency parsing. The large spaCy English model was used, which includes pre-trained word embeddings that can be used for computing similarity between texts, as well as the spacytextblob <ref type="bibr" target="#b8">[9]</ref> library for determining sentiment and polarity. Additionally, scikitlearn <ref type="bibr" target="#b9">[10]</ref> was used for computing Multidimensional Scaling <ref type="bibr" target="#b10">[11]</ref> and Support-Vector Machine <ref type="bibr" target="#b11">[12]</ref>. Although there are several NLP libraries available, the use of spaCy and scikit-learn was due to their ease of use and access to pre-trained models, such as the BERT base model, and the ability to work with a pre-trained Ukrainian model. Other alternative libraries could include NLTK, Stanford CoreNLP, and Gensim. However, the analysis showed that spaCy provides the best combination of performance and ease of use to achieve the research goal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Pre-processing</head><p>The first step in preparing text for NLP processing involves cleaning the text and removing any irrelevant or unnecessary information <ref type="bibr" target="#b12">[13]</ref>. This typically involves removing punctuation marks, numbers, and stop words, which are common words that do not carry much meaning, such as "the," "and," and "a." In the proposed approach, the built-in stop word list from spaCy is used to remove stop words from the text. Removing stop words is important because it can help reduce noise in the text and facilitate the identification of important words and phrases. After cleaning the text, it is tokenized using the spaCy tokenizer, which breaks the text into individual tokens or words. Each token is assigned a part-of-speech tag that indicates the role the word plays in the sentence, such as noun, verb, or adjective. Next, the spaCy lemmatizer is used to reduce each token to its base form or lemma. Lemmatization is important because it can help reduce the complexity of the text and facilitate comparisons between words that have the same root or meaning. These processing steps can be useful in detecting fake news by facilitating the identification of important words and phrases in the text and removing irrelevant or unnecessary information. Reducing the complexity of the text and identifying key words allows for better detection of patterns and features in the text that indicate fake news or biased language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Characteristics of textual content</head><p>Next, the set of text characteristics used in this study will be considered. It should be noted that it is not fixed. These characteristics are used to analyze the ability of the proposed approach to solve the task at hand. It should also be noted that the proposed approach is adaptive, allowing for the expansion of both the set of text characteristics and the training data sets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1.">Persuasion and influence</head><p>Fake news can be convincing and influential because they often use language and tactics aimed at manipulating the reader's emotions and beliefs. For example, fake news can use biased language to appeal to the reader's existing beliefs and values, or use persuasion techniques such as repetition, paraphrasing, and dehumanizing language to influence the reader's perception of the topic.</p><p>Language bias refers to language that expresses preference or bias towards a particular group or belief system. In fake news, biased language can be used to draw attention to readers who share similar beliefs or values, as well as to reinforce the beliefs of those who already agree with the message. For example, a news article that criticizes a particular political figure may use derogatory language to appeal to readers who already oppose that figure, while also strengthening negative beliefs among readers.</p><p>Subjectivity is another important factor in biased language, as it can complicate an objective evaluation of the content of a news article. Fake news can use intentionally subjective or emotional language to sway the reader's opinion or beliefs. For example, an article that presents a certain political figure in a negative light may use language intended to provoke the reader's feelings of anger or sadness in order to influence their beliefs about that figure.</p><p>Other methods commonly used in fake news include paraphrasing, repetitive narratives, dehumanizing language, and objectification. These techniques can be used to reinforce the message of an article and make it more memorable and influential for the reader. For example, an article that criticizes a certain group may use dehumanizing language to make the group seem less sympathetic or relatable, making it easier for the reader to dismiss their concerns or opinions.</p><p>According to the given characteristics of the text, it is suggested to use the following parameters:</p><p>• 𝑓 1 -paraphrased_ratio: the paraphrasing coefficient allows you to find the percentage of information that has already been voiced, but is repeated for some purpose; this parameter was calculated by comparing the previous sentence with the following sentences; measured from 0 to 1, where 0 is no paraphrasing and 1 is a complete repetition of the text; • 𝑓 2 -dehumanizing_language_ratio: coefficient of "deprivation" of human contact; this parameter was a computational measurement of the Proper Noun (grammatical construction) in the sentence and the mismatch of the Part of speech; is measured from 0 to 1, where 0 is normal handling and 1 is maximum dumanization; • 𝑓 3 -subjective_words_ratio: the coefficient of subjective words shows the subjectivity of the text; determined using the spacytextblob component, which contains a ready-made subjectivity indicator for English words; measured from 0 to 1 according to increasing subjectivity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2.">Narrative</head><p>Narrative is one of the components of detecting fake news. It is important for news to have a clear and consistent narrative that is related to the headline and the overall essence of the text. The narrative is revealed through analyzing the context of the news and describes the logical order of events or information contained in the text.</p><p>Special attention should be paid to the narrative in the case of fake news, as they may contain illogical connections between the information that reaches the reader and the headline. Fake news often contains attempts to change the audience's opinion or create a nonexistent problem, which can lead to social division or panic. In such cases, the narrative may be inconsistent and illogical, which is a sign of a fake.</p><p>In analyzing the narrative, it is important to evaluate not only the connections between the news headline and the text, but also the logical connections between events and facts presented in the text. This makes it possible to detect fake news that may contain illogical and conflicting connections between facts and events.</p><p>Therefore, detecting fake news depends on how clearly they are structured and logically connected. The more attention is paid to the narrative, the greater the possibility of detecting fake news and preventing the spread of false information among the audience.</p><p>According to the given characteristics of the text, it is suggested to use the following parameter:</p><p>• 𝑓 4 -header_summary_similarity_ratio: the similarity coefficient of the title of the article and its body -determined by comparing the title and body of the article using the similar method of the spaCy library; generalization of the body of the article is determined by the selection of the most important sentences based on their similarity to the rest of the article; is measured from 0 to 1, where 0 is dissimilarity and 1 is identity of title and body.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3.">Sentiment and Linguistic Analysis</head><p>Sentiment analysis and linguistic analysis are widely used methods in detecting fake news. An important part of these methods is identifying unusual and illogical textual structures, as fake news may contain vague and unmotivated statements that contradict the headline or general idea of the news.</p><p>To analyze the sentiment of fake news texts, methods that allow the determination of the average, positive, negative, and neutral mood are used. Machine learning algorithms and language analysis are typically applied for this purpose. Identifying such parameters helps to distinguish fake news from real news because fake news may have an overly positive or negative sentiment that does not correspond to the content of the news. These methods are an important tool in combating the harmful effects of fake news on society and enable informed conclusions to be made about the veracity of the text.</p><p>According to the given characteristics of the text, it is suggested to use the following parameters:</p><p>• 𝑓 5 -unusual_inappropriate_language_ratio: the coefficient of unusual inappropriate language shows how many unusual words there are; determined by checking tokens (words and not only) to see if they fall under the standard category: is_alpha, is_punct, exists in vocabulary; measured from 0 to 1 according to the number of words in the entire text; • 𝑓 6 -awkward_text_ratio: the coefficient of awkward, complex or convoluted sentence structures is determined by taking into account and subtracting the dependencies of linguistic tagging in the text "amod", "compound", "nsubj", "dobj", and "pobj"; measured from 0 to 1 according to the number of complex "tokens"; • 𝑓 7 -avg_sentiment: the sentiment coefficient of the text of the words shows the average sentiment of the text; determined using the spacytextblob component, which contains a ready-made polarity indicator; is measured from -1 to 1, where -1 is negative, 0 is neutral, and 1 is positive; • 𝑓 8 -positive_ratio: the positivity coefficient shows how positive the text is; determined by subtracting the number of positive words from the entire text; is measured from 0 to 1; • 𝑓 9 -neutral_ratio: the neutrality coefficient shows how neutral the text is; determined by subtracting the number of positive words from the entire text; is measured from 0 to 1; • 𝑓 10 -negative_ratio: the negativity coefficient shows how negative the text is; determined by subtracting the number of positive words from the entire text; is measured from 0 to 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Evaluation of the validity of the proposed feature vector</head><p>To assess the quality of the proposed feature vector for classification tasks, the use of Multidimensional Scaling (MDS) method is proposed. This is one of the methods for reducing the dimensionality of the vector space. The aim of the method is to reduce the dimensionality to a level that can be visualized (3D or 2D). The criterion for dimensionality reduction is, for example, the Euclidean distance between vectors. That is, by solving an optimization problem, an 𝑅 𝑛 → 𝑅 2 mapping is found that makes it possible to obtain a two-dimensional graph of the mutual arrangement of vector points and visually assess the quality of the model for the classification task. Visual criteria have been proposed to assess the quality of modeling (Figure <ref type="figure">3</ref>). The proposed criteria are recommended to be used to verify the quality of the proposed feature vector. The feature vector will be considered correct if the values of the results appear as shown.</p><p>Subsequently, if the feature vector allows for the separation of two classes of news, a classifier is proposed to be obtained.</p><p>The next step is to evaluate the quality of the proposed classifier using the following metrics: precision, recall, and F1-norm.</p><p>In machine learning precision and recall are indicators of productivity <ref type="bibr" target="#b13">[14]</ref>. They apply to information obtained by simple sampling, collection or corpus.</p><p>Precision shows what proportion of the results found from the sample are relevant to the query <ref type="bibr" target="#b14">[15]</ref>, and is by the formula:</p><formula xml:id="formula_0">𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∩ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠<label>(2)</label></formula><p>The best result for classification issues is a score of 1.0, when each of the samples submitted for entry actually belongs to a certain class (however, the number of such samples that was not observed correctly is unknown). Relevant documents can still be called correctly classified.</p><p>Recall shows the share of relevant documents that found are successfully <ref type="bibr" target="#b15">[16]</ref>, and is formally depicted as follows:</p><formula xml:id="formula_1">𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∩ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡<label>(3)</label></formula><p>The F-measure is calculated through precision and recall. It is common to use the measure Fβ, in which β в depending on its value, pays more attention to either precision or recall. However, they often focus on the measure F1. Measure F1is the weighted harmonic mean of precision and recall <ref type="bibr" target="#b16">[17]</ref> which can be formally written as follows:</p><formula xml:id="formula_2">𝐹 1 = 2 × 𝑟𝑒𝑐𝑎𝑙𝑙 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙 + 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛<label>(4)</label></formula><p>The best score for F1 is 1.0, which suggests that precision and recall are ideal. The worst score is 0 if either precision or recall is zero. Given the popularity of measure F1, it should be noted that it can give inaccurate data with an unbalanced data set, so it should be used only on a balanced set <ref type="bibr" target="#b17">[18]</ref>.</p><p>These metrics are used to study the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and discussion</head><p>A number of experiments were conducted to test the proposed approach and evaluate the validity of the feature vector. Below are their results and discussion. A description of the dataset used in the experiments is given. The result of the application of visual analytics to assess the ability of the proposed features of fake news texts to be divided into two classes is given. Visual and numerical (statistical metrics) results of classifier training (using SVM) are given. The discussion was carried out and the prospects of the proposed approach were given.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Dataset</head><p>The dataset <ref type="bibr" target="#b18">[19]</ref> has over 20000 true and fake news labeled and categorized. It is very popular among the data science community and has been used in many articles and works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">MDS</head><p>The results of applying the MDS method to input data (generalized features of news texts) in 2dimensional space are shown in Figure <ref type="figure" target="#fig_3">4</ref>. As can be seen from Figure <ref type="figure" target="#fig_3">4</ref>, the result is satisfactory, the classification was successful for a larger number of texts from the training set. Analysis of a small number of misclassified texts showed that there are true articles written with poorer text quality and vice versa.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">SVM</head><p>After calculating the MDS, we can pass a value to the train_test_split method to split the data into training and test samples. Using SVM methods from the scikit-learn library, we obtained the following results: After the number of news articles went over 2000, the results became consistent and we could consider it average for the whole dataset.</p><p>The obtained numerical results show the high accuracy of the proposed approach for determining fake news. The given values of the statistical metrics are either in the range or even better than the published modern results of other researchers.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations of the approach and further research</head><p>The main limitation of the proposed approach is the lack of high-quality labeled datasets for successful training of classifiers, especially for the Ukrainian language. Another limitation is the insufficient number of generalized characteristics of texts that allow detecting more hidden ways of creating fakes. However, it should be noted that these limitations are not significant for the proposed approach since it allows for adaptation, building new interpreted and transparent classifiers using both new datasets and additional generalized text features.</p><p>The future development of the approach to fact-checking may include the integration of external APIs to gather more detailed information and fact-check claims made in articles. These APIs may be from verified sources such as news agencies, government organizations, or other fact-checking organizations. This will help improve the accuracy and reliability of the fact-checking process.</p><p>Another potential development could be checking information on different social media platforms such as Twitter to verify the popularity and authenticity of claims. This can be done by analyzing the number of likes, retweets, and article publications, as well as verifying the sources of information to ensure their reliability. Additionally, the approach can also detect the toxicity of comments on social media platforms such as Twitter.</p><p>Finally, the approach can be extended to detect content created by artificial intelligence (AI). This may involve analyzing the language and structure of text to detect patterns that are commonly used in AI-generated content. Detecting AI-generated content will help prevent the spread of misinformation and disinformation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>An adaptive approach to identifying fake news using natural language processing techniques and machine learning algorithms is presented in this work. A thorough review of related works was conducted to ensure the novelty and effectiveness of the proposed approach. Ten different parameters (general text features) were used to model the text, and multidimensional scaling (MDS) was applied to obtain visual analytics as one of the criteria for evaluating the quality of the proposed approach. A support vector machine (SVM) classifier was trained to classify text into different categories. The research results show that the proposed approach is within the same range or surpasses existing methods in accuracy (overall accuracy -over 90%).</p><p>Limitations of the proposed approach include the absence of high-quality, annotated datasets (especially for the Ukrainian language) for successful classifier training and insufficient generalized text features (for detecting more hidden ways of creating fake news). These limitations are not critical as the proposed adaptive approach is capable of incorporating new datasets and new generalized features for retraining.</p><p>Future improvements to the approach will be directed towards increasing the accuracy of identifying fake news and achieving greater interpretability and understanding of classification results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Scheme of the classifier training method</figDesc><graphic coords="3,78.48,362.40,202.56,348.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Scheme of the classification method</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :Criteria 2 .</head><label>32</label><figDesc>Figure 3: The quality of the feature vector for the classification problem (a) is ideal, (b) is acceptable, (c) is satisfactory Criteria 1. -An ideal feature vector for text classification.Figure 3 (a) shows that the two classes are clearly separated. Criteria 2. -Acceptable feature vector for text classification. Figure 3 (b) shows that two classes collide with each other, but individual members of the classes do not intersect. Criteria 3. -Satisfactory model level for text classification.Figure3 (c)shows that the two classes overlap somewhat. With such an indicator, the model can be considered workable, but it will require an additional expert opinion to confirm the classification.</figDesc><graphic coords="7,222.36,96.84,132.00,61.32" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: MDS Results for 2000 articlesAs can be seen from Figure4, the result is satisfactory, the classification was successful for a larger number of texts from the training set. Analysis of a small number of misclassified texts showed that there are true articles written with poorer text quality and vice versa.</figDesc><graphic coords="8,72.00,384.36,452.16,242.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: SVM decision boundary for 2000 elements</figDesc><graphic coords="9,72.00,234.48,440.52,216.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: SVM decision boundary for 200 elements</figDesc><graphic coords="9,72.00,515.04,408.00,221.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Comparison of the metrics for the classification problem</figDesc><table><row><cell>Number news, N</cell><cell>Precision</cell><cell>Recall</cell><cell>𝐹 1</cell></row><row><cell>20</cell><cell>1.0</cell><cell>1.0</cell><cell>1.0</cell></row><row><cell>200</cell><cell>0.88</cell><cell>0.82</cell><cell>0.85</cell></row><row><cell>2000</cell><cell>0.93</cell><cell>0.92</cell><cell>0.93</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Haoran</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yingtong</forename><surname>Dou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Canyu</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lichao</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><forename type="middle">S</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kai</forename><surname>Shu</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2302.07363</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2302.07363" />
		<title level="m">Attacking Fake News Detectors via Manipulating News Social Engagement</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Detecting Fake News Using Machine Learning</title>
		<author>
			<persName><forename type="first">Alim</forename><surname>Al</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ayub</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ayman</forename><surname>Aljabouh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Praveen</forename><surname>Kumar Donepudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Myung</forename><forename type="middle">Suh</forename><surname>Choi</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2102.04458</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2102.04458" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">contexts: a transformer-based approach</title>
		<author>
			<persName><forename type="first">Shaina</forename><surname>Raza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chen</forename><surname>Ding</surname></persName>
		</author>
		<idno type="DOI">10.1007/s41060-021-00302-z</idno>
		<ptr target="https://doi.org/10.1007/s41060-021-00302-z" />
	</analytic>
	<monogr>
		<title level="m">Fake news detection based on news content and social</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Bronstein: Fake News Detection on Social Media using Geometric Deep Learning</title>
		<author>
			<persName><forename type="first">Federico</forename><surname>Monti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabrizio</forename><surname>Frasca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Davide</forename><surname>Eynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Damon</forename><surname>Mannion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Michael</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1902.06673</idno>
		<ptr target="https://doi.org/10.48550/arXiv.1902.06673" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Machine Learning based Fake News Detection using linguistic features and word vector features</title>
		<author>
			<persName><forename type="first">Mayank</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jain</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Dinesh</forename><surname>Gopalani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><surname>Yogesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kumar</forename><surname>Meena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">; Rajesh</forename><surname>Kumar</surname></persName>
		</author>
		<ptr target="https://ieeexplore.ieee.org/document/9376576" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">Hnin</forename><surname>Ei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wynne</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Zar</forename><surname>Zar Wint</surname></persName>
		</author>
		<idno type="DOI">10.1145/3366030.3366116</idno>
		<ptr target="https://dl.acm.org/doi/10.1145/3366030.3366116" />
		<title level="m">Content Based Fake News Detection Using N-Gram Models</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Xinyi</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Atishay</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Reza</forename><surname>Phoha</surname></persName>
		</author>
		<author>
			<persName><surname>Zafarani</surname></persName>
		</author>
		<idno type="DOI">10.1145/3377478</idno>
		<ptr target="https://dl.acm.org/doi/10.1145/3377478" />
		<title level="m">Fake News Early Detection: A Theory-driven Model</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<ptr target="https://spacy.io" />
		<title level="m">spaCy, Python library for NLP processing</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<ptr target="https://spacy.io/universe/project/spacy-textblob" />
		<title level="m">Sentiment analysis component for spaCy</title>
				<imprint/>
	</monogr>
	<note>spacytextblob</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<ptr target="https://scikit-learn.org/stable" />
		<title level="m">Ski-learn, classification and other library for SVM</title>
				<imprint/>
	</monogr>
	<note>MDS</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Review of the Development of Multidimensional Scaling Methods</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mead</surname></persName>
			<affiliation>
				<orgName type="collaboration">MDS ; Multidimensional Scaling</orgName>
			</affiliation>
		</author>
		<idno>JSTOR 234863</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of the Royal Statistical Society. Series D (The Statistician)</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="27" to="39" />
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Support-vector networks</title>
		<author>
			<persName><forename type="first">Corinna</forename><forename type="middle">;</forename><surname>Svm ; Cortes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vladimir</forename><surname>Vapnik</surname></persName>
		</author>
		<idno type="DOI">10.1007/BF00994018.S2CID206787478</idno>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="273" to="297" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
	<note>CiteSeerX</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<ptr target="https://iq.opengenus.org/text-preprocessing-in-spacy" />
		<title level="m">Text Preprocessing in Python using spaCy</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Yacouby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Axman</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.eval4nlp-1.9</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics (ACL)</title>
				<meeting>the First Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics (ACL)<address><addrLine>Stroudsburg, PA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="79" to="91" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A Survey on Performance Metrics for Object-Detection Algorithms</title>
		<author>
			<persName><forename type="first">R</forename><surname>Padilla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Netto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A B</forename><surname>Da Silva</surname></persName>
		</author>
		<idno type="DOI">10.1109/IWSSIP48289.2020</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Systems, Signals and Image Processing (IWSSIP)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Precision-Recall Curve (PRC) Classification Trees</title>
		<author>
			<persName><forename type="first">J</forename><surname>Miao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.1007/s12065-021-00565-2</idno>
	</analytic>
	<monogr>
		<title level="j">Evolutionary Intelligence</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">An improved ensemble approach for dos attacks detection</title>
		<author>
			<persName><forename type="first">R</forename><surname>Alguliyev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aliguliyev</surname></persName>
		</author>
		<author>
			<persName><surname>Ya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Imamverdiyev</surname></persName>
		</author>
		<author>
			<persName><surname>Sukhostat</surname></persName>
		</author>
		<idno type="DOI">10.15588/1607-3274-2018-2-8</idno>
	</analytic>
	<monogr>
		<title level="j">Radio Electronics, Informatics, Management</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="73" to="82" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Classification assessment methods</title>
		<author>
			<persName><forename type="first">A</forename><surname>Tharwat</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.aci.2018.08.003</idno>
	</analytic>
	<monogr>
		<title level="j">Applied Computing and Informatics</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="168" to="192" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">Clement</forename><surname>Bisaillon</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset" />
		<title level="m">Fake and real news dataset</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<ptr target="https://scikit-learn.org/0.18/auto_examples/svm/plot_iris.html" />
		<title level="m">SVM Decision Boundary</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
