=Paper=
{{Paper
|id=Vol-3723/paper24
|storemode=property
|title=Information technology for identifying disinformation sources and inauthentic chat users' behaviours based on machine learning
|pdfUrl=https://ceur-ws.org/Vol-3723/paper24.pdf
|volume=Vol-3723
|authors=Victoria Vysotska,Lyubomyr Chyrun,Sofia Chyrun,Ilia Holets
|dblpUrl=https://dblp.org/rec/conf/modast/VysotskaCCH24
}}
==Information technology for identifying disinformation sources and inauthentic chat users' behaviours based on machine learning==
Information technology for identifying
disinformation sources and inauthentic chat users'
behaviours based on machine learning
Victoria Vysotska1,†, Lyubomyr Chyrun2,†, Sofia Chyrun1,† and Ilia Holets1,∗ ,†
1
Lviv Polytechnic National University, Stepan Bandera 12, 79013 Lviv, Ukraine
2
Ivan Franko National University of Lviv, University 1, 79000 Lviv, Ukraine
Abstract
For the study, the main tasks were formulated as recognition of propaganda messages and
analysis of the spread of propaganda messages between groups, thereby finding propaganda
networks. The relevance, object and subject of the research are described. The influencing
public opinion means are studied, in particular propaganda, in particular russian propaganda.
A data search and review for the study is conducted. Several datasets of different structures
are selected for further research. The tagged data quality, the tags balance and the empty
values presence are checked. Several regularities of specific datasets are deduced (regarding
the length of messages, and the use of certain emoticons and keywords). Processing of text
data is carried out using a combination of different methods (conversion to lowercase,
extraction of characters through regular expressions, extraction of stop words, stemming).
The unique words present in the texts of the dataset are reviewed. A word frequencies table
is compiled, which demonstrates some regularities for certain words. The best parameters for
the binary classification of propaganda on machine learning models are selected, such as
logistic regression (1-2-3-grams, 5000 features). A classification accuracy of ~0.85 is
achieved. Binary classification is tested using artificial neural networks, namely: a simple fully
connected neural network with count vectors at the input, a neural network with embeddings
at the input, and a transformer. Different methods of measuring the similarity (distance)
between texts are tested: cosine similarity with different types of input vectors, Jaccard
similarity, Leventschein distance, cosine similarity on loaded word2vec-google-news-300
embeddings. The last method proved to be the best, so it is used further. On several datasets,
part of the messages was compared with each other, using the chosen method of distance
measurement. Similar messages are found, which most likely were retranslated, both within
the same group and between different groups. A method of calculating the number of
reinterpretations between groups has been created, which allows propaganda distribution
networks recognition.
Keywords
Disinformation, fake, propaganda, linguistic analysis, natural language processing, machine
learning, cyber warfare, artificial intelligence, semantic analysis, information security 1
MoDaST-2024: 6th International Workshop on Modern Data Science Technologies, May, 31 - June, 1,
2024, Lviv-Shatsk, Ukraine
∗
Corresponding author.
†
These authors contributed equally.
victoria.a.vysotska@lpnu.ua (V. Vysotska); Lyubomyr.Chyrun@lnu.edu.ua (L. Chyrun);
sofiia.chyrun.sa.2022@lpnu.ua (S. Chyrun); illia.holets.sa.2020@lpnu.ua (I. Holets)
0000-0001-6417-3689 (V. Vysotska); 0000-0002-9448-1751 (L. Chyrun); 0000-0002-2829-0164 (S.
Chyrun); 0000-0003-4747-3684 (I. Holets)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
1. Introduction
The spread of disinformation, propaganda, and fake information has become a
particularly acute problem with the spread of the Internet and social media [1-3]. Now
anyone can create their site or group on a social network and share almost any
information. The problem has already been studied, but this area of research is still quite
young. In addition, the methods of creating and spreading dishonest information are
constantly changing and improving. In addition, during the war, the spread of russian
propaganda is a big problem for Ukraine and the whole world [4-6].
Misinformation is defined as "factually incorrect information that is not supported by
evidence." Disinformation in social networks has become an urgent and vital problem,
especially in areas related to the war in Ukraine. Such information obtained from social
media, including thematic online communities, can influence the results of public opinion
formation, control public sentiment, and accordingly affect the course of war as a whole
[1-3]. Concerns about misinformation have grown with the rise of requests for relevant
information on social media. The lack of safeguards during discussions in online
communities contributes to the spread and reinforcement of misinformation. The existing
literature mostly focuses on the detection of fake reviews and fake news; however, the
literature lacks a comprehensive theoretical framework designed to detect
misinformation, especially in the context of an online community. Considering the huge
amount of misinformation about the war in Ukraine that is spreading in the relevant online
communities, there is a need to develop an effective model to achieve automatic
detection of disinformation in the context of identification of inauthentic behaviour (bots)
of coordinated groups. Stopping the spread of disinformation in social networks during
an information war has long been a public concern, as the spread of such disinformation
can hurt the population as a consumer of this content and, accordingly, the course of the
war itself. Usually, the detection of thematic online disinformation is based on the
linguistic features of the content of the textual content of the publications. But they
multiply and spread faster than they can be identified and blocked. Therefore, identifying
the sources of similar content, potential authors, and distribution mechanisms, i.e.
analysis and identification of the behaviour of potential generators of fakes is a priority
task for improving the means of cyber-fighting against disinformation on the Internet.
Features of online disinformation can be classified into two levels: central (including
features of the topic) and peripheral (including linguistic features, features of attitudes
and features of user behaviour). Behavioural features need to be found to reflect user
interaction characteristics: discussion initiation, interaction involvement, sphere of
influence, relational mediation, and informational independence.
To build models and methods for identifying misinformation on the Internet, many
researchers have devoted themselves to identifying the features of misinformation.
Misinformation on social media can be seen as messages that are posted to persuade
other users. To identify effective misinformation detection functions in online health
communities, it is necessary to use a model that can help understand how misinformation
on the Internet, particularly in social networks and online communities, persuades users.
Users usually construct an attitude toward a message through both central and
peripheral routes. In the central route, users scrutinize the quality and strength of
information; whereas in the peripheral route, users care more about surface factors such
as source reputation, visual appeal, and presentation. In addition to the content of the
message, some secondary information (for example, the number of likes and stars)
significantly increases the validity and reliability of messages. Therefore, message
central-level functions persuade users based on message content, while peripheral-level
functions persuade users through the influence of message authors. The best features
for detecting misinformation in social networks may be those that look at user
characteristics, messages, topics, and user behaviour. The creation of a misinformation
detection model that combines central-level functions (in particular, topic features) and
peripheral-level functions (in particular, linguistic features, mood features, and user
behaviour features) requires further research. Based on these features, it is necessary
to evaluate their ability to automatically distinguish misinformation from truth within a
topical online community using various machine learning techniques. The developed
system for rapid identification of sources of disinformation should be based on the
analysis of the inauthentic behaviour of participants in the distribution of fakes. The
results have not only demonstrated the effectiveness of behavioural features in
disinformation detection but also offered both methodological and theoretical
contributions to disinformation detection in terms of integrating features of messages as
well as features of message authors. The project is aimed at the application of artificial
intelligence for the development and improvement of cyber warfare tools, in particular for
the fight against disinformation on the Internet, namely for the automatic detection of
sources of disinformation and inauthentic behaviour (bots) of coordinated groups.
The goal of the project is to increase the level of information security of the state by
developing mathematical models, methods and means of cyber-fighting against
disinformation, in particular, automatic detection of sources of disinformation and
inauthentic behaviour (bots) of coordinated groups on the Internet based on stylistic
analysis and linguistic processing of the text of fakes and propaganda, their features
distribution and reposting, as well as machine learning methods.
The main tasks of the project are to develop methods and tools for monitoring and
detecting misinformation on the Internet, in particular:
1) stylistic analysis and linguistic processing of disinformation to identify common
characteristic features of fakes of the same author's collective;
2) identification of disinformation that is potentially similar in style to form a set of
potential authors and participants in the dissemination of propaganda;
3) identification of primary sources of the publication of disinformation based on the
analysis of the results of the search for distribution routes of thematically and content-
like texts to determine a set of criteria for evaluating the inauthentic behaviour of a group
of participants;
4) analysis of inauthentic behaviour of chat users to form their informational portraits
with their classification, in particular, into people/bots;
5) implementation of an information system for identifying sources of misinformation
and inauthentic behaviour of chat users, its experimental testing,
collection/processing/analysis of the obtained results to calculate the accuracy/efficiency
of functioning.
The purpose of the research is to recognize propaganda texts and ways of their
distribution on the Internet, in particular, in social networks. The objectives of the
research are:
1. Propaganda recognition in the text using computer linguistics and machine
learning.
2. Recognition of similar propaganda messages.
3. Identification of propaganda distribution networks.
The object of the research is the processes and mechanisms of influence on public
opinion through the dissemination of disinformation, propaganda and fake in mass
media.
The subject of the study is the methods and means of spreading propaganda
(especially russian) on the Internet.
The detection of propaganda in itself is not new, but it is also not fully researched,
and in the context of the war in Ukraine and the fight against russian information attacks,
this issue is more relevant than ever. Most of the research in this field was conducted on
American data, such as the analysis of the situation of the election of Trump as the
president of the United States. This study aims to adapt current knowledge to improve
the situation in Ukraine.
2. The current state of the problem
Online media and social networks allow rapid exchange of information, including
misinformation, both purposefully and randomly/chaotically [1-6]. Along with the main
advantage as an organization of quick access for all those who want operational and up-
to-date information, online media are often used to spread deliberately misleading
content such as fakes and propaganda about specific events, people or organizations,
including governments [7] Recently, vivid examples of the spread disinformation is the
russian government's attempts to control information during the war in Ukraine since
2014, for example, the MH17 plane crash [8]. In parallel, much online information are
subject to regional censorship in certain territorial regions due to political, economic,
social, religious and other factors, for example, to control/manage the opinion of the
people of that region, for example, in the occupied territories of russia to control the future
voters of the bunker president. It is easy for an average person to get lost and navigate
in this mass of content flow with contrary facts and causes of events/phenomena [9]. It
is unethical, illegal and impractical to control what to show/hide (censor) Internet content
to the average user in democracies without direct evidence of the presence of
disinformation/fake/propaganda for a purposeful violation of the information security of
an organization/country. This is one of the first steps in the transition to totalitarianism.
Providing information, for example, to journalists about a possible thematic fake for
conducting a journalistic investigation or warning the average reader about the possibility
of disinformation in this content/resource is, on the one hand, support for freedom of
speech, on the other hand, giving a person the opportunity to choose what to believe
and what not. This makes it possible to gain an understanding of events and orientation
in the flow of information both for solving everyday tasks and adjusting business
strategies, etc. [10-15].
Blocking disinformation and sources of its dissemination, as well as identification of
potential authors based on analysis of inauthentic behaviour, are usually the functional
responsibilities of authorized bodies, especially during information warfare. But it is now
so quickly and efficiently generated/distributed based on the use of modern information
technologies and artificial intelligence that no one can cope with this task 100% without
the use of new methods and tools based on machine learning [16-27]. Significant and
massive dissemination of (dis)information against the background of the war in Ukraine
without appropriate analysis potentially leads to panic among the relevant strata/region
of the population, significantly affecting the process of adjusting plans/strategies of
business, social services, etc. Against the background of the information war, a lot of
time and resources are spent on the appropriate collection, analysis and formation of
appropriate conclusions regarding the content of the relevant content. This is also
influenced by the language of the information, which may partially/significantly change
the content when translated. The system will not be able to completely replace human
activity in this direction. But it can be a significant helper for the rapid formation of
relevant bases of such content, stylistic and linguistic analysis of the disinformation text
to form an informational portrait of the authors, search for authors and distributors based
on the analysis of inauthentic behaviour and the results of the analysis of the style of
content writing, as well as responding to local changes or dynamics changes in the flow
of content, marking certain content as potentially fake in a certain percentage.
There will not be enough resources to fully analyze all new/old content. And by the
time the analysis is done, the disinformation itself will become obsolete. And here is the
quick formation/modification/replenishment of databases/databases of content marked
as blocked/unblocked in a certain region, sorted by appropriate metrics (time, topic,
blocking region, language, etc.) from relevant/relevant to less relevant for further analysis
by methods/technologies NLP/ML will significantly speed up the process of navigating
through the chaos of new information on the Internet. Determining the topic/reason for
content blocking (censorship) in a certain region will improve the quality of identifying
fakes/propaganda/disinformation on the relevant topic. Therefore, it is urgent and
necessary to develop a system for the automatic detection of sources of disinformation
and inauthentic behaviour of chat users in cyberspace. The system should be
implemented based on new principles of information security (data monitoring, threat
detection, forecasting), which will make it possible to identify, monitor, report on the
threat level and predict cyber threats, as well as the degree of probable informational
and psychological impact on public opinion. Because of this, this project is relevant,
relevant, timely and promising for increasing the degree of information security of the
state based on the identification, monitoring, forecasting and analysis of threats in the
cyberspace of Ukraine.
3. The project's novelty
The scientific novelty consists in the development of the following methods:
- stylistic analysis and linguistic processing of disinformation to identify common
characteristic features of fakes of one author's collective based on methods of
processing natural language and artificial intelligence, linguistic analysis of information
messages, classification/clustering of text, etc. to identify linguistic signs of destructive
and manipulative attempts to influence the reader;
- detection of potentially similar disinformation in terms of style to form a set of
potential authors and participants in the dissemination of propaganda based on the
collection/monitoring/detection/classification of information threats in the Internet space;
- identification of primary sources of disinformation publication based on the analysis
of the results of the search for distribution routes of thematically and content-similar texts
to determine a set of criteria for evaluating the inauthentic behaviour of a group of
participants based on the analysis of social networks through graph theory and intelligent
data analysis.
The practical novelty consists in the development of an information system for
detecting sources of misinformation and inauthentic behaviour of chat users,
experimental testing, collection/processing/analysis of the obtained results to calculate
the accuracy/efficiency of functioning based on the implementation of the following
software modules as:
- a module for intelligent search, collection, marking, linguistic analysis and
classification of information messages for the further formation of a set of potential fakes,
as well as monitoring, management, detection and tracking of information threat data
based on machine learning;
- a module of stylistic analysis of a set of fakes for identification of similar styles for
one author collective with subsequent classification (human/bot) based on methods of
machine learning and linguistic statistical data analysis;
- a module for the analysis of inauthentic behaviour of chat users to form their
information portraits with their classification, in particular, into people/bots through the
study of optimization models of attackers’ actions based on the graph of reposts,
optimization models and scenarios of inauthentic behaviour of participants, methods of
intelligent search for disinformation distribution routes.
4. Related works
The only off-the-shelf solution similar enough to this research is Mantis Analytics (Fig.
1). Link: https://mantisanalytics.com/.
Figure 1: Mantis Analytics Home Page
The website states that the program uses machine learning, Natural Language
Processing and Large Language Models to analyze a large amount of unstructured data
(Fig. 2). Although the website describes how the program works and what tasks it can
perform (Fig. 3), to use it, you need to enter your email address and request access to
the demo version. Most likely, the program is not available to the general public, so it will
not be considered further.
Figure 2: A description of the program from Mantis Analytics
Figure 3: Features of Mantis Analytics
5. Research Methodology
As the basis of the research methodology, we offer a synthesized technology based on
the methods of artificial intelligence, computer linguistics, machine learning, intelligent
data analysis, statistical data processing, systems theory and system analysis, computer
and simulation modelling, etc. The problem consists of two main components –
identifying a set of information as fake and, based on it, finding sources and analyzing
the inauthentic behaviour of participants.
The principle of operation of the information system of automatic detection of sources
of disinformation and inauthentic behaviour of chat users:
Stage 1. Defining a set of information as fake:
Step 1.1. Collection and integration of the content of the relevant language from
relevant resources in the Data Store.
Step 1.2. Checking whether content is blocked from a specific resource in a specific
region.
Step 1.3. Marking of each content as blocked/unblocked in a certain region with
corresponding additional metrics (time, resource, frequency of appearance of
blocked/unblocked duplicates, presence of relevant marked words in the
title/digest/annotation, for example, proper names, etc.).
Step 1.4. Formation of an intermediate database of branded sorted data.
Step 1.5. Applying top content NLP methods to calculate the potential of a fake and/or
topic as a reason for blocking content in a certain region based on dictionaries and a set
of metrics. NLP diagram of the content topic definition process:
1.5.1. Definition of a set of keywords of the relevant content and a set of available
marker words (proper names, abbreviations, top words of the relevant topic, etc.).
Determination, if possible, of the topic of the content (method of text classification).
1.5.2. If it is difficult to determine the topic by keywords - identify persistent phrases.
Define if possible the content topic.
1.5.3. If it is difficult to define a topic based on persistent keywords, perform a
semantic analysis and build an ontology. Define if possible the content topic.
1.5.4. If, according to the results of the semantic analysis, it is impossible to do this,
mark it accordingly and transfer it to the list for the work of the content moderator.
1.5.5. For a specific topic, if the content is marked as blocked, check with the list of
previously blocked topics in this region. If not, update the list. If there is to renew the
number of blocks of this topic as censorship in a specific region.
Step 1.6. Applying ML technologies to improve data analysis/labelling/NLP. Pre-
training ML models on a validated training dataset.
Step 1.7. Generating models/patterns of potential fakes to update the list of labelled
content sorting metrics in step 1.3 and metrics/dictionaries for NLP.
Step 1.8. Constant updating of the intermediate database of branded sorted data and
transfer of outdated content to the archive.
Step 1.9. Updating the training dataset to improve ML models. General scheme of the
training and training process of the disinformation analysis module:
Pipeline 1.9.1Pre-Labeled Data NLP ML Models/Patterns/Metrics.
Pipeline 1.9.2. Input new data Data labelling (blocked/unblocked) NLP ML
Content labelling (fake/not fake) or finding a potential reason for blocking (not fake, but
this particular event/topic is banned in a certain region for the average audience). The
general scheme of the system is presented in Fig. 4.
Stage 2. Identification of sources and analysis of inauthentic behaviour of
participants.
Step 2.1. Creation of a disinformation detection model that combines central-level
features (in particular, topic features) and peripheral-level features (in particular,
linguistic features, mood features, and user behaviour features).
Step 2.2. Evaluating the ability of central and peripheral features to automatically
distinguish misinformation from truth within a topical online community using different
machine learning techniques.
Check for Resulting data
Internet Labelled data NLP
blocking (fake, topic)
Models,
Training
Data integration Data storage Machine learning Metrics, and
datasets
Patterns
Figure 4: General scheme of the system for identifying disinformation, fakes and
propaganda
Step 2.3. Intelligent search for fakes based on machine learning.
Step 2.4. Finding a set of stylistically similar fakes for one author.
Step 2.5. Finding the sources of the fake on the main analysis of the distribution graph.
Step 2.6. Analysis of the behaviour of the author/team/bot over a long period to form
a set of main characteristic behavioural traits.
Step 2.7. Finding other fakes of the author by his writing style and behaviour.
Step 2.8. Formation of a portrait of the author's behaviour and behaviour prediction
models.
Step 2.9. Based on the analysis of information portraits of various authors, form
forecasts of the development and spread of fakes (frequency, density, subject matter),
for example for informational and psychological operation (PSYOP).
6. Experiments, results and discussions
The goal of the project is to recognize propaganda and ways of its dissemination. To do
this, we will analyze several datasets to study the parameters and markers of
disinformation. The first dataset (Fig. 5) is the Twitter ru Propaganda Classification. The
author is Bohdan Mynzar. The data are available in two copies - in English and Ukrainian.
Among the important features: it contains almost 13 thousand records and has three
useful features - date + time of creation, message text, and label (propaganda / not
propaganda). Link to the dataset:
https://www.kaggle.com/datasets/bohdanmynzar/twitter-propaganda-
classification?resource=download. The second dataset (Fig. 6) is russian propaganda
tweets. The author is Darius Alexandru. The dataset consists of two files - one contains
only propaganda messages, and the other contains only non-propaganda messages.
There are no labels in the dataset, but the very knowledge of broken files covers their
absence. Of the important features, the dataset has 22,000 records, more than 30
features, but most of them are not very useful because they have a lot of empty values
(90+%). Important and filled-in features are the date of posting the message, the time of
posting, the name of the group that posted, and the text of the message. Link to the
dataset: https://www.kaggle.com/datasets/dariusalexandru/russian-propaganda-tweets-
vs-western-tweets-war?select=russian_propaganda_tweets.csv.
Figure 5: View of the first dataset
Figure 6: View of the second dataset
Figure 7: View of the third dataset
The third dataset (Fig. 7) – russian invasion of Ukraine | Live News. The author is
Hladkiy Ivan. The dataset consists of one file containing messages from the Telegram
social network. Contains more than 400 thousand records. Contains three important
features - group name, publication date, and message text. Link:
https://www.kaggle.com/datasets/falloutbabe/russian-invasion-of-ukraine-live-news-
dataset. For simplicity, the first dataset will be used at the beginning, as it contains
enough, but not too many records and features, and has English text and labels.
The project is implemented in the Google Colab environment in Python Notebook.
The advantages of working in Python Notebook are the ability to conveniently run small
blocks of code and write program comments in separate text blocks. The advantages of
using Google Colab are the ability to easily connect to various Google services and the
ability to work from any place where you can enter your browser and your account.
First, we will download the datasets to our Google Drive, and then use google.colab
library + the drive function and the pandas library, we will transfer the first dataset from
the drive to the collab notebook.
This is what the dataset looks like in the notebook:
In the dataset, the first two columns do not carry important information, so we discard
them.
Let's check how well-marked the data are, since the assessment of whether a
message is propaganda is a subjective decision of the author of the dataset. To do this,
we output ten random messages, analyze their text and evaluate whether the label is
well placed.
In our opinion, the dataset is labelled adequately. Next, we check whether the dataset
is balanced.
The dataset contains an equal number of positive and negative labels, so it is perfectly
balanced.
It also does not contain empty values. For further checks, add a column with the length
of the message to the dataframe.
An interesting feature is that many messages are 140 characters long.
The texts of the messages are probably truncated. This may be caused either by the
desire of the author of the dataset or by the fact that at one time the maximum message
length on Twitter was 140 characters. Also interesting is that Google Colab understands
emoticons from messages. Using the matplotlib library, we will display the average
length of messages as a graph.
Most messages are close to the maximum length. This is probably because, in the
news, proper names, first names and surnames of politicians are often used, which takes
up a lot of characters. We are interested in emoticons from those messages, so let's
analyze them a little. First, let's output all the unique characters that are in the texts.
In total, there are 516 unique symbols, most of them are letters, numbers, all kinds of
symbols and emoticons. We do not see the point in analyzing them all, but we will
analyze some of them through the author's substring_check function, which accepts a
certain string as input, not necessarily a smiley, counts all the texts of messages
containing this string, and outputs how many of such messages are propaganda and
how many not.
For example, all 679 messages containing the russian flag emoticon are propaganda.
From my own experience, the lightning emoticon is often found in Ukrainian groups,
so it is expected that the proportion of normal messages will be higher.
99% of messages containing the surname "Zakharova" are propaganda.
To continue working with the text, it is necessary to remove everything unnecessary
for this task (noise). This means uppercase letters, punctuation marks, emoticons, and
so on.
To do this, we will write one large function, into which you can pass text and receive
processed text that has passed all the planned stages. The following methods will be
used in the function: converting to lowercase, cleaning up redundant character structures
like links, mentioning people in messages, and everything else, with the help of the "re"
regular expression library. Also, with the help of the "nltk" library, we will remove stop
words from the texts (for example, the, are, or...) and conduct word stemming so that
words in different forms are counted as one (for example, imported -> import, replace -
>replac...).
Now we will apply the function on the texts from the dataset, the Before and After view
is shown below.
We will also demonstrate the Before and After view on the example of one random
message.
Add the Message length column after processing to the dataframe, and save this
dataframe as a clean_df file for convenience.
After processing the messages, it was noticed that the length of some texts became
0.
There are only 3 such lines and, most likely, they did not contain useful information,
so they became zero. Therefore, we simply discard them from the dataset and update
the indices of other rows both in clean_df and in the original df.
Next, let's check how many unique words there are in the dataset. To do this, we
import CountVectorizer from the sklearn library.
We will also compile a word frequency table, which shows which words are used most
often in general, in propaganda and non-propaganda. To do this, import the numpy
library and use the already created CountVectorizer.
We will display the top 10 words by frequency of appearance from the table.
The table shows that the word "rt" comes across most often. This is the name of a
propaganda group, and most likely there are many messages from it in the dataset,
which is why the word is so common. Given the topic of the dataset, frequent words such
as russia, russian, Ukraine and Ukrainian are also expected. The table also shows
interesting dependencies. For example, in propaganda, the word foreign is often used,
that is, "foreign", probably to set the population against everything foreign. Also, what
has already been found, 99% of messages containing the surname "Zakharov" are
propaganda. Let's display the next top 10 words.
Interesting finds continue. Propaganda rarely uses the word "war". This is probably
because they do not like this word there, and more often they use words like "special
operation". Also, approximately 99% of messages with the surname "lavrov" are
propaganda.
The next step will be to describe methods of text classification, to search for the best
parameters.
The first idea for recognizing propaganda was a binary classification (propaganda /
not propaganda). There was also the idea of a multi-class classification to understand
what specific type of propaganda was used (appeals to authority, cult of personality,
labelling...). However, large datasets were not necessary for this, so we settled on binary
classification.
To do this, we will first use simple models, for example, logistic regression, and then
we will try to build artificial neural networks. To assess the quality of the model, it is
necessary to determine the minimum baseline accuracy. Then you can choose the best
parameters, such as the number of n-grams in the text and the number of features, and
train the model on such data.
First, let's convert the labels True and False into numbers 1 and 0, and also split the
data into training, testing, and validation using the train_test_split function from the
sklearn library.
We determine the baseline accuracy by the number of positive labels in the validation
set.
This means that if just any text is classified as propaganda, the accuracy will be 0.514.
Therefore, future models should be more accurate than this.
We import and check the finished TextBlob solution.
It turned out to be about the same as the baseline accuracy, even a little worse, so it
doesn't fit.
Let's try to use logistic regression for classification, selecting the best parameters in
parallel.
We will check: which vectors are better, Count or TF-IDF, the number of n-grams and
the number of features.
To check all this, we will write the function optimal features and the auxiliary function
accuracy_pipeline to it.
First, we check the Count vectors:
The best result is an accuracy of 0.8529 on 1-2-3 grams and several features of 5000.
We check the same for TF-IDF vectors.
The best result is 0.8444 on 1-2-3 grams and the number of signs is 9500.
On these Count data, vectors provide greater accuracy and on a smaller number of
features, so these vectors and optimal parameters will be used further.
The next step will be to use neural networks for text classification.
Let's use neural networks to classify propaganda. I will start with a simple fully
connected neural network, but before that, we need to bring the data into such a format
that it can be given to the input of the neural network. We convert texts into vectors using
the TextVectorization layer from the TensorFlow library:
Now the texts look like this:
Neural network structure: Input is fed to a Dense layer of 256 neurons, from it to the
next Dense layer of 256 neurons, and then to the output Dense layer of 2 neurons, which
classifies the input as propaganda or non-propaganda.
Neural network parameters: optimizer – Adam, loss function – sparse categorical
cross-entropy, metrics – accuracy.
Training the network for 5 epochs:
The best result was on the first epoch, where the validation accuracy val_accuracy =
0.8253. The result is a couple of percent worse than it was in logistic regression.
We will try to improve the result by using embeddings instead of ordinary vectors to
represent texts. To do this, we divide the texts into tokens, then each token is
represented as a vector. I will also slightly improve the structure of the neural network.
Now the input looks like this:
We determine the average and maximum length of tokens:
We add padding to the tokens so that they are all the same size:
New neural network structure: embedding layer, LSTM (Long short-term memory)
layer, followed by the same Dense layers with 2 output classes.
The network training parameters remain the same.
Of the three epochs, the second with val_accuracy = 0.8360 had the highest
accuracy. Better than the previous neural network, but still not better than logistic
regression.
Let's try to use a transformer. To do this, we will build a transformer block.
We will also use embeddings here, and for this, we will build an embedding layer.
The structure of the model itself: input layer, embedding layer, transformer block,
pooling layer, dropout layer, fully connected layer, dropout again, output fully connected
layer for 2 neurons that classifies.
Learning parameters are unchanged:
Model training on 5 epochs:
The best result was on the second epoch with val_accuracy = 0.8268.
For some reason, neural networks cannot outperform logistic regression. In our
opinion, it depends on the data, and on such a small dataset of 13 thousand records, the
neural network cannot obtain sufficient generalizing properties.
The next step is to find the most suitable way to measure the distance between texts.
The task is to find chains of propagation of propaganda between groups. To do this,
we will search for similar texts and, using metadata such as the date and time of the
message, we will investigate how these similar messages were distributed.
But before that, it is necessary to choose a method of measuring the similarity of
messages. We plan to test lexical and semantic methods.
First, we manually find several messages of varying degrees of similarity to compare
the methods. To do this, I will display all messages containing the keyword, the surname
"kyrilenko".
To see what specific messages looked like before and after processing the text, we
will write the showme function, in which we will enter the indexes of the messages we
are interested in.
Several messages were selected using this function:
1162 - it will be the main one, with which I will compare the others
1393 - a message that is very similar to the main one
11628 is slightly smaller but still similar to the main one
12510 – contains several common words with the main one
6666 - not at all similar to the main one (as far as the nature of the dataset allows)
Now let's write the compare_distances function for a quick comparison of text
similarity measurement methods.
Next, we will write functions for measuring cosine similarity (on Count and TF-IDF
vectors), Jaccard similarity and Levenshtein distance. After that, we will compare their
results.
The results:
The closest to what we want to get is shown by cosine similarity on count vectors, on
TF-IDF vectors the value is closer to 0, in Jaccard similarity the value is closer to 0.5,
and as for the Levenshtein distance, we realized that it is not very suitable for this task
at all.
Now let's try to take into account the semantic content. To do this, download word2vec
embeddings from Google News.
Let's write a function for cosine similarity on embeddings embed_cos_sim and an
auxiliary function get_word2vec.
Result:
This is exactly what we wanted to get, but for some reason, completely different texts
have more similarities than slightly different ones. But it already depends on how these
embeddings were trained.
The next step is to search for propaganda networks by examining messages with
similar texts.
Using the selected method of measuring the distance between texts (cosine similarity
on word2vec-google-news-300 embeddings), we want to find similar messages among
the entire dataset. To do this, we create 2 dataframes containing only propaganda
messages (since we are investigating the spread of propaganda). The first will contain
the texts in their original form, and the second in a cleaned form.
Let's write a small function to compare each message with each other.
Did a test run with 100 x 100 messages, measured the time it took and realized we
didn't have the computing power to compare all 6500 x 6500 messages. However, the
research is not finished yet, so let's take a small fraction of 300 x 300 messages and
work with them.
For better clarity, we create a data frame from the result.
If you represent the result of the comparison as a matrix, then there will always be
ones on the main diagonal, since there the message is compared with itself. Therefore,
we select from the results only those that are less than one, and at the same time those
that are greater than 0.8, that is, very similar messages that are potentially overwritten.
To conclude whether the messages are similar, we want to look at their original
appearance. But after comparing 300 x 300 in the resulting dataframe, all the indexes
went astray. However, knowing how these indices were formed allows us to return to the
original indices. For example, if the result is a message with index 302, it is clear that
this is the second message compared to the third, because the first 300 records (0 - 299)
are comparisons of the first message with all others, and starting from 300 (300 - 599)
there are comparisons second message with all others.
In this way, to get the index of the first message, you need to perform a special division
"//" operation on the index from the result, which returns only the whole part (floor
division), and for the second index, perform a "%" operation on the remainder of the
division (modulus).
We are interested in this line, so we will generate original views for it.
In these messages, some links were followed and made sure that the messages could
be called the same.
Link:
https://x.com/RT_com/status/1519692929065009155
https://x.com/i/web/status/1519292781226868742
They both say that russia is stopping gas supplies to Poland because it does not want
to pay for gas in rubles.
This is a success in terms of finding similar messages, but both of these messages
came from the same group, a day apart. We want to find connections between different
groups. To do this, I download a second dataset containing message metadata such as
date, time, and group, and in the process of downloading, I select only those columns
that interest us.
We use the old function for text processing:
We create a dataframe for cleaned texts:
We use the old function to compare messages, this time 1000 x 1000:
We transfer the result to the dataframe:
It seems that after processing, some texts became empty, but they will still be rejected
when selecting messages with a similarity of less than 1.
We create a function similar to the previous one for viewing the original appearance
of messages.
We check this line:
Both messages are about the beginning of a full-scale invasion, and they both say
that "the people of the DNR and LNR are asking russia for help, and according to Article
51 of the United Nations Charter, russia is launching a 'special operation' in those lands".
The texts are similar, now it's worth checking where they were taken from.
The messages were posted on the same day, with an hour difference, and from
different groups. It should be taken into account that it was such a large-scale event that
it is possible that all the groups wrote about it. However, the text of the messages itself
is so similar that we believe it is evidence of a propaganda network. In the mfa_russia
group, the message was posted at 14:24, and in the russianembassy group - at 15:17,
the chain looks like this: mfa_russia russianembassy.
Let's try to write a function to automate the search for such chains.
We compile a dictionary from the obtained results using the collections library:
Next, you can combine those cases when there are symmetrical chains, that is, from
the first group to the second and from the second to the first. Then you can look at the
number, how many cases of reinterpretation there are between groups, and based on
this, conclude the presence of a propaganda network. You can also build longer chains
of 3-4 or more groups. Potentially, you can further write functions that will themselves
look at numbers from the dictionary and conclude the existence of the network.
However, this is where our research ends, as we do not have sufficient computing
power to cover all available data. We could not process 6500 x 6500 messages, and in
this dataset, there are as many as 22500 x 22500. Therefore, here we end our review
and, accordingly, research.
7. Conclusions
At the beginning of the research, the topic of the project is formulated in the article,
and the purpose and tasks of the research are described:
1. Recognition of propaganda messages.
2. Analysis of the spread of propaganda messages between groups, thereby
finding propaganda networks.
The relevance, object and subject of the research are described. The means of
influencing public opinion are studied, in particular propaganda, in particular russian
propaganda. During the war with russia, the methods of combating information attacks
of the enemy are extremely important. A search and review of data for the study was
conducted. Several datasets of different structures were selected for further research.
The quality of tagged data, the balance of tags and the presence of empty values are
checked. Several regularities of specific datasets are deduced (regarding the length of
messages, and the use of certain emoticons and keywords).
Processing of text data was carried out using a combination of different methods
(conversion to lowercase, extraction of characters through regular expressions,
extraction of stop words, stemming). The unique words present in the texts of the dataset
were reviewed. In particular, a table of word frequencies was compiled, which
demonstrates some regularities for certain words (for example, frequent use of the word
"foreign" in propaganda). The best parameters for the binary classification of propaganda
on machine learning models were selected, such as logistic regression (1-2-3-grams,
5000 features). A classification accuracy of ~0.85 was achieved.
Binary classification was tested using artificial neural networks, namely: a simple fully
connected neural network with count vectors at the input, a neural network with
embeddings at the input, and a transformer. Unusually, neural networks were unable to
outperform logistic regression. Different methods of measuring the similarity (distance)
between texts were tested: cosine similarity with different types of input vectors, Jaccard
similarity, Leventschein distance, and cosine similarity on downloaded word2vec-
google-news-300 embeddings. The last method proved to be the best, so it was used
further. On several datasets, part of the messages were compared with each other, using
the selected method of distance measurement. Similar messages were found, which
most likely were reinterpreted, both within the same group and between different groups.
A method of calculating the number of reinterpretations between groups has been
created, which allows the recognition of propaganda networks.
The results of the project:
1. For the first time, the fundamentals and main principles of the synthesized
information technology for the automatic detection of sources of disinformation
and inauthentic behaviour of chat users were developed, which will allow
timely detection of destructive and suspicious communities in various social
networks, identify their leaders and curators, identify information threats in
user messages, prevent the spread of fake and harmful information.
2. For the first time, a method of stylistic analysis and linguistic processing of
disinformation was developed to form an information portrait of the author/text
content generation bot as part of the search parameters for both similar author
content and distribution channels.
3. Developed criteria and parameters of inauthentic behaviour of chat users for
the formation of information portraits of potential disinformation disseminators
and detection of distribution routes and mechanisms, frequency of fake
generation, topics and keywords characteristic of the relevant group.
NLP of the process of identifying content as fake/not fake is a complex process, as it
depends not only on the speed/quality of pre-collected/integrated and processed content
(blocked/unblocked in a certain region, content topics) but also on an effectively selected
machine learning model on training sessions datasets. Usually, the fake is not blocked.
The purpose of its creation is to spread it as quickly as possible both throughout the
world and in those regions where usually true information (not fake) can potentially be
blocked (not guaranteed). If non-fake information is blocked in a certain territory, and the
opposite (fake) information is distributed from this territory, then the chance of identifying
a fake increases. If the non-fake is not blocked and the fake is freely distributed in
parallel, NLP methods will not help here. They can only label two sets with opposite
explanations for an event/phenomenon. And only with additional statistical research, it is
possible to identify which set is fake and which is not. The difficulty lies in the language
of the content itself, in particular Ukrainian. In comparison with English-language content,
Ukrainian/russian languages are quite difficult to automatically process, especially when
analyzing semantics and building an ontology. Standard and traditional methods used
for processing English-language tests are not suitable for processing Slavic languages,
including for identifying disinformation and stylistic features of authors who generate
fakes and propaganda. Similarly, in addition to the fact that the inauthentic behaviour of
chat users, both people and bots, is different, people with different motivations (belief in
propaganda, work for money, just one of the types of vandalism and, so to speak,
leisure), nationality, education, gender, mentality, level of knowledge of the language of
the text, degree of faith, intelligence, etc. All this significantly affects the process of
determining the criteria for the behaviour of different communities and within the same
community, which in turn significantly affects the formation of an informational portrait of
the inauthentic behaviour of users of various chats (what is typical for a Muslim
propagandist is significantly different for a representative of a rashka as russia or,
respectively, lpr/dpr).
Justification of the practical value of the planned results of the project for the economy
and society:
- Reducing the amount of disinformation, fakes and propaganda and the
frequency/regularity of publication by tracking stylistically similar content and distribution
routes.
- Reducing the negative impact of disinformation on public sentiment and reducing
the degree of control of public opinion through the spread of propaganda in information
warfare. For example, suppressing the psyche of young people (including mental
disorders, leading to lethal consequences), encouraging them to engage in antisocial
behaviour, forming groups of civil disobedience or aggressive behaviour on fabricated
pretexts, analyzing the social consequences of cyberattacks, etc.
- Reducing the cost of finding, identifying and blocking disinformation,
authors/targeted distributors and sources.
The development of the above-described methods is aimed at identifying threats,
third-party intervention (attacks) in the early stages, classifying threats by type, and
further countering each type of threat. Description of the ways and methods of further
use of the results of the project in social practice.
1. Implementation of information security in the form of response to cyber threats
related to the spread of fakes and propaganda in information warfare is an
important modern practice in the USA and in many countries of Asia and the
European Union.
2. The results of this project can be used in the information security organization
of Ukraine to identify not only disinformation but also targeted groups with
primary sources of dissemination. This synthesized information technology
will make it possible to significantly reduce the overall negative impact on
public sentiment and opinion, eliminate deceptive elements of public opinion
management in society, and also reduce the cost of automatically finding,
processing and blocking disinformation/distribution sources.
3. In the future, the results of the project may have a long-term social impact in
the information sphere, in particular for the implementation of PSYOP among
such directed groups in favour of Ukraine to ensure the cyber security of the
state and inflict damage on the enemy in the information war.
The following risks may affect the implementation of the project:
the risk associated with the war in Ukraine, the conduct of hostilities on the
territory of the country, periodic mass shelling and the possibility of a blackout;
the risk associated with the instability of economic legislation and the current
economic situation;
incompleteness and inaccuracy of information on the dynamics of technical
and economic indicators, parameters of new equipment and technology.
Organizational, production and financial risks are not assumed. Possible technical
risks are that the level of automation of the stages may not fully satisfy the initial
requirements. It is possible to involve elements of the manual implementation of some
tasks with the adjustment of the corresponding methods; a possible replacement in the
use of basic mathematical methods of the project for the development of new models
and methods. Then there will likely be a transition to traditional approaches or new ones
(if new approaches and information technologies are developed in the world in the next
two years); the development of individual methods and tools may go beyond the
established deadlines. Then, during the implementation of the demo version of the
system, traditional approaches will be applied to ensure the minimal functionality of the
methods.
References
[1] I. Afanasieva, N. Golian, V. Golian, A. Khovrat, K. Onyshchenko, Application of
Neural Networks to Identify of Fake News, CEUR Workshop Proceedings 3396
(2023) 346-358.
[2] A. Shupta, O. Barmak, A. Wierzbicki, T. Skrypnyk, An Adaptive Approach to
Detecting Fake News Based on Generalized Text Features, CEUR Workshop
Proceedings 3387 (2023) 300-310.
[3] V.-A. Oliinyk, V. Vysotska, Y. Burov, K. Mykich, V. Basto-Fernandes, Propaganda
Detection in Text Data Based on NLP and Machine Learning, CEUR workshop
proceedings 2631 (2020) 132-144.
[4] R. A. Dar, Dr. R. Hashmy, A Survey on COVID-19 related Fake News Detection
using Machine Learning Models, CEUR Workshop Proceedings 3426 (2023) 36-46.
[5] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP
Tool for Extracting Relevant Information from Criminal Reports or
Fakes/Propaganda Content, in Proceedings of IEEE 17th International Conference
on Computer Sciences and Information Technologies (CSIT), 2022, pp. 93-98, doi:
10.1109/CSIT56902.2022.10000563.
[6] A. Mykytiuk, V. Vysotska, O. Markiv, L. Chyrun, Y. Pelekh, Technology of Fake
News Recognition Based on Machine Learning Methods, CEUR Workshop
Proceedings 3387 (2023) 311-330.
[7] Y. Zhao, J. Da, J. Yan, Detecting health misinformation in online health communities:
Incorporating behavioral features into machine learning based approaches,
Information Processing & Management 58(1) (2021) 102390.
[8] M. Hartmann, Y. Golovchenko, I. Augenstein, Mapping (dis-)information flow about
the MH17 plane crash, arXiv:1910.01363, 2019.
[9] S. Ahmed, Classification of Censored Tweets in Chinese Language using XLNet, in
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship,
Disinformation, and Propaganda, 2021, pp. 136-139.
[10] V. Vysotska Modern state and prospects of information technologies development
for natural language content processing, CEUR Workshop Proceedings 3668 (2024)
198–234.
[11] I. Zamaruieva, S. Lienkov, O. Babich, A. Shevchenko, Y. Khlaponin, N. Bernaz,
Analytical Approaches to News Content Processing during the War in Ukraine in
Opposing Geopolitical Alliances Mass Media, CEUR Workshop Proceedings 3403
(2023) 618-631.
[12] V. Vysotska, Computer Linguistic Systems Design and Development Features for
Ukrainian Language Content Processing, CEUR Workshop Proceedings 3688
(2024) 229–271. URL: https://ceur-ws.org/Vol-3688/paper18.pdf.
[13] S. Albota, Creating a Model of War and Pandemic Apprehension: Textual Semantic
Analysis, CEUR Workshop Proceedings 3396 (2023) 228-243.
[14] N. Khairova, Y. Holyk, D. Sytnikov, Y. Mishcheriakov, N. Shanidze, Topic Modelling
of Ukraine War-Related News Using Latent Dirichlet Allocation with Collapsed Gibbs
Sampling, CEUR Workshop Proceedings 3688 (2024) 1-15.
[15] S. Mainych, A. Bulhakova, V. Vysotska, Cluster Analysis of Discussions Change
Dynamics on Twitter about War in Ukraine, CEUR Workshop Proceedings 3396
(2023) 490-530.
[16] R. Nazarchuk, S. Albota, Tweets about Ukraine during the russian-Ukrainian War:
Quantitative Characteristics and Sentiment Analysis, CEUR Workshop Proceedings
3426 (2023) 551-560.
[17] N. Khairova, A. Kolesnyk, O. Mamyrbayev, K. Mukhsina, The Aligned Kazakh-
Russian Parallel Corpus Focused on the Criminal Theme, CEUR Workshop
Proceedings 2362 (2019) 116-125.
[18] S. Voloshyn, V. Vysotska, O. Markiv, I. Dyyak, I. Budz, V. Schuchmann, Sentiment
Analysis Technology of English Newspapers Quotes Based on Neural Network as
Public Opinion Influences Identification Tool, in Proceedings of 2022 IEEE 17th
International Conference on Computer Sciences and Information Technologies
(CSIT), 2022, pp. 83-88, doi: 10.1109/CSIT56902.2022.10000627.
[19] N. Khairova, A. Shapovalova, O. Mamyrbayev, N. Sharonova, K. Mukhsina, Using
BERT model to Identify Sentences Paraphrase in the News Corpus, CEUR
Workshop Proceedings 3171 (2022) 38-48.
[20] N. Bondarchuk, I. Bekhta, O. Melnychuk, O. Matviienkiv, Keyword-based Study of
Thematic Vocabulary in British Weather News, CEUR Workshop Proceedings 3171
(2022) 451-460.
[21] S. Voloshyn, O. Markiv, V. Vysotska, I. Dyyak, L. Chyrun, V. Panasyuk, Emotion
Recognition System Project of English Newspapers to Regional E-Business
Adaptation, Proceedings of IEEE 17th International Conference on Computer
Sciences and Information Technologies (CSIT), 2022, pp. 392-397, doi:
10.1109/CSIT56902.2022.10000527.
[22] N. Antonyuk, L. Chyrun, V. Andrunyk, A. Vasevych, S. Chyrun, A. Gozhyj, I. Kalinina,
Y. Borzov, Medical news aggregation and ranking of taking into account the user
needs, CEUR Workshop Proceedingsnn248 (2019) 369–382.
[23] V. Andrunyk, A. Vasevych, L. Chyrun, N. Chernovol, N. Antonyuk, A. Gozhyj, V.
Gozhyj, I. Kalinina, M. Korobchynskyi, Development of information system for
aggregation and ranking of news taking into account the user needs, CEUR
Workshop Proceedings 2604 (2020) 1127–1171.
[24] V. Vysotska, S. Voloshyn, O. Markiv, O. Brodyak, N. Sokulska, V. Panasyuk, Tone
Analysis of Regional Articles in English-Language Newspapers Based on Recurrent
Neural Network Bi-LSTM, in Proceedings of the 5th International Conference on
Advanced Information and Communication Technologies (AICT), 2023, pp. 158-
163.
[25] S. Albota, Linguistic and Psychological Features of the Reddit News Post, in
Proceedings of the IEEE 15th International Scientific and Technical Conference on
Computer Sciences and Information Technologies, CSIT, 2020, 1, pp. 295–299.
[26] N. Shakhovska, M. Medykovskyj, L. Bychkovska, Building a smart news annotation
system for further evaluation of news validity and reliability of their sources, Przeglad
Elektrotechniczny 91(7) (2015) 43-44.
[27] V. Vysotska, R. Holoshchuk, S. Goloshchuk, O. Voloshynskyi, M. Shevchenko, V.
Panasyuk, Predicting the Effects of News on the Financial Market Based on Machine
Learning Technology, in Proceedings of the 5th International Conference on
Advanced Information and Communication Technologies (AICT), 2023, pp. 152-
157.