<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal
of Electronic &amp; Information Systems</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.2404.00463</article-id>
      <title-group>
        <article-title>Method for analysis and formation of representative text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olexander Barmak</string-name>
          <email>barmak@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena Sobko</string-name>
          <email>olenasobko.ua@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olexander Mazurets</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maryna Molchanova</string-name>
          <email>m.o.molchanova@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iurii Krak</string-name>
          <email>yuri.krak@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Glushkov Cybernetics Institute</institution>
          ,
          <addr-line>Kyiv, 40, Glushkov ave., 03187</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>Khmelnytskyi, 11, Institutes str., 29016</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Kyiv, 64/13, Volodymyrska str., 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>6</volume>
      <fpage>561</fpage>
      <lpage>582</lpage>
      <abstract>
        <p>The paper is devoted to the creation and approbation of method for analysis and formation of representative text datasets according to FATE fairness principle for subject areas. The method performs an analysis of dataset representativeness according to ethical aspects, as result of which a representative adjustment of the dataset according to ethical aspects is performed. When adjusting the dataset, optimization problem is solved both for the selection of redundant elements for removal, and for the formation of requirements for ethical aspects of belonging to each element for data augmentation. To investigate the effectiveness of the method, software was created that uses machine learning models to classify texts according to various ethical aspects - age, gender, religion, ethnicity, etc. The obtained deviations of the sample distributions by ethical aspects classes of dataset, transformed according to the created method, compared to the ideal representative distribution were: minimum 0.00%, maximum 0.04%, average 0.02%. The obtained results contribute to improvement of representativeness of text datasets and fair and unbiased representation of demographic groups in them, which increases trust in decisions made by artificial intelligence.</p>
      </abstract>
      <kwd-group>
        <kwd>NLP</kwd>
        <kwd>data ethical correctness</kwd>
        <kwd>ethical principles</kwd>
        <kwd>non-discrimination</kwd>
        <kwd>text datasets representative 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In today's world, numerous solutions using artificial intelligence are being actively developed to
solve various tasks that people face every day. Accordingly, the results generated by artificial
intelligence depend on the training datasets on which they were trained, in other words, the
content of these datasets directly affects the final result. Lack of transparency regarding the
sources and characteristics of the data used to train AI algorithms reduces confidence in the results
obtained. In this case, users are often unable to appreciate the potential biases or discriminatory
elements built into these algorithms. Insufficient awareness of the content of educational datasets
increases the risk of spreading unfair or inaccurate decisions, which can have serious consequences
for individuals and society as a whole [1].</p>
      <p>Means for evaluating the representativeness of a textual data set in accordance with the
principles of ethical non-discrimination are currently lacking. This is especially relevant for
socially important and sensitive tasks according to SDG3 (good health and well-being), SDG4
(quality education) and SDG16 (peace, justice, and strong institutions), for example, detecting
cyberbullying, determining the emotional state of people based on text messages, etc. The lack of
attention to ethical components when creating and using datasets leads to bias in algorithms,
which negatively affects the fairness and reliability of the decisions made [2].</p>
      <p>Well-known datasets for training neural networks, for example [3, 4], are actively used by
researchers, because they have a large amount of data, but they were not validated by the authors
regarding representativeness according to the principle of fairness, and therefore, the use of such
datasets for training artificial intelligence algorithms may potentially violate ethical principles and,
hence, have a low reliability of the decisions made.</p>
      <p>The representativeness of the data in the datasets not only affects the accuracy of the results
and models, but is also closely related to the principles of FATE (Fairness, Accountability,
Transparency, Ethics) in the use of data and development of artificial intelligence technologies. If
dataset does not include adequate representation of all social, demographic, or cultural groups, it
can lead to discriminatory patterns that prioritize one group over another, so are not fair. The
representativeness of datasets according to ethical principle of FATE is achieved by correct
balancing according to various ethical aspects: gender, religious, age, etc. [5].</p>
      <p>The main contribution of the paper is the development and validation of an approach to the
analysis and formation of representative text samples of data according to the principle of fairness
of FATE for subject areas.</p>
      <p>Further, in chapter 2, a review of works related to the topic of the study, namely the formation
of representative text samples and the issue of impartial representation of demographic groups
according to the principle of justice, is carried out. Chapter 3 offers a description of the method of
analysis and formation of representative samples of text data, the datasets used for further
experimental studies of the effectiveness of the given method are given and described. Chapter 4
contains an experimental study. Section 5 presents the results and discussion. Chapter 6 concludes
the work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Many works have been devoted to the study of the representativeness of text samples and the fair
and unbiased representation of demographic groups in them, since the concepts of
representativeness, fairness and impartiality are important in the creation of ethical and fair
machine learning models [6]. Natural language processing tools are widely used for this purpose
[7]. Recently, authors have increasingly paid attention to the issue of representativeness of data in
samples, but the current state suggests that data sets have gaps in the representation of gender and
race, and the complex nature of demographic variables makes classification difficult and
inconsistent. Thus, the question of representativeness of data in sets that include people with
disabilities and the elderly is considered. The authors recommend increasing representativeness by
adding samples for underrepresented groups, including by collecting additional data or using
synthetic data methods to improve representation of minorities and people with disabilities.</p>
      <p>In the article [8], the authors raise the important problem of sample representativeness in the
context of machine learning and artificial intelligence, emphasizing the need for accurate
representation of population data. The main strategy that the authors propose to achieve high
quality models is the use of stratified samples, which allow to reduce the variability between
subgroups and accurately reflect the proportions between different categories in the population.</p>
      <p>The authors of the study [9] consider biases arising both from class imbalances in the data and
from sensitive (protected) characteristics such as race or gender. The approach increases model
accuracy by balancing classes and reduces dependence on sensitive features, which improves group
fairness.</p>
      <p>IBM researchers have developed an open-source AI Fairness 360 toolkit for evaluating and
reducing discrimination in machine learning models [10]. The main purpose of the toolkit is to
detect bias based on attributes such as race, gender or age, and to provide methods for
representation of all given social groups at different stages of model development.</p>
      <p>The article [11] highlights the problem of intersectional biases in natural language processing
(NLP) models, namely the unrepresentative and biased representation of different groups of people
in textual datasets. The results showed that although existing debiasing methods (for example, for
BERT or RoBERTa) preserve the predictive accuracy of the models well, their ability to reduce
intersectional biases is limited.</p>
      <p>The authors of [12] propose a specialized model of machine learning to detect and minimize
bias in textual data, in particular, in news articles. The authors claim that their approach is
effective because of deep models and transformative architectures that are able to detect and
correct biases at different stages of machine learning.</p>
      <p>The article [13] presents the problem of gender bias in natural language processing models,
solving it using two main approaches: statistical and causal fairness. Researchers use techniques
such as counterfactual data augmentation for causal debiasing, as well as resampling and revaging
methods for statistical debiasing. The results showed that the combination of these techniques
allows for significant bias in the models by both statistical and causal metrics.</p>
      <p>Article [14] is devoted to solving the problem of intersectional bias in the predictions of
machine learning models, in particular deep neural networks. Researchers propose a new method
based on the Apriori algorithm for automatically detecting biased subgroups in data. It allows
efficient generation of frequent subgroups and calculation of fairness metrics for them.</p>
      <p>In [15], the authors identify and classify bias in natural language processing using transformer
models such as BERT. The authors explore different ways to identify bias, including identifying
social characteristics such as gender, race, religion, and sexual orientation.</p>
      <p>The study [16] examines the problem of cyberbullying, which is a threat to people based on
different characteristics, such as religion, age, ethnicity, and gender. The data set used by the
authors has been modified with ethical considerations in mind, which ensures responsible AI.</p>
      <p>The cited works show that the formation of representative and unbiased samples is a relevant
research area, however, most of the works are devoted either to the detection of unbiasedness or to
the analysis of the representativeness or unbiasedness of data samples, however, data samples must
be modified to achieve compliance with FATE principles.</p>
      <p>So, summarizing, it is possible to highlight the features of the modern approach, which is
applied to the development of AI models (Fig. 1). However, this approach does not take into
account existing ethical principles and non-discriminatory, representative presentation of existing
population subgroups, which should be applied to obtain AI models.</p>
      <p>The purpose of the work is to ensure compliance with the ethical aspects (gender, religious, age,
etc.) of the FATE principle of justice [5] for educational datasets, which consists in creating a
method of analysis and formation of representative (according to the specified aspects) text
samples of data. To achieve the specified goal, it is necessary to propose a method that will
implement the following research tasks:
•
•
to develop an approach to the analysis and formation of relevant representative datasets
according to the principle of fairness of FATE for subject areas.
to investigate the effectiveness of the proposed approach, by using it for the applied
analysis of the text dataset and bringing it to a representative view according to the aspects
of the FATE principle of justice: gender, age and religion.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method for analysis and formation of representative text datasets</title>
      <p>
        In contrast to the existing approach to training AI models (see Fig. 1), the study proposes a new
approach (Fig. 2), which will ensure the representativeness and ethical correctness of the datasets
used for training AI models.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
      </p>
      <p>In order to implement the proposed approach, we will present: the information model and
presentation of the task of forming representative samples of text data as an optimization task;
steps of the method of analysis and formation of representative samples; a way to obtain a typical
ML model for the ethical aspect; description of the composition of the datasets for the study.</p>
      <sec id="sec-3-1">
        <title>3.1. Information model</title>
        <p>The problem of obtaining a representative, ethically unbiased text dataset can be presented in the
framework of an information model of the following form:</p>
        <p>{ ,  ʹ,  ,  ,  ,  },
where DS is the text dataset for analysis and correction, DSʹ is the text dataset after correction, C
is the set of classes of the subject domain of the dataset, A is the set of ethical aspects, M is the set
of trained machine learning models (separate for each ethical aspect), F is the objective function
minimizing the deviation between current and desired ratios for all ethical aspects.</p>
        <p>
          In (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), the initial dataset DS and the corrected dataset DSʹ can be represented as:
{ = { ∪  },
        </p>
        <p>
          DSʹ = {Dʹ ∪ Metadataʹ}, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
where D is the set of elements of the DS dataset, Metadata is the set of metadata of the DS
dataset, Dʹ is the set of elements of the DSʹ dataset, Metadataʹ is the set of metadata of the
DSʹ dataset.
        </p>
        <p>
          Each element of the set of elements of the dataset D in (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and each element of the set of
elements of the dataset Dʹ in (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is a tuple of the following form:
        </p>
        <p>=  ʹ = (   ,   ,    ),
where the attribute text is the textual content of element d or dʹ; cx is the class of the subject
area of the dataset to which the element belongs, cx ∈ C; ACx is a set of classes of dataset element
belonging to ethical aspects.</p>
        <p>
          Thus, in (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) cx and ACx are the marking (marking) of the content of the text element.
        </p>
        <p>
          In (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), the set of classes of membership of the dataset element DS or DSʹ in (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) to the ethical
aspects Ax is presented in the form of a tuple:
 1, ∈  1,  2, ∈  2, … ,   , ∈   ,
 1 ∪  2 ∪ … ∪
        </p>
        <p>
          = 
The Metadata set of the DS dataset in (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) includes:
        </p>
        <p>= ( 1 ,  2 , … ,   , ),
where ax – classes of element belonging to ethical aspects; k is the number of ethical aspects to
be analyzed, k = |Ax|.</p>
        <p>
          At the same time, in (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) according to (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Ax ⊂ A, and classes of dataset elements belonging to
ethical aspects are elements of the corresponding sets, unique for each of the ethical aspects:
        </p>
        <p>= {  ,   ,    ,  ʹ ,   ʹ ,  ʹ },
where nDS is the number of elements in D, nDS = |D|; ANDS is the set of quantities of dataset
elements belonging to each class of each ethical aspect from Ax; ATDS is the set of available
proportions of items for each class relative to the total for each ethical aspect from Ax, nʹDS is the
target number of elements in Dʹ; ANʹDS is the set of target quantities of dataset elements belonging
to each class of each ethical aspect from Ax; ATʹDS is the set of target proportions of elements for
each class relative to the total amount for each ethical aspect from Ax.</p>
        <p>
          At the same time, in (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ), each element anDS,i of the set ANDS corresponds to a separate i-th
ethical aspect and is represented by a tuple of the following form:
  ,
        </p>
        <p>= (  , ,1,   , ,2, … ,   , , , … ,   , , ),
where nDS,i,1 is the number of elements in the dataset of the 1st class of the i-th ethical aspect,
nDS,i,2 is the number of elements in the dataset of the 2nd class of the i-th ethical aspect, nDS,i,j is the
number of elements in the dataset of the j-th class of the i-th ethical aspect, k is the number of
classes of the i-th ethical aspect.
represented by a tuple of the following form:</p>
        <p>
          Similarly to (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ), in (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) the proportions of the elements atDS,i of the i-th ethical aspect are
  ,
        </p>
        <p>= (  , ,1,   , ,2, … ,   , , , … ,   , , ),
where tDS,i,1 is the ratio of the number of elements in the dataset of the 1st class of the i-th
ethical aspect to the total number of elements in the dataset, tDS,i,2 is the ratio of the number of
elements in the dataset of the 2nd class of the i-th ethical aspect to the total number of elements in
the dataset, tDS,i,j is the ratio of the number of elements in the dataset of the i-th class of the
i-th ethical aspect to the total number of elements in the dataset.</p>
        <p>
          At the same time, for the values (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) and (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) in accordance with (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) for each i-th ethical aspect,
the equality holds:
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
In contrast to (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ), the set of Metadataʹ of the DSʹ dataset in (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) includes:
  , ,1 +   , ,2 + … +   , , =   ,

 , ,1 +   , ,2 + … +   , , = 1.

ʹ

= { ʹʹ ,
        </p>
        <p>ʹʹ ,  ʹʹ },
where nʹʹDS is actually the number of elements in Dʹ obtained as a result of adjustment,
nʹʹDS = |D|; ANʹʹDS is the set actually obtained as a result of adjusting the quantities of dataset
 = { 1,  2, … ,   , },   = (</p>
        <p>,   ,  1,  2, … ,   ),  = 1, … , 
where C = {c1, c2, …, ck}, where k is the number of classes of dataset D, m is the number of ethical
aspects.</p>
        <p>
          According to (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) – (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ), the solution of the problem is aimed at obtaining the dataset Dʹ, which
contains the total number of elements nʹ = nʹDS = |Dʹ|, quantitatively balanced according to the
ethical aspects Аi from the set of ethical aspects А:
elements belonging to each class of each ethical aspect from Ax; ATʹʹDS is the set actually obtained
as a result of adjusting the proportions of elements for each class relative to the total number for
each ethical aspect from Ax.
        </p>
        <p>ATʹDS and ATʹʹDS .
n = nDS = |D| and can be presented in the form:</p>
        <p>
          Thus, in (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) and (
          <xref ref-type="bibr" rid="ref13">13</xref>
          ), (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) and (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ) hold for ANʹDS and ANʹʹDS, and (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) and (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ) hold for
Thus, according to (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) and (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ), the text dataset D has the number of elements
 = { 1,  2, … ,   },   =   ,    ,  = 1, … ,  ,
where each aspect Ai contains classes Сi and target proportions of classes Tij or each element of
class С; С is the set of classes of the ethical aspect Ai , C = {c1, c2, …,cj}; j is the number of classes of
the ethical aspect of Ai.
        </p>
        <p>
          To balance the dataset for each ethical aspect, it is necessary to use trained or train an
appropriate number of classifier models, which can be as deep learning models, for example, BERT,
LSTM, GRU, as well as machine learning models of Logistic Regression, Naive Bayes, Support
Vector Machines, k-Nearest Neighbors etc. [17], and according to (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), the set of trained models of
classifiers M is presented in the form:
(14)
(15)
(16)
(17)
 = 
  
 −1  −1
  
 ʹ
        </p>
        <p>−    .</p>
        <p>
          Limitations of the task:
1) the sum of all class samples within one aspect is equal to the target number of samples for
this aspect (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ):
        </p>
        <p>= { 1,  2, … ,   }, m = |D|.</p>
        <p>Thus, within the framework of the proposed information model, it is necessary to perform the
Dʹ with the condition of maximal correspondence nʹʹDS
→
nʹDS,
transformation D</p>
        <p>⇒
ANʹʹDS → ANʹDS та ATʹʹDS → ATʹDS.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Idea of the approach</title>
        <p>The study proposes to reduce the task of building a representative, ethically unbiased dataset to the
task of multi-criteria optimization. The optimization task consists in minimizing the deviation
between the current and desired class ratios, taking into account the limitations on the number of
samples in the classes and the possibilities of generating synthetic data.</p>
        <p>Input data: textual dataset DS, set of ethical aspects A, requirements for representative
distribution DSʹ.</p>
        <p>The goal of the problem: to create a representative sample for all ethical aspects that achieves the
target class proportions for each ethical aspect D ⇒ Dʹ.</p>
        <p>Variables: xij – number of samples of class Cj in aspect Ai after sequestration and augmentation.</p>
        <p>The objective function F is the minimization of the deviation between the current and desired ratios
for all ethical aspects simultaneously, taking into account constraints (18) – (21):</p>
        <p>=1
where ni is the number of classes in the aspectAi;
2) the number of samples for each class should correspond to the target proportion of classes:</p>
        <p>≈    , ∀ ∈ {1,2, … ,  }, ∀ ∈ {1,2, … ,   };
 ʹ
3) the estimated number of samples cannot be negative:
 
   =  ʹ, ∀ ∈ {1,2, … ,  },
(18)
(19)
(20)
   ≥ 0, ∀ ∈ {1,2, … ,  }, ∀ ∈ {1,2, … ,   };
4) the ability to add new samples should match the ability to generate new data for each class
and aspect:</p>
        <p>≤    , ∀ ∈ {1,2, … ,  }, ∀ ∈ {1,2, … ,   },
where Gij is the maximum possible number of samples of class Cj in aspect Ai, that can be added.</p>
        <p>Based on the set optimization task of forming a representative dataset (17), we present the
steps of the method of analysis and formation of representative samples of text data.
(21)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Main steps of method</title>
        <p>Method for analysis and formation of representative text datasets is presented in the form of
three consecutive stages: preprocessing, analysis of representativeness according to ethical
aspects and representative adjustment of dataset. Each stage consists of its own steps, which
are shown in Figure 3.</p>
        <p>
          The input data of the method is the dataset DS for analysis, which according to (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and (
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
contains the target number of nʹDS elements, the set of ethical aspects A with subsets of classes,
the target proportions of ATDS classes and the number of ANʹDS elements in the classes of
ethical aspects, respectively, the trained set of models М for each ethical aspect from A, which
uses balanced samples for each ethical aspect for training.
        </p>
        <p>At stage 1, a sample of text data in D ⊂ DS is pre-processed, namely, the removal of
noninformative text fragments such as punctuation marks, numbers and special characters [18].
Removal of emoticons is not performed, as in many cases including emoticons in the analysis
improves the accuracy of machine learning models used to classify texts based on emotional or
mood content [19]. Incorrect records (empty, uninformative, etc.) are also deleted.</p>
        <p>At stage 2, an analysis of the representativeness of the sample of textual data is carried out,
taking into account ethical aspects. First, it is necessary to vectorize and classify each element
∀d ∈ D of the data sample using separate machine learning models m ∈ M for each of the
ethical aspects Ai ∈ A . The existing proportions of ANDS and ATDS classes for each of the ethical
aspects are determined. The amount of shortage or excess of elements of each class for each of
the ethical aspects is calculated. After that, the sufficiency of the data in the sample for
augmentation is analyzed (minimum availability of samples of the relevant classes, etc.).</p>
        <p>Stage 3 involves a representative adjustment of the data sample to take into account ethical
considerations. Adjustments include removing and adding.</p>
        <p>The deletion operation is performed to remove redundant elements of each class for each of the
ethical aspects with minimal damage to other distributions, for which the optimization problem of
selecting redundant elements in the framework of (17), which should be removed to achieve the
target proportions of classes, is solved.</p>
        <p>The add operation is performed to create new items using one of the known methods, for
example, the SMOTE method [20]. Requirements are created in the form of the necessary
combination of classes of each of the ethical aspects for each new element, for which the
optimization problem of forming requirements for the missing elements is solved within the
framework of (17).</p>
        <p>The output data of the method is a text dataset Dʹ ⊂ DSʹ, which has the required volume nʹDS and
is balanced according to the required proportions ATʹDS according to the selected ethical aspects Ax
⊂ A.</p>
        <p>The steps of the method of analyzing and generating representative samples of text data will
allow you to generate text samples that are non-discriminatory and unbiased and reflect a
proportional representation of the sample samples to the actual demographic subgroups of the
population, which will affect the accuracy and transparency of training machine learning models
for solving various problems.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets for research</title>
        <p>To test the method of analysis and formation of representative samples of textual data, an input
dataset was formed based on two datasets "Cyberbullying Classification" [3] and "Cyberbully
Detection Dataset" [4]. The "Cyberbullying Classification" dataset contains 46,017 tweets, which
are labeled by types of cyberbullying into 6 classes. The "Cyberbully Detection Dataset" contains
99,989 tweets, which is also labeled by type of cyberbullying. Both datasets are unlabeled for
gender, age group, religion, and ethnicity of the message author.</p>
        <p>To train machine learning models, which will be used to label the input dataset, datasets were used
on the example of three ethical aspects of the principle of justice of gender, age and religion.</p>
        <p>The English-language dataset "Tweet Files for Gender Guessing" [21], which contains 34,146
unique text entries, which are divided into two classes: female and male, with 17,073 entries in each
class, was used to train ML based on the ethical aspect of the gender of the author of the message.
On the basis of the English-language dataset "CyberBullying Detection Dataset" [22], which
contains 20109 test samples, a sample was created for training the classifier and marking the input
dataset according to the religious ethical aspect. The dataset in Italian "TAG-it Dataset
Distribution" [23] was translated into English and used to bring to a representative view the
working dataset by age and contains 21,948 text messages divided into age classes: 0-19, 20-29,
3039, 40-49, 50-100 years old.</p>
        <p>Since the classes in the given datasets are not balanced and have a different number of samples,
which will negatively affect the quality of training of machine learning models, all classes in the
datasets were balanced in terms of number. The final number of samples in each class of training
samples for ML training according to ethical aspects is shown in Fig. 4.</p>
        <p>As a result of the work on creating training samples, datasets balanced by the number of text
messages in classes were obtained. Such datasets will make it possible to correctly assess the
representativeness of working text datasets.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Software for research</title>
        <p>To study the effectiveness of the method of analysis and formation of a representative sample of
text data, a software implementation was created using the Python programming language. The
tensorflow library (https://www.tensorflow.org/) was used to classify the input dataset on
cyberbullying based on gender, age, and religion. In Fig. 5 shows an example of classification based
on the religious basis of the FATE-principle of justice.</p>
        <p>To form a set of trained ethical machine learning models, which are separate for each ethical
aspect, various classifier models were analyzed, and to select the best of them, their quality was
evaluated by statistical indicators, such as Accuracy, Precision, Recall, and F1-score [24]. Both deep
learning models, such as BERT, GPT, LSTM, GRU, etc., and classifiers such as Logistic Regression,
Naive Bayes, Support Vector Machines k-Nearest Neighbors, etc., were studied [25]. After that, the
classifier is trained on the selected ML model on the annotated dataset for the ethical aspect.</p>
        <p>As a result, different architectures were chosen as classifiers: FastForest classifiers, SVM and
LSTM, BERT deep learning models [26]. Thus, machine learning models such as FastForest, SVM,
LSTM, and BERT are effective tools for solving text classification tasks, including determining a
person's gender, religion, and age based on user text posts. Classical approaches such as FastForest
and SVM have also demonstrated their effectiveness in text classification. FastForest works efficiently
with large datasets and prevents overtraining. SVM, in turn, is known for its ability to work with
high-dimensional data, which is especially useful for text classification, where each word or phrase
can be represented as a separate feature [27]. Deep learning models, such as LSTM and BERT, are
able to recognize complex patterns in text sequences, preserving the context at all stages of analysis
[28]. A distinctive feature of LSTM is its ability to retain information about previous parts of the text,
which makes this model effective for complex classification tasks where the overall context of the
message is important. Studies have shown that such a model can achieve an accuracy of up to 92% in
text classification tasks [29]. The BERT model, in turn, is characterized by the ability to analyze the
text in two directions, that is, to take into account both the previous and subsequent context of words
[30].</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Analysis of research results</title>
        <p>To analyze and form a representative sample of text data for the target proportions of classes, to
form a representative sample of text data by age and gender, the population of Ukraine was
taken. According to the M. V. Ptukh Institute of Demography and Social Research of the National
Academy of Sciences of Ukraine (https://idss.org.ua/forecasts/nation_pop_proj), as of July 2023,
the total population of Ukraine is estimated at 3,559,6216 people. The following number of people
is represented in each age subgroup: age group 0-19 years 6,659,068 people, 20-29 years 3,623,143
people, 30-39 years - 6,022,345 people, 40-49 years - 5,431,140 people, 50 -100 years - 13,860,520
people. Regarding the gender structure of the population of Ukraine in 2023: 1,695,1527 are
women, and 1,864,689 are men (idss.org.ua). Note that within the scope of this work, the
cisgender group is considered in the analysis of the gender ethical aspect.</p>
        <p>To study the effectiveness of the method of analysis and formation of a representative
selection of text data described in the work, several machine learning models were trained. The
results of calculating static metrics such as Accuracy, Precision, Recall and F1-score [24] of
machine learning models for the gender, age and religious ethical aspects are shown in Table 1.</p>
        <p>For different classes, different levels of linear resolution were obtained: according to religion
using the BERT classifier, which showed the best result of the trained machine learning models for
the task of classifying text samples according to the religious ethical aspect, the data turned out to
be well separated, according to gender using the LSTM classifier, which showed the best
performance compared to other models, the data turned out to be moderately separable, and
according to age, using the SVM classifier, it was poorly separable.</p>
        <p>In addition, it was found that the dataset is not representative, because the classes of various
ethical aspects have a number of text samples that do not correspond to the proportions of the
demographic subgroups of the population of Ukraine, thus they need balancing to acquire a
representative appearance.</p>
        <p>Therefore, according to the steps of the method of analysis and formation of a representative
sample of text data, a sample of text data needs data augmentation to form a representative sample.
For this, it is necessary to solve the optimization problem, for the correct removal of redundant
elements of each class according to each of the ethical aspects, with further augmentation of the
data sample to the target requirements (number of elements and proportions of classes).</p>
        <p>Table 2 presents the percentages of samples by age in the sample of textual data and individuals
of the population in age-demographic subgroups, and also calculates the new distribution of the
sample classes if only one ethical aspect - age - was taken into account.</p>
        <p>Table 3 presents the percentages of samples by gender in the sample of textual data and
individuals of the population in gender demographic subgroups, and also calculates the new
distribution of sample classes if only one ethical aspect - gender - was taken into account.</p>
        <p>The deviation of the sample distributions by classes of the age-ethical aspect of the dataset,
transformed according to the created method, from the ideal representative distribution was
obtained: minimum 0.01%, maximum 0.04%, average 0.02%, and for the gender ethical aspect:
minimum 0.03%, maximum 0.03%, average 0.03 %.</p>
        <p>However, the optimization task of forming a representative sample of textual data is a
multi-criteria one, in which the criteria are the formation of a sample based on age and gender
ethical aspects, so the goal is to minimize the deviation between the current and desired class
ratios, taking into account the limitations on the number of samples and the possibility of
generating new data. As a result of solving the optimization problem for the formation of a
representative sample by age and gender ethical aspects on the example of demographic
subgroups of the population of Ukraine, a representative sample of text data was obtained by
augmentation, the balance of classes of which is presented in Table 4, Fig. 6 and Fig. 7.
Age
demographic 0-19 years 20-29 years 30-39 years 40-49 years
subgroups
Percentage ratio of demographic groups by gender and age in the population of Ukraine
Men 9.67% 5.64% 8.96% 7.79% 15.56%
Women 9.04% 4.53% 7.96% 7.47% 23.38%
Percentage ratio of demographic groups by gender and age in the text sample
Men 9.65% 5.62% 8.94% 7.80% 15.57%
Women 9.05% 4.57% 7.97% 7.45% 23.38%
The resulting deviation from a representative distribution
Men 0.02% 0.02% 0.02% 0.01% 0.02%
Women 0.01% 0.04% 0.01% 0.02% 0.00%
50-100 years</p>
        <p>The deviation of the sample distributions by classes of age and gender ethical aspects of the
dataset simultaneously, transformed according to the created method, from the ideal representative
distribution was obtained: minimum 0.00%, maximum 0.04%, average 0.02%.</p>
        <p>So, as a result of performing the steps of the analysis method and forming representative
samples of text data, a text sample was formed, which is non-discriminatory and unbiased and
reflects the representation of sample samples proportional to the real demographic subgroups of
the population of Ukraine.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Thus, the goal of the study was achieved through the development of the method for analysis and
formation of representative text datasets, designed for the analysis and formation of representative
text samples of data according to the principle of fairness of FATE for subject areas.</p>
      <p>To investigate the effectiveness of the analysis method and the formation of a
representative presentation of the text dataset, software was created that uses machine
learning models to classify texts according to various ethical aspects - age, gender, religion,
ethnicity, etc. Thus, to classify the text samples in the sample according to the age-ethical
aspect, SVM was used, LSTM was used for gender, and BERT was used for religious ones,
which are the best indicators of statistical metrics.</p>
      <p>As a result of the practical application of the developed method, it was established that the
available dataset is not representative compared to the objective data of demographic statistics,
so a multi-criteria optimization problem was solved and the dataset was transformed into a
representative one in terms of age and gender ethical aspects. The obtained deviations of the
sample distributions by classes of ethical aspects of the dataset transformed according to the
created method from the ideal representative distribution were: minimum 0.00%, maximum
0.04%, average 0.02%, under the conditions of the initial volume of the dataset 47,692 elements,
the minimum initial number of samples in the class 1007 elements, the maximum initial
number of samples in the class is 28,112 elements. The studied efficiency proves that the
developed method allows performing the analysis of the representativeness of text datasets and
bringing them to a representative form according to various aspects of the FATE fairness
principle.</p>
      <p>The obtained results contribute to improvement of representativeness of text datasets and
fair and unbiased representation of demographic groups in them, which increases trust in
decisions made by artificial intelligence, and complies with goals SDG3 (good health and
wellbeing), SDG4 (quality education) and SDG16 (peace, justice, and strong institutions).</p>
      <p>Further plans for improving the method of analysis and formation of representative samples
of text data are the formation of not only a non-discriminatory sample by the number of
samples, but also the search and removal of samples of text samples that contain a biased
attitude towards representatives of various demographic subgroups, according to the ethical
aspects of the FATE-principle of justice.</p>
      <p>Also, the prospects for further research are the use of the developed method for adjusting
textual datasets of subject areas and their use for solving applied problems, such as detection
and classification of cyberbullying, analysis of the emotional tonality of messages, detection of
the physical and mental state of users based on their posts, etc. Detecting performance gains
from using ethically balanced text datasets will provide feedback for improving the developed
method.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: grammar and
spelling check; DeepL Translate in order to: some phrases translation into English. After using
these tools/services, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sureja</surname>
          </string-name>
          ,
          <source>A Comprehensive Review of Bias in Deep Learning Models: Methods</source>
          , Impacts, and Future Directions,
          <source>Arch Computat Methods Eng</source>
          (
          <year>2024</year>
          ).
          <source>doi:10.1007/s11831- 024-10134-2</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yusyn</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Rybachok, dictionary-based deterministic method of generation of text CORPORA</article-title>
          ,
          <source>Computer systems and information technologies 3</source>
          (
          <year>2024</year>
          )
          <fpage>67</fpage>
          -
          <lpage>73</lpage>
          .. doi:
          <volume>10</volume>
          .31891/csit-2024
          <source>-3-9.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          .com, Cyberbullying Classification,
          <year>2021</year>
          . URL: https://www.kaggle.com/datasets/andrewmvd/cyberbullyingclassification?resource=download
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          .com,
          <source>CyberBullying Detection Dataset</source>
          ,
          <year>2024</year>
          . URL: https://www.kaggle.com/datasets/sayankr007/cyber
          <article-title>-bullying-data-for-multi-labelclassification</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eke</surname>
          </string-name>
          ,
          <article-title>The ethics of ChatGPT - Exploring the ethical issues of an emerging technology</article-title>
          ,
          <source>International Journal of Information Management</source>
          <volume>74</volume>
          (
          <year>2024</year>
          ), p.
          <fpage>102700</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.ijinfomgt.
          <year>2023</year>
          .102700
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Manziuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mazurets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pylypiak</surname>
          </string-name>
          ,
          <article-title>Structural alignment method of conceptual categories of ontology and formalized domain</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>3003</volume>
          (
          <year>2021</year>
          ) pp.
          <fpage>11</fpage>
          -
          <lpage>22</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O.</given-names>
            <surname>Barmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mazurets</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smolarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gromaszek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Smailova</surname>
          </string-name>
          ,
          <article-title>Information technology for creation of semantic structure of educational materials</article-title>
          ,
          <source>Proceedings of SPIE - The International Society for Optical Engineering</source>
          <volume>11176</volume>
          (
          <year>2019</year>
          ), pp.
          <fpage>147</fpage>
          -
          <lpage>156</lpage>
          . doi:
          <volume>10</volume>
          .1117/12.2537064
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L. K. H.</given-names>
            <surname>Clemmensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Rune</surname>
          </string-name>
          ,
          <article-title>Data Representativity for Machine Learning</article-title>
          and
          <source>AI Systems</source>
          ,
          <year>2022</year>
          . URL: https://ar5iv.labs.arxiv.org/html/2203.04706
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dablain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <article-title>Towards a holistic view of bias in machine learning: bridging algorithmic fairness and imbalanced learning</article-title>
          .
          <source>Discov Data</source>
          <volume>2</volume>
          ,
          <issue>4</issue>
          (
          <year>2024</year>
          ).
          <source>doi:10.1007/s44248-024-00007-1</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>R. K</surname>
          </string-name>
          . E. Bellamy,
          <source>AI</source>
          Fairness
          <volume>360</volume>
          :
          <article-title>An extensible toolkit for detecting and mitigating algorithmic bias</article-title>
          ,
          <source>IBM Journal of Research and Development</source>
          , volume
          <volume>63</volume>
          ,
          <issue>4</issue>
          /5, (
          <year>2019</year>
          ) pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .1147/JRD.
          <year>2019</year>
          .2942287
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lalor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Forsgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <article-title>Benchmarking Intersectional Biases in NLP, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Association for Computational Linguistics (</article-title>
          <year>2022</year>
          ), pp.
          <fpage>3598</fpage>
          -
          <lpage>3609</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl-main.
          <fpage>263</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Reji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>Dbias: detecting biases and ensuring fairness in news articles</article-title>
          .
          <source>International Journal of Data Science and Analytics</source>
          , volume
          <volume>17</volume>
          (
          <year>2024</year>
          ), pp.
          <fpage>39</fpage>
          -
          <lpage>59</lpage>
          . doi:
          <volume>10</volume>
          .1007/s41060-022-00359-4
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Addressing Both Statistical and Causal Gender Fairness in NLP Models</article-title>
          .
          <source>In Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2024</year>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>