<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Method for analysis and formation of representative text datasets ⋆</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Olena</forename><surname>Sobko</surname></persName>
							<email>olenasobko.ua@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>11, Institutes str</addrLine>
									<postCode>29016</postCode>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olexander</forename><surname>Mazurets</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>11, Institutes str</addrLine>
									<postCode>29016</postCode>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maryna</forename><surname>Molchanova</surname></persName>
							<email>m.o.molchanova@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>11, Institutes str</addrLine>
									<postCode>29016</postCode>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Iurii</forename><surname>Krak</surname></persName>
							<email>yuri.krak@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Taras Shevchenko National University of Kyiv</orgName>
								<address>
									<addrLine>64/13, Volodymyrska str</addrLine>
									<postCode>01601</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Glushkov Cybernetics Institute</orgName>
								<address>
									<addrLine>Kyiv, 40, Glushkov ave</addrLine>
									<postCode>03187</postCode>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olexander</forename><surname>Barmak</surname></persName>
							<email>barmak@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Khmelnytskyi National University</orgName>
								<address>
									<addrLine>11, Institutes str</addrLine>
									<postCode>29016</postCode>
									<settlement>Khmelnytskyi</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">International Workshop on Advanced Applied Information Technologies</orgName>
								<address>
									<addrLine>December 5</addrLine>
									<postCode>2024</postCode>
									<settlement>KhmelnytskyiZilina</settlement>
									<country>Ukraine -, Slovakia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Method for analysis and formation of representative text datasets ⋆</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AC6C29D3C583A95AE3965900D52B0ABA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>NLP, data ethical correctness, ethical principles, non-discrimination, text datasets representative 1 O. Barmak) 0000-0001-5371-5788 (O. Sobko)</term>
					<term>0000-0002-8900-0650 (O. Mazurets)</term>
					<term>0000-0001-9810-936X (M. Molchanova)</term>
					<term>0000-0002-8043-0785 (I. Krak)</term>
					<term>0000-0003-0739-9678 (O. Barmak)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The paper is devoted to the creation and approbation of method for analysis and formation of representative text datasets according to FATE fairness principle for subject areas. The method performs an analysis of dataset representativeness according to ethical aspects, as result of which a representative adjustment of the dataset according to ethical aspects is performed. When adjusting the dataset, optimization problem is solved both for the selection of redundant elements for removal, and for the formation of requirements for ethical aspects of belonging to each element for data augmentation. To investigate the effectiveness of the method, software was created that uses machine learning models to classify texts according to various ethical aspects -age, gender, religion, ethnicity, etc. The obtained deviations of the sample distributions by ethical aspects classes of dataset, transformed according to the created method, compared to the ideal representative distribution were: minimum 0.00%, maximum 0.04%, average 0.02%. The obtained results contribute to improvement of representativeness of text datasets and fair and unbiased representation of demographic groups in them, which increases trust in decisions made by artificial intelligence.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In today's world, numerous solutions using artificial intelligence are being actively developed to solve various tasks that people face every day. Accordingly, the results generated by artificial intelligence depend on the training datasets on which they were trained, in other words, the content of these datasets directly affects the final result. Lack of transparency regarding the sources and characteristics of the data used to train AI algorithms reduces confidence in the results obtained. In this case, users are often unable to appreciate the potential biases or discriminatory elements built into these algorithms. Insufficient awareness of the content of educational datasets increases the risk of spreading unfair or inaccurate decisions, which can have serious consequences for individuals and society as a whole <ref type="bibr" target="#b0">[1]</ref>.</p><p>Means for evaluating the representativeness of a textual data set in accordance with the principles of ethical non-discrimination are currently lacking. This is especially relevant for socially important and sensitive tasks according to SDG3 (good health and well-being), SDG4 (quality education) and SDG16 (peace, justice, and strong institutions), for example, detecting cyberbullying, determining the emotional state of people based on text messages, etc. The lack of attention to ethical components when creating and using datasets leads to bias in algorithms, which negatively affects the fairness and reliability of the decisions made <ref type="bibr" target="#b1">[2]</ref>.</p><p>Well-known datasets for training neural networks, for example <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>, are actively used by researchers, because they have a large amount of data, but they were not validated by the authors regarding representativeness according to the principle of fairness, and therefore, the use of such datasets for training artificial intelligence algorithms may potentially violate ethical principles and, hence, have a low reliability of the decisions made.</p><p>The representativeness of the data in the datasets not only affects the accuracy of the results and models, but is also closely related to the principles of FATE (Fairness, Accountability, Transparency, Ethics) in the use of data and development of artificial intelligence technologies. If dataset does not include adequate representation of all social, demographic, or cultural groups, it can lead to discriminatory patterns that prioritize one group over another, so are not fair. The representativeness of datasets according to ethical principle of FATE is achieved by correct balancing according to various ethical aspects: gender, religious, age, etc. <ref type="bibr" target="#b4">[5]</ref>.</p><p>The main contribution of the paper is the development and validation of an approach to the analysis and formation of representative text samples of data according to the principle of fairness of FATE for subject areas.</p><p>Further, in chapter 2, a review of works related to the topic of the study, namely the formation of representative text samples and the issue of impartial representation of demographic groups according to the principle of justice, is carried out. Chapter 3 offers a description of the method of analysis and formation of representative samples of text data, the datasets used for further experimental studies of the effectiveness of the given method are given and described. Chapter 4 contains an experimental study. Section 5 presents the results and discussion. Chapter 6 concludes the work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Many works have been devoted to the study of the representativeness of text samples and the fair and unbiased representation of demographic groups in them, since the concepts of representativeness, fairness and impartiality are important in the creation of ethical and fair machine learning models <ref type="bibr" target="#b5">[6]</ref>. Natural language processing tools are widely used for this purpose <ref type="bibr" target="#b6">[7]</ref>. Recently, authors have increasingly paid attention to the issue of representativeness of data in samples, but the current state suggests that data sets have gaps in the representation of gender and race, and the complex nature of demographic variables makes classification difficult and inconsistent. Thus, the question of representativeness of data in sets that include people with disabilities and the elderly is considered. The authors recommend increasing representativeness by adding samples for underrepresented groups, including by collecting additional data or using synthetic data methods to improve representation of minorities and people with disabilities.</p><p>In the article <ref type="bibr" target="#b7">[8]</ref>, the authors raise the important problem of sample representativeness in the context of machine learning and artificial intelligence, emphasizing the need for accurate representation of population data. The main strategy that the authors propose to achieve high quality models is the use of stratified samples, which allow to reduce the variability between subgroups and accurately reflect the proportions between different categories in the population.</p><p>The authors of the study <ref type="bibr" target="#b8">[9]</ref> consider biases arising both from class imbalances in the data and from sensitive (protected) characteristics such as race or gender. The approach increases model accuracy by balancing classes and reduces dependence on sensitive features, which improves group fairness.</p><p>IBM researchers have developed an open-source AI Fairness 360 toolkit for evaluating and reducing discrimination in machine learning models <ref type="bibr" target="#b9">[10]</ref>. The main purpose of the toolkit is to detect bias based on attributes such as race, gender or age, and to provide methods for representation of all given social groups at different stages of model development.</p><p>The article <ref type="bibr" target="#b10">[11]</ref> highlights the problem of intersectional biases in natural language processing (NLP) models, namely the unrepresentative and biased representation of different groups of people in textual datasets. The results showed that although existing debiasing methods (for example, for BERT or RoBERTa) preserve the predictive accuracy of the models well, their ability to reduce intersectional biases is limited.</p><p>The authors of <ref type="bibr" target="#b11">[12]</ref> propose a specialized model of machine learning to detect and minimize bias in textual data, in particular, in news articles. The authors claim that their approach is effective because of deep models and transformative architectures that are able to detect and correct biases at different stages of machine learning.</p><p>The article <ref type="bibr" target="#b12">[13]</ref> presents the problem of gender bias in natural language processing models, solving it using two main approaches: statistical and causal fairness. Researchers use techniques such as counterfactual data augmentation for causal debiasing, as well as resampling and revaging methods for statistical debiasing. The results showed that the combination of these techniques allows for significant bias in the models by both statistical and causal metrics.</p><p>Article <ref type="bibr" target="#b13">[14]</ref> is devoted to solving the problem of intersectional bias in the predictions of machine learning models, in particular deep neural networks. Researchers propose a new method based on the Apriori algorithm for automatically detecting biased subgroups in data. It allows efficient generation of frequent subgroups and calculation of fairness metrics for them.</p><p>In <ref type="bibr" target="#b14">[15]</ref>, the authors identify and classify bias in natural language processing using transformer models such as BERT. The authors explore different ways to identify bias, including identifying social characteristics such as gender, race, religion, and sexual orientation.</p><p>The study <ref type="bibr" target="#b15">[16]</ref> examines the problem of cyberbullying, which is a threat to people based on different characteristics, such as religion, age, ethnicity, and gender. The data set used by the authors has been modified with ethical considerations in mind, which ensures responsible AI.</p><p>The cited works show that the formation of representative and unbiased samples is a relevant research area, however, most of the works are devoted either to the detection of unbiasedness or to the analysis of the representativeness or unbiasedness of data samples, however, data samples must be modified to achieve compliance with FATE principles.</p><p>So, summarizing, it is possible to highlight the features of the modern approach, which is applied to the development of AI models (Fig. <ref type="figure" target="#fig_0">1</ref>). However, this approach does not take into account existing ethical principles and non-discriminatory, representative presentation of existing population subgroups, which should be applied to obtain AI models. The purpose of the work is to ensure compliance with the ethical aspects (gender, religious, age, etc.) of the FATE principle of justice <ref type="bibr" target="#b4">[5]</ref> for educational datasets, which consists in creating a method of analysis and formation of representative (according to the specified aspects) text samples of data. To achieve the specified goal, it is necessary to propose a method that will implement the following research tasks:</p><p>• to develop an approach to the analysis and formation of relevant representative datasets according to the principle of fairness of FATE for subject areas. • to investigate the effectiveness of the proposed approach, by using it for the applied analysis of the text dataset and bringing it to a representative view according to the aspects of the FATE principle of justice: gender, age and religion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method for analysis and formation of representative text datasets</head><p>In contrast to the existing approach to training AI models (see Fig. <ref type="figure" target="#fig_0">1</ref>), the study proposes a new approach (Fig. <ref type="figure" target="#fig_1">2</ref>), which will ensure the representativeness and ethical correctness of the datasets used for training AI models. In order to implement the proposed approach, we will present: the information model and presentation of the task of forming representative samples of text data as an optimization task; steps of the method of analysis and formation of representative samples; a way to obtain a typical ML model for the ethical aspect; description of the composition of the datasets for the study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Information model</head><p>The problem of obtaining a representative, ethically unbiased text dataset can be presented in the framework of an information model of the following form:</p><formula xml:id="formula_0">{𝐷𝐷𝐷𝐷, 𝐷𝐷𝐷𝐷ʹ, 𝐶𝐶, 𝐴𝐴, 𝑀𝑀, 𝐹𝐹},<label>(1)</label></formula><p>where DS is the text dataset for analysis and correction, DSʹ is the text dataset after correction, C is the set of classes of the subject domain of the dataset, A is the set of ethical aspects, M is the set of trained machine learning models (separate for each ethical aspect), F is the objective function minimizing the deviation between current and desired ratios for all ethical aspects.</p><p>In (1), the initial dataset DS and the corrected dataset DSʹ can be represented as:</p><formula xml:id="formula_1">{𝐷𝐷𝐷𝐷 = {𝐷𝐷 ∪ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀},<label>(2)</label></formula><formula xml:id="formula_2">DSʹ = {Dʹ ∪ Metadataʹ}, (<label>3</label></formula><formula xml:id="formula_3">)</formula><p>where D is the set of elements of the DS dataset, Metadata is the set of metadata of the DS dataset, Dʹ is the set of elements of the DSʹ dataset, Metadataʹ is the set of metadata of the DSʹ dataset.</p><p>Each element of the set of elements of the dataset D in (2) and each element of the set of elements of the dataset Dʹ in (3) is a tuple of the following form:</p><formula xml:id="formula_4">𝑀𝑀 = 𝑀𝑀ʹ = (𝑀𝑀𝑀𝑀𝑡𝑡𝑀𝑀, 𝑐𝑐 𝑥𝑥 , 𝐴𝐴𝐶𝐶 𝑥𝑥 ),<label>(4)</label></formula><p>where the attribute text is the textual content of element d or dʹ; cx is the class of the subject area of the dataset to which the element belongs, cx ∈ C; ACx is a set of classes of dataset element belonging to ethical aspects.</p><p>Thus, in (4) cx and ACx are the marking (marking) of the content of the text element. In (4), the set of classes of membership of the dataset element DS or DSʹ in (1) to the ethical aspects Ax is presented in the form of a tuple:</p><formula xml:id="formula_5">𝐴𝐴 𝑥𝑥 = (𝑀𝑀 1𝑥𝑥 , 𝑀𝑀 2𝑥𝑥 , … , 𝑀𝑀 𝑘𝑘𝑥𝑥 , ),<label>(5)</label></formula><p>where ax -classes of element belonging to ethical aspects; k is the number of ethical aspects to be analyzed, k = |Ax|.</p><p>At the same time, in (5) according to (1) Ax ⊂ A, and classes of dataset elements belonging to ethical aspects are elements of the corresponding sets, unique for each of the ethical aspects:</p><formula xml:id="formula_6">𝑀𝑀 1,𝑥𝑥 ∈ 𝐴𝐴 1 , 𝑀𝑀 2,𝑥𝑥 ∈ 𝐴𝐴 2 , … , 𝑀𝑀 𝑘𝑘,𝑥𝑥 ∈ 𝐴𝐴 𝑘𝑘 ,<label>(6)</label></formula><formula xml:id="formula_7">𝐴𝐴 1 ∪ 𝐴𝐴 2 ∪ … ∪ 𝐴𝐴 𝑘𝑘 = 𝐴𝐴<label>(7)</label></formula><p>The Metadata set of the DS dataset in (2) includes:</p><formula xml:id="formula_8">𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷 = {𝑛𝑛 𝐷𝐷𝐷𝐷 , 𝐴𝐴𝐴𝐴 𝐷𝐷𝐷𝐷 , 𝐴𝐴𝑇𝑇 𝐷𝐷𝐷𝐷 , 𝑛𝑛ʹ 𝐷𝐷𝐷𝐷 , 𝐴𝐴𝐴𝐴ʹ 𝐷𝐷𝐷𝐷 , 𝐴𝐴𝑇𝑇ʹ 𝐷𝐷𝐷𝐷 }, (<label>8</label></formula><formula xml:id="formula_9">)</formula><p>where nDS is the number of elements in D, nDS = |D|; ANDS is the set of quantities of dataset elements belonging to each class of each ethical aspect from Ax; ATDS is the set of available proportions of items for each class relative to the total for each ethical aspect from Ax, nʹDS is the target number of elements in Dʹ; ANʹDS is the set of target quantities of dataset elements belonging to each class of each ethical aspect from Ax; ATʹDS is the set of target proportions of elements for each class relative to the total amount for each ethical aspect from Ax.</p><p>At the same time, in (8), each element anDS,i of the set ANDS corresponds to a separate i-th ethical aspect and is represented by a tuple of the following form:</p><formula xml:id="formula_10">𝑀𝑀𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖 = (𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,1 , 𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,2 , … , 𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,𝑗𝑗 , … , 𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,𝑘𝑘 ),<label>(9)</label></formula><p>where nDS,i,1 is the number of elements in the dataset of the 1st class of the i-th ethical aspect, nDS,i,2 is the number of elements in the dataset of the 2nd class of the i-th ethical aspect, nDS,i,j is the number of elements in the dataset of the j-th class of the i-th ethical aspect, k is the number of classes of the i-th ethical aspect.</p><p>Similarly to <ref type="bibr" target="#b8">(9)</ref>, in <ref type="bibr" target="#b7">(8)</ref> the proportions of the elements atDS,i of the i-th ethical aspect are represented by a tuple of the following form:</p><formula xml:id="formula_11">𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖 = (𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,1 , 𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,2 , … , 𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,𝑗𝑗 , … , 𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,𝑘𝑘 ),<label>(10)</label></formula><p>where tDS,i,1 is the ratio of the number of elements in the dataset of the 1st class of the i-th ethical aspect to the total number of elements in the dataset, tDS,i,2 is the ratio of the number of elements in the dataset of the 2nd class of the i-th ethical aspect to the total number of elements in the dataset, tDS,i,j is the ratio of the number of elements in the dataset of the i-th class of the i-th ethical aspect to the total number of elements in the dataset.</p><p>At the same time, for the values ( <ref type="formula" target="#formula_10">9</ref>) and (10) in accordance with <ref type="bibr" target="#b7">(8)</ref> for each i-th ethical aspect, the equality holds:</p><formula xml:id="formula_12">𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,1 + 𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,2 + … + 𝑛𝑛 𝐷𝐷𝐷𝐷,𝑖𝑖,𝑘𝑘 = 𝑛𝑛 𝐷𝐷𝐷𝐷 ,<label>(11)</label></formula><formula xml:id="formula_13">𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,1 + 𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,2 + … + 𝑀𝑀 𝐷𝐷𝐷𝐷,𝑖𝑖,𝑘𝑘 = 1.<label>(12)</label></formula><p>In contrast to <ref type="bibr" target="#b7">(8)</ref>, the set of Metadataʹ of the DSʹ dataset in (3) includes:</p><formula xml:id="formula_14">𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀ʹ 𝐷𝐷𝐷𝐷 = {𝑛𝑛ʹʹ 𝐷𝐷𝐷𝐷 , 𝐴𝐴𝐴𝐴ʹʹ 𝐷𝐷𝐷𝐷 , 𝐴𝐴𝑇𝑇ʹʹ 𝐷𝐷𝐷𝐷 }, (<label>13</label></formula><formula xml:id="formula_15">)</formula><p>where nʹʹDS is actually the number of elements in Dʹ obtained as a result of adjustment, nʹʹDS = |D|; ANʹʹDS is the set actually obtained as a result of adjusting the quantities of dataset elements belonging to each class of each ethical aspect from Ax; ATʹʹDS is the set actually obtained as a result of adjusting the proportions of elements for each class relative to the total number for each ethical aspect from Ax.</p><p>Thus, in ( <ref type="formula" target="#formula_8">8</ref>) and ( <ref type="formula" target="#formula_14">13</ref>), ( <ref type="formula" target="#formula_10">9</ref>) and ( <ref type="formula" target="#formula_12">11</ref>) hold for ANʹDS and ANʹʹDS, and ( <ref type="formula" target="#formula_11">10</ref>) and ( <ref type="formula" target="#formula_13">12</ref>) hold for ATʹDS and ATʹʹDS .</p><p>Thus, according to (4), ( <ref type="formula" target="#formula_6">6</ref>) and <ref type="bibr" target="#b6">(7)</ref>, the text dataset D has the number of elements n = nDS = |D| and can be presented in the form:</p><formula xml:id="formula_16">𝐷𝐷 = {𝑀𝑀 1 , 𝑀𝑀 2 , … , 𝑀𝑀 𝑛𝑛 , }, 𝑀𝑀 𝑖𝑖 = (𝑀𝑀𝑀𝑀𝑡𝑡𝑀𝑀 𝑖𝑖 , 𝑐𝑐 𝑖𝑖 , 𝐴𝐴 1 , 𝐴𝐴 2 , … , 𝐴𝐴 𝑚𝑚 ), 𝑖𝑖 = 1, … , 𝑛𝑛 ���������<label>(14)</label></formula><p>where C = {c1, c2, …, ck}, where k is the number of classes of dataset D, m is the number of ethical aspects.</p><p>According to ( <ref type="formula" target="#formula_6">6</ref>) -( <ref type="formula" target="#formula_11">10</ref>), the solution of the problem is aimed at obtaining the dataset Dʹ, which contains the total number of elements nʹ = nʹDS = |Dʹ|, quantitatively balanced according to the ethical aspects Аi from the set of ethical aspects А:</p><formula xml:id="formula_17">𝐴𝐴 = {𝐴𝐴 1 , 𝐴𝐴 2 , … , 𝐴𝐴 𝑚𝑚 }, 𝐴𝐴 𝑖𝑖 = �𝐶𝐶 𝑖𝑖 , 𝑇𝑇 𝑖𝑖𝑗𝑗 �, 𝑖𝑖 = 1, … , 𝑚𝑚 ��������� ,<label>(15)</label></formula><p>where each aspect Ai contains classes Сi and target proportions of classes Tij or each element of class С; С is the set of classes of the ethical aspect Ai , C = {c1, c2, …,cj}; j is the number of classes of the ethical aspect of Ai.</p><p>To balance the dataset for each ethical aspect, it is necessary to use trained or train an appropriate number of classifier models, which can be as deep learning models, for example, BERT, LSTM, GRU, as well as machine learning models of Logistic Regression, Naive Bayes, Support Vector Machines, k-Nearest Neighbors etc. <ref type="bibr" target="#b16">[17]</ref>, and according to (1), the set of trained models of classifiers M is presented in the form:</p><formula xml:id="formula_18">𝑀𝑀 = {𝑀𝑀 1 , 𝑀𝑀 2 , … , 𝑀𝑀 𝑚𝑚 }, m = |D|.<label>(16)</label></formula><p>Thus, within the framework of the proposed information model, it is necessary to perform the transformation D ⇒ Dʹ with the condition of maximal correspondence nʹʹDS → nʹDS, ANʹʹDS → ANʹDS та ATʹʹDS → ATʹDS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Idea of the approach</head><p>The study proposes to reduce the task of building a representative, ethically unbiased dataset to the task of multi-criteria optimization. The optimization task consists in minimizing the deviation between the current and desired class ratios, taking into account the limitations on the number of samples in the classes and the possibilities of generating synthetic data.</p><p>Input data: textual dataset DS, set of ethical aspects A, requirements for representative distribution DSʹ.</p><p>The goal of the problem: to create a representative sample for all ethical aspects that achieves the target class proportions for each ethical aspect D ⇒ Dʹ.</p><p>Variables: xij -number of samples of class Cj in aspect Ai after sequestration and augmentation.</p><p>The objective function F is the minimization of the deviation between the current and desired ratios for all ethical aspects simultaneously, taking into account constraints ( <ref type="formula" target="#formula_21">18</ref>) -( <ref type="formula" target="#formula_24">21</ref>):</p><formula xml:id="formula_19">𝐹𝐹 = 𝑀𝑀𝑎𝑎𝑎𝑎𝑚𝑚𝑖𝑖𝑛𝑛 � � � 𝑡𝑡 𝑖𝑖𝑗𝑗 𝑛𝑛ʹ − 𝑇𝑇 𝑖𝑖𝑗𝑗 � 𝑛𝑛 𝑖𝑖 𝑗𝑗−1 𝑚𝑚 𝑖𝑖−1 . (<label>17</label></formula><formula xml:id="formula_20">)</formula><p>Limitations of the task: 1) the sum of all class samples within one aspect is equal to the target number of samples for this aspect (4):</p><formula xml:id="formula_21">� 𝑡𝑡 𝑖𝑖𝑗𝑗 𝑛𝑛 𝑖𝑖 𝑗𝑗=1 = 𝑛𝑛ʹ, ∀𝑖𝑖 ∈ {1,2, … , 𝑚𝑚},<label>(18)</label></formula><p>where ni is the number of classes in the aspectAi; 2) the number of samples for each class should correspond to the target proportion of classes:</p><formula xml:id="formula_22">𝑡𝑡 𝑖𝑖𝑗𝑗 𝑛𝑛ʹ ≈ 𝑇𝑇 𝑖𝑖𝑗𝑗 , ∀𝑖𝑖 ∈ {1,2, … , 𝑚𝑚}, ∀𝑗𝑗 ∈ {1,2, … , 𝑛𝑛 𝑖𝑖 };<label>(19)</label></formula><p>3) the estimated number of samples cannot be negative:</p><formula xml:id="formula_23">𝑡𝑡 𝑖𝑖𝑗𝑗 ≥ 0, ∀𝑖𝑖 ∈ {1,2, … , 𝑚𝑚}, ∀𝑗𝑗 ∈ {1,2, … , 𝑛𝑛 𝑖𝑖 };<label>(20)</label></formula><p>4) the ability to add new samples should match the ability to generate new data for each class and aspect:</p><formula xml:id="formula_24">𝑡𝑡 𝑖𝑖𝑗𝑗 ≤ 𝐺𝐺 𝑖𝑖𝑗𝑗 , ∀𝑖𝑖 ∈ {1,2, … , 𝑚𝑚}, ∀𝑗𝑗 ∈ {1,2, … , 𝑛𝑛 𝑖𝑖 }, (<label>21</label></formula><formula xml:id="formula_25">)</formula><p>where Gij is the maximum possible number of samples of class Cj in aspect Ai, that can be added.</p><p>Based on the set optimization task of forming a representative dataset <ref type="bibr" target="#b16">(17)</ref>, we present the steps of the method of analysis and formation of representative samples of text data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Main steps of method</head><p>Method for analysis and formation of representative text datasets is presented in the form of three consecutive stages: preprocessing, analysis of representativeness according to ethical aspects and representative adjustment of dataset. Each stage consists of its own steps, which are shown in Figure <ref type="figure" target="#fig_2">3</ref>.</p><p>The input data of the method is the dataset DS for analysis, which according to ( <ref type="formula" target="#formula_1">2</ref>) and ( <ref type="formula" target="#formula_8">8</ref>) contains the target number of nʹDS elements, the set of ethical aspects A with subsets of classes, the target proportions of ATDS classes and the number of ANʹDS elements in the classes of ethical aspects, respectively, the trained set of models М for each ethical aspect from A, which uses balanced samples for each ethical aspect for training.</p><p>At stage 1, a sample of text data in D ⊂ DS is pre-processed, namely, the removal of noninformative text fragments such as punctuation marks, numbers and special characters <ref type="bibr" target="#b17">[18]</ref>. Removal of emoticons is not performed, as in many cases including emoticons in the analysis improves the accuracy of machine learning models used to classify texts based on emotional or mood content <ref type="bibr" target="#b18">[19]</ref>. Incorrect records (empty, uninformative, etc.) are also deleted.</p><p>At stage 2, an analysis of the representativeness of the sample of textual data is carried out, taking into account ethical aspects. First, it is necessary to vectorize and classify each element ∀d ∈ D of the data sample using separate machine learning models m ∈ M for each of the ethical aspects Ai ∈ A. The existing proportions of ANDS and ATDS classes for each of the ethical aspects are determined. The amount of shortage or excess of elements of each class for each of the ethical aspects is calculated. After that, the sufficiency of the data in the sample for augmentation is analyzed (minimum availability of samples of the relevant classes, etc.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Stage 3 involves a representative adjustment of the data sample to take into account ethical considerations. Adjustments include removing and adding.</head><p>The deletion operation is performed to remove redundant elements of each class for each of the ethical aspects with minimal damage to other distributions, for which the optimization problem of selecting redundant elements in the framework of <ref type="bibr" target="#b16">(17)</ref>, which should be removed to achieve the target proportions of classes, is solved.</p><p>The add operation is performed to create new items using one of the known methods, for example, the SMOTE method <ref type="bibr" target="#b19">[20]</ref>. Requirements are created in the form of the necessary combination of classes of each of the ethical aspects for each new element, for which the optimization problem of forming requirements for the missing elements is solved within the framework of <ref type="bibr" target="#b16">(17)</ref>. The output data of the method is a text dataset Dʹ ⊂ DSʹ, which has the required volume nʹDS and is balanced according to the required proportions ATʹDS according to the selected ethical aspects Ax ⊂ A.</p><p>The steps of the method of analyzing and generating representative samples of text data will allow you to generate text samples that are non-discriminatory and unbiased and reflect a proportional representation of the sample samples to the actual demographic subgroups of the population, which will affect the accuracy and transparency of training machine learning models for solving various problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and discussion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Datasets for research</head><p>To test the method of analysis and formation of representative samples of textual data, an input dataset was formed based on two datasets "Cyberbullying Classification" <ref type="bibr" target="#b2">[3]</ref> and "Cyberbully Detection Dataset" <ref type="bibr" target="#b3">[4]</ref>. The "Cyberbullying Classification" dataset contains 46,017 tweets, which are labeled by types of cyberbullying into 6 classes. The "Cyberbully Detection Dataset" contains 99,989 tweets, which is also labeled by type of cyberbullying. Both datasets are unlabeled for gender, age group, religion, and ethnicity of the message author.</p><p>To train machine learning models, which will be used to label the input dataset, datasets were used on the example of three ethical aspects of the principle of justice of gender, age and religion.</p><p>The English-language dataset "Tweet Files for Gender Guessing" <ref type="bibr" target="#b20">[21]</ref>, which contains 34,146 unique text entries, which are divided into two classes: female and male, with 17,073 entries in each class, was used to train ML based on the ethical aspect of the gender of the author of the message. On the basis of the English-language dataset "CyberBullying Detection Dataset" <ref type="bibr" target="#b21">[22]</ref>, which contains 20109 test samples, a sample was created for training the classifier and marking the input dataset according to the religious ethical aspect. The dataset in Italian "TAG-it Dataset Distribution" <ref type="bibr" target="#b22">[23]</ref> was translated into English and used to bring to a representative view the working dataset by age and contains 21,948 text messages divided into age classes: 0-19, 20-29, 30-39, 40-49, 50-100 years old.</p><p>Since the classes in the given datasets are not balanced and have a different number of samples, which will negatively affect the quality of training of machine learning models, all classes in the datasets were balanced in terms of number. The final number of samples in each class of training samples for ML training according to ethical aspects is shown in Fig. <ref type="figure" target="#fig_3">4</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Software for research</head><p>To study the effectiveness of the method of analysis and formation of a representative sample of text data, a software implementation was created using the Python programming language. The tensorflow library (https://www.tensorflow.org/) was used to classify the input dataset on cyberbullying based on gender, age, and religion. In Fig. <ref type="figure" target="#fig_4">5</ref> shows an example of classification based on the religious basis of the FATE-principle of justice. To form a set of trained ethical machine learning models, which are separate for each ethical aspect, various classifier models were analyzed, and to select the best of them, their quality was evaluated by statistical indicators, such as Accuracy, Precision, Recall, and F1-score <ref type="bibr" target="#b23">[24]</ref>. Both deep learning models, such as BERT, GPT, LSTM, GRU, etc., and classifiers such as Logistic Regression, Naive Bayes, Support Vector Machines k-Nearest Neighbors, etc., were studied <ref type="bibr" target="#b24">[25]</ref>. After that, the classifier is trained on the selected ML model on the annotated dataset for the ethical aspect.</p><p>As a result, different architectures were chosen as classifiers: FastForest classifiers, SVM and LSTM, BERT deep learning models <ref type="bibr" target="#b25">[26]</ref>. Thus, machine learning models such as FastForest, SVM, LSTM, and BERT are effective tools for solving text classification tasks, including determining a person's gender, religion, and age based on user text posts. Classical approaches such as FastForest and SVM have also demonstrated their effectiveness in text classification. FastForest works efficiently with large datasets and prevents overtraining. SVM, in turn, is known for its ability to work with high-dimensional data, which is especially useful for text classification, where each word or phrase can be represented as a separate feature <ref type="bibr" target="#b26">[27]</ref>. Deep learning models, such as LSTM and BERT, are able to recognize complex patterns in text sequences, preserving the context at all stages of analysis <ref type="bibr" target="#b27">[28]</ref>. A distinctive feature of LSTM is its ability to retain information about previous parts of the text, which makes this model effective for complex classification tasks where the overall context of the message is important. Studies have shown that such a model can achieve an accuracy of up to 92% in text classification tasks <ref type="bibr" target="#b28">[29]</ref>. The BERT model, in turn, is characterized by the ability to analyze the text in two directions, that is, to take into account both the previous and subsequent context of words <ref type="bibr" target="#b29">[30]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Analysis of research results</head><p>To analyze and form a representative sample of text data for the target proportions of classes, to form a representative sample of text data by age and gender, the population of Ukraine was taken. According to the M. V. Ptukh Institute of Demography and Social Research of the National Academy of Sciences of Ukraine (https://idss.org.ua/forecasts/nation_pop_proj), as of July 2023, the total population of Ukraine is estimated at 3,559,6216 people. The following number of people is represented in each age subgroup: age group 0-19 years 6,659,068 people, 20-29 years 3,623,143 people, 30-39 years -6,022,345 people, 40-49 years -5,431,140 people, 50 -100 years -13,860,520 people. Regarding the gender structure of the population of Ukraine in 2023: 1,695,1527 are women, and 1,864,689 are men (idss.org.ua). Note that within the scope of this work, the cisgender group is considered in the analysis of the gender ethical aspect.</p><p>To study the effectiveness of the method of analysis and formation of a representative selection of text data described in the work, several machine learning models were trained. The results of calculating static metrics such as Accuracy, Precision, Recall and F1-score <ref type="bibr" target="#b23">[24]</ref> of machine learning models for the gender, age and religious ethical aspects are shown in Table <ref type="table" target="#tab_0">1</ref>. For different classes, different levels of linear resolution were obtained: according to religion using the BERT classifier, which showed the best result of the trained machine learning models for the task of classifying text samples according to the religious ethical aspect, the data turned out to be well separated, according to gender using the LSTM classifier, which showed the best performance compared to other models, the data turned out to be moderately separable, and according to age, using the SVM classifier, it was poorly separable.</p><p>In addition, it was found that the dataset is not representative, because the classes of various ethical aspects have a number of text samples that do not correspond to the proportions of the demographic subgroups of the population of Ukraine, thus they need balancing to acquire a representative appearance.</p><p>Therefore, according to the steps of the method of analysis and formation of a representative sample of text data, a sample of text data needs data augmentation to form a representative sample. For this, it is necessary to solve the optimization problem, for the correct removal of redundant elements of each class according to each of the ethical aspects, with further augmentation of the data sample to the target requirements (number of elements and proportions of classes).</p><p>Table <ref type="table">2</ref> presents the percentages of samples by age in the sample of textual data and individuals of the population in age-demographic subgroups, and also calculates the new distribution of the sample classes if only one ethical aspect -age -was taken into account.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Percentage ratios of samples by age in the sample of text data and individuals of the population in age demographic subgroups, % Table <ref type="table" target="#tab_1">3</ref> presents the percentages of samples by gender in the sample of textual data and individuals of the population in gender demographic subgroups, and also calculates the new distribution of sample classes if only one ethical aspect -gender -was taken into account.</p><p>The deviation of the sample distributions by classes of the age-ethical aspect of the dataset, transformed according to the created method, from the ideal representative distribution was obtained: minimum 0.01%, maximum 0.04%, average 0.02%, and for the gender ethical aspect: minimum 0.03%, maximum 0.03%, average 0.03 %. However, the optimization task of forming a representative sample of textual data is a multi-criteria one, in which the criteria are the formation of a sample based on age and gender ethical aspects, so the goal is to minimize the deviation between the current and desired class ratios, taking into account the limitations on the number of samples and the possibility of generating new data. As a result of solving the optimization problem for the formation of a representative sample by age and gender ethical aspects on the example of demographic subgroups of the population of Ukraine, a representative sample of text data was obtained by augmentation, the balance of classes of which is presented in Table <ref type="table">4</ref>, Fig. <ref type="figure" target="#fig_5">6</ref> and Fig. <ref type="figure" target="#fig_6">7</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Distribution of samples in the formed representative sample after data augmentation as a result of solving a multi-criteria optimization problem</p><p>The deviation of the sample distributions by classes of age and gender ethical aspects of the dataset simultaneously, transformed according to the created method, from the ideal representative distribution was obtained: minimum 0.00%, maximum 0.04%, average 0.02%.</p><p>So, as a result of performing the steps of the analysis method and forming representative samples of text data, a text sample was formed, which is non-discriminatory and unbiased and reflects the representation of sample samples proportional to the real demographic subgroups of the population of Ukraine.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>Thus, the goal of the study was achieved through the development of the method for analysis and formation of representative text datasets, designed for the analysis and formation of representative text samples of data according to the principle of fairness of FATE for subject areas.</p><p>To investigate the effectiveness of the analysis method and the formation of a representative presentation of the text dataset, software was created that uses machine learning models to classify texts according to various ethical aspects -age, gender, religion, ethnicity, etc. Thus, to classify the text samples in the sample according to the age-ethical aspect, SVM was used, LSTM was used for gender, and BERT was used for religious ones, which are the best indicators of statistical metrics.</p><p>As a result of the practical application of the developed method, it was established that the available dataset is not representative compared to the objective data of demographic statistics, so a multi-criteria optimization problem was solved and the dataset was transformed into a representative one in terms of age and gender ethical aspects. The obtained deviations of the sample distributions by classes of ethical aspects of the dataset transformed according to the created method from the ideal representative distribution were: minimum 0.00%, maximum 0.04%, average 0.02%, under the conditions of the initial volume of the dataset 47,692 elements, the minimum initial number of samples in the class 1007 elements, the maximum initial number of samples in the class is 28,112 elements. The studied efficiency proves that the developed method allows performing the analysis of the representativeness of text datasets and bringing them to a representative form according to various aspects of the FATE fairness principle.</p><p>The obtained results contribute to improvement of representativeness of text datasets and fair and unbiased representation of demographic groups in them, which increases trust in decisions made by artificial intelligence, and complies with goals SDG3 (good health and wellbeing), SDG4 (quality education) and SDG16 (peace, justice, and strong institutions).</p><p>Further plans for improving the method of analysis and formation of representative samples of text data are the formation of not only a non-discriminatory sample by the number of samples, but also the search and removal of samples of text samples that contain a biased attitude towards representatives of various demographic subgroups, according to the ethical aspects of the FATE-principle of justice.</p><p>Also, the prospects for further research are the use of the developed method for adjusting textual datasets of subject areas and their use for solving applied problems, such as detection and classification of cyberbullying, analysis of the emotional tonality of messages, detection of the physical and mental state of users based on their posts, etc. Detecting performance gains from using ethically balanced text datasets will provide feedback for improving the developed method.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An existing approach to training AI models.</figDesc><graphic coords="3,114.90,537.61,370.85,135.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: An approach to the formation of ethically representative datasets.</figDesc><graphic coords="4,91.08,223.74,418.40,138.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Steps of method for analysis and formation of representative text datasets.</figDesc><graphic coords="8,111.05,101.94,378.55,292.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Classes and number of samples in ML training datasets for ethical aspects.</figDesc><graphic coords="9,93.88,184.70,412.90,94.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Developed software for classification of dataset by religious ethical aspect.</figDesc><graphic coords="9,92.58,479.53,414.85,163.98" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: The balance of the distribution of the input dataset according to the age-ethical aspect of the FATE-principle of justice.</figDesc><graphic coords="12,154.70,535.43,293.39,199.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: The balance of the distribution of the input dataset according to the gender ethical aspect of the FATE-principle of justice.</figDesc><graphic coords="13,161.23,62.35,278.20,188.74" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Statistical metrics Accuracy, Precision, Recall and F1-score of machine learning models by gender, age and religious ethical aspects</figDesc><table><row><cell>ML model</cell><cell>Accuracy</cell><cell>Precision</cell><cell>Recall</cell><cell>F1-score</cell></row><row><cell>Gender ethical aspect</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>FastForest</cell><cell>0.630</cell><cell>0.640</cell><cell>0.600</cell><cell>0.620</cell></row><row><cell>SVM</cell><cell>0.580</cell><cell>0.580</cell><cell>0.580</cell><cell>0.580</cell></row><row><cell>LSTM</cell><cell>0.70</cell><cell>0.770</cell><cell>0.670</cell><cell>0.720</cell></row><row><cell>BERT</cell><cell>0.690</cell><cell>0.640</cell><cell>0.710</cell><cell>0.670</cell></row><row><cell>Age ethical aspect</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>FastForest</cell><cell>0.535</cell><cell>0.542</cell><cell>0.504</cell><cell>0.504</cell></row><row><cell>SVM</cell><cell>0.815</cell><cell>0.770</cell><cell>0.779</cell><cell>0.770</cell></row><row><cell>LSTM</cell><cell>0.590</cell><cell>0.600</cell><cell>0.560</cell><cell>0.580</cell></row><row><cell>BERT</cell><cell>0.580</cell><cell>0.430</cell><cell>0.450</cell><cell>0.440</cell></row><row><cell cols="2">Religious ethical aspect</cell><cell></cell><cell></cell><cell></cell></row><row><cell>FastForest</cell><cell>0.775</cell><cell>0.800</cell><cell>0.762</cell><cell>0.780</cell></row><row><cell>SVM</cell><cell>0.825</cell><cell>0.850</cell><cell>0.810</cell><cell>0.829</cell></row><row><cell>LSTM</cell><cell>0.850</cell><cell>0.880</cell><cell>0.830</cell><cell>0.854</cell></row><row><cell>BERT</cell><cell>0.910</cell><cell>0.980</cell><cell>0.74 0</cell><cell>0.840</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3</head><label>3</label><figDesc>Percentages of samples by gender in the sample of text data and individuals of the population in gender demographic subgroups, %</figDesc><table><row><cell>Gender demographic subgroups</cell><cell>Percentage of samples by gender in text dataset</cell><cell>Percentage of population in gender demographic subgroups</cell><cell>Deviation text samples of from subgroups of population</cell><cell>New distribution of sampling classes</cell><cell>Deviation from representative distribution</cell></row><row><cell>Men</cell><cell>58.94%</cell><cell>43.28%</cell><cell>15.67%</cell><cell>43.25%</cell><cell>0.03%</cell></row><row><cell>Women</cell><cell>41.06%</cell><cell>56.72%</cell><cell>15.67%</cell><cell>56.75%</cell><cell>0.03%</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Declaration on Generative AI</head><p>During the preparation of this work, the authors used Grammarly in order to: grammar and spelling check; DeepL Translate in order to: some phrases translation into English. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication's content.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A Comprehensive Review of Bias in Deep Learning Models: Methods, Impacts, and Future Directions</title>
		<author>
			<persName><forename type="first">M</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sureja</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11831-024-10134-2</idno>
	</analytic>
	<monogr>
		<title level="j">Arch Computat Methods Eng</title>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">dictionary-based deterministic method of generation of text CORPORA</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yusyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rybachok</surname></persName>
		</author>
		<idno type="DOI">10.31891/csit-2024-3-9</idno>
	</analytic>
	<monogr>
		<title level="j">Computer systems and information technologies</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="67" to="73" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><surname>Kaggle</surname></persName>
		</author>
		<author>
			<persName><surname>Com</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification?resource=download" />
		<title level="m">Cyberbullying Classification</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><surname>Kaggle</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/sayankr007/cyber-bullying-data-for-multi-label-classification" />
		<title level="m">CyberBullying Detection Dataset</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The ethics of ChatGPT -Exploring the ethical issues of an emerging technology</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Stahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Eke</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.ijinfomgt.2023.102700</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Information Management</title>
		<imprint>
			<biblScope unit="volume">74</biblScope>
			<biblScope unit="page">102700</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Structural alignment method of conceptual categories of ontology and formalized domain</title>
		<author>
			<persName><forename type="first">E</forename><surname>Manziuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Krak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Barmak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Mazurets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kuznetsov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Pylypiak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">3003</biblScope>
			<biblScope unit="page" from="11" to="22" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Information technology for creation of semantic structure of educational materials</title>
		<author>
			<persName><forename type="first">O</forename><surname>Barmak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Mazurets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Krak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kulias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Smolarz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Azarova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gromaszek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Smailova</surname></persName>
		</author>
		<idno type="DOI">10.1117/12.2537064</idno>
	</analytic>
	<monogr>
		<title level="j">Proceedings of SPIE -The International Society for Optical Engineering</title>
		<imprint>
			<biblScope unit="volume">11176</biblScope>
			<biblScope unit="page" from="147" to="156" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Data Representativity for Machine Learning and AI Systems</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K H</forename><surname>Clemmensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">K</forename><surname>Rune</surname></persName>
		</author>
		<ptr target="https://ar5iv.labs.arxiv.org/html/2203.04706" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Towards a holistic view of bias in machine learning: bridging algorithmic fairness and imbalanced learning</title>
		<author>
			<persName><forename type="first">D</forename><surname>Dablain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Krawczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Chawla</surname></persName>
		</author>
		<idno type="DOI">10.1007/s44248-024-00007-1</idno>
	</analytic>
	<monogr>
		<title level="j">Discov Data</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K E</forename><surname>Bellamy</surname></persName>
		</author>
		<idno type="DOI">10.1147/JRD.2019.2942287</idno>
	</analytic>
	<monogr>
		<title level="j">IBM Journal of Research and Development</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="issue">4/5</biblScope>
			<biblScope unit="page" from="1" to="15" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Benchmarking Intersectional Biases in NLP</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lalor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Forsgren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abbasi</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.naacl-main.263</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Association for Computational Linguistics</title>
				<meeting>the 2022 Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3598" to="3609" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Dbias: detecting biases and ensuring fairness in news articles</title>
		<author>
			<persName><forename type="first">S</forename><surname>Raza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Reji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ding</surname></persName>
		</author>
		<idno type="DOI">10.1007/s41060-022-00359-4</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Data Science and Analytics</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="39" to="59" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Addressing Both Statistical and Causal Gender Fairness in NLP Models</title>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Evans</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2404.00463</idno>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="561" to="582" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Fairpriori: Improving Biased Subgroup Discovery for Deep Neural Network Fairness</title>
		<author>
			<persName><forename type="first">K</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2407.01595</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">A Study on Bias De-tection and Classification in Natural Language Processing</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Moniz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Coheur</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2408.07479</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Attribute-Specific Cyberbullying Detection Using Artificial Intelligence</title>
		<author>
			<persName><forename type="first">A</forename><surname>Orelaja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ejiofor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sarpong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Imakuh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bassey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Opara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">N A</forename><surname>Tettey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Akinola</surname></persName>
		</author>
		<idno type="DOI">10.30564/jeis.v6i1.6206</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Electronic &amp; Information Systems</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="10" to="21" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Text Classification Using Deep Learning Models: A Comparative Review</title>
		<author>
			<persName><forename type="first">M</forename><surname>Zulqarnain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sheikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hussain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sajid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Abbas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Majid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Ullah</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10115-023-01856-z</idno>
	</analytic>
	<monogr>
		<title level="j">Cloud Computing and Data Science</title>
		<imprint>
			<biblScope unit="page" from="80" to="96" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Method for Sentiment Analysis of Ukrainian-Language Reviews in E-Commerce Using RoBERTa Neural Network</title>
		<author>
			<persName><forename type="first">O</forename><surname>Zalutska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Molchanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Sobko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Mazurets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Pasichnyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Barmak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Krak</surname></persName>
		</author>
		<idno type="DOI">10.15407/jai2024.02.085</idno>
	</analytic>
	<monogr>
		<title level="j">CEUR Workshop Proceedings</title>
		<imprint>
			<biblScope unit="volume">3387</biblScope>
			<biblScope unit="page" from="344" to="356" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">EMFSA: Emoji-based multifeature fusion sentiment analysis</title>
		<author>
			<persName><forename type="first">H</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1371/journal.pone.0310715</idno>
	</analytic>
	<monogr>
		<title level="j">PLoS One</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">9</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Wongvorachan</surname></persName>
		</author>
		<author>
			<persName><surname>Bulut</surname></persName>
		</author>
		<idno type="DOI">10.3390/info14010054</idno>
	</analytic>
	<monogr>
		<title level="j">Information</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">54</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><surname>Kaggle</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/aharless/tweet-files-for-gender-guessing" />
		<title level="m">Tweet Files for Gender Guessing</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><surname>Kaggle</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/sayankr007/cyber-bullying-data-for-multi-label-classification" />
		<title level="m">CyberBullying Detection Dataset</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<ptr target="https://live.european-language-grid.eu/catalogue/corpus/8112/download/" />
		<title level="m">TAG-it Dataset Distribution</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>european-language-grid</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Edge-informed single image super-resolution</title>
		<author>
			<persName><forename type="first">K</forename><surname>Nazeri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Thasarathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ebrahimi</surname></persName>
		</author>
		<ptr target="https://openaccess.thecvf.com/content_ICCVW_2019/html/AIM/Nazeri_Edge-Informed_Single_Image_Super-Resolution_ICCVW_2019_paper.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision Workshops</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Visual Analytics-Based Method for Sentiment Analysis of COVID-19 Ukrainian Tweets</title>
		<author>
			<persName><forename type="first">O</forename><surname>Kovalchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Slobodzian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Sobko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Molchanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Mazurets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Barmak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Krak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Savina</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-16203-9_33</idno>
	</analytic>
	<monogr>
		<title level="j">Lecture Notes on Data Engineering and Communications Technologies</title>
		<imprint>
			<biblScope unit="volume">149</biblScope>
			<biblScope unit="page" from="591" to="607" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">ExaAUAC: Arabic Twitter user age prediction corpus based on language and metadata features</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sadeghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Akbari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Jaziriyan</surname></persName>
		</author>
		<idno type="DOI">10.1007/s44163-024-00145-0</idno>
	</analytic>
	<monogr>
		<title level="j">Discover Artifcial Intelligence</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">A Comparative Analysis of Machine Learning Algorithms for Classification Purpose</title>
		<author>
			<persName><forename type="first">V</forename><surname>Sheth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Tripathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sharma</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.procs.2022.12.044</idno>
	</analytic>
	<monogr>
		<title level="j">Procedia Computer Science</title>
		<imprint>
			<biblScope unit="page" from="422" to="431" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Deep Learning-based Text Classification</title>
		<author>
			<persName><surname>Sh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Minaee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cambria</surname></persName>
		</author>
		<author>
			<persName><surname>Gao</surname></persName>
		</author>
		<idno type="DOI">10.1145/3439726</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="1" to="40" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Gender Classification on Social Media Messages Using fastText-base Feature Extraction and Long Short-Term Memory</title>
		<author>
			<persName><forename type="first">H</forename><surname>Sa'diah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Faisal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farmadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Indriani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alkaff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Abdullayev</surname></persName>
		</author>
		<idno type="DOI">10.35882/jeeemi.v6i3.407</idno>
	</analytic>
	<monogr>
		<title level="j">Journal ofElectronics, Electromedical Engineering, and Medical Informatics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="243" to="252" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">An efficient approach for textual data classification using deep learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Alqahtani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Ullah</surname></persName>
		</author>
		<author>
			<persName><surname>Sh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alsubai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Almadhor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Iqbal</surname></persName>
		</author>
		<author>
			<persName><surname>Abbas</surname></persName>
		</author>
		<idno type="DOI">10.3389/fncom.2022.992296</idno>
	</analytic>
	<monogr>
		<title level="j">Frontiers in Computational Neuroscience</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
