<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>COLINS-</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Gender Classification of Surnames: Ukrainian aspect</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Natalia Borysova</string-name>
          <email>borysova.n.v@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karina Melnyk</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nadiia Babkova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zoia Kochuieva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktoriia Melnyk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv general education school of I-III degrees No 145</institution>
          ,
          <addr-line>Amosova street, 24a, Kharkiv,61171</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kirpichova street, 2, Kharkiv, 61002</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>6</volume>
      <fpage>12</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>This research focuses on resolving the problem of gender classification of Ukrainian surnames of texts' authors from GRAC corpus. An analytical review of existing solutions and classification methods for gender determination of texts have been carried out. The mathematical model of the given task has been developed. The functional model of the gender classification process has been proposed in the form of BPMN-diagram. The developed approach works with two groups of surnames: classification according to endings and classification without explicit gender features. The set of indicators for the second group according to texts' characteristics has been proposed. The efficiency of proposed classifier has been calculated.</p>
      </abstract>
      <kwd-group>
        <kwd>Keywords1</kwd>
        <kwd>Gender classification of surnames</kwd>
        <kwd>GRAC corpus</kwd>
        <kwd>gender determination</kwd>
        <kwd>naive Bayesian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The paradigm change in the humanities around the world has led to the actualization of research in
the study of gender aspects of language. This area is called gender linguistics. Recently, the problem of
determining gender identity has become more and more urgent. There are many different classification
criteria in determining gender task. Usually such criteria are various characteristics of the text, which
can be automatically identified in a text. They reflect the morphological, lexical, syntactic and stylistic
features of the author of the text. The problem of gender determination can be resolved in various areas
of human activity, such as: authorship’s expertise, banking, insurance, etc. This task is relevant in
domain areas where it is impossible to analyze texts. In this case, if some personal information is
available, namely a person’s surname, it is easy to determine a gender using the last letters of the
surname. However, there are some problems, when surnames do not have a division into the so-called
“male” and “female” options. If additional information of person such as name and/or patronymic is
presented, it can be used for gender determination. However, the practice of using names in different
countries is different, so it is necessary to use databases of names of a certain language that already
indicate gender. Automation of such processes requires the involvement of natural language processing
methods.</p>
      <p>
        This study proposes an approach to solution of the problem of determining gender for some authors
of the General Regionally Annotated Corpus of Ukrainian (GRAC). GRAC is the Ukrainian language
corpus with a volume of more than 650 million tokens. It is designed for linguistic research in grammar,
vocabulary, history of the Ukrainian literary language, as well as for use in compiling dictionaries and
grammars. The developers of the corpus are Maria Shvedova and Vasyl Starko. GRAC contains texts
of various genres, styles, topics, regions [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. To date the gender affiliation of 23089 authors of this
corpus is unknown. Thus, the purpose of the study is to develop approach for resolving the given issue.
EMAIL:
(N. Borysova);
(N. Babkova);
      </p>
      <p>2021 Copyright for this paper by its authors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formal problem statement</title>
      <p>Let’s consider the wording of the gender determination task of Ukrainian surnames of the authors
and their texts. This task is a classification task in general terms. The object of classification is a set of
Ukrainian surnames of the authors of the GRAC corpus with unknown gender. Each record of the list
of authors is subject to the following rule: authors’ surnames and initials should be presented in Latin
only. Foreign names have been removed at the pre-processing stage. Therefore, input data is the various
characteristics and indicators of the texts and surnames. Let designate  = { 1, …   } as a set of such
characteristics. The result of the gender determination task is two classes: male and female. Let
designate  = { 1,  2} as a set of possible classes. Therefore, the gender classification task is a mapping
of one set to another  :  →  .</p>
      <p>This issue can be divided to the following tasks:
 form a set of input indicators for the resolving the classification task;
 conduct an analytical review of approaches for resolving the given task;
 carry out the review of mathematical classification methods and choose suitable method;
 develop the model of the gender classification of surnames;
 assess the proposed approach of resolving the classification task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Literature Review</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Review of related works</title>
      <p>Software that automatically recognizes and classifies people based on the electronic footprint that
the user leaves on the network has become increasingly widespread recently. The rapid growth of the
Internet has created many ways to share information across time and space. Social networks (Twitter,
Myspace, Facebook), e-commerce (eBay, Craiglist), newsgroups and other sites allow to accumulate
large amounts of data about users and the surrounding space. Gender information is no longer required
when registering in some companies. It is expected, that the percentage of users without a declared
gender will increase. Some tasks, such as testing the accuracy and efficiency of machine learning
algorithms for recommending content need the user’s gender. Therefore, it is necessary to determine
the gender of registered users who do not disclose it, using the information collected during their
registration and/or their further activities. The surname is the main source of information about a person,
so it is the main factor in this study. There are two types of surnames in Ukrainian:
 surnames with gender characteristics;
 surnames without clearly defined gender characteristics.</p>
      <p>
        The problem of determining the gender of the author of a written text cannot be solved without basic
knowledge about the system of linguistic gender markers: linguistic units, models, structures that
characterize the language in the gender aspect. Early studies of the characteristics of men and women
speech behavior on the materials of various linguistic cultures show the following: discursive markers
of male dominance in language practices, lexical indicators of gender specificity, differences in
strategies and tactics of communication, etc. The nature of communication between homogeneous and
heterogeneous groups of communicants in the gender aspect also differs [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The works [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4-7</xref>
        ]
demonstrate communicative models, which allow highlighting signs of typically male and typically
female speech.
      </p>
      <p>A review of a wide range of gender studies allows concluding that researchers usually analyze one
type of discursive realization of communicatives: they study either spoken or written business, artistic
speech, but the question of the stability/variability of signs of gender opposition in different types of
speech situations is not raised. However, this problem is very significant, because its solution also
requires an adjustment methodology, specific procedures for gender studies.</p>
      <p>
        An analysis of available sources of information has shown that there are many solutions for
determining a gender identity used by full name (last and first name and patronymic) or last and first
name based on the various methods. The article [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] describes the usage of the statistical method for
determining a gender identity in detail. The research [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] indicates how to use the neural network for it.
The forum [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposes discussion of various methods and algorithms, for example, the using of
databases with names, or the defining the gender based on the end of patronymic, etc. The website [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
describes how to use the Gender API to determine the gender of customers by the names, which they
write in the registration forms. Moreover, the authors provide for free use both their programs and
training datasets.
      </p>
      <p>Input data of the given task is a list of authors with surnames and initials, so name and patronymic
are not specified. Therefore, the usage of available solutions in solving given problem is not possible.
Therefore, it is necessary to develop approach for the resolution of the given task.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>An analytical review of classification methods</title>
      <p>
        Aforementioned analysis of domain area allow seeing that the gender determination task of
Ukrainian surnames of the authors and their texts is a classification task. It is proposed to use
machinelearning methods to solve the problem. Machine learning is algorithms for independently finding
solutions to various problems through the integrated use of statistical data, which one are source for
creating forecasts and patterns. There are many types of issue, where the machine learning techniques
is useful: regression task, classification task, clustering task, dimensionality reduction problem, and
anomaly detection problem. There are three types of methods in machine learning: supervised learning
method, unsupervised learning method, and reinforcement learning one. Most often, supervised
learning is used to analyze text data, since algorithms of this class work faster and better with texts.
With the help of machine learning, a machine classifier can be built that can recognize different classes
of text. The classifier is built on a pre-labeled text corpus (training sample), in which labels are assigned
to data that encode their features. Learning can be defined as identifying common patterns based on
training data. The primary task is to identify features in the data that can predict the target variable
(label). However, classifiers are not transparent to understanding and interpretation. Machine learning
uses various technologies and algorithms. Scientists can use discriminant analysis, Bayesian classifiers,
artificial neural networks, and many other mathematical methods. An analytical review of classification
methods for the given issue has been conducted using a limited but representative set of objects. The
finding of this process is presented in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Bayes’ method refers to probabilistic classification methods [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. This classification approach
adopts the principle of class conditional independence from Bayes’ theorem. Naive Bayesian classifier
is a simple and easy to implement algorithm. This classifier is mainly used in text classification, spam
identification, and recommendation systems. It process well numerical and categorical data. This
classifier shows good result when the amount of data is limited in comparing with models that are more
complex.
      </p>
      <p>
        Linear regression is used to identify the relationship between a dependent variable and one or more
independent variables, and is generally used to predict future outcomes [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The support vector machine is a linear classification method. It gives good results when processing
documents. However, there may be documents that will be assigned to one class by the algorithm, but
in reality, they should belong to another. Such data are called outliers because they introduce method
error. Such documents are best ignored when using this classification method.</p>
      <p>
        The k-nearest neighbor method, also known as the KNN algorithm, is a non-parametric algorithm
that classifies data based on its proximity and association with other available data. The ease of use of
this method and low computation time make it the most popular algorithm for data scientists, however,
as the test set increases, the processing time increases, making it less attractive for classification
problems. KNN is commonly used for recommendation engines and image recognition [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>
        The decision tree method refers to the logical methods of classification. In practice, binary decision
trees are usually used, because in them the decision to move along the edges is carried out by simply
checking the presence of a feature in the document. When the feature value is less than a certain value,
one branch is selected, and when it is greater than or equal to, another branch is selected. Compared to
other approaches, the decision tree approach is a symbolic (i.e., non-numeric) algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Every method has both advantages and disadvantages. After analyzing and comparing the
aforementioned methods of classification and their results, it has been decided to use Bayesian
classifier.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Materials and methods</title>
    </sec>
    <sec id="sec-7">
      <title>4.1. Gender differences of texts</title>
      <p>
        Differences between male and female languages are manifested at different levels of the language:
in vocabulary, in phonetics, in grammar. In addition, there are differences in the tactics of conducting
conversations. Linguists claim that most often gender differences appear at the level of vocabulary.
E. A. Zemskoy, M. V. Kitaigorodskaya and M. M. Rozanova [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] note in their works that women tend
to use diminutive forms, especially when talking with children and animals, namely the use of
approximate designations, a tendency to hyperbolic expression and a high concentration of emotionally
evaluating words. Men have a tendency to coarsen the language with lexical means, a tendency to exact
nomination, the use of terms, the use of stylistically neutral evaluative vocabulary, and the active use
of professional knowledge outside the sphere of professional communication. In addition, women
widely use adjectives and adverbs that express a general positive assessment. Such typically female
evaluative words are adjectives and corresponding adverbs: wonderful, magical, amazing, unsurpassed,
beautiful.
      </p>
      <p>
        According to O. Espersen, women are more prone to euphemisms and less to obscene expressions
compared with men. They are also more conservative in their use of neoplasms in the language [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Women use more pronouns, verbs, particles than men do. Women tend to use emotional qualities of
objects and states, while men prefer concrete nouns. Men use qualitative adjectives mostly in the highest
degree, and not in the comparative or high degree. Women prefer to use exclamations: “ouch” is the
most common one. Constructions with the pronouns “such”, yes, “which” marked with both positive
and negative connotations, are entrenched in women. Women prefer to use diminutives to convey
multifaceted connections with the world, while men prefer to use diminutives when describing
situations with children or loved ones. Men’s speech have rationalistic assessments. [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ].
      </p>
      <p>
        N. L. Pushkareva [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] notes that women often use inversions, exclamation marks and questions.
Their texts are characterized by detailed and expressive sentences. Sentences and texts of men are
laconic, specific and less dynamic.
      </p>
      <p>
        The study of written speech E.I. Goroshko have showed that the following features are entrenched
in the male language: men more often use contractual rather than composing communication, less often
use incomplete sentences, elliptical constructions, reverse word order [
        <xref ref-type="bibr" rid="ref18 ref19 ref4">4, 18, 19</xref>
        ].
      </p>
      <p>
        According to A. V. Kirilina, women prefer to focus on their inner world, because their vocabulary
contains more words that can describe feelings and emotions. They also often use verbs that can convey
the emotional and psychological state of a person. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Thus, various studies of the oral and written speech of men and women show that the difference in
the use of language units is not an accident. The originality of the speeches of people of different sex
really exists at all levels and in any language. It is noted that the gender factor is more fully realized at
the level of vocabulary. Linguists claim that the female language is more emotional and expressive, it
is characterized by a rare use of stylistically reduced means and vulgar vocabulary. Women prefer more
detailed sentences and texts. Men tend to use professionalisms and terms in the language, coarsen the
language with lexical means, and tend to concise sentences and texts. However, it should be noted that
although there is a general trend of gender differentiation of language means, the above characteristics
might vary depending on the communicative situation, cultural level and social status of the speaker.
4.2.</p>
    </sec>
    <sec id="sec-8">
      <title>The forming process of input data</title>
      <p>
        Let’s consider the process of forming input data based on the aforementioned analysis on the
example of the Ukrainian anthroponyms. Anthroponyms or proper personal names are an important
part of every language. Native speakers, as well as people who study the language, use surnames, first
names, patronymics, and sometimes nicknames to distinguish people in society. Anthroponyms are
necessary factors of verbal communication. Their importance lies in the fact that they contain important
theoretical-linguistic, historical, ethnographic and other scientific and everyday information [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In
Ukraine, the three-term name of a person is most common: last name, first name and patronymic.
      </p>
      <p>It is obvious that all Ukrainian surnames can be divided into two groups: surnames that have explicit
gender characteristics (for example, Markov and Markova, Svintsytsky and Svintsytska, Isayev and
Isayeva, Sanin and Sanina, Berezivsky and Berezivska, Kutsev and Kutseva etc.), and surnames without
such features (Shevchenko, Potebnya, Solomka, Plaksiy, Melnyk, Movchan, Sklyar, Stelmakh,
Markovych, Koval etc.). This division determines the solution of the problem of gender classification
of surnames separately for these two groups.</p>
      <p>The most important stage in the resolving of the classification task is the forming of the input data.
Therefore, the classification’s features of the first group are the last four letters of the surname.</p>
      <p>
        The input data of the second group of the surnames are the various characteristics of the texts written
by the authors with these surnames. The analytical review of the literature sources [
        <xref ref-type="bibr" rid="ref20">20-26</xref>
        ] would
determine the set of texts’ characteristics. The main advantage of the set that the features do not depend
on the context and have a linguistic interpretation. Chosen classification features have been divided into
five groups.
      </p>
      <p>The first group is the frequency indicators of using the punctuation marks and special characters
(comma, period, exclamation point, question point, parentheses, dashes, quotation marks, single
quotation marks, colon, and semicolon). Each sign are characterized by the following indicators:
 the number of occurrences of the signs in the text is divided by the total number of sentences;
 the number of sentences with the particular sign is divided by the total number of sentences;
 the number of occurrences of all signs is divided by the total number of sentences;
 the number of sentences with at least one sign is divided by the total number of sentences;
 the average number of different signs in a sentence;
 the breadth of the author’s use of punctuation marks: the maximum number of different signs
in sentences is divided by the number of different signs.</p>
      <p>These features cannot be consciously controlled by a person, in contrast to the syntactic and semantic
characteristics of the text, so their use in the author’s expertise is most acceptable. At the same time,
the question arises of how these parameters correlate with semantically relevant features.</p>
      <p>Second group characterizes the frequency indicators of using the different parts of speech and their
combinations. The main parts of speech and their forms are the following: noun, verb, personal pronoun,
pronoun (all other types), adjective, short form of adjective, adverb, predicate (“unfortunately”, “good”,
“bad”), introductory words, service parts languages (preposition, conjunction, particle), as well as two
combinations: “adverb + adjective” and “adverb + adverb”. Lists of introductory words and service
parts of speech have been taken from dictionaries. The following values are calculated for each of these
groups:
 the number of occurrences of a part of speech or combination in the text is divided by the total
number of sentences;
 the number of sentences that contain a certain part of speech or combination is divided by the
total number of sentences.</p>
      <p>Third group indicates the length of sentences and words:
 the average length of sentences in the text that expressed in words;
 the average length of words in the text that expressed in symbols.</p>
      <p>Fourth group of indicators shows the frequency data of using the idioms and phraseologies. They
contribute to greater diversity of language and recovery. The use of idioms can indicate the age,
education, mood of the person who speaks or writes. It is necessary to use the appropriate dictionaries
for calculating the following indicators.</p>
      <p> the number of idioms in the text is divided by the total number of sentences;
 the number of sentences with at least one idiom is divided by the total number of sentences;
 the number of phraseologies in the text is divided by the total number of sentences;
 the number of sentences with at least one phraseology from the list is divided by the total
number of sentences.</p>
      <p>The last group contains indicators of the vocabulary:
 the richness of vocabulary: the number of different words used in the text is divided by the total
number of sentences;
 the number of words with the frequency of appearance in the text equal to 1 and 2 is divided
by the total number of sentences;
</p>
      <p>the number of words not found in the dictionary is divided by the total number of sentences.
4.3.</p>
    </sec>
    <sec id="sec-9">
      <title>Model of gender classification of surnames</title>
      <p>Let’s consider the process of solving the problem of gender classification of surnames in more
detailed way. Before the process starts, it is necessary to preprocess the input data. Records without
surnames (for example, Biofarm, Orhanizatsiia hromadska etc.) as well as not Ukrainian surnames (for
example, Dao, Wu, Aktürk, Algül etc.) have been removed from the analyzed file. Surnames in Cyrillic
have been transliterated in accordance with modern rules. All extra characters except letters, spaces,
hyphens, apostrophes have been removed from records. All records have been aligned with one register.</p>
      <p>The gender classification process consists of two stages. The first one realizes the implementation
of gender classification of surnames with clear gender characteristics. The model of this business
process is presented in the form of BPMN-diagram in Figure 1.</p>
      <p>The process starts from the forming of a training dataset with female and male surnames. There are
many suitable endings of female surnames: -ska, -s’ka, -skaya, -skaia, -ova, -ina, -eva, -ieva, -eeva,
yeva, -lna, -la. The endings of male surnames are the following: -iev, -kov, -sky, -nov, -shev, -skyi,
-khin, -mov, -rov, -skiy, -chev, -cheev, -l'ev, -bov, -eyev, -gin, -kin, -bin, -skii, -gii, -hiy, -ckiy, -lev,
nin, -voy, -rev, -skyy, -lov, -lui, -lyy. Then the trained classifier carries out gender classification of new
surnames. The findings are recorded in two files according to female and male surnames. Next step is
checking the results by an expert. The efficiency of the classifier is evaluated according to the Accuracy
metric. The numerical values of the Accuracy metrics are provided in the Results and Discussions
section. The obtained results are used on the next stage, which is designated on the BPMN- diagram on
the Figure 1 as subprocess with the label: “Conduct gender classification of surnames without gender
features”.</p>
      <p>The second stage shows the process of surnames’ classification without clearly expressed gender
features according to authors’ texts. The functional model of it is presented in Figure 2. Input data is a
file with mistaken defined surnames from the first stage. To use the classifier, it is necessary to form
training corpus of texts from GRAC according to the available features [27]: one subcorpus with texts
written by women and another one by men. The set of texts have been formed based on the subset of
well-defined surnames from the first stage.</p>
      <p>The user can use the classification in one or two stages, depending on obtained results. If a surname
has clear gender characteristics, then the result after the first stage will be satisfactory. If the surname
does not have clear gender characteristics and it was classified inaccurately, the user can use the
function of analysis of a person’s text. If the first stage do not give the desired result, then the second
stage will clearly reveal the gender identity of the author. Due to the fact that the classification is carried
out in two stages, the analysis of efficiency is also carried out in two stages. The efficiency of the first
stage, which classifies only surnames, is higher.</p>
      <p>Furthermore, the time of writing of the texts should be taken into account. To define gender identity
of authors of some period, it is necessary to use a base of texts with such period of writing. Usage of
texts from</p>
      <p>another time-period leads to unreliable classification results, since the linguistic
characteristics of the text significantly depend on the writing period.</p>
      <p>
        When the training corpus is created, the classifier is trained and used for new unknown texts. An
analyzed text is divided into sentences. There are many markers of the end of a sentence: a period, an
exclamation mark, a question mark, a newline, a tab. Then classifier conducts the following actions for
each sentence: calculation of the number of punctuation marks, dividing the sentence into words,
lemmatization and morphological analysis, calculation of the number of different parts of speech,
counting the number of idioms and phraseologies (additional dictionary is also available in GRAC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]),
counting the number of words with errors. The obtained information is the base for the calculating of
the classification indicators of the whole text.
      </p>
      <p>
        According to the functional model of resolving the problem of gender determination, this task is a
classification task. It is proposed to use naïve Bayesian classifier as the classification method, because
it has showed the best results [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Both stages of functional model of gender classification task of
surnames use classifier. Input data for the first stage is gender features of surnames. If we get dataset of
surnames without clearly expressed gender features, it is necessary to resolve classification task, where
the output classes should be two groups of texts depending on the gender of an author. The training and
using the classifier for the both stages is similar to each other. Consider the process of train and use the
naïve Bayesian classifier on the example of resolving the text classification task in more detailed way.
      </p>
      <p>Let designate the following notation:





 is a set of indicators that describe the text;
  ,  ∈  is a set of values of the  -th indicator;
 is a set of texts;

   is  -th value of  -th indicator in  -th text,  ∈  ,  ∈   ,  ∈  .
 = ⋃2=1   , where   is a set of texts of  -th class.</p>
      <p>It is necessary to determine the gender of the author for each text, so designate  as output class,
 = {  },  = ̅1̅,̅2̅, where   –  -th value of  -th class. The set of texts is divided into two sets, so
Let introduce notation for the number of occurrences of certain values of indicators:
   is the number of occurrences  -th value of the indicator   , where   = ∑ ∈    ;
   is the number of occurrences  -th value of the indicator    ;
   is number of occurrences   as value of the output class  .</p>
      <p>Taking into account the aforementioned notations, the general view of the algorithm for using the
Bayesian classifier to determine the gender of texts is presented in the form of an activity diagram in
Figure 13.</p>
      <p>Input
Training of the naive</p>
      <p>Bayesian classifier
Set of input
indicators</p>
      <p>Input
Determining the class</p>
      <p>of the (t+1)-th text
Statistical
data</p>
      <p>Output obtained data</p>
      <p>The algorithm consists of two stages: the first stage is responsible for training the Bayesian classifier,
the second one describes the process of using the Bayesian classifier, which was trained for the previous
steps. Thus, the mathematical model for solving the given problem is presented.</p>
    </sec>
    <sec id="sec-10">
      <title>5. Results and discussions</title>
      <p>Aforementioned research has proposed to use different metrics for estimating the efficiency of the
developed classifier. It depends on the problem being solved. The Accuracy metric has been chosen for
efficiency estimation of the task of classification of surnames with gender features. The Precision and
Recall metrics were chosen for efficiency estimation of the task of classification of surnames without
gender features.</p>
      <p>The calculated values of chosen metrics for preresearch are presented in Table 1. The classifier used
the dataset for the first stage, which contains 500 male and 500 female surnames. The structure of
training set and result of using the set are the following: 1163 surnames from the training dataset were
identified as male (1143 of them were identified correctly) and 1156 surnames were identified as female
(1005 of them were identified correctly). For the second stage, classifier was trained on subcorpuses
consisting of text written by 30 female authors and 30 male authors. 850 texts were taken for analysis.</p>
      <p>The values of efficiency metrics for main research are shown in Table 2. At the first stage of the
main research all available surnames were analyzed. The result of the second stage is the analysis of all
texts of necessary authors. The distinctive feature of the second stage is the values of Precision and
Recall metrics are calculated only for the texts written by authors with known gender. It is strict rule,
because only in this case it is possible to determine the values of needed parameters used to calculate
these metrics.</p>
    </sec>
    <sec id="sec-11">
      <title>6. Conclusions</title>
      <p>The problem of gender classification of surnames is quite large-scale. It is out of linguistic boundary
and requires the use of automated natural language processing methods. There is no doubt that the given
task is relevant, since it requires a solution in various areas of human activity. A specialist linguist can
resolve this task manually, but the opportunities of modern information technologies and big amount
of information in electronic form can decrease the processing time and increase the quality of obtained
data. The task of determining gender is a classification task. Many machine-learning methods have
demonstrated their efficiency for resolving the given task. It was confirmed by numerous studies for
different languages.</p>
      <p>Thus, in this study, an approach for solving the problem of the gender determination of Ukrainian
surnames has been proposed . The main point of the given task is undertaking the classification in two
stages. The first stage has been based only on gender’s features. The second stage allows conducting
classification of the authors’ surnames according to their texts. The set of texts with known authors and
the various suitable literature sources have been analyzed for creating the set of texts’ characteristics.
The features from this set were divided into five groups according to meaning of different calculated
numbers. The conducted research and analysis of the efficiency of the classifier showed the possibility
of using the proposed approach to determine gender of surname to improve the process of determining
the author’s expertise.</p>
    </sec>
    <sec id="sec-12">
      <title>7. Acknowledgment</title>
      <p>The authors are sincerely grateful to the developer of the GRAC corpus Maria Shvedova for the data
provided for analysis.</p>
    </sec>
    <sec id="sec-13">
      <title>8. References</title>
      <p>[21] E. I. Goroshko, “Features of male and female verbal behavior: (psycholinguistic analysis,” Ph.D.
dissertation, the Russian Academy of Sciences, Institute of Linguistics, Moscow (1996) (in
Russian).
[22] O. Goroshko, Differentiation in Male and Female Speech Styles. Budapest, Hungary: Open</p>
      <p>Society Institute Center for Publishing Development Electronic Publishing Program (1999).
[23] E. S. Oshchepkova, “Gender identification of the author from the written text: Lexical and
grammatical aspect,” Ph.D. dissertation, Moscow State Linguistic University, Moscow (2003) (in
Russian).
[24] А. V. Plusnina, “Characteristics of Man and Female Written Speech in Gender Consciousness of
Communicators,” Yaroslavl Pedagogical Bulletin, no. 1, pp 184–188, 2012. URL:
http://vestnik.yspu.org/releases/2012_1g/41.pdf, (in Russian).
[25] T. A. Litvinova, “Written text author’s characteristics ascertainment (profiling),” Philology.</p>
      <p>Theory and Practice, no. 2(13), pp. 90–94, 2012. URL:
https://www.gramota.net/articles/issn_1997-2911_2012_2_29.pdf, (in Russian).
[26] T. A. Litvinova, O. V. Zagorovskaya, V. A. Chervaneva and O. A. Litvinova, “The problem of
author gender attribution impact of genre,” Russian Journal of Education and Psychology, no.
1(33), 2014. DOI: http://dx.doi.org/10.12731/2218-7405-2014-1-4, (in Russian).
[27] Sketch Engine. Concordance GRAC v.11. Advanced. URL:
https://parasol.vmguest.unijena.de/grac_crystal/#concordance?corpname=grac11</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>GRAC</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: http://uacorpus.org/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shvedova</surname>
          </string-name>
          . (
          <year>2020</year>
          , Apr.) “
          <article-title>The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus</article-title>
          .org):
          <source>Architecture and Functionality,” in Proc. of the 4th International Conference on Computational Linguistics and Intelligent Systems (COLINS'2020)</source>
          , vol. I: Main Conference, Lviv, Ukraine, Apr.
          <year>2020</year>
          . pp.
          <fpage>489</fpage>
          -
          <lpage>506</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2604</volume>
          /paper36.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Kirilina</surname>
          </string-name>
          , Gender: linguistic aspects.
          <source>Moscow</source>
          (
          <year>1999</year>
          )
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E .I.</given-names>
            <surname>Goroshko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Kirilina</surname>
          </string-name>
          , “Gender Research in Linguistics Today”, Gender Research, no.
          <issue>2</issue>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          ,
          <string-name>
            <surname>Kharkiv</surname>
          </string-name>
          (
          <year>1999</year>
          )
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Zemskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kitaygorodskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Rozanova</surname>
          </string-name>
          , “
          <article-title>Features of male and female speech”, Russian language</article-title>
          and its functioning, Moscow, Science (
          <year>1999</year>
          ), pp.
          <fpage>90</fpage>
          -
          <lpage>136</lpage>
          , (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Anishchenko</surname>
          </string-name>
          , “
          <article-title>On the gender characteristics of the realization of emotional reactions”</article-title>
          ,
          <source>Gender: Language</source>
          , Culture, Communication, Moscow (
          <year>2003</year>
          ), pp.
          <fpage>18</fpage>
          -
          <lpage>19</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Borisova</surname>
          </string-name>
          , “
          <article-title>The use of interjections in the speech of women and men”</article-title>
          ,
          <source>Gender: Language</source>
          , Culture, Communication, Moscow (
          <year>2003</year>
          ), pp.
          <fpage>28</fpage>
          -
          <lpage>29</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] Gender determination by name - when accuracy really matters</article-title>
          ,
          <year>2016</year>
          . URL: https://habr.com/ru/post/274499/, (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>MІSexDetector.</surname>
          </string-name>
          <article-title>Neural network for detecting user's sex by name</article-title>
          ,
          <year>2019</year>
          . URL: https://github.com/Rai220/MlSexDetector
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>Gender determination by name</article-title>
          ,
          <year>2017</year>
          . URL: https://ru.stackoverflow.com/questions/655179, (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Gender</surname>
            <given-names>API</given-names>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://gender-api.com/en/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shleiko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Borysova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kochuieva</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Melnyk</surname>
          </string-name>
          , “
          <article-title>An overview of existing machine learning methods for gender classification of names”</article-title>
          ,
          <source>in Proc. of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS'2021)</source>
          , vol. II, Lviv, Ukraine, Apr.
          <year>2021</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>92</lpage>
          . URL: http://web.kpi.kharkov.ua/iks/wpcontent/uploads/sites/113/2021/10/CoLInS_Volume2_
          <year>2021</year>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Supervised</surname>
            <given-names>Learning</given-names>
          </string-name>
          ,
          <year>2020</year>
          . URL: https://www.ibm.com/cloud/learn/supervised-learning
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>T. B. Kryuchkova</surname>
          </string-name>
          , “
          <article-title>Some research of the features of the use of the Russian language by men and women”</article-title>
          , Problems of psycholinguistics, Moscow,
          <year>1995</year>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>199</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I. N.</given-names>
            <surname>Kavinkina</surname>
          </string-name>
          , “
          <article-title>Diminutives as markers of the linguistic consciousness of men and women”, Word formation and nominative derivation in Slavic languages, Part 1</article-title>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>31</lpage>
          ,
          <string-name>
            <surname>Grodno</surname>
          </string-name>
          (
          <year>1998</year>
          )
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shkvorchenko</surname>
          </string-name>
          , “
          <article-title>Internet discourse as a linguistic category”, Current issues of the humanities, V. 3</article-title>
          , no 23 (
          <year>2019</year>
          ), pp.
          <fpage>62</fpage>
          -
          <lpage>72</lpage>
          (in Ukrainian).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Pushkareva</surname>
          </string-name>
          , “
          <article-title>Gender Linguistics and Historical Sciences”</article-title>
          ,
          <string-name>
            <surname>Ethnographic Review</surname>
          </string-name>
          (
          <year>2001</year>
          ), no.
          <issue>2</issue>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>E. I. Goroshko</surname>
          </string-name>
          , “
          <article-title>On the question of the correlation of quantitative and qualitative methods of data analysis in linguistic gender studies”</article-title>
          ,
          <source>Gender: Language</source>
          , Culture, Communication, Moscow,
          <year>2003</year>
          , pp,
          <fpage>35</fpage>
          -
          <lpage>36</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Yu</surname>
          </string-name>
          . P. Maslova, “
          <article-title>Features of the development of gender linguistic research in Ukraine and abroad”</article-title>
          , Scientific notes of the National University of Ostroh Academy. Series: Philological, no.
          <volume>57</volume>
          (
          <year>2015</year>
          ), pp.
          <fpage>100</fpage>
          -
          <lpage>105</lpage>
          (in Ukrainian).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Vyazigina</surname>
          </string-name>
          , “
          <article-title>Gender lingustics and diagnostics of the sex as a problem of authorship's expertise,” Legal Linguistics</article-title>
          , no.
          <volume>2</volume>
          (
          <issue>13</issue>
          ), pp.
          <fpage>48</fpage>
          -
          <lpage>53</lpage>
          , (
          <year>2013</year>
          ). DOI: https://doi.org/10.14258/leglin(
          <year>2013</year>
          )%
          <fpage>25x</fpage>
          . , (in Russian).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>