<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>I. Butko, The use of geospatial information by public authorities to support the
decision making of management. Advanced Information Systems</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.20998/2522-9052.2021.1.05</article-id>
      <title-group>
        <article-title>Information technology for textual content author's gender and age determination based on machine learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <email>victoria.a.vysotska@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lyubomyr Chyrun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Chyrun</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariia Soltys</string-name>
          <email>mariia.soltys.sa.2020@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ivan Franko National University of Lviv, University 1</institution>
          ,
          <addr-line>79000 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera 12, 79013 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Readable data</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>5</volume>
      <issue>1</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In the process of implementing this project, namely the project on determining the author's age and gender based on his text, a model was developed that determines these biological data of the author based on his text. Before starting work, similar studies on a similar topic are reviewed to find out what has already been researched and tested, and what is still worth investigating. Also, from these studies, it was possible to find many clues about which implementation methods and tools are better to choose, and which work better for this task. The project work is carefully planned using process diagrams and data flows. The best methods and tools for the implementation of this project were studied, and simple classification and regression models of Random Forest became such tools. Such models were chosen, because they cope with the task quite well, and are much less resource-intensive than the same large language models, in addition, they are very easy to use and configure. Two datasets were selected, a dataset with blogs and a dataset with books. The dataset with blogs was used the most because it contains both the age and gender of the blog author. The prediction accuracy of the "book" model is 0.8, and with blogs - 0.6. Before use, the data was analysed and cleaned, later transformed into embeddings and sent for model training. The results of the model are studied and analysed in detail. Many useful features are extracted that are responsible for classifying the age or gender of the author in the texts. In addition, many interesting regularities were observed in the process of analysing the results. Additionally, a test case is implemented that allows the user to easily interact with my model.</p>
      </abstract>
      <kwd-group>
        <kwd>machine learning</kwd>
        <kwd>text analysis</kwd>
        <kwd>dataset</kwd>
        <kwd>author</kwd>
        <kwd>age</kwd>
        <kwd>gender</kwd>
        <kwd>NLP</kwd>
        <kwd>cybersecurity</kwd>
        <kwd>context</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The problem of determining the gender and age of the author of the text is a difficult task,
especially in the context of the Internet, where information is often provided anonymously
or under pseudonyms [1-5]. Also, this issue is relevant both for the distribution of
advertising to the target audience, for example in social networks, and for determining
additional parameters of the author of an anonymous text, especially if it is
fake/propaganda/disinformation [6-11]. Although there are machine learning models for
determining gender and age based on photos or videos, for example, posted on social
networks, these approaches have limitations, since real visual information about the
author is not always available [12-16]. Considering this, researchers pay attention to text
analysis to determine such parameters, which opens up new opportunities [17-22].
Analysis of the text to determine the gender and age of the author depends on various
factors, including the style of the author's writing, images, lexical features, and used
words and phrases [23-39]. One of the approaches is the application of machine learning
methods to textual data [40-45]. For example, models based on neural networks can use
the analysis of syntactic and semantic features of the text to determine the gender and
age of the author [46-54]. Research in this direction is already underway, and they
indicate the potential of these approaches [55-60]. On the other hand, determining
gender and age from a text can be a more difficult task due to the variability over time of
features of writing styles, context, and other factors [61-69]. Therefore, it remains an
active area of research in the field of natural language processing. Finally, the
development of new approaches to the analysis of textual information may in the future
help to solve the problem of determining the age and gender of the author from his texts
on the Internet as an additional parameter for identifying the potential author of the set
of generated fakes/propaganda/disinformation.</p>
      <p>Determining the age and gender of the author based on the text written by him is a
very relevant problem today. Such a model could be useful in various areas, for example,
in the field:





cyber security or law enforcement agencies, to detect and identify persons who
plan or commit crimes on the network. It will help in detecting internet fraudsters,
and online criminals or even in the investigation of cyber security threats;
historical research, to determine the authorship of texts or the dating of the writer's
works, which can be important for the identification of authors or the analysis of
the development of language and styles in different historical periods;
secondary and higher education to prevent plagiarism and ensure academic
integrity. A model for determining the gender and age of the author from the text
can help determine whether works written by students or schoolchildren are
authentic;
marketing and analysis of social networks, this model can be useful for
determining the target audience, creating personalized offers and analysing user
behaviour;
psychological and sociological research, i.e. it can be useful in psychological and
sociological research to understand the peculiarities of language style and
psychosocial characteristics of different population groups.</p>
      <p>Also, it is worth noting that, in the conditions of war, such a model would be useful for
Ukraine to identify collaborators, trolls, propagandists or criminals based on their texts
on the Internet and mass media, including in social networks.</p>
      <p>The purpose of the research is to develop an information technology for text analysis
for features to determine the gender and age of the author based on machine learning.</p>
      <p>The object of the research is the process of identifying the linguistic features of the
text cornet to determine the gender and age of the author.</p>
      <p>The subject of the research is methods and means of determining the gender and
age of authors of texts.</p>
      <p>The paper considers the definition of two characteristics at the same time for the first
time, which was not previously investigated in other works [61-65]. In addition, this work
explores age and gender characteristics reflected in texts, as opposed to identifying
these characteristics through images and videos.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>In the context of the research area, namely the determination of the gender and age of
the author of the text, the need to rely on previous research becomes especially critical.
This is due to several factors. Firstly, there are practically no works on this topic in the
Ukrainian context, the existing studies were mainly carried out by English-speaking
researchers. Secondly, the availability of Ukrainian-language datasets with data on
authors (their age and gender) and texts is very limited (if at all), so conducting a study
of the Ukrainian language that would not include the creation of a completely new dataset
is practically unattainable.</p>
      <p>Initial problems with the search for relevant data create difficulties for the
implementation of research in the Ukrainian context. Ukrainian data on authors and texts
are not available in the available datasets, which makes it difficult to carry out an
objective analysis. In this regard, we will focus on English-language studies and datasets
to ensure an adequate amount of data for analysis and project development.</p>
      <p>This situation highlights the importance of the study taking into account the results of
other studies carried out in the English-speaking context and meeting global standards
in the field of text analysis to determine the age and gender of the author.</p>
      <p>Social media is important for monitoring the perception of public health issues and for
educating target audiences about health. However, limited information on the
demographics of social media users makes it difficult to identify conversations between
target audiences and limits the effectiveness of using social media for public health
surveillance and educational interventions [66-75]. Certain social media platforms
provide demographic information about the followers of a user's account. If they are
provided, they are not always disclosed. Therefore, researchers have developed
machine learning algorithms to predict the demographic characteristics of social network
users, mainly for Twitter [61]. To date, limited research has been conducted on predicting
the demographic characteristics of Reddit users [61]. The study was conducted taking
into account data and metadata about Reddit users, that is, not only their posts but also
the communities in which they leave their posts, comments or simply subscribe. The
researchers manually flagged users' data using the SMART app, looking for confirmation
of their age in comments or posts where users indicated it themselves. Data volumes
were such that each age category (youth (13-17 years), young adults (18-20 years), and
adults (21-54 years)) had a minimum of 625 records. Metadata was collected after
tagging the data by age, via the Reddit API for each user. Metadata included user-level
information (e.g., year of account creation), submission-level data (e.g., post popularity),
and comment-level data (e.g., commenting frequency). The study focused on specific
metadata that could potentially help distinguish between adolescent and adult age
groups. The research identified 1,523 variables that could potentially indicate the age of
Reddit users:
 Final statistics: average level of evaluation of publications, etc.
 Frequency of subreddits: frequency of posts in specific subreddits related to age
groups.
 Frequency of emoji usage: Frequency of emoji usage in comments.
 Post Patterns: Percentage of posts that were videos, images, etc.
 Use of terms: TF-IDF scores for specific terms (e.g. "school") used in comments.</p>
      <p>The dataset is divided into train and test (80/20), after which various models (logistic
regression, random forest, k-nearest neighbours, Gradient boosted trees) that could
potentially show a good result for this task were collected and evaluated by their
indicators such metrics as AUROC, precision, recall and F1 score. The best result was
shown by the Gradient boosted trees model (F1 score: 0.77, AUROC: 0.84). In the end,
it is analysed and evaluated which of the signs have the greatest influence on
determining the age of users. This study is important because it helps to better
understand what should be relied on when determining the age and gender of the
authors of the texts, and which signs are the most important and influential.</p>
      <p>The rapid growth of social networks has generated an unprecedented amount of
usergenerated data, which provides an excellent opportunity for text mining [62]. The main
purpose of authorship analysis, an important part of text analysis, is to learn as much
information as possible about the author of the text through the subtle variations in writing
styles that exist within genders, ages, and social groups. Such information has a variety
of uses, including advertising and law enforcement. One of the most accessible sources
of user-generated data is Twitter, which provides free access to most user data through
its Data Access API. In the study [62], the authors sought to determine the gender of
Twitter users using Perceptron and Naive Bayes with selected parameters from 1 to
5gram features from the tweet text. Stream applications of these algorithms have been
used for gender prediction to process the speed and volume of tweet traffic. Since
informal text such as tweets cannot be easily evaluated using traditional dictionary
methods, the study [62] implemented n-gram features to represent streaming tweets.
The large number of 1- to 5-grams requires only a subset of them to be used in gender
classification, for this reason, the informative features of n-grams are selected using
several selection algorithms. In the best case, the Naïve Bayes and Perceptron
algorithms showed accuracy, balanced accuracy and F-measure above 99%.</p>
      <p>The study [62] is based on the analysis of messages and posts on Twitter, and the
main goal of the study is to extract signs that would indicate some personal information
about the author of the tweet. The peculiarities of this study are that informal language
is used in twitter, and this paper is devoted to the actual analysis of informal language
for important identification features. This approach has its difficulties, because, first of
all, Twitter has a limit of 140 characters per message, which is a problem for traditional
text analysis, as large segments of texts are usually used in such analysis. Secondly,
since it is an informal language, users very often use acronyms, so-called text emoticons,
and especially distorted spelling of the word, which can also make analysis more difficult.
Before conducting the study, the data was carefully filtered and manually labelled using
the API. Six different feature selection mechanisms were used to identify them and
determine which ones would best help accomplish the task. This process aims to extract
the most informative n-grams from tweets to improve gender prediction accuracy. To
perform the task of classification, a simple neural network, namely Naïve Bayes, is used,
which is based on Bayes' theorem. The importance of the study [62] is that it nicely
highlights the difficulties in analysing spoken language and informal writing. Like the
previous one, this study also highlights the importance of the correct choice of features
to improve the accuracy of the model's prediction and, accordingly, the accuracy of the
author's gender classification.</p>
      <p>In [63], it was investigated whether wording, stylistic choices and online behaviour
can be used to predict the age category of blog authors. The authors hypothesize that
significant changes in writing style distinguish pre-social media bloggers from post-social
media bloggers. By experimenting with different years, the authors found that college
students' birth dates around the time when social networking sites like AIM, SMS texting,
MySpace, and Facebook became popular gave accurate age predictions. The authors
also determined that the characteristics of Internet writing are important characteristics
for predicting age, but lexical content is also necessary to obtain significantly more
accurate results. Our best results provide an accuracy of 81.57%.</p>
      <p>The basis of this study [63] is the determination of the age of blog authors. The
definition is based on stylistic choices and online behaviour. The best part of the model
is to determine the approximate age of a person, namely, whether he was born before
the era of social networks, or already during it. The blogs are collected from the
LiveJournal magazine, namely those blogs where the age of the author is indicated. All
the articles are from American bloggers.</p>
      <p>Several features have been identified that help determines the author's age, including
special words, stylistic features such as slang or text emoticons, as well as online
behaviour such as frequency of posting and number of friends. A binary classification
model based on year of birth was used, slightly modified to address changes in blogging
styles based on popular social media technologies. In a study [63], it was found that two
age groups (born in 1977-1979 and born in 1982-1984) differed greatly in terms of
blogging style. Both stylistic and substantive features strongly influenced the prediction
of age with the help of other variables that helped in determining the age group. The
study [63] is important, because it notes the determination of age purely by text analysis
and the use of certain features in the text, without taking into account metadata about
the user. The research can be expanded to determine the geographical location or other
data about the author.</p>
      <p>Although the study of the relationship between discourse patterns and personal
identity has been going on for decades, the study of these patterns using language
technologies is relatively recent [64]. In this latest tradition, the authors in [64]
implemented the prediction of the author's age from the text as a regression problem.
They investigated the same task using three very different genres of data
simultaneously: blogs, telephone conversations, and online forum posts. A domain
adaptation technique was also used, which allows for training a joint model including all
three corpora together as well as separately and analysing the differences in predictive
performance between the combined and corpus-specific aspects of the model. Effective
features include both stylistic (such as POS templates) and content-oriented features.
Using a linear regression model based on shallow text elements, the authors in [64]
obtained correlations up to 0.74 and mean absolute errors between 4.1 and 6.8 years.
In the study, three datasets were selected for analysis: blog corpus, fisher telephone
corpus, and breast cancer forum. Each dataset has a different age distribution, which
affects the determination of the age of users. The blog dataset has more young people,
while the breast cancer forum dataset has more older people. The telephone
conversation dataset has the most balanced age distribution. There were four different
linear regression models for predicting user age. Interestingly, the study [64] states that
the gender of the user significantly affects the identification of his age, that is, it makes
sense to determine both characteristics. The best results were obtained by the dataset
of telephone conversations, immediately followed by the dataset with blogs. The study
also provides examples where the signs that can be used to determine the age of an
Internet user are visible.</p>
      <p>There is a growing interest in automatically predicting the gender and age of authors
based on texts. However, most research so far ignores that language use is related to
the social identity of speakers, which may differ from their biological identity. In [65], the
authors combined insights from sociolinguistics with data collected through an online
game to highlight the importance of approaching age and gender as social variables
rather than static biological variables. In the study, thousands of players guessed the
gender and age of Twitter users based on tweets alone. The authors showed that more
than 10% of Twitter users do not use language that the crowd associates with their
biological sex. It has also been shown that older Twitter users are often perceived as
younger than they are. The authors' conclusions highlight the limitations of current
approaches to gender and age prediction from texts. This is quite an interesting study
that calls into question all previous studies. The authors point out that often the behaviour
of users does not correspond to their biological age or sex, so it makes sense to define
gender as a social construct, and not as a biological feature, the same applies to age.
It's common for people on Twitter to post messages that don't match their gender or age.
The research was conducted using a game developed by the authors, where people
guessed the gender and age of a certain author from Twitter. Thousands of participants
joined the game and the result showed a significant difference in the guessed age and
the real age of the authors, using only the text of the tweets. According to a study [65],
10% of Twitter users and their language are not associated with their real age or gender.
Also, older Twitter users are often classified as younger. With this study, the authors
highlighted the problem that the automatic determination of age or gender is often based
on stereotypical features, which in reality may not correspond to reality at all. This limits
the models in their ability to draw on upbringing and social constructs rather than just
biological age. The authors of the study call for consideration of social and sociocultural
influence and the variability of people's pronunciation when developing classification
models.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and materials</title>
      <p>Many studies highlight the main characteristics by which it is possible to identify age or
gender, which we could use in our study [61-75]. Different studies have used different
data and different models to predict user characteristics [61-75]. This allows you to
compare them and understand what could be used in your research. For example, the
study [64] analysed how the different distribution of data in the dataset affects the
accuracy of the model and, accordingly, the accuracy of the characteristics predicted by
it, i.e. age or gender. Research [65] allows us to look at our topic from a critical point of
view, and to determine what should be taken into account when developing one's
program, namely, the fact that the author's behaviour may often not coincide with his
biological sex or age due to certain social constructs or upbringing. To do this, we will
first define the tree of the goals of our research.</p>
      <p>A tree of goals is a hierarchical tree-like structure obtained by dividing the overall goal
into subgoals, which in turn can also be divided into smaller subgoals, functions, etc.
(Fig. 1). Graphically, the tree is depicted with "branches down", and the main goal is
placed at the highest level. The advantage of building a goal tree is the possibility of
dividing a large unfathomable goal into simpler tasks that can be solved by known
methods. At the root of the tree is "Development of a model for determining the age and
gender of the authors of the text", and the branches of the tree go down from the root:


</p>
      <p>Collection of datasets: preparation of datasets for model training and task
execution.</p>
      <p>a. Blog Authorship Corpus - a dataset with blogs and information about
the author to determine age and gender [73].
b. Spooky Author Identification - a dataset with famous authors and
excerpts from their works, for determining gender.</p>
      <p>Feature extraction: selection and ranking of the best features that best influence
the model output.</p>
      <p>a. Bag-of-Words - uses TF-IDF technology.
b. N-grams - includes sequences of word combinations (bigrams,
trigrams) as features to capture the context.
c. Embeddings, Word2Vec, GloVe - turns words into dense vectors that
capture semantic meaning.</p>
      <p>Model training: training of the selected model on cleaned data.</p>
      <p>a. Transformer model - already trained large language models, suitable
for gender determination.
b. Regression model - models working based on a regression function
are suitable for determining age.
 Analysis of results: construction of graphs, statistical analysis, summarization of
conclusions.</p>
      <p>Dataset collection
Blog Authorship</p>
      <p>Corpus</p>
      <p>Spooky Author</p>
      <p>Identification
 A wide selection of models can be applied for this task.
 Python libraries allow you to perform a variety of tasks, from pre-processing to
data analysis.
 Hyperparameters that can be adjusted to get the best results.</p>
      <p>Instructions:
 Transformers documentation for proper use of large language models.
 Previous research from which useful information can be gleaned for my research.
 Other documentation will help in the use of numerous libraries in the process of
working and developing the model.</p>
      <p>dTorcaunmsfeonrtmaetirosn docuOmtheenrtsation Thceonredsuuclttesdofeastrulideires
dTieffxetrseonft Determine theaaugtehaonrsd gender of the pArgeediacntidongemnoddeerl
authoVrasrious models LPibyrtahroiens Hyperparameters
Deteramnidnaeggeend3er Передбачення
моделі</p>
      <p>Interpret the Модель
results 4 передбачення
віку та статі</p>
      <p>A Data Flow Diagram or DFD is a graphical structural analysis methodology that
describes external to the system data sources and destinations, logical functions, data
flows and data stores that are accessed (Fig. 4-5). That is, the data flows implemented
in the project are described.</p>
      <p>Data repositories:
 Blog dataset - downloaded Blog Authorship Corpus dataset [73].
 Book dataset - downloaded Spooky Author Identification dataset [74].
 Documentation - all documentation that controls the developed models and
software part of the project.</p>
      <p>External entities:
Functions:
 Developer - a person who develops a model, and configures it.
 User - a natural person who uses a ready-made model.
 Pipeline - the process of pre-processing data, and preparing them for use by the
model.
 Age determination model - a machine learning model that predicts the age of the
author based on the texts written by him.
 Gender determination model - a machine learning model that predicts the gender
of the author based on the texts written by him.
 Conversion into a convenient format - conversion of the information provided by
the model into a convenient and human-readable format using graphs and
conversion functions.</p>
      <p>Developer</p>
      <p>Parameters</p>
      <p>Age and gender
determination
Books dataset</p>
      <p>Documentation</p>
      <p>Age
determination
model</p>
      <p>2</p>
      <p>Parameters</p>
      <p>A workflow diagram (process diagram) is used to model the sequence of steps or
stages in the work process. The main purpose of such a diagram is to visualize and
analyse the workflow to optimize or automate the process. For the project, this is a
visualization of the development process and all its stages (Fig. 6). In the end, a fully
functional model was obtained for determining the age and gender of the author of the
text.</p>
      <sec id="sec-3-1">
        <title>Start</title>
      </sec>
      <sec id="sec-3-2">
        <title>Data collection</title>
      </sec>
      <sec id="sec-3-3">
        <title>Data analysis</title>
      </sec>
      <sec id="sec-3-4">
        <title>Model training</title>
      </sec>
      <sec id="sec-3-5">
        <title>Results analysis</title>
      </sec>
      <sec id="sec-3-6">
        <title>Conclusions End</title>
      </sec>
      <sec id="sec-3-7">
        <title>Data cleaning</title>
        <p>Collection of data, i.e. datasets with data on authors of texts [73-74].
Data analysis, identification of data types, their quantity and other metadata for
model selection.






Right branch:
The division into branches. Left branch:</p>
        <p>Data cleaning, removal of special characters, unnecessary characters, and
articles.</p>
        <p>Identification of features using the previously described methods.
Ranking of features to determine the most important for this study.
Preparation of data for sending to the model.</p>
        <sec id="sec-3-7-1">
          <title>Selection of models:</title>
          <p>a. A model for classifying authors by age.</p>
          <p>b. A model for classifying authors by article.</p>
          <p>Setting up the model, selecting parameters, optimizers and modifying the
architecture.</p>
        </sec>
        <sec id="sec-3-7-2">
          <title>Joining branches:</title>
          <p>

</p>
          <p>Training of the previously configured model on prepared data.
Evaluation of results using metrics, graphs and analytics.</p>
          <p>Formation of research conclusions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Statement and justification of the problem</title>
      <p>Statement of the problem: this study allows us to study the problem of determining the
gender and age of the author based on the texts written by him. Its essence is to create
a machine learning model to analyse the text and determine the biological data (age and
gender) of its author based on the sample of his text.</p>
      <p>Technical characteristics: as an input, the model accepts a text sample in text format
(string, char), processed and cleaned, and as an output, the age, numerical value or
numerical interval, as well as gender, and binary value will be analysed.</p>
      <p>Business processes:</p>
      <p>Data collection  Data processing  Model selection  Model training And
creating a practical application for the model, for example in cyber security for
identification.</p>
      <p>Technical means of implementation:

</p>
      <p>Bag-of-Words, N-grams, Word2Vec, and GloVe are used for data processing.</p>
      <p>To build a model: Transformers, Tensorflow, Keras, PyTorch.</p>
      <p>Application: the model is developed for research purposes to expand the issue of
determining the gender or age of the authors of texts, but it can also be used to identify
a person or verify authorship.</p>
      <p>Expected effects: contributing to research on the identification of biological data of the
author from his text. Development of a potentially useful model in cyber security and








</p>
      <p>Cons:
other fields. Gaining new knowledge about the development of language models and
conducting research.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Comparison of methods and means of the product under development</title>
      <p>5.1. Machine learning models
Regression models are better for predicting age, here are a few basic ones in
comparison:</p>
      <sec id="sec-5-1">
        <title>1. Linear regression. Pluses:</title>
        <p>
</p>
        <p>Simple and clear.</p>
        <p>Fast learning and getting results.</p>
        <p>Cons: Assumes a linear relationship between traits and age, which may not be true
for complex textual data.</p>
      </sec>
      <sec id="sec-5-2">
        <title>2. Support Vector Regression (SVR). Pluses:</title>
      </sec>
      <sec id="sec-5-3">
        <title>Effective in large multidimensional spaces. Can capture complex relationships using kernel features.</title>
      </sec>
      <sec id="sec-5-4">
        <title>Cons: Requires careful tuning of hyperparameters.</title>
        <p>3. Gradient Boosting Regression (for example, XGBoost). Pluses:</p>
      </sec>
      <sec id="sec-5-5">
        <title>Resistant to fuzzy and noisy data.</title>
        <p>Can effectively capture non-linear relationships.</p>
        <p>Cons: Higher computational cost compared to linear models.</p>
        <p>Options for using large language models (LLM) to accomplish this task are also
considered:
4. BERT (Transformer Bidirectional Encoder Representation). Pluses:</p>
      </sec>
      <sec id="sec-5-6">
        <title>Captures the bidirectional context in the text. Can handle complex relationships and semantics in textual data. Pre-trained on a large corpus (e.g. Wikipedia, books) and then customized for specific tasks.</title>
        <p>Requires significant computing resources for training and results.</p>
        <p>A large amount of memory.

</p>
        <p>Creates coherent text appropriate to the context.</p>
        <p>Useful for creating text predictions.</p>
        <p>Cons: Can't directly output predicted age or gender; requires additional fine-tuning for
a specific task.
5.2. Comparison factors
1. Productivity. LLMs are generally excellent at capturing complex patterns and
semantics in textual data, potentially leading to higher predictive accuracy
compared to traditional regression models.
2. Interpretability. Traditional regression models, such as linear regression, offer
straightforward interpretation, making it easier to understand the relationship
between characteristics and predictors. LLMs, being deep learning models, are
more complex and less interpretive, although techniques such as attention
mechanisms can provide some insight.
3. Resource requirements. LLMs require significant computational resources (e.g.,
GPU, memory) for training and inference due to their deep architecture and large
parameter size. Traditional regression models are smaller in terms of resource
requirements.
4. Possibility of adaptation to specific tasks. LLM can be customized for specific
tasks, such as age and gender prediction, using transfer learning using
pretrained models. Traditional regression models may require more complex work
with features and additional tuning for a specific area. Therefore, both regression
models and LLM models are suitable for the task of this study. They show different
performances for different tasks, so it's best to use several models in your work,
give them different parts of the task and compare their performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>The basis of the project, which fulfils its main goals, namely the determination of age and
gender, is a machine learning model written in the Python programming language.
Despite this, the model itself takes up relatively little space in the program, and most of
it is occupied by data processing and analysis, and analysis of results. It is useful to
consider all these parts of the program separately to be able to focus on the methods
and processes of each stage.
6.1. Data analysis
At the stage of data analysis, the dataset itself is loaded [73-74], and its content, amount
of data, data distribution, and search for correlation between data using graphs and other
tools are analysed.</p>
      <p></p>
      <p>Methods: data loading, manual data cleaning, data visualization.</p>
      <p>Tools: Python libraries (pandas, matplotlib, seaborn).</p>
      <p>Process description: First, the data (dataset) is loaded into the Python
environment for further processing. There is a manual review of the dataset and
the selection of suitable features in the data. Unnecessary features can be
deleted. Next, we build several visualizations using Python libraries to better
capture data correlation and create an idea of how to work with them, namely a
graph of gender distribution and age distribution in the dataset to check its
weighting.
6.2. Data processing (pre-processing)
In the data pre-processing stage, the dataset goes through detailed processing and text
cleaning to clean the text of unnecessary characters that can negatively affect the
accuracy of the model's predictions, as well as converting the data into a numerical
format that the model understands and can work with.</p>
      <p>

</p>
      <p>Methods: removal of unnecessary symbols, removal of stop words, tokenization
of sentences, lemmatization of words, division of data into sets, vectorization of
words, labelling of evaluations.</p>
      <p>Tools: Python libraries (pandas, NLTK).</p>
      <p>Process description: The data from the previous step is first separated into text
and scores. Scores are converted into binary (gender) and categorical (age) or
numeric (age) formats. After that, the text data is cleaned. First, all uppercase
letters are converted to lowercase, sentences are cleaned of stop-words, all kinds
of signs and markings, and, if necessary, lemmatized (in my case, this step turned
out to be unnecessary). In the end, already cleaned data are divided into training
and test sets.
6.3. Model training
After cleaning and pre-processing the data, it can be transformed into a set of vectors
and fed into a model to make predictions. The model itself consists of two Random Forest
models, one of which allows classifying age and the other gender. The prediction
accuracy of both models is evaluated using metrics.</p>
      <p>
</p>
      <p>Methods: text vectorization, model training, model evaluation.</p>
      <p>Tools: Python libraries (scikit-learn).</p>
      <p>Process description: the data completely cleaned at the previous stage is
transferred to the vectorization function, which converts tokens into digital values
(embeddings). In this, numerical, form, the data can be transferred to the model
for training. Random Forest Classification models were used to classify age and
sex, and a Random Forest Regressor was used to determine the numerical value
of age. The text and the mark to it are transferred to the model, thus the process
of training the model takes place. Next, the model is evaluated and its accuracy
is determined by comparing its predictions with real marks.
6.4. Evaluation of results
</p>
      <p>Evaluation of the results is almost the most important stage of any research. It
allows you to see certain regularities between the results and the initial data,
which can sometimes even initiate another study. Data visualization, model
accuracy measurement, feature selection, comparison of predictions with real
results, and other methods are used to evaluate research results.</p>
      <p>Methods: visualization of results, construction of predictions, transformation of
predictions into a human-understandable format, comparison of data, calculation
of numerical metrics.</p>
      <p>Tools: Python libraries (matplotlib, seaborn, scikit-learn, pandas, NumPy).
Process description: the model predicts age and gender on test data, compares
its results with the real ones, and generates graphs. In the work, the most
influential signs, by which the model determines age and gender, were identified,
and they were displayed in the form of a graph (separately for age and gender).
These graphs are among the most important because they help us understand
which words can indicate the biological data of the author of the text. In addition,
many graphs are created that describe the accuracy of the model, these include
the ROC curve, the positive/negative true/false matrix, the histogram of true and
predicted age (for numerical age prediction), the distribution of true and predicted
age categories (for categorical age determination).
6.5. For the user
This stage is created for interaction with the user, it provides an opportunity to enter your
text excerpt to determine the gender and age of the author, and the results are presented
as clearly as possible for users.</p>
      <p>

</p>
      <p>Methods: calling previously developed functions, outputting results in a
humanunderstandable format.</p>
      <p>Tools: Tools: Python libraries (scikit-learn, NLTK).</p>
      <p>Process Description: The task of this stage is to create an extremely simple and
concise section for user interaction. The user's task is to enter the text in the right
place, the author of which needs to be determined, and run two cells with the
code. The text entered by the user is passed to previously developed functions,
undergoes cleaning, removal of stop-words, transformation, vectorization,
transfer of text to the model, conversion of the text into a readable format and
output of the results to the user. The whole process takes 8 lines and takes no
longer than a minute.
6.6. User manual
To run the program, the user's device must meet the following requirements:



</p>
      <p>Internet connection;
Operating system: Windows 7 or higher;
Software: a program that supports the .ipynb format (Jupyter notebook, Google
Colab web resource, VS Code);</p>
      <p>Features: 8+ GB RAM, CPU (or use Google Colab).</p>
      <p>To use the program, you need to follow the following steps:</p>
      <sec id="sec-6-1">
        <title>Place the program file and the dataset in one folder.</title>
        <p>Open the program file.</p>
        <p>Run each cell individually, one by one, using the start button (usually a trident) to
the left of each cell, or the "Run All" button on the top panel of the program, if
there is one.</p>
        <p>Wait until the end of execution of all cells (approximately 10-15 minutes).
5. In the "For user" section (at the end of the file), you can enter the text to be
classified, then run the cell with the entered text and the following text, the result
will appear under the second cell.
6.7. Program code
Downloading required libraries:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
Loading dataset:
df = pd.read_csv('./blogtext.csv')
df.head()
Deleting unnecessary lines:
df = df.drop(columns=['id', 'date', 'sign']) #deleting unnecessary columns
df
The function of the graph of the distribution of data by gender:
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='gender', color='purple')
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
The function of the graph of the distribution of data by age:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='age', bins=20, kde=True, color='purple')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
Data cleaning function from stop-words:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# nltk.download('punkt') # nltk.download('stopwords')
# nltk.download('wordnet') # nltk.download('omw-1.4')
def stopwords_removal(text):
new_words = word_tokenize(text)</p>
        <p>new_filtered_words = [#lemmatizer.lemmatize(word.lower()) for word in new_words if
word.lower() not in stopwords.words('english')</p>
        <p>word for word in new_words if word.lower() not in stopwords.words('english')]
return ' '.join(new_filtered_words)
Sampling part of the dataset (30,000 samples):
df = df.sample(n=len(df))
df_short = df[:30000]
Converting age into categories (optional):
def categorize_age(age):
if age &lt; 20:</p>
        <p>return 0 # less than 20
elif 20 &lt;= age &lt;= 30:</p>
        <p>return 1 # 20-30
else:</p>
        <p>return 2 # more than 30
df_short['age_category'] = df_short['age'].apply(categorize_age)</p>
        <p>The function of removing unnecessary characters and applying all preprocessing
functions to the text:</p>
        <p>Vectorization of text data:
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
x_train_tfidf = vectorizer.fit_transform(x_train)
x_test_tfidf = vectorizer.transform(x_test)
A model for determining age:
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
rf_age_category = RandomForestClassifier(n_estimators=100, random_state=42)
rf_age_category.fit(x_train_tfidf, y_train['age_category'])
age_category_score = rf_age_category.score(x_test_tfidf, y_test['age_category'])
print(f'Random Forest age category prediction accuracy: {age_category_score}')
Calculation of metrics for the age model and feature selection:
age_category_predictions = rf_age_category.predict(x_test_tfidf)
print(classification_report(y_test['age_category'], age_category_predictions))
importances_age_category = rf_age_category.feature_importances_
indices_age_category = np.argsort(importances_age_category)[::-1]
top_n = 15
top_features_age_category = [vectorizer.get_feature_names_out()[i] for i in
indices_age_category[:top_n]]
print(f'Top {top_n} features for age category prediction: {top_features_age_category}')
Model for gender determination:
rf_gender = RandomForestClassifier(n_estimators=100, random_state=42)
rf_gender.fit(x_train_tfidf, y_train['gender_bi'])
rf_gender.score(x_test_tfidf, y_test['gender_bi'])
gender_predictions = rf_gender.predict(x_test_tfidf)
print(classification_report(y_test['gender_bi'], gender_predictions))
Selection of features for the gender model:
importances_gender = rf_gender.feature_importances_
indices_gender = np.argsort(importances_gender)[::-1]
top_features_gender = [vectorizer.get_feature_names_out()[i] for i in indices_gender[:top_n]]
print(f'Top {top_n} features for gender prediction: {top_features_gender}')
Visualization of graphs of the importance of traits for age:
top_importances_age_category = importances_age_category[indices_age_category[:top_n]]
plt.figure(figsize=(10, 6))
plt.barh(range(top_n), top_importances_age_category, align='center', color='salmon')
plt.yticks(range(top_n), top_features_age_category)
plt.gca().invert_yaxis()
plt.xlabel('Feature Importance')
plt.title('Top 15 Features for Age Category Prediction')
plt.show()
Visualization of graphs of the importance of traits for gender:
top_importances_gender = importances_gender[indices_gender[:top_n]]
plt.figure(figsize=(10, 6))
plt.barh(range(top_n), top_importances_gender, align='center', color='purple')
plt.yticks(range(top_n), top_features_gender)
plt.gca().invert_yaxis()
plt.xlabel('Feature Importance')
plt.title('Top 15 Features for Gender Prediction')
plt.show()
Correlation matrix for gender predictions:
from sklearn.metrics import confusion_matrix
import seaborn as sns
predicted_genders = rf_gender.predict(x_test_tfidf)
cm = confusion_matrix(y_test['gender_bi'], predicted_genders)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='RdPu',
yticklabels=['Female', 'Male'])
plt.title('Confusion Matrix for Gender Prediction')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Correlation matrix for age predictions:
xticklabels=['Female',
'Male'],
cm_age_category = confusion_matrix(y_test['age_category'], age_category_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_age_category, annot=True, fmt='d', cmap='RdPu', xticklabels=['&lt;20', '20-30',
'&gt;30'], yticklabels=['&lt;20', '20-30', '&gt;30'])
plt.title('Confusion Matrix for Age Category Prediction')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
sns.countplot(x=age_category_predictions, palette='Reds', alpha=0.5, label='Predicted Age
Categories')
plt.title('Distribution of True vs. Predicted Age Categories')
plt.xlabel('Age Category')
plt.ylabel('Count')
plt.legend()
plt.show()
A function for converting predicted data into a human-readable format:
def decategorize(age:list, gender:list):
res = []
if age == 0:</p>
        <p>res.append(['less than 20'])
elif age == 1:</p>
        <p>res.append(['20-30'])
else:</p>
        <p>res.append(['more than 30'])
if gender == 0:</p>
        <p>res.append(['female'])
else:</p>
        <p>res.append(['male'])
return res
Downloading the second dataset (with book authors):
book_train = pd.read_csv('train_book.csv')</p>
        <p>Conversion of author names to gender and pre-processing of text data using
previously described functions:
book_train.drop(columns=['id'])
def categorize_gender(gender): # defining authors gender
if gender == 'EAP' or gender == 'HPL':</p>
        <p>return 1 # male
else:</p>
        <p>return 0 # female
# Apply age categorization
book_train['gender'] = book_train['author'].apply(categorize_gender)
book_train['text'].apply(text_prerocess)
book_train['text'].apply(stopwords_removal)
x_train_book, x_test_book, y_train_book, y_test_book = train_test_split(book_train['text'],
book_train[['gender']], test_size=0.2, random_state=42)</p>
        <p>Data vectorization:
x_train_b = vectorizer.fit_transform(x_train_book)
x_test_b = vectorizer.transform(x_test_book)
Training the additional model purely on new data:
rf_author = RandomForestClassifier(n_estimators=100, random_state=42)
rf_author.fit(x_train_b, y_train_book)
rf_author.score(x_test_b, y_test_book)
Comparison of the results of the old and new gender models on the old and new data:
print('Old model on old data: ', rf_gender.score(x_test_tfidf, y_test['gender_bi']))
print('Old model on new data: ', rf_gender.score(x_test_b, y_test_book))
print('New model on new data: ', rf_author.score(x_test_b, y_test_book))
print('New model on old data: ', rf_author.score(x_test_tfidf, y_test['gender_bi']))
String to enter sample text from the user:
your_text = 'I would love to go to the gallery with you after university! I heard they are
exhibiting the most famous paintings of Monet, he is my favourite painter, i am so excited!'
Functions for converting text, making predictions, converting predictions back to text:
text = text_prerocess(your_text) #Now run this cell and it will give you result
text = stopwords_removal(text)
text = pd.Series(your_text)
text = vectorizer.transform(text)
pred_age = rf_age_category.predict(text)
pred_gen = rf_gender.predict(text)
result = decategorize(pred_age, pred_gen)
print(f'The author\'s age is in this range: {result[0]}\nAuthor\'s gender: {result[1]}')</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Results</title>
      <p>The program is divided into several sections according to the type of tasks performed.
When starting the program, you first need to import the libraries, immediately after that
the data analysis section begins.</p>
      <p>In the Data Analysis section, the dataset has been imported (if necessary, the
reference to the dataset in the file system must be changed) and unnecessary data
columns (id, sign, date) have been removed. After that, the data balance in the dataset
was checked using data visualization. A quantity graph was used to compare gender,
and a histogram was used to compare age.</p>
      <p>Next, data pre-processing takes place. This section creates two important functions
that are responsible for cleaning data, as well as one optional function:
def stopwords_removal(text): the function removes stop words, removes capital
letters, tokenizes sentences and provides lemmatized text. As input, it accepts
the text to be cleared, and returns the cleared text.
def text_prerocess(text): function for text pre-processing, it removes characters
and unwanted combinations that can cause noise. As input, it accepts the text to
be cleared, returns the cleared text.
def categorize_age(age): an optional function, used only when age is defined as
a classification by age group rather than assumed as a number. Divide authors
into three groups by age. At the entrance, it takes the age, at the exit it gives the
number of the group to which the author belongs.</p>
      <p>In general, in this section, the text data is passed through these functions in turn to
obtain tokenized and cleaned text, which can then be immediately vectorized and sent
to the model.</p>
      <p>In the next section, the model itself, or rather two models, is trained to predict age and
gender. However, the data from the previous section is still in text form, which is
incomprehensible to the model, so the text is vectorized before that. It converts the text
into a set of embeddings, in the form of TF-IDF, a technology that allows you to sort
words by their importance, which is very important for my research.</p>
      <p>The vectorized ones are passed to the RandomForestClassifier model to determine
gender or age in a categorical form, if the age needs to be predicted in a numerical form,
the RandomForestRegressor model is used. These models were chosen because they
allow us to highlight features for further analysis.</p>
      <p>Also, in this section, the characteristics that have the greatest influence on the
classification result are highlighted.</p>
      <p>In the next section, the data obtained during the research is analysed. The section
consists entirely of graphs of various types and for various purposes.</p>
      <p>A separate section has also been developed for this project so that users have the
opportunity to more conveniently interact with the model and enter their sentences for
age and gender verification. For the sake of the experiment, 5 different sentences were
given: 3 from women and 2 from men. The following results were obtained:</p>
      <p>Four of the five proposed cases are correctly identified. One of the woman's
messages is identified as a message from a man.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Discussion</title>
      <p>So, the purpose of the research was to create a model for analysing the author's age
and gender based on his texts. In this work, the model is built and trained on a part of
the dataset with blogs. The following program execution results were achieved:
 The age determination model demonstrates an accuracy of 64.3% if age is
specified categorically, and 25% if age is assumed in numerical format. The sex
determination model has an accuracy of 64%. These results are not as high as

we would like, but they allow us to extract certain features from the text that allow
us to determine the biological data of the author. The data could probably be
improved by taking a larger dataset and longer messages.</p>
      <p>Features from the model were selected and sorted, and a rating of the most
influential features for the classification of biological data was obtained. Adding
the subject of writing to the overall text of the message greatly helps the
classification, making it easier for models to predict both age and gender. For age
classification, topics about the place of study or work, and social status (Student,
industry, and others) are the most helpful, for gender it is the field of interests
(technology, finance).</p>
      <p>From the correlation matrix, it can be noted that men's messages are perceived
as women's more often than vice versa. This may indicate that men are more
likely to use a feminine way of communicating, or that women are more likely to
indicate their gender as the opposite for various reasons.</p>
      <p>When analysing a dataset from books, where it was necessary to determine the
gender of the author, an additional model was created, which was trained only on
this dataset, and also passed this dataset through the previous model and vice
versa. The results of operations were compared. It turns out that a model trained
on blog data performs better on unfamiliar data than a model trained on new data.</p>
      <p>One of the authors wrote a message from himself, as an example, which is
correctly classified as female, age 20 to 30, which it is. The numerical value of
the age slightly exaggerates the real one, the model believes that the author is
27. It can be said that the model is much more likely to determine the
psychological age of a person than the biological one (the author is a little over
20 years old).</p>
      <p>Statistics of the model before training:
1. Statistical analysis of data begins even before sending the dataset to the model
for making predictions. When the dataset is loaded, two graphical displays of the
data distribution in this dataset are constructed. The amount of data by gender is
almost the same, but the ages 13-17 and 23-29 years old are significantly more
prevalent, and there are no users in their 20s and 30s at all. This can have a
rather negative effect on the accuracy of the model. Blog Authorship Corpus
dataset (kaggle.com) [73].
2. During the training of the model, to determine its accuracy, a classification report
is built, which includes such metrics as accuracy, f1-score, macro avg, precision,
recall.</p>
      <p>Gender determination has more stable metrics, it can be assumed that models are
easier to cope with gender determination than age.</p>
      <p>3. Analysis of signs of predicting gender or age is perhaps the most important step
in this research. The results were generated and displayed in the form of a ranked
graph of the 15 most important features for classification.</p>
      <p>The first three places belong to the topic names that users specify when writing a
blog. This is quite expected because the subject of the text can tell about the user no
more than the text itself. The following positions belong to slang forms of the text, which,
most likely, are more often used by young people. Next, you can see that places of
work/study and some other words appear, the meaning of which can be analysed by
analogy.</p>
      <p>Again, the first to appear are the topics of messages that indicate the areas of interest
of individuals. More technical interests are responsible for the male gender, while more
creative interests are for the female gender. You can also notice that there are quite a
lot of feelings, the expression of which is more characteristic of women.
4. Correlation matrices can also be a great source of information about a program
and its results.</p>
      <p>In many previous runs of the program, it can be observed that the number of men
classified as women is relatively higher than the other way around. From this, we can
draw an interesting conclusion that men more often adopt a female manner of
communication than women - a male one.</p>
      <p>It can be seen from the matrix that the central group quite strongly dominates the
others, both in the classification of the model and in the dataset. This problem can be
solved by age-balancing the dataset, which may greatly distort the model results.
5. The ROC curve shows the ratio of True Positive to True Negative and the
accuracy with which the model works.</p>
      <p>It can be seen that the model predicts the age group &lt;20 best, although it is not the
most common. Most likely, many signs help to quickly identify this age, such as "school"
or "student".</p>
      <p>6. Comparing predicted and actual ages is one of the best ways to visually evaluate
model predictions.</p>
      <p>As you can see, the model has difficulties with the definition of the third category
(&gt;30). Most likely, such trouble is because the number of this category, compared to
others, is very small, so the model has not learned how to predict it properly. This can
fix the correction of the dataset. You can still look at a similar graph but for a temporary
prediction of age.</p>
      <p>Interestingly, the model does not want to predict ages below 17, even though there
are many younger users in the dataset. It predicts a large part of the data as 17-23,
although it is generally an empty area in the dataset. Also, the model, as in the
categories, practically avoids age more than 30. It seems that more weighted data will
be able to solve some of these problems.</p>
      <p>7. A separate gender classification model has been developed for the dataset of
book authors, which perfectly classifies the gender of authors from this dataset.</p>
      <p>Dataset with book authors: Spooky Author Identification [74].</p>
      <p>However, when this same data was run through a previously trained model on the old
data, it performed better than this new model on the old data. So, the old model, although
it has worse accuracy, perceives the new data much better than the new model.
8. As a final test, as well as an opportunity for users to more conveniently interact
with the program, the ability to determine the gender and age of the text specified
by the user has been developed. A sentence was written by one of the authors of
this work, and a fairly accurate classification was obtained.</p>
      <p>Several experiments were also conducted to verify the model. Three friends of one of
the authors of the study took part in the experiment. They wrote common conversational
sentences, and this author also wrote a few. These sentences were placed in the
following order: author (female, 21), author (female, 21), male 22, female 20, male 21.
The obtained results are arranged in the same order.</p>
      <p>Four out of five sentences are classified correctly, and all of them are correctly
determined by age, according to article 4/5. One woman was classified as a man. This
is a pretty good result for a text-based age and gender prediction model.</p>
      <p>A few more experiments were conducted with sentence formulation using the features
provided by the model. Thus, adding the words "school" or "student" reduces the
predicted age of the author, adding words related to technology changes the gender of
the author to male. This means that it is important to submit a sentence to the model that
is not written to deceive the model, it should be sincere and casual.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions</title>
      <p>So, in the process of implementing this project, namely the project on determining the
author's age and gender based on his text, a model was developed that determines
these biological data of the author based on his text. Before starting work, similar studies
on a similar topic are reviewed to find out what has already been researched and tested,
and what is still worth investigating. Also, from these studies, it was possible to find many
hints about which implementation methods and tools are better to choose, and which
work better for this task.</p>
      <p>The work on the project is carefully planned using process diagrams and data flows.
The best methods and tools for the implementation of this project were studied, and
simple classification and regression models of Random Forest became such tools. Such
models were chosen, because they cope with the task quite well, and are much less
resource-intensive than the same large language models, in addition, they are very easy
to use and configure.</p>
      <p>Two datasets were selected, a dataset with blogs and a dataset with books. The
dataset with blogs was used the most because it contains both the age and gender of
the blog author.</p>
      <p>Before use, the data was analysed and cleaned, later transformed into embeddings
and sent for model training. The results of the model are studied and analysed in detail.
Many useful features are extracted that are responsible for classifying the age or gender
of the author in the texts. In addition, many interesting regularities were observed in the
process of analysing the results. Additionally, a test case is implemented that allows the
user to easily interact with my model.</p>
      <p>Such research is very useful in many areas of life, but also for the development of
science. Such studies can help capture the relationship between seemingly unrelated
features, such as the reflection of an author's gender and age in his texts. We believe
that it is possible to try to repeat or edit our experiment on a computer with higher
capacities to be able to analyse much larger volumes of data, which could significantly
improve the results of the model. Although it is worth noting, as stated in another study,
such predictions work more for the psychological age of a person than for his biological
age, because the manner of speaking reflects the psychological age. Also, you can try
using other datasets in the future, if available, or rebalance the current dataset and try
again. Such research can bring many benefits to society if it is used properly.
[1] O. Tverdokhlib, V. Vysotska, P. Pukach, M. Vovk, Information technology for
identifying hate speech in online communication based on machine learning,
Lecture Notes on Data Engineering and Communications Technologies 195 (2024)
339–369.
[2] N. Borysova, K. Melnyk, N. Babkova, Z. Kochuieva, V. Melnyk, Gender
Classification of Surnames: Ukrainian aspect, CEUR Workshop Proceedings 3171
(2022) 354-364.
[3] L. Stasiuk, Gender Marked Intimate Conversational Interaction of Spouses in</p>
      <p>Modern English, CEUR Workshop Proceedings 2870 (2021) 731-742.
[4] A. Hadzalo, Analysis of Gender-Marked Units: Statistical Approach, CEUR
workshop proceedings 2604 (2020) 462-471.
[5] Y. Butelskyy, Statistical Methods to Detect Gender Peculiarities of Communication
in Vkontakte Social Network Groups, in Proceedings of the 11th International
Scientific and Technical Conference on Computer Sciences and Information
Technologies, CSIT, 2016, pp. 132-135. doi: 10.1109/STC-CSIT.2016.7589888.
[6] I. Afanasieva, N. Golian, V. Golian, A. Khovrat, K. Onyshchenko, Application of
Neural Networks to Identify of Fake News, CEUR Workshop Proceedings 3396
(2023) 346-358.
[7] A. Shupta, O. Barmak, A. Wierzbicki, T. Skrypnyk, An Adaptive Approach to
Detecting Fake News Based on Generalized Text Features, CEUR Workshop
Proceedings 3387 (2023) 300-310.
[8] V.-A. Oliinyk, V. Vysotska, Y. Burov, K. Mykich, V. Basto-Fernandes, Propaganda
Detection in Text Data Based on NLP and Machine Learning, CEUR workshop
proceedings 2631 (2020) 132-144.
[9] R. A. Dar, Dr. R. Hashmy, A Survey on COVID-19 related Fake News Detection
using Machine Learning Models, CEUR Workshop Proceedings 3426 (2023) 36-46.
[10] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP
Tool for Extracting Relevant Information from Criminal Reports or
Fakes/Propaganda Content, in Proceedings of IEEE 17th International Conference
on Computer Sciences and Information Technologies (CSIT), 2022, pp. 93-98, doi:
10.1109/CSIT56902.2022.10000563.
[11] A. Mykytiuk, V. Vysotska, O. Markiv, L. Chyrun, Y. Pelekh, Technology of Fake
News Recognition Based on Machine Learning Methods, CEUR Workshop
Proceedings 3387 (2023) 311-330.
[12] T. Batiuk, V. Vysotska, V. Lytvyn, Intelligent system for socialization by personal
interests on the basis of SEO technologies and methods of machine learning, CEUR
workshop proceedings 2604 (2020) 1237-1250.
[13] D. Uhryn, O. Naum, N. Antonyuk, I. Dyyak, L. Chyrun, A. Demchuk, V. Vysotska, Z.</p>
      <p>Rybchak, T. Batiuk, Tourist Itineraries Plan Design Based on the Behavior of Bee
Colonies, CEUR Workshop Proceedings 2631 (2020) 516-539.
[14] T. Batiuk, V. Vysotska, R. Holoshchuk, S. Holoshchuk, Intelligent System for
Socialization of Individual’s with Shared Interests based on NLP, Machine Learning
and SEO Technologies, CEUR Workshop Proceedings 3171 (2022) 572-631.
[15] D. Dosyn, T. Batiuk, A Realization of Visual Biometric Validation to Enhance
Guarded and Efficient Authorization for Intellectual Systems, CEUR Workshop
Proceedings 3668 (2024) 247-268.
[16] T. Batiuk, L. Chyrun, O. Oborska, Ontology Model and Ontological Graph for
Development of Decision Support System of Personal Socialization by Common
Relevant Interests, CEUR Workshop Proceedings 3171 (2022) 877-903.
[17] R. Bekesh, L. Chyrun, P. Kravets, A. Demchuk, Y. Matseliukh, T. Batiuk, I.</p>
      <p>Peleshchak, R. Bigun, I. Maiba, Structural modeling of technical text analysis and
synthesis processes, CEUR Workshop Proceedings 2604 (2020) 562–589.
[18] A. Yarovyi, D. Kudriavtsev, Method of Multi-Purpose Text Analysis Based on a
Combination of Knowledge Bases for Intelligent Chatbot, CEUR Workshop
Proceedings2870, 2021, pp. 1238-1248.
[19] V. Vasyliuk, Y. Shyika, T. Shestakevych, Information System of Psycholinguistic</p>
      <p>Text Analysis, CEUR workshop proceedings 2604 (2020) 178-188.
[20] O. Artemenko, V. Pasichnyk, N. Kunanets, K. Shunevych, Using sentiment text
analysis of user reviews in social media for e-tourism mobile recommender systems,
CEUR workshop proceedings 2604 (2020) 259-271.
[21] I. Gruzdo, I. Kyrychenko, G. Tereshchenko, O. Cherednichenko, Applıcatıon of
Paragraphs Vectors Model for Semantıc Text Analysıs, CEUR workshop
proceedings 2604 (2020) 283-293.
[22] N.B. Shakhovska, R.Yu. Noha, Methods and tools for text analysis of publications
to study the functioning of scientific schools, Journal of Automation and Information
Sciences 47(12) (2015) 29-43.
[23] V. Vysotska, V.B. Fernandes, V. Lytvyn, M. Emmerich, M. Hrendus, Method for
Determining Linguometric Coefficient Dynamics of Ukrainian Text Content
Authorship, Advances in Intelligent Systems and Computing 871 (2019) 132-151.
doi: 10.1007/978-3-030-01069-0_10.
[24] V. Vysotska, Y. Burov, V. Lytvyn, A. Demchuk, Defining Author's Style for Plagiarism
Detection in Academic Environment, in: Proceedings of the International Conference
on Data Stream Mining and Processing, DSMP, 2018, pp. 128-133. DOI:
10.1109/DSMP.2018.8478574.
[25] V. Vysotska, O. Kanishcheva, Y. Hlavcheva, Authorship Identification of the
Scientific Text in Ukrainian with Using the Lingvometry Methods, in: Proceedings of
the International Conference on Computer Sciences and Information Technologies,
CSIT, 2018, pp. 34-38. DOI: 10.1109/STC-CSIT.2018.8526735.
[26] V. Lytvyn, V. Vysotska, Y. Burov, I. Bobyk, O. Ohirko, The linguometric approach for
co-authoring author's style definition, in: International Symposium on Wireless
Systems within the International Conferences on Intelligent Data Acquisition and
Advanced Computing Systems, IDAACS-SWS, 2018, pp. 29-34. doi:
10.1109/IDAACS-SWS.2018.8525741.
[27] V. Lytvyn, V. Vysotska, I. Budz, Y. Pelekh, N. Sokulska, R. Kovalchuk, L. Dzyubyk,
O. Tereshchuk, M. Komar, Development of the quantitative method for automated
text content authorship attribution based on the statistical analysis of N-grams
distribution, Eastern-European Journal of Enterprise Technologies 6(2-102) (2019)
28-51. doi: 10.15587/1729-4061.2019.186834.
[28] V. Vysotska, O. Markiv, S. Teslia, Y. Romanova, I. Pihulechko, Correlation Analysis
of Text Author Identification Results Based on N-Grams Frequency Distribution in
Ukrainian Scientific and Technical Articles, CEUR Workshop Proceedings 3171
(2022) 277-314.
[29] V. Motyka, Y. Stepaniak, M. Nasalska, V. Vysotska, Lexical Diversity Parameters
Analysis for Author's Styles in Scientific and Technical Publications, CEUR
Workshop Proceedings 3403 (2023) 595–617.
[30] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak,
Intellectual Analysis System Project for Ukrainian-language Artistic Works to
[45] V. Vysotska, Computer Linguistic Systems Design and Development Features for
Ukrainian Language Content Processing, CEUR Workshop Proceedings 3688
(2024) 229–271. URL: https://ceur-ws.org/Vol-3688/paper18.pdf.
[46] S. Albota, Creating a Model of War and Pandemic Apprehension: Textual Semantic</p>
      <p>Analysis, CEUR Workshop Proceedings 3396 (2023) 228-243.
[47] N. Khairova, Y. Holyk, D. Sytnikov, Y. Mishcheriakov, N. Shanidze, Topic Modelling
of Ukraine War-Related News Using Latent Dirichlet Allocation with Collapsed Gibbs
Sampling, CEUR Workshop Proceedings 3688 (2024) 1-15.
[48] S. Mainych, A. Bulhakova, V. Vysotska, Cluster Analysis of Discussions Change
Dynamics on Twitter about War in Ukraine, CEUR Workshop Proceedings 3396
(2023) 490-530.
[49] R. Nazarchuk, S. Albota, Tweets about Ukraine during the russian-Ukrainian War:
Quantitative Characteristics and Sentiment Analysis, CEUR Workshop Proceedings
3426 (2023) 551-560.
[50] N. Khairova, A. Kolesnyk, O. Mamyrbayev, K. Mukhsina, The Aligned
KazakhRussian Parallel Corpus Focused on the Criminal Theme, CEUR Workshop
Proceedings 2362 (2019) 116-125.
[51] S. Voloshyn, V. Vysotska, O. Markiv, I. Dyyak, I. Budz, V. Schuchmann, Sentiment
Analysis Technology of English Newspapers Quotes Based on Neural Network as
Public Opinion Influences Identification Tool, in Proceedings of 2022 IEEE 17th
International Conference on Computer Sciences and Information Technologies
(CSIT), 2022, pp. 83-88, doi: 10.1109/CSIT56902.2022.10000627.
[52] N. Khairova, A. Shapovalova, O. Mamyrbayev, N. Sharonova, K. Mukhsina, Using
BERT model to Identify Sentences Paraphrase in the News Corpus, CEUR
Workshop Proceedings 3171 (2022) 38-48.
[53] N. Bondarchuk, I. Bekhta, O. Melnychuk, O. Matviienkiv, Keyword-based Study of
Thematic Vocabulary in British Weather News, CEUR Workshop Proceedings 3171
(2022) 451-460.
[54] S. Voloshyn, O. Markiv, V. Vysotska, I. Dyyak, L. Chyrun, V. Panasyuk, Emotion
Recognition System Project of English Newspapers to Regional E-Business
Adaptation, Proceedings of IEEE 17th International Conference on Computer
Sciences and Information Technologies (CSIT), 2022, pp. 392-397, doi:
10.1109/CSIT56902.2022.10000527.
[55] N. Antonyuk, L. Chyrun, V. Andrunyk, A. Vasevych, S. Chyrun, A. Gozhyj, I. Kalinina,
Y. Borzov, Medical news aggregation and ranking of taking into account the user
needs, CEUR Workshop Proceedingsnn248 (2019) 369–382.
[56] V. Andrunyk, A. Vasevych, L. Chyrun, N. Chernovol, N. Antonyuk, A. Gozhyj, V.</p>
      <p>Gozhyj, I. Kalinina, M. Korobchynskyi, Development of information system for
aggregation and ranking of news taking into account the user needs, CEUR
Workshop Proceedings 2604 (2020) 1127–1171.
[57] V. Vysotska, S. Voloshyn, O. Markiv, O. Brodyak, N. Sokulska, V. Panasyuk, Tone
Analysis of Regional Articles in English-Language Newspapers Based on Recurrent
Neural Network Bi-LSTM, in Proceedings of the 5th International Conference on
Advanced Information and Communication Technologies (AICT), 2023, pp.
158163.
[58] S. Albota, Linguistic and Psychological Features of the Reddit News Post, in
Proceedings of the IEEE 15th International Scientific and Technical Conference on
Computer Sciences and Information Technologies, CSIT, 2020, 1, pp. 295–299.
[59] N. Shakhovska, M. Medykovskyj, L. Bychkovska, Building a smart news annotation
system for further evaluation of news validity and reliability of their sources, Przeglad
Elektrotechniczny 91(7) (2015) 43-44.
[60] V. Vysotska, R. Holoshchuk, S. Goloshchuk, O. Voloshynskyi, M. Shevchenko, V.</p>
      <p>Panasyuk, Predicting the Effects of News on the Financial Market Based on Machine
Learning Technology, in Proceedings of the 5th International Conference on
Advanced Information and Communication Technologies (AICT), 2023, pp.
152157.
[61] Chew, R., Kery, C., Baum, L., Bukowski, T., Kim, A., &amp; Navarro, M. (2021).</p>
      <p>Predicting age groups of Reddit users based on posting behavior and metadata:
classification model development and validation. JMIR Public Health and
Surveillance, 7(3), e25807.
[62] Z. Miller, B. Dickinson, W. Hu, Gender Prediction on Twitter Using Stream
Algorithms with N-Gram Character Features, International Journal of Intelligence
Science 2 (4A) (2012) 24184, doi:10.4236/ijis.2012.224019.
[63] S. Rosenthal, K. McKeown, Age prediction in blogs: A study of style, content, and
online behavior in pre-and post-social media generations, in: Proceedings of the
49th annual meeting of the association for computational linguistics: human
language technologies. 2011, pp. 763-772.
[64] D. Nguyen, N. A.Smith, C. P. Ros´e, Author age prediction from text using linear
regression, in: Proceedings of the 5th ACL Workshop on Language Technology for
Cultural Heritage, Social Sciences, and Humanities, LaTeCH@ ACL 2011, 24 June,
2011, Portland, Oregon, USA. Association for Computational Linguistics, 2011, pp.
115-123.
[65] D. Nguyen, D. Trieschnigg, A. S. Dogruoz¨, R. Grave, M. Theune, T. Meder, F. de
Jong, Why gender and age prediction from tweets is hard: Lessons from a
crowdsourcing experiment, in: Proceedings of the Technical Papers 25th
International Conference on Computational Linguistics, August 23-29, 2014, Dublin,
Ireland. Association for Computational Linguistics, 2014, pp. 1950-1961.
[66] I. Khomytska, V. Teslyuk, I. Bazylevych, I. Shylinska, Approach for minimization of
phoneme groups in authorship attribution, International Journal of Computing 19(1)
(2020) 55-62.
[67] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of Methods,
Models and Means for the Author Attribution of a Text, Eastern-European Journal of
Enterprise Technologies 3/2 (93) (2018) 41–46.
[68] I. Khomytska, V. Teslyuk, Authorship Attribution by Differentiation of Phonostatistical
Structures of Styles, in: Proceedings of the XIIIth Scientific and Technical
Conference on Computer Sciences and Information Technologies, CSIT, Lviv, 2018,
pp. 5–8.
[69] I. Khomytska, V. Teslyuk, The Software for Authorship and Style Attribution, in:
Proceedings of the 15th International Conference on CADMS, Polyana, 2019, pp.
23–26.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>