Information technology for textual content author's gender and age determination based on machine learning Victoria Vysotska1,†, Lyubomyr Chyrun2,†, Sofia Chyrun1,† and Mariia Soltys1,∗ ,† 1 Lviv Polytechnic National University, Stepan Bandera 12, 79013 Lviv, Ukraine 2 Ivan Franko National University of Lviv, University 1, 79000 Lviv, Ukraine Abstract In the process of implementing this project, namely the project on determining the author's age and gender based on his text, a model was developed that determines these biological data of the author based on his text. Before starting work, similar studies on a similar topic are reviewed to find out what has already been researched and tested, and what is still worth investigating. Also, from these studies, it was possible to find many clues about which implementation methods and tools are better to choose, and which work better for this task. The project work is carefully planned using process diagrams and data flows. The best methods and tools for the implementation of this project were studied, and simple classification and regression models of Random Forest became such tools. Such models were chosen, because they cope with the task quite well, and are much less resource-intensive than the same large language models, in addition, they are very easy to use and configure. Two datasets were selected, a dataset with blogs and a dataset with books. The dataset with blogs was used the most because it contains both the age and gender of the blog author. The prediction accuracy of the "book" model is 0.8, and with blogs - 0.6. Before use, the data was analysed and cleaned, later transformed into embeddings and sent for model training. The results of the model are studied and analysed in detail. Many useful features are extracted that are responsible for classifying the age or gender of the author in the texts. In addition, many interesting regularities were observed in the process of analysing the results. Additionally, a test case is implemented that allows the user to easily interact with my model. Keywords machine learning, text analysis, dataset, author, age, gender, NLP, cybersecurity, context, content1 1. Introduction The problem of determining the gender and age of the author of the text is a difficult task, especially in the context of the Internet, where information is often provided anonymously MoDaST-2024: 6th International Workshop on Modern Data Science Technologies, May, 31 - June, 1, 2024, Lviv-Shatsk, Ukraine ∗ Corresponding author. † These authors contributed equally. victoria.a.vysotska@lpnu.ua (V. Vysotska); Lyubomyr.Chyrun@lnu.edu.ua (L. Chyrun); sofiia.chyrun.sa.2022@lpnu.ua (S. Chyrun); mariia.soltys.sa.2020@lpnu.ua (M. Soltys) 0000-0001-6417-3689 (V. Vysotska); 0000-0002-9448-1751 (L. Chyrun); 0000-0002-2829-0164 (S. Chyrun); 0000-0002-5378-4350 (M. Soltys) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings or under pseudonyms [1-5]. Also, this issue is relevant both for the distribution of advertising to the target audience, for example in social networks, and for determining additional parameters of the author of an anonymous text, especially if it is fake/propaganda/disinformation [6-11]. Although there are machine learning models for determining gender and age based on photos or videos, for example, posted on social networks, these approaches have limitations, since real visual information about the author is not always available [12-16]. Considering this, researchers pay attention to text analysis to determine such parameters, which opens up new opportunities [17-22]. Analysis of the text to determine the gender and age of the author depends on various factors, including the style of the author's writing, images, lexical features, and used words and phrases [23-39]. One of the approaches is the application of machine learning methods to textual data [40-45]. For example, models based on neural networks can use the analysis of syntactic and semantic features of the text to determine the gender and age of the author [46-54]. Research in this direction is already underway, and they indicate the potential of these approaches [55-60]. On the other hand, determining gender and age from a text can be a more difficult task due to the variability over time of features of writing styles, context, and other factors [61-69]. Therefore, it remains an active area of research in the field of natural language processing. Finally, the development of new approaches to the analysis of textual information may in the future help to solve the problem of determining the age and gender of the author from his texts on the Internet as an additional parameter for identifying the potential author of the set of generated fakes/propaganda/disinformation. Determining the age and gender of the author based on the text written by him is a very relevant problem today. Such a model could be useful in various areas, for example, in the field:  cyber security or law enforcement agencies, to detect and identify persons who plan or commit crimes on the network. It will help in detecting internet fraudsters, and online criminals or even in the investigation of cyber security threats;  historical research, to determine the authorship of texts or the dating of the writer's works, which can be important for the identification of authors or the analysis of the development of language and styles in different historical periods;  secondary and higher education to prevent plagiarism and ensure academic integrity. A model for determining the gender and age of the author from the text can help determine whether works written by students or schoolchildren are authentic;  marketing and analysis of social networks, this model can be useful for determining the target audience, creating personalized offers and analysing user behaviour;  psychological and sociological research, i.e. it can be useful in psychological and sociological research to understand the peculiarities of language style and psychosocial characteristics of different population groups. Also, it is worth noting that, in the conditions of war, such a model would be useful for Ukraine to identify collaborators, trolls, propagandists or criminals based on their texts on the Internet and mass media, including in social networks. The purpose of the research is to develop an information technology for text analysis for features to determine the gender and age of the author based on machine learning. The object of the research is the process of identifying the linguistic features of the text cornet to determine the gender and age of the author. The subject of the research is methods and means of determining the gender and age of authors of texts. The paper considers the definition of two characteristics at the same time for the first time, which was not previously investigated in other works [61-65]. In addition, this work explores age and gender characteristics reflected in texts, as opposed to identifying these characteristics through images and videos. 2. Related works In the context of the research area, namely the determination of the gender and age of the author of the text, the need to rely on previous research becomes especially critical. This is due to several factors. Firstly, there are practically no works on this topic in the Ukrainian context, the existing studies were mainly carried out by English-speaking researchers. Secondly, the availability of Ukrainian-language datasets with data on authors (their age and gender) and texts is very limited (if at all), so conducting a study of the Ukrainian language that would not include the creation of a completely new dataset is practically unattainable. Initial problems with the search for relevant data create difficulties for the implementation of research in the Ukrainian context. Ukrainian data on authors and texts are not available in the available datasets, which makes it difficult to carry out an objective analysis. In this regard, we will focus on English-language studies and datasets to ensure an adequate amount of data for analysis and project development. This situation highlights the importance of the study taking into account the results of other studies carried out in the English-speaking context and meeting global standards in the field of text analysis to determine the age and gender of the author. Social media is important for monitoring the perception of public health issues and for educating target audiences about health. However, limited information on the demographics of social media users makes it difficult to identify conversations between target audiences and limits the effectiveness of using social media for public health surveillance and educational interventions [66-75]. Certain social media platforms provide demographic information about the followers of a user's account. If they are provided, they are not always disclosed. Therefore, researchers have developed machine learning algorithms to predict the demographic characteristics of social network users, mainly for Twitter [61]. To date, limited research has been conducted on predicting the demographic characteristics of Reddit users [61]. The study was conducted taking into account data and metadata about Reddit users, that is, not only their posts but also the communities in which they leave their posts, comments or simply subscribe. The researchers manually flagged users' data using the SMART app, looking for confirmation of their age in comments or posts where users indicated it themselves. Data volumes were such that each age category (youth (13-17 years), young adults (18-20 years), and adults (21-54 years)) had a minimum of 625 records. Metadata was collected after tagging the data by age, via the Reddit API for each user. Metadata included user-level information (e.g., year of account creation), submission-level data (e.g., post popularity), and comment-level data (e.g., commenting frequency). The study focused on specific metadata that could potentially help distinguish between adolescent and adult age groups. The research identified 1,523 variables that could potentially indicate the age of Reddit users:  Final statistics: average level of evaluation of publications, etc.  Frequency of subreddits: frequency of posts in specific subreddits related to age groups.  Frequency of emoji usage: Frequency of emoji usage in comments.  Post Patterns: Percentage of posts that were videos, images, etc.  Use of terms: TF-IDF scores for specific terms (e.g. "school") used in comments. The dataset is divided into train and test (80/20), after which various models (logistic regression, random forest, k-nearest neighbours, Gradient boosted trees) that could potentially show a good result for this task were collected and evaluated by their indicators such metrics as AUROC, precision, recall and F1 score. The best result was shown by the Gradient boosted trees model (F1 score: 0.77, AUROC: 0.84). In the end, it is analysed and evaluated which of the signs have the greatest influence on determining the age of users. This study is important because it helps to better understand what should be relied on when determining the age and gender of the authors of the texts, and which signs are the most important and influential. The rapid growth of social networks has generated an unprecedented amount of user- generated data, which provides an excellent opportunity for text mining [62]. The main purpose of authorship analysis, an important part of text analysis, is to learn as much information as possible about the author of the text through the subtle variations in writing styles that exist within genders, ages, and social groups. Such information has a variety of uses, including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which provides free access to most user data through its Data Access API. In the study [62], the authors sought to determine the gender of Twitter users using Perceptron and Naive Bayes with selected parameters from 1 to 5- gram features from the tweet text. Stream applications of these algorithms have been used for gender prediction to process the speed and volume of tweet traffic. Since informal text such as tweets cannot be easily evaluated using traditional dictionary methods, the study [62] implemented n-gram features to represent streaming tweets. The large number of 1- to 5-grams requires only a subset of them to be used in gender classification, for this reason, the informative features of n-grams are selected using several selection algorithms. In the best case, the Naïve Bayes and Perceptron algorithms showed accuracy, balanced accuracy and F-measure above 99%. The study [62] is based on the analysis of messages and posts on Twitter, and the main goal of the study is to extract signs that would indicate some personal information about the author of the tweet. The peculiarities of this study are that informal language is used in twitter, and this paper is devoted to the actual analysis of informal language for important identification features. This approach has its difficulties, because, first of all, Twitter has a limit of 140 characters per message, which is a problem for traditional text analysis, as large segments of texts are usually used in such analysis. Secondly, since it is an informal language, users very often use acronyms, so-called text emoticons, and especially distorted spelling of the word, which can also make analysis more difficult. Before conducting the study, the data was carefully filtered and manually labelled using the API. Six different feature selection mechanisms were used to identify them and determine which ones would best help accomplish the task. This process aims to extract the most informative n-grams from tweets to improve gender prediction accuracy. To perform the task of classification, a simple neural network, namely Naïve Bayes, is used, which is based on Bayes' theorem. The importance of the study [62] is that it nicely highlights the difficulties in analysing spoken language and informal writing. Like the previous one, this study also highlights the importance of the correct choice of features to improve the accuracy of the model's prediction and, accordingly, the accuracy of the author's gender classification. In [63], it was investigated whether wording, stylistic choices and online behaviour can be used to predict the age category of blog authors. The authors hypothesize that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. By experimenting with different years, the authors found that college students' birth dates around the time when social networking sites like AIM, SMS texting, MySpace, and Facebook became popular gave accurate age predictions. The authors also determined that the characteristics of Internet writing are important characteristics for predicting age, but lexical content is also necessary to obtain significantly more accurate results. Our best results provide an accuracy of 81.57%. The basis of this study [63] is the determination of the age of blog authors. The definition is based on stylistic choices and online behaviour. The best part of the model is to determine the approximate age of a person, namely, whether he was born before the era of social networks, or already during it. The blogs are collected from the LiveJournal magazine, namely those blogs where the age of the author is indicated. All the articles are from American bloggers. Several features have been identified that help determines the author's age, including special words, stylistic features such as slang or text emoticons, as well as online behaviour such as frequency of posting and number of friends. A binary classification model based on year of birth was used, slightly modified to address changes in blogging styles based on popular social media technologies. In a study [63], it was found that two age groups (born in 1977-1979 and born in 1982-1984) differed greatly in terms of blogging style. Both stylistic and substantive features strongly influenced the prediction of age with the help of other variables that helped in determining the age group. The study [63] is important, because it notes the determination of age purely by text analysis and the use of certain features in the text, without taking into account metadata about the user. The research can be expanded to determine the geographical location or other data about the author. Although the study of the relationship between discourse patterns and personal identity has been going on for decades, the study of these patterns using language technologies is relatively recent [64]. In this latest tradition, the authors in [64] implemented the prediction of the author's age from the text as a regression problem. They investigated the same task using three very different genres of data simultaneously: blogs, telephone conversations, and online forum posts. A domain adaptation technique was also used, which allows for training a joint model including all three corpora together as well as separately and analysing the differences in predictive performance between the combined and corpus-specific aspects of the model. Effective features include both stylistic (such as POS templates) and content-oriented features. Using a linear regression model based on shallow text elements, the authors in [64] obtained correlations up to 0.74 and mean absolute errors between 4.1 and 6.8 years. In the study, three datasets were selected for analysis: blog corpus, fisher telephone corpus, and breast cancer forum. Each dataset has a different age distribution, which affects the determination of the age of users. The blog dataset has more young people, while the breast cancer forum dataset has more older people. The telephone conversation dataset has the most balanced age distribution. There were four different linear regression models for predicting user age. Interestingly, the study [64] states that the gender of the user significantly affects the identification of his age, that is, it makes sense to determine both characteristics. The best results were obtained by the dataset of telephone conversations, immediately followed by the dataset with blogs. The study also provides examples where the signs that can be used to determine the age of an Internet user are visible. There is a growing interest in automatically predicting the gender and age of authors based on texts. However, most research so far ignores that language use is related to the social identity of speakers, which may differ from their biological identity. In [65], the authors combined insights from sociolinguistics with data collected through an online game to highlight the importance of approaching age and gender as social variables rather than static biological variables. In the study, thousands of players guessed the gender and age of Twitter users based on tweets alone. The authors showed that more than 10% of Twitter users do not use language that the crowd associates with their biological sex. It has also been shown that older Twitter users are often perceived as younger than they are. The authors' conclusions highlight the limitations of current approaches to gender and age prediction from texts. This is quite an interesting study that calls into question all previous studies. The authors point out that often the behaviour of users does not correspond to their biological age or sex, so it makes sense to define gender as a social construct, and not as a biological feature, the same applies to age. It's common for people on Twitter to post messages that don't match their gender or age. The research was conducted using a game developed by the authors, where people guessed the gender and age of a certain author from Twitter. Thousands of participants joined the game and the result showed a significant difference in the guessed age and the real age of the authors, using only the text of the tweets. According to a study [65], 10% of Twitter users and their language are not associated with their real age or gender. Also, older Twitter users are often classified as younger. With this study, the authors highlighted the problem that the automatic determination of age or gender is often based on stereotypical features, which in reality may not correspond to reality at all. This limits the models in their ability to draw on upbringing and social constructs rather than just biological age. The authors of the study call for consideration of social and sociocultural influence and the variability of people's pronunciation when developing classification models. 3. Methods and materials Many studies highlight the main characteristics by which it is possible to identify age or gender, which we could use in our study [61-75]. Different studies have used different data and different models to predict user characteristics [61-75]. This allows you to compare them and understand what could be used in your research. For example, the study [64] analysed how the different distribution of data in the dataset affects the accuracy of the model and, accordingly, the accuracy of the characteristics predicted by it, i.e. age or gender. Research [65] allows us to look at our topic from a critical point of view, and to determine what should be taken into account when developing one's program, namely, the fact that the author's behaviour may often not coincide with his biological sex or age due to certain social constructs or upbringing. To do this, we will first define the tree of the goals of our research. A tree of goals is a hierarchical tree-like structure obtained by dividing the overall goal into subgoals, which in turn can also be divided into smaller subgoals, functions, etc. (Fig. 1). Graphically, the tree is depicted with "branches down", and the main goal is placed at the highest level. The advantage of building a goal tree is the possibility of dividing a large unfathomable goal into simpler tasks that can be solved by known methods. At the root of the tree is "Development of a model for determining the age and gender of the authors of the text", and the branches of the tree go down from the root:  Collection of datasets: preparation of datasets for model training and task execution. a. Blog Authorship Corpus - a dataset with blogs and information about the author to determine age and gender [73]. b. Spooky Author Identification - a dataset with famous authors and excerpts from their works, for determining gender.  Feature extraction: selection and ranking of the best features that best influence the model output. a. Bag-of-Words - uses TF-IDF technology. b. N-grams - includes sequences of word combinations (bigrams, trigrams) as features to capture the context. c. Embeddings, Word2Vec, GloVe - turns words into dense vectors that capture semantic meaning.  Model training: training of the selected model on cleaned data. a. Transformer model - already trained large language models, suitable for gender determination. b. Regression model - models working based on a regression function are suitable for determining age.  Analysis of results: construction of graphs, statistical analysis, summarization of conclusions. Age and gender determination model Identification of Dataset collection Model training Results analysis signs Blog Authorship Spooky Author Embeddings: Transformers, Regression Bag-of-Words N-grams Corpus Identification Word2Vec, CloVe LLM models Figure 1: Tree of goals The methodology of functional modelling is used to create a functional model that reflects the structure and functions of the system, as well as the flows of information and material objects connecting these functions. The IDEF0 diagram was designed to display mechanisms and instructions in the diagram (Fig. 2-3). The main process is to create a model for determining the age and gender of the author based on their text. Input: Excerpts or fragments of texts by different authors. Output: Age and gender prediction model. Mechanisms:  A wide selection of models can be applied for this task.  Python libraries allow you to perform a variety of tasks, from pre-processing to data analysis.  Hyperparameters that can be adjusted to get the best results. Instructions:  Transformers documentation for proper use of large language models.  Previous research from which useful information can be gleaned for my research.  Other documentation will help in the use of numerous libraries in the process of working and developing the model. Transformers Others The results of studies documentation documentation conducted earlier Texts of Determine the age and gender of the Age and gender different authors prediction model authors Libraries Various models Hyperparameters Python Figure 2: IDEF0 Others documentation Результати досліджень, Clean data Документація Texts of проведених раніше Clear data Transformens different 1 authors Customize the model 2 Determine gender Model and age 3 Передбачення Модель моделі Interpret the передбачення Libraries Various results 4 Hyperparameters віку та статі Python models Figure 3: Decomposed IDEF0 A Data Flow Diagram or DFD is a graphical structural analysis methodology that describes external to the system data sources and destinations, logical functions, data flows and data stores that are accessed (Fig. 4-5). That is, the data flows implemented in the project are described. Data repositories:  Blog dataset - downloaded Blog Authorship Corpus dataset [73].  Book dataset - downloaded Spooky Author Identification dataset [74].  Documentation - all documentation that controls the developed models and software part of the project. External entities:  Developer - a person who develops a model, and configures it.  User - a natural person who uses a ready-made model. Functions:  Pipeline - the process of pre-processing data, and preparing them for use by the model.  Age determination model - a machine learning model that predicts the age of the author based on the texts written by him.  Gender determination model - a machine learning model that predicts the gender of the author based on the texts written by him.  Conversion into a convenient format - conversion of the information provided by the model into a convenient and human-readable format using graphs and conversion functions. Readable Parameters Age and gender data Developer User determination User requested data Figure 4: Data Flow Diagram 2 Blog dataset 1 Documentation Sending data Readable Instructions Conversion into a 3 Books dataset Age data convenient format determination 4 Pipeline Result model 2 Result User requested (processing) 1 Model of gender data Processed data determination 3 Parameters Figure 5: Decomposed data flow diagram A workflow diagram (process diagram) is used to model the sequence of steps or stages in the work process. The main purpose of such a diagram is to visualize and analyse the workflow to optimize or automate the process. For the project, this is a visualization of the development process and all its stages (Fig. 6). In the end, a fully functional model was obtained for determining the age and gender of the author of the text. Start Data collection Data analysis Data cleaning Model selection Identification of signs Choosing a model Choosing a model for for age classification gender classification Ranking of features Setting up models Preparation of data for the model Model training Results analysis Conclusions End Figure 6: Workflow diagram  Collection of data, i.e. datasets with data on authors of texts [73-74].  Data analysis, identification of data types, their quantity and other metadata for model selection. The division into branches. Left branch:  Data cleaning, removal of special characters, unnecessary characters, and articles.  Identification of features using the previously described methods.  Ranking of features to determine the most important for this study.  Preparation of data for sending to the model. Right branch:  Selection of models: a. A model for classifying authors by age. b. A model for classifying authors by article.  Setting up the model, selecting parameters, optimizers and modifying the architecture. Joining branches:  Training of the previously configured model on prepared data.  Evaluation of results using metrics, graphs and analytics.  Formation of research conclusions. 4. Statement and justification of the problem Statement of the problem: this study allows us to study the problem of determining the gender and age of the author based on the texts written by him. Its essence is to create a machine learning model to analyse the text and determine the biological data (age and gender) of its author based on the sample of his text. Technical characteristics: as an input, the model accepts a text sample in text format (string, char), processed and cleaned, and as an output, the age, numerical value or numerical interval, as well as gender, and binary value will be analysed. Business processes: Data collection  Data processing  Model selection  Model training And creating a practical application for the model, for example in cyber security for identification. Technical means of implementation:  Bag-of-Words, N-grams, Word2Vec, and GloVe are used for data processing.  To build a model: Transformers, Tensorflow, Keras, PyTorch. Application: the model is developed for research purposes to expand the issue of determining the gender or age of the authors of texts, but it can also be used to identify a person or verify authorship. Expected effects: contributing to research on the identification of biological data of the author from his text. Development of a potentially useful model in cyber security and other fields. Gaining new knowledge about the development of language models and conducting research. 5. Comparison of methods and means of the product under development 5.1. Machine learning models Regression models are better for predicting age, here are a few basic ones in comparison: 1. Linear regression. Pluses:  Simple and clear.  Fast learning and getting results. Cons: Assumes a linear relationship between traits and age, which may not be true for complex textual data. 2. Support Vector Regression (SVR). Pluses:  Effective in large multidimensional spaces.  Can capture complex relationships using kernel features. Cons: Requires careful tuning of hyperparameters. 3. Gradient Boosting Regression (for example, XGBoost). Pluses:  Resistant to fuzzy and noisy data.  Can effectively capture non-linear relationships. Cons: Higher computational cost compared to linear models. Options for using large language models (LLM) to accomplish this task are also considered: 4. BERT (Transformer Bidirectional Encoder Representation). Pluses:  Captures the bidirectional context in the text.  Can handle complex relationships and semantics in textual data.  Pre-trained on a large corpus (e.g. Wikipedia, books) and then customized for specific tasks. Cons:  Requires significant computing resources for training and results.  A large amount of memory. 5. GPT (generative pre-trained transformer). Pluses:  Creates coherent text appropriate to the context.  Useful for creating text predictions. Cons: Can't directly output predicted age or gender; requires additional fine-tuning for a specific task. 5.2. Comparison factors 1. Productivity. LLMs are generally excellent at capturing complex patterns and semantics in textual data, potentially leading to higher predictive accuracy compared to traditional regression models. 2. Interpretability. Traditional regression models, such as linear regression, offer straightforward interpretation, making it easier to understand the relationship between characteristics and predictors. LLMs, being deep learning models, are more complex and less interpretive, although techniques such as attention mechanisms can provide some insight. 3. Resource requirements. LLMs require significant computational resources (e.g., GPU, memory) for training and inference due to their deep architecture and large parameter size. Traditional regression models are smaller in terms of resource requirements. 4. Possibility of adaptation to specific tasks. LLM can be customized for specific tasks, such as age and gender prediction, using transfer learning using pre- trained models. Traditional regression models may require more complex work with features and additional tuning for a specific area. Therefore, both regression models and LLM models are suitable for the task of this study. They show different performances for different tasks, so it's best to use several models in your work, give them different parts of the task and compare their performance. 6. Experiments The basis of the project, which fulfils its main goals, namely the determination of age and gender, is a machine learning model written in the Python programming language. Despite this, the model itself takes up relatively little space in the program, and most of it is occupied by data processing and analysis, and analysis of results. It is useful to consider all these parts of the program separately to be able to focus on the methods and processes of each stage. 6.1. Data analysis At the stage of data analysis, the dataset itself is loaded [73-74], and its content, amount of data, data distribution, and search for correlation between data using graphs and other tools are analysed.  Methods: data loading, manual data cleaning, data visualization.  Tools: Python libraries (pandas, matplotlib, seaborn).  Process description: First, the data (dataset) is loaded into the Python environment for further processing. There is a manual review of the dataset and the selection of suitable features in the data. Unnecessary features can be deleted. Next, we build several visualizations using Python libraries to better capture data correlation and create an idea of how to work with them, namely a graph of gender distribution and age distribution in the dataset to check its weighting. Figure 7: Manual data cleaning 6.2. Data processing (pre-processing) In the data pre-processing stage, the dataset goes through detailed processing and text cleaning to clean the text of unnecessary characters that can negatively affect the accuracy of the model's predictions, as well as converting the data into a numerical format that the model understands and can work with.  Methods: removal of unnecessary symbols, removal of stop words, tokenization of sentences, lemmatization of words, division of data into sets, vectorization of words, labelling of evaluations.  Tools: Python libraries (pandas, NLTK).  Process description: The data from the previous step is first separated into text and scores. Scores are converted into binary (gender) and categorical (age) or numeric (age) formats. After that, the text data is cleaned. First, all uppercase letters are converted to lowercase, sentences are cleaned of stop-words, all kinds of signs and markings, and, if necessary, lemmatized (in my case, this step turned out to be unnecessary). In the end, already cleaned data are divided into training and test sets. Figure 8: Text processing as data pre-processing (removal of stop words and lemmatization) Figure 9: Clear text 6.3. Model training After cleaning and pre-processing the data, it can be transformed into a set of vectors and fed into a model to make predictions. The model itself consists of two Random Forest models, one of which allows classifying age and the other gender. The prediction accuracy of both models is evaluated using metrics.  Methods: text vectorization, model training, model evaluation.  Tools: Python libraries (scikit-learn).  Process description: the data completely cleaned at the previous stage is transferred to the vectorization function, which converts tokens into digital values (embeddings). In this, numerical, form, the data can be transferred to the model for training. Random Forest Classification models were used to classify age and sex, and a Random Forest Regressor was used to determine the numerical value of age. The text and the mark to it are transferred to the model, thus the process of training the model takes place. Next, the model is evaluated and its accuracy is determined by comparing its predictions with real marks. Figure 10: Age model training Figure 11: Sex model training 6.4. Evaluation of results  Evaluation of the results is almost the most important stage of any research. It allows you to see certain regularities between the results and the initial data, which can sometimes even initiate another study. Data visualization, model accuracy measurement, feature selection, comparison of predictions with real results, and other methods are used to evaluate research results.  Methods: visualization of results, construction of predictions, transformation of predictions into a human-understandable format, comparison of data, calculation of numerical metrics.  Tools: Python libraries (matplotlib, seaborn, scikit-learn, pandas, NumPy).  Process description: the model predicts age and gender on test data, compares its results with the real ones, and generates graphs. In the work, the most influential signs, by which the model determines age and gender, were identified, and they were displayed in the form of a graph (separately for age and gender). These graphs are among the most important because they help us understand which words can indicate the biological data of the author of the text. In addition, many graphs are created that describe the accuracy of the model, these include the ROC curve, the positive/negative true/false matrix, the histogram of true and predicted age (for numerical age prediction), the distribution of true and predicted age categories (for categorical age determination). 6.5. For the user This stage is created for interaction with the user, it provides an opportunity to enter your text excerpt to determine the gender and age of the author, and the results are presented as clearly as possible for users.  Methods: calling previously developed functions, outputting results in a human- understandable format.  Tools: Tools: Python libraries (scikit-learn, NLTK).  Process Description: The task of this stage is to create an extremely simple and concise section for user interaction. The user's task is to enter the text in the right place, the author of which needs to be determined, and run two cells with the code. The text entered by the user is passed to previously developed functions, undergoes cleaning, removal of stop-words, transformation, vectorization, transfer of text to the model, conversion of the text into a readable format and output of the results to the user. The whole process takes 8 lines and takes no longer than a minute. Figure 12: For the user 6.6. User manual To run the program, the user's device must meet the following requirements:  Internet connection;  Operating system: Windows 7 or higher;  Software: a program that supports the .ipynb format (Jupyter notebook, Google Colab web resource, VS Code);  Features: 8+ GB RAM, CPU (or use Google Colab). To use the program, you need to follow the following steps: 1. Place the program file and the dataset in one folder. 2. Open the program file. 3. Run each cell individually, one by one, using the start button (usually a trident) to the left of each cell, or the "Run All" button on the top panel of the program, if there is one. 4. Wait until the end of execution of all cells (approximately 10-15 minutes). 5. In the "For user" section (at the end of the file), you can enter the text to be classified, then run the cell with the entered text and the following text, the result will appear under the second cell. 6.7. Program code Downloading required libraries: import numpy as np import pandas as pd import re from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer import matplotlib.pyplot as plt import seaborn as sns from sklearn.feature_extraction.text import CountVectorizer Loading dataset: df = pd.read_csv('./blogtext.csv') df.head() Deleting unnecessary lines: df = df.drop(columns=['id', 'date', 'sign']) #deleting unnecessary columns df The function of the graph of the distribution of data by gender: plt.figure(figsize=(8, 6)) sns.countplot(data=df, x='gender', color='purple') plt.title('Gender Distribution') plt.xlabel('Gender') plt.ylabel('Count') plt.show() The function of the graph of the distribution of data by age: plt.figure(figsize=(10, 6)) sns.histplot(data=df, x='age', bins=20, kde=True, color='purple') plt.title('Age Distribution') plt.xlabel('Age') plt.ylabel('Count') plt.show() Data cleaning function from stop-words: import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() # nltk.download('punkt') # nltk.download('stopwords') # nltk.download('wordnet') # nltk.download('omw-1.4') def stopwords_removal(text): new_words = word_tokenize(text) new_filtered_words = [#lemmatizer.lemmatize(word.lower()) for word in new_words if word.lower() not in stopwords.words('english') word for word in new_words if word.lower() not in stopwords.words('english')] return ' '.join(new_filtered_words) Sampling part of the dataset (30,000 samples): df = df.sample(n=len(df)) df_short = df[:30000] Converting age into categories (optional): def categorize_age(age): if age < 20: return 0 # less than 20 elif 20 <= age <= 30: return 1 # 20-30 else: return 2 # more than 30 df_short['age_category'] = df_short['age'].apply(categorize_age) The function of removing unnecessary characters and applying all preprocessing functions to the text: def text_prerocess(text): #10 minutes text = re.sub(r'<.*?>', '', text) text = re.sub(r'\W+', ' ', text) text = text.lower() text = re.sub(r'nbsp', '', text) text = re.sub(r'urllink', '', text) text = re.sub(r'im', 'i am', text) return text df_short['clean_text'] = df_short['text'].apply(text_prerocess) df_short['clean_topic'] = df_short['topic'].apply(text_prerocess) df_short['clean_topic'] = df_short['clean_topic'].apply(stopwords_removal) df_short['clean_text'] = df_short['clean_text'].apply(stopwords_removal) df_short['combined_text'] = df_short['clean_text'] + ' ' + df_short['clean_topic'] df_short['gender_bi'] = df_short['gender'].map({'male': 1, 'female': 0}) x_train, x_test, y_train, y_test = train_test_split(df_short['combined_text'], df_short[['age_category','gender_bi']], test_size=0.2, random_state=42) Vectorization of text data: vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2)) x_train_tfidf = vectorizer.fit_transform(x_train) x_test_tfidf = vectorizer.transform(x_test) A model for determining age: from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.linear_model import LinearRegression, Ridge, Lasso rf_age_category = RandomForestClassifier(n_estimators=100, random_state=42) rf_age_category.fit(x_train_tfidf, y_train['age_category']) age_category_score = rf_age_category.score(x_test_tfidf, y_test['age_category']) print(f'Random Forest age category prediction accuracy: {age_category_score}') Calculation of metrics for the age model and feature selection: age_category_predictions = rf_age_category.predict(x_test_tfidf) print(classification_report(y_test['age_category'], age_category_predictions)) importances_age_category = rf_age_category.feature_importances_ indices_age_category = np.argsort(importances_age_category)[::-1] top_n = 15 top_features_age_category = [vectorizer.get_feature_names_out()[i] for i in indices_age_category[:top_n]] print(f'Top {top_n} features for age category prediction: {top_features_age_category}') Model for gender determination: rf_gender = RandomForestClassifier(n_estimators=100, random_state=42) rf_gender.fit(x_train_tfidf, y_train['gender_bi']) rf_gender.score(x_test_tfidf, y_test['gender_bi']) gender_predictions = rf_gender.predict(x_test_tfidf) print(classification_report(y_test['gender_bi'], gender_predictions)) Selection of features for the gender model: importances_gender = rf_gender.feature_importances_ indices_gender = np.argsort(importances_gender)[::-1] top_features_gender = [vectorizer.get_feature_names_out()[i] for i in indices_gender[:top_n]] print(f'Top {top_n} features for gender prediction: {top_features_gender}') Visualization of graphs of the importance of traits for age: top_importances_age_category = importances_age_category[indices_age_category[:top_n]] plt.figure(figsize=(10, 6)) plt.barh(range(top_n), top_importances_age_category, align='center', color='salmon') plt.yticks(range(top_n), top_features_age_category) plt.gca().invert_yaxis() plt.xlabel('Feature Importance') plt.title('Top 15 Features for Age Category Prediction') plt.show() Visualization of graphs of the importance of traits for gender: top_importances_gender = importances_gender[indices_gender[:top_n]] plt.figure(figsize=(10, 6)) plt.barh(range(top_n), top_importances_gender, align='center', color='purple') plt.yticks(range(top_n), top_features_gender) plt.gca().invert_yaxis() plt.xlabel('Feature Importance') plt.title('Top 15 Features for Gender Prediction') plt.show() Correlation matrix for gender predictions: from sklearn.metrics import confusion_matrix import seaborn as sns predicted_genders = rf_gender.predict(x_test_tfidf) cm = confusion_matrix(y_test['gender_bi'], predicted_genders) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='RdPu', xticklabels=['Female', 'Male'], yticklabels=['Female', 'Male']) plt.title('Confusion Matrix for Gender Prediction') plt.xlabel('Predicted') plt.ylabel('True') plt.show() Correlation matrix for age predictions: cm_age_category = confusion_matrix(y_test['age_category'], age_category_predictions) plt.figure(figsize=(8, 6)) sns.heatmap(cm_age_category, annot=True, fmt='d', cmap='RdPu', xticklabels=['<20', '20-30', '>30'], yticklabels=['<20', '20-30', '>30']) plt.title('Confusion Matrix for Age Category Prediction') plt.xlabel('Predicted') plt.ylabel('True') plt.show() Derivation of ROC curves for gender and age: from sklearn.metrics import roc_curve, auc from sklearn.preprocessing import label_binarize from itertools import cycle y_test_binarized = label_binarize(y_test['age_category'], classes=[0, 1, 2]) n_classes = y_test_binarized.shape[1] fpr = dict() tpr = dict() roc_auc = dict() for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(y_test_binarized[:, i], rf_age_category.predict_proba(x_test_tfidf)[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) fpr_gender, tpr_gender, _ = roc_curve(y_test['gender_bi'], rf_gender.predict_proba(x_test_tfidf)[:, 1]) roc_auc_gender = auc(fpr_gender, tpr_gender) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) colors = cycle(['darkturquoise', 'darkmagenta', 'lightcoral']) for i, color in zip(range(n_classes), colors): ax1.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i])) ax1.plot([0, 1], [0, 1], 'k--', lw=2) ax1.set_xlim([0.0, 1.0]) ax1.set_ylim([0.0, 1.05]) ax1.set_xlabel('False Positive Rate') ax1.set_ylabel('True Positive Rate') ax1.set_title('ROC Curve for Age Category Prediction') ax1.legend(loc="lower right") ax2.plot(fpr_gender, tpr_gender, color='purple', lw=2, label='ROC curve (area = {0:0.2f})'.format(roc_auc_gender)) ax2.plot([0, 1], [0, 1], 'k--', lw=2) ax2.set_xlim([0.0, 1.0]) ax2.set_ylim([0.0, 1.05]) ax2.set_xlabel('False Positive Rate') ax2.set_ylabel('True Positive Rate') ax2.set_title('ROC Curve for Gender Prediction') ax2.legend(loc="lower right") plt.tight_layout() plt.show() Derivation of the precision-recall curve: from sklearn.metrics import precision_recall_curve precision = dict() recall = dict() for i in range(n_classes): precision[i], recall[i], _ = precision_recall_curve(y_test_binarized[:, i], rf_age_category.predict_proba(x_test_tfidf)[:, i]) plt.figure(figsize=(8, 6)) for i, color in zip(range(n_classes), colors): plt.plot(recall[i], precision[i], color=color, lw=2, label='PR curve of class {0}'.format(i)) plt.xlabel('Recall') plt.ylabel('Precision') plt.title('Precision-Recall Curve for Age Category Prediction') plt.legend(loc="lower left") plt.show() Graph comparing actual and predicted age: plt.figure(figsize=(12, 6)) sns.countplot(x=y_test['age_category'], palette='Blues', alpha=0.5, label='True Age Categories') sns.countplot(x=age_category_predictions, palette='Reds', alpha=0.5, label='Predicted Age Categories') plt.title('Distribution of True vs. Predicted Age Categories') plt.xlabel('Age Category') plt.ylabel('Count') plt.legend() plt.show() A function for converting predicted data into a human-readable format: def decategorize(age:list, gender:list): res = [] if age == 0: res.append(['less than 20']) elif age == 1: res.append(['20-30']) else: res.append(['more than 30']) if gender == 0: res.append(['female']) else: res.append(['male']) return res Downloading the second dataset (with book authors): book_train = pd.read_csv('train_book.csv') Conversion of author names to gender and pre-processing of text data using previously described functions: book_train.drop(columns=['id']) def categorize_gender(gender): # defining authors gender if gender == 'EAP' or gender == 'HPL': return 1 # male else: return 0 # female # Apply age categorization book_train['gender'] = book_train['author'].apply(categorize_gender) book_train['text'].apply(text_prerocess) book_train['text'].apply(stopwords_removal) x_train_book, x_test_book, y_train_book, y_test_book = train_test_split(book_train['text'], book_train[['gender']], test_size=0.2, random_state=42) Data vectorization: x_train_b = vectorizer.fit_transform(x_train_book) x_test_b = vectorizer.transform(x_test_book) Training the additional model purely on new data: rf_author = RandomForestClassifier(n_estimators=100, random_state=42) rf_author.fit(x_train_b, y_train_book) rf_author.score(x_test_b, y_test_book) Comparison of the results of the old and new gender models on the old and new data: print('Old model on old data: ', rf_gender.score(x_test_tfidf, y_test['gender_bi'])) print('Old model on new data: ', rf_gender.score(x_test_b, y_test_book)) print('New model on new data: ', rf_author.score(x_test_b, y_test_book)) print('New model on old data: ', rf_author.score(x_test_tfidf, y_test['gender_bi'])) String to enter sample text from the user: your_text = 'I would love to go to the gallery with you after university! I heard they are exhibiting the most famous paintings of Monet, he is my favourite painter, i am so excited!' Functions for converting text, making predictions, converting predictions back to text: text = text_prerocess(your_text) #Now run this cell and it will give you result text = stopwords_removal(text) text = pd.Series(your_text) text = vectorizer.transform(text) pred_age = rf_age_category.predict(text) pred_gen = rf_gender.predict(text) result = decategorize(pred_age, pred_gen) print(f'The author\'s age is in this range: {result[0]}\nAuthor\'s gender: {result[1]}') 7. Results The program is divided into several sections according to the type of tasks performed. When starting the program, you first need to import the libraries, immediately after that the data analysis section begins. Figure 13: Start the program In the Data Analysis section, the dataset has been imported (if necessary, the reference to the dataset in the file system must be changed) and unnecessary data columns (id, sign, date) have been removed. After that, the data balance in the dataset was checked using data visualization. A quantity graph was used to compare gender, and a histogram was used to compare age. Figure 14: Data balance graphs Next, data pre-processing takes place. This section creates two important functions that are responsible for cleaning data, as well as one optional function:  def stopwords_removal(text): the function removes stop words, removes capital letters, tokenizes sentences and provides lemmatized text. As input, it accepts the text to be cleared, and returns the cleared text. Figure 15: Stopwords_removal function  def text_prerocess(text): function for text pre-processing, it removes characters and unwanted combinations that can cause noise. As input, it accepts the text to be cleared, returns the cleared text. Figure 16: The text_prerocess function Figure 17: Age categorization function  def categorize_age(age): an optional function, used only when age is defined as a classification by age group rather than assumed as a number. Divide authors into three groups by age. At the entrance, it takes the age, at the exit it gives the number of the group to which the author belongs. In general, in this section, the text data is passed through these functions in turn to obtain tokenized and cleaned text, which can then be immediately vectorized and sent to the model. Figure 18: Applying functions to data In the next section, the model itself, or rather two models, is trained to predict age and gender. However, the data from the previous section is still in text form, which is incomprehensible to the model, so the text is vectorized before that. It converts the text into a set of embeddings, in the form of TF-IDF, a technology that allows you to sort words by their importance, which is very important for my research. Figure 19: Text vectorization for model training The vectorized ones are passed to the RandomForestClassifier model to determine gender or age in a categorical form, if the age needs to be predicted in a numerical form, the RandomForestRegressor model is used. These models were chosen because they allow us to highlight features for further analysis. Figure 20: The models used in the work Also, in this section, the characteristics that have the greatest influence on the classification result are highlighted. Figure 21: Extracting the main features In the next section, the data obtained during the research is analysed. The section consists entirely of graphs of various types and for various purposes. Figure 22: Ranking of signs that affect the determination of gender and age Figure 23: Correlation matrices of age (categories) and gender Figure 24: ROC curves Figure 25: Comparison of predicted categories with actual ones A separate section has also been developed for this project so that users have the opportunity to more conveniently interact with the model and enter their sentences for age and gender verification. For the sake of the experiment, 5 different sentences were given: 3 from women and 2 from men. The following results were obtained: Figure 26: Results of the second case Figure 27: Results of the second case Figure 28: Results of the third case Figure 29: Results of the fourth case Figure 30: Results of the fifth case Four of the five proposed cases are correctly identified. One of the woman's messages is identified as a message from a man. 8. Discussion So, the purpose of the research was to create a model for analysing the author's age and gender based on his texts. In this work, the model is built and trained on a part of the dataset with blogs. The following program execution results were achieved:  The age determination model demonstrates an accuracy of 64.3% if age is specified categorically, and 25% if age is assumed in numerical format. The sex determination model has an accuracy of 64%. These results are not as high as we would like, but they allow us to extract certain features from the text that allow us to determine the biological data of the author. The data could probably be improved by taking a larger dataset and longer messages.  Features from the model were selected and sorted, and a rating of the most influential features for the classification of biological data was obtained. Adding the subject of writing to the overall text of the message greatly helps the classification, making it easier for models to predict both age and gender. For age classification, topics about the place of study or work, and social status (Student, industry, and others) are the most helpful, for gender it is the field of interests (technology, finance).  From the correlation matrix, it can be noted that men's messages are perceived as women's more often than vice versa. This may indicate that men are more likely to use a feminine way of communicating, or that women are more likely to indicate their gender as the opposite for various reasons.  When analysing a dataset from books, where it was necessary to determine the gender of the author, an additional model was created, which was trained only on this dataset, and also passed this dataset through the previous model and vice versa. The results of operations were compared. It turns out that a model trained on blog data performs better on unfamiliar data than a model trained on new data. Figure 31: Comparison of two models  One of the authors wrote a message from himself, as an example, which is correctly classified as female, age 20 to 30, which it is. The numerical value of the age slightly exaggerates the real one, the model believes that the author is 27. It can be said that the model is much more likely to determine the psychological age of a person than the biological one (the author is a little over 20 years old). Statistics of the model before training: 1. Statistical analysis of data begins even before sending the dataset to the model for making predictions. When the dataset is loaded, two graphical displays of the data distribution in this dataset are constructed. The amount of data by gender is almost the same, but the ages 13-17 and 23-29 years old are significantly more prevalent, and there are no users in their 20s and 30s at all. This can have a rather negative effect on the accuracy of the model. Blog Authorship Corpus dataset (kaggle.com) [73]. Figure 32: Graphs of age and gender distribution 2. During the training of the model, to determine its accuracy, a classification report is built, which includes such metrics as accuracy, f1-score, macro avg, precision, recall. Figure 33: Classification Report for Age(1) and Gender(2) Gender determination has more stable metrics, it can be assumed that models are easier to cope with gender determination than age. 3. Analysis of signs of predicting gender or age is perhaps the most important step in this research. The results were generated and displayed in the form of a ranked graph of the 15 most important features for classification. Figure 34: Signs for predicting age The first three places belong to the topic names that users specify when writing a blog. This is quite expected because the subject of the text can tell about the user no more than the text itself. The following positions belong to slang forms of the text, which, most likely, are more often used by young people. Next, you can see that places of work/study and some other words appear, the meaning of which can be analysed by analogy. Figure 35: Signs for predicting gender Again, the first to appear are the topics of messages that indicate the areas of interest of individuals. More technical interests are responsible for the male gender, while more creative interests are for the female gender. You can also notice that there are quite a lot of feelings, the expression of which is more characteristic of women. 4. Correlation matrices can also be a great source of information about a program and its results. Figure 36: Correlation matrix for gender In many previous runs of the program, it can be observed that the number of men classified as women is relatively higher than the other way around. From this, we can draw an interesting conclusion that men more often adopt a female manner of communication than women - a male one. Figure 37: Correlation matrix for age (classification) It can be seen from the matrix that the central group quite strongly dominates the others, both in the classification of the model and in the dataset. This problem can be solved by age-balancing the dataset, which may greatly distort the model results. 5. The ROC curve shows the ratio of True Positive to True Negative and the accuracy with which the model works. Figure 38: Curves for age and gender It can be seen that the model predicts the age group <20 best, although it is not the most common. Most likely, many signs help to quickly identify this age, such as "school" or "student". 6. Comparing predicted and actual ages is one of the best ways to visually evaluate model predictions. Figure 39: Comparison of predicted and actual categories (age) As you can see, the model has difficulties with the definition of the third category (>30). Most likely, such trouble is because the number of this category, compared to others, is very small, so the model has not learned how to predict it properly. This can fix the correction of the dataset. You can still look at a similar graph but for a temporary prediction of age. Figure 40: Comparison of predictions and true values (age) Interestingly, the model does not want to predict ages below 17, even though there are many younger users in the dataset. It predicts a large part of the data as 17-23, although it is generally an empty area in the dataset. Also, the model, as in the categories, practically avoids age more than 30. It seems that more weighted data will be able to solve some of these problems. 7. A separate gender classification model has been developed for the dataset of book authors, which perfectly classifies the gender of authors from this dataset. Dataset with book authors: Spooky Author Identification [74]. Figure 41: Prediction of the "book" model However, when this same data was run through a previously trained model on the old data, it performed better than this new model on the old data. So, the old model, although it has worse accuracy, perceives the new data much better than the new model. Figure 42: Comparison of models 8. As a final test, as well as an opportunity for users to more conveniently interact with the program, the ability to determine the gender and age of the text specified by the user has been developed. A sentence was written by one of the authors of this work, and a fairly accurate classification was obtained. Figure 43: Predictions of the model Several experiments were also conducted to verify the model. Three friends of one of the authors of the study took part in the experiment. They wrote common conversational sentences, and this author also wrote a few. These sentences were placed in the following order: author (female, 21), author (female, 21), male 22, female 20, male 21. The obtained results are arranged in the same order. Figure 44: Results of predictions for entered sentences Four out of five sentences are classified correctly, and all of them are correctly determined by age, according to article 4/5. One woman was classified as a man. This is a pretty good result for a text-based age and gender prediction model. A few more experiments were conducted with sentence formulation using the features provided by the model. Thus, adding the words "school" or "student" reduces the predicted age of the author, adding words related to technology changes the gender of the author to male. This means that it is important to submit a sentence to the model that is not written to deceive the model, it should be sincere and casual. Figure 45: Artificial reduction of predicted age 9. Conclusions So, in the process of implementing this project, namely the project on determining the author's age and gender based on his text, a model was developed that determines these biological data of the author based on his text. Before starting work, similar studies on a similar topic are reviewed to find out what has already been researched and tested, and what is still worth investigating. Also, from these studies, it was possible to find many hints about which implementation methods and tools are better to choose, and which work better for this task. The work on the project is carefully planned using process diagrams and data flows. The best methods and tools for the implementation of this project were studied, and simple classification and regression models of Random Forest became such tools. Such models were chosen, because they cope with the task quite well, and are much less resource-intensive than the same large language models, in addition, they are very easy to use and configure. Two datasets were selected, a dataset with blogs and a dataset with books. The dataset with blogs was used the most because it contains both the age and gender of the blog author. Before use, the data was analysed and cleaned, later transformed into embeddings and sent for model training. The results of the model are studied and analysed in detail. Many useful features are extracted that are responsible for classifying the age or gender of the author in the texts. In addition, many interesting regularities were observed in the process of analysing the results. Additionally, a test case is implemented that allows the user to easily interact with my model. Such research is very useful in many areas of life, but also for the development of science. Such studies can help capture the relationship between seemingly unrelated features, such as the reflection of an author's gender and age in his texts. We believe that it is possible to try to repeat or edit our experiment on a computer with higher capacities to be able to analyse much larger volumes of data, which could significantly improve the results of the model. Although it is worth noting, as stated in another study, such predictions work more for the psychological age of a person than for his biological age, because the manner of speaking reflects the psychological age. Also, you can try using other datasets in the future, if available, or rebalance the current dataset and try again. Such research can bring many benefits to society if it is used properly. References [1] O. Tverdokhlib, V. Vysotska, P. Pukach, M. Vovk, Information technology for identifying hate speech in online communication based on machine learning, Lecture Notes on Data Engineering and Communications Technologies 195 (2024) 339–369. [2] N. Borysova, K. Melnyk, N. Babkova, Z. Kochuieva, V. Melnyk, Gender Classification of Surnames: Ukrainian aspect, CEUR Workshop Proceedings 3171 (2022) 354-364. [3] L. Stasiuk, Gender Marked Intimate Conversational Interaction of Spouses in Modern English, CEUR Workshop Proceedings 2870 (2021) 731-742. [4] A. Hadzalo, Analysis of Gender-Marked Units: Statistical Approach, CEUR workshop proceedings 2604 (2020) 462-471. [5] Y. Butelskyy, Statistical Methods to Detect Gender Peculiarities of Communication in Vkontakte Social Network Groups, in Proceedings of the 11th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT, 2016, pp. 132-135. doi: 10.1109/STC-CSIT.2016.7589888. [6] I. Afanasieva, N. Golian, V. Golian, A. Khovrat, K. Onyshchenko, Application of Neural Networks to Identify of Fake News, CEUR Workshop Proceedings 3396 (2023) 346-358. [7] A. Shupta, O. Barmak, A. Wierzbicki, T. Skrypnyk, An Adaptive Approach to Detecting Fake News Based on Generalized Text Features, CEUR Workshop Proceedings 3387 (2023) 300-310. [8] V.-A. Oliinyk, V. Vysotska, Y. Burov, K. Mykich, V. Basto-Fernandes, Propaganda Detection in Text Data Based on NLP and Machine Learning, CEUR workshop proceedings 2631 (2020) 132-144. [9] R. A. Dar, Dr. R. Hashmy, A Survey on COVID-19 related Fake News Detection using Machine Learning Models, CEUR Workshop Proceedings 3426 (2023) 36-46. [10] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP Tool for Extracting Relevant Information from Criminal Reports or Fakes/Propaganda Content, in Proceedings of IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022, pp. 93-98, doi: 10.1109/CSIT56902.2022.10000563. [11] A. Mykytiuk, V. Vysotska, O. Markiv, L. Chyrun, Y. Pelekh, Technology of Fake News Recognition Based on Machine Learning Methods, CEUR Workshop Proceedings 3387 (2023) 311-330. [12] T. Batiuk, V. Vysotska, V. Lytvyn, Intelligent system for socialization by personal interests on the basis of SEO technologies and methods of machine learning, CEUR workshop proceedings 2604 (2020) 1237-1250. [13] D. Uhryn, O. Naum, N. Antonyuk, I. Dyyak, L. Chyrun, A. Demchuk, V. Vysotska, Z. Rybchak, T. Batiuk, Tourist Itineraries Plan Design Based on the Behavior of Bee Colonies, CEUR Workshop Proceedings 2631 (2020) 516-539. [14] T. Batiuk, V. Vysotska, R. Holoshchuk, S. Holoshchuk, Intelligent System for Socialization of Individual’s with Shared Interests based on NLP, Machine Learning and SEO Technologies, CEUR Workshop Proceedings 3171 (2022) 572-631. [15] D. Dosyn, T. Batiuk, A Realization of Visual Biometric Validation to Enhance Guarded and Efficient Authorization for Intellectual Systems, CEUR Workshop Proceedings 3668 (2024) 247-268. [16] T. Batiuk, L. Chyrun, O. Oborska, Ontology Model and Ontological Graph for Development of Decision Support System of Personal Socialization by Common Relevant Interests, CEUR Workshop Proceedings 3171 (2022) 877-903. [17] R. Bekesh, L. Chyrun, P. Kravets, A. Demchuk, Y. Matseliukh, T. Batiuk, I. Peleshchak, R. Bigun, I. Maiba, Structural modeling of technical text analysis and synthesis processes, CEUR Workshop Proceedings 2604 (2020) 562–589. [18] A. Yarovyi, D. Kudriavtsev, Method of Multi-Purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot, CEUR Workshop Proceedings2870, 2021, pp. 1238-1248. [19] V. Vasyliuk, Y. Shyika, T. Shestakevych, Information System of Psycholinguistic Text Analysis, CEUR workshop proceedings 2604 (2020) 178-188. [20] O. Artemenko, V. Pasichnyk, N. Kunanets, K. Shunevych, Using sentiment text analysis of user reviews in social media for e-tourism mobile recommender systems, CEUR workshop proceedings 2604 (2020) 259-271. [21] I. Gruzdo, I. Kyrychenko, G. Tereshchenko, O. Cherednichenko, Applıcatıon of Paragraphs Vectors Model for Semantıc Text Analysıs, CEUR workshop proceedings 2604 (2020) 283-293. [22] N.B. Shakhovska, R.Yu. Noha, Methods and tools for text analysis of publications to study the functioning of scientific schools, Journal of Automation and Information Sciences 47(12) (2015) 29-43. [23] V. Vysotska, V.B. Fernandes, V. Lytvyn, M. Emmerich, M. Hrendus, Method for Determining Linguometric Coefficient Dynamics of Ukrainian Text Content Authorship, Advances in Intelligent Systems and Computing 871 (2019) 132-151. doi: 10.1007/978-3-030-01069-0_10. [24] V. Vysotska, Y. Burov, V. Lytvyn, A. Demchuk, Defining Author's Style for Plagiarism Detection in Academic Environment, in: Proceedings of the International Conference on Data Stream Mining and Processing, DSMP, 2018, pp. 128-133. DOI: 10.1109/DSMP.2018.8478574. [25] V. Vysotska, O. Kanishcheva, Y. Hlavcheva, Authorship Identification of the Scientific Text in Ukrainian with Using the Lingvometry Methods, in: Proceedings of the International Conference on Computer Sciences and Information Technologies, CSIT, 2018, pp. 34-38. DOI: 10.1109/STC-CSIT.2018.8526735. [26] V. Lytvyn, V. Vysotska, Y. Burov, I. Bobyk, O. Ohirko, The linguometric approach for co-authoring author's style definition, in: International Symposium on Wireless Systems within the International Conferences on Intelligent Data Acquisition and Advanced Computing Systems, IDAACS-SWS, 2018, pp. 29-34. doi: 10.1109/IDAACS-SWS.2018.8525741. [27] V. Lytvyn, V. Vysotska, I. Budz, Y. Pelekh, N. Sokulska, R. Kovalchuk, L. Dzyubyk, O. Tereshchuk, M. Komar, Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution, Eastern-European Journal of Enterprise Technologies 6(2-102) (2019) 28-51. doi: 10.15587/1729-4061.2019.186834. [28] V. Vysotska, O. Markiv, S. Teslia, Y. Romanova, I. Pihulechko, Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles, CEUR Workshop Proceedings 3171 (2022) 277-314. [29] V. Motyka, Y. Stepaniak, M. Nasalska, V. Vysotska, Lexical Diversity Parameters Analysis for Author's Styles in Scientific and Technical Publications, CEUR Workshop Proceedings 3403 (2023) 595–617. [30] R. Romanchuk, V. Vysotska, V. Andrunyk, L. Chyrun, S. Chyrun, O. Brodyak, Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text Authorship Attribution Probability, in Proceedings of the International Conference on Computer Sciences and Information Technologies, CSIT, Lviv, 19-21 October 2023 р. [31] O. Levchenko, M. Dilai, Qualitative and Quantitative Markers of Individual Authorial Conceptualization, CEUR Workshop Proceedings 3396 (2023) 1-19. [32] I. Khomytska, V. Teslyuk, I. Bazylevych, I. Karamysheva, Automated Identification of Authorial Styles, CEUR Workshop Proceedings 3396 (2023) 323-333. [33] I. Butko, The use of geospatial information by public authorities to support the decision making of management. Advanced Information Systems 5(1) (2021) 39– 44. doi: 10.20998/2522-9052.2021.1.05. [34] V. Shynkarenko, I. Demidovich, Natural Language Texts Authorship Establishing Based on the Sentences Structure, CEUR Workshop Proceedings 3171 (2022) 328- 337. [35] Y. Hlavcheva, O. Kanishcheva, М. Vovk, M. Glavchev, Identification of the Author's Idea Based on the Modified TextRank Method, CEUR Workshop Proceedings 2870 (2021) 118-128. [36] V. Shynkarenko, I. Demidovich, Authorship Determination of Natural Language Texts by Several Classes of Indicators with Customizable Weights, CEUR Workshop Proceedings 2870 (2021) 832-844. [37] I. Khomytska, V. Teslyuk, The Multifactor Method Applied for Authorship Attribution on the Phonological Level, CEUR workshop proceedings 2604 (2020) 189-198. [38] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of methods, models, and means for the author attribution of a text, Eastern-European Journal of Enterprise Technologies. 3(2(93)) (2018) 41–46. doi: 10.15587/1729- 4061.2018.132052. [39] I. Khomytska, V. Teslyuk, Authorship and Style Attribution by Statistical Methods of Style Differentiation on the Phonological Level, Advances in Intelligent Systems and Computing 871 (2019) 105–118. doi: 10.1007/978-3-030-01069-0_8. [40] Y. Zhao, J. Da, J. Yan, Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches, Information Processing & Management 58(1) (2021) 102390. [41] M. Hartmann, Y. Golovchenko, I. Augenstein, Mapping (dis-)information flow about the MH17 plane crash, arXiv:1910.01363, 2019. [42] S. Ahmed, Classification of Censored Tweets in Chinese Language using XLNet, in Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, 2021, pp. 136-139. [43] V. Vysotska Modern state and prospects of information technologies development for natural language content processing, CEUR Workshop Proceedings 3668 (2024) 198–234. [44] I. Zamaruieva, S. Lienkov, O. Babich, A. Shevchenko, Y. Khlaponin, N. Bernaz, Analytical Approaches to News Content Processing during the War in Ukraine in Opposing Geopolitical Alliances Mass Media, CEUR Workshop Proceedings 3403 (2023) 618-631. [45] V. Vysotska, Computer Linguistic Systems Design and Development Features for Ukrainian Language Content Processing, CEUR Workshop Proceedings 3688 (2024) 229–271. URL: https://ceur-ws.org/Vol-3688/paper18.pdf. [46] S. Albota, Creating a Model of War and Pandemic Apprehension: Textual Semantic Analysis, CEUR Workshop Proceedings 3396 (2023) 228-243. [47] N. Khairova, Y. Holyk, D. Sytnikov, Y. Mishcheriakov, N. Shanidze, Topic Modelling of Ukraine War-Related News Using Latent Dirichlet Allocation with Collapsed Gibbs Sampling, CEUR Workshop Proceedings 3688 (2024) 1-15. [48] S. Mainych, A. Bulhakova, V. Vysotska, Cluster Analysis of Discussions Change Dynamics on Twitter about War in Ukraine, CEUR Workshop Proceedings 3396 (2023) 490-530. [49] R. Nazarchuk, S. Albota, Tweets about Ukraine during the russian-Ukrainian War: Quantitative Characteristics and Sentiment Analysis, CEUR Workshop Proceedings 3426 (2023) 551-560. [50] N. Khairova, A. Kolesnyk, O. Mamyrbayev, K. Mukhsina, The Aligned Kazakh- Russian Parallel Corpus Focused on the Criminal Theme, CEUR Workshop Proceedings 2362 (2019) 116-125. [51] S. Voloshyn, V. Vysotska, O. Markiv, I. Dyyak, I. Budz, V. Schuchmann, Sentiment Analysis Technology of English Newspapers Quotes Based on Neural Network as Public Opinion Influences Identification Tool, in Proceedings of 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022, pp. 83-88, doi: 10.1109/CSIT56902.2022.10000627. [52] N. Khairova, A. Shapovalova, O. Mamyrbayev, N. Sharonova, K. Mukhsina, Using BERT model to Identify Sentences Paraphrase in the News Corpus, CEUR Workshop Proceedings 3171 (2022) 38-48. [53] N. Bondarchuk, I. Bekhta, O. Melnychuk, O. Matviienkiv, Keyword-based Study of Thematic Vocabulary in British Weather News, CEUR Workshop Proceedings 3171 (2022) 451-460. [54] S. Voloshyn, O. Markiv, V. Vysotska, I. Dyyak, L. Chyrun, V. Panasyuk, Emotion Recognition System Project of English Newspapers to Regional E-Business Adaptation, Proceedings of IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), 2022, pp. 392-397, doi: 10.1109/CSIT56902.2022.10000527. [55] N. Antonyuk, L. Chyrun, V. Andrunyk, A. Vasevych, S. Chyrun, A. Gozhyj, I. Kalinina, Y. Borzov, Medical news aggregation and ranking of taking into account the user needs, CEUR Workshop Proceedingsnn248 (2019) 369–382. [56] V. Andrunyk, A. Vasevych, L. Chyrun, N. Chernovol, N. Antonyuk, A. Gozhyj, V. Gozhyj, I. Kalinina, M. Korobchynskyi, Development of information system for aggregation and ranking of news taking into account the user needs, CEUR Workshop Proceedings 2604 (2020) 1127–1171. [57] V. Vysotska, S. Voloshyn, O. Markiv, O. Brodyak, N. Sokulska, V. Panasyuk, Tone Analysis of Regional Articles in English-Language Newspapers Based on Recurrent Neural Network Bi-LSTM, in Proceedings of the 5th International Conference on Advanced Information and Communication Technologies (AICT), 2023, pp. 158- 163. [58] S. Albota, Linguistic and Psychological Features of the Reddit News Post, in Proceedings of the IEEE 15th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT, 2020, 1, pp. 295–299. [59] N. Shakhovska, M. Medykovskyj, L. Bychkovska, Building a smart news annotation system for further evaluation of news validity and reliability of their sources, Przeglad Elektrotechniczny 91(7) (2015) 43-44. [60] V. Vysotska, R. Holoshchuk, S. Goloshchuk, O. Voloshynskyi, M. Shevchenko, V. Panasyuk, Predicting the Effects of News on the Financial Market Based on Machine Learning Technology, in Proceedings of the 5th International Conference on Advanced Information and Communication Technologies (AICT), 2023, pp. 152- 157. [61] Chew, R., Kery, C., Baum, L., Bukowski, T., Kim, A., & Navarro, M. (2021). Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation. JMIR Public Health and Surveillance, 7(3), e25807. [62] Z. Miller, B. Dickinson, W. Hu, Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features, International Journal of Intelligence Science 2 (4A) (2012) 24184, doi:10.4236/ijis.2012.224019. [63] S. Rosenthal, K. McKeown, Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations, in: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 2011, pp. 763-772. [64] D. Nguyen, N. A.Smith, C. P. Ros´e, Author age prediction from text using linear regression, in: Proceedings of the 5th ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH@ ACL 2011, 24 June, 2011, Portland, Oregon, USA. Association for Computational Linguistics, 2011, pp. 115-123. [65] D. Nguyen, D. Trieschnigg, A. S. Dogruoz¨, R. Grave, M. Theune, T. Meder, F. de Jong, Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment, in: Proceedings of the Technical Papers 25th International Conference on Computational Linguistics, August 23-29, 2014, Dublin, Ireland. Association for Computational Linguistics, 2014, pp. 1950-1961. [66] I. Khomytska, V. Teslyuk, I. Bazylevych, I. Shylinska, Approach for minimization of phoneme groups in authorship attribution, International Journal of Computing 19(1) (2020) 55-62. [67] I. Khomytska, V. Teslyuk, A. Holovatyy, O. Morushko, Development of Methods, Models and Means for the Author Attribution of a Text, Eastern-European Journal of Enterprise Technologies 3/2 (93) (2018) 41–46. [68] I. Khomytska, V. Teslyuk, Authorship Attribution by Differentiation of Phonostatistical Structures of Styles, in: Proceedings of the XIIIth Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT, Lviv, 2018, pp. 5–8. [69] I. Khomytska, V. Teslyuk, The Software for Authorship and Style Attribution, in: Proceedings of the 15th International Conference on CADMS, Polyana, 2019, pp. 23–26. [70] I. Khomytska, V. Teslyuk, Mathematical Methods Applied for Authorship Attribution on the Phonological Level, in: Proceedings of the XIVth Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT, Lviv, 2019, pp. 7–11. [71] I. Khomytska, V. Teslyuk, L. Bordyuk, The Kolmogorov-Smirnov's Test for Authorship Attribution on the Phonological Level, in: Proceedings of the IEEE 15th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT, 2020, pp. 259–262. [72] I. Khomytska, V. Teslyuk, N. Kryvinska, V. Beregovskyi, The nonparametric method for differentiation of phonostatistical structures of authorial style, Procedia Computer Science 160 (2019) 38–45. [73] Dataset of Blog Authorship Corpus. URL: https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus. [74] Dataset of Spooky Author Identification. URL: https://www.kaggle.com/competitions/spooky-author- identification/data?select=train.zip. [75] O. Prokipchuk, V. Vysotska, P. Pukach, V. Lytvyn, D. Uhryn, Y. Ushenko, Z. Hu, Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology, International Journal of Modern Education and Computer Science 15(3) (2023) 70–93.