An Automatic Author Profiling from Non-Normative Lithuanian Texts Monika Briedienė Jurgita Kapočiutė - Dzikienė Vytautas Magnus University Vytautas Magnus University Kaunas, Lithuania Kaunas, Lithuania monika.briediene@vdu.lt jurgita.kapociute-dzikiene@vdu.lt Abstract - This paper presents author profiling research done the style of the text. It is possible due to a phenomenon of the on the Lithuanian texts using automatic machine learning existing human stylome (an analogue of a genome) which methods. Our research is novel and challenging due to the allows each person to formulate sentences and express their following reasons: 1) a big number of author profiling dimensions, thoughts in his/her special and unique ways [1]. Similarly, in i.e., gender, age, education, marital status and personality type; 2) many research studies, it is claimed that this phenomenon very short (avg. ~ 24 tokens) non-normative texts; 3) vocabulary occurs not only in the style of individual, but also in the style of rich highly inflective Lithuanian language. We have performed their groups, sharing the same demographic characteristics (as experimental investigation that resulted in choosing automatic age, gender, education or marital status) or the personality type. author profiling methods (in particular, classifiers and feature types) that have reached the highest accuracy on the pure texts In general, the identification of an authorship has the long without any meta-information about their authors. Out of a history dating back to 1887 [2], but with the Internet era its number of experimentally investigated classifiers using lexical or popularity gained dramatically. Therefore the author profiling symbolic features the Naïve Bayes Multinomial method with – responsible for the automatic extraction of the meta- character n-grams feature type yielded the best performance information about some author (as, e.g., age [3], gender [4], reaching 84.3%, 52.7%, 79.6%, 76.6%, 79.1% of accuracy in psychological status [5], etc.) – nowadays is an active and gender, age, education, marital status and personality type important research area. The author profiling research is mainly detection tasks, respectively. focused on the English language, whereas for the Lithuanian Keywords—gender detection, age detection, education detection, language it is rather a new subject. The age, gender and political marital status detection, personality type detection, author profiling, views profiling tasks are solved using parliamentary transcripts the non-normative Lithuanian language, supervised machine [6]; age and gender profiling tasks are solved using the learning Lithuanian literary texts [17]. However, these research works are done on rather long (having ~ 217 tokens on average) and I. INTRODUCTION normative Lithuanian texts. The non-normative Lithuanian In today’s world, numbers of electronic texts have exceeded language (which is the object of research in this paper) is much paper texts by several times. However, the vast majority of more complicated: it is full of out-of-vocabulary words, jargon, these texts are written anonymously or pseudonymously. For foreign language insertions and neologisms. Besides, it faces an this reason, court analysts, web forum administrators, social important problem of diacritics ignorance (where ą, č, ę, ė, į, š, networks supervisors are increasingly facing impersonation, ų, ū, ž are often replaced with the appropriate ASCII bullying or harassment, discloser of confidential information, equivalents). However, the author profiling task on the non- dissemination of disinformation, and other issues. Uncovering normative Lithuanian texts is issued using the gender the exact identity of the person is very complicated and dimension only [7]. Moreover, some sub-tasks of the author’s sometimes unsolvable task, whereas to reveal his/her meta- profiling on the education, marital status, and personality type information (i.e., demographic features: age, gender, etc.) is dimensions have never even been solved before using any types easier, but still very useful. The revealed meta-information that, of Lithuanian texts. Consequently, the purpose of this paper is e.g., a 50-year-old man is impersonating a 10-year-old girl may to fill in the above mentioned gap: i.e., to offer the methods encourage the police to dive more detailed into the data or even (classifiers, their parameters, and features types) able to create take decisive actions for the criminal offense. The manual the automatic author profiles from the short non-normative Internet space monitoring and manual text analysis is hardly Lithuanian texts (Facebook posts, comments and messages). possible, because it requires enormous amounts of human The final goal of this research can be achieved after resources. Thus, natural language processing technologies performing the following intermediate tasks: (1) a related work become the only solution for tacking similar problems. analysis (see Section II), (2) a construction of the representative The author profiling experimental investigations confirm corpus containing non-normative Lithuanian texts (see Section that the authors’ characteristics can be determined by analyzing III), (3) an analytical selection of the most promising methods (see Section IV), (4) a precise experimental evaluation of selected methods (see Section V). The conclusions Copyright held by the author(s). (recommendations) and future research plans for the author 99 profiling tasks when using short non-normative Lithuanian The majority of research done for solving the author texts are in Section VI. profiling tasks involve these popular supervised approaches (e.g., Naïve Bayes [12], Naïve Bayes Multinomial [13], Support II. RELATED WORKS Vector Machines [14]) and similarity-based (e.g., k-Nearest There are many methods used to deal with the author Neighbor) or the comparative experiments proving the profiling task. All existing approaches can be grouped superiority of Naïve Bayes Multinomial and Support Vector according to the following criteria: the percentage of training Machines (as in [15]). Since, it is proved that these approaches instances in the dataset, an amount of information they provide, are not only the most popular, but the most accurate for the (i.e., a recognition-training feedback) and the nature of author profiling tasks, further we will focus only on these types knowledge. Based on these criteria, the approaches are [8]: of methods. Rule-based, Unsupervised Machine Learning, Supervised When analyzing the Lithuanian non-normative texts, we Machine Learning, and Similarity-Based. follow the recommendations formulated for the other The obsolete rule-based methods use rules that have been languages. However, a language factor itself should also be constructed by human-experts. The development process itself taken into account. The Lithuanian language (used in our is very difficult and requires linguistic competence. In addition, research) has rich vocabulary, morphology, word derivation rules are created for the specific solution, therefore are hardly system and relatively free-word order in a sentence. Despite the transferable to the new areas. Lithuanian language (especially non-normative) is rather complicated, some of previously mentioned language Unsupervised machine learning (or clustering methods) is characteristics do not necessary have to complicate our solving chosen when no meta-information (i.e., no training instances) is tasks, i.e., it might occur that our investigated groups of provided. Examples of the text are grouped according to their individuals are bind to the very different, but very similarity. The main disadvantage of these methods is that their representative non-normative sentence structures or grouping does not necessarily correspond an imaginary vocabularies. grouping of a human. Usually because of their low accuracy, these methods are not popular in author profiling tasks. III. CORPUS If texts are supplemented with the necessary meta- Unfortunately, the author profiling benchmark corpora are information about the particular author characteristic (so-called not available on the Internet for the non-normative Lithuanian class) the supervised machine learning is one of two best language, therefore in this research we are using the corpus that choices. The stylistic, lexical or symbolic text characteristics was specifically created for our tasks. The corpus is composed (i.e., so-called features) are presented as the input. The classifier of unprocessed posts (without any appearance of the third party summarizes training information and creates a model as its texts) manually harvested from the Facebook social network in output. This model afterwards can be used for the author the period of 2016-2017. The author profiling research for the profiling of unseen texts. A main disadvantage of all supervised other languages mostly focuses on the Twitter [15], but not machine learning methods is that they require a comprehensive Facebook [16] texts. It is due to the convenient APIs that help and representative training data to create a reliable and crawling tweets; besides, in some countries Twitter is more comprehensive model. The advantage of supervised methods is popular than Facebook. In our work we have chosen Facebook that they can be flexibly adjusted to the new tasks or areas by social network due to its popularity in Lithuania and adding new text samples and retraining the classifier. The deep opportunity to store more demographic characteristics such as learning methods [9] [10] (that became extremely popular education, marital status (not only age or gender) reported by recently for many text classification tasks) are also the users themselves. representatives of this group. The popularity of the Neural Our corpus contains posts, comments and messages of 200 Networks (Convolutional [10], Recurrent [9], etc.) is also individuals (for statistics see Figure 1), one text per person (to growing recently. Such popularity has also been driven by the avoid the authorship attribution impact on the author profiling technical progress: it has led to the faster computing and results). 102 and 98 texts belong to women and men, processing huge amounts of data. The deep learning is used for respectively (see Gender column in Figure 2). The youngest the author profiling [10] and authorship attribution [9] tasks. participant is 18 years old, the oldest – 78, the mean age of Despite the deep learning methods are successfully applied in respondents is ~ 36.9. Respondents are divided into six age many natural language processing tasks, on the smaller datasets groups (see Age column in Figure 2). The selected grouping is (as in our paper) they underperform the other supervised used in surveys of psychologists, in the social studies, in the machine learning approaches, such as Support Vector Machines largest European and Lithuanian data archives. Besides, it is or Naïve Bayes Multinomial [11]. The similarity-based also used in the similar research works [17], making our results approaches (often researched and discussed separately) are very more comparable to the previously reported for the Lithuanian similar to the supervised machine learning approaches by their language. nature. The only difference is that instead of creating a model, they preserve all training instances and use similarity measures The education level of 105 and 95 respondents is higher and to determine to which of available classes some incoming secondary, respectively (see Education column in Figure 2). unseen instance is the most similar. An advantage of similarity- 114 and 86 individuals claimed they are married and single, based methods is that they keep the entire training set; so the respectively (see Marital status column in Figure 2). 112 and information is not lost during generalization. 88 people attributed themselves as extrovert and introvert, respectively (see Personality type column in Figure 2). 100 The corpus consists of 4.830 tokens (including in-the- by Cortes C. and Vapnik V. in 1995 [18]). It is a vocabulary and out-the-vocabulary words, numbers, and non- discriminatory instance-based approach, currently normative “words” with embedded digits or punctuation) in considered as one of the most popular text classification total. The shortest text (without symbols and emoticons) is only techniques. The method effectively copes with the huge 2 tokens length, the longest – 161, the average length per text is number of features, sparse feature vectors and does not only ~ 24 tokens. perform an aggressive feature selection, which may result in the loss of valuable information and accuracy [19]. Posts Comments Messages Another representatives are Naïve Bayes (NB) and its modification Naïve Bayes Multinomial (NBM) (introduced by Lewis D. D. and Gale W. A. in 1994 [20]). These techniques are generative profile-based approaches, often chosen due to their simplicity and sufficiently high 19% accuracy. The NB assumption about the feature independence allows each parameter to be learned 49% separately; these methods work especially well when a number of features having equal significance is high; they 32% are fast and do not require large data storage resources. Moreover, Bayesian methods often play a baseline role in the evaluation.  Similarity-based. A representative of this type is the IBK Fig. 1 A percentage of posts, comments and messages in our corpus method (introduced by Aha D. and Kibler D. in 1991 [21]). This nearest neighbors’ classifier chooses the appropriate k IV. METHODOLOGY value, based on the k-time cross-check after the distance The methodological part covers two main directions: 1) the evaluation (between a testing instance and all samples in proper selection of the classifier and 2) the proper selection of the training set). Another representative is Kstar method the feature type. (introduced by Cleary J. G. and Trigg L. E. in 1995 [22]). On the contrary to IBK, Kstar calculates not a distance To come up with the very best, we have analyzed the measure, but a similarity function. It differs from the other following classifiers of these groups: approaches of this type, because uses the entropy-based  Supervised machine learning. A representative of this type distance function. These two last-mentioned methods store is the Support Vector Machine (SVM) method (introduced Gender Age Education Marital status Personality type Fig. 2 Distribution of respondents according to characteristics 101 all available instances; therefore, are prevented from the Our preliminary experiments have involved the selection of information loss during training. the most accurate classification technique when using word tokenizer with unigrams (n=1) (denoted as word1), n-gram Our second research direction involved the proper selection tokenizer with unigrams (n=1) (lex1) and tetra-grams (n=4) of the feature type. In our experiments we have explored: (lex4), alphabetic tokenizer with unigrams (n=1) (alph1),  Lexical feature types: token uni-grams (n=1) character n-gram tokenizer with unigrams (n=1) (char1) and (individual tokens) and token tetra-grams (n=4) tetra-grams (n=4) (char4) (the best results are presented in (sequences of 4 tokens in a window sliding one Figures 3-7). The overall best results were achieved with SVM token at the time). For instance, from the phrase and NBM methods and character n-grams2. These methods also “author profiling from the Lithuanian texts” it demonstrated the best performance in the author profiling tasks would be generated 6 unigrams: “author”, on the morphologically complex Arabic language [15]. “profiling”, “from”, “the”, “Lithuanian”, “texts” In our later experiments we have performed the tuning of and 3 tetra-grams “author profiling from the”, the character n-gram parameter n by keeping the classifier “profiling from the Lithuanian”, “from the parameter stable and equal to SVM or NBM (because these Lithuanian texts”. classifiers demonstrated the best performance in the classifier  Character features, in particular, character n-grams selection experiments). The obtained results with the different similarly to token n-grams are sequences of items, author profiling dimensions are reported in Figure 8. but instead of tokens they contain characters. For The overall best results (reaching 0.843 of the accuracy and instance, from the phrase “author profiling” it 0.843 of f-score) on the short non-normative Lithuanian texts in would be generated the following document-level the gender detection task were achieved with the NBM and character 4-grams: “auth”, “utho”, “thor”, “hor_”, character n-grams of n = [6, 7] as the feature type. The best “or_p”, “r_pr”, etc. (where “_” marks the results reaching 0.527 of accuracy and 0,473 of f-score on the whitespace). It is important to mention that a value age dimension were achieved with the NBM and character n- of n not necessary has to be fixed. E.g., with the grams (of n = [5, 5]). In the education detection NBM and interval n = [2,4] all bi-grams (n=2), trigrams character n-grams (of n = [5, 5]) demonstrated the best (n=3), and tetra-grams (n=4) would be generated performance reaching 0.796 of accuracy and 0.796 of f-score. and used as features. Experiments with the marital status showed the best results V. EXPERIMENTS AND RESULTS reaching 0.766 of accuracy and 0.767 of f-score with the NBM and character n-grams (of n = [6, 6]). Tests with the personality Our experiments were carried out on the corpus described type proved the superiority of NBM again: the highest 0.791 in Section III using the methods and feature types described in accuracy and 0.792 f-score was achieved with the character n- Section IV. grams (of n = [6, 6]). Thus, the Naïve Bayes Multinomial We used the implementations of the methods integrated into classifier and previously reported feature types would be the WEKA 3.8 machine learning toolkit1. WEKA [23] allowed recommended for the similar tasks and languages. both: the extraction of features and selection of the classifier. On the contrary, the best previously reported age and gender All experiments were performed using stratified10-fold profiling results on the normative Lithuanian language were cross validation and evaluated with the accuracy (1) and f-score achieved with the SVM classifier and lemma bi-grams as the (2) metrics. The results are considered acceptable and feature type [17]. It is not surprising having in mind that reasonable if the achieved author profiling accuracy is above morphological tools (dealing with the normative texts) were majority (3) and random (4) baselines. maximally helpful. Besides, the second best feature type was also based on the character n-grams. Despite our best method 𝑡𝑝 + 𝑡𝑛 (1) achieved slightly higher accuracy compared to the previously 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛 reported, the direct comparison is hardly possible due to the very different experimental conditions (datasets and their sizes, 2 ∗ 𝑡𝑝 (2) language types, text lengths, etc.). 𝐹_𝑠𝑐𝑜𝑟𝑒 = 2 ∗ 𝑡𝑝 + 𝑓𝑝 + 𝑓𝑛 In general, the gender detection task is solved for a rather here tp (true positives), tn (true negatives), fp (false positives), fn (false big group of languages, reaching ~ 80% and ~ 56.53% of negatives) denote a number of correctly classified instances ci with ci and cj with any other cj, incorrectly classified instances ci with any other cj and any other cj accuracy on the normative English in [4] and [24], respectively; with ci, respectively 64.73% on the Spanish blogs in [24] and ~ 82.60% on the Greek max(𝑝𝑖 ) (3) blogs [25]. On the non-normative tweet texts the obtained accuracies are still surprisingly high reaching, e.g., ~ 98% on (4) Arabic in [15] and ~ 99% on English in [26]. However, the ∑ 𝑝𝑖2 reported results, especially for the English language, are very 𝑖 controversial (from ~ 56.53% in [24] to even ~ 99% in [26]). here 𝑝𝑖 denote the probability of the class The age detection task is also thoroughly researched for many 1 2 Download from: http://www.cs.waikato.ac.nz/ml/weka/downloading.html Since the f-score values demonstrate the same trend compared to the accuracies, we do not present them in the figures. 102 languages, reaching 64.0%, 43.80%, 19.09% on the English texts [24], [3], [27]; 64.30%, 37.50% on the Spanish [24] [27]; 71.3% on the Dutch [28]; 80% on the Chinese [29]. Research on the personality type is mostly done on the normative English language [5] and reaches ~ 58.2% of accuracy. Hence, the observed results are very different, due to the different test samples, methods, or chosen languages. char4 char4 char4 char1 char1 Due to the very different experimental conditions (different datasets, used methods and language types) these results are hardly comparable between; as well as they are hardly comparable with the results obtained in our research work. Fig. 5 Accuracies (in percentage) obtained with different classification solving education detection task. For the other notations see Fig. 3. word1 char4 alph1 char4 alph1 char4 alph1 char4 alph1 lex4 Fig. 3 Accuracies (in percentage) obtained with different classifiers solving gender detection task (an upper horizontal line represents a majority baseline, lower – a random baseline). Every column shows the best result obtained with different feature type: word tokenizer & unigrams denote as word1, alphabetic tokenizer & unigrams - alph1, n-gram tokenizer & unigrams - lex1, n-gram tokenizer & tetra-grams - lex4, character n-gram tokenizer& unigrams - char1, Fig. 6 Accuracies (in percentage) obtained with different classification character n-gram tokenizer & tetra-grams - char4. solving marital status detection task. For the other notations see Fig. 3 lex4 char4 lex4 char1 char1 char4 char1 lex1 lex4 lex4 Fig. 4 Accuracies (in percentage) obtained with different classifiers solving age detection task. For the other notations see Fig. 3. Fig. 7 Accuracies (in percentage) obtained with different classifiers solving personality type detection task. For the other notations see Fig. 3. 103 influence of features and dataset sizes.," Human Language Technologies – The Baltic Perspective, pp. 99-106, 2014. [7] M. Briedienė, J. Kapočiūtė-Dzikienė, "An authomatic gender detection from non-normative Lithuanina texts," Ceur-Ws, Kaunas, 2017. [8] E. Stomatatos, "A Survey of Modern Author," Journal of the American Society for Information Science and Technology, 2009. [9] D. Bagnall, "Author identification using multi-headed recurrent," PAN 2015, 2015. [10] S. Sierra, M. Montes-y-Gómez, T. Solorio, F. A. González, "Convolutional Neural Networks for Author Profiling," Notebook for PAN at CLEF 2017, 2017. [11] E. A. Zanaty, "Support Vector Machines (SVMs) versus Multilayer Perception (MLP) in data classification," Mathematics Dept., Computer Science Section, Faculty of Science, Sohag University, Sohag, Egypt, 2012. Fig. 8 The best summarized accuracies (in percentage) for the different profiling dimensions. [12] M. Meina, K. Brodzinska, B. Celmer, M. Czokow, M. Patera, J. Pezacki, M. Wilk, "Ensemble-based classification for author profiling VI. CONCLUSION AND FUTURE WORKS using," Notebook for PAN at CLEF 2013, 2013. [13] T. Raghunadha Reddy, B. Vishnu Vardhan, P. Vijayapal Reddy, In this paper we report the author profiling task results "Profile specific Document Weighted approach using a New Term using short (of only avg. ~ 24 tokens) Lithuanian non- Weighting Measure for Author Profiling," International Journal of normative texts harvested from the Facebook social network. Intelligent Engineering and Systems, pp. 136-146, december 2016. During our research we investigated the most popular [14] F. Rangel, F. Celli, P. Rosso, M. Potthast, B. Stein, W. Daelemans, supervised machine learning (Naïve Bayes, Naïve Bayes "Overview of the 3rd Author Profiling Task at PAN 2015," 2015. Multinomial, Support Vector Machine) and similarity-based [15] E. AlSukhni, Q. Alequr, "Investigation the Use of Machine Learning (IBK, kStart) techniques plus various lexical and character Algorithms in Detecting Gender of the Arabic Tweet Author," Article feature types. Published in International Journal of Advanced Computer Science and Applications, 2016. The best results on the 1) gender (84.3% of accuracy), [16] M. Fatimaa, K. Hasanb, S. Anwara, R. M. A. Nawab, "Multilingual 2) age (52.7%), 3) education (79.6%), 4) marital status author profiling on Facebook," Information Processing & (76.6%) and 5) personality type (79.1%) dimensions were Management, pp. 886-904, liepa 2017. achieved with 1) Naïve Bayes Multinomial and character n- [17] J. Kapočiūtė-Dzikienė, A. Utka, L. Šarkutė, "Authorship Attribution grams of n = [6, 7]; 2) Naïve Bayes Multinomial method and and Author Profiling of Lithuanian Literary Texts," Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing, pp. 96- character n-grams of n = 5; 3) Naïve Bayes Multinomial and 105, September 2015. character n-grams of n = 5; 4) Naïve Bayes Multinomial and [18] C. Cortes, V. Vapnik, "Support-Vector Networks," Machine character n-grams of n = 6; 5) Naïve Bayes Multinomial Learning, pp. 273–297, 1995. method and character n-grams of n = 6. [19] T. Joachims, "Text Categorization with Support Vector Machines: In the future research our focus on the non-normative Learning with Many Relevant Features," European Conference on Machine Learning, pp. 137-142, 1998. Lithuanian texts remains. We are planning to increase our author profiling corpus and test it on the different deep [20] D. D. Lewis, W. A. Gale, "A Sequential Algorithm for Training Text Classifiers," SIGIR '94 Proceedings of the 17th annual international learning approaches. ACM SIGIR conference on Research and development in information retrieval , pp. 3-12, July 1994. [21] D. Aha, D. Kibler, "Instance-based learning algorithms," Machine REFERENCES Learning, pp. 37–66, 1991. [22] J. G. Cleary, L. E. Trigg, "K*: An Instance-based Learner Using an Entropic Distance Measure," In Proceedings of the 12th International [1] H. Van Halteren, R. H. Baayen, F. Tweedie, M. Haverkort, A. Neijt, Conference on Machine Learning, 1995. "New Machine Learning Methods Demonstrate the Existence of a Human Stylome," Journal of Quantitative Linguistics, 2005. [23] 2016. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/. [2] T. C. Mendenhall, "The Characteristic Curves of Composition," [24] K. Santosh, R. Bansal, M. Shekhar, V. Varma, "Author Profiling: Science, 1851. Predicting Age and Gender from Blogs," Notebook for PAN at CLEF 2013, 2013. [3] J. Schler, M. Koppel, S. Argamon, J. Pennebaker, "Effects of Age and Gender on Blogging," American Association for Artificial [25] G. K. Mikros, "Authorship Attribution and Gender Identification in Intelligence , 2006. Greek Blogs," Methods and Applications of Quantitative Linguistics, pp. 21-32, 2012. [4] M. Koppel, S. Argamon, A. R. Shimoni, "Automatically Categorizing Written Texts by Author Gender," Literary and Linguistic Computing, [26] Z. Miller, B. Dickinson, W. Hu, "Gender Prediction on Twitter Using pp. 401-412, November 2002. Stream Algorithms with N-Gram Character Features," International Journal of Intelligence Science, pp. 143-148 , 2012. [5] S. Argamon, S. Dawhle, M. Koppel, J. Pennebaker, "Lexical Predictors of Personality Type," Joint Annual Meeting of the Interface [27] J. Marquard, G. Farnadi, G. Vasudevan, M-F. Moens, S. Davalos, A. and the Classification Society of North America, June 2005. Teredesai, M. De Cock, "Age and Gender Identification in Social Media," CLEF 2014 working notes; PAN 2014, 2014. [6] J. Kapočiūtė-Dzikienė, L. Šarkutė, A. Utka, "Automatic author profiling of Lithuanian parliamentary speeches : exploring the 104 [28] C. Peersman, W. Daelemans, L. Van Vaerenbergh, "Predicting Age and Gender in Online Social Networks," SMUC '11 Proceedings of the 3rd international workshop on Search and mining user-generated contents , pp. 37-44 , 2010. [29] Li Chen, Tieyun Qian, Fei Wang, Zhenni You, Qingxi Peng, Ming Zhong, "Age Detection for Chinese Users in Weibo," WAIM 2015: Web-Age Information Management, 2015. [30] A. Venckauskas, A. Karpavicius, R. Damaševičius, R. Marcinkevičius, J. Kapočiūte-Dzikiené, and C. Napoli, “Open class authorship attribution of Lithuanian Internet comments using one- class classifier.” In Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 373-382, 2017.. [31] M. Wróbel, J.T. Starczewski, and C. Napoli, “Handwriting recognition with extraction of letter fragments”. In International Conference on Artificial Intelligence and Soft Computing, pp. 183- 192, 2017. 105