=Paper=
{{Paper
|id=Vol-3090/paper28
|storemode=property
|title=Defining Kinds of Violence in Russian Short Stories of 1900–1930: A Case of Topic Modelling With LDA and PCA
|pdfUrl=https://ceur-ws.org/Vol-3090/paper28.pdf
|volume=Vol-3090
|authors=Ekaterina Gryaznova,Margarita Kirina
|dblpUrl=https://dblp.org/rec/conf/ims2/GryaznovaK21
}}
==Defining Kinds of Violence in Russian Short Stories of 1900–1930: A Case of Topic Modelling With LDA and PCA==
Defining Kinds of Violence in Russian Short Stories of 1900– 1930: A Case of Topic Modelling With LDA and PCA Ekaterina Gryaznovaa and Margarita Kirinaa a National Research University Higher School of Economics, 123 Griboyedova emb., St. Petersburg, 190068, Russia Abstract This paper discusses the problem of defining subthemes in literary texts about violence of different kinds from the Corpus of Russian short stories of the first third of the 20th century. It considers the results of topic modelling via Latent Dirichlet Allocation (LDA), which is used to reveal various kinds of violence, and principal component analysis (PCA), which is used to compare stories by the level of ‘violent lexis saturation’. The experiment based on short stories that depict violence and death demonstrates that topic modelling did not allow the detection of internal topics but did group together stories with similar plots. The LDA algorithm seems to unveil some of the semantically related episodes of texts, though it is not always sufficient for providing complete interpretation of the resulting topics. The PCA method, on the other hand, successfully distinguishes between the following themes: death, execution, and murder. The research has proven that literary works are, indeed, rather difficult objects for automatic theme detection. In the case of fiction, the explicitness of themes appears to be a crucial factor in success of both LDA and PCA methods. The authors suggest that for more comprehensive analysis of fictional texts, several methods should be applied at the same time. Keywords 1 Computational linguistics, machine learning, text mining, violence, Russian fiction, topic modelling, principal component analysis, latent Dirichlet allocation, literary corpus, literature studies 1. Introduction Violence is considered to be an intrinsic part of human interactions that regard periods of time when various confrontations, be they social, political or historical, take place. Indeed, it is a foundation of a majority of social conflicts. As literature is, according to some interpretations [4; 5], a reflection of human experiences, it often chooses violence as its theme. Being an intercultural phenomenon, the issue of violence is reflected in a variety of texts, however, the definition of a ‘violent’ text is a quite challenging task. According to Reimer, texts determined by this theme “are often assumed by critics of media and literature to be those texts that depict acts of injurious physical force” [15, p. 102]. Though the description of violent acts through the lexis is a crucial part of the narrative about violence, there are some complications: a violent act is not necessarily placed in the text in an obvious way, and it is more likely to stay hidden in the rhetorical structures of the story [17, p. 2]. As one of the literary themes, violence reoccurs in the texts of the Corpus of Russian short stories of the first third of the 20th century [10], due to the specific period of time they were written. The beginning of the 20th century in Russia was marked by a number of violent historical events, such as the Russo-Japanese War, World War I, October and February Revolutions, and the subsequent Civil War. At the same time, cruel stories include not only examples of socially-induced violent acts and IMS 2021 - International Conference "Internet and Modern Society", June 24-26, 2021, St. Petersburg, Russia EMAIL: esgryaznova@edu.hse.ru (A. 1); mkirina2412@gmail.com (A. 2) ORCID: 0000-0001-9844-2664 (A. 1); 0000-0002-7381-676X (A. 2) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 282 PART 2: Computational Linguistics their consequences (death, murder, execution, rape), but also cases of, for instance, cruelty to animals or psychological pressure towards other characters. The common feature of all these forms of violence is that they are not necessarily placed in episodes of the stories in an explicit way. This paper aims to explore the theme of violence in Russian short story of the early 20th century. To investigate violence diversity and intensity, we provide the analysis with topic modelling, using Latent Dirichlet Allocation (LDA) which is one of the most popular and quite effective topic modelling algorithms. Then, in order to scale the explicitness of violence narrative in the stories under consideration, we test them with principal component analysis (PCA) based on the list of violent lexis that has been compiled manually. This research continues the ongoing study of automatic thematic annotation of literary texts on the basis of the Corpus of Russian short stories of 1900–1930 described in [18; 19; 22]. For that reason, it also presents a comparison of human assessment of literary texts which exploit violence as a theme and the results of theme extraction obtained via an application of computational methods. 2. Data description and preprocessing The experiment is performed on the part of the annotated subcorpus of the Corpus of Russian short stories of the first third of the 20th century, which includes 310 texts written by 300 different authors with the total number of almost 1 000 000 words [10; 11; 12]. The thematic annotation of the subcorpus was done manually by an expert and described in [20]. As a result, the initial mark-up of 89 themes was normalized and the list of 30 tags was obtained (for details see [18]). For the present analysis a ‘violence’ subcorpus was compiled. It contains 115 texts from 115 Russian writers with the following distribution of texts into historical periods suggested for the Corpus: I period: early 20th century (1900–1913) – 41 stories; II period: World War I, October and February revolutions, and the Civil War (1917–1922) – 40 stories; III period: early Soviet period (1923–1930) – 34 stories. The selection of these specific stories was not random – all of them are united by the tags violence and death. The topic death was also chosen because it, firstly, often occurs in the same stories and, secondly, death naturally presents a resolution of violent conflict. Besides, the vast majority of texts about death refer to the cases of unnatural death, mainly violent or self-violent. In addition, stories with non-violent types of death pose another interesting challenge – will LDA distinguish it as a separate group? Given the tag’s nature, similarly to categories, it can cover multiple themes. Thus, tag violence includes the following thematic elements – rape, cruelty, and blood (3), while tag death unites death from gunshot wounds (during the war or on the barricade); death from natural causes (including epidemic and thoughts about death), execution (by shooting as well, and fear of death), sudden and accidental death, suicide, and murder (not at war) (6). The total number of themes suggested by the expert equals 9. It also has to be mentioned that some of the stories not only lie in both groups presented by tags violence and death, but also are described by several themes. For instance, The Seven Who Were Hanged (Rasskaz o semi poveshennykh) by L. Andreev is one of these stories and, moreover, allegedly, the most violent text in subcorpus, as it is labelled by 4 themes in total: execution, death from natural causes, cruelty, and murder (not at war). A story’s thematic density, thus, varies from one to another. Another peculiarity about the thematic mark-up of the literary texts in the given corpus is that the stories can at some point of the narrative develop non-violent or non-death related themes at all. So, one story, for example Matter (Materiya) by M. Krinitskij includes not only the tag violence, but also such tags as relations, love, sins, and nature. This tendency raises the problem if a number of themes that the story carries can cause a predicament for successful detection of the ones in question. With regards to preprocessing, the texts were tokenized and lemmatized with automatic contextual disambiguation and POS-tagging by MyStem [16]. The total number of tokens is 426 778. Then the IMS-2021. International Conference “Internet and Modern Society” 283 stop words and, additionally, the specific for fiction words that indicate the direct speech, such as skazat' (to say), govorit' (to speak), otvechat' (to answer), sprashivat' (to ask), dumat' (to think), and so on, as well as the most frequent names of the characters were removed. The tidy data size equals 228 745 tokens. 3. Topic modelling with LDA 3.1. Determination of the number of topics Topic modelling is commonly used to detect clusters of semantically connected words within various corpora [13; 14]. As thus, a topic covers a cluster of texts which share similar content. Topic modelling is widely applied to large collections of texts, mainly non-fictional, where the quantity and quality of the topics are relatively easier to determine, due to the fact that there are no that many specific and implicit themes as we find in literary works [1; 3]. One of the most popular algorithms for topic modelling is Latent Dirichlet Allocation (LDA), which is an unsupervised generative probabilistic model [2]. Commonly speaking, it represents each document in data as a mixture of random topics. For topic modelling the LDA implementation in R package ‘topicmodels’ was chosen [6]. After testing different numbers of topics, it was noted that the bigger the number gets the more detailed topics the model results. On closer consideration, for the model for 20 topics, it appeared to cover mainly individual texts rather than groups and, therefore, the topics were too detailed and difficult to interpret. This problem is similar to the one described in [21]. Since “the highest coherence value does not seem to necessarily correspond to the quality of topics”, it was decided to limit the number of topics [ibid., p. 65]. To better the quality of the topics and to fulfill the suggestion to experiment with a number of topics proposed in [18], it was agreed to set the number of topics that corresponds with the one deduced from the expert annotation for the chosen group of texts – 9. 3.2. Evaluation of the model with expert annotation and stories per topic distribution Dealing with short texts and, especially ones that often include other themes as well, even though they share the same thematic tags (namely violence and death), we are still facing some difficulties. As it can be seen from Table 1, for the words of the highest weight suggested for each topic, in some cases, for instance, topics 7 and 9, it is challenging to establish semantic connection between the terms, let alone assign the name, even if they are based on the list of expert themes. That is why in order to name the topics, we took corresponding thematic elements from the expert annotation and not only their distribution among the topic but also frequency measurements. Further evaluation of the topics’ quality was conducted in accordance with the expert themes, as first suggested in [18]. We also decided to look into the stories that got clustered together based on the document distribution lists. The most frequent occurrences of the themes and the stories of the highest rank for each topic are presented in the table below. Table 1 Distribution of themes and stories per topic Topic terms Thematic elements Freq. (%) Stories of the highest rank Topic 1 “VIOLENCE TOWARDS WOMEN” death from natural 28,6 Too Late (Pozdno) by den' (day), god (year) dusha causes A. Verbitskaya, The Platform 10 (soul), noch' (night), vremya suicide 28,6 (Platforma 10) by L. Charskaya, (time), pis'mo (letter), hotet' cruelty 14,3 The Rooms in Kirochnaya street (to want), zhenshchina rape 7,1 (Nomera na Kirochnoj) by (woman), uhodit' (leave) F. Bogrov 284 PART 2: Computational Linguistics Topic terms Thematic elements Freq. (%) Stories of the highest rank Topic 2 “NON-WAR MURDER” dom (house), den' (day), murder (not at war) 26,7 starik (old man), ubivat' (kill), cruelty 20,0 The Chess (Shakhmaty) by delo (matter/case), hotet' suicide 13,3 Ya. Braun, The Burning Days (want), hod (move), tolpa death from gunshot 13,3 (Ognennye dni) by A. Gorelov, (croud), vdrug (suddenly), wound Riot (Bunt) by L. Lunts ulitsa (street) Topic 3 “DEATH AT WAR” zemlya (ground), den' (day), cruelty 23,5 The Sharashka Bureau belyj (white), stoyat' (stand), execution 17,6 (Sharashkina kontora) by chjorny (black), soldat death from gunshot 17,6 B. Guber, The Earth Shakes (soldier), muzhik (man), wounds (Zemnoj tryas) by A. Kargopolov, loshad' (horse), doroga murder (not at war) 17,6 The Outhouse (Fligel') by (road), storona (side) A. Karavaeva Topic 4 “DOMESTIC VIOLENCE” buryj (fulvous), pojti (to go), cruelty 35,7 The Fulvous (Buryj) by syn (son), hotet' (to want), suicide 21,4 M. Chernokov, A Nightmare stojat' (stand), rebjonok death from gunshot 21,4 (Koshmar) by Gusev- (child), batjushka (priest), wounds Orenburgsky, The Barricade soldat (soldier), golos (voice), (Barricada) by G. Yablochkov krichat' (scream) Topic 5 “UNEXPECTED DEATH AND ILLUSIONS” starik (old man), vremja cruelty 21,4 Rioters (Buntovshchiki) by (time), stojat' (stand), zemlja sudden death 21,4 P. Semynin, The Trophy (Nagrada) (ground), dver' (door), videt' death from natural 14,3 by N. Anov, The Forgotten Colliery (see), golos (golos), kazatsya causes (Zabytyj rudnik), Two Bloods (Dva (seem), chjornyj (black), voda suicide 14,3 krovnika) by L. Pasynkov (water) murder (not at war) 14,3 Topic 6 “SUDDEN DEATH” vdrug (suddenly), den' (day), death from gunshot 25,0 The Seven Who Were Hanged smert' (death), hotet' (want), wounds (Rasskaz o semi poveshennykh) slovo (word), volk (wolf), cruelty 18,8 by L. Andreev, The Silent Valley kazatsya (seem), nachinat' murder (not at war) 18,8 (Gluchaja pad') by L. Ulin, The (start), chas (hour), noch' death from natural 12,5 Wolves (Volky) by L. Zinovyeva- (night) causes Annibal execution 12,5 Topic 7 “NATURAL DEATH” den' (day), drug (friend), death from natural 23,1 In the Circus (V cyrke) by kazatsya (seem), vdrug causes A. Kuprin, In the Crowd (V tolpe) (suddenly), stojat' (stand), sudden death 23,1 by F. Sologub, From Another golos (voice), tolpa (crowd), execution 15,4 World (Iz drugogo mira) by vremja (time), komnata death from gun 15,4 V. Orlovsky (room), tjomnyj (dark) wounds Topic 8 “LIFE IN PRISON” davat' (to give), pojti (to go), natural death 33,3 Behind the Barbed Wire (Za delo (case), hotet' (to want), death from gun 22,2 koluchej provolkoj) by K. Levin, lager' (camp), den' (day), wounds How Ivan spent time (Kak Ivan prihodit' (to come), russkij provel vremja) by S. Podyachev, IMS-2021. International Conference “Internet and Modern Society” 285 Topic terms Thematic elements Freq. (%) Stories of the highest rank (russian), zhit' (to live), sidet' The Bad Hat (Neputevyj) by E. (to be seated) Zamyatin Topic 9 “CRUEL DEATH” lipa (Lipa), pojti (to go), delo execution 23,1 Savel Semenych (Savel Semenych) (case), ded (grandfather), murder (not at war) 23,1 by K. Fedin, In the quiet corner (V hotet' (to want), bolshoj (big), death from natural 15,4 tikhom uglu) by E. Fedorov, vyhodit' (to exit), dver' (door), causes Communist (Kommunistka) by vdrug (suddenly), zemlja cruelty 15,4 A. Tyukhanov (ground) suicide 15,4 The words that compose the clusters do not largely differ between the topics. Though, there are a few cases where the certain words strike the most. For example, these are the nouns that describe the places where the action takes place: dom (house), komnata (room), ulitsa (street), lager' (camp). Thus, topic 8, for example, seems to have gathered stories that describe prison and labour camps. Russian word sidet' (to be seated) has a second meaning of being in prison, and the word lager' (labour camp) adds to that theme. The word russkij (Russian) indicates the topic of international relations in prisons and camps that can be found in stories from this group. Topic 1, on the other hand, probably exploits the themes of rape or cruel behavior towards women at some point of the narrative’s development. After considering the stories of the highest rank that contribute to this topic, it would be more accurate to say that all of them present a woman as a central figure. However, the stories below the 3rd rank are indeed dealing with another kind of violence, namely – death from natural causes and suicide (or suicide attempt). A pattern alike is found in topic 5 and 9. It is possible that these kinds of death are not largely presented in the lexis of the stories which makes it hard to detect them. Moreover, we deliberately did not exclude verbs from the data, though this procedure is recommended for the improvement of the model [9]. It was suggested that violence is a theme that presupposes the usage of ‘active’ lexis. For that reason, it was expected that such verbs as, for instance, to kill, to murder, to rape, and etc., will result as terms of the highest probability within topics. However, the most helpful for interpretation words happened to be the nouns. More interestingly, the same tendency is discovered with regards to principal component analysis which is to be discussed in the next chapter. 4. Scaling violence with PCA 4.1. Detection of violent lexis Principal component analysis (PCA) is an unsupervised machine learning method that reduces the dimensionality without losing much of statistical information [7]. Often textual data contains variables that either strongly correlate with each other, or there is not much variation within a variable. Such variables are often quite useless for research. The PCA reduces the size of data by creating new variables that represent it, while saving only important information. It also visualizes the important correlation between variables, thus this method works well for finding dependencies in data. Compared to LDA, it does not detect deep semantic connections. That being said, the PCA can scale the explicitness of the violence narrative in the given subcorpus. The PCA algorithm that was used for this research is from the R package ‘factoextra’ [8]. A list of violent words was compiled manually with consideration of cases which are specific for the period in question: ubit' (to kill), ubivat' (to kill), bit' (to beat), izbit' (to beat up), izbivat' (to beat up), pribit' (to beat to death), dushit' (to choke), pridushit' (to choke to death), udushit' (to choke to death), strelyat' (to shoot), zastrelit' (to kill by shooting), rasstrelyat' (to kill by shooting), pristrelit' (to kill by shooting), rasstrelivat' (to kill by shooting), zarezat' (to slaughter), topit' (to drown), utopit' (to kill by drowning), smert' (death), nasilije (violence), nasilovat' (to rape), iznasilovat' (to rape), 286 PART 2: Computational Linguistics pytat' (to torture), pytka (torture), prikonchit' (to kill), rasstrel (shooting), kazn' (execution), krov' (blood), udarit' (to hit), udaryat' (to hit), nasilstvennyj (violent), terror (terror), terrorizirovat' (to terrorize), prigovor (sentence), viselitsa (gallows). Figure 1: Distribution of violent lexis According to Figure 1, the lemmas, which usage differs from all the other words, are smert' (death), krov' (blood) and kazn' (execution). That means that these words appear more often in certain texts and that is what distinguishes one text from another. Other words like bit' (to beat), udaryat' (to hit) and strelyat' (to shoot) do not excel. Ubivat' (to kill) does not contribute much to distinguishing a certain story, however it does excel. It could mean that this word is simply used more often in general, rather than it being specific to a certain story. It is expected that the stories also fall in the same pattern. What is more, though the words that were chosen for the list are presented mainly by verbs, as it can be seen from the graph above, the most striking results, again, were obtained, except for ubivat' (to kill), by virtue of nouns. It seems that despite the proactiveness of the characters that show violent behavior, nouns contribute to the quality of both, the LDA model and the PCA, the most. 4.2. Degree of violence within a story One of the disadvantages of the PCA is that it does not perform well on high-dimensional data; therefore, the graphs demonstrate only a few stories to make it more readable. As it can be seen from Figure 2, the most ‘violent’ stories are The Seven Who Were Hanged (Rasskaz o semi poveshennykh) by L. Andreev and Two bloods (Dva krovnika) by L. Pasynkov, which means they both contain more violent words than any other story. However, they differ from each other in terms of what kind of violence they describe: The Seven Who Were Hanged (Rasskaz o semi poveshennykh) has a strong connection with the word smert' (death), while Two Bloods (Dva krovnika) with the word krov' (blood) on the other side of the graph. IMS-2021. International Conference “Internet and Modern Society” 287 Figure 2: Degree of violent lexis in stories: a few examples The stories Viper (Gadyuka) by A. Tolstoy and The Sharashka Bureau (Sharashkina kontora) by B. Guber also contain a lot of violent lexis, but they are not unique in the kind of lexis they contain. It is quite interesting that the story Blood of a working man (Krov' rabochego) by P. Arsky that already has a word from a dictionary in its title does not excel, which means that the word krov' (blood) does not contribute to distinguishing this story from others. Meanwhile Two Bloods (Dva krovnika) by L. Pasynkov is separated from the group, but krov' (blood) in this story also does not stand for the violent component only. As it follows from the story’s plot, it is about two brothers related by blood and their blood enemy. Though the violent episodes or discussions between characters, including, for instance, spilling blood, indeed take place, the word blood here may have several meanings. The examples like this one demonstrate the problem of explicitness in stories’ narratives. Figure 3: Correlation between lexis and stories One thing that the PCA excelled in, as better seen in Figure 3, is in distinguishing the theme of execution (prigovor (sentence), kazn' (execution), smert' (death) and viselitsa (gallows)) from the other topics. One of them possibly being a distinctive topic of murder (ubivat' (kill), krov' (blood)), 288 PART 2: Computational Linguistics however ubivat' (kill) does not correlate with a specific story. Possibly the narrative of execution tends not to describe blood and gore, while the narrative of murder does. To sum up the results of the PCA, this method is an efficient tool to measure the explicitness of topics in the text, in this case – violence, since it not only shows the most explicit stories, but also distinguishes death from execution and murder, though it did not separate death from violence. It did not detect any other violent acts or causes of death that are hidden in the rhetorical structures and are not that explicit. 5. Conclusion This research has proven that, albeit the homogeneous nature of the subcorpus, the LDA and PCA algorithms are able to detect different violent acts, however, with a few restrictions in terms of their diversity. Thus, topic modelling was able to capture some common plot-related features in the stories, while the PCA allowed to distinguish two stories that excessively describe two kinds of violence. On the whole, the analysis of the LDA model showed that the most probable words for each topic did not represent any violent acts. One possible explanation regards the data itself, namely the fact that some stories fall into different categories at the same time which may complicate the detection of themes of our interest only. Another reason for unsatisfying results is that the stories, being short, do not comprehensively cover each of the subthemes enough, so they are not vastly expressed in the lexis of the texts. On the other hand, it occurs that the LDA was able to put into one cluster the stories with the similar plot details or characters (in terms of gender or social status). For instance, topic 1 unites those stories in which woman is the main character while the depiction of the act of rape or any sexual act at all is not necessarily present. Perhaps, regarding literature, topic modelling appears to identify common structures that occur in various texts, however, they do not always constitute their themes. Since the PCA works with variables, it performed better – the difference between death by execution and murder was detected which can be juxtaposed with the stories. Thus, the most explicitly violent stories – Two bloods (Dva krovnika) by L. Pasynkov and The Seven Who Were Hanged (Rasskaz o semi poveshennyh) by L. Andreev – tell about murder and execution respectively. What is more, a comparison of the PCA results and per-document-per-topic probabilities of the LDA shed a light on some interesting tendencies. Two bloods (Dva krovnika) by L. Pasynkov and The Seven Who Were Hanged (Rasskaz o semi poveshennyh) by L. Andreev, which were distinguished by the PCA as the most violent, are also the stories of the highest rank in topics 5 and 6 respectively of the LDA model. For any other topic they do not contribute the same, lying at the bottom of the lists. It appears that these two stories indeed differ in terms of violence representation from others. To conclude, we suggest that applying topic modelling to the literary texts unveils some difficulties prompted by the fact that the theme in the fictional text, as a rule, is not obviously expressed in the story’s lexis. In our case, comparing the expert annotation and the automatic one, the automatic one did not detect as many themes as the expert did. For that reason, when tackling fictional texts that do not differ in genre, several methods need to be applied. As this study shows, the PCA can contribute to dealing with lexis-specific themes extraction. Additionally, the results of both LDA and PCA could be properly interpreted only with the knowledge of the contents of the stories and their thematic assessment by an expert. For future research, experimenting with various topic modelling algorithms (NMF, for instance), on the one hand, and applying supervised machine learning methods for analysis of literary works, on the other, might help to obtain better results in terms of comprehensive interpretation. Mastering the automatic theme extraction may be a step towards human-alike textual analysis, allowing studying literature with the means of computation methods in more conclusive manner. 6. Acknowledgements The publication is prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2021 (grant # 21-04-053 ‘Artificial Intelligence Methods in Literature and Language Studies’). IMS-2021. International Conference “Internet and Modern Society” 289 7. References [1] R. Albalawi, T. H. Yeap, M. Benyoucef, Using topic modeling methods for short-text data: A comparative analysis, in: Frontiers in Artificial Intelligence, 3, 2020. [2] D. M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, in: J. Mach. Learn. Res. 3(4–5), 2003, pp. 993–1022. [3] B. Blummer, J. M. Kenton, Academic Libraries’ Outreach Efforts: Identifying Themes in the Literature, in: Public Services Quarterly, Volume 15, Issue 3, 2019, pp. 179–204. [4] J. Carroll, The extremes of conflict in literature: Violence, homicide, and war, in: The Oxford handbook of evolutionary perspectives on violence, homicide, and war, 2012. [5] J. Carroll, Violence in literature: an evolutionary perspective, in: The evolution of violence, 2014, pp. 33–52. [6] B. Grun, K. Hornik, Topicmodels: An {R} Package for Fitting Topic Models, in: Journal of Statistical Software, vol. 40 (13), 2011, pp. 1–30. [7] I. T. Jolliffe, J. Cadima, Principal component analysis: a review and recent developments, in: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374 (2065), 2016. [8] A. Kassambara, F. Mundt, Factoextra: Extract and Visualize the Results of Multivariate Data Analyses, 2020. URL: https://CRAN.R-project.org/package=factoextra. [9] F. Martin, M. Johnson, More efficient topic model-ling through a noun only approach, in: Proceedings of the Australasian Language Technology Association Workshop, 2015, pp. 111– 115. [10] G. Y. Martynenko, T. Y. Sherstinova, A. G. Melnik, T. I. Popova, Methodological issues related with the compilation of digital anthology of Russian short stories (the first third of the 20th century), in: Proceedings of the XXI International United Conference ‘The Internet and Modern Society’, IMS–2018, Computational linguistics and computational ontologies, ITMO University, St. Petersburg, Issue 2, 2018a, pp. 99–104. [11] G. Y. Martynenko, T. Y. Sherstinova, T. I. Popova, A. G. Melnik, E.V. Zamirajlova, O printsipakh sozdaniya korpusa russkogo rasskaza pervoy treti XX veka [About Principles of the Creation of the Corpus of Russsian Short Stories of the First Third of the 20th Century], in: Proc. of the XV Int. Conf. on Computer and Cognitive Linguistics ‘TEL2018’, Kazan Federal University. Kazan, 2018b, pp.180–197. [12] G. Martynenko, T. Sherstinova, Linguistic and Stylistic Parameters for the Study of Literary Language in the Corpus of Russian Short Stories of the First Third of the 20th Century, in: R. Piotrowski's Readings in Language Engineering and Applied Linguistics, Proc. of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), Saint Petersburg, Russia, November 27, 2019, CEUR Workshop Proceedings. Vol. 2552, 2020, pp. 105–120. URL: http://ceur-ws.org/Vol-2552/. [13] O. A. Mitrofanova, Modelirovanije tematiki spe-cial’nyh tekstov na osnove algoritma LDA [Topic modeling of special texts based on LDA algorithm], in: XLII Mezhdunarodnaya filologicheskaya konferencija [XLII International philological conference], 2014. [14] S. Nikolenko, S. Koltcov, O. Koltsova, Topic modelling for qualitative studies, in: J. Inf. Sci. 43(1), 2017, pp. 88–102. [15] M. Reimer, Introduction: Violence and Violent Children's Texts, in: Children's Literature Association Quarterly, 22(3), 1997, pp. 102–104. [16] I. Segalovich, V. Titov, MyStem. Yandex [Computer Software], 2011. URL: https://yandex.ru/dev/MyStem/. [17] S. Sielke, Reading rape: The rhetoric of sexual violence in American literature and culture, 1790- 1990. Princeton, 2009. [18] T. Sherstinova, O. Mitrofanova, T. Skrebtsova, E. Zamiraylova, M. Kirina, Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction, in: Advances in Computational Intelligence, MICAI 2020, Lecture Notes in Computer Science, Vol. 12469, 2020, pp. 134–151. 290 PART 2: Computational Linguistics [19] T. Sherstinova, T. Skrebtsova, Russian Literature Around the October Revolution: A Quantitative Exploratory Study of Literary Themes and Narrative Structure in Russian Short Stories of 1900-1930, in: CompLing (in print). [20] T. G. Skrebtsova, Thematic Tagging of Literary Fiction: The Case of Early 20th Century Russian Short Stories, in: CompLing, CEUR Workshop Proceedings, Vol. 2813, 2021, pp. 265-276. [21] I. Uglanova, E. Gius, The Order of Things. A Study on Topic Modelling of Literary Texts, in: Proc. of the CHR 2020: Workshop on Computational Humanities Research, CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2723/long7.pdf. [22] E. Zamiraylova, O. Mitrofanova, Dynamic topic modeling of Russian fiction prose of the first third of the XXth century by means of non-negative matrix factorization, in: Proc. of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), Vol. 2552, 2019, pp. 321–339.