Introduction

Dynamic Topic Modeling of Russian Prose of the First Third of the XXth Century by Means of Non-Negative Matrix Factorization

Ekaterina Zamiraylova e.zamiraylova@gmail.com

0 1 0 Olga Mitrofanova 1 Saint Petersburg State University , Saint Petersburg, Russian Federation

2 14

This paper describes automatic topic spotting of literary texts based on the Russian short stories corpus, compiling stories written in the first third of the XXth century. Non-negative matrix factorization (NMF) is a valuable alternative to existing approaches of dynamic topic modeling and it can find niche topics and related vocabularies that are not captured by existent methods. The experiments were conducted on text samples extracted from the corpus, the given samples contain texts of 300 different authors. This approach allows to trace the topic dynamics of Russian prose for 30 years from 1900 to 1930.

Introduction

In the last decade topic modeling has become one of the most popular issues of computer linguistics. Topic modeling is usually understood as building a model that shows which topics appear in each document [Daud et al., 2010]. The topic model of a collection of text documents determines whether each document belongs to a different topic and it generates a list of words (terms) from which each topic is formed [Blei, Lafferty, 2006]. With this method, it is possible to process large amounts of data (fiction texts, magazine articles, news reports, social media, reviews, etc.) and automatically receive information about the topics of texts. Knowing what people are talking about and understanding their concerns and opinions is very valuable for science, business, political campaigns, etc.

Currently a large number of methods for topic modeling have been created. The most common in modern applications are methods based on Bayesian networks, which are probabilistic models on oriented graphs. Probabilistic topic models belong to a relatively young

field of research in an unsupervised theory. Probabilistic latent semantic analysis (PLSA) based on the principle of maximum likelihood was one of the first proposed as an alternative to classical clustering methods based on the calculation of distance functions. Next to PLSA latent Dirichlet allocation (LDA) and its numerous generalizations were proposed.

Following on from LDA methods similar probabilistic approaches have been consistently developed to track the evolution of topics over time in a sequentially organized corpus of documents, such as the dynamic topic model (DTM) [Blei, Lafferty, 2006]. Alternative algorithms, such as non-negative matrix factorization (NMF) [Lee and Seung, 1999] considered in this paper, have proven effective in finding underlying topics in text corpora [Wang et al., 2012]. For this reason, that algorithm was chosen for this study, which is aimed to automatic selection of dynamic topics in the Russian short stories corpus of the first third of the XXth century [see Sherstinova, Martynenko in this volume]. 2

Selection rationale Non-negative matrix factorization (NMF) is an unsupervised algorithm of machine learning

that aims to detect useful features [Mu¨ller and Guido, 2017]. It is utilized for dimensionality of non-negative matrices, because it decomposes the data into factors in such a way that there are no negative values in them. Therefore, this method can be applied only to those data where features have non-negative values, as a non-negative sum of non-negative components cannot become negative [the same].

One of the advantages of NMF over existing LDA methods is that fewer parameter vari

ants are used in the modelling process [Darek and Cross, 2016]. In addition, another benefit is that NMF can identify niche topics that tend to be under-reported in traditional LDA approaches [O’Callaghan et al., 2015]. Niche topics are sub-topics that can be identified within a dynamic topic.

The ability of NMF to consider how significant a word is to a document in a text collection, based on weighted term frequency values, is particularly useful. In particular, the application of log-based TF-IDF weighting factor to the data before the construction of the topic model contributed to diverse but semantically coherent topics that are less likely to be represented by the same high-frequency terms [Darek and Cross, 2016]. This makes NMF a suitable method for identifying both broad groups of high-level documents and niche topics with specialized dictionaries [O’Callaghan et al., 2015]. 3

Experimental design The experiment is based on a two-level strategy of topic modeling within the framework of

non-negative matrix factorization to the Russian short stories corpus of the first third of the

XXth century. This strategy is that the first stage is an application of the topic modeling NMF to one set of texts from a fixed period of time, the second stage is a combination of results of topic modeling from successive periods of time for detecting a set of dynamic topics related to a particular time window or the whole corpus. 2 Linguistic data set The material for this paper is a selected data from the Russian short stories corpus of the

first third of the XXth century, which is developed at the Philology Department of Saint Petersburg State University in cooperation with Philology Department of the National Research

University Higher School of Economics, Saint Petersburg [Martynenko et al., 2018a; Shersti

nova, Martynenko, 2019]. The data set consists of 300 stories of 300 unique writers both world-famous and barely known. The corpus is a homogeneous resource, which is focused on one of the most common genres of fiction the short story. This genre is the most popular among prose writers, its presence may be found in almost all kind of literary of almost all writers.

The corpus under development covers one of the most dramatic periods in the development of the Russian language and literature. The central point that divides the first third of the twentieth century into different time periods is the October Revolution of 1917. All other events are considered either as leading to it or as arising from it. It allows to make the quantitative analysis of language changes in rather wide chronological frameworks and to estimate what of the arisen language changes were fixed in language, were started to be often used by speakers or were disappeared after the revolutionary epoch [Martynenko et al., 2018b].

The base of the corpus provides a means for exploring the language of the first third of the twentieth century (1900-1930) divided into three main periods: 1) the beginning of the XXth century and the prerevolutionary years, including the First world war, 2) the revolutionary years the February and the October revolutions and the Civil war, and 3) postrevolutionary years from the end of the Civil war to 1930. Each of these time periods will be analyzed separately and the results will be combined into an overall picture, reflecting the development of the Russian language in the first third of the XXth century [Martynenko et al., 2018b]. 5

Experimental procedure The experimental setup is pre-processing of texts that included: removal of non-text symbols, abbreviations, stop words and lemmatization. The volumes of the data sets are shown (in tokens) below: In the first step a document-term matrix is created to which TF-IDF and normalization of

the document length are applied before each matrix is written. It includes marking documents and creation of a document matrix for the window topic where the topic model is created by applying NMF to each time window.

Determining the number of topics is a nontrivial task because the choice of too few of them leads to overly generalized results while choosing too many topics entails too many small, highly similar topics [Green and Cross, 2015]. For the cases when this number is not known in advance there are different strategies for automatic or semi-automatic selection of number of topics. In particular it is proposed to build a Word2Vec Skipgram model using the Gensim library (https://radimrehurek.com/gensim/) from all documents in the case. The TC-W2V measure is used to compare different topic models and then select a model with a suitable number of topics. More details on the TC-W2V are in [O’Callaghan, 2015].

Applying the method mentioned in [Green and Cross, 2015] to determine the number of topics the following results were obtained: Top recommendations for number of topics for ’1900–1913’: 10 (Table 2) Top recommendations for number of topics for ’1914–1922’: 4 (Table 3) Top recommendations for number of topics for ’1923–1930’: 10 (Table 4) Top recommendations for number of dynamic topics: 4 (Table 5). The ability of NMF to apply TF-IDF weighting to data before the topic modeling creates

diverse but nonetheless coherent topics that are less likely to be represented by the same high-frequency terms allowing identification of both broad and niche topics with specialized vocabularies [O’Callaghan et al., 2015]. In the context of the study of the first third of the

XXth century the discovery of these niche topics is an advantage that helps to consider the

components of the topics and analyze the realities of the period in more detail. To illustrate this idea table 5 shows the top 10 terms for 4 dynamic topics. Terms in bold are unique to a topic; terms in italics are met in an overall description of a topic and in time windows (or even in one time window), terms in bold and italics are found within time windows but are not in an overall description of a topic.

The above list of words created by NMF to describe topics is rich and various, moreover each time window has its own unique words. If to compare it with the most common LDA method as the authors of the model do [Darek and Cross, 2016] NMF is more suitable for niche content analysis while LDA offers only a good general description of broader topics.

Linguistic interpretation of experimental results

The highest interest for linguistic analysis is the content of dynamic topics. Table 5 lists the top 4 dynamic topics penetrating all time periods (1900–1913, 1914–1922, 1923–1930). Table 6 shows niche topics and vocabularies of each dynamic topic in a specific time period. For instance, the first broad topic in the first time window is represented by 40 words with the biggest amount of unique terms (писать (to write), любовь (love), любить (to love), сцена (scene), роль (role), ребенок (child), муж (husband), жена (wife), счастье (happiness), кабинет (room, office), деньга (money), русский (Russian), пароход (steamer), город (city, town) and etc.). Inference should be drawn that writers at the beginning of the century wrote a lot about a mode of life: about family (муж (husband), ребенок (child), жена (wife), мама (mom), сестра (sister), отец (father)), work (кабинет (room, office), деньга (money)) and events that occurred with the main characters, which interacted with people of different professions (купец (merchant), приказчик (manager), извозчик (horse-cab driver), доктор (doctor)). The number of unique terms is not numerous in the second time window (1914–1922), which is due to the fact that this is a revolutionary time and the description of life is minimal. In the post-revolutionary period the vocabulary increases again, there are unique words that reflect the ¾new life¿ (товарищ (comrade), завод (factory), гражданин (citizen), рабочий (working), etc.).

There are more words describing nature in the second dynamic topic: 1900–1913 (пруд

(pond), река (river), ночь (night), солнце (sun), куст (bush), лес (forest), волк (wolf ) ), during 1914–1922 - more abstract (ветер (wind), море (sea), небо (sky), солнце (sun), ночь (night), берег (bank)), 1923–1930 (сосна (pine), птица (bird), зверь(beast), лес (forest), болото (swamp)). The third dynamic topic is filled with words related to the military sphere. However, if to look through the niche topics and vocabularies of each period at the beginning of the century only a few words can be attributed to the military theme (солдат (soldier), офицер (officer), пост (post)), the rest of unique terms are more related to the usual way of life (барин (lord), старик (old man), деревня (village), благородие (honour), etc.)), which can indicate to the maintenance of order and regulation of people relations . In the second time window (1914–1922) two unique words ¾немецкий¿ (German) and ¾немец¿ (German) appeared, and there are no abstract words, almost all content refer to the military (солдат (soldier), офицер (officer), стрелять (to shoot), рота (troop), etc.), which fully reflects the revolutionary period. There is a large number of unique words in the third time window where there are the following niche topics: movement by train (вагон (coach), пассажир (passenger), поезд (train), станция (station), ход (motion), курс (course)), family (муж (husband), мама (mom), ребенок (child), мальчик (boy)), house/home (дом (house, home), кухня (kitchen)). Only the word ¾солдат¿ (soldier) can be attributed to the military topic.

The fourth dynamic topic has several niche topics: village (телега (telega, horse wagon),

народ (folk), изба (hut, house), etc.), religion (поп (priest), батюшка (priest), церковь (сhurch). It is worth paying attention to the dynamics of changes in the religious topic: more words in the revolutionary time (батюшка (priest), Бог (God), святой (saint)) compared to the beginning of the century (батюшка (priest)), and the postrevolutionary period (поп (priest), церковь (сhurch)).

The above analysis shows that the internal organization of topics described as a bundle of paradigmatic and syntagmatic connections between the words of the same topic which vary in different time intervals within the same dynamic topic, change significantly over time and reflect the external events.

If we consider the components of topics from the linguistic viewpoint the largest number of words belongs to the nominative class, which is represented by common nouns. Proper names (Александр (Alexander), Алексей (Alexey), Анна (Anna), Владимир (Vladimir), Володя (Volodya), Мишка (Mishka), Вера (Vera), etc.) are deliberately removed since there are many dialogues in the data. The frequency of names is very high and its topic distribution is not conditioned by anything.

Results

Most nouns refer to the description of people (ребенок (child), девушка (young lady), женщина (woman), старик (old man), дед (grandfather, old man), баба (country woman, peasant’s wife), отец (father), мать (mother), муж (husband), жена (wife), father (батюшка), матушка (mother), сестра (sister), дядя (uncle), etc.), profession (солдат (soldier), барин (lord), офицер (officer), студент (student), крестьянин (peasant), доктор (doctor), etc.), body parts (рука (hand, arm), нога (leg), грудь (chest)). Another group of nouns - everyday realities (комната (room), улица (street), пост (post), изба (hut, house), дом (house, home), письмо (letter), etc.), nature and animals (лес (forest), куст (bush), волк (wolf ), солнце (sun), лошадь (horse), pond (пруд), река (river), ночь (night), снег (snow)), abstract (жизнь (life), счастье (happiness), мысль (thought), смерть (death), etc.), as well as collective nouns (толпа (crowd), рота (troop), народ (folk)).

The predicative class is represented by verbs: писать (to write), знать (to know), любить (to love), хотеть (to want), стоять (to stand), кричать (to shout), бежать (to run), думать (to think), обедать (to dine), прийти (to come), сидеть (to sit), жить (to live), работать (to work), спать (to sleep), глянуть (to peep), стрелять (to shoot), вскочить (to jump up), пойти (to go). From the point of view of semantic classification of verbs developed by V. V. Vinogradov and supplemented by G. A. Zolotova [Zolotova, 2004] the present verbs belong to the main semantic classes:

1) verbs of movement: стоять (to stand), сидеть (to sit), глянуть (to peep), бежать (to run), прийти (to come), вскочить (to jump up), идти (to go); 2) verbs of speech action: кричать (to shout); 3) verbs of mental actions: знать (to know), думать (to think), казаться (to seem);

4) verbs of emotional action: любить (to love);

5) verbs of physiological action: жить (to live), обедать (to dine), спать (to sleep); 6) verbs of activity or occupation: работать (to work), писать (to write), стрелять (to shoot); 7) modal verb: хотеть (to want).

The attributive class is the narrowest it includes qualitative (хороший (good), большой

(big), черный (black), темный (dark), горбатый (humpbacked), старший (senior)) and relative adjectives (рабочий (working), русский (Russian), немецкий (German)).

The correlation of words in the topics reflects the diversity of paradigmatic and syn

tagmatic relations that organize the text [Mitrofanova et al., 2014; Mitrofanova, 2014]. The language connections within the topics may be describes with lexical functions in the model ¾Meaning < = > Text¿ [Melchuk, 1974/1999] which allows to cover the predictable, idiomatic connections of the word and its lexical correlates.

Among paradigmatic relations in topics the following are prevailed: synonymy (Syn), antonymy (Anti) and derivational (Der) relations, etc. For example, Syn: мама (mom) мать (mother), друг (friend) товарищ (comrade), фабрика (plant) завод (factory), черный (black) темный (dark), поп (priest) батюшка (priest), etc. Anti: деревня (village) город (city, town), Der: работа (work) рабочий (working) работать (to work), немец (German) немецкий (German), любовь (love) любить (to love), крик (shout) кричать(to shout), команда (command, team) командир (commanding officer), etс. Partitive relations: семья (family) ребенок (child), мама (mom), отец (father), дядя (uncle), сестра (sister), муж (husband), жена (wife), отец (father), мать (mother), тетка (aunt); армия (military) офицер (officer), солдат (soldier), рота (troop), штаб (headquarter), капитан (сaptain), команда (command, team), etc.; природа (nature) пруд (pond), река (river), берег (bank), лес (forest), снег (snow), солнце (sun), ветер (wind), дерево (tree), куст (bush), болото (swamp), etc..; лес (forest) дерево (tree), куст (bush), болото (swamp); охота (hunt) ружье (rifle), зверь (beast), лес (forest), огонь (fire); деревня (village) изба (hut, house), телега (telega, horse wagon), крестьянин (peasant), барин (lord); дом (house, home) комната (room), дверь (door), окно (window), лампа (lamp), кухня (kitchen), кабинет (room, office); передвижение на поезде (go by train) – вагон (coach), пассажир (passenger), поезд (train), станция (station), ход (motion); завод (factory) – работа (work), рабочий (working), работать (to work), машина (machine), etc.

Syntagmatic relations are realized at the level of valence frames filled with words from the topic. Among lexical functions Oper1,2 may be selected, which connect a verb, the name of the first or the second actant in the role of subject and the name of the situation as additions: суп (soup) – обедать (to dine), письмо (letter) - писать (to write), винтовка (gun) – стрелять (to shoot), ребенок (child) – кричать (to shout), etc. In addition, there are a number of examples for the implementation of the lexical function Cap: (команда (command, team) – командир (commanding officer); штаб (headquarter) – начальник (chief ); отряд (squad) – командир (commanding officer), церковь (church) – поп (priest), пароход (steamer) - капитан (captain), etc. The lexical function Equip ("personnel, staff"): people (folk) – man (country man, peasant man), etc., lexical function Doc (res) ("document that is the result"): write (to write) – letter (letter); draw – drawing, etc.

Conclusion The observations made during the experiments confirm the expediency of using non-negative matrix factorization for the issues of topic modeling, including the evaluation of the content of texts as a result of semantic compression. The results obtained in the processing of the selected data from the Russian short stories corpus of the first third of the XXth century indicate the diversity of the implementation of dynamic topic in different time periods. The research data makes it possible to interpret the received results from the perspective

of the theory of lexical functions, as well as to use historical and literary approaches for this purpose. The content of the topics allows to draw conclusions about the topic dynamics of

Russian prose for 30 years from 1900 to 1930. Acknowledgements The research is supported by the Russian Foundation for Basic Research, project # 17-29

09173 “The Russian language on the edge of radical historical changes: the study of language and style in prerevolutionary, revolutionary and post-revolutionary artistic prose by the methods of mathematical and computer linguistics (a corpus-based research on Russian short stories)”. [Melchuk, 1974/1999] Melchuk I. A. (1974/1999) Experience of the theory of the linguistic models ¾Meaning Text¿. Moscow, 1974/199. 1974 (In Russ.) = Opyt teorii lingvistitcheskix modelej Smysl Tekst, Moskva, 1974/1999.

[Daud et al. 2010 ] Daud

, Li

, Zhou

, Muhammad

( 2010 ) Knowledge Discovery through Directed Probabilistic Topic Models: a Survey / / Proceedings of Frontiers of Computer Science in China.

[Blei, Lafferty, 2006] Blei D. M. , Lafferty J. D. ( 2006 ) Dynamic topic models . In Proc. 23rd International Conference on Machine Learning , pp. 113 - 120 .

[Lee and Seung , 1999] Lee D. D. and Seung H. S. ( 1999 ) Learning the parts of objects by non-negative matrix factorization . Nature 401 , 788 - 91 .

[Wang et al., 2012 ] Wang , Q. ,

Cao ,

Xu , and

Li ( 2012 ). Group matrix factorization for scalable topic modeling . In Proc. 35th SIGIR Conf. on Research and Development in Information Retrieval , pp. 375 - 384 . ACM.

[Sherstinova, Martynenko, 2019] Sherstinova

, Martynenko

( 2019 ) Linguistic and Stylistic Parameters for the Study of Literary Language in the Corpus of Russian Short Stories of the First Third of the 20th Century . This volume.

[Mu¨ller and Guido , 2017] Mu¨ller. A. and Guido. S. ( 2016 ) Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly ., 2016

[Darek and Cross , 2016] Greene D. and Cross J. P. ( 2016 ) Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach . ArXiv abs/1607.03055

[O'Callaghan et al., 2015 ]

'Callaghan , D. , Greene

D , Carthy

J. , and Cunningham P. ( 2015 ) An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications (ESWA).

[Martynenko et al., 2018a] Martynenko

G.Ya.

, Sherstinova

Yu ., Melnik

A. G.

, Popova

T.I.

( 2018 ) Methodological problems of creating a computer anthology of the Russian short story as a language resource for the study of the language and style of Russian prose in the era of revolutionary changes (the first third of the XX century) / Computational linguistics and computational ontologies. Issue 2 (Proceedings of the XXI international joint conference "Internet and modern society , IMS-2018, St. Petersburg, May 30-June 2, 2018 Collection of scientific articles") . - St . Petersburg: ITMO University, 2018 . P. 99 - 104 . (In Rus.) = Metodologicheskiye problemy sozdaniya Kompyuternoy antologii russkogo rasskaza kak yazykovogo resursa dlya issledovaniya yazyka i stilya russkoy khudozhestvennoy prozy v epokhu revolyutsionnykh peremen (pervoy treti XX veka) / Kompyuternaya lingvistika i vychislitelnyye ontologii. Vypusk 2 (Trudy XXI Mezhdunarodnoy obyedinennoy konferentsii "Internet i sovremennoye obshchestvo . IMS-2018. Sankt-Peterburg. 30 Maya - 2 Iyunya 2018 g. Sbornik nauchnykh statey" ). SPb: Universitet ITMO . 2018 . S. 99 - 104 .

[Martynenko et al., 2018b] Martynenko

G.Ya.

, Sherstinova

Yu ., Popova

T.I.

, Melnik

А.G.

, Zamiraylova

E.V.

( 2018 ) On the principles of creation of the Russian short stories corpus of the first third of the 20th century . Proceedings of the XV International conference on computer and cognitive linguistics "TEL 2018 ". - Kazan , 2018 . Pp. 180 - 197 . (In Rus.) = O printsipakh sozdaniya korpusa russkogo rasskaza pervoy treti XX veka // Trudy XV Mezhdunarodnoy konferentsii po kompyuternoy i kognitivnoy lingvistike ¾ TEL 2018 ¿. - Kazan . 2018 . - S. 180 - 197 .

[Green and Cross , 2015] Greene D. , and Cross J. P. ( 2015 ) Unveiling the Political Agenda of the European Parliament Plenary: A Topical Analysis ACM Web Science 2015 , 28 June - 1 July , 2015 Oxford, UK.

[Zolotova et al., 2004 ] olotova G. A. , Onipenko

N. T. , Sidorova

M. Y.

Communicative grammar of the Russian language . Ed. - Moscow: Nauka, 2004 . 544 p (In Rus.) = Kommunikativnaya grammatika russkogo yazyka . M.: Nauka . 2004 . 544 s. ISBN 5-88744-050-3

[Mitrofanova et al., 2014 ] Mitrofanova O. A. , Shimorina , A. S. , Koltsov

S. N. , Koltsova

O. Yu. ( 2014 ) Modeling semantic links in social media texts using the LDA algorithm (based on the Russian-language segment of the LiveJournal) . Structural and applied linguistics , Vol. 10 , 151 - 168 . (in Rus.) = Modelirovaniye semanticheskikh svyazey v tekstakh sotsialnykh setey s pomoshchyu algoritma LDA (na materiale russkoyazychnogo segmenta Zhivogo Zhurnala) . Strukturnaya i prikladnaya lingvistka . Vyp . 10 . 151 - 168 .

[Mitrofanova , 2014] Mitrofanova O. A. ( 2014 ) Topic modeling of special texts based on LDA algorithm . XLII International philological conference. March 11-16 , 2013 . Selected works . SPb . (in Rus.) = Modelirovanije tematiki special'nyh tekstov na osnove algoritma LDA XLII Mezhdunarodnaya filologicheskaya konferencija . 11 - 16 marta 2013 . Izbrannyje trudy . SPb