=Paper=
{{Paper
|id=Vol-2953/SEIM_2021_paper_11
|storemode=property
|title=Visualizing Russian Kinship Term Possessive Sequences as Family Trees
|pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_11.pdf
|volume=Vol-2953
|authors=Anna Golub,Alyona Belova,Gleb Bondarenko,Darya Karmaz
}}
==Visualizing Russian Kinship Term Possessive Sequences as Family Trees==
Visualizing Russian kinship term possessive sequences as family trees Anna Golub Alyona Belova Faculty of Infocommunication Technologies Faculty of Infocommunication Technologies ITMO University ITMO University St. Petersburg, Russia St. Petersburg, Russia anna.golub.923@gmail.com belova.alyona.itmo@gmail.com Gleb Bondarenko Darya Karmaz Faculty of Infocommunication Technologies Faculty of Infocommunication Technologies ITMO University ITMO University St. Petersburg, Russia St. Petersburg, Russia bondoll2001@mail.ru dasha.karmaz@yandex.ru Abstract—As frequently as they are encountered in texts of 2. analyzing the family relations that the each of the various genres, Russian kinship term possessive sequences sequences presents and building graphs upon them; remain confusing even for the native speakers. The paper presents the authors’ original computer science project, whose 3. visualizing the graphs as family trees using the goal was to suggest a method of extracting such word sequences existing tools for data visualization. from a piece of text and visualizing them in an easily comprehensible way. Such an attempt of kinship relations II. STATE OF THE ART analysis automatization might contribute to future research in A. Kinship Term Possessive Sequences history, linguistics, and literary studies and be of use to those studying Russian as a foreign language. As to our knowledge, a comprehensive solution to the problem has not yet been suggested. A great deal of research Keywords—kinship term, possessive structure, family tree, has been done on kinship term systems and the grammar of natural language processing, Russian possession in various languages. However, possessive structures with kinship terms have not been subject to in-depth I. INTRODUCTION investigation. A few researchers touched on the topic of Russian possessive sequences including several kinship kinship terms in their studies of possessive structures (Dahl & terms (e. g. сестра мужа моей тёщи — my mother-in-law’s Koptjevskaja-Tamm 2001 [2], Paykin & Van Peteghem 2003 husband’s sister, бабушка шурина его брата — his [3], Jones 2010 [4]). Unfortunately, the approach taken was brother’s brother-in-law’s grandma) often appear confusing exclusively theoretical, and in addition, the only type of in written text as well as in oral speech since it is difficult to phrases discussed was those consisting of a possessive quickly calculate the relations between the relatives pronoun and a single kinship term, such as her sister. Aside mentioned. Besides, such phrases often include names of from that, some papers describing data collected through relatives by marriage (тёща — a wife’s mother, шурин — a fieldwork mention kinship term possessive sequences, but that wife’s brother, etc.). Those kinship terms are becoming kind of research does not seem applicable to building a tool increasingly obsolescent in modern Russian as they only for their computational processing. constitute for 1.4 per cent of all kinship term entries in Russian B. Family Tree Visualization National Corpus [1] in 1990—2020. Therefore, appearing in a sequence, they make it even harder to comprehend. In terms of visualization, there are many software tools that are suitable for depicting family trees. To begin with, The goal of the authors’ original computer science project specialized packages (e. g. ggenealogy [5]) provide plotting was to create a computational tool that would extract the methods for genealogical data. However, the hierarchical sequences in question from a given piece of text and visualize nature of a structure does not allow horizontal edges and node them in an easily comprehensible way. Such an attempt of skipping, which are required for valid representation of kinship relations analysis automatization might contribute to relationships in kinship term possessive sequences. This is future research in history, linguistics, and literary studies, e. g. accurate for the libraries with more general visualizing scientific analysis and systematization of fiction and memoirs. functionality (e. g. Graphviz [6], Plotly [7], Toytree [8]) as Moreover, this technology might be utilized by those studying well. Furthermore, these visualization tools have rather Russian as a foreign language in order to ease the process of limited customization options while the goal was to display Russian kinship terms’ meaning comprehension and confusing lineages in the most efficient way. memorization. Additionally, to our knowledge, some other family tree The following text processing stages were suggested for visualization tools appropriate for the assigned task [9] are the computational tool: unfortunately unavailable for public use. 1. finding kinship term possessive sequences in the given text; Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1. possessive adjective / possessive pronoun + kinship term (GEN) Example: бабушкиному мужу grandma's husband ‘grandma’s husband’ 2. kinship term + kinship term (GEN) 𝑛 = 0,1,2.. + possessive adjective / possessive pronoun Fig. 1. Visualizing a family tree with Toytree + kinship term (GEN) Moreover, there is genealogy software with intuitive family tree builders for illustrating pedigrees (e.g. Family Example: муж бабушки моей сестры Historian, Legacy Family Tree, etc. [10]). Despite the husband grandma my sister representation of kinship relationships in a comprehensive, ‘my sister’s grandma’s husband’ visually organized manner, these applications require user interaction, which makes automatic visualization significantly 3. kinship term more complicated. + kinship term (GEN) 𝑛 = 0,1,2.. + possessive adjective / possessive pronoun Example: мужа бабушки сестры моей husband grandma sister my ‘my sister’s grandma’s husband’ 4. kinship term + kinship term (GEN) 𝑛 = 0,1,2.. + noun (non-kinship) (GEN) Example: мужем бабушки подруги husband grandma friend ‘friend’s grandma’s husband’ Fig. 2. Visualizing a kinship term sequence with Family Historian 5. kinship term Taking all of the aforementioned into account, we decided + kinship term (GEN) 𝑛 = 0,1,2.. to use NetworkX [11] and Matplotlib [12] to develop our own visualization algorithm best fitting the requirements. Example: мужу бабушки сестры husband grandma sister’s III. SOLUTION ‘sister’s grandma’s husband’ A. Text Search Then, the sequence is cast to the so-called normal form: All the source code is written in Python. At first, using NLTK, a Python library for natural language processing [13], • kinship terms are put in the nominative singular form; the given piece of text is split into sentences; then each of them is tokenized into words. Next, employing pymorphy2 [14] for • non-kinship nouns are put in the nominative case with part-of-speech tagging and further morphological analysis, the number unchanged; continuous word sequences are extracted from sentences. At • possessive adjectives are replaced with their stem this point the sequences consist of: nouns in the nominative singular form; • one or more kinship terms, the first one of them in any • possessive pronouns are replaced with the case while all the rest in the genitive case exclusively; corresponding personal ones in the nominative case • certain kinds of modifiers, namely long- or short- (the forms of the свой pronoun are substituted by formed adjectives and participles, ordinal numerals некто). and adjective pronouns; Afterwards, the sequence is reshuffled so as to put the • not more than one non-kinship noun in the genitive words in the direct relation order (see examples below). If the case. If included, this word is the last one in the sequence then starts with a kinship term, the first-person sequence (see sequence type 4 below). singular pronoun я is inserted into the beginning. Afterwards, each of the sequences is recognized as one of the sequence types listed below, which depict most instances of kinship term possessive sequences in Russian. The GEN abbreviation stands for the genitive case while 𝑛 signifies the possible number of the word’s occurrences. Examples: a. бабушкиному мужу — я бабушка муж grandma’s husband — me grandma husband b. мужу бабушки сестры моей — я сестра husband grandma sister my — me sister бабушка муж grandma husband Thus, all the words in the sequence, except for the first one, turn out to be kinship terms in the nominative singular form. Preprocessed this way, the sequence is fit for further analysis. B. Kinship Relations Analysis For each word in the sequence, except for the first one, the following actions are performed. First, a graph fragment template is uploaded. The template presents the relationship between this and the previous character in the sequence. Here, by character we mean any relative in the chain of connections described by the sequence. Within the template all the characters are connected directly, either through parent—child or wife—husband type of Fig. 4. Template alignment connection. See the сестра (sister) template below as an example. Finally, if multiple nodes of the graph correspond to the same character of the sequence, they are merged into one. Such cases are unraveled based on the heuristic that each character may only have one mother and one father. The graph above is thus transformed into the following one. Fig. 3. “Сестра” (“sister”) template Next, the template is incorporated in the graph that has been built so far, namely the root of this template is aligned with the top of the previous one. Here, by the template top we mean the character signified by this template’s corresponding word. In turn, the template’s root is the character whose relation to the top of the template is being described by the template word. (In the sister template above я (me) is the root while сестра (sister) is the top.) Consequently, by the tree Fig. 5. The final graph structure root we mean the root of the first template and by the tree top the top of the last template included in the graph. See example Thus, in the resulting graph all the characters are below: the бабушка (grandmother) template being aligned connected directly to each other. It is worth mentioning that with the сестра (sister) template. (For now, we only discuss due to linguistic polysemy, the relationship reflected in a the maternal grandmother.) kinship term might be described by a few different templates (e. g. in the above example the grandmother might be both maternal and paternal). Consequently, several graph structures are built, considering all the possible template combinations for the kinship terms in the sequence. Fig. 6. The alternative graph structure C. Graph Visualization The family trees are drawn using NetworkX and Matplotlib in Python. All the edges of the graph are added to the list in the loop that goes through the characters and their connections. The root node gets zero coordinates, for other nodes the following rules apply: • parent is positioned one point higher than their child; • child is positioned one point lower than their parent; • wife/husband is drawn at the same level and one point to the right from their spouse; • if the determined spot is already taken, the node is shifted one point to the right. As to colors, the following rules apply: Fig. 7. The program output • for the tree top and the tree root nodes, blue color is used; IV. EVALUATION • otherwise, if the character is directly mentioned in the A. Evaluation Process kinship term sequence, the node has a light blue color; In order to evaluate the tool’s performance, it was run on • if there is no direct mention of the character, the node a purposefully collected corpus of texts, selected manually is painted light gray. from the Russian National Corpus. The corpus consists of 3067 words and includes at least five entries of each of the The gender is displayed through the shape of the nodes: kinship terms while keeping a rough balance between the circles for females, squares for males; a rhombus is used for sequence types. the root node. As a result, a PNG file presents the graph with a gray background, the kinship term sequence and the original For text search evaluation, each kinship term possessive sentence. The file is the final program output. See the sequence in the corpus was manually classified as follows: visualization of the sequence муж бабушки моей сестры (my sister's grandmother's husband) as an example. • True Positive — the sequence was found, and its boundaries were identified correctly; • False Positive — the sequence was found, but extra words were included; • True Negative — there are no sequences in the sentence, and none were found; • False Negative — the sequence was found, but some necessary words were excluded. The precision and recall scores were then calculated, input data; however, suggested paths for further development turning out 0.96 and 0.93 respectively. might push its limits significantly. Then, for each of the sequences found by the program the The project source code is available on github. The graphs were drawn to evaluate the kinship relations analysis evaluation corpus, as well as the list of kinship terms and visualization. For each of the sequences, the number of processed by the program, can also be viewed there. expected and present correct visualizations was estimated manually with the resulting accuracy score being 0.95. The tool is also available for public use as a Python package. As for now, the users are able to: B. Discussion • extract sequences from a given piece of text; As the evaluation test has outlined flaws in the tool’s performance, a few areas for future work are suggested. • build a graph upon a given sequence; • Processing proper names. As for now, the tool cannot • visualize an already-built graph or sequences from a correctly process input sequences including first name given piece of unprocessed text. + patronym collocations (e. g. сын Анны Ивановны — Anna Ivanovna’s son) or abbreviated name forms (e. g. REFERENCES сын Вл. Набокова — V. Nabokov’s son). [1] https://ruscorpora.ru/new/ • Broadening the range of sequence types. For example, [2] Dahl, Östen & Koptjevskaja-Tamm, Maria. (2001). 11. Kinship in the tool cannot correctly process the sequence below grammar. 10.1075/tsl.47.12dah. because it does not fit any of the sequence type [3] Paykin, K., van Peteghem, M. External vs. Internal Possessor schemas. Structures and Inalienability in Russian. Russian Linguistics 27, 329– 348 (2003). a. шурина моего сын [4] Jones, Doug. (2010). Human kinship, from conceptual structure to wife’s brother my son grammar. The Behavioral and brain sciences. 33. 367-404 ‘my wife’s brother’s son’ [5] Rutter L, VanderPlas S, Cook D, Graham MA (2019). “ggenealogy: An R Package for Visualizing Genealogical Data.” Journal of Statistical • Adding context analysis features for better Software, 89(13), 1–31. differentiation between the sequence types. [6] "Graphviz and Dynagraph – Static and Dynamic Graph Drawing Tools", by John Ellson, Emden R. Gansner, Eleftherios Koutsofios, • Updating the template approach. At the relations Stephen C. North, and Gordon Woodhull, in Jünger & Mutzel (2004). analysis stage, complex kinship terms can be replaced [7] Plotly Technologies Inc. Collaborative data science. Montréal, QC, with their simpler explanations, e. g. turning тёща 2015. https://plot.ly (mother-in-law) into мать жены (a wife’s mother), [8] https://toytree.readthedocs.io/en/latest/ allowing to only store templates for the basic kinship [9] Borges, J. (2019). A contextual family tree visualization design. terms, namely parents, children, siblings and spouses. Information Visualization. • Identifying coreference. At this point, the program [10] https://www.toptenreviews.com/software/home/best-genealogy- software/ does not register several words referring to the same [11] Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring character as in сын моего отца (my dad’s son) and is network structure, dynamics, and function using NetworkX”, in unable to depict that in the graph. Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, • Making the tool adjustable for other languages by CA USA), pp. 11–15, Aug 2008 eradicating the language dependencies in the code. [12] J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. V. CONCLUSION [13] Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language This paper has presented kinship term possessive Processing with Python. O’Reilly Media Inc. sequences as a field for natural language processing [14] Korobov M.: Morphological Analyzer and Generator for Russian and development, presenting the authors’ original tool for the Ukrainian Languages // Analysis of Images, Social Networks and Texts, pp 320-332 (2015) sequences’ human-readable visualization. The program appears quite efficient and performs well on a broad range of