=Paper= {{Paper |id=Vol-2953/SEIM_2021_paper_11 |storemode=property |title=Visualizing Russian Kinship Term Possessive Sequences as Family Trees |pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_11.pdf |volume=Vol-2953 |authors=Anna Golub,Alyona Belova,Gleb Bondarenko,Darya Karmaz }} ==Visualizing Russian Kinship Term Possessive Sequences as Family Trees== https://ceur-ws.org/Vol-2953/SEIM_2021_paper_11.pdf
                            Visualizing Russian kinship term
                           possessive sequences as family trees
                                Anna Golub                                                              Alyona Belova
                Faculty of Infocommunication Technologies                                Faculty of Infocommunication Technologies
                              ITMO University                                                          ITMO University
                           St. Petersburg, Russia                                                   St. Petersburg, Russia
                        anna.golub.923@gmail.com                                              belova.alyona.itmo@gmail.com

                             Gleb Bondarenko                                                            Darya Karmaz
                Faculty of Infocommunication Technologies                                Faculty of Infocommunication Technologies
                              ITMO University                                                          ITMO University
                           St. Petersburg, Russia                                                   St. Petersburg, Russia
                           bondoll2001@mail.ru                                                    dasha.karmaz@yandex.ru



          Abstract—As frequently as they are encountered in texts of             2.   analyzing the family relations that the each of the
      various genres, Russian kinship term possessive sequences                       sequences presents and building graphs upon them;
      remain confusing even for the native speakers. The paper
      presents the authors’ original computer science project, whose             3.   visualizing the graphs as family trees using the
      goal was to suggest a method of extracting such word sequences                  existing tools for data visualization.
      from a piece of text and visualizing them in an easily
      comprehensible way. Such an attempt of kinship relations                                  II. STATE OF THE ART
      analysis automatization might contribute to future research in         A. Kinship Term Possessive Sequences
      history, linguistics, and literary studies and be of use to those
      studying Russian as a foreign language.                                    As to our knowledge, a comprehensive solution to the
                                                                             problem has not yet been suggested. A great deal of research
         Keywords—kinship term, possessive structure, family tree,           has been done on kinship term systems and the grammar of
      natural language processing, Russian                                   possession in various languages. However, possessive
                                                                             structures with kinship terms have not been subject to in-depth
                             I. INTRODUCTION                                 investigation. A few researchers touched on the topic of
          Russian possessive sequences including several kinship             kinship terms in their studies of possessive structures (Dahl &
      terms (e. g. сестра мужа моей тёщи — my mother-in-law’s                Koptjevskaja-Tamm 2001 [2], Paykin & Van Peteghem 2003
      husband’s sister, бабушка шурина его брата — his                       [3], Jones 2010 [4]). Unfortunately, the approach taken was
      brother’s brother-in-law’s grandma) often appear confusing             exclusively theoretical, and in addition, the only type of
      in written text as well as in oral speech since it is difficult to     phrases discussed was those consisting of a possessive
      quickly calculate the relations between the relatives                  pronoun and a single kinship term, such as her sister. Aside
      mentioned. Besides, such phrases often include names of                from that, some papers describing data collected through
      relatives by marriage (тёща — a wife’s mother, шурин — a               fieldwork mention kinship term possessive sequences, but that
      wife’s brother, etc.). Those kinship terms are becoming                kind of research does not seem applicable to building a tool
      increasingly obsolescent in modern Russian as they only                for their computational processing.
      constitute for 1.4 per cent of all kinship term entries in Russian     B. Family Tree Visualization
      National Corpus [1] in 1990—2020. Therefore, appearing in
      a sequence, they make it even harder to comprehend.                        In terms of visualization, there are many software tools
                                                                             that are suitable for depicting family trees. To begin with,
          The goal of the authors’ original computer science project         specialized packages (e. g. ggenealogy [5]) provide plotting
      was to create a computational tool that would extract the              methods for genealogical data. However, the hierarchical
      sequences in question from a given piece of text and visualize         nature of a structure does not allow horizontal edges and node
      them in an easily comprehensible way. Such an attempt of               skipping, which are required for valid representation of
      kinship relations analysis automatization might contribute to          relationships in kinship term possessive sequences. This is
      future research in history, linguistics, and literary studies, e. g.   accurate for the libraries with more general visualizing
      scientific analysis and systematization of fiction and memoirs.        functionality (e. g. Graphviz [6], Plotly [7], Toytree [8]) as
      Moreover, this technology might be utilized by those studying          well. Furthermore, these visualization tools have rather
      Russian as a foreign language in order to ease the process of          limited customization options while the goal was to display
      Russian kinship terms’ meaning comprehension and                       confusing lineages in the most efficient way.
      memorization.
                                                                                 Additionally, to our knowledge, some other family tree
          The following text processing stages were suggested for            visualization tools appropriate for the assigned task [9] are
      the computational tool:                                                unfortunately unavailable for public use.
           1.   finding kinship term possessive sequences in the
                given text;




Copyright © 2021 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                        1.   possessive adjective / possessive pronoun
                                                                          + kinship term (GEN)
                                                                        Example: бабушкиному мужу
                                                                                 grandma's      husband
                                                                                 ‘grandma’s husband’

                                                                        2.   kinship term
                                                                          + kinship term (GEN) 𝑛 = 0,1,2..
                                                                          + possessive adjective / possessive pronoun
Fig. 1. Visualizing a family tree with Toytree
                                                                          + kinship term (GEN)
    Moreover, there is genealogy software with intuitive
family tree builders for illustrating pedigrees (e.g. Family            Example: муж        бабушки моей сестры
Historian, Legacy Family Tree, etc. [10]). Despite the                           husband grandma my           sister
representation of kinship relationships in a comprehensive,                      ‘my sister’s grandma’s husband’
visually organized manner, these applications require user
interaction, which makes automatic visualization significantly          3.   kinship term
more complicated.                                                         + kinship term (GEN) 𝑛 = 0,1,2..
                                                                          + possessive adjective / possessive pronoun
                                                                        Example: мужа       бабушки сестры моей
                                                                                 husband grandma sister          my
                                                                                 ‘my sister’s grandma’s husband’

                                                                        4.   kinship term
                                                                          + kinship term (GEN) 𝑛 = 0,1,2..
                                                                          + noun (non-kinship) (GEN)
                                                                        Example: мужем бабушки подруги
                                                                                 husband grandma friend
                                                                                 ‘friend’s grandma’s husband’
Fig. 2. Visualizing a kinship term sequence with Family Historian
                                                                        5.   kinship term
    Taking all of the aforementioned into account, we decided
                                                                          + kinship term (GEN) 𝑛 = 0,1,2..
to use NetworkX [11] and Matplotlib [12] to develop our own
visualization algorithm best fitting the requirements.                  Example: мужу         бабушки сестры
                                                                                 husband grandma sister’s
                         III. SOLUTION                                           ‘sister’s grandma’s husband’
A. Text Search
                                                                        Then, the sequence is cast to the so-called normal form:
     All the source code is written in Python. At first, using
NLTK, a Python library for natural language processing [13],            • kinship terms are put in the nominative singular form;
the given piece of text is split into sentences; then each of them
is tokenized into words. Next, employing pymorphy2 [14] for             • non-kinship nouns are put in the nominative case with
part-of-speech tagging and further morphological analysis,                the number unchanged;
continuous word sequences are extracted from sentences. At              • possessive adjectives are replaced with their stem
this point the sequences consist of:                                      nouns in the nominative singular form;
    • one or more kinship terms, the first one of them in any           • possessive pronouns are replaced with the
      case while all the rest in the genitive case exclusively;           corresponding personal ones in the nominative case
    • certain kinds of modifiers, namely long- or short-                  (the forms of the свой pronoun are substituted by
      formed adjectives and participles, ordinal numerals                 некто).
      and adjective pronouns;                                            Afterwards, the sequence is reshuffled so as to put the
    • not more than one non-kinship noun in the genitive             words in the direct relation order (see examples below). If the
      case. If included, this word is the last one in the            sequence then starts with a kinship term, the first-person
      sequence (see sequence type 4 below).                          singular pronoun я is inserted into the beginning.

    Afterwards, each of the sequences is recognized as one of
the sequence types listed below, which depict most instances
of kinship term possessive sequences in Russian. The GEN
abbreviation stands for the genitive case while 𝑛 signifies the
possible number of the word’s occurrences.
Examples:
   a. бабушкиному мужу — я     бабушка муж
       grandma’s  husband — me grandma husband
    b.   мужу бабушки сестры моей — я сестра
         husband grandma sister my — me sister
         бабушка муж
         grandma husband
   Thus, all the words in the sequence, except for the first
one, turn out to be kinship terms in the nominative singular
form. Preprocessed this way, the sequence is fit for further
analysis.
B. Kinship Relations Analysis
    For each word in the sequence, except for the first one, the
following actions are performed.
    First, a graph fragment template is uploaded. The template
presents the relationship between this and the previous
character in the sequence. Here, by character we mean any
relative in the chain of connections described by the sequence.
Within the template all the characters are connected directly,
either through parent—child or wife—husband type of                                  Fig. 4. Template alignment
connection. See the сестра (sister) template below as an
example.                                                              Finally, if multiple nodes of the graph correspond to the
                                                                   same character of the sequence, they are merged into one.
                                                                   Such cases are unraveled based on the heuristic that each
                                                                   character may only have one mother and one father. The graph
                                                                   above is thus transformed into the following one.




              Fig. 3. “Сестра” (“sister”) template

    Next, the template is incorporated in the graph that has
been built so far, namely the root of this template is aligned
with the top of the previous one. Here, by the template top we
mean the character signified by this template’s corresponding
word. In turn, the template’s root is the character whose
relation to the top of the template is being described by the
template word. (In the sister template above я (me) is the root
while сестра (sister) is the top.) Consequently, by the tree
                                                                                   Fig. 5. The final graph structure
root we mean the root of the first template and by the tree top
the top of the last template included in the graph. See example        Thus, in the resulting graph all the characters are
below: the бабушка (grandmother) template being aligned            connected directly to each other. It is worth mentioning that
with the сестра (sister) template. (For now, we only discuss       due to linguistic polysemy, the relationship reflected in a
the maternal grandmother.)                                         kinship term might be described by a few different templates
                                                                   (e. g. in the above example the grandmother might be both
                                                                   maternal and paternal). Consequently, several graph structures
                                                                   are built, considering all the possible template combinations
                                                                   for the kinship terms in the sequence.
             Fig. 6. The alternative graph structure

C. Graph Visualization
    The family trees are drawn using NetworkX and
Matplotlib in Python. All the edges of the graph are added to
the list in the loop that goes through the characters and their
connections. The root node gets zero coordinates, for other
nodes the following rules apply:
   • parent is positioned one point higher than their child;
   • child is positioned one point lower than their parent;
   • wife/husband is drawn at the same level and one point
     to the right from their spouse;
   • if the determined spot is already taken, the node is
     shifted one point to the right.
   As to colors, the following rules apply:
                                                                                    Fig. 7. The program output
   • for the tree top and the tree root nodes, blue color is
     used;                                                                            IV. EVALUATION
   • otherwise, if the character is directly mentioned in the     A. Evaluation Process
     kinship term sequence, the node has a light blue color;
                                                                     In order to evaluate the tool’s performance, it was run on
   • if there is no direct mention of the character, the node     a purposefully collected corpus of texts, selected manually
     is painted light gray.                                       from the Russian National Corpus. The corpus consists of
                                                                  3067 words and includes at least five entries of each of the
    The gender is displayed through the shape of the nodes:
                                                                  kinship terms while keeping a rough balance between the
circles for females, squares for males; a rhombus is used for
                                                                  sequence types.
the root node. As a result, a PNG file presents the graph with
a gray background, the kinship term sequence and the original        For text search evaluation, each kinship term possessive
sentence. The file is the final program output. See the           sequence in the corpus was manually classified as follows:
visualization of the sequence муж бабушки моей сестры
(my sister's grandmother's husband) as an example.                   • True Positive — the sequence was found, and its
                                                                       boundaries were identified correctly;
                                                                     • False Positive — the sequence was found, but extra
                                                                       words were included;
                                                                     • True Negative — there are no sequences in the
                                                                       sentence, and none were found;
                                                                     • False Negative — the sequence was found, but some
                                                                       necessary words were excluded.
    The precision and recall scores were then calculated,       input data; however, suggested paths for further development
turning out 0.96 and 0.93 respectively.                         might push its limits significantly.
   Then, for each of the sequences found by the program the        The project source code is available on github. The
graphs were drawn to evaluate the kinship relations analysis    evaluation corpus, as well as the list of kinship terms
and visualization. For each of the sequences, the number of     processed by the program, can also be viewed there.
expected and present correct visualizations was estimated
manually with the resulting accuracy score being 0.95.             The tool is also available for public use as a Python
                                                                package. As for now, the users are able to:
B. Discussion
                                                                      • extract sequences from a given piece of text;
    As the evaluation test has outlined flaws in the tool’s
performance, a few areas for future work are suggested.               • build a graph upon a given sequence;
   • Processing proper names. As for now, the tool cannot             • visualize an already-built graph or sequences from a
     correctly process input sequences including first name             given piece of unprocessed text.
     + patronym collocations (e. g. сын Анны Ивановны —
     Anna Ivanovna’s son) or abbreviated name forms (e. g.                                   REFERENCES
     сын Вл. Набокова — V. Nabokov’s son).
                                                                [1]  https://ruscorpora.ru/new/
   • Broadening the range of sequence types. For example,       [2]  Dahl, Östen & Koptjevskaja-Tamm, Maria. (2001). 11. Kinship in
     the tool cannot correctly process the sequence below            grammar. 10.1075/tsl.47.12dah.
     because it does not fit any of the sequence type           [3] Paykin, K., van Peteghem, M. External vs. Internal Possessor
     schemas.                                                        Structures and Inalienability in Russian. Russian Linguistics 27, 329–
                                                                     348 (2003).
       a.   шурина          моего сын                           [4] Jones, Doug. (2010). Human kinship, from conceptual structure to
            wife’s brother my         son                            grammar. The Behavioral and brain sciences. 33. 367-404
            ‘my wife’s brother’s son’                           [5] Rutter L, VanderPlas S, Cook D, Graham MA (2019). “ggenealogy: An
                                                                     R Package for Visualizing Genealogical Data.” Journal of Statistical
   • Adding context analysis features for              better        Software, 89(13), 1–31.
     differentiation between the sequence types.                [6] "Graphviz and Dynagraph – Static and Dynamic Graph Drawing
                                                                     Tools", by John Ellson, Emden R. Gansner, Eleftherios Koutsofios,
   • Updating the template approach. At the relations                Stephen C. North, and Gordon Woodhull, in Jünger & Mutzel (2004).
     analysis stage, complex kinship terms can be replaced
                                                                [7] Plotly Technologies Inc. Collaborative data science. Montréal, QC,
     with their simpler explanations, e. g. turning тёща             2015. https://plot.ly
     (mother-in-law) into мать жены (a wife’s mother),          [8] https://toytree.readthedocs.io/en/latest/
     allowing to only store templates for the basic kinship     [9] Borges, J. (2019). A contextual family tree visualization design.
     terms, namely parents, children, siblings and spouses.          Information Visualization.
   • Identifying coreference. At this point, the program        [10] https://www.toptenreviews.com/software/home/best-genealogy-
                                                                     software/
     does not register several words referring to the same
                                                                [11] Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring
     character as in сын моего отца (my dad’s son) and is            network structure, dynamics, and function using NetworkX”, in
     unable to depict that in the graph.                             Proceedings of the 7th Python in Science Conference (SciPy2008),
                                                                     Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena,
   • Making the tool adjustable for other languages by               CA USA), pp. 11–15, Aug 2008
     eradicating the language dependencies in the code.         [12] J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in
                                                                     Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007.
                     V. CONCLUSION                              [13] Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language
   This paper has presented kinship term possessive                  Processing with Python. O’Reilly Media Inc.
sequences as a field for natural language processing            [14] Korobov M.: Morphological Analyzer and Generator for Russian and
development, presenting the authors’ original tool for the           Ukrainian Languages // Analysis of Images, Social Networks and
                                                                     Texts, pp 320-332 (2015)
sequences’ human-readable visualization. The program
appears quite efficient and performs well on a broad range of