=Paper= {{Paper |id=Vol-1399/paper8 |storemode=property |title=Tolstoy Digital: Mining Biographical Data in Literary Heritage Editions |pdfUrl=https://ceur-ws.org/Vol-1399/paper8.pdf |volume=Vol-1399 |dblpUrl=https://dblp.org/rec/conf/bd/Bonch-Osmolovskaya15 }} ==Tolstoy Digital: Mining Biographical Data in Literary Heritage Editions== https://ceur-ws.org/Vol-1399/paper8.pdf
       Tolstoy Digital: Mining Biographical Data in Literary Heritage Editions
                                     Anastasia Bonch-Osmolovskaya, Matvey Kolbasov
                                   National Research University Higher School of Economics
                                       20 Myasnitskaya Ulitsa; Moscow, Russia; 101000
                                       abonch@gmail.com, matveykolbasov@yandex.ru

                                                              Abstract
This paper presents a solution for mining the biographical information from commentaries on Leo Tolstoy’s letters. It is implemented
as a part of Tolstoy Digital Project – a semantically marked-up web publication of the 90-volume complete collection of Leo
Tolstoy’s works. Extraction of relevant biographical information will be used to create an open database for all the persons who were
somehow connected with Tolstoy or Tolstoy’s works. The paper also accounts for various subtleties of the commentary apparatus and
pays special attention to specific difficulties of biographical information extraction, such as the problem of defining the boundaries of
expressions denoting profession, or the problem of non-standardized syntactic constructions for kinship relations.
Keywords: Leo Tolstoy, commentary apparatus, biographical database, semantic edition



                 1. Project Description                                 problem of automatic named entity recognition and fact
The Tolstoy Digital project aims to prepare a web-
                              1                                         extraction seems to be very well elaborated, see
published semantically marked-up version of the 90-                     (Grishman 2003, and Jurafsky Martin 2009) for a
volume complete collection of Leo Tolstoy’s works2. The                 comprehensive outline and reference list. Still,
digital version of the 90-volume edition has become easy                information extraction in academic philological texts
to access thanks to the mass crowdsourcing campaign                     pursues slightly different aims than mining the news flow:
“All of Tolstoy in one click”3 . The next step of Tolstoy               there are graphical, linguistic and conceptual problems.
digitization is devoted to the semantic tagging of                      Specific graphics (such as abbreviations, inner
Tolstoy’s text and the creation of a comprehensive                      references, font highlighting) may have non-trivial
database of all of the additional reference information that            semiotic functions. Some patterns that are uncommon in
goes along with the Tolstoy oeuvres and private archive.                general language may be used to render the latent
The 90-volume edition (Tolstoy 1928-1964) comprises an                  recurring logics of the commentary structure. Finally, the
exhaustive critical apparatus, which contains relevant                  notions that are relevant for characterizing historical
information on Tolstoy’s works, life, and other people                  events or persons may have little in common with the
connected with him. Current research, done as part of the               conceptual space of data mining in the web (the domain
Tolstoy Digital project, presents an on-going work in fact-             which is primarily regarded in all the works on
extraction from literary commentaries. The edition                      information extraction). This is the reason why the task of
contains 21 volumes of letters, dated from 1844 till 1910,              biographical information extraction from academic
the year of Tolstoy’s death. Each letter is followed by a               editions cannot be solved with the help of already existing
detailed commentary, where a biographical reference to                  solutions, but requires some specific process
the addressees and persons mentioned in the letter is                   modifications. This paper aims to report on the first steps
provided. Our aim is to analyze unstructured text of a                  that we have made in this direction. We used a rule-based
commentary, to extract person names and relevant                        toolbox Tomita to write and apply some basic grammar
biographical information, and to use TEI semantic                       rules that are used to extract relevant ontology concepts.
annotations for relevant mark-ups (Barnard, et al 1995).                In part (2) we will present a short outline of the
Extracted data will be aggregated and stored in a                       biographical ontology we are going to elaborate and we
reference database. The database will be linked with the                will briefly describe preliminary data preparation. Part (3)
text of the semantic edition.                                           will be dedicated to the description of our grammar and
                                                                        the rules for biographical fact extraction. Evaluation and
The main aim of our current experiment was to estimate                  analysis are presented in part (4).
the effectiveness of a rule-based approach that could be
used for data analysis and fact extraction for our type of                  2. Textual material and basic ontology
texts. We started with the most basic concepts and facts:               The commentary apparatus of epistolary volumes of the
names, dates, professions and family relations. The                     complete 90-volume edition consists of letter
                                                                        commentaries, which constitute about 40% of entire text.
1
  See the project description here:                                     Usually they are organized in a non-structured way as a
http://tolstoy.ru/projects/tolstoy-digital/                             sequence of factual comments. Some of them are hard to
2
  http://tolstoy.ru/creativity/90-volume-collection-of-the-             classify, and many of them seem not to fit to any
works/                                                                  database, making them redundant. This lack of explicit
3
  See, for example, review in the Guardian:                             structure can be explained by the fact that the
http://www.theguardian.com/books/2013/oct/16/all-leo-                   commentaries have been created by different authors, and
tolstoy-one-click-project-digitisation

                                                                   48
each of them had used his own text template. As a result,            Text               Type of the         Current
commentaries are represented as an accompanying text,                                   information         relevance of the
but not as an enumeration of properties and parameters of                                                   information
a common database. An example of a commentary is
provided in Table 1.                                                 Anatoly            The head of         YES
                                                                     Ivanovich          biographical
                                                                     Farezov (1852-     information
 Russian                      English
                                                                     1928)
 Анатолий Иванович            Anatoly Ivanovich                      writer-populist    Main fact           YES
 Фаресов (1852—1928),         Faresov (1852-1928),
 публицист-народник.          writer-populist. [He was               He was tried for   Additional          NO
 [Судился по «делу 193»;      tried for “Case 193”; He               “Case 193”; He     political fact
 был амнистирован в           was pardoned in 1880 and               was pardoned in
 1880 г. и перешел в          joined the camp of                     1880 and joined
 лагерь умеренных             moderate liberals.] He                 the camp of
 либералов.]                  worked in the “News,” in               moderate
 Сотрудничал в                the “Week,” and other                  liberals.
 «Новостях», в «Неделе»       publications. He had met               He worked in the Additional            NO
 и других изданиях.           with Tolstoy February 1,               “News,” in the   profession fact
 Познакомился с               1898 [and left unpublished             “Week,” and
 Толстым 1 февраля 1898       memories of him].                      other
 г. [и оставил                Faresov’s article titled By            publications.
 неопубликованные             yesterday incident edition
 воспоминания о нем].         of “Week”" was published               He had met with Tolstoy fact           YES
 Статья Фаресова под          in the "New Times" , №                 Tolstoy February
 заглавием «Ко                7208 of 23 March                       1, 1898
 вчерашнему                                                          and left           Additional          NO
 происшествию редакции                                               unpublished        publicistic fact
 в «Недели» была                                                     memories of
 напечатана в «Новом                                                 him.
 времени», № 7208 от 23
 марта                                                               Faresov’s Article Faresov’s            YES
                                                                     titled "By        publicistic fact
                                                                     yesterday
Table 1. An example of biographical commentary in the
                                                                     incident edition
90-volume Tolstoy’s edition.
                                                                     of “Week”" was
                                                                     published in the
This example illustrates the complexity of the information
                                                                     "New Times" ,
in the comments. We have marked with square brackets
                                                                     № 7208 of 23
all information that cannot be structurally formalized.
                                                                     March
That means that though we have at our disposable a
considerable corpus of short biographical texts, but it is
still a collection of unstructured texts with high degree of        Table 2. The choice of facts and relationships to be
lexical variation and a mixture of relevant and excessive           extracted from the commentaries.
factual statements. We aim to transfer an unstructured
commentary text into a structured database with given               The corpus of letters, used for rule development and
semantic relationships between the elements. For the first          testing, was based on the 63rd volume of the Complete
stage of our project, we limit ourselves to a small list of         edition, which contains 281 letters by Tolstoy of the
relationships. Table 2 shows the relevant facts, which              period of 1880-1886. We have created a manual markup
have been filtered out of the text, given in table 1.               of 50 letters using the annotation module from GATE
                                                                    framework (Cunningham, et al 2002). We annotated all
                                                                    professions, kinship terms and relationships, and birth and
                                                                    death dates. The observations made upon this small
                                                                    annotated corpus were used to develop grammatical rules
                                                                    for biographical fact extraction. For example, we have
                                                                    found that the mention of a person, who has not been
                                                                    commented on before in the text, corresponds to a certain
                                                                    robust text structure. We called it a pattern of first
                                                                    mention. It consists of a personal name, dates of life, and
                                                                    additional information, which may be a profession or
                                                                    kinship references to other persons (first of all, to

                                                               49
Tolstoy). Finally we have built a small ontology,                    presented in Figure 1. The text spans referring to
comprising the main entities and their attributes and                annotation are printed in bold.
matched in with our annotated sentences. The ontology is
  
  RUS:      Николай Николаевич Страхов (1828—1896) — критик и философ.
  ENG:      Nikolay Nikolayevich Strahov (1828—1896) — critic and philosopher.
            
            RUS:      Анастасия Васильевна Дмоховская, урожд. Воронец. О ней см. в т. 49
            ENG:      Anastasia Vasil’yevna Dmohovskaya, maiden name Voronec. See vol.49.
            
            RUS:      Евдокия Александровна Новосильцева (р. 1861 г.), в замужестве Регекампф.
            ENG:      Evdokiya Alexandrovna Novosil’ceva (born 1861), marriage name Regekampf.
  
  RUS:      Владимир Иванович Даль ( 1801 — 1872 ) — известный лексикограф и этнограф.
  ENG:      Vladimir Ivanovich Dal’ ( 1801 — 1872 ) — famous lexicographer and ethnographer.
  
  RUS:      Михаил Александрович Энгельгардт (р. 1861 г. — ум. 21 июля 1915 г.)
  ENG:      Mihail Alexandrovich Engelgardt (born 1861 — die 21 July 1915)
  
  RUS:      Гр. Сергей Николаевич Толстой (1826—1904) — старший брат Льва Николаевича.
  ENG:      Earl Sergey Nikolayevich Tolstoy (1826—1904) — the elder brother of Leo Tolstoy.
            
            RUS:      Иван Васильевич Сютаев (р. 1856), крестьянин дер. Шевелино
            ENG:      Ivan Vasil’evich Sutaev (born 1856), peasant of Shevelino village
  
  RUS:      Афанасий Афанасьевич Фет (Шеншин) (1820—1892) — поэт.
  ENG:      Afanasiy Afanas’evich Fet (Shenshin) (1820—1892) — poet.
            
            RUS:      Михаил Матвеевич Стасюлевич (1826—1911) — общественный деятель, историк и
            публицист, с 1865 года редактор-издатель журнала «Вестник Европы».
            ENG:      Mihail Matveevich Stasulevich (1826—1911) — social activist, historian and publicist, since 1865
            editor and publisher of the magazine "Herald of Europe".
  
  
  RUS:      Лев Львович (р. в 1869 г.), третий сын Толстого.
  ENG:      Lev Lvovich (b. 1869), the third son of Tolstoy.
  < friend of: name>
  
  RUS:      Дмитрий Алексеевич Дьяков (1823—1891) — сын Алексея Николаевича и Ирины Дмитриевны,
            рожд. Полторацкой, друг Толстого.
  ENG:      Dmitri Alekseevich Dyakov (1823—1891) — the son of Aleksey Nikolayevich and Irina Dmitrievna, maiden
Figure 1. The ontology of biographical facts and attributes in commentaries to Tolstoy’s letters.

For the first stage of our research, we decided to consider          Yandex.ru4 and is available for free download. Tomita
the most basic categories, such as name, dates of life,              provides modules of morphological analysis, as well as
profession, and kinship. We have created several                     ready-made rules for extracting names and numbers.
grammatical rules and have evaluated those rules with the            Tomita grammars consist of rules. The user may create
help of our annotated commentary corpora.                            his or her own grammars and dictionaries for a certain
                                                                     language. To construct a grammar, the user should write a
                      3. Methods                                     number of transducer rules, for which one can use regular
We used Tomita parser to create rules for fact extraction            expressions, words lists, and other rules, built-in or
(Tomita 1984, 1985). Tomita parser is a free NLP                     composed. Each rule has a left and a right side; the
platform customized for creating small and light                     transducer operation is denoted by arrow, separating left
information extraction modules. It can be used as a tool             and right sides. The left side can be only represented by a
for extracting structured data (facts) from texts by                 terminal, while the right side can be both terminals or
context-free grammars and dictionaries of keywords. The              nonterminals. An example of a rule sequence can be
API for Tomita parser has been developed by the team of              found in Figures 2-3.

                                                                     4
                                                                         https://tech.yandex.ru/tomita/

                                                                50
                                                                           INPUT

                                                                           RUS:
Figure 2. Rule for date extraction.                                        Александр Андреевич Иванов (Шешин)
                                                                           (1806 — 1858) — художник .
This rule says that a date is a number, consisting of four                 ENG:
characters from 0 to 9. Then, we can define date chains                    Alexandr Andreevich Ivanov (Sheshin)
(for example dates of birth and death) using previously set                (1806 — 1858) — а painter.
rules. The operator interp in Figure 3 assigns the extracted                        ► PATTERN RECOGNITION
text span to a specific fact. Here Dt1 stands for the birth
date, while Dt2 corresponds to the date of death :                                  ► KINSHIP DICTIONARY CHECK

                                                                                    ► CHOICE OF RULE SET

Figure 3. Rule for date chains extraction.                                          ► FACT EXTRACTION

                                                                           OUTPUT
Using Tomita parser we have written a grammar with
specific rules and dictionaries aimed to extract the                       
following biographic facts from letter commentaries:                       RUS: Александр
ProfessionFact (person’s name, dates of life and                           ENG: Aleksandr
professions) and FamilyFact (kinship). We quickly
                                                                           
realized that the built-in date-extraction and name-
extraction rules are not well-sutied to philological                       RUS: Андреевич
commentaries, so we also made some specific                                ENG: Andreevich
modifications. For description of kinship relations of a
                                                                           
person, we have created several dictionaries containing
different kinship types. The kinship dictionary is applied                 RUS: Иванов
after the pattern is recognized. If no kinship term is                     ENG: Ivanov
detected in the pattern, then the rules that extract
profession are applied. Overall there have been created 31                 
rules and 3 dictionaries. The processing scheme is                         RUS: Shenshin
presented by Figure 4.                                                     ENG: Шеньшин
                                                                            1806

              4. Results and analysis                                       1858

The performance rank of these rules has been measured                      
with the help of testing commentary corpus of 560,000
                                                                           RUS: художник
tokens. The results are considerably acceptable for
precision, but are much lower for recall, which is quite                   ENG: painter
typical for rule-based approaches. The overall score for
both facts is demonstrated in Table 3
                                                                    Figure 4. Extracting information from text patterns.
  Fact name         Precision         Recall   F-measure
  ProfessionFact    1                 0,82     0,9                  Another problem is the boundaries of the text span which
  FamilyFact        0,78              0,3      0,43                 refers to the profession itself. Thus, in example 2, we see
                                                                    a description of Shamil’s social activities. So the question
Table 3. Evaluation of biographical fact extraction.                is what part of the complicated NP should be extracted as
                                                                    Shamil’s profession. Should it be the leader and
The analysis of results reveals that some important                 consolidator or the full NP (leader and consolidator of
problems lie in the conceptual part. First of all, it is            hillmen of Daghestan and Chechnya)?
problematic to define a strict classification of what may
(or may not) be considered as a profession. Thus in                 1) Василий Кириллович Сютаев (1819—1892) — ...
example 1 the profession extracted is a peasant. But in             крестьянин дер. Шевелино.
context of biographical commentaries this person’s origin
(village Shevelevo) becomes even more important than                Vasiliy Kirillovich Syutaev (1819—1892) — ... peasant of
the profession of peasant itself. It’s obvious that Tolstoy         village Shevelino.
had met a lot of peasants during his life, so any special
                                                                    2) Шамиль (1797—1874) — знаменитый вождь и
attributes are of much value in the context. There remains
                                                                    объединитель горцев Дагестана и Чечни в их борьбе
a problem of distinguishing those cases, which are
                                                                    с русскими.
meaningless without additional information.

                                                               51
Shamil (1797-1874) – a popular leader and consolidator             extract them using algorithms of machine learning. In
of hillmen of Daghestan and Chechnya in their struggle             general, during the next stage of the research we intend to
with Russians.                                                     develop the ontology (i.e. add important locations and
                                                                   relations between person and location), to process all the
To resume, if such types of professions as writer,                 31 volumes of letters, to make an open database with all
musician, and philosopher are good categorizers, that              the persons connected to Tolstoy, and to provide every
allow to define a group of people in one way or another            input referring to a person with a short semantically
connected with Tolstoy or his work, such status as a               marked-up biography. The database is to be used as an
peasant or consolidator are meaningless without their              interlinked reference base to Leo Tolstoy’s 90 volume
attributes (genitive groups), in this case – “village              edition, and also as an aggregator of exterior information
Shevelino” and “hillmen of Daghestan and Chechnya in               from other sources. The specific syntactic constructions,
their struggle with Russians.” Accordingly, the                    that are intrinsic for the commentary genre (such as
significant problem is with the boundary of the nominal            abundance of names and titles in a sentence, special
group that determines the professional status. The second          means of reference inside text, etc.) will be certainly
problem is concerned with intricate chains of kinships             taken into account.
that have a very specific syntax, as shown in Example 3.
In this phrase there is an inversion (see Figure 5),                                     References
probably made especially for publication, so that Sophia
Tolstaya, wife of Leo Tolstoy, would take a more                   Barnard, David T., et al. “Hierarchical encoding of text:
prominent position.                                                  Technical problems and SGML solutions.” Text
                                                                     Encoding Initiative. Springer Netherlands, 1995. 211-
3) Александр Михайлович Кузминский ... муж сестры
                                                                     231.Computers and the Humanities 29.3 (1995): 211-
Софьи Андреевны, Татьяны Андреевны (1846—
                                                                     231.
1925).
                                                                   Cunningham, Hamish, et al. “GATE: an architecture for
Alexandr Mihaylovich Kuzminskiy ... the husband of the
sister of Sofiya Andreevna, Tatiana Andreevna (1846—                 development of robust HLT applications.” Proceedings
1925).                                                               of the 40th annual meeting on association for
                                                                     computational linguistics. Association for
                                                                     Computational Linguistics, 2002.

                                                                   Grishman, Ralph. ”Information extraction: Techniques
                                                                     and challenges.” Information extraction a
         the husband of the sister of SA, TA                         multidisciplinary approach to an emerging information
                                                                     technology. Springer Berlin Heidelberg, 1997. 10-27.

Figure 5. Complicated syntactic structures in kinship              Tolstoy, Lev. Polnoe sobranie sochinenij v 90 tomah,
patterns.                                                            Moskva, 1928-1954. (Tolstoy, Leo. Complete collected
                                                                     works in 90 volumes, Moscow 1928-1954)
Another problem is the discrepancy between singular and
plural forms when it comes to the descriptions of relations        Tomita, Masaru. LR parsers for natural languages.
between a person and a family group, as shown in                     COLING. 10th International Conference on
Example 4.                                                           Computational Linguistics. 1984. P. 354—357.

4) Николай Михайлович Нагорнов (1845—1896) — сын                   Tomita, Masaru. An efficient context-free parsing
Михаила Михайловича и Надежды Ивановны                               algorithm for natural languages. IJCAI. International
Нагорновых.                                                          Joint Conference on Artificial Intelligence. 1985. P.
                                                                     756—764.
Nikolay Mihaylovich Nagornov (1845—1896) — the son
of Mihail Mihaylovich and Nadejda Ivanovna
Nagornovs.                                                                          Acknowledgements
                                                                   This study (research grant No 15-06-99523 А) was sup-
Unlike profession extraction which shows good quality
                                                                   ported by the Russian Foundation for Basic Research in
by rule-based approach, kinship relation patterns seem to
                                                                   2014-2015.
be much less regular, so perhaps it is worth trying to




                                                              52