<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tolstoy Digital: Mining Biographical Data in Literary Heritage Editions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anastasia Bonch-Osmolovskaya</string-name>
          <email>abonch@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matvey Kolbasov</string-name>
          <email>matveykolbasov@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics 20 Myasnitskaya Ulitsa; Moscow</institution>
          ,
          <addr-line>Russia; 101000</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1845</year>
      </pub-date>
      <fpage>48</fpage>
      <lpage>52</lpage>
      <abstract>
        <p>This paper presents a solution for mining the biographical information from commentaries on Leo Tolstoy's letters. It is implemented as a part of Tolstoy Digital Project - a semantically marked-up web publication of the 90-volume complete collection of Leo Tolstoy's works. Extraction of relevant biographical information will be used to create an open database for all the persons who were somehow connected with Tolstoy or Tolstoy's works. The paper also accounts for various subtleties of the commentary apparatus and pays special attention to specific difficulties of biographical information extraction, such as the problem of defining the boundaries of expressions denoting profession, or the problem of non-standardized syntactic constructions for kinship relations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Project Description</title>
      <p>
        The Tolstoy Digital project1 aims to prepare a
webpublished semantically marked-up version of the
90volume complete collection of Leo Tolstoy’s works2. The
digital version of the 90-volume edition has become easy
to access thanks to the mass crowdsourcing campaign
“All of Tolstoy in one click”3 . The next step of Tolstoy
digitization is devoted to the semantic tagging of
Tolstoy’s text and the creation of a comprehensive
database of all of the additional reference information that
goes along with the Tolstoy oeuvres and private archive.
The 90-volume edition
        <xref ref-type="bibr" rid="ref4">(Tolstoy 1928-1964)</xref>
        comprises an
exhaustive critical apparatus, which contains relevant
information on Tolstoy’s works, life, and other people
connected with him. Current research, done as part of the
Tolstoy Digital project, presents an on-going work in
factextraction from literary commentaries. The edition
contains 21 volumes of letters, dated from 1844 till 1910,
the year of Tolstoy’s death. Each letter is followed by a
detailed commentary, where a biographical reference to
the addressees and persons mentioned in the letter is
provided. Our aim is to analyze unstructured text of a
commentary, to extract person names and relevant
biographical information, and to use TEI semantic
annotations for relevant mark-ups
        <xref ref-type="bibr" rid="ref1">(Barnard, et al 1995)</xref>
        .
Extracted data will be aggregated and stored in a
reference database. The database will be linked with the
text of the semantic edition.
      </p>
      <p>The main aim of our current experiment was to estimate
the effectiveness of a rule-based approach that could be
used for data analysis and fact extraction for our type of
texts. We started with the most basic concepts and facts:
names, dates, professions and family relations. The
1 See the project description here:
http://tolstoy.ru/projects/tolstoy-digital/
2
http://tolstoy.ru/creativity/90-volume-collection-of-theworks/
3 See, for example, review in the Guardian:
http://www.theguardian.com/books/2013/oct/16/all-leotolstoy-one-click-project-digitisation
problem of automatic named entity recognition and fact
extraction seems to be very well elaborated, see
(Grishman 2003, and Jurafsky Martin 2009) for a
comprehensive outline and reference list. Still,
information extraction in academic philological texts
pursues slightly different aims than mining the news flow:
there are graphical, linguistic and conceptual problems.
Specific graphics (such as abbreviations, inner
references, font highlighting) may have non-trivial
semiotic functions. Some patterns that are uncommon in
general language may be used to render the latent
recurring logics of the commentary structure. Finally, the
notions that are relevant for characterizing historical
events or persons may have little in common with the
conceptual space of data mining in the web (the domain
which is primarily regarded in all the works on
information extraction). This is the reason why the task of
biographical information extraction from academic
editions cannot be solved with the help of already existing
solutions, but requires some specific process
modifications. This paper aims to report on the first steps
that we have made in this direction. We used a rule-based
toolbox Tomita to write and apply some basic grammar
rules that are used to extract relevant ontology concepts.
In part (2) we will present a short outline of the
biographical ontology we are going to elaborate and we
will briefly describe preliminary data preparation. Part (3)
will be dedicated to the description of our grammar and
the rules for biographical fact extraction. Evaluation and
analysis are presented in part (4).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Textual material and basic ontology</title>
      <p>The commentary apparatus of epistolary volumes of the
complete 90-volume edition consists of letter
commentaries, which constitute about 40% of entire text.
Usually they are organized in a non-structured way as a
sequence of factual comments. Some of them are hard to
classify, and many of them seem not to fit to any
database, making them redundant. This lack of explicit
structure can be explained by the fact that the
commentaries have been created by different authors, and
each of them had used his own text template. As a result,
commentaries are represented as an accompanying text,
but not as an enumeration of properties and parameters of
a common database. An example of a commentary is
provided in Table 1.</p>
      <sec id="sec-2-1">
        <title>Russian</title>
        <p>English
Анатолий Иванович Anatoly Ivanovich
Фаресов (1852—1928), Faresov (1852-1928),
публицист-народник. writer-populist. [He was
[Судился по «делу 193»; tried for “Case 193”; He
был амнистирован в was pardoned in 1880 and
1880 г. и перешел в joined the camp of
лагерь умеренных moderate liberals.] He
либералов.] worked in the “News,” in
Сотрудничал в the “Week,” and other
«Новостях», в «Неделе» publications. He had met
и других изданиях. with Tolstoy February 1,
Познакомился с 1898 [and left unpublished
Толстым 1 февраля 1898 memories of him].
г. [и оставил Faresov’s article titled By
неопубликованные yesterday incident edition
воспоминания о нем]. of “Week”" was published
Статья Фаресова под in the "New Times" , №
заглавием «Ко 7208 of 23 March
вчерашнему
происшествию редакции
в «Недели» была
напечатана в «Новом
времени», № 7208 от 23
марта</p>
        <p>This example illustrates the complexity of the information
in the comments. We have marked with square brackets
all information that cannot be structurally formalized.
That means that though we have at our disposable a
considerable corpus of short biographical texts, but it is
still a collection of unstructured texts with high degree of
lexical variation and a mixture of relevant and excessive
factual statements. We aim to transfer an unstructured
commentary text into a structured database with given
semantic relationships between the elements. For the first
stage of our project, we limit ourselves to a small list of
relationships. Table 2 shows the relevant facts, which
have been filtered out of the text, given in table 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Text</title>
      </sec>
      <sec id="sec-2-3">
        <title>Anatoly</title>
        <p>Ivanovich
Farezov
(18521928)
writer-populist
He was tried for
“Case 193”; He
was pardoned in
1880 and joined
the camp of
moderate
liberals.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Type of the information</title>
      </sec>
      <sec id="sec-2-5">
        <title>The head of biographical information</title>
      </sec>
      <sec id="sec-2-6">
        <title>Main fact</title>
      </sec>
      <sec id="sec-2-7">
        <title>Additional political fact</title>
      </sec>
      <sec id="sec-2-8">
        <title>He worked in the Additional</title>
        <p>“News,” in the profession fact
“Week,” and
other
publications.</p>
      </sec>
      <sec id="sec-2-9">
        <title>He had met with Tolstoy fact</title>
        <p>Tolstoy February
1, 1898
and left
unpublished
memories of
him.</p>
      </sec>
      <sec id="sec-2-10">
        <title>Additional</title>
        <p>publicistic fact
Faresov’s Article Faresov’s
titled "By publicistic fact
yesterday
incident edition
of “Week”" was
published in the
"New Times" ,
№ 7208 of 23
March</p>
      </sec>
      <sec id="sec-2-11">
        <title>Current relevance of the information YES</title>
        <p>YES
NO
NO
YES
NO
YES</p>
        <p>
          The corpus of letters, used for rule development and
testing, was based on the 63rd volume of the Complete
edition, which contains 281 letters by Tolstoy of the
period of 1880-1886. We have created a manual markup
of 50 letters using the annotation module from GATE
framework
          <xref ref-type="bibr" rid="ref2">(Cunningham, et al 2002)</xref>
          . We annotated all
professions, kinship terms and relationships, and birth and
death dates. The observations made upon this small
annotated corpus were used to develop grammatical rules
for biographical fact extraction. For example, we have
found that the mention of a person, who has not been
commented on before in the text, corresponds to a certain
robust text structure. We called it a pattern of first
mention. It consists of a personal name, dates of life, and
additional information, which may be a profession or
kinship references to other persons (first of all, to
Tolstoy). Finally we have built a small ontology,
comprising the main entities and their attributes and
matched in with our annotated sentences. The ontology is
presented in Figure 1. The text spans referring to
annotation are printed in bold.
For the first stage of our research, we decided to consider
the most basic categories, such as name, dates of life,
profession, and kinship. We have created several
grammatical rules and have evaluated those rules with the
help of our annotated commentary corpora.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        We used Tomita parser to create rules for fact extraction
        <xref ref-type="bibr" rid="ref5 ref6">(Tomita 1984, 1985)</xref>
        . Tomita parser is a free NLP
platform customized for creating small and light
information extraction modules. It can be used as a tool
for extracting structured data (facts) from texts by
context-free grammars and dictionaries of keywords. The
API for Tomita parser has been developed by the team of
      </p>
      <p>Yandex.ru4 and is available for free download. Tomita
provides modules of morphological analysis, as well as
ready-made rules for extracting names and numbers.
Tomita grammars consist of rules. The user may create
his or her own grammars and dictionaries for a certain
language. To construct a grammar, the user should write a
number of transducer rules, for which one can use regular
expressions, words lists, and other rules, built-in or
composed. Each rule has a left and a right side; the
transducer operation is denoted by arrow, separating left
and right sides. The left side can be only represented by a
terminal, while the right side can be both terminals or
nonterminals. An example of a rule sequence can be
found in Figures 2-3.</p>
      <sec id="sec-3-1">
        <title>4 https://tech.yandex.ru/tomita/</title>
        <p>This rule says that a date is a number, consisting of four
characters from 0 to 9. Then, we can define date chains
(for example dates of birth and death) using previously set
rules. The operator interp in Figure 3 assigns the extracted
text span to a specific fact. Here Dt1 stands for the birth
date, while Dt2 corresponds to the date of death :</p>
        <p>Using Tomita parser we have written a grammar with
specific rules and dictionaries aimed to extract the
following biographic facts from letter commentaries:
ProfessionFact (person’s name, dates of life and
professions) and FamilyFact (kinship). We quickly
realized that the built-in date-extraction and
nameextraction rules are not well-sutied to philological
commentaries, so we also made some specific
modifications. For description of kinship relations of a
person, we have created several dictionaries containing
different kinship types. The kinship dictionary is applied
after the pattern is recognized. If no kinship term is
detected in the pattern, then the rules that extract
profession are applied. Overall there have been created 31
rules and 3 dictionaries. The processing scheme is
presented by Figure 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and analysis</title>
      <p>The performance rank of these rules has been measured
with the help of testing commentary corpus of 560,000
tokens. The results are considerably acceptable for
precision, but are much lower for recall, which is quite
typical for rule-based approaches. The overall score for
both facts is demonstrated in Table 3</p>
      <sec id="sec-4-1">
        <title>Fact name</title>
        <p>ProfessionFact
FamilyFact</p>
        <p>Precision
1
0,78</p>
        <p>Recall
0,82
0,3</p>
        <p>F-measure
0,9
0,43
The analysis of results reveals that some important
problems lie in the conceptual part. First of all, it is
problematic to define a strict classification of what may
(or may not) be considered as a profession. Thus in
example 1 the profession extracted is a peasant. But in
context of biographical commentaries this person’s origin
(village Shevelevo) becomes even more important than
the profession of peasant itself. It’s obvious that Tolstoy
had met a lot of peasants during his life, so any special
attributes are of much value in the context. There remains
a problem of distinguishing those cases, which are
meaningless without additional information.
RUS:
Александр Андреевич Иванов (Шешин)
(1806 — 1858) — художник .</p>
        <p>ENG:
Alexandr Andreevich Ivanov (Sheshin)
(1806 — 1858) — а painter.</p>
        <p>► PATTERN RECOGNITION
► KINSHIP DICTIONARY CHECK
► CHOICE OF RULE SET
► FACT EXTRACTION
OUTPUT
&lt;has name:first&gt;
RUS: Александр
ENG: Aleksandr
&lt;has name:middle&gt;
RUS: Андреевич
ENG: Andreevich
&lt;has name:last&gt;
RUS: Иванов
ENG: Ivanov
&lt;has name:last&gt;
RUS: Shenshin
ENG: Шеньшин
&lt;date:birth&gt; 1806
&lt;date:death&gt; 1858
&lt;profession&gt;
RUS: художник</p>
        <p>ENG: painter
Another problem is the boundaries of the text span which
refers to the profession itself. Thus, in example 2, we see
a description of Shamil’s social activities. So the question
is what part of the complicated NP should be extracted as
Shamil’s profession. Should it be the leader and
consolidator or the full NP (leader and consolidator of
hillmen of Daghestan and Chechnya)?
1) Василий Кириллович Сютаев (1819—1892) — ...
крестьянин дер. Шевелино.</p>
        <p>Vasiliy Kirillovich Syutaev (1819—1892) — ... peasant of
village Shevelino.
2) Шамиль (1797—1874) — знаменитый вождь и
объединитель горцев Дагестана и Чечни в их борьбе
с русскими.</p>
        <p>Shamil (1797-1874) – a popular leader and consolidator
of hillmen of Daghestan and Chechnya in their struggle
with Russians.</p>
        <p>To resume, if such types of professions as writer,
musician, and philosopher are good categorizers, that
allow to define a group of people in one way or another
connected with Tolstoy or his work, such status as a
peasant or consolidator are meaningless without their
attributes (genitive groups), in this case – “village
Shevelino” and “hillmen of Daghestan and Chechnya in
their struggle with Russians.” Accordingly, the
significant problem is with the boundary of the nominal
group that determines the professional status. The second
problem is concerned with intricate chains of kinships
that have a very specific syntax, as shown in Example 3.
In this phrase there is an inversion (see Figure 5),
probably made especially for publication, so that Sophia
Tolstaya, wife of Leo Tolstoy, would take a more
prominent position.</p>
        <p>the husband of the sister of SA, TA</p>
        <p>Another problem is the discrepancy between singular and
plural forms when it comes to the descriptions of relations
between a person and a family group, as shown in
Example 4.</p>
        <p>Nikolay Mihaylovich Nagornov (1845—1896) — the son
of Mihail Mihaylovich and Nadejda Ivanovna
Nagornovs.</p>
        <p>Unlike profession extraction which shows good quality
by rule-based approach, kinship relation patterns seem to
be much less regular, so perhaps it is worth trying to
extract them using algorithms of machine learning. In
general, during the next stage of the research we intend to
develop the ontology (i.e. add important locations and
relations between person and location), to process all the
31 volumes of letters, to make an open database with all
the persons connected to Tolstoy, and to provide every
input referring to a person with a short semantically
marked-up biography. The database is to be used as an
interlinked reference base to Leo Tolstoy’s 90 volume
edition, and also as an aggregator of exterior information
from other sources. The specific syntactic constructions,
that are intrinsic for the commentary genre (such as
abundance of names and titles in a sentence, special
means of reference inside text, etc.) will be certainly
taken into account.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Barnard</surname>
          </string-name>
          , David T., et al. “
          <article-title>Hierarchical encoding of text: Technical problems</article-title>
          and SGML solutions.
          <source>” Text Encoding Initiative</source>
          . Springer Netherlands,
          <year>1995</year>
          .
          <fpage>211</fpage>
          -
          <lpage>231</lpage>
          .
          <source>Computers and the Humanities 29.3</source>
          (
          <year>1995</year>
          ):
          <fpage>211</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hamish</surname>
          </string-name>
          , et al. “
          <article-title>GATE: an architecture for development of robust HLT applications.” Proceedings of the 40th annual meeting on association for c o m p u t a t i o n a l l i n g u i s t i c s . A s s o c i a t i o n f o r Computational Linguistics</article-title>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Grishman</surname>
          </string-name>
          , Ralph. ”
          <article-title>Information extraction: Techniques a n d c h a l l e n g e s . ” I n f o r m a t i o n e x t r a c t i o n a multidisciplinary approach to an emerging information technology</article-title>
          . Springer Berlin Heidelberg,
          <year>1997</year>
          .
          <fpage>10</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Tolstoy</surname>
            ,
            <given-names>Lev.</given-names>
          </string-name>
          <article-title>Polnoe sobranie sochinenij v 90 tomah</article-title>
          , Moskva,
          <fpage>1928</fpage>
          -
          <lpage>1954</lpage>
          .
          <article-title>(Tolstoy, Leo. Complete collected works in 90 volumes</article-title>
          , Moscow 1928-1954)
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Tomita</surname>
            ,
            <given-names>Masaru.</given-names>
          </string-name>
          <article-title>LR parsers for natural languages</article-title>
          .
          <string-name>
            <surname>CO L I N</surname>
          </string-name>
          <article-title>G . 1 0 th I nt e r n a ti o na l Co n f e r e n c e on Computational Linguistics</article-title>
          .
          <year>1984</year>
          . P.
          <volume>354</volume>
          -
          <fpage>357</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Tomita</surname>
            ,
            <given-names>Masaru.</given-names>
          </string-name>
          <article-title>An efficient context-free parsing algorithm for natural languages</article-title>
          .
          <source>IJCAI. International Joint Conference on Artificial Intelligence</source>
          .
          <year>1985</year>
          . P.
          <volume>756</volume>
          -
          <fpage>764</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>This study (research grant No</source>
          <volume>15</volume>
          -
          <fpage>06</fpage>
          -99523 А)
          <article-title>was supported by the Russian Foundation for Basic Research in 2014-2015.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>