=Paper= {{Paper |id=Vol-1399/paper12 |storemode=property |title=Interpersonal Relations in Biographical Dictionaries. A Case Study |pdfUrl=https://ceur-ws.org/Vol-1399/paper12.pdf |volume=Vol-1399 |dblpUrl=https://dblp.org/rec/conf/bd/StotzSRS15 }} ==Interpersonal Relations in Biographical Dictionaries. A Case Study== https://ceur-ws.org/Vol-1399/paper12.pdf
          Interpersonal Relations in Biographical Dictionaries. A Case Study.

                 Sophia Stotz∗ , Valentina Stuß∗ , Matthias Reinert‡ , Maximilian Schrott‡
                                                       ∗
                                                           University of Paderborn
                                                            stotz,stuss@upb.de

                                                ‡
                                                    Historische Kommission München
                                                       reinert,schrott@hk.badw.de
                                                                 Abstract
Adopting the concept of “Local Grammars” (M. Gross), which were successfully applied in practice by (Geierhos, 2010) to biographical
information extraction in English our project aims to detect, encode, and finally visualize relations between persons. Our corpus consists
of the digitised biographical lexicon “Neue Deutsche Biographie (NDB)”, roughly 21.000 biographies in 25 volumes in print since 1953.
We developed local grammars and suitable dictionaries to describe interpersonal relations and applied them to the corpus with Unitex 3.1.
The local grammars were designed to integrate existing TEI-XML structures in the corpus. Using the ability of local grammars in Unitex
to act as transducers we were able to produce XML-tags and encode semantic information. Based on grammars for personal names
and places we described interpersonal relations like to study, predecessors and successors as well as friends and circles. Afterwards we
identified persons (as given in the authority file or index). Finally we displayed relations on our website in an interactive and dynamic
way. Utilizing the Javascript library D3.js we represented named relations between identified individuals as ego centred network graphs.
Keywords: Local Grammar, Relation Extraction, Visualisation


                     1.    Introduction                                   1.1. Method
                                                                          In the huge field of information extraction we operate on
Biographical dictionaries comprise accounts of lives in a                 named entity recognition, named entity disambiguation and
condensed, often abbreviated form. They list the most im-                 relation extraction. But we restricted our efforts to detect
portant events in an individual’s life, as well as achieve-               personal names and a restricted set of relations. Interesting
ments and contacts with others. Events are expressed in                   relations are accompanied with predicates containing fur-
predicates or sometimes idioms. Both carry one or more                    ther nameable entities as arguments. Our disambiguation
arguments, at least one of them representing an individual.               aims primarily to align personal names with a knowledge
This we call predicate-argument-structure (Geierhos, 2010,                base, namely an index of people, already qualified with pro-
7f.). Other statements about the influence of publications,               fession, dates of birth and death and references to pages
innovations or intellectual impact brought about by the sub-              where they occur in the printed volumes.
ject of biography are not taken into account.                             In order to extract relations we applied methods described
A subset of these predicate-argument structures contain re-               by Gross (1997), an approach called local grammars. Gross
lational expressions: a second argument representing an-                  promoted the idea that idioms tended to be predominant
other person and the predicate - possibly accompanied by                  over syntactic rules in language and demanded to examine
temporal or modal modifiers - representing the relation.                  large corpora in order to extract typical phrases. It is a com-
                                                                          bined dictionaries and graph approach, whereby graphs de-
We consider academic teachers, friends, colleagues as di-
                                                                          scribe linguistic structures on a sub-sentence level. Lin-
rect interpersonal relations and relations constituted by
                                                                          guistic structures or predicate-argument-structures are con-
peer-groups attending the same school and university or
                                                                          sidered as verbal or noun phrases comprising entities car-
share the same profession and professional institution as in-
                                                                          rying information. This reflects the influence of (Harris,
direct relations. Another dimension is hierarchy (patrons,
                                                                          1974) who put the focus on argument structures.
teachers) vs equality (friends, colleagues) expressed in di-
rect relations and hereditary (familiar background) vs tran-              Recent research into this approach has been undertaken on
scendence (intellectual influence, schools of thought) in in-             organization names in English by (Mallchok, 2005), on de-
direct relations. Obviously these relations are manifold and              scriptors for humans in German by (Geierhos, 2007), on to-
occur in modified forms therefore we have to normalise                    ponyms in German by (Nagel, 2008), on biographical facts
them. In this paper we will demonstrate the extraction of                 in English by (Geierhos, 2010) and on biographical facts in
relations expressed by the verb to study.                                 French by (Maurel et al., 2011) and (Maurel and Friburger,
                                                                          2013).
In order to visualize relations between individuals we need
                                                                          Just like these studies we rely on Unitex corpus-processor
to identify their names. We achieved this be applying sim-
                                                                          (Paumier, 2013). Unitex adopts the early efforts of W.
ple matching techniques using indexes and scores and we
                                                                          A. Woods on applying graphs to linguistic phenomena
undertook tests using topic similarities.
                                                                          (William A Woods, 1970). Already in 1980 he pro-
Finally we show the potential of relation extracting be-                  posed to draft and apply subsequent graphs step by step
tween identified individuals by visualizing them online us-               (Woods, 1980). Among others, those ideas and the abil-
ing common force-directed graph libraries.                                ity to call sub-graphs and morphological filters have been

                                                                     74
                           Figure 1: Example of a simple bootstrap graph detecting place names


implemented in Unitex.                                                     Wetzlar,.EN+Topon+ORTSTUD
We constructed local grammars in two steps. First we                       Wismar,.EN+Topon+ORTSTUD
drafted preliminary graphs to describe and detect the spe-                 Witzenhausen,.EN+Topon+ORTSTUD
cific vocabulary around interesting phrases. This was help-                Włocławek,.EN+Topon+ORTSTUD
ful to set up auxiliary dictionaries. Like the electronic dic-             Worpswede,.EN+Topon+ORTSTUD
tionaries distributed with Unitex we use the DELA syn-                     Zerbst,.EN+Topon+ORTSTUD
tax (Dictionnaires Electroniques du LADL [Laboratoire
                                                                      Figure 2: Example of a simple dictionary entries, denot-
d’Automatique Documentaire et Linguistique] (Paumier,
                                                                      ing place names with lexical category EN (named entity),
2013, 29)).
                                                                      semantic categories Topon and ORTSTUD
Secondly we had to cope with TEI-XML-markup already
present in the corpus. We decided not to clean up this in-
formation because abbreviations had been tagged and fa-               for occupation, 2.000 of them in declined form; 15.000 ge-
cilitated the detection of sentence boundaries. This was              ographical names, 3.500 institutional names, mostly multi
achieved by using subsequent local grammars graphs, a                 word chunks. A special vocabulary (1000 entries) covered
mode of “cascade” available in Unitex and described by                disciplines and adjectives accompanying them; another in-
(Maurel and Friburger, 2013).                                         dividual school names who otherwise interfere with the re-
                                                                      lation to study.
1.2. Dictionaries                                                     Bootstrapping dictionaries from the corpus gives the oppor-
Dictionaries are crucial for the adoption of local grammars.          tunity to revise and optimize the dictionaries.
We used the general dictionary CISLEX for German de-
veloped at Center for Information and Language Process-               1.3. The corpus Neue Deutsche Biographie
ing (Centrum für Informations- und Sprachverarbeitung -               Our corpus is provided online at www.deutsche-
CIS) Munich (Guenthner and Maier, 1994). CISLEX con-                  biographie.de. The website consists of the digitised
tains syntactic information about 150.000 entries encoded             biographical dictionaries “New German Biography”
in DELA format (Paumier, 2013, 47ff).                                 (NDB). The dictionary recently reached the letter T
In addition we extracted dictionaries of denominators for             (Tecklenborg) and has published 25 volumes in print since
named entities from indices (list of names, professions) and          1953. Available online are about 21.000 articles of the first
an authority file (Gemeinsame Normdatei1 . The Gemein-                24 volumes (A-Stader). These biographical articles have
same Normdatei (GND)2 provided personal names and                     been selected in a peer review process by the editorial team
name parts, names for places, regions and organisations.              under guidance of the editor in chief. They are composed
We could derive dictionaries with roughly 1.9 mio sur-                of a headline, a short genealogy, the account of life and
names, 1.5 mio forenames and 9.3 mio full names for indi-             further technical paragraphs on awards, works, secondary
viduals as well as 1.36 mio entries for organisational names.         literature and depictions. All articles are signed by an
Describing simple local grammars in a bootstrap manner                author. Articles are written in modern German (pre 2006
(Gross, 1999) we could extract lists of entities for fields of        style) in full sentences but show many abbreviations of
study, institutions and place names (see 2). These boot-              frequent words (adjectives, nouns) and the lemma itself
strapped dictionaries are specific to the given corpus and            (surname or personal name of the subject of the biography).
linguistically simply structured. They contain almost no              In addition to the NDB its precursor “Allgemeine Deutsche
syntactic information or declined forms but carry semantic            Biographie” finished 1912 in 55 volumes plus an index
information. We put together another 32.000 descriptors               volume enlarges the amount of articles available in the
                                                                      website by 27.000. These older articles are written in an
   1
    http://www.dnb.de/gnd                                             outdated orthography and style and have not been taken
   2                                                                  into account.
    http://www.dnb.de/lds GND as Linked Data Service
in March 2013.                                                        We heavily used auxiliary databases listing the individuals

                                                                 75
              Figure 3: Masking pre-tagged text and entities, using TOKEN-loops with ![ ]-negative context


mentioned in the text along with profession or position in                1. masking pre-existing tags
life, their birth and death dates and references to the printed           2. masking interfering statements on education
volumes. All in all the core data base consists of 92.000 in-             3. to study with arguments in pre- and post-position
dividuals and several hundred families. Almost each entry                 4. to study with arguments in pre-position
has been aligned with or added to the bibliographic author-               5. to study with arguments in post-position (common
ity file Gemeinsame Normdatei (GND).                                         case)
The articles were digitised and typographically tagged by                 6. deal with the noun study.
an exernal firm and afterwards structurally tagged in XML
                                                                       Figure 5: Schema of the cascade (Paumier, 2013, 243ff).
according to the TEI guidelines (Text Encoding Initiative,
                                                                       Each graph is applied repeatedly until no new match is
2009) in the project. For reasons like human read-ability,
                                                                       found and in merge-mode e.g. merging outputs with the
easier proof-reading, and tagging of pre-existing XML, we
                                                                       detected sequences in the corpus
decided neither to follow the stand-off mark-up approach
nor the habit of computational linguistics of working on
plain text but to keep up the whole tagging, re-use it on              sequence of grammars (cascade). Almost all grammars
occasion and add further tags in line.                                 were acting as transducers - they wrote output back into
                                                                       the recognized chunks of text. In this way new XML tags
   2.   A local grammar for the verb to study                          were introduced to mark extracted entities in each step.
In German, there are several ways to express someone has               There is a {multi word expression,.lexical
studied. The verb studieren as well as ein Studium be-                 type|mask(+lexical type|mask)∗ }–notation
ginnen, aufnehmen, absolvieren, beenden or (sich) an der               processed by the Unitex system (Paumier, 2013, 44-46).
Universtität einschreiben/ Vorlesungen (an der Universität)            As shown in fig. 3 Unitex recognizes such kind of meta-
belegen, besuchen, jemanden hören each sets a certain fo-              syntax in order to treat multi-word expressions on the one
cus to the activity and determines possible arguments. We              hand and assign lexico-semantic types (e.g. CHOICE+UA
restrict our grammar to the verb to study and its forms. Our           in fig. 3) to text units on the other hand (Geierhos et al.,
analysis of the corpus resulted in the following structure:            2011, 49).
The predicate-argument structure of to study is accompa-               The mask applies to abbreviations already identified and
nied by several types of entities, like institution, university        tagged, certain abbreviations are tagged with semantic
(Universität Wien, Akademie der bildenden Künste), place,              types. This applies also to personal names which were sim-
discipline (Physik, Kulturwissenschaften, teacher (bei Vir-            ilarly identified and tagged with local grammars.
chow und Naunyn, ...