Interpersonal Relations in Biographical Dictionaries. A Case Study. Sophia Stotz∗ , Valentina Stuß∗ , Matthias Reinert‡ , Maximilian Schrott‡ ∗ University of Paderborn stotz,stuss@upb.de ‡ Historische Kommission München reinert,schrott@hk.badw.de Abstract Adopting the concept of “Local Grammars” (M. Gross), which were successfully applied in practice by (Geierhos, 2010) to biographical information extraction in English our project aims to detect, encode, and finally visualize relations between persons. Our corpus consists of the digitised biographical lexicon “Neue Deutsche Biographie (NDB)”, roughly 21.000 biographies in 25 volumes in print since 1953. We developed local grammars and suitable dictionaries to describe interpersonal relations and applied them to the corpus with Unitex 3.1. The local grammars were designed to integrate existing TEI-XML structures in the corpus. Using the ability of local grammars in Unitex to act as transducers we were able to produce XML-tags and encode semantic information. Based on grammars for personal names and places we described interpersonal relations like to study, predecessors and successors as well as friends and circles. Afterwards we identified persons (as given in the authority file or index). Finally we displayed relations on our website in an interactive and dynamic way. Utilizing the Javascript library D3.js we represented named relations between identified individuals as ego centred network graphs. Keywords: Local Grammar, Relation Extraction, Visualisation 1. Introduction 1.1. Method In the huge field of information extraction we operate on Biographical dictionaries comprise accounts of lives in a named entity recognition, named entity disambiguation and condensed, often abbreviated form. They list the most im- relation extraction. But we restricted our efforts to detect portant events in an individual’s life, as well as achieve- personal names and a restricted set of relations. Interesting ments and contacts with others. Events are expressed in relations are accompanied with predicates containing fur- predicates or sometimes idioms. Both carry one or more ther nameable entities as arguments. Our disambiguation arguments, at least one of them representing an individual. aims primarily to align personal names with a knowledge This we call predicate-argument-structure (Geierhos, 2010, base, namely an index of people, already qualified with pro- 7f.). Other statements about the influence of publications, fession, dates of birth and death and references to pages innovations or intellectual impact brought about by the sub- where they occur in the printed volumes. ject of biography are not taken into account. In order to extract relations we applied methods described A subset of these predicate-argument structures contain re- by Gross (1997), an approach called local grammars. Gross lational expressions: a second argument representing an- promoted the idea that idioms tended to be predominant other person and the predicate - possibly accompanied by over syntactic rules in language and demanded to examine temporal or modal modifiers - representing the relation. large corpora in order to extract typical phrases. It is a com- bined dictionaries and graph approach, whereby graphs de- We consider academic teachers, friends, colleagues as di- scribe linguistic structures on a sub-sentence level. Lin- rect interpersonal relations and relations constituted by guistic structures or predicate-argument-structures are con- peer-groups attending the same school and university or sidered as verbal or noun phrases comprising entities car- share the same profession and professional institution as in- rying information. This reflects the influence of (Harris, direct relations. Another dimension is hierarchy (patrons, 1974) who put the focus on argument structures. teachers) vs equality (friends, colleagues) expressed in di- rect relations and hereditary (familiar background) vs tran- Recent research into this approach has been undertaken on scendence (intellectual influence, schools of thought) in in- organization names in English by (Mallchok, 2005), on de- direct relations. Obviously these relations are manifold and scriptors for humans in German by (Geierhos, 2007), on to- occur in modified forms therefore we have to normalise ponyms in German by (Nagel, 2008), on biographical facts them. In this paper we will demonstrate the extraction of in English by (Geierhos, 2010) and on biographical facts in relations expressed by the verb to study. French by (Maurel et al., 2011) and (Maurel and Friburger, 2013). In order to visualize relations between individuals we need Just like these studies we rely on Unitex corpus-processor to identify their names. We achieved this be applying sim- (Paumier, 2013). Unitex adopts the early efforts of W. ple matching techniques using indexes and scores and we A. Woods on applying graphs to linguistic phenomena undertook tests using topic similarities. (William A Woods, 1970). Already in 1980 he pro- Finally we show the potential of relation extracting be- posed to draft and apply subsequent graphs step by step tween identified individuals by visualizing them online us- (Woods, 1980). Among others, those ideas and the abil- ing common force-directed graph libraries. ity to call sub-graphs and morphological filters have been 74 Figure 1: Example of a simple bootstrap graph detecting place names implemented in Unitex. Wetzlar,.EN+Topon+ORTSTUD We constructed local grammars in two steps. First we Wismar,.EN+Topon+ORTSTUD drafted preliminary graphs to describe and detect the spe- Witzenhausen,.EN+Topon+ORTSTUD cific vocabulary around interesting phrases. This was help- Włocławek,.EN+Topon+ORTSTUD ful to set up auxiliary dictionaries. Like the electronic dic- Worpswede,.EN+Topon+ORTSTUD tionaries distributed with Unitex we use the DELA syn- Zerbst,.EN+Topon+ORTSTUD tax (Dictionnaires Electroniques du LADL [Laboratoire Figure 2: Example of a simple dictionary entries, denot- d’Automatique Documentaire et Linguistique] (Paumier, ing place names with lexical category EN (named entity), 2013, 29)). semantic categories Topon and ORTSTUD Secondly we had to cope with TEI-XML-markup already present in the corpus. We decided not to clean up this in- formation because abbreviations had been tagged and fa- for occupation, 2.000 of them in declined form; 15.000 ge- cilitated the detection of sentence boundaries. This was ographical names, 3.500 institutional names, mostly multi achieved by using subsequent local grammars graphs, a word chunks. A special vocabulary (1000 entries) covered mode of “cascade” available in Unitex and described by disciplines and adjectives accompanying them; another in- (Maurel and Friburger, 2013). dividual school names who otherwise interfere with the re- lation to study. 1.2. Dictionaries Bootstrapping dictionaries from the corpus gives the oppor- Dictionaries are crucial for the adoption of local grammars. tunity to revise and optimize the dictionaries. We used the general dictionary CISLEX for German de- veloped at Center for Information and Language Process- 1.3. The corpus Neue Deutsche Biographie ing (Centrum für Informations- und Sprachverarbeitung - Our corpus is provided online at www.deutsche- CIS) Munich (Guenthner and Maier, 1994). CISLEX con- biographie.de. The website consists of the digitised tains syntactic information about 150.000 entries encoded biographical dictionaries “New German Biography” in DELA format (Paumier, 2013, 47ff). (NDB). The dictionary recently reached the letter T In addition we extracted dictionaries of denominators for (Tecklenborg) and has published 25 volumes in print since named entities from indices (list of names, professions) and 1953. Available online are about 21.000 articles of the first an authority file (Gemeinsame Normdatei1 . The Gemein- 24 volumes (A-Stader). These biographical articles have same Normdatei (GND)2 provided personal names and been selected in a peer review process by the editorial team name parts, names for places, regions and organisations. under guidance of the editor in chief. They are composed We could derive dictionaries with roughly 1.9 mio sur- of a headline, a short genealogy, the account of life and names, 1.5 mio forenames and 9.3 mio full names for indi- further technical paragraphs on awards, works, secondary viduals as well as 1.36 mio entries for organisational names. literature and depictions. All articles are signed by an Describing simple local grammars in a bootstrap manner author. Articles are written in modern German (pre 2006 (Gross, 1999) we could extract lists of entities for fields of style) in full sentences but show many abbreviations of study, institutions and place names (see 2). These boot- frequent words (adjectives, nouns) and the lemma itself strapped dictionaries are specific to the given corpus and (surname or personal name of the subject of the biography). linguistically simply structured. They contain almost no In addition to the NDB its precursor “Allgemeine Deutsche syntactic information or declined forms but carry semantic Biographie” finished 1912 in 55 volumes plus an index information. We put together another 32.000 descriptors volume enlarges the amount of articles available in the website by 27.000. These older articles are written in an 1 http://www.dnb.de/gnd outdated orthography and style and have not been taken 2 into account. http://www.dnb.de/lds GND as Linked Data Service in March 2013. We heavily used auxiliary databases listing the individuals 75 Figure 3: Masking pre-tagged text and entities, using TOKEN-loops with ![ ]-negative context mentioned in the text along with profession or position in 1. masking pre-existing tags life, their birth and death dates and references to the printed 2. masking interfering statements on education volumes. All in all the core data base consists of 92.000 in- 3. to study with arguments in pre- and post-position dividuals and several hundred families. Almost each entry 4. to study with arguments in pre-position has been aligned with or added to the bibliographic author- 5. to study with arguments in post-position (common ity file Gemeinsame Normdatei (GND). case) The articles were digitised and typographically tagged by 6. deal with the noun study. an exernal firm and afterwards structurally tagged in XML Figure 5: Schema of the cascade (Paumier, 2013, 243ff). according to the TEI guidelines (Text Encoding Initiative, Each graph is applied repeatedly until no new match is 2009) in the project. For reasons like human read-ability, found and in merge-mode e.g. merging outputs with the easier proof-reading, and tagging of pre-existing XML, we detected sequences in the corpus decided neither to follow the stand-off mark-up approach nor the habit of computational linguistics of working on plain text but to keep up the whole tagging, re-use it on sequence of grammars (cascade). Almost all grammars occasion and add further tags in line. were acting as transducers - they wrote output back into the recognized chunks of text. In this way new XML tags 2. A local grammar for the verb to study were introduced to mark extracted entities in each step. In German, there are several ways to express someone has There is a {multi word expression,.lexical studied. The verb studieren as well as ein Studium be- type|mask(+lexical type|mask)∗ }–notation ginnen, aufnehmen, absolvieren, beenden or (sich) an der processed by the Unitex system (Paumier, 2013, 44-46). Universtität einschreiben/ Vorlesungen (an der Universität) As shown in fig. 3 Unitex recognizes such kind of meta- belegen, besuchen, jemanden hören each sets a certain fo- syntax in order to treat multi-word expressions on the one cus to the activity and determines possible arguments. We hand and assign lexico-semantic types (e.g. CHOICE+UA restrict our grammar to the verb to study and its forms. Our in fig. 3) to text units on the other hand (Geierhos et al., analysis of the corpus resulted in the following structure: 2011, 49). The predicate-argument structure of to study is accompa- The mask applies to abbreviations already identified and nied by several types of entities, like institution, university tagged, certain abbreviations are tagged with semantic (Universität Wien, Akademie der bildenden Künste), place, types. This applies also to personal names which were sim- discipline (Physik, Kulturwissenschaften, teacher (bei Vir- ilarly identified and tagged with local grammars. chow und Naunyn, ...