=Paper= {{Paper |id=Vol-2119/paper9 |storemode=property |title=A Prosopographical Information System (APIS) |pdfUrl=https://ceur-ws.org/Vol-2119/paper9.pdf |volume=Vol-2119 |authors=Matthias Schlögl,Katalin Lejtovicz |dblpUrl=https://dblp.org/rec/conf/bd/SchloglL17 }} ==A Prosopographical Information System (APIS)== https://ceur-ws.org/Vol-2119/paper9.pdf
                             A Prosopographical Information System (APIS)
                                              Matthias Schlögl, Katalin Lejtovicz
                                                Austrian Centre for Digital Humanities
                                              Sonnenfelsgasse 19, 1010 Vienna, Austria
                                     matthias.schloegl@oeaw.ac.at, katalin.lejtovicz@oeaw.ac.at

                                                                    Abstract
During recent years massive amount of biographical datasets have been digitized and - at least some of them - made available open
access. However, an easy to use system that allows non-experts to work with the data is still missing. The APIS system, designed
within the framework of the APIS project at the Austrian Academy of Sciences, is a web-based, highly customizeable virtual research
environment that allows researchers to work alongside programs designed for processing natural language texts, so called Natural
Language Processing pipelines.

Keywords: biographical data, virtual research environment, natural language processing


                      1     Introduction                                       published under a open-source license (MIT) on GitHub:
During recent years massive amounts of biographical                            https://github.com/acdh-oeaw/apis.
datasets have been digitized and - at least some of them
                                                                                      2   APIS virtual research environment
- made available open access (Reinert et al., 2015; Fokkens
et al., 2014). Additionally, collaborative efforts such as                     The approaches for extracting structured information from
Wikipedia/Wikidata1 have created even more partly struc-                       biographical data sets have been brought forward by a rela-
tured prosopographical and biographical datasets (Gergaud                      tively small scholarly community using locally runned, tai-
et al., 2016). Reference resources such as Gemeinsame                          lor made systems that almost never have a user interface.
Normdatei2 and the Virtual Internationa Authority File                         Compared to the conventional methods that researchers ap-
(VIAF)3 have also been utilized for prosopographical re-                       ply when evaluating textual data (e.g. taking notes in a
search (Andert et al., 2014). Since the first endeavours,                      Word Document, filling out an Excel sheet manually), APIS
researchers have worked on tools that allow for extracting                     allows for a semi-automatic exploration of the information
structured data of these biographical texts. Various Natural                   in a large scale data set. It enables researchers to find an-
Language Processing (NLP) techniques have been used for                        swers to their research questions more easily and much
these objectives (local grammars, regular expressions, ma-                     faster than with conventional methods.
chine learning and deep learning based approaches etc.).                       APIS is a web-based, highly customizeable VRE that
However, the goal of the researchers was not limited to                        allows traditional researchers to work alongside NLP
transforming full-text data into structured data, but also                     pipelines. This hybrid approach (the possibility to manu-
included the interpretation of textual resources by apply-                     ally annotate texts and edit entities/relations alongside au-
ing statistical and network research methodologies. In this                    tomatic systems) allows researchers to ”use the best of both
sense computer linguistic processing, statistical analysis                     worlds”, and computer scientists to improve the tools di-
and network visualization of biographies has been started                      rectly on real world data. The web application not only
at ÖBL - the Austrian Biographical Dictionary - in the con-                   helps researchers to systematically and semi-automatically
text of the APIS project. The results of the various anal-                     process large amounts of data, but also to analyze and vi-
ysis methods are later evaluated and interpreted by schol-                     sualize connections between entities detected in the docu-
arly researchers. In this paper we describe the Virtual Re-                    ments. Visualization of the data allows the researchers to
search Environment (VRE) (Schlögl and Andorfer, 2018) -                       get an overall picture of the entities and relations encoded
from now on referred to as APIS - that has been developed                      in the documents, that otherwise would be hard to access.
during the project and Natural Language Processing (NLP)                       APIS provides the users an easy and intuitive workflow to
techniques we use for (semi)automatically structuring the                      process large amounts of data.
data. The APIS VRE is a Django based web application                           It therefore tackles two main problems and will make the
                                                                               work with biographical data easier for historians as well as
   1
      or resources such as Freebase that have been included in these           data scientists:
endeavours.                                                                      • It allows historians to annotate biographies with ex-
    2
      An authority file for persons, events, locations, works, institu-            actly that information they need for their research,
tions operated cooperatively by the German National Library, the
                                                                                   easily link the annotations to the Linked Open Data
German Union Catalogue of Serials and other institutions. The
                                                                                   cloud4 , and export it for further research.
GND has recognized these developments and will open the sys-
tem to actors outside traditional libraries. http://www.dnb.de/EN/
                                                                                  4
Standardisierung/GND/gnd node.html (Kett, 2017)                                     LOD - data that is being published so that it can easily be
    3                                                                          interlinked with other datasets, which allows for more refined, de-
      An international authority file compiled by national libraries.
https://viaf.org/.                                                             tailed queries of the content.

                                                                          53
  • It allows data scientists to easily access annotated data          places, persons and institutions, institutions and works, per-
    via APIs, use it for (re)training models, store new an-            sons and persons and so on. All entities share a set of basic
    notations to the system and use the built-in evaluation            attributes (name, start-, end-dates etc.) and some have addi-
    system for retrieving precision, recall, F1 and other              tional ones (e.g. place has longitudes and latitudes). Every
    metrices.                                                          entity can be related to several URIs (if they do not share
                                                                       the same top-level domain) and grouped in so-called collec-
2.1 General Idea                                                       tions. Relations on the other hand have a fix set of attributes
The design of APIS meets three basic criteria, based on ex-            (start-, end-date, kind, notes, references).6 Every entity can
perience from previous projects:                                       have as much full-texts as needed. These full-texts in return
                                                                       can have offset annotations grouped in so-called annotation
  • a simple datamodel that can be serialized to other for-            projects and - if useful - linked to other entities or relations7
    mats and datamodels later on                                       in the database. All entities and relations are typed with
                                                                       Simple Knowledge Organisation System (SKOS) vocabu-
  • use of a solid and widely used software stack to keep
                                                                       laries (SKOS defines standards for working with knowl-
    the development and maintenance effort as low as pos-
                                                                       edge systems such as thesauri, taxonomies and classifica-
    sible
                                                                       tion schemes). Additionally the system features a very fine
  • a hybrid approach that allows researchers as well as               grained user permission system, that allows to set permis-
    automatic tools/pipelines to work in parallel on the               sions on collection basis.
    same dataset.
                                                                       2.3 The web frontend
While this design has some advantages, it brings some                  The APIS web frontend allows to search the data, work on
downsides along. Most commonly used high level ontolo-                 it and analyze it. The list views can be used to search the
gies, such as CIDOC CRM (a structure designed to describe              data8 , sort and export it and to access the edit views. Figure
concepts and relationships used in the cultural heritage do-
main) are based on an event driven datamodel. Our internal
datamodel (discussed in more detail below) is simpler and
easier to use, but needs to be mapped to event based mod-
els later on. Similarly, the use of well proven technologies
such as Django and SQL databases brings some obvious
advantages in the development of the web application, but
in a world of Linked Open Data at some point we will need
to serialize our data into RDF-triples5 and publish it to in-
clude it in the Linked Open Data cloud. However, during
the project our design decisions have proven to be success-
ful. Due to the simple datamodel and the easy and fast de-
velopment of the web application we were able to (manu-
ally) annotate much more data than we anticipated.

2.2 Datamodel
The APIS datamodel is a hybrid between an event-based
and a relation-based model. Figure 1 shows a simplified




                                                                                     Figure 2: APIS edit view of person


                                                                       2 shows the edit view of a person. The view consists of two
                                                                       panes, in the left pane one can work on the entities meta-
                                                                       data, in the right the entity can be related to other entities.
                                                                       The forms feature wherever possible/useful autocompletes
       Figure 1: APIS datamodel (simplified version)
                                                                          6
                                                                             Relations are a kind of mini-event: The relation can be only
                                                                       connected to two entities and has a limited set of attributes, but
version of the APIS datamodel. It consists of 5 entities
                                                                       nonetheless the relation of two entities has some additional data
(person, place, institution, event and work) that are all in-          attached. We therefore call our model a hybrid between relation-
terrelated. Relations can be added between persons and                 based and event-based.
                                                                           7
                                                                             Entities and/or relations that are annotated in the full-text can
   5
    RDF is a framework for representing information in the Web.        be automatically added to the database.
                                                                           8
In RDF statements about resources are expressed in the form of               The search fields and functions can be defined in the main
subjectpredicateobject, known as triples.                              settings file of the application.

                                                                  54
that make the editing process more convenient and less er-              2.5    Inter annotator agreement
ror prone for the researcher.                                           In section 3 we will elaborate on the Natural Language Pro-
                                                                        cessing (NLP) techniques we used to (semi)automatically
2.4       Full-text annotation                                          enrich the ÖBL biographies. One of the prerequisites of
APIS also allows for annotation of biographical full texts.             automatic text processing is a gold standard of annotations
Instead of just adding a relation between two entities to the           and a high inter-annotator agreement.11 Getting towards a
database, this relation can be annotated directly in the text.          gold standard and a high agreement among the annotators
When highlighting a part of the text a context menu opens               is a time consuming and tedious process. We try to fos-
that allows to select the relation type.9 After selecting the           ter this process by visualizing overlapping annotations in
relation type (e.g. Person-Place) another form is loaded that           the frontend and providing readymade metrices to compute
allows for selecting the related entity (e.g. Vienna) and the           the agreement over large collections of texts and/or annota-
kind of relation (e.g. ’educated in’). We already explained             tors.12

                                                                        2.6    Versioning
                                                                        One important aspect of (historic) research is provenance.
                                                                        Ideally every step in the data generation and data analy-
                                                                        sis process is logged and reproduceable. To allow for full
                                                                        provenance information in the APIS process, we imple-
                                                                        mented a system that serializes every edit of a data point
                                                                        and adds a timestamp and a user-ID to the serialization. The
                                                                        revision can be accessed in the GUI and used for recreating
                                                                        any former state of the database. We are currently work-
                                                                        ing on building a Rest API endpoint for providing machine
                                                                        readable access to this versioning system.

                                                                        2.7    Visualization
                                                                        The APIS system also includes a rudimentary visualiza-
                                                                        tion module. Several projects have shown that social net-
           Figure 3: Image of overlapping annotations                   work analysis (SNA) is a very useful visualization and anal-
                                                                        ysis method (Armitage, 2016; Warren et al., 2016) The

that annotations in APIS are stored as offsets and related to
the user and something we call annotation project. This al-
lows to view the biography from different angles. A simple
form allows to filter for the annotations one wants to look
at (annotation project, user, type of annotation). Addition-
ally, the visualization allows for overlapping annotations.
As figure 3 shows when clicking on overlapping annota-
tions - visualized with yellow background color - a context
window opens and shows a copy of the text snipped for ev-
ery existing overlapping annotation.

2.4.1 Automatic import of LOD entities
The APIS webapplication allows the use of external re-
sources - such as Linked Open Data resources - in the auto-                            Figure 4: Network visualization
complete search. Whenever a researcher searches for an en-
tity in the autocomplete, not only local entries are searched,
but also external resources integrated into the APIS sys-               APIS network visualization allows for iterative creation
tem.10 When a researcher selects an entity that is not yet              of networks by specifying the source node13 , the relation
present in the database the system retrieves the original en-           type and/or kind and/or the target node. The form sup-
tity and parses it into the database. The parser can be de-             ports the researcher in creating the network with autocom-
fined in an instance wide settings file.                                pletes that show existing entries in the database. Nodes

      9                                                                   11
      The context menu is defined in a system wide settings file              Most of the time the latter is needed to produce the former.
                                                                          12
accessible via the admin backend.                                             As described above the APIS application does not distinguish
   10
      We use a local Apache Stanbol instance for fast access            between human researchers and automatic tools. Tools communi-
to Geonames and GND, but have also implemented bridges to               cate with the database via a Rest API, researchers via the GUI,
SPARQL (the query language for RDF data) endpoints for less             both have an user account that allows APIS to version the edits.
                                                                           13
frequently used sources.                                                      It is also possible to select whole collections of nodes.

                                                                   55
can be extended14 by accessing the context menu of the                        into a semantic reference resource, which is later used for
nodes. Figure 4 shows a network that was created by                           the semantic enrichment of the documents. To perform
adding person-place relations with the target node set to                     the semantic annotation, we produce so-called Referenced
’München’, ’Berlin’ and ’Graz’. After creating the network                   Sites from the data available in RDF/XML format (i.e. from
it can be downloaded either as JSON15 or graphml.16 The                       GeoNames, GND). In the Referenced Sites the indexed data
downloaded file includes all the attributes - such as longi-                  is stored in a Solr 21 index.
tudes and latitudes for places - that exist in the database.
The APIS project also cooperates with external partners                       3.1.1 Abbreviations
to explore the potential of other more experimental visu-                     The information extraction process created in APIS con-
alization methods. One of these methods is the space-time-                    sists of two steps. First, we resolve abbreviations of person
cube developed by colleagues from the University of Krems                     names, institution names, academic titles, place names, and
(Windhager et al., 2017).17                                                   common verbs. We developed two versions to resolve ab-
                                                                              breviations, a Java program based on regular expressions
              3     Information extraction                                    and a Python based script that uses regular expressions, a
                                                                              dictionary of German words and a large German-language
One of the goals of the APIS project is to offer automated
                                                                              corpus (AMC) (Ďurčo et al., 2014) to resolve ambiguous
text processing to facilitate the work of researchers. The
                                                                              abbreviations and choose the correct variant. The program
processing and interpretation of the texts were carried out
                                                                              queries the abbreviation and its context in the AMC corpus,
using computer linguistic methods, which include identi-
                                                                              and the resolution with the most hits is chosen.
fication of entities (individuals, places, institutions, etc.),
automatically linking them to Linked Open Data Cloud re-                      3.1.2 Creating an index
sources, and disambiguating and manually curating the re-                     The second step in the semantic annotation process is to
sults. In the following section we will outline the above                     create Solr indices from ontologies. During Entity Linking
described steps in more detail.                                               Apache Stanbol searches the entities (persons, places, insti-
                                                                              tution names, etc.) in the indexed ontologies. In the APIS
3.1 Entity Linking                                                            project we created indexes from GeoNames and GND to
Although biographies are available in XML format, these                       link the place names, personal names and institution names
do not contain all relevant information about a person’s                      in the text to the Linked Open Data Cloud. The indexes
life in structured format, except for some key events such                    were created as follows: we downloaded the RDF/XML
as birth and death. One of the main goals of the project                      dumps of the aforementioned resources, which were cut
is to reveal information encoded in natural language text                     into smaller files in order to get manageable sized data, and
(e.g. names of persons, places, institutions, events, etc.)                   to make it easy to create separate indexes for the different
and to automatically detect relationships between them                        entity types. After this we created the Apache Solr indexes
and the person depicted in the biography. In order to                         from the above mentioned files using Apache Stanbols Java
tackle this problem efficiently, we combined automated                        package for indexing.
and manual information retrieval techniques. The infor-
mation extraction in APIS consists of three main steps:                       3.1.3 The NLP pipeline
Named Entity Recognition, Entity Linking, and Disam-                          After creating and installing the Solr index the Entity Link-
biguation/Curation. For the automatic information extrac-                     ing component is configured. Stanbol allows various con-
tion we use the open source software Apache Stanbol18 ,                       figuration options to achieve an accurate and efficient Entity
which detects entities in natural language texts and con-                     Linking process. For example, one can narrow down the
nects them to ontologies and knowledge databases such as                      search to proper nouns only. In this case the NLP algorithm
the GND, GeoNames19 , or DBpedia20 . The connections                          of Stanbol identifies proper nouns and queries only them in
that are created between entities and biographies not only                    the Solr index, this yields more accurate Entity Linking and
allow for the enrichment of the biographies with semantic                     a better runtime. Another configuration option is to use the
information, but also for the automatic correction of miss-                   types of entities in the matching process. If this setting is
ing or erroneous data. The advantage of using Apache Stan-                    turned on and the index contains information regarding the
bol for Entity Identification and Linking is that it provides a               type of the entities, the user gets the results categorized into
straightforward mechanism how entities are identified and                     different types such as ”Person”, ”Location”, ”Event”, etc.
how any ontology in RDF/XML format can be converted                           (depending on what types are available in the index).
                                                                              Following the configuration of the Entity Linking compo-
  14
      By ’extending’ we mean adding all relations for the node to             nent, the Natural Language Processing component is con-
the visualization.                                                            structed, which defines what NLP steps have to be carried
   15
      a format that allows for easy data interchange between appli-           out. In APIS we use the Apache OpenNLP22 open source
cations - see: https://www.json.org/                                          software for the computer linguistic analysis of the biogra-
   16
      Graphml is a XML-based format for storing graphs. See http:             phies. Our pipeline consists of the following steps: De-
//graphml.graphdrawing.org/ for details.                                      termine the language of the input text. (langdetect) Divide
   17
      Please also see Windhager et al in this proceedings for details.
   18
      https://stanbol.apache.org/index.html,       last    accessed:            21
                                                                                    Solr is an open source search platform, which allows for full-
26.02.2018                                                                    text search, faceted search, hit highlighting amongst other fea-
   19
      http://www.geonames.org/, last accessed: 26.02.2018                     tures.
   20
      http://wiki.dbpedia.org/, last accessed: 26.02.2018                        22
                                                                                    https://opennlp.apache.org/, last accessed: 27.02.2018

                                                                         56
the text into sentences (opennlp-sentence). Tokenize the               The second solution we tested was IEPY (Information Ex-
sentences (opennlp-token). Determine the Part of Speech                traction in Python)25 , an open source software implemented
tag of the words (opennlp-pos). Search for noun phrases                in Python which realizes relation extraction. IEPY per-
(opennlp-chunker). Perform Entity Linking. (Custom Ref-                forms machine learning based relationship recognition. On
erenced Site)                                                          the web interface of the application, the user annotates oc-
In the last step, the nouns and noun phrases are compared              currences of predefined relationships (e.g. ’traveled some-
with the Solr index (Entity Linking). If a term matches                where’, ’married somebody’, etc.) from which the software
an entry in the index, the entry from the Solr index is re-            learns a model, that can be used to identify relations in doc-
turned by the application in the requested output format               uments that have not been seen before by the system. In
(e.g. JSON, RDF/XML, Turtle, N3, JSON-LD). If there are                case of the ÖBL, IEPY has not proven to be a suitable soft-
multiple results, a score between 0 and 1 indicates which is           ware, because it requires the selection of both members of
the most likely result. The advantage of the Apache Stan-              a relationship (eg. in case of ’learned somewhere’ both the
bol Entity Linking software is, that it can effectively index          person and the place). However in ÖBL, to avoid the repeti-
any ontology available in RDF/XML format, and allows the               tion of the person, the biography was written about, his/her
user to select the data resource for semantic annotation.              name is usually only mentioned once, at the beginning of
                                                                       the biography.
3.2    Relation Extraction                                             The third approach we have examined is the recognition
Entity Linking is the first step in automatically interpret-           of the tree structure obtained from the syntactic parsing of
ing the meaning of a natural language document. Through                the sentences with Deep Learning. We use a standard NLP
Entity Linking strings in the documents can be replaced by             pipeline26 to process the text. When the module finds a
URIs (Uniform Resource Identifiers). The concepts in the               named entity it climbs up the parse tree and extracts pre-
LOD resources are not only clearly identifiable and refer-             defined classes - in the sense of POS tags - of words (e.g.
enceable by their URIs, but they can also be shared between            verbs). The extracted list of words is converted into a vector
applications, unstructured texts can be enriched with infor-           which is used for classification. This method makes use of
mation attached to them or inconsistencies in the data can             the inherent advantages a biography brings along: in many
be detected and corrected.                                             cases a biography talks about the portrayed person, there-
The second step is to determine the relationships and the              fore we skipped the search for the subject and just assumed
types of the relationships that hold between the entities,             that the portrayed person is the subject. First tests with a
also known as automatic Relation Extraction. During Re-                model trained on roughly 4000 and evaluated on 1000 ex-
lation Extraction the NLP module looks for semantic re-                amples of person-place relations shows the potential of the
lationships such as ’parent-child’, ’traveled to a place’,             method27 , but also the problems automatic tools have with
’learned somewhere’, ’participated in an event’ between                the very specific language in the ÖBL.
people, places, and events detected in the text. We have               The training data set was annotated during a small research
tried three different methods for the automatic relationship           project dealing with members of the ’Künstlerhaus’.28
recognition, which will be tested and the best solution will           Given the rather difficult training data, the (for modern NLP
be permanently integrated into the APIS system.                        tools) problematic language of the ÖBL, and the relation
The first version is a rule-based algorithm implemented us-            types to extract29 the model performed rather well, even
ing the GATE framework.23 The implementation uses the                  though obviously not precise enough for historians to only
JAPE regular expressions language of GATE to automati-                 rely on the extracted data. The evaluation on 30 randomly
cally extract semantic links from the text. In a first step,           chosen artist biographies30 showed a recall of 0.79 and a
the output of the Entity Linking module is converted to                precision of 0.44 (F-beta 0.56). The combination of high
XML format, where each Named Entity is an element in                   recall and low precision is due to the named entity recog-
the XML. These XML files were then uploaded to GATE,                   nizer annotating places where a human annotator wouldn’t
and processed by the ANNIE NLP module.24 The Entity                    do so (e.g. ’Vienna’ in ’University of Vienna’). We believe
Linking results as well as the output of the NLP pipeline are          that the precision of the method can be significantly raised
stored as annotations in GATE. The JAPE regular expres-                by improving the named entity recognizer.31
sions work with these annotations and search for linguistic              25
                                                                             https://github.com/machinalis/iepy
patterns in the documents that can express a relationship.               26
                                                                             https://spacy.io
If the application finds a text snippet that corresponds to               27
                                                                             Please       see     https://apis.acdh.oeaw.ac.at/presentation
the pattern that is specific to that relationship, it automati-        innsbruck17/ for a more detailed presentation and a live
cally provides a new annotation, which defines the type of             version of the model.
the relationship. The output of the relation extraction was               28
                                                                             The fact that this data was not specifically produced for train-
exported to XML - widely used in NLP applications - and                ing purposes is important. It is very unevenly distributed: about
imported back in the APIS system.                                      2/3 of all annotations bear only two labels out of eight. The anno-
                                                                       tations were also done by only one annotator and are therefore not
  23                                                                   very concise over the whole corpus.
      GATE is an open source software designed to automatically           29
                                                                             Relation types were only chosen based on the research ques-
process natural language documents. See: https://gate.ac.uk/
   24                                                                  tion and not for how easy they are to find by automatic tools.
      ANNIE is a system within the GATE framework, which was              30
                                                                             All members of the ’Künstlerhaus’ have been annotated and
designed to automatically process and extract information from
                                                                       used for training, we therefore used other artists for evaluation.
textual data.                                                             31
                                                                             We will do so by retraining the model, and by implementing
                                                                  57
There have not been many attempts to automatically extract               - other than the rule based approach - allow historians to
information from biographical articles so far and no one - to            train it with whatever they are interested in and get a first -
our best knowledge - has tried to train models on relations              even if not very accurate - annotation of the whole dataset.
annotated by researchers. However, Fokkens et al. (2014)
for example extract metadata on the portrayed person from                                     5    Copyrights
full text. While this is not (exactly) the same as extracting            These proceedings are published by CEUR. Copyright of
relations to other entities it is comparable (e.g. metadata on           the individual submissions remains entirely with the au-
education vs relations to schools and universities). Fokkens             thors. Copyright of the proceedings falls to the editors. For
et al. (2014) had much higher precision, but significantly               a detailed explanation see: http://ceur-ws.org/
lower recall. The overall system performed similar to our
deep learning approach. Dib et al. (2015) used a somewhat                               6    Acknowledgements
similar approach to extract professions from wikipedia ar-
ticles. While they also used the parse tree (and especially              The APIS project is funded by a research grant (project
the verbs) to find the connection between an actor (in our               number ÖAW0405) of Nationalstiftung für Forschung,
case the portrayed person) and a circumstance (in our case               Technologie und Entwicklung (Programm ”Digital Human-
a Named Entity) they did not use a machine learning algo-                ities - Langzeitprojekte zum kulturellen Erbe”)
rithm to predict the kind of relation, but used a (more or
less) fixed set of words that describe the professions. Even                                  7    References
if they have evaluated it only on a limited number of well               Martin Andert, Frank Berger, Paul Molitor, and Jörg Ritter.
suited articles the overall performance of their system was                 2014. An optimized platform for capturing metadata of
much higher than ours (recall: 74.1%, precision: 95.2% and                  historical correspondence. 30(4):471–480.
F1: 83.3%). However, as it is focused on extracting pro-                 Neil Armitage. 2016. The Biographical Network Method.
fessions only the system is not really comparable to ours.                  Sociological Research Online, 21(2):16.
Bonch-Osmolovskaya and Kolbasov (2015) also used rules                   Anastasia Bonch-Osmolovskaya and Matvey Kolbasov.
to extract facts from a digital edition of Tolstoy’s letters.               2015. Tolstoy Digital: Mining Biographical Data in Lit-
While the system had a very good performance (compara-                      erary Heritage Editions. In Proceedings of the First Con-
ble to Dib et al. (2015)) for professions it had a F1 of 0.43               ference on Biographical Data in a Digital World 2015,
for family facts.                                                           page 5, Amsterdam.
We are currently working on annotating 300 biographies                   Firas Dib, Simon Lindberg, and Pierre Nugues. 2015. Ex-
specifically for training the relation extraction tools. While              traction of Career Proles from Wikipedia. In Proceed-
our training material so far focused on certain professions                 ings of the First Conference on Biographical Data in a
and on a specific research question, the model trained on                   Digital World 2015, page 6, Amsterdam.
these annotations should provide us with a baseline. Addi-               Antske Fokkens, Serge ter Braake, Niels Ockeloen, Piek
tionally, we are working on a gold standard for evaluating                  Vossen, Susan Legne, and Guus Schreiber. 2014. Bi-
this baseline model.                                                        ographyNet: Methodological Issues when NLP supports
We are also working on evaluating the rule based approach                   historical research. pages 3728–3735.
for relation extraction discussed above.                                 Olivier Gergaud, Morgane Laouenan, and Etienne Was-
                                                                            mer. 2016. A Brief History of Human Time: Exploring
                      4    Conclusion
                                                                            a database of ’notable people’. Sciences Po Economics
APIS provides a integrated system that allows researchers                   Discussion Paper 2016-03, Sciences Po Departement of
to annotate biographies and link the annotation to LOD re-                  Economics, February.
sources (and therefore reuse the data that already exists). In           Jürgen Kett. 2017. GND-Entwicklungsprogramm 2017-
a second step it allows for basic visualizations, filtering and             2021.
export of the data. On the other hand the system provides                M Reinert, M Schrott, and B Ebneth. 2015. From Biogra-
easy access to the database-backend for data scientists and                 phies to Data Curation-The Making of www. deutsche-
therefore allows for use of annotations for training models                 biographie. de. BD.
and out of the box evaluation.
                                                                         Matthias Schlögl and Peter Andorfer. 2018. acdh-
The NLP pipelines have some problems with the non-
                                                                            oeaw/apis-core: Apis-core, May.
standard language used in biographic dictionaries such as
                                                                         Matej Ďurčo, Karlheinz Mörth, Hannes Pirker, and Jutta
ÖBL. However, we found that the rule based approach as
                                                                            Ransmayer. 2014. Austrian Media Corpus 2.0.
well as the trained models show some possibilities. The
                                                                         CN Warren, D Shore, J Otis, and L Wang. 2016. Six De-
former - as others have shown before (Dib et al., 2015;
                                                                            grees of Francis Bacon: A Statistical Method for Recon-
Bonch-Osmolovskaya and Kolbasov, 2015) - especially for
                                                                            structing Large Historical Social Networks. DHQ, 10(3).
extracting data of well defined realms such as professions.
                                                                         Florian Windhager, Paolo Federico, Saminu Salisu,
The latter even if precision and recall are not high enough
                                                                            Matthias Schlögl, and Eva Mayr. 2017. A Synoptic Vi-
yet, to provide historians at least with a useful baseline an-
                                                                            sualization Framework for the Multi-Perspective Study
notation that they can use as starting point. This tool will
                                                                            of Biography and Prosopography Data. October.
some simple rules such as: when the name of an institution con-
tains a place name, the system will annotate the expression as an
institution, but not as a place.

                                                                    58