Biographical data exploration as a test-bed for a multi-view, multi-method
                          approach in the Digital Humanities
                                    André Blessing, Andrea Glaser and Jonas Kuhn
                                           Institute for Natural Language Processing (IMS)
                                                          Universität Stuttgart
                                            Pfaffenwaldring 5b, 70569 Stuttgart, Germany
                                              {firstname.lastname}@ims.uni-stuttgart.de
                                                                   Abstract
The present paper has two purposes: the main point is to report on the transfer and extension of an NLP-based biographical data
exploration system that was developed for Wikipedia data and is now applied to a broader collection of traditional textual biographies
from different sources and an additional set of structured biographical resources, also adding membership in political parties as a
new property for exploration. Along with this, we argue that this expansion step has many characteristic properties of a typical
methodological challenge in the Digital Humanities: resources and tools of different origin and with different accuracy are combined for
use in a multidisciplinary context. Hence, we view the project context as an interesting test-bed for some methodological considerations.

Keywords: information extraction, visualization, digital humanities, exploration system


                       1.    Introduction                                  1.1.   The Exemplary Character of Biographical Data
            1                                                                     Exploration
CLARIN is a large infrastructure project and has the mis-
sion to advance research in the humanities and social sci-                 The use of computational methods in the Humanities bears
ences. Scholars should be able to understand and exploit                   an enormous potential. Obviously, moving representations
the facilities offered by CLARIN (Hinrichs et al., 2010)                   of artifacts and knowledge sources to the digital medium
without technical obstacles. We developed a showcase                       and interlinking them provides new ways of integrated ex-
(Blessing and Kuhn, 2014), which is called TEA2 (Textual                   ploration. But while this change of medium could be ar-
Emigration Analysis), to demonstrate how CLARIN can                        gued to “merely” speed up the steps a scholar could in
be used in a web-based application. The previously pub-                    principle take with traditional means, there are opportuni-
lished version of the showcase was based on two data sets:                 ties that clearly expand the traditional methodological spec-
a data set from the Global Migrant Origin Database, and a                  trum, (a) through interaction and sharing among scholars,
data set which was extracted from the German Wikipedia                     potentially from quite different fields (e.g., shared annota-
edition. The idea for the chosen scenario was to enable                    tions (Bradley, 2012)), and (b) through scaling to a sub-
researchers of the humanities access to large textual data.                stantially larger collection of objects of study, which can
This approach is not limited to the extraction of informa-                 undergo exploration and qualitative analysis, and of course
tion, it also integrates interaction and visualization of the              quantitative analysis (Moretti, 2013; Wilkens, 2011).
results. In particular, transparency is an important aspect to             However, these novel avenues turn out to be very hard to
satisfy the needs of the researcher of the humanities. Each                integrate into established disciplinary frameworks, e.g., in
result must be inspectable. In this work we integrate two                  literary or cultural studies, and from the point of view of
new data sets into our application:                                        scholarly less erudite computational scientists, it often ap-
                                                                           pears that the scaling potential of computational analysis
  • NDB - Neue Deutsche Biographie (New German Bi-                         and modeling is heavily under-explored (Ramsay, 2003;
    ography)                                                               Ramsay, 2007). It is important to understand what is be-
                                                                           hind this rather reluctant adoption. Our hypothesis is that
  • ÖBL - Österreichisches Biographisches Lexikon                        humanities scholars perceive a lack of control over the scal-
    1815-1950 (Austrian Biographical Dictionary)                           able analytical machinery and should be placed in a posi-
                                                                           tion to apply fully transparent computational models (in-
Furthermore we investigate new relations which are of high                 cluding imperfect automatic analysis steps) that invite for
interest to researchers of the humanities, for example, if a               critical reflection and subsequent adaptation.3 An orthog-
person is or was a member of a party, company or a corpo-                  onal issue lies in the fact that advanced scholarly research
rate body.                                                                 tends to target resources and artifacts that have not previ-
Next, we view the project context as an interesting test-bed               ously been made accessible and studied in detail. So the
for some methodological considerations.                                    digitization process takes up a considerable part of a typi-
                                                                           cal project and a bootstrapping cycle of computational tools
                                                                           and models (as it is common in methodologically oriented
                                                                           projects in the computational sciences) cannot be applied
   1                                                                          3
       http://clarin.eu                                                        The bottom-up approach laid out in (Blanke and Hedges,
   2
       http://clarin01.ims.uni-stuttgart.de/geovis/showcase.html           2013) seems an effective strategy to counteract this situation.

                                                                      53
                                                                NLP                           GND
                                      Wikipedia               PIPELINE


                                         ÖBL


                                               NDB


                            unstructured sources                                      structured sources
                                                         DATA MODEL


                                               VIEWS


                            DH scholar          geo-centric       entity-centric      statistic-centric


                        Figure 1: Overview of the NLP-based biographical data exploration system.


on datasets that are sufficiently relevant to the actual schol-         teractive working environment. Lastly, an important aspect
arly research question. We believe that biographical data               besides this model character in terms of the interplay of
exploration is an excellent test-bed for pushing forward a              resources and computational components and the natural
scalability-oriented program in the Digital Humanities: the             options for multi-view visualization is the relevance of bio-
compilation of biographical information collections from                graphical collections to multiple different disciplines in the
heterogeneous sources has a long tradition, and every user              humanities and social sciences. Hence, sizable resources
of traditional, printed resources of this kind is aware of              are already available and are being used, and it is likely that
the trade-off between the benefit of large coverage and the             improved ways of providing access to such collections and
cost of high reliability and depth of individual entries. In            encouraging interactive improvements of reliability, cover-
other words, the intricacies that come from scalable com-               age and connectivity will actually benefit research in vari-
putational models (concerning reliability of data extraction            ous fields (and will hence generate feedback on the method-
procedure, granularity and compatibility of data models,                ological questions we are raising).
etc.) have pre-digital predecessors, and an exploration en-             We are not the first who work on the exploration of different
vironment may invite to a competent negotiation of these                biographical data sets. The BiographyNet project (Fokkens
factors. Here, a very natural multiple view presentation in             et al., 2014; Ockeloen et al., 2013) tackles similar questions
a digital exploration platform can bring in a great deal of             on reliability of resources, significance of derived output,
transparency: with a brushing-and-linking approach, users               and how results can be adjusted to improve performance
can go back and forth between an entity-centered view on                and acceptance.
biographical data (starting out from individuals or a visual-
ization of tangible aggregates, e.g., by geographical or tem-
poral affinity) and the sources from which information was
extracted (e.g., natural language text passages or (semi-)                               2.    System Overview
structured information sources). This readily invites to a
                                                                        Figure 1 shows the architecture of our approach. The sys-
critically reflected use of the information. Methodological
                                                                        tem integrates different biographic data sources (top left).
artifacts tend to stand out in aggregate presentations along
                                                                        Additional biographic data sources can be integrated if they
an independent dimension, and it does not take specialist
                                                                        are based on textual data. Textual sources are processed
knowledge to identify systematic errors (e.g., in an under-
                                                                        by the NLP pipeline (top middle) which will be explained
lying NLP component), which can then be fixed in an in-
                                                                        in the next section. In addition to textual data, structured

                                                                   54
                                                                    Converters


                                                el
                                                                                    CLARIN


                                            od
                                                 IMS type system


                                            m
                                                                        Tokenizer

                                      ta
                                                                                    web services
                               da
                           UIMA - modules        TCF-                    Tagger
                                                Wrapper                               TCF exchange
                                                                         Parser
                                                                                      format
                                                Feature-           Named Entity
                                                 UIMA
                                                extractor           Recognizer

                                                ClearTK


        Figure 2: The used data model is based on the UIMA framework that interacts with CLARIN webservices.


data sets (top right) are used to enable real world inference           to use the Unstructured Information Management Architec-
(e.g. mapping extracted knowledge to a world map). We                   ture (UIMA) framework (Ferrucci and Lally, 2004) as data
discuss the used structured data set in more detail later on.           model. The core of UIMA provides a data-driven frame-
The data model (middle) central to our system includes the              work for the development and application of NLP process-
derived and extracted data and additionally all links to the            ing systems. It provides a customized annotation scheme
sources. This enables transparency by providing access to               which is called type system. This type system is flexible
the whole processing pipeline. Finally, several views of the            and makes it possible to integrate one’s own annotation
data model (bottom) are provided. These allow the user                  on different layers (e.g. part of speech tags, named enti-
to visualize the obtained data in different ways. A specific            ties) in the UIMA framework. It is also possible to keep
view can be used depending on the actual research question.             track of existing structured information (e.g. hyperlinks in
                                                                        Wikipedia articles or highlighted phrases in a biographi-
2.1.   NLP Pipeline                                                     cal lexicon) as the original text’s own annotation in UIMA.
Natural Language Processing (NLP) is typically done by                  Automatic annotation components are called analysis en-
chaining several tools as a pipeline. The right hand part               gines in the UIMA systems. Each of these engines has to
of Figure 2 shows some basic tools (Mahlow et al., 2014)                be defined by a description language which includes the
which are necessary. This pipeline includes normalization,              enumeration of all input and output types. This allows us to
sentence segmentation, tokenizing, part-of-speech tagging,              chain different engines including validation checks. UIMA
coreference resolution, and named entity recognition. An                is a well accepted data model framework, especially since
important property is that these components are not rigidly             the most popular UIMA-based application, which is called
combined. This allows the user to adjust or substitute single           Watson (Ferrucci et al., 2010), won in the US show “Jeop-
components if the performance of the whole system is not                ardy” against human competitors. The flexible type system
sufficient. The system is also language independent insofar             also enables the split of content-based annotation and pro-
as all NLP tools in one language can be replaced by tools in            cess meta data annotations (Eckart and Heid, 2014) which
other languages. Table 1 gives more details about the used              allows keeping track of the processing history including
versions. These services are designed to process big data               versioning. Such tracking of process meta data can also
and do not require local installation of linguistic tools. This         be seen as provenance modeling (Ockeloen et al., 2013).
is often time consuming since most tools are using different            The combination of UIMA and TCF is simple since only
input and output formats which have to be adapted.                      a single bridge annotation engine is needed to map both
                                                                        annotation schemata. ClearTK is used as machine learn-
                                                                        ing (ML) interface (Ogren et al., 2008). It integrates sev-
                                                                        eral ML algorithms (e.g. Maximum Entropy Classifica-
2.2.   Data Model                                                       tion). The extraction of relevant features is a customized
The data model of our system has to fit several require-                component of the ClearTK framework. The used features
ments: i) store textual data and linguistic annotations; ii)            are described in Blessing and Schütze (2010). At the cur-
enable interlinking and exploration of data; iii) aggregate             rent stage a standard feature set is used (e.g. part-of-speech
results for visualization and data export; iv) store process            tags, dependency paths, lemma information).
meta data.
CLARIN-D provides its own data format called TCF (Heid                  2.3.   Textual Emigration Analysis
et al., 2010) which is designed for efficient processing                After the abstract definition of the requirements and archi-
with minimal overhead. But, such a format is not ade-                   tecture we give a more detailed view of the the extended
quate as core data model for an application. We decided                 TEA-tool. As mentioned before, we are using the already

                                                                   55
 Name                        Description                                    PID which refers to the CMDI description of the service
 Tokenizer                   Tokenizer and sentence boundary detector       http://hdl.handle.net/11858/00-247C-0000-0007-3736-B
 (Schmid, 2000)              for English, French and German
 TreeTagger                  Part-Of-Speech tagging for                     http://hdl.handle.net/11858/00-247C-0000-0022-D906-1
 (Schmid, 1995)              English, French and German
 RFTagger                    Part-Of-Speech tagging for                     http://hdl.handle.net/11858/00-247C-0000-0007-3735-D
 (Schmid and Laws,           English, French and German using a fine-
 2008)                       grained POS tagset
 German NER                  German Named Entity Recognizer based on        http://hdl.handle.net/11858/00-247C-0000-0022-DDA1-3
 (Faruqui and Padó, 2010)   Stanford NLP
 Stuttgart Dependency        Bohnet Dependency Parser                       http://hdl.handle.net/11858/00-247C-0000-0007-3734-F
 Parser
 (Bohnet and Kuhn, 2012)

                                   Table 1: Overview of the used CLARIN webservices.


Figure 3: Using the TEA-tool to querying emigrations from Germany based on the ÖBL data set. The emigration details
windows refers to ÖBL source which states that Moritz Oppenheimer emigrated 1939 from Germany to the US.


deployed web-based application that allows researchers to          user to increase the performance of the system through re-
make quantitative and qualitative statements on persons            training or active learning. For more technical details on
who emigrated to other countries. The visualization of the         the base system please consider Blessing and Kuhn (2014).
results on a map helps to understand spatial aspects of the        The extended application, which contains the two new data
emigration paths, for example, if people mostly emigrate to        sets, is shown in Figure 3. In this example the Austrian Bi-
nearby regions on the same continent or if they are spread         ographical Data is used as data origin. The user selected the
over the whole world. The visualization contains a second          country Germany, and the extended system returned all per-
view which aggregates and sums the emigration between              sons who emigrated from Germany to other countries. This
two countries. The aggregated numbers can be inspected in          information is represented by arcs on the map and as a table
a third view. Thereby, each number is decomposed by all            at the bottom of the screen. A key feature of the applica-
persons who are part of the given emigration path. Not only        tion is that each number can be grounded to the underlying
the person names are shown, but the whole sentence stating         text snippets. This allows users interested in e.g., the two
this emigration can be visualized. In the expert mode such         persons that emigrated from Germany to the US to click on
sentences can also be marked as correct or wrong by the            the details to open an additional view that lists all persons

                                                              56
including the sentence which describes the emigration.               that address named entities with multiple candidate refer-
The three view types, geo-driven, text-driven and                    ents. Often, people playing some role in a biography are
quantitative-driven of the TEA-application helps to explore          mentioned very briefly, so unless the name is very rare,
the data set from different perspectives which allows re-            machine learning methods for picking the correct person
searchers to identify inconsistencies. For example, the geo-         have a hard time due to the very limited context. Many ap-
driven view can be used to compare emigrations in a region           proaches rely on extracted features to learn something spe-
by selecting adjacent countries. Such an analysis helps to           cific about people with ambiguous names, which requires
find systematic geo-mapping errors (e.g. former USSR and             enough training data. In our approach we use topic mod-
the Baltic states). In contrast the text-driven view enables         els for characteristic properties of the candidate referents.
the identification of errors caused by NLP.                          These properties can be for example nationalities, profes-
                                                                     sions, or activities a person is involved in. We also ap-
2.4.   Challenges for extension of the TEA-system                    ply topic models to the context of an ambiguous person in
                                                                     the biography and use the extracted properties to compute
                                                                     the similarity to the candidate referents. We then create a
To allow a smooth integration of the new biographic data             target-oriented candidate ranking.
sets, a few modifications in the NLP pipeline were needed.
First, the import methods had to be adapted to allow the                                 3.    Experiments
extraction of the textual elements from the new XML or               The largest data set consists of articles about persons which
HTML files. Second, the text normalization component had             were extracted from the German Wikipedia edition. It cov-
to be adjusted on biographic texts, because ÖBL or NDB              ers 250,360 persons after filtering by the German Integrated
use a lot more abbreviations which had to be resolved. This          Authority File (GND). The NDB data set contains 22,149
could easily be done using a list of abbreviations provided          persons and the ÖBL data set 18,428 persons. Figure 6 de-
by the NDB website.                                                  picts the overlap of the used data sets. Only 1,147 persons
The integration of a new relation was more challenging: a            are part of all three data sets. We extracted 12,402 instances
new relation extraction component had to be defined and              of the emigration relation from the Wikipedia person data
trained. For the emigration relation the whole process was           set. For the NBD data set we found 1,932 instances of this
done manually which is very time consuming. For the                  relation and for the ÖBL data set we extracted 1,188 in-
member-of-party relation we switched to a new system cur-            stances. Most of the persons found in Wikipedia are neither
rently under development called ’extractor creator’. Since           part of NDB or ÖBL which lead to the higher number of
the system is in an early stage of engineering, the member-          Wikipedia emigrations. Moreover, the overlap of all three
of-party relation was used as a development scenario. Fig-           data sets is small, meaning that we only have a few cases
ure 4 shows a screenshot of the extractor creator. Some of           in which a person who emigrated is represented in all three
the basic methods of the interactive relation extraction com-        data sets. An automatic comparison of the found instance
ponent were published in Blessing et al. (2012) and Bless-           for emigration is only possible to a limited extent since the
ing and Schütze (2010). The novelty in the new system               different textual representations are not parallel for all facts.
is that more background knowledge is integrated by using             The member-of-party extraction is at an early development
person identifiers (based on the German Integrated Author-           stage. Its performance has a high accuracy but the coverage
ity File - GND) and Wikidata (Erxleben et al., 2014). This           is low. We started to use Wikidata for evaluation purposes
leads to a more effective filtering in the search which in-          since it also contains the same relation. However, the first
creases the performance of the whole system. The given               results showed that Wikidata is not complete enough to be
example in Figure 4 shows the lookup of specific persons             a sustainable gold standard. This observation was made by
and the listing of all mentioned Körperschaften (corporate          manually evaluating the membership relation in the Social
bodies) which are mentioned in the same Wikipedia article.           Democratic Party of Germany. In this evaluation scenario
A click on one of the corporate bodies opens the table on            our extractor found 18 persons which were not represented
the right which lists all person who also mention this corpo-        in Wikidata. This constitutes 20 percent of the extracted
rate body. A mouse-over function allows the user to see the          data. As a consequence, we need a larger manually anno-
textual context of the mention. The human instructor can             tated data set to enable a valid evaluation on precision.
then add relevant sentences as positive or negative training         Both experiments give evidence that we reached our first
examples.                                                            goal, which can be seen as a proof-of-concept. The chosen
                                                                     scenarios are not sufficient to enable an exhaustive evalua-
                                                                     tion since we have no well-defined gold standard data sets.
The first results of the novel relation extractor showed that        However, components like the relation extraction provide
unlike the emigration relation a more fine-grained syntactic         enough parameters for optimization in the future.
feature set is needed in the scenario of corporate bodies.
Figure 5 shows a simplified example that includes negations
which occurred only rarely in the emigration scenario.
                                                                                        4.    Related Work
2.5. Entity disambiguation                                           Since the Message Understanding Conferences (Grishman
Along with the extension of the core TEA system, we per-             and Sundheim, 1996) in the 1990s, Information Extraction
form experiments with special disambiguation techniques              (IE) is an established field in NLP research. Chiticariu

                                                                57
                            Figure 4: Prototype of the interactive relation extraction creator.


            Figure 5: Dependency parse of the German sentence: Angela Merkel war kein Mitglied der SED.


                                                                   proaches (Li et al., 2012). One reason is the economic effi-
                                                                   ciency of rule-based systems which are expensive in devel-
                                                                   opment since the rules are hand crafted but later on the are
                                                                   very efficient without needing huge computational power
                                         Ö                         and resources. For researchers such systems are not as at-
                                       18, BL                      tractive since their goals are different by working on clean
            N                             428
           22 DB                                                   gold standard data sets which allow exhaustive evaluation
             ,14
                9                                                  by comparing precision and recall numbers. In our system,
                                       4,782
                              1,1
                                  47                               we experimented with both, ML-based and rule-based ap-
                                                                   proaches. Rule-based systems have the big advantage to
                                                                   provide transparency to the end users. On the other hand,
                           16,317                                  small changes on the requested relations need a complete
                                                                   rewriting of the rules. We believe that a hybrid approach
                                                                   which allows the definition of some rule-based constraints
                                                                   to correct the output of supervised systems are the systems
                       Wikipedia + GND                             which provide the highest acceptance.
                           250,360
                                                                   The drawback of ML-based IE systems (Agichtein and Gra-
                                                                   vano, 2000; Suchanek et al., 2009) is the need of expensive
                                                                   manually annotated training data. There are unsupervised
                                                                   approaches (Mausam et al., 2012; Carlson et al., 2010)
             Figure 6: Size of used data sets.                     to avoid training data but then the semantics of the ex-
                                                                   tracted information is often not clear. Especially, for DH
                                                                   researchers, which have a clear definition of the informa-
et al. (2013) presented a study that shows that IE is ad-          tion to extract, this is not feasible.
dressed in completely different way in research than in in-        Another requirement of DH scholars is that they want to use
dustry. They showed that 75 percent of NLP papers (2003-           complete systems which are often called end-to-end sys-
2012) are using machine learning techniques and only 3.5           tems. PROPMINER (Akbik et al., 2013) is such a system
percent are using rule-based systems. In contrast, 67 per-         which uses deep-syntactic information. For our use case
cent of the commercial IE systems are using rule-based ap-         such a system is not sufficient since they do not provide

                                                              58
several views on the data which also a big factor for the             Andre Blessing and Hinrich Schütze. 2010. Self-
usability of system in the DH community.                                annotation for fine-grained geospatial relation extraction.
                                                                        In Proceedings of the 23rd International Conference on
                    5.    Conclusion                                    Computational Linguistics, pages 80–88.
                                                                      Andre Blessing, Jens Stegmann, and Jonas Kuhn. 2012.
We presented extensions of an experimental system for
                                                                        SOA meets relation extraction: Less may be more in in-
NLP-based exploration of biographical data. Merging data
                                                                        teraction. In Proceedings of the Workshop on Service-
sources that have non-empty intersections provides an im-
                                                                        oriented Architectures (SOAs) for the Humanities: Solu-
portant access for quality control.
                                                                        tions and Impacts, Digital Humanities, pages 6–11.
Offering multiple views for data exploration turns out use-
                                                                      Bernd Bohnet and Jonas Kuhn. 2012. The best of both-
ful, not only from a data gathering perspective, but quite
                                                                        worlds – a graph-based completion model for transition-
importantly also as a way of inviting users to keep a critical
                                                                        based parsers. In Proceedings of the 13th Conference of
distance from the presented results. Methodological arti-
                                                                        the European Chapter of the Association for Computa-
facts that originate from NLP errors or other problems tend
                                                                        tional Linguistics, pages 77–87.
to stand out in one of the aggregate visualizations.
                                                                      John Bradley. 2012. Towards a richer sense of digital anno-
5.1.   Outlook                                                          tation: Moving beyond a media orientation of the anno-
                                                                        tation of digital objects. Digital Humanities Quarterly,
We are collaborating with scholars of different fields of the
                                                                        6(2).
humanities that are interested to use our system. Com-
                                                                      Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Set-
mon questions are, which persons had certain positions at
                                                                        tles, Estevam R. Hruschka Jr., and Tom M. Mitchell.
what time? Which persons are members of organizations or
                                                                        2010. Toward an architecture for never-ending language
smaller groups at the same time? Which persons did their
                                                                        learning. In Proceedings of the 24th Conference on Arti-
education at the same institutions? We will incrementally
                                                                        ficial Intelligence, pages 1306–1313.
integrate such relation extractors in our system and observe
the user experience. The mixture of data aggregation and              Laura Chiticariu, Yunyao Li, and Frederick R. Reiss. 2013.
being transparent is one of the crucial task to gain a high             Rule-based information extraction is dead! long live
acceptance from DH scholars. We will also evaluate which                rule-based information extraction systems! In Proceed-
additional factors are relevant for the acceptance of such a            ings of the 2013 Conference on Empirical Methods in
system.                                                                 Natural Language Processing, EMNLP 2013, 18-21 Oc-
                                                                        tober 2013, Grand Hyatt Seattle, Seattle, Washington,
                                                                        USA, A meeting of SIGDAT, a Special Interest Group of
                 Acknowledgements                                       the ACL, pages 827–832. ACL.
We thank the anonymous reviewers for their valuable ques-             Kerstin Eckart and Ulrich Heid. 2014. Resource interoper-
tions and comments. This work is supported by CLARIN-                   ability revisited. In Ruppenhofer and Faaß (Ruppenhofer
D (Common Language Resources and Technology Infras-                     and Faaß, 2014), pages 116–126.
tructure, http://de.clarin.eu/), funded by the German Fed-            Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian
eral Ministry for Education and Research (BMBF) and by                  Mendez, and Denny Vrandečić. 2014. Introducing wiki-
a Nuance Foundation Grant.                                              data to the linked data web. In Proceedings of the 13th
                                                                        International Semantic Web Conference (ISWC 2014),
                    6.    References                                    volume 8796 of LNCS, pages 50–65. Springer, October.
Eugene Agichtein and Luis Gravano. 2000. Snowball: Ex-                Manaal Faruqui and Sebastian Padó. 2010. Training and
  tracting relations from large plain-text collections. In              evaluating a German named entity recognizer with se-
  Proceedings of the 5th ACM Conference on Digital Li-                  mantic generalization. In Proceedings of the Conference
  braries, pages 85–94.                                                 on Natural Language Processing (KONVENS), pages
Alan Akbik, Oresti Konomi, and Michail Melnikov. 2013.                  129–133.
  Propminer: A workflow for interactive information ex-               Daniel Ferrucci and Adam Lally. 2004. UIMA: an archi-
  traction and exploration using dependency trees. In Pro-              tectural approach to unstructured information process-
  ceedings of the 51st Annual Meeting of the Association                ing in the corporate research environment. Natural Lan-
  for Computational Linguistics: System Demonstrations,                 guage Engineering, 10(3-4):327–348.
  pages 157–162, Sofia, Bulgaria, August. Association for             David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James
  Computational Linguistics.                                            Fan, David Gondek, Aditya Kalyanpur, Adam Lally,
Tobias Blanke and Mark Hedges. 2013. Scholarly prim-                    William Murdock, Eric Nyberg, John Prager, Nico
  itives: Building institutional infrastructure for human-              Schlaefer, and Christopher Welty. 2010. Building Wat-
  ities e-science. Future Generation Computer Systems,                  son: An Overview of the DeepQA Project. AI Magazine,
  29(2):654–661.                                                        31(3):59–79.
Andre Blessing and Jonas Kuhn. 2014. Textual Emigra-                  Antske Fokkens, Serge ter Braake, Niels Ockeloen, Piek
  tion Analysis (TEA). In Proceedings of the Ninth Inter-               Vossen, Susan Legêne, and Guus Schreiber. 2014. Bi-
  national Conference on Language Resources and Eval-                   ographynet: Methodological issues when nlp supports
  uation (LREC’14), Reykjavik, Iceland, may. European                   historical research. In Proceedings of the 9th Interna-
  Language Resources Association (ELRA).                                tional Conference on Language Resources and Evalua-

                                                                 59
  tion (LREC 2014), Reykjavik, Iceland, May 26 - 31.                 ceedings of the 12th Edition of the Konvens Conference,
Ralph Grishman and Beth Sundheim. 1996. Message un-                  Hildesheim, Germany, October 8-10, 2014. Universitäts-
  derstanding conference-6: a brief history. In Proceed-             bibliothek Hildesheim.
  ings of the 16th conference on Computational linguistics,        Helmut Schmid and Florian Laws. 2008. Estimation of
  pages 466–471.                                                     conditional probabilities with decision trees and an ap-
Ulrich Heid, Helmut Schmid, Kerstin Eckart, and Erhard               plication to fine-grained POS tagging. In Proceedings
  Hinrichs. 2010. A corpus representation format for lin-            of the 22nd International Conference on Computational
  guistic web services: the D-SPIN Text Corpus Format                Linguistics (Coling 2008), pages 777–784.
  and its relationship with ISO standards. In Proceed-             Helmut Schmid. 1995. Improvements in part-of-speech
  ings of LREC-2010, Linguistic Resources and Evalua-                tagging with an application to German. In In Proceed-
  tion Conference, Malta. [CD-ROM].                                  ings of the ACL SIGDAT-Workshop, pages 47–50.
Marie Hinrichs, Thomas Zastrow, and Erhard Hinrichs.               Helmut Schmid. 2000. Unsupervised learning of period
  2010. Weblicht: Web-based lrt services in a distributed            disambiguation for tokenisation. Technical report, IMS,
  escience infrastructure. In Proceedings of the Seventh             University of Stuttgart.
  International Conference on Language Resources and               Fabian M. Suchanek, Mauro Sozio, and Gerhard Weikum.
  Evaluation (LREC’10). electronic proceedings.                      2009. SOFIE: A Self-Organizing Framework for Infor-
Yunyao Li, Laura Chiticariu, Huahai Yang, Frederick R.               mation Extraction. In Proceedings of the 18th Interna-
  Reiss, and Arnaldo Carreno-fuentes. 2012. Wizie:                   tional Conference on World Wide Web, pages 631–640.
  A best practices guided development environment for              Matthew Wilkens. 2011. Canons, close reading, and the
  information extraction. In Proceedings of the ACL                  evolution of method. In Matthew K. Gold, editor, De-
  2012 System Demonstrations, ACL ’12, pages 109–114,                bates in the Digital Humanities. University of Minnesota
  Stroudsburg, PA, USA. Association for Computational                Press, Minneapolis.
  Linguistics.
Cerstin Mahlow, Kerstin Eckart, Jens Stegmann, André
  Blessing, Gregor Thiele, Markus Gärtner, and Jonas
  Kuhn. 2014. Resources, tools, and applications at the
  CLARIN center stuttgart. In Ruppenhofer and Faaß
  (Ruppenhofer and Faaß, 2014), pages 127–137.
Mausam, Michael Schmitz, Robert Bart, Stephen Soder-
  land, and Oren Etzioni. 2012. Open language learning
  for information extraction. In Proceedings of Confer-
  ence on Empirical Methods in Natural Language Pro-
  cessing and Computational Natural Language Learning
  (EMNLP-CONLL).
Franco Moretti. 2013. Distant Reading. Verso, London.
Niels Ockeloen, Antske Fokkens, Serge Ter Braake, Piek
  T. J. M. Vossen, Victor de Boer, Guus Schreiber, and
  Susan Legêne. 2013. Biographynet: Managing prove-
  nance at multiple levels and from different perspectives.
  In Paul T. Groth, Marieke van Erp, Tomi Kauppinen,
  Jun Zhao, Carsten Keßler, Line C. Pouchard, Carole A.
  Goble, Yolanda Gil, and Jacco van Ossenbruggen, edi-
  tors, Proceedings of the 3rd International Workshop on
  Linked Science 2013 - Supporting Reproducibility, Scien-
  tific Investigations and Experiments (LISC2013) In con-
  junction with the 12th International Semantic Web Con-
  ference 2013 (ISWC 2013), Sydney, Australia, October
  21, 2013., volume 1116 of CEUR Workshop Proceed-
  ings, pages 59–71. CEUR-WS.org.
Philip V. Ogren, Philipp G. Wetzler, and Steven Bethard.
  2008. ClearTK: A UIMA toolkit for statistical natural
  language processing. In UIMA for NLP workshop at
  Language Resources and Evaluation Conference, pages
  32–38.
Stephen Ramsay. 2003. Toward an algorithmic criticism.
  Literary and Linguistic Computing, 18:167–174.
Stephen Ramsay, 2007. Algorithmic Criticism, pages 477–
  491. Blackwell Publishing, Oxford.
Josef Ruppenhofer and Gertrud Faaß, editors. 2014. Pro-

                                                              60