Querying the Deutsches Textarchiv

                       Bryan Jurish                 Christian Thomas                   Frank Wiegand
                   Deutsches Textarchiv · Berlin-Brandenburgische Akademie der Wissenschaften
                                   Jägerstrasse 22/23 · 10117 Berlin · Germany
                                        jurish|thomas|wiegand@bbaw.de


                                                                     of all possible variants can be a time-consuming and
                                                                     error-prone process even for language-historical ex-
                          Abstract                                   perts, additional steps must be taken to improve re-
                                                                     call [EGF06, HHL+ 07, GNR+ 09, Jur12, Efr13].
     Historical document collections present                            This paper describes the process architecture for
     unique challenges for information retrieval.                    full-text search in the historical German document
     In particular, the absence of consistent                        collection of the Deutsches Textarchiv (DTA). Our
     orthographic conventions in historical text                     approach makes use of an extensive corpus pre-
     presents difficulties for conventional search                   processing phase to annotate the source texts with
     architectures which typically rely on a static                  linguistically salient attributes such as “canonical”
     inverted index keyed by orthographic form.                      contemporary form, part-of-speech tag, and lemma.
     Additional steps must therefore be taken                        Building on the richly annotated corpus and a doc-
     in order to improve recall, in particular for                   ument index structure supporting multiple quasi-
     single-term bareword queries from non-                          independent token-level attributes, naı̈ve bareword
     expert users. This paper describes the query                    searches are expanded into equivalence classes of his-
     processing architecture currently employed                      torical spelling variants by a dedicated external ex-
     for full-text search of the historical Ger-                     pansion server.
     man document collection of the Deutsches                           The rest of this paper is organized as follows: sec-
     Textarchiv project.                                             tion 2 describes the historical text corpus indexed by
                                                                     the DTA, section 3 describes the DTA query process-
1    Introduction                                                    ing architecture in greater detail, and section 4 con-
                                                                     tains a conclusion and brief description of work cur-
Historical document collections present unique chal-
                                                                     rently in progress.
lenges for information retrieval. In particular, the ab-
sence of consistent orthographic conventions in his-
torical text presents difficulties for any system re-                2     Text Corpora
quiring reference to a static lexicon keyed by ortho-                The Deutsches Textarchiv (“German Text Archive”)1 ,
graphic form. Conventional search architectures on                   a project funded by the Deutsche Forschungsgemein-
the other hand typically rely on a static inverted in-               schaft (DFG, “German Research Foundation”) at the
dex [Knu73, BCC10] mapping each actually occur-                      Language Research Center of the Berlin-Brandenburg
ring surface string to a list of its locations, implic-              Academy of Sciences and Humanities, provides a
itly assuming the source texts adhere to strict or-                  core corpus of more than 1300 significant German
thographic conventions. Since casual or non-expert                   texts from various disciplines originally published
users cannot be expected to be familiar with the                     between ca. 1600 and 1900. Due to the project’s
many spelling variants to be found in historical docu-               primary focus on the history of the German lan-
ment collections, and since the explicit enumeration                 guage, full-text transcriptions document the original
                                                                     printed works, of which the earliest edition accessi-
Copyright © 2014 for the individual papers by the paper’s authors.
Copying permitted for private and academic purposes. This volume     ble was digitized. The transcriptions were acquired
is published and copyrighted by its editors.                         for the most part using the highly accurate double-
In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.): Proceed-     keying method; optical character recognition (OCR)
ings of the MindTheGap’14 Workshop, Berlin, Germany, 4-March-
2014, published at http://ceur-ws.org                                    1 http://www.deutschestextarchiv.de
was used for only ca. 200 volumes, together with ex-              used to prepare the corpus for indexing, and section
tensive manual pre-structuring and post-correction                3.2 deals with the index itself. The query expan-
phases. The corpus as a whole therefore displays an               sion strategy used for runtime term conflation is pre-
exceptionally high accuracy not only on the level of              sented in section 3.3, and section 3.4 describes some
transcription, but also on the annotation level.                  accessibility-oriented extensions.
   The DTA core text sources are published via the
Internet as digital facsimiles and as XML-annotated               3.1   Corpus Preprocessing
transcriptions together with comprehensive bibli-                 In order to provide a powerful and flexible retrieval
ographic meta-data. The annotation consistently                   environment, the raw text corpus was subjected to
follows the well-documented DTA “base format”                     an extensive automatic preprocessing phase before
(DTABf),2 a TEI subset developed for the represen-                being passed to the low-level retrieval engine for
tation of (historical) written corpora [CLA12]. As of             indexing. In particular, corpus text was automati-
January 2014, the DTA core corpus comprises 1301                  cally tokenized into paragraph-, sentence- and word-
digitized volumes (ca. 680M characters, 100M to-                  like units using the waste tokenizer [JW13], ex-
kens).                                                            tinct historical spelling variants were mapped to
   In addition to the core corpus, the DTA currently              “canonical” contemporary forms using a both a fi-
includes 473 high-quality textual resources provided              nite lexicon of known forms and a robust genera-
by cooperating projects or curated from existing text             tive canonicalization cascade within the DTA::CAB
collections such as Wikisource and Project Guten-                 framework6 [Jur13], and the returned canonical
berg.3 Further additions include the Polytechnis-                 forms were passed to conventional software tools for
ches Journal (1820–1931; 370 volumes, 490M char-                  morphological analysis [GH06], part-of-speech tag-
acters, 78M tokens).4 In total, the DTA and its ex-               ging [Jur03], lemmatization, and named-entity recog-
tensions comprise approximately 1.2B characters in                nition [DD09].
195M tokens. In the context of a DFG-funded project,
the existing OCR text of the journal Die Grenzboten               3.2   Index Structure
(1841–1922) is currently being structured according
to the DTABf and automatically corrected on the                   The richly annotated corpus data was passed to the
character level.5 The resulting optimized text base               free open-source DDC concordance tool7 [Sok03] for
will be integrated as an extension to the DTA corpora             indexing of selected document- and token-level at-
as well.                                                          tributes. In addition to document-level bibliographic
   All corpus texts are available in the web based                meta-data fields such as title, author, publication date
platform for collaborative quality assurance DTAQ.                and genre, DDC also allows each token to be asso-
Within DTAQ, transcriptions can be proofread, and                 ciated with a fixed number of quasi-independent lo-
misprints, transcription or annotation errors as well             cal attributes, Boolean conditions over which may be
as erroneous meta-data can be corrected [Wie13]. The              conjoined in runtime queries. In contrast to many
DTA serves as a basis for a reference corpus of the his-          conventional search architectures, the DTA corpus in-
torical New High German language and offers highly                dex uses not only a raw text string to represent a cor-
relevant primary sources for academic research in                 pus token, but also includes the following token-level
various disciplines in the humanities and sciences as             attributes:
well as for legal scholars and economists.                        Utf8Token (u) contains the raw token text encoded
                                                                      in UTF-8 [Uni13].
3     Methods
                                                                  Token (w) contains a deterministic transliteration of
This section describes the process architecture under-                the raw token text into that subset of the latin
lying the DTA’s full-text search functionality. Section               alphabet used in contemporary German orthog-
3.1 briefly describes the preprocessing techniques                    raphy. In the case of historical German, deter-
    2 http://www.deutschestextarchiv.de/doku/basisformat              ministic transliteration is especially useful for
   3 These resources were integrated in the course of a BMBF-         mapping the long-s character ‘s’ to a conven-
funded CLARIN-D “curation project”. For documentation on              tional round ‘s’ and for mapping superscript ‘e’
the project and a list of resources integrated cf. http://www.        to the conventional Umlaut diacritic ‘¨’, as in
deutschestextarchiv.de/clarin_kupro.                                                             e
   4 The original project’s web page is http://dingler.culture.       the transliteration Abstande 7→ Abstände (“dis-
hu-berlin.de. After DFG-funding expired, the complete text base       tances”). This attribute was used as the default
was integrated into the BBAW corpus infrastructure and a DDC          for literal string-identity searches.
open search wrapper http://kaskade.dwds.de/dingleros was
set up for corpus queries.                                          6 http://www.deutschestextarchiv.de/demo/cab
   5 http://brema.suub.uni-bremen.de/grenzboten                     7 http://www.ddc-concordance.org
CanonicalToken (v) contains the estimated “canon-         queries of the ‘Lemma’ attribute, together accounting
   ical” contemporary form for the current token as       for 96.9% of user searches.8 In order to improve re-
   determined by the corpus preprocessing phase;          call for such queries9 – especially from non-expert
   e.g. Teil for the raw text Theyl (“part”) or fragte    users who cannot be expected to be familiar with the
   for the raw text frug (“asked”).                       great diversity of spelling variants to be found in his-
                                                          torical texts – while still retaining the flexibility of
Pos (p) contains the part-of-speech (POS) tag auto-       the multi-attribute DDC index, we extended the DDC
    matically assigned to the source token by the         query language to include user-defined term expan-
    moot part-of-speech tagger [Jur03] using the          sion pipelines with attribute-dependent defaults for
    STTS tag-set [STT95].                                 both explicit and implicit runtime term conflation.
Lemma (l) contains the lemma or “base form” as-              In addition to built-in term expanders for
   signed to the source token by the corpus prepro-       e.g. letter-case normalization or legacy rule-based
   cessing phase, taking into account both the POS-       stemming, we introduced a new extendable class of
   tag and the analyses returned by the TAGH mor-         external term expanders accessed via HTTP as well
   phological analyzer [GH06], if any.                    as a class for chains or “pipelines” of multiple ex-
                                                          panders. Each expander x receives as input a finite
XPath (xpath) contains the “canonical” XPath to the       set T of strings (terms)10 and returns a finite set x(T )
    deepest element node containing (the first char-      of “equivalent” strings, for some expander-dependent
    acter of) the current token in the original TEI       conflation relation ∼x . The query interpreter eval-
    source document.                                      uates an expanded query as it would any set-valued
                                                          query as the Boolean disjunctionS over all elements of
Page (page) identifies the source facsimile, for ad-
                                                          the (expanded) set: ~x(T ) = t∈x(T ) ~t. Prototypi-
    ministrative and cross-referencing purposes.
                                                          cally, ∼x will be a true equivalence relation and x(T )
Line (lb) tracks the line number of the source to-        will be a superset of T , so that literal matches to a
    ken on the current page, for administrative and       user query will always be retrieved.
    cross-referencing purposes.                              Each token attribute is associated with a de-
                                                          fault expansion pipeline, so that bareword queries
   A traditional inverted index is constructed for each   can be assigned equivalence classes in an attribute-
attribute at corpus indexing time. Unlike conven-         dependent manner: it would be counter-productive
tional query interpreters supporting only document-       for example to attempt to analyze XPath attribute val-
level dependencies however, the DDC runtime query         ues as natural language text, whereas Token attribute
interpreter ensures that dependencies in a given user     values are expected to be historical word-forms and
query are resolved at the token level. For exam-          may be analyzed as such. The current DTA corpus
ple, the query (@Böttcher WITH $p=NN) would re-          index configuration defines the following term ex-
trieve all and only those instances of the literal        panders, among others:
string Böttcher annotated with the part-of-speech tag
NN indicating a common noun (“cooper”), whereas           tolower Letter-case expander generating lowercase
(@Böttcher WITH $p=NE) would retrieve those in-              variants of its input.
stances tagged as proper names. A conventional
query evaluation architecture on the other hand           toupper Letter-case expander generating uppercase
would only be capable of retrieving those documents           variants of its input. This is the default expander
containing some instance of the target word (Böttcher)       for the Pos attribute.
and some instance of the target part-of-speech tag
(NN or NE), regardless of whether or not the tag was      case Letter-case expander generating upper-, lower-,
assigned to the target word, or to some other word in         and initial-uppercase variants of its input. This
the document.                                                 is the default expander for the Lemma attribute.

3.3   Runtime Query Expansion                                8 The dominance of simple bareword queries is not surprising,
                                                          being as it is well attested in the literature on generic web search-
Despite the rich annotations offered by indexed cor-      ing, e.g. [JSS00, SWJS01, WM07].
pus, the majority of actual searches are in fact             9 [JA12] reported an improvement in type-wise recall from
single-term “bareword” queries. Of 29,410 total           55.7% to 95.7% for canonical-form queries vs. raw string-identity
queries between September 2013 and January 2014,          queries in an artificial retrieval task over a small test corpus of
                                                          18th- to 19th-century German text, corresponding to a token-wise
15,977 (54.3%) were single-term bareword searches,        recall improvement from 78.5% to 99.3%.
3,302 (11.2%) were phrases composed exclusively of          10 A bareword query is treated as a singleton set for purposes of
bareword terms, and 9,219 (31.4%) were bareword           term expansion.
morphy Legacy rule-based stemming               and   re-    search of the Lemma attribute, incorporating the ap-
   inflection using Morphy [Lez00].                          propriate explicit syntax into the suggestions it re-
                                                             turns. Assumedly, this suggestion strategy is largely
tagh TAGH-based lemmatization and re-inflection              responsible for the comparatively high ratio of ex-
    [GH06] via external server.                              plicit lemma searches (31.4%) we observed.
                                                                Additionally, we implemented a simple web-based
pho Phonetic equivalence via external DTA::CAB
                                                             GUI for visualization, debugging, and fine-tuning of
   server.
                                                             the term expansion process.13 This so-called “query
rw Rewrite equivalence via external DTA::CAB                 lizard” allows users not only to see the effects of
   server.                                                   changes in the expansion pipeline, but also to fine-
                                                             tune the term sets actually queried by de-selecting
eqlemma TAGH-based best-lemma match using a                  undesirable target values such as miscanonicaliza-
    pre-compiled index via external DTA::CAB                 tions, foreign-language material, etc. Unlike the
    server. This is the default expander for both the        auto-completion widget, the query lizard does not
    Token and Utf8Token attributes.                          seem to have acquired a particularly wide user-base:
                                                             only 321 accesses were observed between September
    Of particular interest are the external CAB-based        2013 and January 2014.
expanders such as pho, rw, and eqlemma. In or-
der to function efficiently, the associated expansion        4   Conclusion and Outlook
servers must restrict the strings returned to those ac-
tually occurring in the corpus. Since each of the            We have described a flexible architecture for full-text
CAB-based expanders are equivalence relations of             search in historical document collections, especially
the form fa ◦ fa−1 for some function fa on source to-        those exhibiting a high degree of spelling variation.
kens (e.g. phonetic-form or best-lemma), the bulk of         By using a corpus preprocessing phase to annotate
the task can be accomplished during the corpus pre-          the source documents with linguistically salient fea-
processing phase by constructing a database map-             tures and incorporating these into the corpus index
ping the image of the corpus under fa to the asso-           as quasi-independent token attributes, we were able
ciated surface types; i.e. an extensional inverse map        to implement a query interpreter which robustly in-
fa∗ : fa [W ] → ℘(W ) : a 7→ fa−1 (a) ∩ W for a source at-   terprets naı̈ve bareword queries as equivalence clas-
tribute fa : W → A from corpus words W to some               ses of historical spelling variants, while still retaining
characteristic set of possible attribute values A. Run-      the full precision of a raw string index.
time expansion can then be performed by analyzing               We are interested in performing a more thorough
each input term t with the function fa and performing        evaluation of the online term expansion strategy’s
a simple lookup in the                                       utility for actual user searches, and in comparing
                           S extensional database, setting   our approach to alternative methods for approximate
xa (T ) = [fa ◦ fa∗ ](T ) = t∈T fa∗ (fa (t)).
                                                             search in historical document collections, e.g. [Efr13].
                                                             We are currently engaged in the development of
3.4   Accessibility Extensions
                                                             semantically motivated term expanders and visual-
On their own, none of the innovations discussed              izations using both induced distributional semantic
above “challenge the paradigm of information access          models [BDO95, BL09] and the manually constructed
as being a single-shot search request submitted to a         lexical network GermaNet [KL02, LK07].
web search engine.”11 On the contrary, the costly cor-
pus preprocessing techniques, the indexing of mul-           Acknowledgements
tiple, partially redundant token attributes, and the
                                                             The current work was supported by Deutsche
use of implicit attribute-dependent default term ex-
                                                             Forschungsgemeinschaft grant KL 337/12-2. We are
pansion pipelines can be seen as workarounds for the
                                                             grateful to our colleagues Alexander Geyken, Su-
overwhelming dominance of bareword searches from
                                                             sanne Haaf, Matthias Schulz, and Kai Zimmer, and
assumedly non-expert users.
                                                             to this article’s anonymous reviewers for their many
   In an attempt to promote user query-language lit-         helpful comments and suggestions.
eracy, an attribute-sensitive auto-completion widget
was added to the prototype HTML search form.12 In
the absence of a user-specified target attribute, the
                                                             References
auto-completion procedure performs a simple prefix           [BCC10] Stefan Büttcher, Charles L. A. Clarke, and
                                                                     Gordon V. Cormack. Information Retrieval:
 11 http://mindthegap2014.dai-labor.de/?page_id=8
 12 http://kaskade.dwds.de/dtaos                              13 http://kaskade.dwds.de/dtaos/lizard
          Implementing and Evaluating Search Engines.     [GNR+ 09] Annette Gotscharek, Andreas Neumann,
          MIT Press, Cambridge, MA, 2010.                         Ulrich Reffle, Christoph Ringlstetter, and
                                                                  Klaus U. Schulz. Enabling information re-
[BDO95] Michael W. Berry, Susan T. Dumais, and                    trieval on historical document collections:
        Gavin W. O’Brien. Using linear algebra for                the role of matching procedures and spe-
        intelligent information retrieval. SIAM Re-               cial lexica. In Proceedings of The Third Work-
        view, 37(4):573–595, December 1995.                       shop on Analytics for Noisy Unstructured Text
                                                                  Data, AND ’09, pages 69–76. ACM, New
[BL09]    Marco Baroni and Alessandro Lenci. One
                                                                  York, 2009.
          distributional memory, many semantic
          spaces. In Proceedings of the Workshop on Ge-
                                                          [HHL+ 07] Andreas Hauser, Markus Heller, Elisa-
          ometrical Models of Natural Language Seman-
                                                                  beth Leiss, Klaus U. Schulz, and Christiane
          tics, GEMS ’09, pages 1–8, Stroudsburg, PA,
                                                                  Wanzeck. Information access to histori-
          USA, 2009. Association for Computational
                                                                  cal documents from the Early New High
          Linguistics.
                                                                  German period. In Proceedings of IJCAI-
[CLA12] CLARIN-D AP 5.          CLARIN-D user                     07 Workshop on Analytics for Noisy Unstruc-
        guide, version 1.0.1.      Technical re-                  tured Text Data (AND-07), pages 147–154,
        port, Berlin-Brandenburgische Akademie                    2007.
        der Wissenschaften, 19 December 2012.
                                                          [JA12]    Bryan Jurish and Henriette Ast. Using an
[DD09] Jörg Didakowski and Marko Drotschmann.                      alignment-based lexicon for canonicaliza-
       Proper noun recognition and classifica-                      tion of historical text. In Proceedings of the
       tion using weighted finite state transduc-                   International Conference Historical Corpora
       ers. In Jakub Piskorski, Bruce W. Watson,                    2012, Frankfurt am Main, Germany, 6th–
       and Anssi Yli-Jyrä, editors, Proceedings of                 9th December 2012.
       FSMNLP 2008 (Ispra, Italy, 11-12 September
       2008), volume 19 of Frontiers in Artificial In-    [JSS00]   Bernard J. Jansen, Amanda Spink, and
       telligence and Applications, pages 50–61. IOS                Tefko Saracevic. Real life, real users, and
       Press, 2009.                                                 real needs: a study and analysis of user
                                                                    queries on the web. Information Processing
[Efr13]   Miles Efron.     Query representation for                 & Management, 36(2):207–227, 2000.
          cross-temporal information retrieval. In
          Proceedings of the 36th international ACM       [Jur03]   Bryan Jurish.     A hybrid approach to
          SIGIR conference on Research and develop-                 part-of-speech tagging. Technical report,
          ment in information retrieval, pages 383–                 Project “Kollokationen im Wörterbuch”,
          392. ACM, 2013.                                           Berlin-Brandenburgische Akademie der
[EGF06] Andrea Ernst-Gerlach and Norbert Fuhr.                      Wissenschaften, Berlin, 2003.
        Generating search term variants for text
        collections with historic spellings.      In      [Jur12]   Bryan Jurish. Finite-State Canonicalization
        Mounia Lalmas, Andy MacFarlane, Ste-                        Techniques for Historical German. PhD the-
        fan Rüger, Anastasios Tombros, Theodora                    sis, Universität Potsdam, January 2012.
        Tsikrika, and Alexei Yavlinsky, editors, Ad-
        vances in Information Retrieval, volume 3936      [Jur13]   Bryan Jurish. Canonicalizing the deutsches
        of Lecture Notes in Computer Science, pages                 Textarchiv. In Ingelore Hafemann, edi-
        49–60. Springer, Berlin, 2006.                              tor, Proceedings of Perspektiven einer corpus-
                                                                    basierten historischen Linguistik und Philolo-
[GH06] Alexander Geyken and Thomas Han-                             gie (Berlin, 12th–13th December 2011), vol-
       neforth. TAGH: A complete morphology                         ume 4 of Thesaurus Linguae Aegyptiae,
       for German based on weighted finite state                    Berlin, Germany, 2013.
       automata. In Finite State Methods and Nat-
       ural Language Processing, 5th International        [JW13]    Bryan Jurish and Kay-Michael Würzner.
       Workshop, FSMNLP 2005, Revised Papers,                       Word and sentence tokenization with Hid-
       volume 4002 of Lecture Notes in Computer                     den Markov Models. Journal for Language
       Science, pages 55–66. Springer, Berlin,                      Technology and Computational Linguistics,
       2006.                                                        28(2):61–83, 2013.
[KL02]   Claudia Kunze and Lothar Lemnitzer. Ger-
         maNet representation, visualization, appli-
         cation. In Proceedings of the 3rd International
         Language Resources and Evaluation (LREC
         ’02), pages 1485–1491, Las Palmas, Canary
         Islands, 2002.
[Knu73] Donald Knuth. The Art of Computer Pro-
        gramming. Third Edition. Addison-Wesley,
        Reading, MA, 1998 [1973].
[Lez00] Wolfgang Lezius. Morphy – German mor-
        phology, part-of-speech tagging and appli-
        cations.   In Proceedings of the 9th EU-
        RALEX International Congress, pages 619–
        623, 2000.
[LK07]   Lothar Lemnitzer and Claudia Kunze. Com-
         puterlexikographie: Eine Einführung. Gunter
         Narr Verlag, Tübingen, 2007.
[Sok03] Alexey Sokirko. A technical overview of
        DWDS/dialing concordance. Talk deliv-
        ered at the meeting Computational linguis-
        tics and intellectual technologies, Protvino,
        Russia, 2003.
[STT95] Anne Schiller, Simone Teufel, and Chris-
        tine Thielen. Guidelines fur das Tagging
        deutscher Textcorpora mit STTS. Technical
        report, University of Stuttgart, Institut für
        maschinelle Sprachverarbeitung and Uni-
        versity of Tübingen, Seminar für Sprach-
        wissenschaft, 1995.
[SWJS01] Amanda Spink,       Dietmar Wolfram,
        Bernard J. Jansen, and Tefko Saracevic.
        Searching the web: The public and their
        queries. Journal of the American Society
        for Information Science and Technology,
        52:226–234, 2001.
[Uni13] Unicode Consortium. The Unicode Standard.
        The Unicode Consortium, Mountain View,
        CA, 2013.
[Wie13] Frank Wiegand. TEI/XML editing for ev-
        eryone’s needs. In TEI Members Meeting
        2013 (poster session), Sapienza, Italy, 2nd–
        5th October 2013.
[WM07] Ryen W. White and Dan Morris. Investi-
       gating the querying and browsing behav-
       ior of advanced search engine users. In
       Proceedings of the 30th annual international
       ACM SIGIR conference on Research and devel-
       opment in information retrieval, pages 255–
       262. ACM, 2007.