Querying the Deutsches Textarchiv Bryan Jurish Christian Thomas Frank Wiegand Deutsches Textarchiv · Berlin-Brandenburgische Akademie der Wissenschaften Jägerstrasse 22/23 · 10117 Berlin · Germany jurish|thomas|wiegand@bbaw.de of all possible variants can be a time-consuming and error-prone process even for language-historical ex- Abstract perts, additional steps must be taken to improve re- call [EGF06, HHL+ 07, GNR+ 09, Jur12, Efr13]. Historical document collections present This paper describes the process architecture for unique challenges for information retrieval. full-text search in the historical German document In particular, the absence of consistent collection of the Deutsches Textarchiv (DTA). Our orthographic conventions in historical text approach makes use of an extensive corpus pre- presents difficulties for conventional search processing phase to annotate the source texts with architectures which typically rely on a static linguistically salient attributes such as “canonical” inverted index keyed by orthographic form. contemporary form, part-of-speech tag, and lemma. Additional steps must therefore be taken Building on the richly annotated corpus and a doc- in order to improve recall, in particular for ument index structure supporting multiple quasi- single-term bareword queries from non- independent token-level attributes, naı̈ve bareword expert users. This paper describes the query searches are expanded into equivalence classes of his- processing architecture currently employed torical spelling variants by a dedicated external ex- for full-text search of the historical Ger- pansion server. man document collection of the Deutsches The rest of this paper is organized as follows: sec- Textarchiv project. tion 2 describes the historical text corpus indexed by the DTA, section 3 describes the DTA query process- 1 Introduction ing architecture in greater detail, and section 4 con- tains a conclusion and brief description of work cur- Historical document collections present unique chal- rently in progress. lenges for information retrieval. In particular, the ab- sence of consistent orthographic conventions in his- torical text presents difficulties for any system re- 2 Text Corpora quiring reference to a static lexicon keyed by ortho- The Deutsches Textarchiv (“German Text Archive”)1 , graphic form. Conventional search architectures on a project funded by the Deutsche Forschungsgemein- the other hand typically rely on a static inverted in- schaft (DFG, “German Research Foundation”) at the dex [Knu73, BCC10] mapping each actually occur- Language Research Center of the Berlin-Brandenburg ring surface string to a list of its locations, implic- Academy of Sciences and Humanities, provides a itly assuming the source texts adhere to strict or- core corpus of more than 1300 significant German thographic conventions. Since casual or non-expert texts from various disciplines originally published users cannot be expected to be familiar with the between ca. 1600 and 1900. Due to the project’s many spelling variants to be found in historical docu- primary focus on the history of the German lan- ment collections, and since the explicit enumeration guage, full-text transcriptions document the original printed works, of which the earliest edition accessi- Copyright © 2014 for the individual papers by the paper’s authors. Copying permitted for private and academic purposes. This volume ble was digitized. The transcriptions were acquired is published and copyrighted by its editors. for the most part using the highly accurate double- In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.): Proceed- keying method; optical character recognition (OCR) ings of the MindTheGap’14 Workshop, Berlin, Germany, 4-March- 2014, published at http://ceur-ws.org 1 http://www.deutschestextarchiv.de was used for only ca. 200 volumes, together with ex- used to prepare the corpus for indexing, and section tensive manual pre-structuring and post-correction 3.2 deals with the index itself. The query expan- phases. The corpus as a whole therefore displays an sion strategy used for runtime term conflation is pre- exceptionally high accuracy not only on the level of sented in section 3.3, and section 3.4 describes some transcription, but also on the annotation level. accessibility-oriented extensions. The DTA core text sources are published via the Internet as digital facsimiles and as XML-annotated 3.1 Corpus Preprocessing transcriptions together with comprehensive bibli- In order to provide a powerful and flexible retrieval ographic meta-data. The annotation consistently environment, the raw text corpus was subjected to follows the well-documented DTA “base format” an extensive automatic preprocessing phase before (DTABf),2 a TEI subset developed for the represen- being passed to the low-level retrieval engine for tation of (historical) written corpora [CLA12]. As of indexing. In particular, corpus text was automati- January 2014, the DTA core corpus comprises 1301 cally tokenized into paragraph-, sentence- and word- digitized volumes (ca. 680M characters, 100M to- like units using the waste tokenizer [JW13], ex- kens). tinct historical spelling variants were mapped to In addition to the core corpus, the DTA currently “canonical” contemporary forms using a both a fi- includes 473 high-quality textual resources provided nite lexicon of known forms and a robust genera- by cooperating projects or curated from existing text tive canonicalization cascade within the DTA::CAB collections such as Wikisource and Project Guten- framework6 [Jur13], and the returned canonical berg.3 Further additions include the Polytechnis- forms were passed to conventional software tools for ches Journal (1820–1931; 370 volumes, 490M char- morphological analysis [GH06], part-of-speech tag- acters, 78M tokens).4 In total, the DTA and its ex- ging [Jur03], lemmatization, and named-entity recog- tensions comprise approximately 1.2B characters in nition [DD09]. 195M tokens. In the context of a DFG-funded project, the existing OCR text of the journal Die Grenzboten 3.2 Index Structure (1841–1922) is currently being structured according to the DTABf and automatically corrected on the The richly annotated corpus data was passed to the character level.5 The resulting optimized text base free open-source DDC concordance tool7 [Sok03] for will be integrated as an extension to the DTA corpora indexing of selected document- and token-level at- as well. tributes. In addition to document-level bibliographic All corpus texts are available in the web based meta-data fields such as title, author, publication date platform for collaborative quality assurance DTAQ. and genre, DDC also allows each token to be asso- Within DTAQ, transcriptions can be proofread, and ciated with a fixed number of quasi-independent lo- misprints, transcription or annotation errors as well cal attributes, Boolean conditions over which may be as erroneous meta-data can be corrected [Wie13]. The conjoined in runtime queries. In contrast to many DTA serves as a basis for a reference corpus of the his- conventional search architectures, the DTA corpus in- torical New High German language and offers highly dex uses not only a raw text string to represent a cor- relevant primary sources for academic research in pus token, but also includes the following token-level various disciplines in the humanities and sciences as attributes: well as for legal scholars and economists. Utf8Token (u) contains the raw token text encoded in UTF-8 [Uni13]. 3 Methods Token (w) contains a deterministic transliteration of This section describes the process architecture under- the raw token text into that subset of the latin lying the DTA’s full-text search functionality. Section alphabet used in contemporary German orthog- 3.1 briefly describes the preprocessing techniques raphy. In the case of historical German, deter- 2 http://www.deutschestextarchiv.de/doku/basisformat ministic transliteration is especially useful for 3 These resources were integrated in the course of a BMBF- mapping the long-s character ‘s’ to a conven- funded CLARIN-D “curation project”. For documentation on tional round ‘s’ and for mapping superscript ‘e’ the project and a list of resources integrated cf. http://www. to the conventional Umlaut diacritic ‘¨’, as in deutschestextarchiv.de/clarin_kupro. e 4 The original project’s web page is http://dingler.culture. the transliteration Abstande 7→ Abstände (“dis- hu-berlin.de. After DFG-funding expired, the complete text base tances”). This attribute was used as the default was integrated into the BBAW corpus infrastructure and a DDC for literal string-identity searches. open search wrapper http://kaskade.dwds.de/dingleros was set up for corpus queries. 6 http://www.deutschestextarchiv.de/demo/cab 5 http://brema.suub.uni-bremen.de/grenzboten 7 http://www.ddc-concordance.org CanonicalToken (v) contains the estimated “canon- queries of the ‘Lemma’ attribute, together accounting ical” contemporary form for the current token as for 96.9% of user searches.8 In order to improve re- determined by the corpus preprocessing phase; call for such queries9 – especially from non-expert e.g. Teil for the raw text Theyl (“part”) or fragte users who cannot be expected to be familiar with the for the raw text frug (“asked”). great diversity of spelling variants to be found in his- torical texts – while still retaining the flexibility of Pos (p) contains the part-of-speech (POS) tag auto- the multi-attribute DDC index, we extended the DDC matically assigned to the source token by the query language to include user-defined term expan- moot part-of-speech tagger [Jur03] using the sion pipelines with attribute-dependent defaults for STTS tag-set [STT95]. both explicit and implicit runtime term conflation. Lemma (l) contains the lemma or “base form” as- In addition to built-in term expanders for signed to the source token by the corpus prepro- e.g. letter-case normalization or legacy rule-based cessing phase, taking into account both the POS- stemming, we introduced a new extendable class of tag and the analyses returned by the TAGH mor- external term expanders accessed via HTTP as well phological analyzer [GH06], if any. as a class for chains or “pipelines” of multiple ex- panders. Each expander x receives as input a finite XPath (xpath) contains the “canonical” XPath to the set T of strings (terms)10 and returns a finite set x(T ) deepest element node containing (the first char- of “equivalent” strings, for some expander-dependent acter of) the current token in the original TEI conflation relation ∼x . The query interpreter eval- source document. uates an expanded query as it would any set-valued query as the Boolean disjunctionS over all elements of Page (page) identifies the source facsimile, for ad- the (expanded) set: ~x(T ) = t∈x(T ) ~t. Prototypi- ministrative and cross-referencing purposes. cally, ∼x will be a true equivalence relation and x(T ) Line (lb) tracks the line number of the source to- will be a superset of T , so that literal matches to a ken on the current page, for administrative and user query will always be retrieved. cross-referencing purposes. Each token attribute is associated with a de- fault expansion pipeline, so that bareword queries A traditional inverted index is constructed for each can be assigned equivalence classes in an attribute- attribute at corpus indexing time. Unlike conven- dependent manner: it would be counter-productive tional query interpreters supporting only document- for example to attempt to analyze XPath attribute val- level dependencies however, the DDC runtime query ues as natural language text, whereas Token attribute interpreter ensures that dependencies in a given user values are expected to be historical word-forms and query are resolved at the token level. For exam- may be analyzed as such. The current DTA corpus ple, the query (@Böttcher WITH $p=NN) would re- index configuration defines the following term ex- trieve all and only those instances of the literal panders, among others: string Böttcher annotated with the part-of-speech tag NN indicating a common noun (“cooper”), whereas tolower Letter-case expander generating lowercase (@Böttcher WITH $p=NE) would retrieve those in- variants of its input. stances tagged as proper names. A conventional query evaluation architecture on the other hand toupper Letter-case expander generating uppercase would only be capable of retrieving those documents variants of its input. This is the default expander containing some instance of the target word (Böttcher) for the Pos attribute. and some instance of the target part-of-speech tag (NN or NE), regardless of whether or not the tag was case Letter-case expander generating upper-, lower-, assigned to the target word, or to some other word in and initial-uppercase variants of its input. This the document. is the default expander for the Lemma attribute. 3.3 Runtime Query Expansion 8 The dominance of simple bareword queries is not surprising, being as it is well attested in the literature on generic web search- Despite the rich annotations offered by indexed cor- ing, e.g. [JSS00, SWJS01, WM07]. pus, the majority of actual searches are in fact 9 [JA12] reported an improvement in type-wise recall from single-term “bareword” queries. Of 29,410 total 55.7% to 95.7% for canonical-form queries vs. raw string-identity queries between September 2013 and January 2014, queries in an artificial retrieval task over a small test corpus of 18th- to 19th-century German text, corresponding to a token-wise 15,977 (54.3%) were single-term bareword searches, recall improvement from 78.5% to 99.3%. 3,302 (11.2%) were phrases composed exclusively of 10 A bareword query is treated as a singleton set for purposes of bareword terms, and 9,219 (31.4%) were bareword term expansion. morphy Legacy rule-based stemming and re- search of the Lemma attribute, incorporating the ap- inflection using Morphy [Lez00]. propriate explicit syntax into the suggestions it re- turns. Assumedly, this suggestion strategy is largely tagh TAGH-based lemmatization and re-inflection responsible for the comparatively high ratio of ex- [GH06] via external server. plicit lemma searches (31.4%) we observed. Additionally, we implemented a simple web-based pho Phonetic equivalence via external DTA::CAB GUI for visualization, debugging, and fine-tuning of server. the term expansion process.13 This so-called “query rw Rewrite equivalence via external DTA::CAB lizard” allows users not only to see the effects of server. changes in the expansion pipeline, but also to fine- tune the term sets actually queried by de-selecting eqlemma TAGH-based best-lemma match using a undesirable target values such as miscanonicaliza- pre-compiled index via external DTA::CAB tions, foreign-language material, etc. Unlike the server. This is the default expander for both the auto-completion widget, the query lizard does not Token and Utf8Token attributes. seem to have acquired a particularly wide user-base: only 321 accesses were observed between September Of particular interest are the external CAB-based 2013 and January 2014. expanders such as pho, rw, and eqlemma. In or- der to function efficiently, the associated expansion 4 Conclusion and Outlook servers must restrict the strings returned to those ac- tually occurring in the corpus. Since each of the We have described a flexible architecture for full-text CAB-based expanders are equivalence relations of search in historical document collections, especially the form fa ◦ fa−1 for some function fa on source to- those exhibiting a high degree of spelling variation. kens (e.g. phonetic-form or best-lemma), the bulk of By using a corpus preprocessing phase to annotate the task can be accomplished during the corpus pre- the source documents with linguistically salient fea- processing phase by constructing a database map- tures and incorporating these into the corpus index ping the image of the corpus under fa to the asso- as quasi-independent token attributes, we were able ciated surface types; i.e. an extensional inverse map to implement a query interpreter which robustly in- fa∗ : fa [W ] → ℘(W ) : a 7→ fa−1 (a) ∩ W for a source at- terprets naı̈ve bareword queries as equivalence clas- tribute fa : W → A from corpus words W to some ses of historical spelling variants, while still retaining characteristic set of possible attribute values A. Run- the full precision of a raw string index. time expansion can then be performed by analyzing We are interested in performing a more thorough each input term t with the function fa and performing evaluation of the online term expansion strategy’s a simple lookup in the utility for actual user searches, and in comparing S extensional database, setting our approach to alternative methods for approximate xa (T ) = [fa ◦ fa∗ ](T ) = t∈T fa∗ (fa (t)). search in historical document collections, e.g. [Efr13]. We are currently engaged in the development of 3.4 Accessibility Extensions semantically motivated term expanders and visual- On their own, none of the innovations discussed izations using both induced distributional semantic above “challenge the paradigm of information access models [BDO95, BL09] and the manually constructed as being a single-shot search request submitted to a lexical network GermaNet [KL02, LK07]. web search engine.”11 On the contrary, the costly cor- pus preprocessing techniques, the indexing of mul- Acknowledgements tiple, partially redundant token attributes, and the The current work was supported by Deutsche use of implicit attribute-dependent default term ex- Forschungsgemeinschaft grant KL 337/12-2. We are pansion pipelines can be seen as workarounds for the grateful to our colleagues Alexander Geyken, Su- overwhelming dominance of bareword searches from sanne Haaf, Matthias Schulz, and Kai Zimmer, and assumedly non-expert users. to this article’s anonymous reviewers for their many In an attempt to promote user query-language lit- helpful comments and suggestions. eracy, an attribute-sensitive auto-completion widget was added to the prototype HTML search form.12 In the absence of a user-specified target attribute, the References auto-completion procedure performs a simple prefix [BCC10] Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: 11 http://mindthegap2014.dai-labor.de/?page_id=8 12 http://kaskade.dwds.de/dtaos 13 http://kaskade.dwds.de/dtaos/lizard Implementing and Evaluating Search Engines. [GNR+ 09] Annette Gotscharek, Andreas Neumann, MIT Press, Cambridge, MA, 2010. Ulrich Reffle, Christoph Ringlstetter, and Klaus U. Schulz. Enabling information re- [BDO95] Michael W. Berry, Susan T. Dumais, and trieval on historical document collections: Gavin W. O’Brien. Using linear algebra for the role of matching procedures and spe- intelligent information retrieval. SIAM Re- cial lexica. In Proceedings of The Third Work- view, 37(4):573–595, December 1995. shop on Analytics for Noisy Unstructured Text Data, AND ’09, pages 69–76. ACM, New [BL09] Marco Baroni and Alessandro Lenci. One York, 2009. distributional memory, many semantic spaces. In Proceedings of the Workshop on Ge- [HHL+ 07] Andreas Hauser, Markus Heller, Elisa- ometrical Models of Natural Language Seman- beth Leiss, Klaus U. Schulz, and Christiane tics, GEMS ’09, pages 1–8, Stroudsburg, PA, Wanzeck. Information access to histori- USA, 2009. Association for Computational cal documents from the Early New High Linguistics. German period. In Proceedings of IJCAI- [CLA12] CLARIN-D AP 5. CLARIN-D user 07 Workshop on Analytics for Noisy Unstruc- guide, version 1.0.1. Technical re- tured Text Data (AND-07), pages 147–154, port, Berlin-Brandenburgische Akademie 2007. der Wissenschaften, 19 December 2012. [JA12] Bryan Jurish and Henriette Ast. Using an [DD09] Jörg Didakowski and Marko Drotschmann. alignment-based lexicon for canonicaliza- Proper noun recognition and classifica- tion of historical text. In Proceedings of the tion using weighted finite state transduc- International Conference Historical Corpora ers. In Jakub Piskorski, Bruce W. Watson, 2012, Frankfurt am Main, Germany, 6th– and Anssi Yli-Jyrä, editors, Proceedings of 9th December 2012. FSMNLP 2008 (Ispra, Italy, 11-12 September 2008), volume 19 of Frontiers in Artificial In- [JSS00] Bernard J. Jansen, Amanda Spink, and telligence and Applications, pages 50–61. IOS Tefko Saracevic. Real life, real users, and Press, 2009. real needs: a study and analysis of user queries on the web. Information Processing [Efr13] Miles Efron. Query representation for & Management, 36(2):207–227, 2000. cross-temporal information retrieval. In Proceedings of the 36th international ACM [Jur03] Bryan Jurish. A hybrid approach to SIGIR conference on Research and develop- part-of-speech tagging. Technical report, ment in information retrieval, pages 383– Project “Kollokationen im Wörterbuch”, 392. ACM, 2013. Berlin-Brandenburgische Akademie der [EGF06] Andrea Ernst-Gerlach and Norbert Fuhr. Wissenschaften, Berlin, 2003. Generating search term variants for text collections with historic spellings. In [Jur12] Bryan Jurish. Finite-State Canonicalization Mounia Lalmas, Andy MacFarlane, Ste- Techniques for Historical German. PhD the- fan Rüger, Anastasios Tombros, Theodora sis, Universität Potsdam, January 2012. Tsikrika, and Alexei Yavlinsky, editors, Ad- vances in Information Retrieval, volume 3936 [Jur13] Bryan Jurish. Canonicalizing the deutsches of Lecture Notes in Computer Science, pages Textarchiv. In Ingelore Hafemann, edi- 49–60. Springer, Berlin, 2006. tor, Proceedings of Perspektiven einer corpus- basierten historischen Linguistik und Philolo- [GH06] Alexander Geyken and Thomas Han- gie (Berlin, 12th–13th December 2011), vol- neforth. TAGH: A complete morphology ume 4 of Thesaurus Linguae Aegyptiae, for German based on weighted finite state Berlin, Germany, 2013. automata. In Finite State Methods and Nat- ural Language Processing, 5th International [JW13] Bryan Jurish and Kay-Michael Würzner. Workshop, FSMNLP 2005, Revised Papers, Word and sentence tokenization with Hid- volume 4002 of Lecture Notes in Computer den Markov Models. Journal for Language Science, pages 55–66. Springer, Berlin, Technology and Computational Linguistics, 2006. 28(2):61–83, 2013. [KL02] Claudia Kunze and Lothar Lemnitzer. Ger- maNet representation, visualization, appli- cation. In Proceedings of the 3rd International Language Resources and Evaluation (LREC ’02), pages 1485–1491, Las Palmas, Canary Islands, 2002. [Knu73] Donald Knuth. The Art of Computer Pro- gramming. Third Edition. Addison-Wesley, Reading, MA, 1998 [1973]. [Lez00] Wolfgang Lezius. Morphy – German mor- phology, part-of-speech tagging and appli- cations. In Proceedings of the 9th EU- RALEX International Congress, pages 619– 623, 2000. [LK07] Lothar Lemnitzer and Claudia Kunze. Com- puterlexikographie: Eine Einführung. Gunter Narr Verlag, Tübingen, 2007. [Sok03] Alexey Sokirko. A technical overview of DWDS/dialing concordance. Talk deliv- ered at the meeting Computational linguis- tics and intellectual technologies, Protvino, Russia, 2003. [STT95] Anne Schiller, Simone Teufel, and Chris- tine Thielen. Guidelines fur das Tagging deutscher Textcorpora mit STTS. Technical report, University of Stuttgart, Institut für maschinelle Sprachverarbeitung and Uni- versity of Tübingen, Seminar für Sprach- wissenschaft, 1995. [SWJS01] Amanda Spink, Dietmar Wolfram, Bernard J. Jansen, and Tefko Saracevic. Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology, 52:226–234, 2001. [Uni13] Unicode Consortium. The Unicode Standard. The Unicode Consortium, Mountain View, CA, 2013. [Wie13] Frank Wiegand. TEI/XML editing for ev- eryone’s needs. In TEI Members Meeting 2013 (poster session), Sapienza, Italy, 2nd– 5th October 2013. [WM07] Ryen W. White and Dan Morris. Investi- gating the querying and browsing behav- ior of advanced search engine users. In Proceedings of the 30th annual international ACM SIGIR conference on Research and devel- opment in information retrieval, pages 255– 262. ACM, 2007.