=Paper=
{{Paper
|id=Vol-1131/mindthegap14_6
|storemode=property
|title=Modeling Patterns in Written Natural Language Questions to Archives
|pdfUrl=https://ceur-ws.org/Vol-1131/mindthegap14_6.pdf
|volume=Vol-1131
|dblpUrl=https://dblp.org/rec/conf/iconference/Hennicke14
}}
==Modeling Patterns in Written Natural Language Questions to Archives==
Modeling Patterns in Written Natural Language Questions to Archives Steffen Hennicke Humboldt-Universität zu Berlin Berlin School of Library and Information Science Germany steffen.hennicke@ibi.hu-berlin.de mostly simple search interfaces which only allow key- word based searches and return plain lists of matches. Abstract Research shows that such search and retrieval sys- tems do not properly serve the users. One of the piv- This short paper is part of an ongoing dis- otal reasons is a prevailing lack of qualitative in-depth sertation project and introduces the idea to analysis of archival user needs [Cra03, Sin10] which create an ontological model – the Archival would allow to analyze existing archival knowledge Knowledge Model (AKM) – of common pat- bases and to improve digital archival information sys- terns found in written natural language ques- tems [And04]. This requires, however, adequate, on- tions to archives. Such an ontological model tological and formal representations of the user needs can be used to analyze and query archival towards archives. knowledge bases in order to provide more ad- The aim of the study1 is to give empirical insight equate answers and to enable more relevant into the nature of user inquiries to archives and to in- discovery facilities. For this purpose, writ- vestigate how patterns of inquiries can be reasonably ten reference questions to the German Federal represented in an ontological model in order to pro- Archive, the Bundesarchiv, are being analyzed duce adequate answers. Such reasonable ontological and patterns found translated to the CIDOC representations of the research interest of the users as CRM and appropriate extensions. queries against an archival target world contribute to the creation of better documentation structures and 1 Introduction better query facilities for archival information systems, for example, pattern-based [DKP00] query mechanism Archives hold enormous information potential [MH01] which would go beyond plain keyword searches. which are meant to be explored and accessed through In this paper, an overview of the research data and archival aids as well as the expertise of archivists. Al- the methodology is given and the draft of one pat- though the conceptualization of these descriptive tools tern, the Documentation-Activity, introduced. A brief is based on elaborate and historically grown archival example will demonstrate how existing EAD encoded principles and models, their design is less informed archival data can be represented using this pattern. by explicit knowledge about the information needs of archival users [Cox08]. Digital representations of these 2 Research Data archival aids typically emulate the original descrip- tive structures and render a vast amount of informa- The term reference question refers to a request of a tion implicit. At the same time, search facilities are user to a staff member of a library or archive for infor- mation or assistance regarding the provision of any Copyright c 2014 for the individual papers by the paper’s au- kind of information. Such a request can either be thors. Copying permitted for private and academic purposes. posed in person at a reference desk or remotely by This volume is published and copyrighted by its editors. In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.): Pro- 1 An extended version of this paper can be found in the pre- ceedings of the MindTheGap’14 Workshop, Berlin, Germany, liminary proceedings of the CRMEX workshop (http://www. 4-March-2014, published at http://ceur-ws.org ontotext.com/CRMEX). phone, mail, or e-mail. In this study, only written ref- information, location of a document, or general back- erence questions by mail or e-mail are being analyzed. ground information; the given information contextu- Archival reference questions capture an important alizes the wanted information by, for example, proper phase of research: Expressing and formulating the names, place names, or a date. wanted information or research interest as explicitly as possible by providing contextual information for an- The current study goes a step further and focuses other person. This kind of empirical research data con- on the Erkenntnisform of the inquiries, their epistemo- tains a largely unfiltered information need of the user logical form: The wanted information is interpreted in his own words [DJ01] which constitutes a signifi- regarding the research interest from a user point of cant advantage over other methods of data collection view in order to describe reality in a way so that it fits like interviews or observation in existing information the perceived epistemological interest of the user and systems through, for example, log files, both of which his question. This ultimately means that the wanted elicit data biased by the interviewee or the precondi- information is determined more precisely by contex- tions of the information system. tualizing it through explicit relations to the appropri- Research data has been collected from the Fed- ate historical background as described by the given eral Archives of Germany, the Bundesarchiv.2 As a information. Through reasonable abstractions, the re- state archive, the Federal Archives are responsible for search interests is further generalized to common uni- the permanent preservation and accessibility of federal versals [MBG+ 03, p. 8], i.e. generic relations and archival documents such as files, papers, cartographic classes which have variations of themselves (e.g. hu- records, pictures, posters, films, sound recordings and man being) as opposed to particulars which have no machine-readable data. variations of themselves (e.g. Fritz ). User files hold physical copies and print-outs of let- ters or e-mails sent to the Bundesarchiv. The user files Regarding epistemological issues of the interpreta- and the inquiries analyzed share a general historical tion itself in relation to historical sciences or theory of and topical horizon which is Contemporary German history, the approach to interpretation taken here un- History, understood as the history of the 19th and 20th derstands itself as meta-theoretical, similar to Gardin century. Altogether, 236 user files have been selected. [Gar02] in the domain of archeology. The approach From these 236 initially selected user files 100 were is agnostic to specific types of historical sciences but available of which 60 contained at least one explicit or reflects patterns which can be considered applicable implicit information request as part of an inquiry by to general historical inquiry, for example, the piv- e-mail or letter. From these 60 user files, 546 single otal role of actors and events and, in close relation to questions have been manually extracted based on the the archival target domain, the role of mostly written methodology outlined in the next section. traces in the archives as evidence or source of informa- tion for historical investigations. 3 Methodological Approach The CIDOC CRM [DOS07, DI08] is an ontologi- Archival reference questions have been largely ne- cal model which has been chosen as the means to for- glected as research data. The study of Duff and John- malize the results from the interpretations. One of son [DJ01] is one of the few which looks at the type the most important design principles of the CIDOC and structure of user reference questions. The study CRM is to represent the past as discrete events. Ma- focuses on the types of questions and the types of el- terial and immaterial persistent items are present at ements used to contextualize the wanted information. events either as a concept or via a physical informa- Here, Duff and Johnson adapt a methodology for an- tion carrier. History, therefore, is conceptualized as alyzing library reference questions based on the work meetings of persistent items through events in space- by Grogan [Gro92] and Jahoda and Braunagel [JB80]. times. Historical facts are described in terms of rela- tions between universals. Since the model has been de- However, Duff and Johnson mainly focus on the veloped bottom-up from the analysis of a broad range Aussageform of the inquiries from an mostly archival of diverse cultural heritage ontologies, it has a strong point of view: First, they categorize the inquiries ac- empirical background and can be expected to be a suit- cording to the type of question, for example, material- able compromise between historical and archival con- finding, fact-finding, or service request. Secondly, ceptualizations. they systematize given and wanted information: The wanted information may be, for example, biographical This study adopts the methodology of the CIDOC 2 A second, similar sample will be collected from the Norwe- CRM and tries if it either partially or completely cov- gian National Archive. ers this hypothetical ontology. Figure 1: The example inquiry represented in the Documentation-Activity pattern. 4 Documentation Activity pattern the archival domain. Here, the user is looking for re- ports which are the result of a policing or surveillance Preliminary results show that research interests found activity targeted at a specific type of group (“revolu- in inquiries can be reasonably represented as general tionary movements”) or at a specific person (“[person patterns using CIDOC CRM. The Documentation- name]”). In that way, this question could be even seen Activity pattern appears to be one of the most sig- as a two-fold question. The result of these policing or nificant ones. surveillance activities are documents about the activi- This pattern is the result of the interpretation of ties of the aforementioned actors. Such documents are a broad range of inquiries and represents research in- routinely products of a governmental institution and terests targeted at documents which are the result of are now stored in an archive. The user wants to know an activity3 which documents events or, more specif- if such documents are available in the Bundesarchiv. ically, observe the activities of people or groups: For Therefore, the information the user wants are pointers example, the members of a parliamentarian commit- to appropriate documents, for example, call numbers tee document their meetings through minutes, or a se- of files likely to contain relevant documents. cret agency observes the activities of a person through The second interpretation step comprises the trans- surveillance and generates a report. lation of the question, its context and its interpretation The following question is a simple example for the to the CIDOC CRM. The two-fold question can be rep- interpretative analysis and formal representation of resented as shown in figure 1. This is a simplified rep- the research interest of an inquiry with CIDOC CRM.4 resentation expressing the formal basic structure of an The context given in the inquiry is: “One source I answer adequate to satisfy the wanted information or would like to consult are the police- and surveillance the research interest.6 The interpretation of the ques- reports for the Weimar Republic which are about rev- tion is evident and materialized by the documentation olutionary movements. I would like to know what the activity7 in the center of the figure. The documenta- surveillance agency of the Reich (or the ones of the tion activity is seen as being implicit in the histori- Länder) had to say about [person name].” 5 cal reality referred to in the question: The police- and The question asked in the inquiry reads: “Do you surveillance reports have been created during an event, know if the Bundesarchiv holds such documents?” or a series of events, which “documented” some other The first interpretation step asks if there are prob- events and which are qualified by the participation of able and adequate answers to the question with re- 6 The implicit question for pointers to documents, for exam- gard to the domain of historical inquiry but also to ple, a set of call numbers, is not the point when translating 3 In CIDOC CRM, E7 Activities are sub-classes of E5 Events. to CIDOC CRM but the context of the documents of interest. 4 Note, that the inquiry has been translated from German to Identification for retrieving the actual physical document is not English by the author of this paper. in the scope of this ontological model. 5 The name of the person referred to has been rendered anony- 7 An extension to the CIDOC CRM currently deemed neces- mous. sary. Figure 2: The information from therepresented explicitly. an actor (“[person name]”) or a specific type of group such patterns would provide relevant access points and (“revolutionary movements”). The documentation ac- contexts to retrieve documents. tivity is following a mandate which captures a specific Here, a brief example shall demonstrate how type of “documented plans (...) for deliberate human archival finding aids encoded with EAD could be an- activities [CDG+ 11, p. 15].” alyzed whether they provide sufficient implicit or ex- Most importantly, mandates8 specify or govern doc- plicit information to adequately answer typical user umentation activities. In the case of the two-fold ques- queries. tion the mandate has a specific type of group as its The Encoded Archival Description 9 (EAD) stan- principle target and at the same time aims at a spe- dard is the de facto standard for the digital encoding cific actor. Furthermore, the mandate is assigned to an of archival aids. One of the essential information en- actor, in this case an institution, who carries out the tities in a finding aid encoded in EAD is the element actual documentation activity which, as the last rele- which typically holds the “name of the vant contextual information, falls within the historical described materials”10 at any level of the descriptive period of the Weimar Republic. Documents which are tree. the result of this constellation are relevant documents The following XML snippet is taken from and may adequately answer the user’s two-fold ques- the existing EAD finding aid Roter Koffer 11 tion. from the Bundesarchiv. In this case it rep- This brief example demonstrates how the research resents a quite informative but yet typical en- interest of inquiries can be formally represented in an try in an archival finding aid giving the title of abstract ontological model. The next section will show a file: Vernehmungsprotokoll Sarah how such a pattern could be instantiated with empir- Fodorova vom 9. Nov. 1936 . ical data from a digital archival aid. Thiscontains a lot of implicit information: There has been an interrogation 5 AKM and EAD (Vernehmung) of a person named Sarah Fodorova on the 9.11.1936 which has been documented by minutes The Archival Knowledge Model (AKM) comprises a set (Vernehmungsprotokoll ) which are now stored in the of such patterns like the Documentation-Activity. As a file. Conceptual Reference Model it can be used to analyze Figure 2 shows an exemplary instantiation (of and to query archival knowledge bases. Tzompanaki parts) of the Documentation-Activity pattern with the and Doerr [TD12] show how large and complex se- 9 http://www.loc.gov/ead/ mantic networks may be queried using CIDOC CRM. 10 http://www.loc.gov/ead/tglib/elements/unittitle. Especially in cases where relevant documents can be html expected to be distribute among records or holdings, 11 “Roter Koffer” translates to “Red Suitcase”. For background information on this holding confer: http: 8 This class is another proposed extension to the CIDOC //www.bstu.bund.de/DE/Wissen/Aktenfunde/Roter-Koffer/ CRM. roter-koffer_inhalt.html information from the . In this representa- [Cox08] Richard Cox. Revisiting the archival find- tion the information is explicit and formalized accord- ing aid. Journal of Archival Organization, ing to a pattern which is relevant to a broad range of 5(4), 2008. information needs of typical user inquiries. [Cra03] Barbara Craig. Perimeters with fences? or The example also shows that even though the AKM thresholds with doors? two views of a bor- may seem complex, sufficient semantics can be ex- der. American Archivist, 66(1), 2003. pected to exist in literal information values. The pat- terns documented in the AKM are evidently imple- [DI08] Martin Doerr and Dolores Iorizzo. The mentable by data structures improved accordingly. dream of a global knowledge network: A Lastly, the intellectual work for the archivist when new approach. Journal on Computing and creating the title remains the same when he serves the Cultural Heritage, 1(1), 2008. seemingly more complex pattern.12 On the contrary, his intellectual work is preserved in a relevant and ex- [DJ01] Wendy M. Duff and Catherine A. Johnson. plicit representation while it would be lost in a plain A virtual expression of need: An analy- literal text. sis of e-mail reference questions. American Archivist, 64(1):43–60, 2001. 6 Conclusion [DKP00] Garett O. Dworman, Steven O. Kim- brough, and Chuck Patch. On pattern- In terms of its research data and methodological ap- directed search of archives and collections. proach the research introduced in this paper appears Journal of the American Society for Infor- to be rare among studies of the information behavior mation Science, 51(1), 2000. of archival users. The study and its research data are empirical in nature, however, the employed method- [DOS07] Martin Doerr, Christian-Emil Ore, and ology has a strong interpretative approach. Archival Stephen Stead. The CIDOC conceptual reference questions are a research data which is diffi- reference model: A new standard for cult to obtain and analyze, however, the interpretative knowledge sharing. ER2007 tutorial. Chal- analysis and formalization of written natural language lenges in Conceptual Modelling: Tutori- questions from users to archives, as has been tried to als, posters, panels and industrial contri- demonstrate, constitute a valuable source for obtain- butions at the 26th International Confer- ing meaningful data on original user needs. Only if ence on Conceptual Modeling, ER 2007, we gain a significant and deeper understanding and Auckland, New Zealand, November 5-9, consensus on archival user needs in general we will be 2007, 83, 2007. able to build a new generation of more sophisticated pattern-oriented (archival) information systems for the [Gar02] Jean-Claude Gardin. Archaeological dis- (archival) users. course, conceptual modelling and digital- isation: An interim report of the logicist program. The Digital Heritage of Archae- References ology: Computer Applications and Quanti- tative Methods in Archaeology, Proceedings [And04] Ian G. Anderson. Are you being served? of the 30th Conference, Heraklion, Crete, historians and the search for primary April 2002, CAA 2002, 2002. sources. Archivaria, (58), 2004. [Gro92] Denis Grogan. Practical Reference Work. + Library Association Publishing, London, [CDG 11] Nick Crofts, Martin Doerr, Tony Gill, Stephen Stead, Matthew Stiff, and 2. edition, 1992. ICOM/CIDOC CRM Special Interest [JB80] Gerald Jahoda and Judith Schiek Brau- Group. Definition of the CIDOC concep- nagel. The Librarian and Reference tual reference model (version 5.0.4): Pro- Queries: A Systematic Approach. Library duced by the ICOM/CIDOC documenta- and information science. Academic Press, tion standards group, continued by the New York, 1980. CIDOC CRM special interest group, 2011. 12 The “mechanical” effort might differ in that it is quick and [MBG+ 03] Claudio Masolo, Stefano Borgo, Nicola easy to simply type in a literal text. However, this is a question Guarino, Alessandro Oltramari, and Luc of implementation and of proper tool design for the creation of Schneider. WonderWeb deliverable d17. archival aids. the WonderWeb library of foundational ontologies. preliminary report. Deliverable D17, May 2003. [MH01] Angelika Menne-Haritz. Access: The reformulation of an archival paradigm. Archival Science, 1, 2001. [Sin10] Donghee Sinn. Room for archives? use of archival materials in no gun ri research. Archival Science, 10(2), 2010. [TD12] Katerina Tzompanaki and Martin Doerr. A new framework for querying semantic networks. San Diego, 2012.