=Paper= {{Paper |id=Vol-1131/mindthegap14_6 |storemode=property |title=Modeling Patterns in Written Natural Language Questions to Archives |pdfUrl=https://ceur-ws.org/Vol-1131/mindthegap14_6.pdf |volume=Vol-1131 |dblpUrl=https://dblp.org/rec/conf/iconference/Hennicke14 }} ==Modeling Patterns in Written Natural Language Questions to Archives== https://ceur-ws.org/Vol-1131/mindthegap14_6.pdf
         Modeling Patterns in Written Natural Language
                     Questions to Archives

                                               Steffen Hennicke
                                       Humboldt-Universität zu Berlin
                              Berlin School of Library and Information Science
                                                   Germany
                                      steffen.hennicke@ibi.hu-berlin.de



                                                                mostly simple search interfaces which only allow key-
                                                                word based searches and return plain lists of matches.
                       Abstract                                    Research shows that such search and retrieval sys-
                                                                tems do not properly serve the users. One of the piv-
    This short paper is part of an ongoing dis-                 otal reasons is a prevailing lack of qualitative in-depth
    sertation project and introduces the idea to                analysis of archival user needs [Cra03, Sin10] which
    create an ontological model – the Archival                  would allow to analyze existing archival knowledge
    Knowledge Model (AKM) – of common pat-                      bases and to improve digital archival information sys-
    terns found in written natural language ques-               tems [And04]. This requires, however, adequate, on-
    tions to archives. Such an ontological model                tological and formal representations of the user needs
    can be used to analyze and query archival                   towards archives.
    knowledge bases in order to provide more ad-                   The aim of the study1 is to give empirical insight
    equate answers and to enable more relevant                  into the nature of user inquiries to archives and to in-
    discovery facilities. For this purpose, writ-               vestigate how patterns of inquiries can be reasonably
    ten reference questions to the German Federal               represented in an ontological model in order to pro-
    Archive, the Bundesarchiv, are being analyzed               duce adequate answers. Such reasonable ontological
    and patterns found translated to the CIDOC                  representations of the research interest of the users as
    CRM and appropriate extensions.                             queries against an archival target world contribute to
                                                                the creation of better documentation structures and
1    Introduction                                               better query facilities for archival information systems,
                                                                for example, pattern-based [DKP00] query mechanism
Archives hold enormous information potential [MH01]             which would go beyond plain keyword searches.
which are meant to be explored and accessed through                In this paper, an overview of the research data and
archival aids as well as the expertise of archivists. Al-       the methodology is given and the draft of one pat-
though the conceptualization of these descriptive tools         tern, the Documentation-Activity, introduced. A brief
is based on elaborate and historically grown archival           example will demonstrate how existing EAD encoded
principles and models, their design is less informed            archival data can be represented using this pattern.
by explicit knowledge about the information needs of
archival users [Cox08]. Digital representations of these
                                                                2    Research Data
archival aids typically emulate the original descrip-
tive structures and render a vast amount of informa-            The term reference question refers to a request of a
tion implicit. At the same time, search facilities are          user to a staff member of a library or archive for infor-
                                                                mation or assistance regarding the provision of any
Copyright c 2014 for the individual papers by the paper’s au-   kind of information. Such a request can either be
thors. Copying permitted for private and academic purposes.     posed in person at a reference desk or remotely by
This volume is published and copyrighted by its editors.
In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.): Pro-       1 An extended version of this paper can be found in the pre-

ceedings of the MindTheGap’14 Workshop, Berlin, Germany,        liminary proceedings of the CRMEX workshop (http://www.
4-March-2014, published at http://ceur-ws.org                   ontotext.com/CRMEX).
phone, mail, or e-mail. In this study, only written ref-           information, location of a document, or general back-
erence questions by mail or e-mail are being analyzed.             ground information; the given information contextu-
   Archival reference questions capture an important               alizes the wanted information by, for example, proper
phase of research: Expressing and formulating the                  names, place names, or a date.
wanted information or research interest as explicitly
as possible by providing contextual information for an-               The current study goes a step further and focuses
other person. This kind of empirical research data con-            on the Erkenntnisform of the inquiries, their epistemo-
tains a largely unfiltered information need of the user            logical form: The wanted information is interpreted
in his own words [DJ01] which constitutes a signifi-               regarding the research interest from a user point of
cant advantage over other methods of data collection               view in order to describe reality in a way so that it fits
like interviews or observation in existing information             the perceived epistemological interest of the user and
systems through, for example, log files, both of which             his question. This ultimately means that the wanted
elicit data biased by the interviewee or the precondi-             information is determined more precisely by contex-
tions of the information system.                                   tualizing it through explicit relations to the appropri-
   Research data has been collected from the Fed-                  ate historical background as described by the given
eral Archives of Germany, the Bundesarchiv.2 As a                  information. Through reasonable abstractions, the re-
state archive, the Federal Archives are responsible for            search interests is further generalized to common uni-
the permanent preservation and accessibility of federal            versals [MBG+ 03, p. 8], i.e. generic relations and
archival documents such as files, papers, cartographic             classes which have variations of themselves (e.g. hu-
records, pictures, posters, films, sound recordings and            man being) as opposed to particulars which have no
machine-readable data.                                             variations of themselves (e.g. Fritz ).
   User files hold physical copies and print-outs of let-
ters or e-mails sent to the Bundesarchiv. The user files              Regarding epistemological issues of the interpreta-
and the inquiries analyzed share a general historical              tion itself in relation to historical sciences or theory of
and topical horizon which is Contemporary German                   history, the approach to interpretation taken here un-
History, understood as the history of the 19th and 20th            derstands itself as meta-theoretical, similar to Gardin
century. Altogether, 236 user files have been selected.            [Gar02] in the domain of archeology. The approach
From these 236 initially selected user files 100 were              is agnostic to specific types of historical sciences but
available of which 60 contained at least one explicit or           reflects patterns which can be considered applicable
implicit information request as part of an inquiry by              to general historical inquiry, for example, the piv-
e-mail or letter. From these 60 user files, 546 single             otal role of actors and events and, in close relation to
questions have been manually extracted based on the                the archival target domain, the role of mostly written
methodology outlined in the next section.                          traces in the archives as evidence or source of informa-
                                                                   tion for historical investigations.

3     Methodological Approach                                         The CIDOC CRM [DOS07, DI08] is an ontologi-
Archival reference questions have been largely ne-                 cal model which has been chosen as the means to for-
glected as research data. The study of Duff and John-              malize the results from the interpretations. One of
son [DJ01] is one of the few which looks at the type               the most important design principles of the CIDOC
and structure of user reference questions. The study               CRM is to represent the past as discrete events. Ma-
focuses on the types of questions and the types of el-             terial and immaterial persistent items are present at
ements used to contextualize the wanted information.               events either as a concept or via a physical informa-
Here, Duff and Johnson adapt a methodology for an-                 tion carrier. History, therefore, is conceptualized as
alyzing library reference questions based on the work              meetings of persistent items through events in space-
by Grogan [Gro92] and Jahoda and Braunagel [JB80].                 times. Historical facts are described in terms of rela-
                                                                   tions between universals. Since the model has been de-
   However, Duff and Johnson mainly focus on the
                                                                   veloped bottom-up from the analysis of a broad range
Aussageform of the inquiries from an mostly archival
                                                                   of diverse cultural heritage ontologies, it has a strong
point of view: First, they categorize the inquiries ac-
                                                                   empirical background and can be expected to be a suit-
cording to the type of question, for example, material-
                                                                   able compromise between historical and archival con-
finding, fact-finding, or service request. Secondly,
                                                                   ceptualizations.
they systematize given and wanted information: The
wanted information may be, for example, biographical
                                                                      This study adopts the methodology of the CIDOC
    2 A second, similar sample will be collected from the Norwe-   CRM and tries if it either partially or completely cov-
gian National Archive.                                             ers this hypothetical ontology.
                 Figure 1: The example inquiry represented in the Documentation-Activity pattern.

4     Documentation Activity pattern                              the archival domain. Here, the user is looking for re-
                                                                  ports which are the result of a policing or surveillance
Preliminary results show that research interests found
                                                                  activity targeted at a specific type of group (“revolu-
in inquiries can be reasonably represented as general
                                                                  tionary movements”) or at a specific person (“[person
patterns using CIDOC CRM. The Documentation-
                                                                  name]”). In that way, this question could be even seen
Activity pattern appears to be one of the most sig-
                                                                  as a two-fold question. The result of these policing or
nificant ones.
                                                                  surveillance activities are documents about the activi-
   This pattern is the result of the interpretation of
                                                                  ties of the aforementioned actors. Such documents are
a broad range of inquiries and represents research in-
                                                                  routinely products of a governmental institution and
terests targeted at documents which are the result of
                                                                  are now stored in an archive. The user wants to know
an activity3 which documents events or, more specif-
                                                                  if such documents are available in the Bundesarchiv.
ically, observe the activities of people or groups: For
                                                                  Therefore, the information the user wants are pointers
example, the members of a parliamentarian commit-
                                                                  to appropriate documents, for example, call numbers
tee document their meetings through minutes, or a se-
                                                                  of files likely to contain relevant documents.
cret agency observes the activities of a person through
                                                                      The second interpretation step comprises the trans-
surveillance and generates a report.
                                                                  lation of the question, its context and its interpretation
   The following question is a simple example for the
                                                                  to the CIDOC CRM. The two-fold question can be rep-
interpretative analysis and formal representation of
                                                                  resented as shown in figure 1. This is a simplified rep-
the research interest of an inquiry with CIDOC CRM.4
                                                                  resentation expressing the formal basic structure of an
   The context given in the inquiry is: “One source I
                                                                  answer adequate to satisfy the wanted information or
would like to consult are the police- and surveillance
                                                                  the research interest.6 The interpretation of the ques-
reports for the Weimar Republic which are about rev-
                                                                  tion is evident and materialized by the documentation
olutionary movements. I would like to know what the
                                                                  activity7 in the center of the figure. The documenta-
surveillance agency of the Reich (or the ones of the
                                                                  tion activity is seen as being implicit in the histori-
Länder) had to say about [person name].” 5
                                                                  cal reality referred to in the question: The police- and
   The question asked in the inquiry reads: “Do you
                                                                  surveillance reports have been created during an event,
know if the Bundesarchiv holds such documents?”
                                                                  or a series of events, which “documented” some other
   The first interpretation step asks if there are prob-
                                                                  events and which are qualified by the participation of
able and adequate answers to the question with re-
                                                                     6 The implicit question for pointers to documents, for exam-
gard to the domain of historical inquiry but also to
                                                                  ple, a set of call numbers, is not the point when translating
    3 In CIDOC CRM, E7 Activities are sub-classes of E5 Events.   to CIDOC CRM but the context of the documents of interest.
    4 Note, that the inquiry has been translated from German to   Identification for retrieving the actual physical document is not
English by the author of this paper.                              in the scope of this ontological model.
  5 The name of the person referred to has been rendered anony-      7 An extension to the CIDOC CRM currently deemed neces-

mous.                                                             sary.
                    Figure 2: The information from the  represented explicitly.

an actor (“[person name]”) or a specific type of group      such patterns would provide relevant access points and
(“revolutionary movements”). The documentation ac-          contexts to retrieve documents.
tivity is following a mandate which captures a specific         Here, a brief example shall demonstrate how
type of “documented plans (...) for deliberate human        archival finding aids encoded with EAD could be an-
activities [CDG+ 11, p. 15].”                               alyzed whether they provide sufficient implicit or ex-
    Most importantly, mandates8 specify or govern doc-      plicit information to adequately answer typical user
umentation activities. In the case of the two-fold ques-    queries.
tion the mandate has a specific type of group as its            The Encoded Archival Description 9 (EAD) stan-
principle target and at the same time aims at a spe-        dard is the de facto standard for the digital encoding
cific actor. Furthermore, the mandate is assigned to an     of archival aids. One of the essential information en-
actor, in this case an institution, who carries out the     tities in a finding aid encoded in EAD is the element
actual documentation activity which, as the last rele-       which typically holds the “name of the
vant contextual information, falls within the historical    described materials”10 at any level of the descriptive
period of the Weimar Republic. Documents which are          tree.
the result of this constellation are relevant documents         The following XML snippet is taken from
and may adequately answer the user’s two-fold ques-         the existing EAD finding aid Roter Koffer 11
tion.                                                       from the Bundesarchiv.         In this case it rep-
    This brief example demonstrates how the research        resents a quite informative but yet typical en-
interest of inquiries can be formally represented in an     try in an archival finding aid giving the title of
abstract ontological model. The next section will show      a file: Vernehmungsprotokoll Sarah
how such a pattern could be instantiated with empir-        Fodorova vom 9. Nov. 1936.
ical data from a digital archival aid.                          This  contains a lot of implicit
                                                            information:      There has been an interrogation
5   AKM and EAD                                             (Vernehmung) of a person named Sarah Fodorova on
                                                            the 9.11.1936 which has been documented by minutes
The Archival Knowledge Model (AKM) comprises a set          (Vernehmungsprotokoll ) which are now stored in the
of such patterns like the Documentation-Activity. As a      file.
Conceptual Reference Model it can be used to analyze            Figure 2 shows an exemplary instantiation (of
and to query archival knowledge bases. Tzompanaki           parts) of the Documentation-Activity pattern with the
and Doerr [TD12] show how large and complex se-
                                                              9 http://www.loc.gov/ead/
mantic networks may be queried using CIDOC CRM.              10 http://www.loc.gov/ead/tglib/elements/unittitle.
Especially in cases where relevant documents can be         html
expected to be distribute among records or holdings,          11 “Roter Koffer” translates to “Red Suitcase”.   For
                                                            background information on this holding confer:    http:
  8 This class is another proposed extension to the CIDOC   //www.bstu.bund.de/DE/Wissen/Aktenfunde/Roter-Koffer/
CRM.                                                        roter-koffer_inhalt.html
information from the . In this representa-                 [Cox08]   Richard Cox. Revisiting the archival find-
tion the information is explicit and formalized accord-                         ing aid. Journal of Archival Organization,
ing to a pattern which is relevant to a broad range of                          5(4), 2008.
information needs of typical user inquiries.
                                                                      [Cra03]   Barbara Craig. Perimeters with fences? or
    The example also shows that even though the AKM
                                                                                thresholds with doors? two views of a bor-
may seem complex, sufficient semantics can be ex-
                                                                                der. American Archivist, 66(1), 2003.
pected to exist in literal information values. The pat-
terns documented in the AKM are evidently imple-                      [DI08]    Martin Doerr and Dolores Iorizzo. The
mentable by data structures improved accordingly.                               dream of a global knowledge network: A
    Lastly, the intellectual work for the archivist when                        new approach. Journal on Computing and
creating the title remains the same when he serves the                          Cultural Heritage, 1(1), 2008.
seemingly more complex pattern.12 On the contrary,
his intellectual work is preserved in a relevant and ex-              [DJ01]    Wendy M. Duff and Catherine A. Johnson.
plicit representation while it would be lost in a plain                         A virtual expression of need: An analy-
literal text.                                                                   sis of e-mail reference questions. American
                                                                                Archivist, 64(1):43–60, 2001.

6      Conclusion                                                     [DKP00]   Garett O. Dworman, Steven O. Kim-
                                                                                brough, and Chuck Patch. On pattern-
In terms of its research data and methodological ap-                            directed search of archives and collections.
proach the research introduced in this paper appears                            Journal of the American Society for Infor-
to be rare among studies of the information behavior                            mation Science, 51(1), 2000.
of archival users. The study and its research data are
empirical in nature, however, the employed method-                    [DOS07]   Martin Doerr, Christian-Emil Ore, and
ology has a strong interpretative approach. Archival                            Stephen Stead. The CIDOC conceptual
reference questions are a research data which is diffi-                         reference model: A new standard for
cult to obtain and analyze, however, the interpretative                         knowledge sharing. ER2007 tutorial. Chal-
analysis and formalization of written natural language                          lenges in Conceptual Modelling: Tutori-
questions from users to archives, as has been tried to                          als, posters, panels and industrial contri-
demonstrate, constitute a valuable source for obtain-                           butions at the 26th International Confer-
ing meaningful data on original user needs. Only if                             ence on Conceptual Modeling, ER 2007,
we gain a significant and deeper understanding and                              Auckland, New Zealand, November 5-9,
consensus on archival user needs in general we will be                          2007, 83, 2007.
able to build a new generation of more sophisticated
pattern-oriented (archival) information systems for the               [Gar02]   Jean-Claude Gardin. Archaeological dis-
(archival) users.                                                               course, conceptual modelling and digital-
                                                                                isation: An interim report of the logicist
                                                                                program. The Digital Heritage of Archae-
References                                                                      ology: Computer Applications and Quanti-
                                                                                tative Methods in Archaeology, Proceedings
[And04]         Ian G. Anderson. Are you being served?                          of the 30th Conference, Heraklion, Crete,
                historians and the search for primary                           April 2002, CAA 2002, 2002.
                sources. Archivaria, (58), 2004.
                                                                      [Gro92]   Denis Grogan. Practical Reference Work.
        +                                                                       Library Association Publishing, London,
[CDG 11] Nick Crofts, Martin Doerr, Tony Gill,
         Stephen Stead, Matthew Stiff, and                                      2. edition, 1992.
         ICOM/CIDOC CRM Special Interest
                                                                      [JB80]    Gerald Jahoda and Judith Schiek Brau-
         Group. Definition of the CIDOC concep-
                                                                                nagel.   The Librarian and Reference
         tual reference model (version 5.0.4): Pro-
                                                                                Queries: A Systematic Approach. Library
         duced by the ICOM/CIDOC documenta-
                                                                                and information science. Academic Press,
         tion standards group, continued by the
                                                                                New York, 1980.
         CIDOC CRM special interest group, 2011.
    12 The “mechanical” effort might differ in that it is quick and
                                                                      [MBG+ 03] Claudio Masolo, Stefano Borgo, Nicola
easy to simply type in a literal text. However, this is a question
                                                                                Guarino, Alessandro Oltramari, and Luc
of implementation and of proper tool design for the creation of                 Schneider. WonderWeb deliverable d17.
archival aids.                                                                  the WonderWeb library of foundational
          ontologies. preliminary report. Deliverable
          D17, May 2003.

[MH01]    Angelika Menne-Haritz.     Access: The
          reformulation of an archival paradigm.
          Archival Science, 1, 2001.
[Sin10]   Donghee Sinn. Room for archives? use
          of archival materials in no gun ri research.
          Archival Science, 10(2), 2010.
[TD12]    Katerina Tzompanaki and Martin Doerr.
          A new framework for querying semantic
          networks. San Diego, 2012.