=Paper=
{{Paper
|id=Vol-2063/dal-paper2
|storemode=property
|title=Finding Talk About the Past in the Discourse of Non-Historians
|pdfUrl=https://ceur-ws.org/Vol-2063/dal-paper2.pdf
|volume=Vol-2063
|authors=Alex Olieman,Kaspar Beelen,Jaap Kamps
|dblpUrl=https://dblp.org/rec/conf/i-semantics/OliemanBK17
}}
==Finding Talk About the Past in the Discourse of Non-Historians==
Finding Talk About the Past in the Discourse of Non-Historians Alex Olieman Kaspar Beelen Jaap Kamps University of Amsterdam University of Amsterdam University of Amsterdam olieman@uva.nl k.beelen@uva.nl kamps@uva.nl Stamkracht BV alex@stamkracht.com ABSTRACT other groups, such as journalists and politicians, have incorporated A heightened interest in the presence of the past has given rise to the past into their narratives. the new field of memory studies, but there is a lack of search and Identifying and interpreting the presence of the past in the dis- research tools to support studying how and why the past is evoked course of non-historians is worthwhile, not because we expect to in diachronic discourses. Searching for temporal references is not establish new facts about past events, but rather because laypersons straightforward. It entails bridging the gap between conceptually- and practitioners of other disciplines draw upon the past to make all based information needs on one side, and term-based inverted kinds of judgments and decisions in daily life [10]. The availability indexes on the other. Our approach enables the search for refer- of large diachronic corpora has opened new avenues to study how ences to (intersubjective) historical periods in diachronic corpora. and why people make reference to the past in their discourse (e.g. It consists of a semantically-enhanced search engine that is able to convince others or to express emotion), and to analyze differ- to find references to many entities at a time, which is combined ences within particular time intervals as well as across time. The with a novel interface that invites its user to actively sculpt the size of such corpora, however, combined with the scatteredness of search result set. Until now we have been concerned mostly with references to the past, can make these corpora daunting to explore user-friendly retrieval and selection of sources, but our tool can without the right tools. also contribute to existing efforts to create reusable linked data In response to the “spatial turn” in Digital Humanities, sub- from and for research in the humanities. stantial effort has been put into the development of tools that al- low for spatial navigation through text collections. In the Pelagios Keywords: Colligatory Concepts, Semantically-Enhanced Search, project, for example, the Pleiades gazetteer serves to anchor lo- Interactive Information Retrieval, Corpus Selection, Digital Human- cations mentioned in text to machine-readable representations of ities these locations, which can be combined with linked data to form rich map-based visualizations and allows for spatial access to the 1 INTRODUCTION texts through the Peripleo search interface [5]. The recognition that “space and time are no more separate in human cognition than they There has been a monumental shift from the future to the past in are in theoretical physics” [5, p. 43] now motivates the development the cultural orientation of Western societies, starting in the 1980s. of tools that provide access to texts by the historical entities that In a sense, the past has increasingly gained in presence: in the liter- they reference, to complement the evolving spatial approaches. ary and artistic expressions of (traumatic) memories that cannot be In this paper, we propose an approach to support researchers who contained by the evidence that forms the basis of historical studies, aim to identify and interpret (indirect) references to historical peri- the proliferation of museums and archives, and the commodifi- ods in a particular discourse. It consists of a semantically-enhanced cation of the past as marked by docudramas, historically-themed search engine that is able to find references to many entities at a amusement parks, and memorabilia of pasts that never existed [3]. time in diachronic corpora, which is combined with an interface Many of the resulting representations of segments of the past serve that invites its user to actively sculpt the search result set. As several a deeply different purpose than the representations created by pro- pilot studies have indicated [4], this approach is useful for those fessional historians. Under the influences of positivist science and studying how groups of people pragmatically feature historical literary realism in the 19th century, history was detached from entities in discourse, remember or commemorate them, and un- “its former habitation in rhetoric” [10, p. 10] and has developed a derstand their own identity in relation to these entities. Moreover, rigorous manner of dealing with sources and evidence that leads we are able to capture in linked data how definitions of historical to the production of distinctly historical accounts of past events. periods are operationalized by researchers, and generate semantic While changes in how historians have interpreted past events has annotations for the search results that users consider to be relevant. been widely studied (e.g. under the banner of historiography), we are still in the early days of developing methods to analyze how 2 SEARCHING FOR COLLIGATORY CONCEPTS The search for references to historical periods entails bridging the © 2017 Copyright held by the author/owners. gap between conceptually-based information needs on one side, SEMANTiCS 2017 workshops proceedings: Drift-a-LOD and term-based inverted indexes on the other. When a researcher September 11-14, 2017, Amsterdam, Netherlands is looking for the fragments of text that refer to, e.g., the French Drift-a-LOD’17, September 2017, Amsterdam, Netherlands Alex Olieman, Kaspar Beelen, and Jaap Kamps Revolution in a specific collection, they are not served well by pro- mind, and aim for the system to find references to the entities that viding only the documents that contain the literal phrase “French are bound together by this period. Revolution.” This is the case because such periods do not pre-exist in reality, waiting to be discovered and named, but rather come into 3 APPROACH being when the disparate observable elements of a phenomenon The task that we aim to facilitate—selecting a research corpus of are “seen together” as a synthetic whole [1]. Philosophers of sci- text fragments that refer to a particular period—is not supported ence and of history have called this process colligation—a binding well by either subject indexing [7] or full-text search [2], which together of selected historical facts, culminating in the proposal nowadays provide the primary means of access to much of the of a “colligatory concept” which represents the historian’s under- text in archives and libraries. Subjects are traditionally assigned to standing of the facts as an ‘entwined whole’ in a form that can be whole documents. Our task, however, depends on the retrieval of in- communicated to others [7]. dividual sentences (and their context), given their subjects. It seems Periods are but one kind of colligatory concept that is commonly infeasible to provide such granular access with manually assigned constructed by historians; the others being characters, such as subjects, but recent advances in Entity Linking have enabled the ‘Louis XVI’ and ‘the French people,’ and ideal types, for instance automated identification of the referents in individual sentences. ‘capitalism’ and ‘revolution’ [7]. Whereas characters are localized Even though entity linking systems are prone to errors that hu- in time and space, and ideal types bind together certain aspects of man annotators would not make, the entity links that they produce the events in which these characters participate across time and can still be useful to search for many entities simultaneously [4]. space, periods are bound together by narratives that feature selected Semantically-enhanced search is made possible by incorporating events which are distributed within (flexible) temporal and spatial entity links or similar semantic annotations into search indexes [2]. bounds [7]. The PeriodO project has already produced a linked Our approach extends this practice by employing linked data to data gazetteer of periods as they are represented in the published bridge the semantic gap between the search target (i.e. a period) and works of historians and historiographers, using a nanopublication references to the entities that are bound together by this concept. approach that connects the qualitative definition of a period by an individual author to its name, spatial extent, and a time interval, 3.1 Bootstrapping with DBpedia expressed together in RDF [1, 5]. Some groupings of entities that people might want to search for al- This model captures accurately the multivocal aspect of period ready exist as Linked Open Data. The category network of Wikipedia definitions, but is not sufficient to retrieve references to these pe- serves an analog function to that of the subject access systems of riods in discourse that is about another subject. Here we are con- librarians and archivists. It is a knowledge organization system fronted with the difference between colligatory concepts which are (KOS) that relates more specific subjects to more general ones by necessary constructs in the writing of history, and the subjects that broader-than relations [8]. As the product of the contributions of a are intended to group together multiple histories that exhibit com- diverse community, it encodes multiple perspectives on the world mon “patterns of colligation” [7, p. 1098]. As a colligatory concept, a in a single structure which allows for multiple paths to exist be- period is a particular representation of the past that binds together tween any two subjects. We used DBpedia’s RDF representation of historical entities in a unique narrative. When this individual rep- this category network for our proof-of-concept, because we were resentation leads to further discourse, the discourse as a whole is working with a corpus that we enriched with entity links that point not about the original colligation, but rather about a homonymous to DBpedia URIs. Its simple structure, represented in the SKOS on- subject that allows us to, e.g, ask a librarian for ‘novels set during tology, may not be ideal for all research purposes, but it is sufficient the French revolution.’ In the practice of information organization, to start a search process with a period-as-subject and invite the we establish common referents and shared structure between col- researcher to operationalize the particular period he/she wants to ligations in order to group a multiplicity of perspectives under a search for. single label [7]. In order to obtain possible mappings between periods of We have no way of searching directly for the period that the interest and the entities that are bound together by such pe- researcher has in mind (it exists only as a cognitive representation), riods, we extracted a subgraph of DBpedia, corresponding to but by starting from its associated subject we can make an educated Wikipedia’s category network and its related entities, into a guess about the elements of the period. Named entities—events, property graph database (see [6]). The category network is people, artifacts—are central in our search approach, because they used at runtime to select potentially relevant entities given a serve as the points of consensus that enable the search for as of root category, by traversing skos:broader1 and dct:subject yet unknown perspectives on segments of the past. The aim is relations in reverse direction. Starting, for example, from the to collect (and represent) an intermittent discourse, consisting of root category dbc:French_Revolution, the traversal would sentences that make (indirect) reference to the target period of the proceed through subcategories such as dbc:Montagnards search, with as much context as is needed to understand them. To and dbc:French_First_Republic to collect entities includ- be sure, we do not equate the colligatory concepts that are put ing dbr:Reign_of_Terror, dbr:Maximilien_Robespierre, forward by historical work with the scattered references to the past dbr:Bastille, and dbr:Drownings_at_Nantes. that are found in non-historical discourse. Rather, we expect the Our proof-of-concept makes use of DBpedia, but any knowledge (re)searcher to have a particular period (i.e. colligatory concept) in graph that conforms to the SKOS ontology can be loaded easily. 1 For namespace prefixes, see https://dbpedia.org/sparql?nsdecl. Finding Talk About the Past in the Discourse of Non-Historians Drift-a-LOD’17, September 2017, Amsterdam, Netherlands Figure 1: Initial query specification in WideNet. Figure 3: A closer look at the retrieved documents. The interface guides users through three research phases: (1) selection of root category, (2) assessment of the categories’ and entities’ relevance, and (3) close-reading. In the first step the user selects one or several root categories from a typeahead search box (see Figure 1), and demarcates the query by selecting a time pe- riod, which is used to prune the underlying entities of the selected categories. WideNet subsequently retrieves the network of nar- rower categories for each selected root category, and collects the contained entities as potentially relevant query components. Be- hind the scenes, each entity is compared with the target period, and is considered to be outright relevant to the period, or not, or Figure 2: Assessing the relevance of categories and entities. a borderline case, or as lacking temporal clues altogether. In the current implementation this classification is achieved with sim- ple rules, based on the features: ‘fraction of years within period,’ ‘fraction of intervals that overlap with the period,’ and ‘has at least Linked data that is structured differently can also be used, as long one year in period.’ The system uses this information to deselect as a grouping or categorization that is familiar to the intended users (sub)categories where more than half of the dated member entities can be derived from it. It is important that the representations of are out-of-period, i.e., those categories are excluded from the query. entities in this data are identifiable by the same URIs as those used The next step for the WideNet user is to assess which of the in the entity links in the corpus, to be able to connect periods, via retrieved subcategories actually contain entities that lead to relevant colligated entities, to the text fragments that refer to these entities. results. By doing this, researchers can operationalize their own Finally, the system needs access to coarse temporal clues about definition of their target period, at least for the purpose of retrieval. entities. Because DBpedia does not provide this data reliably across The interface facilitates this task by showing, per subcategory, entity types, we extract mentioned years from the rdfs:comment which entities are mentioned in the corpus, and how frequently, as values of DBpedia resources with a simple regular expression, and well as which entities did not occur (see Figure 2). It also displays a add them to the graph. The same technique may be successful for list of preview results, showing limited context, to offer quick clues other linked data sources that provide textual descriptions which about the relevance of the category. This preview is also useful to often include temporal expressions. It would be preferable, how- identify individual entities that are not relevant after all, which can ever, to use representations that incorporate structured temporal be deselected by the user. At the end of this step, the researcher’s relations in the form of RDF literals for all entities. decisions amount to a motivated and organized representation of ‘relevant’ or ‘possibly present’ elements of the target period. 3.2 Search Interface Design After inspecting and selecting relevant categories of entities— Our proof-of-concept, named WideNet [4], provides access to a se- thereby sculpting the final query, the WideNet interface allows mantically enriched version of the Dutch parliamentary proceedings— further scrutinizing of the sources by providing an environment in the “verbatim” records of the debates that take place in the Houses which the retrieved documents can be studied up-close, as shown of Parliament (the Staten Generaal). These discussions touch almost in Figure 3. By situating the close reading activity within the same on every issue that moved Dutch public opinion over more than interface, the user is able to compile a corpus of relevant documents two centuries. which may be saved and exported. Moreover, the user can examine Drift-a-LOD’17, September 2017, Amsterdam, Netherlands Alex Olieman, Kaspar Beelen, and Jaap Kamps the results in relation to the document metadata, e.g. to look for (e.g. [1, 8, 9]) and we are still investigating how we can best link saliency by plotting the annotations over time, or to study bias to the existing models. We have prioritized the development of a by comparing how often different political parties refer to the en- useful search tool over the production of reusable data in order to tities of interest. The selected corpus, representing a fragmented investigate which data can be captured during actual research in discourse, may also be used to analyze changes in how and why the humanities, rather than designing our models first and finding the demarcated segment of the past was evoked, establishing con- out later that they are not as usable as we had hoped [9]. testing perspectives and trends over time, provided that enough references could be found. 4 CONCLUSION AND OUTLOOK In searching diachronic corpora for fragmented discourses about 3.3 Capturing Reusable Data historical periods, we need to come to a fundamental understand- As a product of the search, semantic annotations are created which ing of conceptual difference before we can give any account of link the source documents to the entities that are referenced in the conceptual drift over time. Our approach for finding references corpus. These expert-approved annotations have much more value to periods consists of a semantically-enhanced search engine that than the automatically generated entity links. For one, the identity is able to find references to many entities at a time in diachronic of the referents is established with a greater confidence when a corpora, which is combined with an interface that invites its user user chooses to include a particular document fragment in his/her to actively sculpt the search result set. Besides yielding sources research corpus. We cannot interpret this as a direct assessment of that are useful for the searcher, the search tool also produces an the entity links, but when a user has confirmed that the document operationalization of the search target by the (re)searcher as well fragment (indirectly) refers to the target period, it is tempting to as semantic annotations that are much richer than those that we assume that the document was retrieved because the entity links can generate algorithmically. are correct, and not by some mistake. To model the provenance of Although a pragmatic treatment of concepts is sufficient to such relevance decisions, it is necessary to produce a representation search for multifaceted subjects, we envision how the products of the broader context of the search process to which individual of such search processes can collectively provide a shared source assertions can be explicitly related (see [1]). for more elaborate knowledge representations. In designing a tool In the first screen that users encounter when starting a new that implements this approach, we were faced with some trade-offs search process in WideNet (depicted in Figure 1), we are able to between usability of the tool and reusability of the data it captures. capture the motivation and a rough demarcation of the search. We prioritize supporting the present-day researcher well, and facil- This information forms the backbone of a representation to which itate publishing the search process and its results as linked open subsequent assertions can refer. The selection of categories and data, so that subsequent refinement of the captured data may be a entities (in Figure 2) provides assertions about their relevance for community effort. the search target, according to the researcher and dependent on the corpus. We currently store each decision that is made by the ACKNOWLEDGMENTS researcher, so when a previously made decision is revisited, we We thank the anonymous reviewers for their suggestions and re- create representations of e.g. both the act of deselecting an entity, marks. This work was supported by the Netherlands Organization and reselecting it after more preview results had been inspected. for Scientific Research (ExPoSe project, NWO CI # 314.99.108). Capturing this process data, rather than only the final selection of entities, benefits the richness of the provenance of the subsequent REFERENCES assertions about the relevance of text fragments. [1] Patrick Golden and Ryan Shaw. 2016. Nanopublication Beyond the Sciences: the The semantic annotations that are derived from relevance as- PeriodO period gazetteer. PeerJ Computer Science 2:e44 (2016). https://doi.org/ sertions on text fragments can provide richly indexed access to 10.7717/peerj-cs.44 [2] Annika Hinze, David Bainbridge, Sally Jo Cunningham, and J. Stephen Downie. statements about the past. This notion is similar to Ryan Shaw’s 2016. Low-cost Semantic Enhancement to Digital Library Metadata and Indexing. proposal for “deep gazetteers,” in which multiple descriptions of the Proceedings of JCDL ’16 (2016), 93–102. https://doi.org/10.1145/2910896.2910910 same named entity are linked to fragments of discourse in which [3] Andreas Huyssen. 2000. Present Pasts: Media, Politics, Amnesia. Public Culture 12, 1 (2000), 21–38. its name is used [8]. In our case, however, we use representations [4] Alex Olieman, Kaspar Beelen, Milan van Lange, Jaap Kamps, and Maarten Marx. of periods-as-subjects to connect particular conceptions of these 2017. Good Applications for Crummy Entity Linkers? The Case of Corpus Selec- tion in Digital Humanities. In Proceedings of SEMANTiCS 2017. arXiv:1708.01162 periods to discourse that refers to these periods from a different per- [5] Adam Rabinowitz, Ryan Shaw, Sarah Buchanan, Patrick Golden, and Eric Kansa. spective, rather than use of the same name per se. Our approach can 2016. Making sense of the ways we make sense of the past: The PeriodO project. produce “projections” of the parts of the past that were present in a Bulletin of the Institute of Classical Studies 59, 2 (2016), 42–55. [6] Marko Rodriguez. 2015. The Gremlin Graph Traversal Machine and Lan- particular discourse, which can be organized according to the KOS guage. In Proc. 15th Symposium on Database Programming Languages. 1–10. that was used to perform the search, but also according to any other arXiv:1508.03843 KOS that includes representations of the same entities, provided [7] Ryan Shaw. 2013. Information Organization and the Philosophy of History. Journal of the American Society for Information Science and Technology 64, 6 (jun that equivalence of identity (e.g. owl:sameAs) can be established 2013), 1092–1103. https://doi.org/10.1002/asi.22843 between the knowledge organization systems. [8] Ryan Shaw. 2016. Gazetteers Enriched: A Conceptual Basis for Linking Gazetteers with Other Kinds of Information. In Placing Names: Enriching and Integrating While we capture the described representations of search pro- Gazetteers. 51–63. cesses and semantic annotations in our current proof-of-concept, [9] Ryan Shaw, Patrick Golden, and Michael Buckland. 2015. Using Linked Library we do not yet publish this data. Our approach is related to ongo- Data in Working Research Notes. In Linked Data and User Interaction. 48–65. [10] Hayden White. 2014. The Practical Past. Northwestern University Press. ing efforts to produce reusable data for research in the humanities