=Paper= {{Paper |id=Vol-2063/dal-paper2 |storemode=property |title=Finding Talk About the Past in the Discourse of Non-Historians |pdfUrl=https://ceur-ws.org/Vol-2063/dal-paper2.pdf |volume=Vol-2063 |authors=Alex Olieman,Kaspar Beelen,Jaap Kamps |dblpUrl=https://dblp.org/rec/conf/i-semantics/OliemanBK17 }} ==Finding Talk About the Past in the Discourse of Non-Historians== https://ceur-ws.org/Vol-2063/dal-paper2.pdf
    Finding Talk About the Past in the Discourse of Non-Historians
                Alex Olieman                                    Kaspar Beelen                                   Jaap Kamps
           University of Amsterdam                         University of Amsterdam                       University of Amsterdam
               olieman@uva.nl                                  k.beelen@uva.nl                               kamps@uva.nl
                Stamkracht BV
            alex@stamkracht.com
ABSTRACT                                                                  other groups, such as journalists and politicians, have incorporated
A heightened interest in the presence of the past has given rise to       the past into their narratives.
the new field of memory studies, but there is a lack of search and           Identifying and interpreting the presence of the past in the dis-
research tools to support studying how and why the past is evoked         course of non-historians is worthwhile, not because we expect to
in diachronic discourses. Searching for temporal references is not        establish new facts about past events, but rather because laypersons
straightforward. It entails bridging the gap between conceptually-        and practitioners of other disciplines draw upon the past to make all
based information needs on one side, and term-based inverted              kinds of judgments and decisions in daily life [10]. The availability
indexes on the other. Our approach enables the search for refer-          of large diachronic corpora has opened new avenues to study how
ences to (intersubjective) historical periods in diachronic corpora.      and why people make reference to the past in their discourse (e.g.
It consists of a semantically-enhanced search engine that is able         to convince others or to express emotion), and to analyze differ-
to find references to many entities at a time, which is combined          ences within particular time intervals as well as across time. The
with a novel interface that invites its user to actively sculpt the       size of such corpora, however, combined with the scatteredness of
search result set. Until now we have been concerned mostly with           references to the past, can make these corpora daunting to explore
user-friendly retrieval and selection of sources, but our tool can        without the right tools.
also contribute to existing efforts to create reusable linked data           In response to the “spatial turn” in Digital Humanities, sub-
from and for research in the humanities.                                  stantial effort has been put into the development of tools that al-
                                                                          low for spatial navigation through text collections. In the Pelagios
Keywords: Colligatory Concepts, Semantically-Enhanced Search,
                                                                          project, for example, the Pleiades gazetteer serves to anchor lo-
Interactive Information Retrieval, Corpus Selection, Digital Human-
                                                                          cations mentioned in text to machine-readable representations of
ities
                                                                          these locations, which can be combined with linked data to form
                                                                          rich map-based visualizations and allows for spatial access to the
1    INTRODUCTION                                                         texts through the Peripleo search interface [5]. The recognition that
                                                                          “space and time are no more separate in human cognition than they
There has been a monumental shift from the future to the past in
                                                                          are in theoretical physics” [5, p. 43] now motivates the development
the cultural orientation of Western societies, starting in the 1980s.
                                                                          of tools that provide access to texts by the historical entities that
In a sense, the past has increasingly gained in presence: in the liter-
                                                                          they reference, to complement the evolving spatial approaches.
ary and artistic expressions of (traumatic) memories that cannot be
                                                                             In this paper, we propose an approach to support researchers who
contained by the evidence that forms the basis of historical studies,
                                                                          aim to identify and interpret (indirect) references to historical peri-
the proliferation of museums and archives, and the commodifi-
                                                                          ods in a particular discourse. It consists of a semantically-enhanced
cation of the past as marked by docudramas, historically-themed
                                                                          search engine that is able to find references to many entities at a
amusement parks, and memorabilia of pasts that never existed [3].
                                                                          time in diachronic corpora, which is combined with an interface
Many of the resulting representations of segments of the past serve
                                                                          that invites its user to actively sculpt the search result set. As several
a deeply different purpose than the representations created by pro-
                                                                          pilot studies have indicated [4], this approach is useful for those
fessional historians. Under the influences of positivist science and
                                                                          studying how groups of people pragmatically feature historical
literary realism in the 19th century, history was detached from
                                                                          entities in discourse, remember or commemorate them, and un-
“its former habitation in rhetoric” [10, p. 10] and has developed a
                                                                          derstand their own identity in relation to these entities. Moreover,
rigorous manner of dealing with sources and evidence that leads
                                                                          we are able to capture in linked data how definitions of historical
to the production of distinctly historical accounts of past events.
                                                                          periods are operationalized by researchers, and generate semantic
While changes in how historians have interpreted past events has
                                                                          annotations for the search results that users consider to be relevant.
been widely studied (e.g. under the banner of historiography), we
are still in the early days of developing methods to analyze how

                                                                          2    SEARCHING FOR COLLIGATORY
                                                                               CONCEPTS
                                                                          The search for references to historical periods entails bridging the
© 2017 Copyright held by the author/owners.                               gap between conceptually-based information needs on one side,
SEMANTiCS 2017 workshops proceedings: Drift-a-LOD                         and term-based inverted indexes on the other. When a researcher
September 11-14, 2017, Amsterdam, Netherlands                             is looking for the fragments of text that refer to, e.g., the French
Drift-a-LOD’17, September 2017, Amsterdam, Netherlands                                            Alex Olieman, Kaspar Beelen, and Jaap Kamps


Revolution in a specific collection, they are not served well by pro-     mind, and aim for the system to find references to the entities that
viding only the documents that contain the literal phrase “French         are bound together by this period.
Revolution.” This is the case because such periods do not pre-exist
in reality, waiting to be discovered and named, but rather come into      3     APPROACH
being when the disparate observable elements of a phenomenon              The task that we aim to facilitate—selecting a research corpus of
are “seen together” as a synthetic whole [1]. Philosophers of sci-        text fragments that refer to a particular period—is not supported
ence and of history have called this process colligation—a binding        well by either subject indexing [7] or full-text search [2], which
together of selected historical facts, culminating in the proposal        nowadays provide the primary means of access to much of the
of a “colligatory concept” which represents the historian’s under-        text in archives and libraries. Subjects are traditionally assigned to
standing of the facts as an ‘entwined whole’ in a form that can be        whole documents. Our task, however, depends on the retrieval of in-
communicated to others [7].                                               dividual sentences (and their context), given their subjects. It seems
   Periods are but one kind of colligatory concept that is commonly       infeasible to provide such granular access with manually assigned
constructed by historians; the others being characters, such as           subjects, but recent advances in Entity Linking have enabled the
‘Louis XVI’ and ‘the French people,’ and ideal types, for instance        automated identification of the referents in individual sentences.
‘capitalism’ and ‘revolution’ [7]. Whereas characters are localized       Even though entity linking systems are prone to errors that hu-
in time and space, and ideal types bind together certain aspects of       man annotators would not make, the entity links that they produce
the events in which these characters participate across time and          can still be useful to search for many entities simultaneously [4].
space, periods are bound together by narratives that feature selected     Semantically-enhanced search is made possible by incorporating
events which are distributed within (flexible) temporal and spatial       entity links or similar semantic annotations into search indexes [2].
bounds [7]. The PeriodO project has already produced a linked             Our approach extends this practice by employing linked data to
data gazetteer of periods as they are represented in the published        bridge the semantic gap between the search target (i.e. a period) and
works of historians and historiographers, using a nanopublication         references to the entities that are bound together by this concept.
approach that connects the qualitative definition of a period by an
individual author to its name, spatial extent, and a time interval,       3.1     Bootstrapping with DBpedia
expressed together in RDF [1, 5].
                                                                          Some groupings of entities that people might want to search for al-
   This model captures accurately the multivocal aspect of period
                                                                          ready exist as Linked Open Data. The category network of Wikipedia
definitions, but is not sufficient to retrieve references to these pe-
                                                                          serves an analog function to that of the subject access systems of
riods in discourse that is about another subject. Here we are con-
                                                                          librarians and archivists. It is a knowledge organization system
fronted with the difference between colligatory concepts which are
                                                                          (KOS) that relates more specific subjects to more general ones by
necessary constructs in the writing of history, and the subjects that
                                                                          broader-than relations [8]. As the product of the contributions of a
are intended to group together multiple histories that exhibit com-
                                                                          diverse community, it encodes multiple perspectives on the world
mon “patterns of colligation” [7, p. 1098]. As a colligatory concept, a
                                                                          in a single structure which allows for multiple paths to exist be-
period is a particular representation of the past that binds together
                                                                          tween any two subjects. We used DBpedia’s RDF representation of
historical entities in a unique narrative. When this individual rep-
                                                                          this category network for our proof-of-concept, because we were
resentation leads to further discourse, the discourse as a whole is
                                                                          working with a corpus that we enriched with entity links that point
not about the original colligation, but rather about a homonymous
                                                                          to DBpedia URIs. Its simple structure, represented in the SKOS on-
subject that allows us to, e.g, ask a librarian for ‘novels set during
                                                                          tology, may not be ideal for all research purposes, but it is sufficient
the French revolution.’ In the practice of information organization,
                                                                          to start a search process with a period-as-subject and invite the
we establish common referents and shared structure between col-
                                                                          researcher to operationalize the particular period he/she wants to
ligations in order to group a multiplicity of perspectives under a
                                                                          search for.
single label [7].
                                                                             In order to obtain possible mappings between periods of
   We have no way of searching directly for the period that the
                                                                          interest and the entities that are bound together by such pe-
researcher has in mind (it exists only as a cognitive representation),
                                                                          riods, we extracted a subgraph of DBpedia, corresponding to
but by starting from its associated subject we can make an educated
                                                                          Wikipedia’s category network and its related entities, into a
guess about the elements of the period. Named entities—events,
                                                                          property graph database (see [6]). The category network is
people, artifacts—are central in our search approach, because they
                                                                          used at runtime to select potentially relevant entities given a
serve as the points of consensus that enable the search for as of
                                                                          root category, by traversing skos:broader1 and dct:subject
yet unknown perspectives on segments of the past. The aim is
                                                                          relations in reverse direction. Starting, for example, from the
to collect (and represent) an intermittent discourse, consisting of
                                                                          root category dbc:French_Revolution, the traversal would
sentences that make (indirect) reference to the target period of the
                                                                          proceed through subcategories such as dbc:Montagnards
search, with as much context as is needed to understand them. To
                                                                          and dbc:French_First_Republic to collect entities includ-
be sure, we do not equate the colligatory concepts that are put
                                                                          ing dbr:Reign_of_Terror, dbr:Maximilien_Robespierre,
forward by historical work with the scattered references to the past
                                                                          dbr:Bastille, and dbr:Drownings_at_Nantes.
that are found in non-historical discourse. Rather, we expect the
                                                                             Our proof-of-concept makes use of DBpedia, but any knowledge
(re)searcher to have a particular period (i.e. colligatory concept) in
                                                                          graph that conforms to the SKOS ontology can be loaded easily.
                                                                          1 For namespace prefixes, see https://dbpedia.org/sparql?nsdecl.
Finding Talk About the Past in the Discourse of Non-Historians                    Drift-a-LOD’17, September 2017, Amsterdam, Netherlands




       Figure 1: Initial query specification in WideNet.




                                                                                Figure 3: A closer look at the retrieved documents.


                                                                               The interface guides users through three research phases: (1)
                                                                           selection of root category, (2) assessment of the categories’ and
                                                                           entities’ relevance, and (3) close-reading. In the first step the user
                                                                           selects one or several root categories from a typeahead search box
                                                                           (see Figure 1), and demarcates the query by selecting a time pe-
                                                                           riod, which is used to prune the underlying entities of the selected
                                                                           categories. WideNet subsequently retrieves the network of nar-
                                                                           rower categories for each selected root category, and collects the
                                                                           contained entities as potentially relevant query components. Be-
                                                                           hind the scenes, each entity is compared with the target period,
                                                                           and is considered to be outright relevant to the period, or not, or
Figure 2: Assessing the relevance of categories and entities.              a borderline case, or as lacking temporal clues altogether. In the
                                                                           current implementation this classification is achieved with sim-
                                                                           ple rules, based on the features: ‘fraction of years within period,’
                                                                           ‘fraction of intervals that overlap with the period,’ and ‘has at least
Linked data that is structured differently can also be used, as long
                                                                           one year in period.’ The system uses this information to deselect
as a grouping or categorization that is familiar to the intended users
                                                                           (sub)categories where more than half of the dated member entities
can be derived from it. It is important that the representations of
                                                                           are out-of-period, i.e., those categories are excluded from the query.
entities in this data are identifiable by the same URIs as those used
                                                                               The next step for the WideNet user is to assess which of the
in the entity links in the corpus, to be able to connect periods, via
                                                                           retrieved subcategories actually contain entities that lead to relevant
colligated entities, to the text fragments that refer to these entities.
                                                                           results. By doing this, researchers can operationalize their own
   Finally, the system needs access to coarse temporal clues about
                                                                           definition of their target period, at least for the purpose of retrieval.
entities. Because DBpedia does not provide this data reliably across
                                                                           The interface facilitates this task by showing, per subcategory,
entity types, we extract mentioned years from the rdfs:comment
                                                                           which entities are mentioned in the corpus, and how frequently, as
values of DBpedia resources with a simple regular expression, and
                                                                           well as which entities did not occur (see Figure 2). It also displays a
add them to the graph. The same technique may be successful for
                                                                           list of preview results, showing limited context, to offer quick clues
other linked data sources that provide textual descriptions which
                                                                           about the relevance of the category. This preview is also useful to
often include temporal expressions. It would be preferable, how-
                                                                           identify individual entities that are not relevant after all, which can
ever, to use representations that incorporate structured temporal
                                                                           be deselected by the user. At the end of this step, the researcher’s
relations in the form of RDF literals for all entities.
                                                                           decisions amount to a motivated and organized representation of
                                                                           ‘relevant’ or ‘possibly present’ elements of the target period.
3.2    Search Interface Design                                                 After inspecting and selecting relevant categories of entities—
Our proof-of-concept, named WideNet [4], provides access to a se-          thereby sculpting the final query, the WideNet interface allows
mantically enriched version of the Dutch parliamentary proceedings—        further scrutinizing of the sources by providing an environment in
the “verbatim” records of the debates that take place in the Houses        which the retrieved documents can be studied up-close, as shown
of Parliament (the Staten Generaal). These discussions touch almost        in Figure 3. By situating the close reading activity within the same
on every issue that moved Dutch public opinion over more than              interface, the user is able to compile a corpus of relevant documents
two centuries.                                                             which may be saved and exported. Moreover, the user can examine
Drift-a-LOD’17, September 2017, Amsterdam, Netherlands                                            Alex Olieman, Kaspar Beelen, and Jaap Kamps


the results in relation to the document metadata, e.g. to look for        (e.g. [1, 8, 9]) and we are still investigating how we can best link
saliency by plotting the annotations over time, or to study bias          to the existing models. We have prioritized the development of a
by comparing how often different political parties refer to the en-       useful search tool over the production of reusable data in order to
tities of interest. The selected corpus, representing a fragmented        investigate which data can be captured during actual research in
discourse, may also be used to analyze changes in how and why             the humanities, rather than designing our models first and finding
the demarcated segment of the past was evoked, establishing con-          out later that they are not as usable as we had hoped [9].
testing perspectives and trends over time, provided that enough
references could be found.                                                4    CONCLUSION AND OUTLOOK
                                                                          In searching diachronic corpora for fragmented discourses about
3.3    Capturing Reusable Data                                            historical periods, we need to come to a fundamental understand-
As a product of the search, semantic annotations are created which        ing of conceptual difference before we can give any account of
link the source documents to the entities that are referenced in the      conceptual drift over time. Our approach for finding references
corpus. These expert-approved annotations have much more value            to periods consists of a semantically-enhanced search engine that
than the automatically generated entity links. For one, the identity      is able to find references to many entities at a time in diachronic
of the referents is established with a greater confidence when a          corpora, which is combined with an interface that invites its user
user chooses to include a particular document fragment in his/her         to actively sculpt the search result set. Besides yielding sources
research corpus. We cannot interpret this as a direct assessment of       that are useful for the searcher, the search tool also produces an
the entity links, but when a user has confirmed that the document         operationalization of the search target by the (re)searcher as well
fragment (indirectly) refers to the target period, it is tempting to      as semantic annotations that are much richer than those that we
assume that the document was retrieved because the entity links           can generate algorithmically.
are correct, and not by some mistake. To model the provenance of             Although a pragmatic treatment of concepts is sufficient to
such relevance decisions, it is necessary to produce a representation     search for multifaceted subjects, we envision how the products
of the broader context of the search process to which individual          of such search processes can collectively provide a shared source
assertions can be explicitly related (see [1]).                           for more elaborate knowledge representations. In designing a tool
    In the first screen that users encounter when starting a new          that implements this approach, we were faced with some trade-offs
search process in WideNet (depicted in Figure 1), we are able to          between usability of the tool and reusability of the data it captures.
capture the motivation and a rough demarcation of the search.             We prioritize supporting the present-day researcher well, and facil-
This information forms the backbone of a representation to which          itate publishing the search process and its results as linked open
subsequent assertions can refer. The selection of categories and          data, so that subsequent refinement of the captured data may be a
entities (in Figure 2) provides assertions about their relevance for      community effort.
the search target, according to the researcher and dependent on
the corpus. We currently store each decision that is made by the          ACKNOWLEDGMENTS
researcher, so when a previously made decision is revisited, we
                                                                          We thank the anonymous reviewers for their suggestions and re-
create representations of e.g. both the act of deselecting an entity,
                                                                          marks. This work was supported by the Netherlands Organization
and reselecting it after more preview results had been inspected.
                                                                          for Scientific Research (ExPoSe project, NWO CI # 314.99.108).
Capturing this process data, rather than only the final selection of
entities, benefits the richness of the provenance of the subsequent       REFERENCES
assertions about the relevance of text fragments.                          [1] Patrick Golden and Ryan Shaw. 2016. Nanopublication Beyond the Sciences: the
    The semantic annotations that are derived from relevance as-               PeriodO period gazetteer. PeerJ Computer Science 2:e44 (2016). https://doi.org/
sertions on text fragments can provide richly indexed access to                10.7717/peerj-cs.44
                                                                           [2] Annika Hinze, David Bainbridge, Sally Jo Cunningham, and J. Stephen Downie.
statements about the past. This notion is similar to Ryan Shaw’s               2016. Low-cost Semantic Enhancement to Digital Library Metadata and Indexing.
proposal for “deep gazetteers,” in which multiple descriptions of the          Proceedings of JCDL ’16 (2016), 93–102. https://doi.org/10.1145/2910896.2910910
same named entity are linked to fragments of discourse in which            [3] Andreas Huyssen. 2000. Present Pasts: Media, Politics, Amnesia. Public Culture
                                                                               12, 1 (2000), 21–38.
its name is used [8]. In our case, however, we use representations         [4] Alex Olieman, Kaspar Beelen, Milan van Lange, Jaap Kamps, and Maarten Marx.
of periods-as-subjects to connect particular conceptions of these              2017. Good Applications for Crummy Entity Linkers? The Case of Corpus Selec-
                                                                               tion in Digital Humanities. In Proceedings of SEMANTiCS 2017. arXiv:1708.01162
periods to discourse that refers to these periods from a different per-    [5] Adam Rabinowitz, Ryan Shaw, Sarah Buchanan, Patrick Golden, and Eric Kansa.
spective, rather than use of the same name per se. Our approach can            2016. Making sense of the ways we make sense of the past: The PeriodO project.
produce “projections” of the parts of the past that were present in a          Bulletin of the Institute of Classical Studies 59, 2 (2016), 42–55.
                                                                           [6] Marko Rodriguez. 2015. The Gremlin Graph Traversal Machine and Lan-
particular discourse, which can be organized according to the KOS              guage. In Proc. 15th Symposium on Database Programming Languages. 1–10.
that was used to perform the search, but also according to any other           arXiv:1508.03843
KOS that includes representations of the same entities, provided           [7] Ryan Shaw. 2013. Information Organization and the Philosophy of History.
                                                                               Journal of the American Society for Information Science and Technology 64, 6 (jun
that equivalence of identity (e.g. owl:sameAs) can be established              2013), 1092–1103. https://doi.org/10.1002/asi.22843
between the knowledge organization systems.                                [8] Ryan Shaw. 2016. Gazetteers Enriched: A Conceptual Basis for Linking Gazetteers
                                                                               with Other Kinds of Information. In Placing Names: Enriching and Integrating
    While we capture the described representations of search pro-              Gazetteers. 51–63.
cesses and semantic annotations in our current proof-of-concept,           [9] Ryan Shaw, Patrick Golden, and Michael Buckland. 2015. Using Linked Library
we do not yet publish this data. Our approach is related to ongo-              Data in Working Research Notes. In Linked Data and User Interaction. 48–65.
                                                                          [10] Hayden White. 2014. The Practical Past. Northwestern University Press.
ing efforts to produce reusable data for research in the humanities