Data Narrations - Using flexible Data Bindings to support
the Reproducibility of Claims in Digital Library Objects
Denis Nagel1 , Till Affeldt1 and Wolf-Tilo Balke1
1
    Institute for Information Systems, TU Braunschweig, Braunschweig, Germany


                                             Abstract
                                             Digital libraries support researchers by providing public access to a vast collection of state-of-the-art literature. The
                                             considerable variety of statements, claims, observations and insights that form the narrations of these documents can be
                                             used as a valuable groundwork for further research. However, when confronted with these narrations, concerns regarding
                                             their reproducibility might arise. Tackling these concerns usually requires a careful analysis of the underlying data sets
                                             and a search for similar repositories that support the questioned claims. In short, it is necessary to find repositories whose
                                             data narrations match those of the publication. Unfortunately, data analysis and mining are far too often reduced to basic
                                             statistical analyses that usually fail to be helpful. In this paper, we propose a novel idea to use structured narratives as a
                                             template to discover supporting data narrations, hence reducing the problem of assessing the reproducibility of a publication
                                             to a simple matching task between a document and data set. To realize this idea, we outline a novel two-step matching
                                             strategy by describing the individual steps along the lines of a pharmaceutical use case. We thereby identify the main open
                                             research tasks and discuss problems that need to be solved to develop a full-fledged matching algorithm.

                                             Keywords
                                             Narrative Intelligence, Open Data, Digital Libraries


1. Introduction                                                                                                       original authors by thoroughly analyzing the underly-
                                                                                                                      ing data sets. Now, essentially two situations can occur.
Digital libraries and their vast collections of documents                                                             On the one hand, the required data set might not have
represent an invaluable source of knowledge. They pro-                                                                been published or is challenging to find due to missing
vide public access to state-of-the-art research across many                                                           references by the authors, i.e., the critical link between
domains of science. The scientific narrations provided                                                                document and data set might be unavailable [3]. On the
through these documents form the groundwork for on-                                                                   other hand, if the data is readily available, it might still
going research that builds upon their claims, insights                                                                be unrepresentative for the domain of interest. Even if it
and observations. However, when working with such                                                                     is representative, a thorough data set analysis can result
documents, concerns about the reproducibility of the                                                                  in a very time-consuming and exhausting process that
encountered narrations might arise [1]. Often these nar-                                                              many researchers might not be willing to take.
rations originate from research data that has been col-                                                                  Recently scientific narratives have sparked much in-
lected throughout extensive experiments, evaluations,                                                                 terest in the scientific discourse, and their application
scientific studies, or surveys. In recent years increasing                                                            in digital libraries is the topic of an ongoing discussion
efforts have been undertaken to make the ever-increasing                                                              [4]. Every document can contain several narrations that
amounts of research data publicly available by integrat-                                                              connect insights and statements to form a coherent story
ing them into the existing libraries and, in the best case                                                            presented to the reader. For an example, consider the fol-
linking them to their associated documents [2, 3].                                                                    lowing pharmaceutical narrative: Cardiovascular diseases
   Consider a researcher reading a document that has                                                                  are the leading reason for premature deaths worldwide and
high relevance for her current work, but she is sceptical                                                             are caused by a multitude of risk factors, many of which are
about a specific claim made by the authors. Is the claim                                                              avoidable. To avoid premature deaths, it is thus essential to
plausible? Moreover, is the data that supports it broadly                                                             raise awareness of these risk factors. One instance of this
representative for her research domain, or is it applicable                                                           narrative can be encountered in [5], where multiple risk
only to the document’s specific use case? Answering                                                                   factors, like elevated blood pressure, are listed that are
these questions requires retracing the steps taken by the                                                             claimed to be causal for the occurrence of cardiovascular
                                                                                                                      diseases. Hence, when reading [5], the question might
DISCO’21 - Digital Infrastructures for Scholarly Content Objects at                                                   arise whether these claims are plausible.
JCDL2021, September 30–October 01, 2021, Online
                                                                                                                         We believe that, by translating narrations such as our
Envelope-Open nagel@ifis.cs.tu-bs.de (D. Nagel); t.affeldt@tu-bs.de (T. Affeldt);
balke@ifis.cs.tu-bs.de (W. Balke)                                                                                     example into structured narratives [6], it is possible to
Orcid 0000-0002-5832-9154 (D. Nagel); 0000-0001-6440-5654                                                             assess the reproducibility of their statements, claims, in-
(T. Affeldt); 0000-0002-5443-1215 (W. Balke)                                                                          sights and observations without a costly manual data
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     analysis. Our proposal is based on the assumption that
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Excerpt from a pharmaceutical data set [7], contain-
ing clinical data of 303 patients, collected by the Cleveland    Figure 2: Excerpt of a structured narrative extracted from a
Clinic Foundation (publicly available through the UCI ma-        PubMed publication [5]. The nodes of the narrative graph
chine learning repository [8])                                   represent important entities, events and literals, while the
                                                                 labeled edges represent the relations between them

the intrinsic relations between individual values inside
data sets also form implicitly expressed narrations that         Structured Narratives For this paper, we define struc-
we call data narrations. Our core idea is that whenever a        tured narratives according to the definition found in [6].
data set is schematically suitable to reproduce some scien-      Therein narratives are defined as directed edge- and node-
tific narrative, it should provide a data narration that can     labeled graphs 𝑁 = (𝑉 , 𝑅). The nodes represent events
be successfully aligned to that narrative.                       with a temporal component, entities (i.e., real-world ob-
   With structured narratives at hand, the problem of            jects and concepts) and literals, such as numerical values
assessing reproducibility can thus be reduced to a sim-          and lexicographical strings. In contrast the edges rep-
ple matching approach. Given a structured narrative              resent the relations between them, which could be the
extracted from a document of interest, in a first step, data     participation of some entity in an event, a causal or tem-
sets suitable for the matching can be discovered by con-         poral relationship between events or simple facts. As
sidering the available meta-data, i.e., data set descriptions,   such, the set of nodes is defined as 𝑉 ⊆ 𝐸 ∪ 𝐿 ∪ Γ with 𝐸
table headers, or column titles. We then align this meta-        being the set of entities, 𝐿 being the set of literals, and Γ
data to the events and entities described in the extracted       being the set of events. Then the set of edges is defined
narrative, resulting in a set of possible candidate data         as 𝑅 ⊆ (𝐸 ∪ Γ) × Σ × (𝐸 ∪ Γ ∪ 𝐿), with Σ being the alphabet
sets. The second step looks into the actual data to verify       of available edge labels. Figure 2 shows an excerpt of a
whether the relations between the entities and events of         structured narrative extracted from [5].
the narrative also occur expressed by the data.
   The contributions of this paper can be summarized        Narrative Bindings and Data Narrations Narratives
as follows: We propose structured narratives as a tool      can consist of any arbitrary statements and claims with-
to assess the reproducibility of a document’s statements,   out any indication of their plausibility. Hence, in [6]
claims and insights. For this, we outline a novel approach  narrative bindings are introduced to connect parts of the
to discover data sets fitting to the document’s narrations  narrative to a knowledge repository of any type in the
based on a simple two-step matching strategy.               sense of substantiation. Semantically a narrative binding
                                                            between a narrative and a repository indicates that the
2. Preliminaries                                            repository supports the statements made throughout the
                                                            narrative. As such, we can define narrative bindings as
Data Sets Data sets usually store empirical data gained follows. Let 𝑁 = (𝑉 , 𝑅) be a narrative and KR the set of
through experiments, measurements, studies, or surveys. all knowledge repositories. A narrative binding is a tuple
A problem often encountered when working with data nb = (𝑟, kr) ∈ 𝑅 × 𝐾 𝑅. We say that 𝑒 is bound to 𝑘𝑟 by nb.
sets is very high heterogeneity in structure and schema        Narratives can be encountered not only in natural lan-
formatting. For ease of understanding, we thus consider guages, such as documents, novels, or human speech,
data sets in the scope of this paper to store (mainly nu- but also behind the intrinsic relations that the individual
merical) information in a tabular format. We denote the values of a data set implicitly express. We call this special
set of all possible data sets that comply with this format type of narrative a data narration. Using the notion of
by 𝐷𝑆. Each data set 𝐷𝑖 = {𝑉 𝑎𝑟, 𝑇 } ∈ 𝐷𝑆 consists of a narrative bindings, we can define a data narration as fol-
set of variables 𝑉 𝑎𝑟, with each 𝑣𝑎𝑟𝑗 ∈ 𝑉 𝑎𝑟 representing a lows. Let 𝐷𝑆 ⊂ KR be the set of all data sets. A narrative
single column and a set of tuples 𝑇, each representing an 𝑁 = (𝑉 , 𝑅) is called a data narration of a data set 𝐷 ∈ 𝐷𝑆,
individual record (𝑡𝑖 ∈ 𝑇) of the data set which comprises iff there exists a 𝑛𝑏 = (𝑟𝑖 , 𝐷), for every 𝑟 ∈ 𝑅.
of a value 𝑑𝑖𝑗 ∈ 𝑇 for each variable 𝑣𝑎𝑟𝑗 ∈ 𝑉 𝑎𝑟. Figure 1
shows an example for a pharmaceutical data set.
                                                                 tion 1 can be encountered. For simplicity, we focus on the
                                                                 following excerpt of the narrative, which is visualized in
                                                                 2: The paper claims that cardiovascular diseases (CVDs)
                                                                 are the leading cause of premature mortality worldwide.
                                                                 It lists many risk factors, like tobacco smoking, elevated
                                                                 blood pressure, dyslipidemia and advanced age, among
                                                                 others, that are stated to be associated with CVDs. Our
                                                                 goal is now to assess whether these claims can be repro-
                                                                 duced using available open research data sets. We assume
                                                                 that only data sets whose intrinsic data narrations sup-
                                                                 port these claims can be used to reproduce them. The
                                                                 basic idea for our approach is that by translating both,
Figure 3: Our proposed outline to assess the reproducibility     the intrinsic messages of the data set and the scientific
of claims using structured narratives                            narration into structured narratives, we can reduce the
                                                                 problem of assessing the reproducibility of the claims to
                                                                 a simple matching problem.
3. Related Work                                                  Extraction of Structured Narratives In order to al-
Reproducibility of scientific results has always been one        low for an automated matching between publications and
of the core aspects of good scientific practice. Unfortu-        research data, we need to extract the narrations encoun-
nately, there are many cases in which it is not or only          tered throughout the document and translate them into
insufficiently taken into account. As such, terms such as        a structured narrative. Hence, it is necessary to analyse
reproducibility crisis can be encountered frequently in          how each building block of a narrative graph is expressed
the literature. Hence, many authors propose new strate-          in natural language. For this, we have to combine multi-
gies to improve this situation. Recent examples are [9],         ple disciplines from natural language processing (NLP),
where a new system to collect provenance information             such as named entity recognition and event detection
for data science pipelines is presented, and [10] where the      for the nodes and relation extraction for the edges be-
authors developed an integrative platform in the context         tween them. By applying a manual extraction on our
of the semantic web to capture the provenance informa-           example, we can identify six biomedical concepts and
tion for individual experiments.                                 entities, namely Cardiovascular Diseases, Premature Mor-
   In order to tackle the problem of reproducibility, an         tality, as well as the four different risk factors (Tobacco
increasing amount of research data has recently been             Smoking, Elevated Blood Pressure, Dyslipidemia and Ad-
collected and integrated into digital libraries [3, 11, 12].     vanced Age). Furthermore, the narration claims that there
At the same time, it is often difficult to find the links con-   is a causal dependency between CVDs and premature
necting a publication to the underlying data. In [3] the         mortality, as well as a relation between CVDs and each of
authors thus introduced a specialized digital library that       the risk factors. Hence, for our example a structured nar-
offers integrated access to both the documents and their         rative as defined in section 2 and denoted by 𝑁𝑒 = (𝑉𝑒 , 𝑅𝑒 )
associated research data. Contrary to such strategies, our       could look as shown in figure 2. As a manual extraction
approach aims at finding arbitrary research data suitable        can often result in a cumbersome process, developing a
to reproduce the claims from some document, even if it           sophisticated strategy for the automated extraction of
was not the source for these statements.                         narratives is crucial for the practical applicability of our
                                                                 approach and thus an important task for future work.
                                                                 Different approaches, such as [13], are already actively
4. Method Description                                            discussed, thus providing valuable groundwork.

In this section, we present our envisioned idea of us-           Identification of Data Narrations Discovering data
ing structured narratives to assess the reproducibility          sets that are suitable to match 𝑁𝑒 requires awareness
of scientific claims and statements in more detail (for a        of their intrinsic data narrations. The current state-of-
visualization, see figure 3). We identify the steps required     the-art approach to make data narrations visible is the
to develop a full-fledged matching algorithm along the           application of techniques for data visualization [14, 15].
lines of a pharmaceutical use case and discuss difficulties      However, such techniques can only be applied on top of
and open tasks that need to be solved.                           data analysis, i.e., if the intrinsic relations of a data set are
                                                                 already known, which is a requirement that is rarely met
Problem Description Let us reconsider the PubMed                 in the realm of open data. Especially in large-scale data
publication [5], in which the narrative described in sec-
sets, finding exactly those narrations associated with the       The vertical matching for a node 𝑣 ∈ 𝑉𝑒 aims at identify-
topic of interest is coupled with extensive manual labour     ing the subset of tuples, i.e. 𝑇𝑒𝑣 ⊆ 𝑇𝑒 whose substitutions
and is often not feasible. We can apply a more feasible       of the variables in 𝑉 𝑎𝑟𝑒𝑣 fulfill constraints imposed by
top-down strategy by relying on the structured narrative      additional qualifiers for 𝑣. When considering the node
as a template for the general data narration.                 Elevated Blood Pressure for example, it becomes apparent
                                                              that we are only interested in those tuples of the data
First Step: Matching Events and Entities In the set that show a sufficiently high value for the respective
first matching step, we focus on the narrative’s nodes, variable. Although in some cases it might be sufficient to
i.e., the entities and events that partake in the narration. define these qualifiers relative to the actual values, e.g.,
Let us consider the causal relation 𝑟 = (Elevated Blood higher than the average, it will in most cases be inevitable
Pressure,risk factor of,Cardiovascular Diseases) ∈ 𝑅𝑒 , as to rely on external domain knowledge.
well as the data set 𝐷𝑒 = (𝑉 𝑎𝑟𝑒 , 𝑇𝑒 ) ∈ 𝐷𝑆 shown in figure     As the matching result, we receive for each node 𝑣 ∈ 𝑉𝑒
1 [8]. In order to assess whether there exists a successful the set of semantically associated data values consisting
                                                                                                           𝑣
narrative binding between 𝑟 and 𝐷𝑒 , we have to identify of the substitutions for the variables in 𝑉 𝑎𝑟𝑒 of each tuple
                                                                   𝑣
those data values suitable to represent the respective in 𝑇𝑒 .
nodes of 𝑟. We do so by matching the narrative’s nodes
to the values of the data both horizontally and vertically. Second Step: Matching Relations Once we matched
   The horizontal matching for a node 𝑣 ∈ 𝑉𝑒 aims at iden- all nodes to their associated data values, we can now fo-
tifying the subset of variables, i.e. 𝑉 𝑎𝑟𝑒𝑣 ⊆ 𝑉 𝑎𝑟𝑒 that can cus on the relations of the narrative. Essentially for 𝑁𝑒 to
be semantically associated with 𝑣. By looking into 𝐷𝑒 , we be a feasible data narration for 𝐷𝑒 , the relations occurring
can see that each tuple represents an individual patient between the individual nodes of the narrative have to
and captures different properties, like the blood pressure also occur between those values matched to them in the
level in the second column, or observed events, like a previous step. Here it is important to note that a narra-
diagnosis regarding heart health in the last column. Usu- tive might express a large variety of different relation
ally, descriptive meta-information gives insights about types. Identifying these relations requires the application
the variables of the data set, thus providing valuable hints of specialized metrics and strategies for each type. While
about the entities and events referred to in the data. By re- this might sound difficult to realize, we believe that most
ferring to such meta-information, we can infer that only narrative relations can be assigned into three categories:
the second and fourth columns contain data values rele- correlation, causation, and temporal relations. By being
vant for a binding of 𝑟. While the process of horizontally able to make reasoned decisions about whether such a
restricting the data is relatively straightforward in our relationship occurs between data set values, we can thus
use case, this is unfortunately not guaranteed. The high expect to be able to handle most narratives. Finding suit-
heterogeneity in the structure and formatting of meta- able metrics and strategies for these three types is thus
information in open data sets makes this a non-trivial a focus of our ongoing research. Deciding whether two
task. For example, additional information about the con- columns correlate with each other is a trivial task that
text of a clinical study could be attached externally as a can be solved by applying metrics such as the Pearson
description of the data set, e.g., the study containing only correlation coefficient. Causation, on the other hand, is
patients with diagnosed diabetes. Similarly, such infor- complicated to assess. Here, relying on metrics as de-
mation may be directly embedded into the data, e.g., the ployed in clinical studies, such as the relative risk, that
measuring units such as mm HG could be included as in- build upon the idea of control groups to analyse the effect
dividual variables of the data set. In order to find precise some factor has on an observed event might be a promis-
data narrations, it is therefore essential to consider all ing first step. Additionally, we currently analyse the ap-
available meta-information. Finding ways to effectively plicability of more sophisticated rule-based approaches
identify and assign the correct meta-information to the as an easy-to-use strategy for causality assessment.
correct data thus remains a challenging research task.
   At this point, it has to be noted that for a successful Interpreting the Matching Results With the outlined
matching between data set and narrative, both compo- two-step matching strategy, we can now assess the re-
nents must draw from a shared vocabulary. Primary can- producibility of a narration by analysing which claims
didates for such a vocabulary can be found in extensive and insights available open data sets can support, thus
and, preferably, well-curated ontologies. For many of giving valuable insights into the narration’s plausibility.
these ontologies, specialized NLP-tools (such as SciSpacy It is thereby possible to discover multiple data sets whose
[16] for biomedical terms) exist that allow for complete data narrations align to a single narrative. In that case,
annotation pipelines. Using these tools on both the nar- we can assume that this narrative as a whole is plausible
rative’s node labels and the meta-information of the data across many domains and maybe even generally applica-
set yields annotations that we can apply in the matching.
ble. On the other hand, it might be possible that certain         related information in the social sciences, in: 19th
parts of the narrative rarely match any data, which could         ACM/IEEE Joint Conference on Digital Libraries
indicate that the respective substories of the narrative          (JCDL), IEEE, 2019.
are very context-dependent. Thus, even if only parts of a     [4] C. Meghini, V. Bartalesi, D. Metilli, Representing
narrative result in successful bindings, it is still possible     narratives in digital libraries: The narrative ontol-
to draw valuable conclusions about individual claims. For         ogy, Semantic Web 12 (2021).
some cases, it might even be feasible to allow for some       [5] S. Zaninovic, I. Nola, Management of measurable
form of data set augmentation, i.e., combining individual         variable cardiovascular disease’ risk factors, Cur-
bindings against multiple different data sets to support          rent Cardiology Reviews 14 (2018).
the narration as a whole.                                     [6] H. Kroll, D. Nagel, W.-T. Balke, Modeling narrative
                                                                  structures in logical overlays on top of knowledge
                                                                  repositories, in: International Conference on Con-
5. Conclusion and Future Work                                     ceptual Modeling (ER), Springer, 2020.
                                                              [7] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfis-
Assessing whether claims and statements encountered in
                                                                  terer, J.-J. Schmid, S. Sandhu, K. H. Guppy, S. Lee,
scientific publications are representative and thus repro-
                                                                  V. Froelicher, International application of a new
ducible in various use cases usually requires a thorough
                                                                  probability algorithm for the diagnosis of coronary
and careful analysis of the underlying data. For this, it is
                                                                  artery disease, The American Journal of Cardiology
necessary to identify the data narrations formed by the
                                                                  64 (1989).
intrinsic relations inside the data sets. In this paper, we
                                                              [8] D. Dua, C. Graff, UCI machine learning repository,
outline an approach that relies on translating the claims
                                                                  2017. URL: http://archive.ics.uci.edu/ml.
of scientific publications into structured narratives that
                                                              [9] L. Rupprecht, J. C. Davis, C. Arnold, Y. Gur, D. Bhag-
form valuable templates for the discovery of additional
                                                                  wat, Improving reproducibility of data science
data sets that can support these claims. For this, we
                                                                  pipelines through transparent provenance capture,
propose a novel two-step matching strategy. As a first
                                                                  Proc. VLDB Endow. 13 (2020).
step, we rely heavily on the meta-information provided
                                                             [10] S. Samuel, Integrative data management for repro-
with each data set in order to identify those data values
                                                                  ducibility of microscopy experiments, in: The Se-
that align to the entities and events encountered in the
                                                                  mantic Web - 14th International Conference, ESWC,
narrative. This first step thus allows us to identify the
                                                                  volume 10250, 2017.
relevant parts of the data set that we need to analyse in
                                                             [11] F. Limani, A. Latif, K. Tochtermann, Linked pub-
the second step. We then compute a narrative binding
                                                                  lications and research data: Use cases for digital
between the data set and the narrative. If the intrinsic
                                                                  libraries, in: 22nd International Conference on The-
relations between the individual data values match the
                                                                  ory and Practice of Digital Libraries, TPDL, volume
relations expressed in the narrative, we consider the data
                                                                  11057, Springer, 2018.
set to be successfully bound to the narrative. In this case,
                                                             [12] T. Friedrich, A. O. Kempf, Making research
we argue that the data can reproduce the claims encoun-
                                                                  data findable in digital libraries: A layered model
tered in the publication. In the near future, we would like
                                                                  for user-oriented indexing of survey data, in:
to build upon this paper by developing and describing
                                                                  IEEE/ACM Joint Conference on Digital Libraries,
the complete matching algorithm in detail and evaluate
                                                                  JCDL, IEEE Computer Society, 2014.
its practicality in a large-scale evaluation of real-world
                                                             [13] M. N. Hussain, H. A. Rubaye, K. K. Bandeli, N. Agar-
data. For this, we will focus on solving the open research
                                                                  wal, Stories from blogs: Computational extraction
questions discussed throughout this paper.
                                                                  and visualization of narratives, in: Proceedings of
                                                                  Text2Story - Fourth Workshop on Narrative Extrac-
References                                                        tion From Texts, CEUR-WS.org, 2021.
                                                             [14] M. T. Rodríguez, S. Nunes, T. Devezas, Telling sto-
  [1] M. Pawlik, T. Hütter, D. Kocher, W. Mann, N. Aug-           ries with data visualization, in: Proceedings of the
      sten, A link is not enough – reproducibility of data,       2015 Workshop on Narrative & Hypertext, 2015.
      Datenbank-Spektrum 19 (2019).                          [15] E. Segel, J. Heer, Narrative visualization: Telling sto-
  [2] J. Pakstis, H. Calkins, C. Dobrzynski, S. Lamm,             ries with data, IEEE Transactions on Visualization
      L. McNamara, Advancing reproducibility through              and Computer Graphics 16 (2010).
      shared data: Bridging archival and library practice, [16] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis-
      in: 19th ACM/IEEE Joint Conference on Digital               paCy: Fast and Robust Models for Biomedical Nat-
      Libraries (JCDL), 2019.                                     ural Language Processing, in: Proc. of the 18th
  [3] D. Hienert, D. Kern, K. Boland, B. Zapilko,                 BioNLP Workshop and Shared Task, ACL, 2019.
      P. Mutschke, A digital library for research data and