<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Narrations - Using flexible Data Bindings to support the Reproducibility of Claims in Digital Library Objects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denis Nagel</string-name>
          <email>nagel@ifis.cs.tu-bs.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Till Afeldt</string-name>
          <email>t.afeldt@tu-bs.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolf-Tilo Balke</string-name>
          <email>balke@ifis.cs.tu-bs.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Information Systems, TU Braunschweig</institution>
          ,
          <addr-line>Braunschweig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Digital libraries support researchers by providing public access to a vast collection of state-of-the-art literature. The considerable variety of statements, claims, observations and insights that form the narrations of these documents can be used as a valuable groundwork for further research. However, when confronted with these narrations, concerns regarding their reproducibility might arise. Tackling these concerns usually requires a careful analysis of the underlying data sets and a search for similar repositories that support the questioned claims. In short, it is necessary to find repositories whose data narrations match those of the publication. Unfortunately, data analysis and mining are far too often reduced to basic statistical analyses that usually fail to be helpful. In this paper, we propose a novel idea to use structured narratives as a template to discover supporting data narrations, hence reducing the problem of assessing the reproducibility of a publication to a simple matching task between a document and data set. To realize this idea, we outline a novel two-step matching strategy by describing the individual steps along the lines of a pharmaceutical use case. We thereby identify the main open research tasks and discuss problems that need to be solved to develop a full-fledged matching algorithm.</p>
      </abstract>
      <kwd-group>
        <kwd>Digital Library</kwd>
        <kwd>diseases</kwd>
        <kwd>Hence</kwd>
        <kwd>when reading [5]</kwd>
        <kwd>the question might</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Digital libraries and their vast collections of documents
represent an invaluable source of knowledge. They
provide public access to state-of-the-art research across many
domains of science. The scientific narrations provided
going research that builds upon their claims, insights
and observations. However, when working with such
documents, concerns about the reproducibility of the
encountered narrations might arise [1]. Often these
narrations originate from research data that has been
colscientific studies, or surveys. In recent years increasing
eforts have been undertaken to make the ever-increasing
ing them into the existing libraries and, in the best case
linking them to their associated documents [2, 3].</p>
      <p>Consider a researcher reading a document that has
high relevance for her current work, but she is sceptical
about a specific claim made by the authors. Is the claim
plausible? Moreover, is the data that supports it broadly
representative for her research domain, or is it applicable
only to the document’s specific use case? Answering
these questions requires retracing the steps taken by the
nEvelop-O
(T. Afeldt);</p>
      <p>0000-0002-5443-1215 (W. Balke)
analysis. Our proposal is based on the assumption that
the intrinsic relations between individual values inside
data sets also form implicitly expressed narrations that Structured Narratives For this paper, we define
strucwe call data narrations. Our core idea is that whenever a tured narratives according to the definition found in [ 6].
data set is schematically suitable to reproduce some scien- Therein narratives are defined as directed edge- and
nodetific narrative, it should provide a data narration that can labeled graphs  = ( , ) . The nodes represent events
be successfully aligned to that narrative. with a temporal component, entities (i.e., real-world
ob</p>
      <p>With structured narratives at hand, the problem of jects and concepts) and literals, such as numerical values
assessing reproducibility can thus be reduced to a sim- and lexicographical strings. In contrast the edges
repple matching approach. Given a structured narrative resent the relations between them, which could be the
extracted from a document of interest, in a first step, data participation of some entity in an event, a causal or
temsets suitable for the matching can be discovered by con- poral relationship between events or simple facts. As
sidering the available meta-data, i.e., data set descriptions, such, the set of nodes is defined as  ⊆  ∪  ∪ Γ with 
table headers, or column titles. We then align this meta- being the set of entities,  being the set of literals, and Γ
data to the events and entities described in the extracted being the set of events. Then the set of edges is defined
narrative, resulting in a set of possible candidate data as  ⊆ ( ∪ Γ) × Σ × ( ∪ Γ ∪ ) , with Σ being the alphabet
sets. The second step looks into the actual data to verify of available edge labels. Figure 2 shows an excerpt of a
whether the relations between the entities and events of structured narrative extracted from [5].
the narrative also occur expressed by the data.</p>
      <p>The contributions of this paper can be summarized
as follows: We propose structured narratives as a tool
to assess the reproducibility of a document’s statements,
claims and insights. For this, we outline a novel approach
to discover data sets fitting to the document’s narrations
based on a simple two-step matching strategy.</p>
      <p>Narrative Bindings and Data Narrations Narratives
can consist of any arbitrary statements and claims
without any indication of their plausibility. Hence, in [6]
narrative bindings are introduced to connect parts of the
narrative to a knowledge repository of any type in the
sense of substantiation. Semantically a narrative binding
between a narrative and a repository indicates that the
2. Preliminaries repository supports the statements made throughout the
narrative. As such, we can define narrative bindings as
Data Sets Data sets usually store empirical data gained follows. Let  = ( , ) be a narrative and KR the set of
through experiments, measurements, studies, or surveys. all knowledge repositories. A narrative binding is a tuple
A problem often encountered when working with data nb = ( , kr) ∈  ×   . We say that  is bound to  by nb.
sets is very high heterogeneity in structure and schema Narratives can be encountered not only in natural
lanformatting. For ease of understanding, we thus consider guages, such as documents, novels, or human speech,
data sets in the scope of this paper to store (mainly nu- but also behind the intrinsic relations that the individual
merical) information in a tabular format. We denote the values of a data set implicitly express. We call this special
set of all possible data sets that comply with this format type of narrative a data narration. Using the notion of
by  . Each data set   = {  ,  } ∈  consists of a narrative bindings, we can define a data narration as
folset of variables   , with each    ∈   representing a lows. Let  ⊂ KR be the set of all data sets. A narrative
single column and a set of tuples  , each representing an  = ( , ) is called a data narration of a data set  ∈  ,
individual record (  ∈  ) of the data set which comprises if there exists a  = (  , ) , for every  ∈  .
of a value   ∈  for each variable    ∈   . Figure 1
shows an example for a pharmaceutical data set.
tion 1 can be encountered. For simplicity, we focus on the
following excerpt of the narrative, which is visualized in
2: The paper claims that cardiovascular diseases (CVDs)
are the leading cause of premature mortality worldwide.</p>
      <p>It lists many risk factors, like tobacco smoking, elevated
blood pressure, dyslipidemia and advanced age, among
others, that are stated to be associated with CVDs. Our
goal is now to assess whether these claims can be
reproduced using available open research data sets. We assume
that only data sets whose intrinsic data narrations
support these claims can be used to reproduce them. The
basic idea for our approach is that by translating both,
the intrinsic messages of the data set and the scientific
narration into structured narratives, we can reduce the
problem of assessing the reproducibility of the claims to
a simple matching problem.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Related Work</title>
      <sec id="sec-2-1">
        <title>Extraction of Structured Narratives In order to al</title>
        <p>Reproducibility of scientific results has always been one low for an automated matching between publications and
of the core aspects of good scientific practice. Unfortu- research data, we need to extract the narrations
encounnately, there are many cases in which it is not or only tered throughout the document and translate them into
insuficiently taken into account. As such, terms such as a structured narrative. Hence, it is necessary to analyse
reproducibility crisis can be encountered frequently in how each building block of a narrative graph is expressed
the literature. Hence, many authors propose new strate- in natural language. For this, we have to combine
multigies to improve this situation. Recent examples are [9], ple disciplines from natural language processing (NLP),
where a new system to collect provenance information such as named entity recognition and event detection
for data science pipelines is presented, and [10] where the for the nodes and relation extraction for the edges
beauthors developed an integrative platform in the context tween them. By applying a manual extraction on our
of the semantic web to capture the provenance informa- example, we can identify six biomedical concepts and
tion for individual experiments. entities, namely Cardiovascular Diseases, Premature
Mor</p>
        <p>In order to tackle the problem of reproducibility, an tality, as well as the four diferent risk factors ( Tobacco
increasing amount of research data has recently been Smoking, Elevated Blood Pressure, Dyslipidemia and
Adcollected and integrated into digital libraries [3, 11, 12]. vanced Age). Furthermore, the narration claims that there
At the same time, it is often dificult to find the links con- is a causal dependency between CVDs and premature
necting a publication to the underlying data. In [3] the mortality, as well as a relation between CVDs and each of
authors thus introduced a specialized digital library that the risk factors. Hence, for our example a structured
narofers integrated access to both the documents and their rative as defined in section 2 and denoted by   = (  ,   )
associated research data. Contrary to such strategies, our could look as shown in figure 2. As a manual extraction
approach aims at finding arbitrary research data suitable can often result in a cumbersome process, developing a
to reproduce the claims from some document, even if it sophisticated strategy for the automated extraction of
was not the source for these statements. narratives is crucial for the practical applicability of our
approach and thus an important task for future work.</p>
        <p>Diferent approaches, such as [ 13], are already actively
4. Method Description discussed, thus providing valuable groundwork.
In this section, we present our envisioned idea of
using structured narratives to assess the reproducibility
of scientific claims and statements in more detail (for a
visualization, see figure 3). We identify the steps required
to develop a full-fledged matching algorithm along the
lines of a pharmaceutical use case and discuss dificulties
and open tasks that need to be solved.</p>
        <sec id="sec-2-1-1">
          <title>Problem Description Let us reconsider the PubMed publication [5], in which the narrative described in sec</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Identification of Data Narrations Discovering data</title>
        <p>sets that are suitable to match   requires awareness
of their intrinsic data narrations. The current
state-ofthe-art approach to make data narrations visible is the
application of techniques for data visualization [14, 15].
However, such techniques can only be applied on top of
data analysis, i.e., if the intrinsic relations of a data set are
already known, which is a requirement that is rarely met
in the realm of open data. Especially in large-scale data
and is often not feasible. We can apply a more feasible
top-down strategy by relying on the structured narrative
as a template for the general data narration.</p>
      </sec>
      <sec id="sec-2-3">
        <title>First Step: Matching Events and Entities</title>
        <p>In the
ifrst matching step, we focus on the narrative’s nodes,
Let us consider the causal relation  = ( Elevated Blood
Pressure,risk factor of,Cardiovascular Diseases) ∈   , as
well as the data set   = ( 
 ,   ) ∈</p>
        <p>shown in figure
narrative binding between  and   , we have to identify
those data values suitable to represent the respective
nodes of  . We do so by matching the narrative’s nodes
sets, finding exactly those narrations associated with the
topic of interest is coupled with extensive manual labour ing the subset of tuples, i.e.</p>
        <p>The vertical matching for a node  ∈   aims at
identifyof the variables in  
additional qualifiers for  . When considering the node
Elevated Blood Pressure for example, it becomes apparent
that we are only interested in those tuples of the data
set that show a suficiently high value for the respective
variable. Although in some cases it might be suficient to
  fulfill constraints imposed by
 ⊆   whose substitutions
i.e., the entities and events that partake in the narration. define these qualifiers relative to the actual values, e.g.,
1 [8]. In order to assess whether there exists a successful the set of semantically associated data values consisting
higher than the average, it will in most cases be inevitable
to rely on external domain knowledge.</p>
        <p>As the matching result, we receive for each node  ∈  
in    .
of the substitutions for the variables in     of each tuple
to the values of the data both horizontally and vertically. Second Step: Matching Relations</p>
        <sec id="sec-2-3-1">
          <title>Once we matched</title>
          <p>The horizontal matching for a node  ∈   aims at
identifying the subset of variables, i.e.  
be semantically associated with  . By looking into   , we
can see that each tuple represents an individual patient
and captures diferent properties, like the blood pressure
level in the second column, or observed events, like a
 ⊆  

 that can
all nodes to their associated data values, we can now
focus on the relations of the narrative. Essentially for   to
be a feasible data narration for   , the relations occurring
between the individual nodes of the narrative have to
also occur between those values matched to them in the
previous step. Here it is important to note that a
narradiagnosis regarding heart health in the last column. Usu- tive might express a large variety of diferent relation
ally, descriptive meta-information gives insights about
the variables of the data set, thus providing valuable hints
types. Identifying these relations requires the application
of specialized metrics and strategies for each type. While
about the entities and events referred to in the data. By re- this might sound dificult to realize, we believe that most
ferring to such meta-information, we can infer that only
the second and fourth columns contain data values
relevant for a binding of  . While the process of horizontally
restricting the data is relatively straightforward in our
use case, this is unfortunately not guaranteed. The high
heterogeneity in the structure and formatting of
metainformation in open data sets makes this a non-trivial
task. For example, additional information about the
context of a clinical study could be attached externally as a
description of the data set, e.g., the study containing only
patients with diagnosed diabetes. Similarly, such
information may be directly embedded into the data, e.g., the
narrative relations can be assigned into three categories:
correlation, causation, and temporal relations. By being
able to make reasoned decisions about whether such a
relationship occurs between data set values, we can thus
expect to be able to handle most narratives. Finding
suitable metrics and strategies for these three types is thus
a focus of our ongoing research. Deciding whether two
columns correlate with each other is a trivial task that
can be solved by applying metrics such as the Pearson
correlation coeficient. Causation, on the other hand, is
complicated to assess. Here, relying on metrics as
deployed in clinical studies, such as the relative risk, that
measuring units such as mm HG could be included as in- build upon the idea of control groups to analyse the efect
matching between data set and narrative, both compo- two-step matching strategy, we can now assess the
resome factor has on an observed event might be a
promising first step. Additionally, we currently analyse the
applicability of more sophisticated rule-based approaches
as an easy-to-use strategy for causality assessment.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Interpreting the Matching Results</title>
        <sec id="sec-2-4-1">
          <title>With the outlined</title>
          <p>producibility of a narration by analysing which claims
and insights available open data sets can support, thus
giving valuable insights into the narration’s plausibility.
It is thereby possible to discover multiple data sets whose
data narrations align to a single narrative. In that case,
we can assume that this narrative as a whole is plausible
across many domains and maybe even generally
applicadividual variables of the data set. In order to find precise
data narrations, it is therefore essential to consider all
available meta-information. Finding ways to efectively
identify and assign the correct meta-information to the
correct data thus remains a challenging research task.</p>
          <p>At this point, it has to be noted that for a successful
nents must draw from a shared vocabulary. Primary
candidates for such a vocabulary can be found in extensive
and, preferably, well-curated ontologies. For many of
these ontologies, specialized NLP-tools (such as SciSpacy
[16] for biomedical terms) exist that allow for complete
annotation pipelines. Using these tools on both the
narrative’s node labels and the meta-information of the data
set yields annotations that we can apply in the matching.
ble. On the other hand, it might be possible that certain
parts of the narrative rarely match any data, which could
indicate that the respective substories of the narrative
are very context-dependent. Thus, even if only parts of a
narrative result in successful bindings, it is still possible
to draw valuable conclusions about individual claims. For
some cases, it might even be feasible to allow for some
form of data set augmentation, i.e., combining individual
bindings against multiple diferent data sets to support
the narration as a whole.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and Future Work</title>
      <p>Assessing whether claims and statements encountered in
scientific publications are representative and thus
reproducible in various use cases usually requires a thorough
and careful analysis of the underlying data. For this, it is
necessary to identify the data narrations formed by the
intrinsic relations inside the data sets. In this paper, we
outline an approach that relies on translating the claims
of scientific publications into structured narratives that
form valuable templates for the discovery of additional
data sets that can support these claims. For this, we
propose a novel two-step matching strategy. As a first
step, we rely heavily on the meta-information provided
with each data set in order to identify those data values
that align to the entities and events encountered in the
narrative. This first step thus allows us to identify the
relevant parts of the data set that we need to analyse in
the second step. We then compute a narrative binding
between the data set and the narrative. If the intrinsic
relations between the individual data values match the
relations expressed in the narrative, we consider the data
set to be successfully bound to the narrative. In this case,
we argue that the data can reproduce the claims
encountered in the publication. In the near future, we would like
to build upon this paper by developing and describing
the complete matching algorithm in detail and evaluate
its practicality in a large-scale evaluation of real-world
data. For this, we will focus on solving the open research
questions discussed throughout this paper.
related information in the social sciences, in: 19th
ACM/IEEE Joint Conference on Digital Libraries
(JCDL), IEEE, 2019.
[4] C. Meghini, V. Bartalesi, D. Metilli, Representing
narratives in digital libraries: The narrative
ontology, Semantic Web 12 (2021).
[5] S. Zaninovic, I. Nola, Management of measurable
variable cardiovascular disease’ risk factors,
Current Cardiology Reviews 14 (2018).
[6] H. Kroll, D. Nagel, W.-T. Balke, Modeling narrative
structures in logical overlays on top of knowledge
repositories, in: International Conference on
Conceptual Modeling (ER), Springer, 2020.
[7] R. Detrano, A. Janosi, W. Steinbrunn, M.
Pfisterer, J.-J. Schmid, S. Sandhu, K. H. Guppy, S. Lee,
V. Froelicher, International application of a new
probability algorithm for the diagnosis of coronary
artery disease, The American Journal of Cardiology
64 (1989).
[8] D. Dua, C. Graf, UCI machine learning repository,</p>
      <p>2017. URL: http://archive.ics.uci.edu/ml.
[9] L. Rupprecht, J. C. Davis, C. Arnold, Y. Gur, D.
Bhagwat, Improving reproducibility of data science
pipelines through transparent provenance capture,</p>
      <p>Proc. VLDB Endow. 13 (2020).
[10] S. Samuel, Integrative data management for
reproducibility of microscopy experiments, in: The
Semantic Web - 14th International Conference, ESWC,
volume 10250, 2017.
[11] F. Limani, A. Latif, K. Tochtermann, Linked
publications and research data: Use cases for digital
libraries, in: 22nd International Conference on
Theory and Practice of Digital Libraries, TPDL, volume
11057, Springer, 2018.
[12] T. Friedrich, A. O. Kempf, Making research
data findable in digital libraries: A layered model
for user-oriented indexing of survey data, in:
IEEE/ACM Joint Conference on Digital Libraries,</p>
      <p>JCDL, IEEE Computer Society, 2014.
[13] M. N. Hussain, H. A. Rubaye, K. K. Bandeli, N.
Agarwal, Stories from blogs: Computational extraction
and visualization of narratives, in: Proceedings of
Text2Story - Fourth Workshop on Narrative
Extraction From Texts, CEUR-WS.org, 2021.
[14] M. T. Rodríguez, S. Nunes, T. Devezas, Telling
sto[1] M. Pawlik, T. Hütter, D. Kocher, W. Mann, N. Aug- ries with data visualization, in: Proceedings of the
sten, A link is not enough – reproducibility of data, 2015 Workshop on Narrative &amp; Hypertext, 2015.</p>
      <p>Datenbank-Spektrum 19 (2019). [15] E. Segel, J. Heer, Narrative visualization: Telling
sto[2] J. Pakstis, H. Calkins, C. Dobrzynski, S. Lamm, ries with data, IEEE Transactions on Visualization
L. McNamara, Advancing reproducibility through and Computer Graphics 16 (2010).
shared data: Bridging archival and library practice, [16] M. Neumann, D. King, I. Beltagy, W. Ammar,
Scisin: 19th ACM/IEEE Joint Conference on Digital paCy: Fast and Robust Models for Biomedical
NatLibraries (JCDL), 2019. ural Language Processing, in: Proc. of the 18th
[3] D. Hienert, D. Kern, K. Boland, B. Zapilko, BioNLP Workshop and Shared Task, ACL, 2019.</p>
      <p>P. Mutschke, A digital library for research data and</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>