1. Introduction

Narrations - Using flexible Data Bindings to support the Reproducibility of Claims in Digital Library Objects

Denis Nagel

nagel@ifis.cs.tu-bs.de 0 1

Till Afeldt

t.afeldt@tu-bs.de 0 1

Wolf-Tilo Balke

balke@ifis.cs.tu-bs.de 0 1 0 Institute for Information Systems, TU Braunschweig , Braunschweig , Germany 1 Workshop Proce dings

Digital libraries support researchers by providing public access to a vast collection of state-of-the-art literature. The considerable variety of statements, claims, observations and insights that form the narrations of these documents can be used as a valuable groundwork for further research. However, when confronted with these narrations, concerns regarding their reproducibility might arise. Tackling these concerns usually requires a careful analysis of the underlying data sets and a search for similar repositories that support the questioned claims. In short, it is necessary to find repositories whose data narrations match those of the publication. Unfortunately, data analysis and mining are far too often reduced to basic statistical analyses that usually fail to be helpful. In this paper, we propose a novel idea to use structured narratives as a template to discover supporting data narrations, hence reducing the problem of assessing the reproducibility of a publication to a simple matching task between a document and data set. To realize this idea, we outline a novel two-step matching strategy by describing the individual steps along the lines of a pharmaceutical use case. We thereby identify the main open research tasks and discuss problems that need to be solved to develop a full-fledged matching algorithm.

Digital Library diseases Hence when reading [5] the question might

1. Introduction

Digital libraries and their vast collections of documents represent an invaluable source of knowledge. They provide public access to state-of-the-art research across many domains of science. The scientific narrations provided going research that builds upon their claims, insights and observations. However, when working with such documents, concerns about the reproducibility of the encountered narrations might arise [1]. Often these narrations originate from research data that has been colscientific studies, or surveys. In recent years increasing eforts have been undertaken to make the ever-increasing ing them into the existing libraries and, in the best case linking them to their associated documents [2, 3].

Consider a researcher reading a document that has high relevance for her current work, but she is sceptical about a specific claim made by the authors. Is the claim plausible? Moreover, is the data that supports it broadly representative for her research domain, or is it applicable only to the document’s specific use case? Answering these questions requires retracing the steps taken by the nEvelop-O (T. Afeldt);

0000-0002-5443-1215 (W. Balke) analysis. Our proposal is based on the assumption that the intrinsic relations between individual values inside data sets also form implicitly expressed narrations that Structured Narratives For this paper, we define strucwe call data narrations. Our core idea is that whenever a tured narratives according to the definition found in [ 6]. data set is schematically suitable to reproduce some scien- Therein narratives are defined as directed edge- and nodetific narrative, it should provide a data narration that can labeled graphs = ( , ) . The nodes represent events be successfully aligned to that narrative. with a temporal component, entities (i.e., real-world ob

With structured narratives at hand, the problem of jects and concepts) and literals, such as numerical values assessing reproducibility can thus be reduced to a sim- and lexicographical strings. In contrast the edges repple matching approach. Given a structured narrative resent the relations between them, which could be the extracted from a document of interest, in a first step, data participation of some entity in an event, a causal or temsets suitable for the matching can be discovered by con- poral relationship between events or simple facts. As sidering the available meta-data, i.e., data set descriptions, such, the set of nodes is defined as ⊆ ∪ ∪ Γ with table headers, or column titles. We then align this meta- being the set of entities, being the set of literals, and Γ data to the events and entities described in the extracted being the set of events. Then the set of edges is defined narrative, resulting in a set of possible candidate data as ⊆ ( ∪ Γ) × Σ × ( ∪ Γ ∪ ) , with Σ being the alphabet sets. The second step looks into the actual data to verify of available edge labels. Figure 2 shows an excerpt of a whether the relations between the entities and events of structured narrative extracted from [5]. the narrative also occur expressed by the data.

The contributions of this paper can be summarized as follows: We propose structured narratives as a tool to assess the reproducibility of a document’s statements, claims and insights. For this, we outline a novel approach to discover data sets fitting to the document’s narrations based on a simple two-step matching strategy.

Narrative Bindings and Data Narrations Narratives can consist of any arbitrary statements and claims without any indication of their plausibility. Hence, in [6] narrative bindings are introduced to connect parts of the narrative to a knowledge repository of any type in the sense of substantiation. Semantically a narrative binding between a narrative and a repository indicates that the 2. Preliminaries repository supports the statements made throughout the narrative. As such, we can define narrative bindings as Data Sets Data sets usually store empirical data gained follows. Let = ( , ) be a narrative and KR the set of through experiments, measurements, studies, or surveys. all knowledge repositories. A narrative binding is a tuple A problem often encountered when working with data nb = ( , kr) ∈ × . We say that is bound to by nb. sets is very high heterogeneity in structure and schema Narratives can be encountered not only in natural lanformatting. For ease of understanding, we thus consider guages, such as documents, novels, or human speech, data sets in the scope of this paper to store (mainly nu- but also behind the intrinsic relations that the individual merical) information in a tabular format. We denote the values of a data set implicitly express. We call this special set of all possible data sets that comply with this format type of narrative a data narration. Using the notion of by . Each data set = { , } ∈ consists of a narrative bindings, we can define a data narration as folset of variables , with each ∈ representing a lows. Let ⊂ KR be the set of all data sets. A narrative single column and a set of tuples , each representing an = ( , ) is called a data narration of a data set ∈ , individual record ( ∈ ) of the data set which comprises if there exists a = ( , ) , for every ∈ . of a value ∈ for each variable ∈ . Figure 1 shows an example for a pharmaceutical data set. tion 1 can be encountered. For simplicity, we focus on the following excerpt of the narrative, which is visualized in 2: The paper claims that cardiovascular diseases (CVDs) are the leading cause of premature mortality worldwide.

It lists many risk factors, like tobacco smoking, elevated blood pressure, dyslipidemia and advanced age, among others, that are stated to be associated with CVDs. Our goal is now to assess whether these claims can be reproduced using available open research data sets. We assume that only data sets whose intrinsic data narrations support these claims can be used to reproduce them. The basic idea for our approach is that by translating both, the intrinsic messages of the data set and the scientific narration into structured narratives, we can reduce the problem of assessing the reproducibility of the claims to a simple matching problem.

3. Related Work Extraction of Structured Narratives In order to al

Reproducibility of scientific results has always been one low for an automated matching between publications and of the core aspects of good scientific practice. Unfortu- research data, we need to extract the narrations encounnately, there are many cases in which it is not or only tered throughout the document and translate them into insuficiently taken into account. As such, terms such as a structured narrative. Hence, it is necessary to analyse reproducibility crisis can be encountered frequently in how each building block of a narrative graph is expressed the literature. Hence, many authors propose new strate- in natural language. For this, we have to combine multigies to improve this situation. Recent examples are [9], ple disciplines from natural language processing (NLP), where a new system to collect provenance information such as named entity recognition and event detection for data science pipelines is presented, and [10] where the for the nodes and relation extraction for the edges beauthors developed an integrative platform in the context tween them. By applying a manual extraction on our of the semantic web to capture the provenance informa- example, we can identify six biomedical concepts and tion for individual experiments. entities, namely Cardiovascular Diseases, Premature Mor

In order to tackle the problem of reproducibility, an tality, as well as the four diferent risk factors ( Tobacco increasing amount of research data has recently been Smoking, Elevated Blood Pressure, Dyslipidemia and Adcollected and integrated into digital libraries [3, 11, 12]. vanced Age). Furthermore, the narration claims that there At the same time, it is often dificult to find the links con- is a causal dependency between CVDs and premature necting a publication to the underlying data. In [3] the mortality, as well as a relation between CVDs and each of authors thus introduced a specialized digital library that the risk factors. Hence, for our example a structured narofers integrated access to both the documents and their rative as defined in section 2 and denoted by = ( , ) associated research data. Contrary to such strategies, our could look as shown in figure 2. As a manual extraction approach aims at finding arbitrary research data suitable can often result in a cumbersome process, developing a to reproduce the claims from some document, even if it sophisticated strategy for the automated extraction of was not the source for these statements. narratives is crucial for the practical applicability of our approach and thus an important task for future work.

Diferent approaches, such as [ 13], are already actively 4. Method Description discussed, thus providing valuable groundwork. In this section, we present our envisioned idea of using structured narratives to assess the reproducibility of scientific claims and statements in more detail (for a visualization, see figure 3). We identify the steps required to develop a full-fledged matching algorithm along the lines of a pharmaceutical use case and discuss dificulties and open tasks that need to be solved.

Problem Description Let us reconsider the PubMed publication [5], in which the narrative described in sec Identification of Data Narrations Discovering data

sets that are suitable to match requires awareness of their intrinsic data narrations. The current state-ofthe-art approach to make data narrations visible is the application of techniques for data visualization [14, 15]. However, such techniques can only be applied on top of data analysis, i.e., if the intrinsic relations of a data set are already known, which is a requirement that is rarely met in the realm of open data. Especially in large-scale data and is often not feasible. We can apply a more feasible top-down strategy by relying on the structured narrative as a template for the general data narration.

First Step: Matching Events and Entities

In the ifrst matching step, we focus on the narrative’s nodes, Let us consider the causal relation = ( Elevated Blood Pressure,risk factor of,Cardiovascular Diseases) ∈ , as well as the data set = ( , ) ∈

shown in figure narrative binding between and , we have to identify those data values suitable to represent the respective nodes of . We do so by matching the narrative’s nodes sets, finding exactly those narrations associated with the topic of interest is coupled with extensive manual labour ing the subset of tuples, i.e.

The vertical matching for a node ∈ aims at identifyof the variables in additional qualifiers for . When considering the node Elevated Blood Pressure for example, it becomes apparent that we are only interested in those tuples of the data set that show a suficiently high value for the respective variable. Although in some cases it might be suficient to fulfill constraints imposed by ⊆ whose substitutions i.e., the entities and events that partake in the narration. define these qualifiers relative to the actual values, e.g., 1 [8]. In order to assess whether there exists a successful the set of semantically associated data values consisting higher than the average, it will in most cases be inevitable to rely on external domain knowledge.

As the matching result, we receive for each node ∈ in . of the substitutions for the variables in of each tuple to the values of the data both horizontally and vertically. Second Step: Matching Relations

Once we matched

The horizontal matching for a node ∈ aims at identifying the subset of variables, i.e. be semantically associated with . By looking into , we can see that each tuple represents an individual patient and captures diferent properties, like the blood pressure level in the second column, or observed events, like a ⊆ that can all nodes to their associated data values, we can now focus on the relations of the narrative. Essentially for to be a feasible data narration for , the relations occurring between the individual nodes of the narrative have to also occur between those values matched to them in the previous step. Here it is important to note that a narradiagnosis regarding heart health in the last column. Usu- tive might express a large variety of diferent relation ally, descriptive meta-information gives insights about the variables of the data set, thus providing valuable hints types. Identifying these relations requires the application of specialized metrics and strategies for each type. While about the entities and events referred to in the data. By re- this might sound dificult to realize, we believe that most ferring to such meta-information, we can infer that only the second and fourth columns contain data values relevant for a binding of . While the process of horizontally restricting the data is relatively straightforward in our use case, this is unfortunately not guaranteed. The high heterogeneity in the structure and formatting of metainformation in open data sets makes this a non-trivial task. For example, additional information about the context of a clinical study could be attached externally as a description of the data set, e.g., the study containing only patients with diagnosed diabetes. Similarly, such information may be directly embedded into the data, e.g., the narrative relations can be assigned into three categories: correlation, causation, and temporal relations. By being able to make reasoned decisions about whether such a relationship occurs between data set values, we can thus expect to be able to handle most narratives. Finding suitable metrics and strategies for these three types is thus a focus of our ongoing research. Deciding whether two columns correlate with each other is a trivial task that can be solved by applying metrics such as the Pearson correlation coeficient. Causation, on the other hand, is complicated to assess. Here, relying on metrics as deployed in clinical studies, such as the relative risk, that measuring units such as mm HG could be included as in- build upon the idea of control groups to analyse the efect matching between data set and narrative, both compo- two-step matching strategy, we can now assess the resome factor has on an observed event might be a promising first step. Additionally, we currently analyse the applicability of more sophisticated rule-based approaches as an easy-to-use strategy for causality assessment.

Interpreting the Matching Results With the outlined

producibility of a narration by analysing which claims and insights available open data sets can support, thus giving valuable insights into the narration’s plausibility. It is thereby possible to discover multiple data sets whose data narrations align to a single narrative. In that case, we can assume that this narrative as a whole is plausible across many domains and maybe even generally applicadividual variables of the data set. In order to find precise data narrations, it is therefore essential to consider all available meta-information. Finding ways to efectively identify and assign the correct meta-information to the correct data thus remains a challenging research task.

At this point, it has to be noted that for a successful nents must draw from a shared vocabulary. Primary candidates for such a vocabulary can be found in extensive and, preferably, well-curated ontologies. For many of these ontologies, specialized NLP-tools (such as SciSpacy [16] for biomedical terms) exist that allow for complete annotation pipelines. Using these tools on both the narrative’s node labels and the meta-information of the data set yields annotations that we can apply in the matching. ble. On the other hand, it might be possible that certain parts of the narrative rarely match any data, which could indicate that the respective substories of the narrative are very context-dependent. Thus, even if only parts of a narrative result in successful bindings, it is still possible to draw valuable conclusions about individual claims. For some cases, it might even be feasible to allow for some form of data set augmentation, i.e., combining individual bindings against multiple diferent data sets to support the narration as a whole.

5. Conclusion and Future Work

Assessing whether claims and statements encountered in scientific publications are representative and thus reproducible in various use cases usually requires a thorough and careful analysis of the underlying data. For this, it is necessary to identify the data narrations formed by the intrinsic relations inside the data sets. In this paper, we outline an approach that relies on translating the claims of scientific publications into structured narratives that form valuable templates for the discovery of additional data sets that can support these claims. For this, we propose a novel two-step matching strategy. As a first step, we rely heavily on the meta-information provided with each data set in order to identify those data values that align to the entities and events encountered in the narrative. This first step thus allows us to identify the relevant parts of the data set that we need to analyse in the second step. We then compute a narrative binding between the data set and the narrative. If the intrinsic relations between the individual data values match the relations expressed in the narrative, we consider the data set to be successfully bound to the narrative. In this case, we argue that the data can reproduce the claims encountered in the publication. In the near future, we would like to build upon this paper by developing and describing the complete matching algorithm in detail and evaluate its practicality in a large-scale evaluation of real-world data. For this, we will focus on solving the open research questions discussed throughout this paper. related information in the social sciences, in: 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL), IEEE, 2019. [4] C. Meghini, V. Bartalesi, D. Metilli, Representing narratives in digital libraries: The narrative ontology, Semantic Web 12 (2021). [5] S. Zaninovic, I. Nola, Management of measurable variable cardiovascular disease’ risk factors, Current Cardiology Reviews 14 (2018). [6] H. Kroll, D. Nagel, W.-T. Balke, Modeling narrative structures in logical overlays on top of knowledge repositories, in: International Conference on Conceptual Modeling (ER), Springer, 2020. [7] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J.-J. Schmid, S. Sandhu, K. H. Guppy, S. Lee, V. Froelicher, International application of a new probability algorithm for the diagnosis of coronary artery disease, The American Journal of Cardiology 64 (1989). [8] D. Dua, C. Graf, UCI machine learning repository,

2017. URL: http://archive.ics.uci.edu/ml. [9] L. Rupprecht, J. C. Davis, C. Arnold, Y. Gur, D. Bhagwat, Improving reproducibility of data science pipelines through transparent provenance capture,

Proc. VLDB Endow. 13 (2020). [10] S. Samuel, Integrative data management for reproducibility of microscopy experiments, in: The Semantic Web - 14th International Conference, ESWC, volume 10250, 2017. [11] F. Limani, A. Latif, K. Tochtermann, Linked publications and research data: Use cases for digital libraries, in: 22nd International Conference on Theory and Practice of Digital Libraries, TPDL, volume 11057, Springer, 2018. [12] T. Friedrich, A. O. Kempf, Making research data findable in digital libraries: A layered model for user-oriented indexing of survey data, in: IEEE/ACM Joint Conference on Digital Libraries,

JCDL, IEEE Computer Society, 2014. [13] M. N. Hussain, H. A. Rubaye, K. K. Bandeli, N. Agarwal, Stories from blogs: Computational extraction and visualization of narratives, in: Proceedings of Text2Story - Fourth Workshop on Narrative Extraction From Texts, CEUR-WS.org, 2021. [14] M. T. Rodríguez, S. Nunes, T. Devezas, Telling sto[1] M. Pawlik, T. Hütter, D. Kocher, W. Mann, N. Aug- ries with data visualization, in: Proceedings of the sten, A link is not enough – reproducibility of data, 2015 Workshop on Narrative & Hypertext, 2015.

Datenbank-Spektrum 19 (2019). [15] E. Segel, J. Heer, Narrative visualization: Telling sto[2] J. Pakstis, H. Calkins, C. Dobrzynski, S. Lamm, ries with data, IEEE Transactions on Visualization L. McNamara, Advancing reproducibility through and Computer Graphics 16 (2010). shared data: Bridging archival and library practice, [16] M. Neumann, D. King, I. Beltagy, W. Ammar, Scisin: 19th ACM/IEEE Joint Conference on Digital paCy: Fast and Robust Models for Biomedical NatLibraries (JCDL), 2019. ural Language Processing, in: Proc. of the 18th [3] D. Hienert, D. Kern, K. Boland, B. Zapilko, BioNLP Workshop and Shared Task, ACL, 2019.

P. Mutschke, A digital library for research data and