Data Narrations - Using flexible Data Bindings to support the Reproducibility of Claims in Digital Library Objects Denis Nagel1 , Till Affeldt1 and Wolf-Tilo Balke1 1 Institute for Information Systems, TU Braunschweig, Braunschweig, Germany Abstract Digital libraries support researchers by providing public access to a vast collection of state-of-the-art literature. The considerable variety of statements, claims, observations and insights that form the narrations of these documents can be used as a valuable groundwork for further research. However, when confronted with these narrations, concerns regarding their reproducibility might arise. Tackling these concerns usually requires a careful analysis of the underlying data sets and a search for similar repositories that support the questioned claims. In short, it is necessary to find repositories whose data narrations match those of the publication. Unfortunately, data analysis and mining are far too often reduced to basic statistical analyses that usually fail to be helpful. In this paper, we propose a novel idea to use structured narratives as a template to discover supporting data narrations, hence reducing the problem of assessing the reproducibility of a publication to a simple matching task between a document and data set. To realize this idea, we outline a novel two-step matching strategy by describing the individual steps along the lines of a pharmaceutical use case. We thereby identify the main open research tasks and discuss problems that need to be solved to develop a full-fledged matching algorithm. Keywords Narrative Intelligence, Open Data, Digital Libraries 1. Introduction original authors by thoroughly analyzing the underly- ing data sets. Now, essentially two situations can occur. Digital libraries and their vast collections of documents On the one hand, the required data set might not have represent an invaluable source of knowledge. They pro- been published or is challenging to find due to missing vide public access to state-of-the-art research across many references by the authors, i.e., the critical link between domains of science. The scientific narrations provided document and data set might be unavailable [3]. On the through these documents form the groundwork for on- other hand, if the data is readily available, it might still going research that builds upon their claims, insights be unrepresentative for the domain of interest. Even if it and observations. However, when working with such is representative, a thorough data set analysis can result documents, concerns about the reproducibility of the in a very time-consuming and exhausting process that encountered narrations might arise [1]. Often these nar- many researchers might not be willing to take. rations originate from research data that has been col- Recently scientific narratives have sparked much in- lected throughout extensive experiments, evaluations, terest in the scientific discourse, and their application scientific studies, or surveys. In recent years increasing in digital libraries is the topic of an ongoing discussion efforts have been undertaken to make the ever-increasing [4]. Every document can contain several narrations that amounts of research data publicly available by integrat- connect insights and statements to form a coherent story ing them into the existing libraries and, in the best case presented to the reader. For an example, consider the fol- linking them to their associated documents [2, 3]. lowing pharmaceutical narrative: Cardiovascular diseases Consider a researcher reading a document that has are the leading reason for premature deaths worldwide and high relevance for her current work, but she is sceptical are caused by a multitude of risk factors, many of which are about a specific claim made by the authors. Is the claim avoidable. To avoid premature deaths, it is thus essential to plausible? Moreover, is the data that supports it broadly raise awareness of these risk factors. One instance of this representative for her research domain, or is it applicable narrative can be encountered in [5], where multiple risk only to the document’s specific use case? Answering factors, like elevated blood pressure, are listed that are these questions requires retracing the steps taken by the claimed to be causal for the occurrence of cardiovascular diseases. Hence, when reading [5], the question might DISCO’21 - Digital Infrastructures for Scholarly Content Objects at arise whether these claims are plausible. JCDL2021, September 30–October 01, 2021, Online We believe that, by translating narrations such as our Envelope-Open nagel@ifis.cs.tu-bs.de (D. Nagel); t.affeldt@tu-bs.de (T. Affeldt); balke@ifis.cs.tu-bs.de (W. Balke) example into structured narratives [6], it is possible to Orcid 0000-0002-5832-9154 (D. Nagel); 0000-0001-6440-5654 assess the reproducibility of their statements, claims, in- (T. Affeldt); 0000-0002-5443-1215 (W. Balke) sights and observations without a costly manual data Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). analysis. Our proposal is based on the assumption that CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Excerpt from a pharmaceutical data set [7], contain- ing clinical data of 303 patients, collected by the Cleveland Figure 2: Excerpt of a structured narrative extracted from a Clinic Foundation (publicly available through the UCI ma- PubMed publication [5]. The nodes of the narrative graph chine learning repository [8]) represent important entities, events and literals, while the labeled edges represent the relations between them the intrinsic relations between individual values inside data sets also form implicitly expressed narrations that Structured Narratives For this paper, we define struc- we call data narrations. Our core idea is that whenever a tured narratives according to the definition found in [6]. data set is schematically suitable to reproduce some scien- Therein narratives are defined as directed edge- and node- tific narrative, it should provide a data narration that can labeled graphs 𝑁 = (𝑉 , 𝑅). The nodes represent events be successfully aligned to that narrative. with a temporal component, entities (i.e., real-world ob- With structured narratives at hand, the problem of jects and concepts) and literals, such as numerical values assessing reproducibility can thus be reduced to a sim- and lexicographical strings. In contrast the edges rep- ple matching approach. Given a structured narrative resent the relations between them, which could be the extracted from a document of interest, in a first step, data participation of some entity in an event, a causal or tem- sets suitable for the matching can be discovered by con- poral relationship between events or simple facts. As sidering the available meta-data, i.e., data set descriptions, such, the set of nodes is defined as 𝑉 βŠ† 𝐸 βˆͺ 𝐿 βˆͺ Ξ“ with 𝐸 table headers, or column titles. We then align this meta- being the set of entities, 𝐿 being the set of literals, and Ξ“ data to the events and entities described in the extracted being the set of events. Then the set of edges is defined narrative, resulting in a set of possible candidate data as 𝑅 βŠ† (𝐸 βˆͺ Ξ“) Γ— Ξ£ Γ— (𝐸 βˆͺ Ξ“ βˆͺ 𝐿), with Ξ£ being the alphabet sets. The second step looks into the actual data to verify of available edge labels. Figure 2 shows an excerpt of a whether the relations between the entities and events of structured narrative extracted from [5]. the narrative also occur expressed by the data. The contributions of this paper can be summarized Narrative Bindings and Data Narrations Narratives as follows: We propose structured narratives as a tool can consist of any arbitrary statements and claims with- to assess the reproducibility of a document’s statements, out any indication of their plausibility. Hence, in [6] claims and insights. For this, we outline a novel approach narrative bindings are introduced to connect parts of the to discover data sets fitting to the document’s narrations narrative to a knowledge repository of any type in the based on a simple two-step matching strategy. sense of substantiation. Semantically a narrative binding between a narrative and a repository indicates that the 2. Preliminaries repository supports the statements made throughout the narrative. As such, we can define narrative bindings as Data Sets Data sets usually store empirical data gained follows. Let 𝑁 = (𝑉 , 𝑅) be a narrative and KR the set of through experiments, measurements, studies, or surveys. all knowledge repositories. A narrative binding is a tuple A problem often encountered when working with data nb = (π‘Ÿ, kr) ∈ 𝑅 Γ— 𝐾 𝑅. We say that 𝑒 is bound to π‘˜π‘Ÿ by nb. sets is very high heterogeneity in structure and schema Narratives can be encountered not only in natural lan- formatting. For ease of understanding, we thus consider guages, such as documents, novels, or human speech, data sets in the scope of this paper to store (mainly nu- but also behind the intrinsic relations that the individual merical) information in a tabular format. We denote the values of a data set implicitly express. We call this special set of all possible data sets that comply with this format type of narrative a data narration. Using the notion of by 𝐷𝑆. Each data set 𝐷𝑖 = {𝑉 π‘Žπ‘Ÿ, 𝑇 } ∈ 𝐷𝑆 consists of a narrative bindings, we can define a data narration as fol- set of variables 𝑉 π‘Žπ‘Ÿ, with each π‘£π‘Žπ‘Ÿπ‘— ∈ 𝑉 π‘Žπ‘Ÿ representing a lows. Let 𝐷𝑆 βŠ‚ KR be the set of all data sets. A narrative single column and a set of tuples 𝑇, each representing an 𝑁 = (𝑉 , 𝑅) is called a data narration of a data set 𝐷 ∈ 𝐷𝑆, individual record (𝑑𝑖 ∈ 𝑇) of the data set which comprises iff there exists a 𝑛𝑏 = (π‘Ÿπ‘– , 𝐷), for every π‘Ÿ ∈ 𝑅. of a value 𝑑𝑖𝑗 ∈ 𝑇 for each variable π‘£π‘Žπ‘Ÿπ‘— ∈ 𝑉 π‘Žπ‘Ÿ. Figure 1 shows an example for a pharmaceutical data set. tion 1 can be encountered. For simplicity, we focus on the following excerpt of the narrative, which is visualized in 2: The paper claims that cardiovascular diseases (CVDs) are the leading cause of premature mortality worldwide. It lists many risk factors, like tobacco smoking, elevated blood pressure, dyslipidemia and advanced age, among others, that are stated to be associated with CVDs. Our goal is now to assess whether these claims can be repro- duced using available open research data sets. We assume that only data sets whose intrinsic data narrations sup- port these claims can be used to reproduce them. The basic idea for our approach is that by translating both, Figure 3: Our proposed outline to assess the reproducibility the intrinsic messages of the data set and the scientific of claims using structured narratives narration into structured narratives, we can reduce the problem of assessing the reproducibility of the claims to a simple matching problem. 3. Related Work Extraction of Structured Narratives In order to al- Reproducibility of scientific results has always been one low for an automated matching between publications and of the core aspects of good scientific practice. Unfortu- research data, we need to extract the narrations encoun- nately, there are many cases in which it is not or only tered throughout the document and translate them into insufficiently taken into account. As such, terms such as a structured narrative. Hence, it is necessary to analyse reproducibility crisis can be encountered frequently in how each building block of a narrative graph is expressed the literature. Hence, many authors propose new strate- in natural language. For this, we have to combine multi- gies to improve this situation. Recent examples are [9], ple disciplines from natural language processing (NLP), where a new system to collect provenance information such as named entity recognition and event detection for data science pipelines is presented, and [10] where the for the nodes and relation extraction for the edges be- authors developed an integrative platform in the context tween them. By applying a manual extraction on our of the semantic web to capture the provenance informa- example, we can identify six biomedical concepts and tion for individual experiments. entities, namely Cardiovascular Diseases, Premature Mor- In order to tackle the problem of reproducibility, an tality, as well as the four different risk factors (Tobacco increasing amount of research data has recently been Smoking, Elevated Blood Pressure, Dyslipidemia and Ad- collected and integrated into digital libraries [3, 11, 12]. vanced Age). Furthermore, the narration claims that there At the same time, it is often difficult to find the links con- is a causal dependency between CVDs and premature necting a publication to the underlying data. In [3] the mortality, as well as a relation between CVDs and each of authors thus introduced a specialized digital library that the risk factors. Hence, for our example a structured nar- offers integrated access to both the documents and their rative as defined in section 2 and denoted by 𝑁𝑒 = (𝑉𝑒 , 𝑅𝑒 ) associated research data. Contrary to such strategies, our could look as shown in figure 2. As a manual extraction approach aims at finding arbitrary research data suitable can often result in a cumbersome process, developing a to reproduce the claims from some document, even if it sophisticated strategy for the automated extraction of was not the source for these statements. narratives is crucial for the practical applicability of our approach and thus an important task for future work. Different approaches, such as [13], are already actively 4. Method Description discussed, thus providing valuable groundwork. In this section, we present our envisioned idea of us- Identification of Data Narrations Discovering data ing structured narratives to assess the reproducibility sets that are suitable to match 𝑁𝑒 requires awareness of scientific claims and statements in more detail (for a of their intrinsic data narrations. The current state-of- visualization, see figure 3). We identify the steps required the-art approach to make data narrations visible is the to develop a full-fledged matching algorithm along the application of techniques for data visualization [14, 15]. lines of a pharmaceutical use case and discuss difficulties However, such techniques can only be applied on top of and open tasks that need to be solved. data analysis, i.e., if the intrinsic relations of a data set are already known, which is a requirement that is rarely met Problem Description Let us reconsider the PubMed in the realm of open data. Especially in large-scale data publication [5], in which the narrative described in sec- sets, finding exactly those narrations associated with the The vertical matching for a node 𝑣 ∈ 𝑉𝑒 aims at identify- topic of interest is coupled with extensive manual labour ing the subset of tuples, i.e. 𝑇𝑒𝑣 βŠ† 𝑇𝑒 whose substitutions and is often not feasible. We can apply a more feasible of the variables in 𝑉 π‘Žπ‘Ÿπ‘’π‘£ fulfill constraints imposed by top-down strategy by relying on the structured narrative additional qualifiers for 𝑣. When considering the node as a template for the general data narration. Elevated Blood Pressure for example, it becomes apparent that we are only interested in those tuples of the data First Step: Matching Events and Entities In the set that show a sufficiently high value for the respective first matching step, we focus on the narrative’s nodes, variable. Although in some cases it might be sufficient to i.e., the entities and events that partake in the narration. define these qualifiers relative to the actual values, e.g., Let us consider the causal relation π‘Ÿ = (Elevated Blood higher than the average, it will in most cases be inevitable Pressure,risk factor of,Cardiovascular Diseases) ∈ 𝑅𝑒 , as to rely on external domain knowledge. well as the data set 𝐷𝑒 = (𝑉 π‘Žπ‘Ÿπ‘’ , 𝑇𝑒 ) ∈ 𝐷𝑆 shown in figure As the matching result, we receive for each node 𝑣 ∈ 𝑉𝑒 1 [8]. In order to assess whether there exists a successful the set of semantically associated data values consisting 𝑣 narrative binding between π‘Ÿ and 𝐷𝑒 , we have to identify of the substitutions for the variables in 𝑉 π‘Žπ‘Ÿπ‘’ of each tuple 𝑣 those data values suitable to represent the respective in 𝑇𝑒 . nodes of π‘Ÿ. We do so by matching the narrative’s nodes to the values of the data both horizontally and vertically. Second Step: Matching Relations Once we matched The horizontal matching for a node 𝑣 ∈ 𝑉𝑒 aims at iden- all nodes to their associated data values, we can now fo- tifying the subset of variables, i.e. 𝑉 π‘Žπ‘Ÿπ‘’π‘£ βŠ† 𝑉 π‘Žπ‘Ÿπ‘’ that can cus on the relations of the narrative. Essentially for 𝑁𝑒 to be semantically associated with 𝑣. By looking into 𝐷𝑒 , we be a feasible data narration for 𝐷𝑒 , the relations occurring can see that each tuple represents an individual patient between the individual nodes of the narrative have to and captures different properties, like the blood pressure also occur between those values matched to them in the level in the second column, or observed events, like a previous step. Here it is important to note that a narra- diagnosis regarding heart health in the last column. Usu- tive might express a large variety of different relation ally, descriptive meta-information gives insights about types. Identifying these relations requires the application the variables of the data set, thus providing valuable hints of specialized metrics and strategies for each type. While about the entities and events referred to in the data. By re- this might sound difficult to realize, we believe that most ferring to such meta-information, we can infer that only narrative relations can be assigned into three categories: the second and fourth columns contain data values rele- correlation, causation, and temporal relations. By being vant for a binding of π‘Ÿ. While the process of horizontally able to make reasoned decisions about whether such a restricting the data is relatively straightforward in our relationship occurs between data set values, we can thus use case, this is unfortunately not guaranteed. The high expect to be able to handle most narratives. Finding suit- heterogeneity in the structure and formatting of meta- able metrics and strategies for these three types is thus information in open data sets makes this a non-trivial a focus of our ongoing research. Deciding whether two task. For example, additional information about the con- columns correlate with each other is a trivial task that text of a clinical study could be attached externally as a can be solved by applying metrics such as the Pearson description of the data set, e.g., the study containing only correlation coefficient. Causation, on the other hand, is patients with diagnosed diabetes. Similarly, such infor- complicated to assess. Here, relying on metrics as de- mation may be directly embedded into the data, e.g., the ployed in clinical studies, such as the relative risk, that measuring units such as mm HG could be included as in- build upon the idea of control groups to analyse the effect dividual variables of the data set. In order to find precise some factor has on an observed event might be a promis- data narrations, it is therefore essential to consider all ing first step. Additionally, we currently analyse the ap- available meta-information. Finding ways to effectively plicability of more sophisticated rule-based approaches identify and assign the correct meta-information to the as an easy-to-use strategy for causality assessment. correct data thus remains a challenging research task. At this point, it has to be noted that for a successful Interpreting the Matching Results With the outlined matching between data set and narrative, both compo- two-step matching strategy, we can now assess the re- nents must draw from a shared vocabulary. Primary can- producibility of a narration by analysing which claims didates for such a vocabulary can be found in extensive and insights available open data sets can support, thus and, preferably, well-curated ontologies. For many of giving valuable insights into the narration’s plausibility. these ontologies, specialized NLP-tools (such as SciSpacy It is thereby possible to discover multiple data sets whose [16] for biomedical terms) exist that allow for complete data narrations align to a single narrative. In that case, annotation pipelines. Using these tools on both the nar- we can assume that this narrative as a whole is plausible rative’s node labels and the meta-information of the data across many domains and maybe even generally applica- set yields annotations that we can apply in the matching. ble. On the other hand, it might be possible that certain related information in the social sciences, in: 19th parts of the narrative rarely match any data, which could ACM/IEEE Joint Conference on Digital Libraries indicate that the respective substories of the narrative (JCDL), IEEE, 2019. are very context-dependent. Thus, even if only parts of a [4] C. Meghini, V. Bartalesi, D. Metilli, Representing narrative result in successful bindings, it is still possible narratives in digital libraries: The narrative ontol- to draw valuable conclusions about individual claims. For ogy, Semantic Web 12 (2021). some cases, it might even be feasible to allow for some [5] S. Zaninovic, I. Nola, Management of measurable form of data set augmentation, i.e., combining individual variable cardiovascular disease’ risk factors, Cur- bindings against multiple different data sets to support rent Cardiology Reviews 14 (2018). the narration as a whole. [6] H. Kroll, D. Nagel, W.-T. Balke, Modeling narrative structures in logical overlays on top of knowledge repositories, in: International Conference on Con- 5. Conclusion and Future Work ceptual Modeling (ER), Springer, 2020. [7] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfis- Assessing whether claims and statements encountered in terer, J.-J. Schmid, S. Sandhu, K. H. Guppy, S. Lee, scientific publications are representative and thus repro- V. Froelicher, International application of a new ducible in various use cases usually requires a thorough probability algorithm for the diagnosis of coronary and careful analysis of the underlying data. For this, it is artery disease, The American Journal of Cardiology necessary to identify the data narrations formed by the 64 (1989). intrinsic relations inside the data sets. In this paper, we [8] D. Dua, C. Graff, UCI machine learning repository, outline an approach that relies on translating the claims 2017. URL: http://archive.ics.uci.edu/ml. of scientific publications into structured narratives that [9] L. Rupprecht, J. C. Davis, C. Arnold, Y. Gur, D. Bhag- form valuable templates for the discovery of additional wat, Improving reproducibility of data science data sets that can support these claims. For this, we pipelines through transparent provenance capture, propose a novel two-step matching strategy. As a first Proc. VLDB Endow. 13 (2020). step, we rely heavily on the meta-information provided [10] S. Samuel, Integrative data management for repro- with each data set in order to identify those data values ducibility of microscopy experiments, in: The Se- that align to the entities and events encountered in the mantic Web - 14th International Conference, ESWC, narrative. This first step thus allows us to identify the volume 10250, 2017. relevant parts of the data set that we need to analyse in [11] F. Limani, A. Latif, K. Tochtermann, Linked pub- the second step. We then compute a narrative binding lications and research data: Use cases for digital between the data set and the narrative. If the intrinsic libraries, in: 22nd International Conference on The- relations between the individual data values match the ory and Practice of Digital Libraries, TPDL, volume relations expressed in the narrative, we consider the data 11057, Springer, 2018. set to be successfully bound to the narrative. In this case, [12] T. Friedrich, A. O. Kempf, Making research we argue that the data can reproduce the claims encoun- data findable in digital libraries: A layered model tered in the publication. In the near future, we would like for user-oriented indexing of survey data, in: to build upon this paper by developing and describing IEEE/ACM Joint Conference on Digital Libraries, the complete matching algorithm in detail and evaluate JCDL, IEEE Computer Society, 2014. its practicality in a large-scale evaluation of real-world [13] M. N. Hussain, H. A. Rubaye, K. K. Bandeli, N. Agar- data. For this, we will focus on solving the open research wal, Stories from blogs: Computational extraction questions discussed throughout this paper. and visualization of narratives, in: Proceedings of Text2Story - Fourth Workshop on Narrative Extrac- References tion From Texts, CEUR-WS.org, 2021. [14] M. T. RodrΓ­guez, S. Nunes, T. Devezas, Telling sto- [1] M. Pawlik, T. HΓΌtter, D. Kocher, W. Mann, N. Aug- ries with data visualization, in: Proceedings of the sten, A link is not enough – reproducibility of data, 2015 Workshop on Narrative & Hypertext, 2015. Datenbank-Spektrum 19 (2019). [15] E. Segel, J. Heer, Narrative visualization: Telling sto- [2] J. Pakstis, H. Calkins, C. Dobrzynski, S. Lamm, ries with data, IEEE Transactions on Visualization L. McNamara, Advancing reproducibility through and Computer Graphics 16 (2010). shared data: Bridging archival and library practice, [16] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis- in: 19th ACM/IEEE Joint Conference on Digital paCy: Fast and Robust Models for Biomedical Nat- Libraries (JCDL), 2019. ural Language Processing, in: Proc. of the 18th [3] D. Hienert, D. Kern, K. Boland, B. Zapilko, BioNLP Workshop and Shared Task, ACL, 2019. P. Mutschke, A digital library for research data and