Data Credit Distribution through Lineage ⋆

Data Credit Distribution through Lineage ⋆ DennisDosso dosso@dei.unipd.it Department of Information Engineering University of Padua GianmariaSilvello silvello@dei.unipd.it Department of Information Engineering University of Padua Data Credit Distribution through Lineage ⋆ 0FE08D99D7A0623DDB3FA615F9E87E0C GROBID - A machine learning software for extracting information from scholarly documents Data Citation Data Credit Data Provenance

Data are a fundamental asset in the current world of research. Data citation is becoming more common and supported by research databases, but it still presents many research challenges. This paper describes Data Credit, a new measure of value for data derived from data citation, that enables us to annotate databases with real values representing their importance. Credit, computed through the citations, can be used alongside them to better understand the importance of data. We introduce the task of Data Credit Distribution, the process by which credit produced by a citation is and assigned to the data in a database responsible for producing the output information being cited. We describe how this process can be performed and, through experiments, we show that credit can serve, among other things, to highlight "hotspots" in the database.

Introduction

It is widely accepted that citations are the "currency" of the scientific world, a fundamental method to perform dissemination of knowledge and foster scientific development [22]. Scientific databases, "populated and updated with a great deal of human effort" [4], are numerous and at the core of the scientific research [5]. It is globally accepted that data must be cited and citable [18,7,10].

Data citations should be, among other things, counted alongside traditional citations and contribute to bibliometrics indicators to reward scientific database curators for their effort [1,20]. Data citation is often considered in the current literature as a driving force to "facilitate giving scholars credit" [19]. One of its central aspects is how to attribute credit to data creators and curators [6]. Many data creators and curators still do not receive any form of reward for their work; this fosters the growth of detrimental phenomenons like the "reward dilemma", the fear from researchers to share their data, losing their competitive advantage without proper recognition of their work [14].

How to handle and count the credit generated by data citations and how it contributes to traditional and new bibliometrics are long-standing research issues [15,2]. However, even when correctly applied, data citations and the related bibliometrics do not always accurately reward data. Indeed, a query often uses more data than the one present in its output result set. The data being used but not visualized do not receive a citation, nor do their contributors.

To overcome this limitation, in recent years, the idea of crediting data emerged in the academic discussion through the concept of data credit, a real positive value describing the importance of data in a given context. We argue that credit can be used to address some of the limitations highlighted above. Credit is not atomic like a citation. Once computed, it can be divided into portions and assigned to all the data used by a query. Credit can be used as an annotation set at different granularity levels within a database to describe their importance.

In this work, we discuss the problem of data credit distribution, the issuance of credit generated by some query Q on a relational database instance I to the data in I responsible for the generation of Q(I). In particular, we discuss how the distribution is possible in relational databases through lineage, a form of data provenance [9]. While data citation and credit distribution are not limited to relational databases, they are a good test bed for this first approach. In Section 2, we report the related work; Section 3 presents the methods used and the experimental results carried on a real scientific database, GtoPdb; Section 4 contains the conclusions.

Related Work

Kats in [17] suggests the need for a modified citation system that includes the idea of transient and fractional credit. Credit is defined as a "quantity" representing the importance of a research entity (a paper, software or data) mentioned in a citation, but these ideas are proposed without any formalism.

Fang in [13] presents a framework to distribute credit generated by a paper to its authors and to the papers in its reference list in a transitive way. Each cited paper's quantity of credit depends on its impact/role in the citing paper. This theoretical framework works for a graph composed of only papers, but it can be extended to another graph model that includes data.

Zeng et al. in [21] proposed the first method designed to compute credit within a network of papers citing data. This is the first step towards an automatic credit computation procedure. However, it is limited to assigning credit to the whole dataset without considering variable data granularity. Therefore, this is not a way to assign credit to a single research entity within a dataset.

Methods and Experiments

Methods. Data Credit is a non-negative real value representing the importance of data in a specific context. It can be computed with different strategies and rationales. In this paper's context, we consider credit as the product of a data citation; therefore, it is a quantity representing the importance of the data being cited in the citing paper. Ideally, the higher the impact of the cited data in the citing paper, the bigger the credit.

The task of Data Credit Distribution (DCD) consists of dividing this credit into portions and assigning it to the recipients in a database responsible for generating the cited data. Formally: Definition 1. Data Credit Distribution at tuple level (DCD) [12] Given a database instance I, a query Q over I and the value k ∈ R >0 , DCD is defined as the computation of the function

f I,Q : T upleLoc × R >0 → R ≥0 such that f I,Q (t, k) = h where 0 ≤ h ≤ k and t∈T upleLoc f I,Q (t, k) = k.

f is the Distribution Strategy (DS), it aims to annotate each tuple (thus we speak of DCD at tuple level) in I with a portion of the credit. Its only requirement is that it has to be conservative: no credit is generated or lost during the distribution. A DS can be defined in many different ways, but what we may prefer is a function that distributes credit coherently with the role of the input tuples as defined by Q. That is, only tuples that had some role in generating Q(I) should receive credit.

To do so, we propose one definition of DS that exploits the concept of lineage [11]. Given a tuple t ∈ Q(I), its lineage is the set of all and only the tuples that have a role, whatever it is, in the generation of t. Definition 2. Lineage-based Distribution Strategy [12] Let I be a database instance, Q a query over I, o ∈ Q(I) an output tuple and k the credit associated to o. Let L be the lineage of o and t be a generic tuple in I. t receives a credit equal to:

f I,Q (t, k) = 0 if t / ∈ L k |L| if t ∈ L

As we see, this DS equally rewards the tuples of the lineage of a tuple. To perform the whole distribution on Q(I), it is simply necessary to apply this DS to each tuple o ∈ Q(I).

Evaluation. We considered the IUPHAR/BPS Guide to Pharmacology (GtoPdb) [16], a famous and highly cited medical database containing information about drugs, targets, and ligands. GtoPdb is maintained and curated by a consortium of 512 scientists collaborating with in-house curators, distributed in committees [3].

GtoPdb is relational in nature, and its information is also organized into webpages describing specific diseases, receptors, ligands, and families of these elements.

To gather data citations, we considered papers published in the British Journal of Pharmacology (BJP) that cite [16]. [16] is a recent version of a series of papers that the GtoPdb consortium releases every two years to describe the database and its evolutions. It works as a data journal that can be cited in place of the whole database [8]. The papers published in BJP that refer to specific webpages of GtoPdb report the URL of the referenced page. It is possible from these URLs to reverse-engineer the SQL queries that compute the data contained in the webpages. A webpage is composed by different parts, each part created with data extracted from the GtoPdb through SQL queries. We use these queries to perform DCD. We focused only on queries referring to the so-called target families 1 .

Without any loss of generality, we assumed that each tuple present in the output of these queries contains credit equal to 1, and we performed credit distribution through lineage using these queries that we inferred from the BJP papers. We used the ∼900 BJP papers citing [16] as of October 2020, and we extracted from them more than 1200 SQL queries to families of receptors.

The results of the distribution on the family table of GtoPdb, that contain information about the target families, are shown in the heat-map of Figure 1. Each cell in the map is a tuple, and the intensity of the color represents the assigned quantity of credit. Interestingly, few tuples receive almost all the credit, following a Pareto distribution. This shows how credit distribution can highlight "hotspots", elements in the database that receive high values of credit. These are tuples that are used frequently by queries. Interestingly, these may also be tuples that are used but not visualized in the final output. This means that credit allows to rewards parts of the database that are used but not visualized, overcoming a limitation of traditional citations.

To better see how credit differs from traditional citations, consider Figure 2. We reported two radar plots, presenting the top 10 authors citation-wise and credit-wise (values normalized between 0 and 1, and the authors were substituted with numbers for privacy reasons). To compute the citations, we proceeded as follows: each time a query identifies data curated by an author, that author receives one citation and equally shares the credit assigned to that data with the other co-authors of that data. As we see from Figure 2.a, the top 10 authors, citation-wise, do not have the highest values of credit. Similarly, in Figure 2.b, the authors with the higher values of credit do not also have the highest citation count. This shows that credit can reward authors whose data have a high impact in the research community, i.e., those data generated the highest quantity of credit, even if they received fewer citations than other authors. That is, specific citations are "more valuable", credit-wise. Since we assumed that each output tuple carries credit 1, the queries that return outputs with more tuples also generate more credit. In more complex scenarios, where different and more sophisticated techniques may be used to decide how to generate quantities of credit, credit distribution can help to understand how data and their corresponding authors impact the scientific environment.

Conclusions

We showed how credit can highlight parts of the database that cover certain topics instead of others, as defined by queries. Credit and citations are correlated measures, but credit offers a new perspective to evaluate the impact of both data and curators. It can highlight parts of the database related to certain query topics, so-called "hotspots". It directly rewards the tuples, and corresponding authors, that contributed to the production of cited data, even those that are not in the output itself. Moreover, it proportionately rewards data and curators based on their impact in the context defined by the issued queries. This helps to reward authors that would otherwise remain unnoticed. In future works, credit can become the basis for new bibliometrics and applications based on its presence. For example, data pricing, that is the identification of the price of certain data in a database based on how much they are used by queries.

Fig. 1 .1Fig. 1. Heat-map of the distribution of credit to the family table. Each cell represents a tuple in the table.

Fig. 2 .2Fig. 2. Radar plots showing the top-10 authors of GtoPdb citation (a) and credit (b) with their normalized values of credit and citations. https://www.guidetopharmacology.org/targets.jsp

Acknowledgments

This work is partially supported by the ExaMode project, as part of the European Union Horizon 2020 program under Grant Agreement no. 825292.

Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets CWBelter PLoS ONE 9 3 e92590 2014 Data Citation as a Bibliometric Oxymoron CLBorgman Theories of Informetrics and Scholarly Communication CRSugimoto De Gruyter Mouton 2016 How to cite curated databases and how to make them citable PBuneman 18th International Conference on Scientific and Statistical Database Management IEEE Computer Society 2006 Curated Databases PBuneman JCheney WCTan SVansummeren 10.1145/1376916.1376918 Proc. of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems

PODS

2008. 2008 Why data citation is a computational problem PBuneman SBDavidson JFrew Commun. ACM 59 9 2016 Why data citation isn't working, and what to do about it PBuneman GChristie JADavies RDimitrellou SDHarding AJPawson JLSharman YWu 10.1093/databa/baaa022 Database J. Biol. Databases Curation 2020. 2020 Making Data a First Class Scientific Output: Data Citation and Publication by NERC's Environmental Data Centres SCallaghan SDonegan SPepler MThorley NCunningham PKirsch LAult PBell RBowie AMLeadbetter RKLowry GMoncoiffé KHarrison BSmith-Haddon Weatherby DWright 10.2218/ijdc.v7i1.218 International Journal of Digital Curation 7 1 2012 Data Journals: A Survey LCandela DCastelli PManghi ATani 10.1002/asi.23358 Journal of the Association for Information Science and Technology 66 9 2015 Provenance in databases: Why, how, and where JCheney LChiticariu WTan Foundations and Trends in Databases 1 4 2009 CODATA-ICSTI Task Group on Data Citation Standards and Practices: Out of Cite, Out of Mind: The Current State of Practice 10.2481/dsj.OSOM13-043 Policy, and Technology for the Citation of Data 12 September 2013 Tracing the lineage of view data in a warehousing ACM Trans YCui JWidom JLWiener Database Syst 25 2 2000 Data credit distribution: A new method to estimate databases impact DDosso GSilvello Journal of Informetrics 14 4 101080 2020 A discussion of citations from the perspective of the contribution of the cited paper to the citing paper HFang JASIST 69 12 2018 Sharing research data SEFienberg MEMartin MLStraf 1985 National Academy Press Journal impact factor: a brief review EGarfield Can. Med. Assoc 1999 Nc-Iuphar: The IUPHAR/BPS guide to PHARMACOLOGY in 2018: updates and expansion to encompass the new guide to IMMUNOPHARMACOLOGY SDHarding JLSharman EFaccenda CSouthan AJPawson SIreland AJ GGray LBruce SP HAlexander SAnderton CBryant APDavenport CDoerig DFabbro FLevi-Schaffer MSpedding JADavies Nucleic Acids Research 46 2018 Transitive credit as a means to address social and technological concerns stemming from citation and attribution of digital products DKatz Journal of Open Research Software 2 1 2014 Citation and Peer Review of Data: Moving Towards Formal Data Publication BLawrence CJones BMatthews SPepler SCallaghan International Journal of Digital Curation 6 2 2011 MMartone 10.25490/a97f-egyk Joint declaration of data citation principles

San Diego CA

FORCE11 2014. September 2020 Data Citation Synthesis Group Research data explored: An extended analysis of citations and altmetrics IPeters PKraker ELex CGumpenberger JGorraiz Scientometrics 107 2 2016 Assigning credit to scientific datasets using article citation networks TZeng LWu SBratt DEAcuna Journal of Informetrics 14 2 2020 Quantifying the scientific output of new researchers using the zp-index CZou JBPeterson Scientometrics 106 3 2016