<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Fang, H.: A discussion of citations from the perspective of the contribution of the
cited paper to the citing paper. JASIST</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data Credit Distribution through Lineage</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>69</volume>
      <issue>12</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Data are a fundamental asset in the current world of research. Data citation is becoming more common and supported by research databases, but it still presents many research challenges. This paper describes Data Credit, a new measure of value for data derived from data citation, that enables us to annotate databases with real values representing their importance. Credit, computed through the citations, can be used alongside them to better understand the importance of data. We introduce the task of Data Credit Distribution, the process by which credit produced by a citation is and assigned to the data in a database responsible for producing the output information being cited. We describe how this process can be performed and, through experiments, we show that credit can serve, among other things, to highlight “hotspots” in the database.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Citation</kwd>
        <kwd>Data Credit</kwd>
        <kwd>Data Provenance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        It is widely accepted that citations are the “currency” of the scientific world, a
fundamental method to perform dissemination of knowledge and foster scientific
development [22]. Scientific databases, “populated and updated with a great deal
of human effort” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], are numerous and at the core of the scientific research [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
It is globally accepted that data must be cited and citable [
        <xref ref-type="bibr" rid="ref10 ref7">18, 7, 10</xref>
        ].
      </p>
      <p>
        Data citations should be, among other things, counted alongside traditional
citations and contribute to bibliometrics indicators to reward scientific database
curators for their effort [
        <xref ref-type="bibr" rid="ref1">1, 20</xref>
        ]. Data citation is often considered in the current
literature as a driving force to “facilitate giving scholars credit” [19]. One of its
central aspects is how to attribute credit to data creators and curators [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Many
data creators and curators still do not receive any form of reward for their work;
this fosters the growth of detrimental phenomenons like the “reward dilemma”,
the fear from researchers to share their data, losing their competitive advantage
without proper recognition of their work [14].
      </p>
      <p>
        How to handle and count the credit generated by data citations and how it
contributes to traditional and new bibliometrics are long-standing research
issues [
        <xref ref-type="bibr" rid="ref2">15, 2</xref>
        ]. However, even when correctly applied, data citations and the related
bibliometrics do not always accurately reward data. Indeed, a query often uses
more data than the one present in its output result set. The data being used but
not visualized do not receive a citation, nor do their contributors.
      </p>
      <p>To overcome this limitation, in recent years, the idea of crediting data emerged
in the academic discussion through the concept of data credit, a real positive
value describing the importance of data in a given context. We argue that credit
can be used to address some of the limitations highlighted above. Credit is not
atomic like a citation. Once computed, it can be divided into portions and
assigned to all the data used by a query. Credit can be used as an annotation set
at different granularity levels within a database to describe their importance.</p>
      <p>
        In this work, we discuss the problem of data credit distribution, the issuance
of credit generated by some query Q on a relational database instance I to the
data in I responsible for the generation of Q(I). In particular, we discuss how
the distribution is possible in relational databases through lineage, a form of
data provenance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While data citation and credit distribution are not limited
to relational databases, they are a good test bed for this first approach. In
Section 2, we report the related work; Section 3 presents the methods used and
the experimental results carried on a real scientific database, GtoPdb; Section 4
contains the conclusions.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Kats in [17] suggests the need for a modified citation system that includes the idea
of transient and fractional credit. Credit is defined as a “quantity” representing
the importance of a research entity (a paper, software or data) mentioned in a
citation, but these ideas are proposed without any formalism.</p>
      <p>Fang in [13] presents a framework to distribute credit generated by a paper
to its authors and to the papers in its reference list in a transitive way. Each
cited paper’s quantity of credit depends on its impact/role in the citing paper.
This theoretical framework works for a graph composed of only papers, but it
can be extended to another graph model that includes data.</p>
      <p>Zeng et al. in [21] proposed the first method designed to compute credit
within a network of papers citing data. This is the first step towards an automatic
credit computation procedure. However, it is limited to assigning credit to the
whole dataset without considering variable data granularity. Therefore, this is
not a way to assign credit to a single research entity within a dataset.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methods and Experiments</title>
      <p>Methods. Data Credit is a non-negative real value representing the importance
of data in a specific context. It can be computed with different strategies and
rationales. In this paper’s context, we consider credit as the product of a data
citation; therefore, it is a quantity representing the importance of the data being
cited in the citing paper. Ideally, the higher the impact of the cited data in the
citing paper, the bigger the credit.</p>
      <p>
        The task of Data Credit Distribution (DCD) consists of dividing this credit
into portions and assigning it to the recipients in a database responsible for
generating the cited data. Formally:
Definition 1. Data Credit Distribution at tuple level (DCD) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
Given a database instance I, a query Q over I and the value k ∈ R&gt;0, DCD is
defined as the computation of the function fI,Q : T upleLoc × R&gt;0 → R≥0 such
that fI,Q(t, k) = h where 0 ≤ h ≤ k and Pt∈T upleLoc fI,Q(t, k) = k.
      </p>
      <p>f is the Distribution Strategy (DS), it aims to annotate each tuple (thus
we speak of DCD at tuple level) in I with a portion of the credit. Its only
requirement is that it has to be conservative: no credit is generated or lost
during the distribution. A DS can be defined in many different ways, but what
we may prefer is a function that distributes credit coherently with the role of
the input tuples as defined by Q. That is, only tuples that had some role in
generating Q(I) should receive credit.</p>
      <p>
        To do so, we propose one definition of DS that exploits the concept of
lineage [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Given a tuple t ∈ Q(I), its lineage is the set of all and only the tuples
that have a role, whatever it is, in the generation of t.
      </p>
      <p>
        Definition 2. Lineage-based Distribution Strategy [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
Let I be a database instance, Q a query over I, o ∈ Q(I) an output tuple and k
the credit associated to o. Let L be the lineage of o and t be a generic tuple in I.
t receives a credit equal to:
fI,Q(t, k) =
(0
k
|L|
if t ∈/ L
if t ∈ L
      </p>
      <p>As we see, this DS equally rewards the tuples of the lineage of a tuple. To
perform the whole distribution on Q(I), it is simply necessary to apply this DS
to each tuple o ∈ Q(I).
0
0
8
0
0
6
0
0
4
0
0
2
0
webpages of GtoPdb report the URL of the referenced page. It is possible from
these URLs to reverse-engineer the SQL queries that compute the data contained
in the webpages. A webpage is composed by different parts, each part created
with data extracted from the GtoPdb through SQL queries. We use these queries
to perform DCD. We focused only on queries referring to the so-called target
families1.</p>
      <p>Without any loss of generality, we assumed that each tuple present in the
output of these queries contains credit equal to 1, and we performed credit
distribution through lineage using these queries that we inferred from the BJP
papers. We used the ∼900 BJP papers citing [16] as of October 2020, and we
extracted from them more than 1200 SQL queries to families of receptors.</p>
      <p>The results of the distribution on the family table of GtoPdb, that contain
information about the target families, are shown in the heat-map of Figure 1.
Each cell in the map is a tuple, and the intensity of the color represents the
assigned quantity of credit. Interestingly, few tuples receive almost all the credit,
following a Pareto distribution. This shows how credit distribution can highlight
“hotspots”, elements in the database that receive high values of credit. These
are tuples that are used frequently by queries. Interestingly, these may also be
tuples that are used but not visualized in the final output. This means that
credit allows to rewards parts of the database that are used but not visualized,
overcoming a limitation of traditional citations.</p>
      <p>To better see how credit differs from traditional citations, consider Figure
2. We reported two radar plots, presenting the top 10 authors citation-wise and
1 https://www.guidetopharmacology.org/targets.jsp
a
b
credit-wise (values normalized between 0 and 1, and the authors were substituted
with numbers for privacy reasons). To compute the citations, we proceeded as
follows: each time a query identifies data curated by an author, that author
receives one citation and equally shares the credit assigned to that data with the
other co-authors of that data. As we see from Figure 2.a, the top 10 authors,
citation-wise, do not have the highest values of credit. Similarly, in Figure 2.b,
the authors with the higher values of credit do not also have the highest citation
count.</p>
      <p>This shows that credit can reward authors whose data have a high impact in
the research community, i.e., those data generated the highest quantity of credit,
even if they received fewer citations than other authors. That is, specific
citations are “more valuable”, credit-wise. Since we assumed that each output tuple
carries credit 1, the queries that return outputs with more tuples also generate
more credit. In more complex scenarios, where different and more sophisticated
techniques may be used to decide how to generate quantities of credit, credit
distribution can help to understand how data and their corresponding authors
impact the scientific environment.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We showed how credit can highlight parts of the database that cover certain
topics instead of others, as defined by queries. Credit and citations are correlated
measures, but credit offers a new perspective to evaluate the impact of both data
and curators. It can highlight parts of the database related to certain query
topics, so-called “hotspots”. It directly rewards the tuples, and corresponding
authors, that contributed to the production of cited data, even those that are not
in the output itself. Moreover, it proportionately rewards data and curators based
on their impact in the context defined by the issued queries. This helps to reward
authors that would otherwise remain unnoticed. In future works, credit can
become the basis for new bibliometrics and applications based on its presence.
For example, data pricing, that is the identification of the price of certain data
in a database based on how much they are used by queries.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the ExaMode project, as part of the
European Union Horizon 2020 program under Grant Agreement no. 825292.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Belter</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          :
          <article-title>Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets</article-title>
          .
          <source>PLoS ONE</source>
          <volume>9</volume>
          (
          <issue>3</issue>
          ),
          <year>e92590</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Borgman</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Data Citation as a Bibliometric Oxymoron</article-title>
          . In: Sugimoto,
          <string-name>
            <surname>C.</surname>
          </string-name>
          R. (ed.)
          <source>Theories of Informetrics and Scholarly Communication</source>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>116</lpage>
          . De Gruyter Mouton (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Buneman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>How to cite curated databases and how to make them citable</article-title>
          .
          <source>In: 18th International Conference on Scientific and Statistical Database Management, SSDBM</source>
          . pp.
          <fpage>195</fpage>
          -
          <lpage>203</lpage>
          . IEEE Computer Society (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Buneman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vansummeren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Curated Databases.
          <source>In: Proc. of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems</source>
          ,
          <string-name>
            <surname>PODS</surname>
          </string-name>
          <year>2008</year>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          (
          <year>2008</year>
          ), https://doi.org/10.1145/ 1376916.1376918
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buneman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frew</surname>
          </string-name>
          , J.:
          <article-title>Why data citation is a computational problem</article-title>
          .
          <source>Commun. ACM</source>
          <volume>59</volume>
          (
          <issue>9</issue>
          ),
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Buneman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrellou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harding</surname>
            ,
            <given-names>S.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pawson</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharman</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Why data citation isn't working, and what to do about it</article-title>
          .
          <source>Database J. Biol. Databases Curation</source>
          <year>2020</year>
          (
          <year>2020</year>
          ), https://doi.org/ 10.1093/databa/baaa022
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Callaghan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donegan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pepler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thorley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirsch</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ault</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leadbetter</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowry</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          , Moncoiff´e, G.,
          <string-name>
            <surname>Harrison</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith-Haddon</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weatherby</surname>
          </string-name>
          , a.,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Making Data a First Class Scientific Output: Data Citation and Publication by NERC's Environmental Data Centres</article-title>
          .
          <source>International Journal of Digital Curation</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          (
          <year>2012</year>
          ), http://dx.doi.org/10.2218/ijdc.v7i1.
          <fpage>218</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Candela</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castelli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manghi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Data Journals: A Survey</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>66</volume>
          (
          <issue>9</issue>
          ),
          <fpage>1747</fpage>
          -
          <lpage>1762</lpage>
          (
          <year>2015</year>
          ), http://dx.doi.org/10.1002/asi.23358
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiticariu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>Provenance in databases: Why, how, and where</article-title>
          .
          <source>Foundations and Trends in Databases</source>
          <volume>1</volume>
          (
          <issue>4</issue>
          ),
          <fpage>379</fpage>
          -
          <lpage>474</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>CODATA-ICSTI Task</surname>
          </string-name>
          <article-title>Group on Data Citation Standards and Practices: Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data</article-title>
          , vol.
          <volume>12</volume>
          (
          <year>September 2013</year>
          ). https://doi.org/http://doi.org/10.2481/dsj.OSOM13-
          <fpage>043</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiener</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Tracing the lineage of view data in a warehousing environment</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .
          <volume>25</volume>
          (
          <issue>2</issue>
          ),
          <fpage>179</fpage>
          -
          <lpage>227</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silvello</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Data credit distribution: A new method to estimate databases impact</article-title>
          .
          <source>Journal of Informetrics</source>
          <volume>14</volume>
          (
          <issue>4</issue>
          ),
          <volume>101080</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>