<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Brand, L. Allen, M. Altman, M. Hlava, J. Scott, Beyond authorship: Attribution, contri-
bution, collaboration, and credit, Learned Publishing</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1087/20150211</article-id>
      <title-group>
        <article-title>Will Open Science Change Authorship for Good? Towards a Quantitative Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Mannocci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ornella Irrera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Manghi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR-ISTI - National Research Council, Institute of Information Science and Technologies “Alessandro Faedo”</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>OpenAIRE AMKE</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>28</volume>
      <issue>2015</issue>
      <fpage>24</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Authorship of scientific articles has profoundly changed from early science until now. If once upon a time a paper was authored by a handful of authors, scientific collaborations are much more prominent on average nowadays. As authorship (and citation) is essentially the primary reward mechanism according to the traditional research evaluation frameworks, it turned to be a rather hot-button topic from which a significant portion of academic disputes stems. However, the novel Open Science practices could be an opportunity to disrupt such dynamics and diversify the credit of the diferent scientific contributors involved in the diverse phases of the lifecycle of the same research efort. In fact, a paper and research data (or software) contextually published could exhibit diferent authorship to give credit to the various contributors right where it feels most appropriate. We argue that this can be computationally analysed by taking advantage of the wealth of information in model Open Science Graphs. Such a study can pave the way to understand better the dynamics and patterns of authorship in linked literature, research data and software, and how they evolved over the years.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;authorship</kwd>
        <kwd>open science</kwd>
        <kwd>research literature</kwd>
        <kwd>research data</kwd>
        <kwd>data citation</kwd>
        <kwd>scholarly communication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        While, in early science, most of the papers were authored by a handful of scientists, modern
science is characterised by more extensive collaborations, and the average number of authors
per article has increased across many disciplines [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ]. Indeed, in some fields of
science (e.g., High Energy Physics), it is not infrequent to encounter hundreds or thousands of
authors co-participating in the same piece of research [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such intricate collaboration patterns
make it particularly hard to establish a correct relationship between contributor and scientific
contribution, and hence get an accurate and fair reward during research evaluation [7, 8]. Thus,
as widely known, scientific authorship tends to be a rather hot-button topic in academia as
roughly one-fifth of academic disputes among authors stems from this [9].
      </p>
      <p>Open Science, however, has the potential to disrupt such traditional mechanisms by injecting
into the “academic market” new kinds of “currency” for credit attribution, merit and impact
assessment [10, 11]. To this end, the new practices of research data (and software) deposition and
citation could be perceived as an opportunity to diversify scientific attribution and eventually
give credit – right where it feels most appropriate – to the diferent contributors involved in
the diverse phases of the lifecycle within the same research endeavour [7, 12].</p>
      <p>In this extended abstract, we outline the perspective of using modern Open Science Graphs
(OSGs) to analyse whether this is the case or not and understand if the opportunity has been
seized already. Ofering extensive metadata descriptions of both literature and research data
records and the semantic relations among them, OSGs can be conducive to computational
analysis of this phenomenon and thus study the emergence of significant patterns. In particular,
it will be interesting to analyse whether and how the authors’ number, composition, and order
varies when moving from literature to research data and software.</p>
      <p>It would be, for example, interesting to discover that a significantly larger amount of people
is involved in the development of software and the construction of datasets rather than in the
editing of the related publications. This would confirm that the current reward mechanisms are
obsolete and that there is a consistent, submerged workforce contributing to research that risks
being underrepresented and under-evaluated if the current practices do not change for good.</p>
      <p>Furthermore, modifications in the composition (by shufle or by omission) of the authors
participating in a publication as opposed to the ones contributing to related research data (or
software) could instead reveal other interesting aspects worth investigating. While, on the one
hand, such changes in the two author lists could be contingent [13], on the other, it could be
interesting to relate them to the seniority of authors in order to detect patterns revealing a
possible agency behind such a choice. For example, data could suggest that senior staf members
are less involved or, by any mean less interested, in participating, or getting rewarded, for data
production and software development, thus confirming a bias towards the status quo of research
assessment.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and methods</title>
      <p>In this section, we describe the dataset we intend to use to power our study, and we provide an
overview of the methodology we intend to adhere to as well as the major caveats and challenges.
2.1. Data
The study here suggested is possible thanks to the ever-increasing amount of metadata about
research products of the last decade. In particular, Open Science Graphs (OSGs) can be a goldmine
to this extent. OSGs are Scientific Knowledge Graphs whose intent is to improve the overall
FAIRness of science by enabling open access to graph representations of metadata about people,
artefacts, institutions involved in the research lifecycle, as well as the relations between such
entities, to support stakeholder needs, such as discovery, reuse, reproducibility, statistics, trends,
monitoring, validation, and impact assessment. The represented information may span entities
such as research artefacts (e.g., publications, data, software, samples, instruments) and items of
their content (e.g., statistical hypothesis tests reported in publications), research organisations,
researchers, services, projects, and funders. OSGs include relationships between such entities
and sometimes formalised (semantic) concepts characterising them, such as machine-readable
concept descriptions for advanced discoverability, interoperability, and reuse [14].</p>
      <p>For this analysis, we intend to adopt the OpenAIRE Research Graph1 [15] as our dataset of
reference (hereafter, the Graph). The Graph is one of the core services provided by OpenAIRE
AMKE2, a not-for-profit legal entity operating an infrastructure that ofers global services in
support of Open Science scholarly workflows. The Graph aggregates metadata from 96,514
scholarly sources (as of October 2021), comprising literature, research data and software
repositories, publishers, and scholarly registries, such as ORCID, ROR, re3data, OpenDOAR, Crossref,
and DataCite. It thus provides a longitudinal view of the global science record by delivering an
extensive collection of heterogeneous research products interconnected with the relevant
semantic relations. The semantic relations conducive to this study adhere to the specification drawn
in the DataCite Schema documentation3 and are both collected from DataCite4, EMBL-EBI, and
Crossref Event Data, as well as derived from the inference full-text algorithms embedded in the
OpenAIRE Graph provision workflow, and the feedback from OpenAIRE portals users.</p>
      <sec id="sec-2-1">
        <title>2.2. Methods</title>
        <p>First and foremost, a strategy to select the relevant literature and research data and software
records needs to be devised. To this end, let  be a publication and  a research data (or a
software) contextually produced within the same research efort (e.g.,  describes a research
efort to conduct a measuring campaign eventually producing the dataset  released contextually
to the publication). In principle, it is possible to select all the  ↔  couples by looking at
the semantics of the relations linking literature records to non-literature records within the
OpenAIRE Research Graph. In our case, we plan to use the semantic IsSupplementTo (and
inverse IsSupplementedBy), which is the DataCite relation type indicating that a dataset  is
supplementary material for a publication .</p>
        <p>Once the relevant  ↔  couples have been selected, we need to proceed with the analysis of
the author sets and their interpretation. Let  be the set of authors of a publication , and 
the set of authors of a connected research data (or software) , the scope of the analysis will be
threefold. Firstly, the cardinality of the two sets || and || can be compared to understand
whether there is any diference in the workforce when moving from literature items to relevant
research data. Secondly, the composition of the two author sets will be considered in order to
analyse the intersection  ∩ and the symmetric diference  ∆  = ( ∖)∪( ∖).
Lastly, the ordering of the two author sets will be inspected to understand whether there are
significant changes in the ranks of the same authors across the two sets and, if this is the case,
which are the more frequent patterns.</p>
        <p>Indeed, if the quantity of the available data points supports it, such analysis can be put in
time perspective to analyse whether and how the trend evolved throughout the years globally
and across diferent disciplines.</p>
        <p>1OpenAIRE Research Graph, https://graph.openaire.eu
2OpenAIRE, https://www.openaire.eu
3DataCite schema, https://schema.datacite.org
4DataCite, https://datacite.org</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Caveats</title>
        <p>This section analyses two identified major caveats and related challenges that we have to face in
this analysis. The first one is related to the inherent uncertainty of semantic relations specified
among literature and research data records, while the second is related to the long-standing
challenge of author disambiguation. For each, we outline the strategy we intend to follow to
solve, or at least mitigate, the side efects of the caveats here described.</p>
        <sec id="sec-2-2-1">
          <title>2.3.1. Semantic relation uncertainty</title>
          <p>Given a  ↔  couple, the semantic expressed by the relation is defined by the user (i.e.,
researcher, librarian, curator) taking care of the deposition process of the research product
(e.g., on Zenodo). Hence, the semantic is prone to human errors as it might not be very
straightforward, which is the most appropriate one. On Zenodo, for example, the choice is
drawn from a dropdown menu with scarce or limited guidance on the rationale behind the
choice.</p>
          <p>In order to mitigate this aspect, we plan to run a heuristic over  ↔  couples tied by
“vanilla” relations (e.g., Cites, References) and infer the unintentionally lost relations indicating
supplemented material. A viable strategy could consist in retrofitting as supplemented material
relations all the Cites (and inverse IsCitedBy) and References (and inverse IsReferencedBy) relations
when the author sets share at least an author and the year indicates that the two records are
contextual (e.g., within six months apart).</p>
          <p>A possible generalisation of the heuristic above would rely on multiple metadata fields
such as the date of publication, the title and the author list itself to create a feature vector
describing research outputs. Then the distance between such vectors representing publication
records and non-literature records related with the proper semantic would allow us to define
a confidence interval of similarity which characterises literature and related non-literature
records. New supplement semantics can be inferred relying on such confidence interval: if the
similarity between two feature vectors tied by “vanilla” relations lays within the interval, then
the semantics has been probably misassigned, and thus it can be retrofitted as IsSupplementedBy
(or IsSupplementTo, depending on the direction).</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.3.2. Author names disambiguation</title>
          <p>Author disambiguation is essential to make the set of authors  of the dataset (or software) 
and the set of authors  of the publication  comparable.</p>
          <p>The metadata definition of an author  who contributed both in the publication  and in
the supplement dataset , may not be the same in  and  respectively. In this case, if the
intersection  ∩  is computed, the author  will not belong to the intersection because there
are diferent definitions of  in  and . In this context, disambiguation is crucial to correctly
detect that the author  is the same in  and  despite multiple definitions.</p>
          <p>Consider, for example, the publication with the DOI:https://doi.org/10.1186/s12865-015-0113-0
(Immune cell subsets and their gene expression profiles from human PBMC isolated by Vacutainer
Cell Preparation Tube (CPTTM) and standard density gradient). One of the datasets it is
supplemented by is https://doi.org/10.6084/m9.figshare.c.3600443_d4.v1 ( Additional file 4: Table S4. of
Immune cell subsets and their gene expression profiles from human PBMC isolated by Vacutainer
Cell Preparation Tube (CPT™) and standard density gradient). The lists of authors  and  are:
 = {︀ Corkum, Christopher P.; Ings, Danielle P.; Burgess, Christopher;</p>
          <p>Karwowska, Sylwia; Kroll, Werner; Michalak, Tomasz I.}︀
 = {︀ Corkum, Christopher; Ings, Danielle; Burgess, Christopher;</p>
          <p>Karwowska, Sylwia; Kroll, Werner; Michalak, Tomasz}︀
If the lists of authors are analysed, there are three authors which co-occur both in  and in
 (Burgess, Christopher, Karwowska, Sylwia and Kroll, Werner). The three remaining authors
difer in how their name is laid out: in  in fact the first names are followed by another
initial (highlighted in boldface), while in  they do not; without finer author disambiguation
strategies (e.g., plain string match) they would be considered diferent authors.</p>
          <p>To address the author disambiguation problem, we can rely on the deduplication framework
of OpenAIRE [16, 17] and the distance metrics it provides to compute the distance between
single authors and lists. It is worth noting that, in contrast to the standard deduplication task (i.e.,
establish the equivalence of alike research products), we are comparing lists of authors belonging
to research outputs diferent in kind (i.e., literature with non-literature); these lists may not
necessarily contain the same authors; hence, the methods provided for the deduplication need
to be customised according to our needs.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work was co-funded by the European Commission H2020 project OpenAIRE-Nexus (grant
number: 101017452).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Cronin</surname>
          </string-name>
          ,
          <article-title>Hyperauthorship: A postmodern perversion or evidence of a structural shift in scholarly communication practices?</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>52</volume>
          (
          <year>2001</year>
          )
          <fpage>558</fpage>
          -
          <lpage>569</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.1097.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Wren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Z.</given-names>
            <surname>Kozak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Deakyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Schilling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Dellavalle</surname>
          </string-name>
          ,
          <article-title>The write position: A survey of perceived contributions to papers based on byline position and number of authors</article-title>
          ,
          <source>EMBO reports 8</source>
          (
          <year>2007</year>
          )
          <fpage>988</fpage>
          -
          <lpage>991</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Baethge</surname>
          </string-name>
          ,
          <article-title>Publish together or perish: the increasing number of authors per article in academic journals is the consequence of a changing scientific culture. some researchers define authorship quite loosely</article-title>
          ,
          <source>Deutsches Arzteblatt International</source>
          <volume>105</volume>
          (
          <year>2008</year>
          )
          <fpage>380</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Monteiro</surname>
          </string-name>
          ,
          <article-title>Evolution in the number of authors of computer science publications</article-title>
          ,
          <source>Scientometrics</source>
          <volume>110</volume>
          (
          <year>2017</year>
          )
          <fpage>529</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Frandsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nicolaisen</surname>
          </string-name>
          ,
          <article-title>What is in a name? Credit assignment practices in diferent disciplines</article-title>
          ,
          <source>Journal of Informetrics</source>
          <volume>4</volume>
          (
          <year>2010</year>
          )
          <fpage>608</fpage>
          -
          <lpage>617</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.joi.
          <year>2010</year>
          .
          <volume>06</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Aad</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Atlas</surname>
            <given-names>collaboration</given-names>
          </string-name>
          ,
          <source>PHYSICAL REVIEW D Phys Rev D</source>
          <volume>85</volume>
          (
          <year>2012</year>
          )
          <fpage>012003</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>