<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>H-Analytics Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessia Antelmi</string-name>
          <email>alessia.antelmi@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Torquati</string-name>
          <email>massimo.torquati@unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Gregori</string-name>
          <email>daniele.gregori@e4company.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Polzella</string-name>
          <email>f.polzella@zerodivision.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmarco Spinatelli</string-name>
          <email>g.spinatelli@zerodivision.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Aldinucci</string-name>
          <email>marco.aldinucci@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Pisa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Department, University of Torino</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Software Heritage</institution>
          ,
          <addr-line>Open-source Software, Large-scale analytics, License management</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>11</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>by SWH. The Software Heritage (SWH) dataset serves as a vast repository for open-source code, with the ambitious goal of preserving all publicly available open-source projects. Despite being designed to efectively archive project files, its size of nearly 1 petabyte presents challenges in eficiently supporting Big Data MapReduce or AI systems. To address this disparity and enable seamless custom analytics on the SWH dataset, we present the SWH-Analytics (SWHA) architecture. This development environment quickly and transparently runs custom analytic applications on open-source software data preserved over time ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        At its core definition, open-source software refers to software whose source code is made
available to the public via a free and open-source license, which allows viewing, modifying,
and distributing the code by anyone at no cost [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Over the past twenty years, open-source
software has experienced a remarkable evolution, now enjoying extensive adoption. What
began as a grassroots movement, marked by the advent of the first freely available open-source
operating system, has subsequently evolved into a dominant phenomenon within the developer
community [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. The pivotal role of open source was also highlighted by the 2022 GitHub
report [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which revealed that it serves as the cornerstone of over 90% of the world’s software
infrastructure.
      </p>
      <p>
        In this context, the Software Heritage (SWH) initiative represents a valuable source as it aims
to archive, preserve, and make accessible all software publicly available in source code form
ever produced by humankind [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The SWH dataset experiences rapid growth, accumulating
several terabytes of data each month. By July 2023, it had reached close to 1 petabyte in size,
CEUR
Workshop
Proceedings
primarily comprising archived software source code files, each with an average size of less than
4 kilobytes. Simultaneously, the metadata graph associated with this archive had expanded
to nearly 20 terabytes. Although being designed to archive and de-duplicate small files, this
massive dataset encounters notable dificulties when it comes to efectively serving as input for
Big Data processing frameworks like MapReduce (e.g., Spark) or AI systems for training and
inference. This challenge arises from the need to navigate through successive files in query
results, which may entail traversing the 20-terabyte metadata hash tree and navigating across
the vast 1-petabyte storage object repository without any spatial locality.
      </p>
      <p>To bridge the performance gap between stream-based analytics and the SWH dataset, we
present the SWH-Analytics (SWHA) architecture. Designed and developed within the context of
the ADMIRE European project, the main objective of SWHA is to ofer a specialized development
and runtime environment tailored for applications aimed at analyzing the extensive repository
of open-source software preserved by SWH. A notable feature of SWHA lies in its capability
to run custom analytic applications written in Scala. More precisely, in the context of this
study, we describe how SWHA could be efectively exploited to investigate license usage in
open-source software at a large scale.</p>
      <p>The remainder of this paper is organized as follows. Section 2 briefly introduces and describes
the SWH dataset. Section 3 details the architecture of SWHA by explaining each component.
Section 4 delineates the workflow of a possible application built on top of SWHA. Section 5
concludes this work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Software Heritage</title>
      <p>
        Software Heritage [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">6, 5, 7</xref>
        ] is a globally renowned non-profit initiative established in 2016
with a mission to archive, preserve, and provide access to all publicly available software in its
source code form, spanning the entirety of human software production. As of July 2023, this
monumental archive encompasses approximately 16.6 billion unique source files and 3.5 billion
unique commits from over 258 million development projects, collecting a total of 1 petabyte
of data. The SWH repository has been exploited in diverse research endeavours, including
examining the geographical and gender diversity in public code contributions [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ], analyzing
license text variations [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], identifying repository forks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and deriving code usage statistics,
such as the most commonly used filenames [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], commit patterns [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and the average size of
the most prevalent file types [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        In SWH, projects are stored as a Merkle directed acyclic graph (DAG) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A Merkle DAG is
characterized by a unique identifier for each node, which is derived from the cryptographic
properties of the node’s content and, in the case of non-leaf nodes, also incorporates the
identifiers of their child nodes. This inherent feature of Merkle DAGs endows them with
versatility and eficiency, making such structures well-suited for a wide range of applications,
including data integrity verification, deduplication, synchronization, and security. The SWH
DAG [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is organized into six logical layers, which represent (i) the raw content of source code
ifles, (ii) project directories, (iii) project revisions or commits, (iv) project releases or tags, (v)
project snapshots, and (vi) the project origin.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. The SWHA infrastructure</title>
      <p>The Software Heritage Analytics (SWHA) framework has been designed and developed in the
context of the ADMIRE European project1, whose main objective was to produce software
solutions to enhance the throughput of HPC systems and the performance of individual
applications. The framework’s architecture comprises three primary software layers: storage,
data orchestration, and application layers. These layers cooperate in a parallel computing
environment managed through the Apache Spark Streaming Framework. A description of each
layer follows.</p>
      <p>Storage layer. This layer primarily comprises a data cache known as Cachemire, designed to
enhance the speed of data retrieval. Cachemire is a cache of projects, functioning as a
distributed key-value storage system. In this framework, each key corresponds to a unique
identifier assigned to a project by SWH, while the associated value contains the project
package in a compressed tgz format. The Cachemire interface ofers a straightforward
API that exposes PUT and GET functions, and their implementation relies on the locking
mechanisms provided by Posix-compliant file system primitives. Cachemire adopts the
LRU algorithm (Least Recently Used) as the cache replacement policy. To eficiently
manage the cache size, an external script runs at regular intervals, actively monitoring
and ensuring that the size remains within the predefined threshold.</p>
      <p>Data Orchestration layer. This layer includes a pool of data stream generators, referred to as
app controllers, which collaborate with Cachemire in a parallel computing environment
managed through the Apache Spark Streaming Framework. More specifically, the
orchestration layer uses the oficial SWH APIs to search for and retrieve projects, which are
subsequently processed by Apache Spark workers (as illustrated in Figure 1). Apache
Spark is an open-source, distributed computing framework tailored for big data processing
and analytics. In this project, we harnessed Spark Streaming to construct low-latency
applications, optimizing the time required for data retrieval and computation.
Application layer. SHWA can execute custom analytics applications written in Scala, ensuring
compatibility with Apache Spark Streaming. Each application analyzes a specific set of
projects defined through a recipe. This term is inherited from SWH and is associated with
preparing a set of projects for download, informally known as cooking. These recipes
contain essential information, such as the SWH identifier of the projects to analyze and
possibly additional metadata, like the chosen programming language. The application
layer acts as the intermediary for communication between an authenticated user and the
SWHA system. This interaction occurs through a web-based console accessed using a
web browser application. The web console simplifies user interaction with the SWHA
system, enabling eficient project searching, application management, and execution
within a user-friendly browser environment.
&gt;swh.
map(app).</p>
      <p>reduce(+)</p>
      <sec id="sec-4-1">
        <title>Web console</title>
      </sec>
      <sec id="sec-4-2">
        <title>MapReduce</title>
      </sec>
      <sec id="sec-4-3">
        <title>Data orchestration Data cache</title>
        <p>n
e
d
o
n
1
e
d
o
n
app
app
app
s
m
a
e
r
t
s
app
controller
app
controller
app
controller
cachemire
cachemire
cachemire</p>
      </sec>
      <sec id="sec-4-4">
        <title>Software Heritage - https</title>
        <p>~1PB data, ~20TB metadata
S
F
c
o
h
d
a</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. How to exploit SWHA: a case study of the license checker analytic application</title>
      <p>
        An asset of SWHA is that it allows the execution of custom analytic applications written in
Scala. Specifically, in this work, we show how SWHA can be efectively exploited to study
license inconsistencies potentially across all the revisions of all the source code ever produced or
an arbitrary partition of them. Open-source software has gained widespread adoption, with its
licensing terms significantly impacting community involvement and contributions [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Project
licensing, in particular, plays a crucial role in companies, as any violations of licenses can lead
to substantial legal risks [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. With the extensive use of open-source software from public
repositories in products, it becomes crucial to strategically understand license mismatches. In
this context, the Software Heritage (SWH) dataset is a valuable resource for analyzing project
license patterns across temporal and typological dimensions.
      </p>
      <p>The pipeline of a possible application to verify a project’s license(s) compliance comprises
three main steps. The initial stage involves providing the application with the designated set of
projects a user wishes to examine. This project set is defined using an arbitrarily complex and
customizable ‘recipe’ to query the SWH archive via web API. Each app controller is responsible
for querying the dataset and streaming each project’s files (one file per time) to the analytic
application. The second (license identification) and third (license compliance verification)
steps represent the core logic and are handled by the custom analytic application. Specifically,
for each streamed file, the application looks for an explicit license declaration for the whole
project and in-code licenses attached directly to files. The license is automatically detected with
ScanCode2, one of the most popular open-source license scanners available today. For each
project, the application checks whether there is any inconsistency in the licenses detected. If any
inconsistency is found, the application verifies whether there is a license conflict by querying
the OSADL Open Source License Checklist3, which ofers a compatibility matrix between free
and open-source (FOSS) licenses.
2https://github.com/nexB/scancode-toolkit
3https://www.osadl.org/OSADL-Open-Source-License-Checklists.oss-compliance-lists.0.html</p>
      <p>The output is a series of statistics about the types of licenses found and whether inconsistencies
and conflicts have been detected. Specifically, the application’s output may be a JSON file that
provides information for each project, including the number and types of licenses detected,
their categories, and any inconsistencies or conflicts. Additionally, the application can generate
a summary detailing the quantity and types of identified inconsistencies, as well as pairs of
licenses causing conflicts.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>Open-source software is pervasive, and comprehending its development patterns is vital for
ensuring the delivery of top-notch software and adequate support to the developer community.
In this work, we presented SWHA, a framework designed and developed to ofer a specialized
development and runtime environment tailored for applications aimed at analyzing the extensive
repository of open-source software preserved by SWH. We further described how SWHA could
be efectively harnessed to investigate license usage in open-source software, a critical concern
due to the legal issues that license violations may lead to. Currently, we are working on
implementing such an application. In future work, we aim to assess the performance of the
overall SWHA framework, with a particular focus on evaluating the advantages of employing
Cachemire as a data cache and comparing the use of one ad hoc file system against another.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the EuroHPC JU ADMIRE project (G.A. n. 956748)
and the spoke ”FutureHPC &amp; BigData” of the ICSC – Centro Nazionale di Ricerca in
HighPerformance Computing, Big Data and Quantum Computing funded by European Union –
NextGenerationEU.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] RedHat, What is open source software?, https://www.redhat.com/en/topics/open-source/
          <article-title>what-is-open-source-</article-title>
          <string-name>
            <surname>software</surname>
          </string-name>
          ,
          <year>2022</year>
          . Accessed on
          <volume>28</volume>
          /09/
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M.-W. Wu</surname>
          </string-name>
          , Y.
          <string-name>
            <surname>-D. Lin</surname>
          </string-name>
          ,
          <article-title>Open source software development: an overview</article-title>
          ,
          <source>Computer</source>
          <volume>34</volume>
          (
          <year>2001</year>
          )
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          . 1109/2.928619.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bordeleau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Meirelles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sillitti</surname>
          </string-name>
          ,
          <article-title>Fifteen years of open source software evolution</article-title>
          , in: F.
          <string-name>
            <surname>Bordeleau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sillitti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Meirelles</surname>
          </string-name>
          , V. Lenarduzzi (Eds.),
          <source>Open Source Systems</source>
          , Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>67</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -20883-
          <issue>7</issue>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] GitHub,
          <year>Octoverse 2022</year>
          :
          <article-title>10 years of tracking open source</article-title>
          , https://github.blog/ 2022-11-17-octoverse-2022-10
          <article-title>-years-of-tracking-open-</article-title>
          <string-name>
            <surname>source</surname>
          </string-name>
          ,
          <year>2022</year>
          . Accessed on
          <volume>28</volume>
          /09/
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Cosmo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          , Software Heritage:
          <article-title>Why and How to Preserve Software Source Code</article-title>
          ,
          <source>in: iPRES</source>
          <year>2017</year>
          : 14th International Conference on Digital Preservation, Kyoto, Japan,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Cosmo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          , Software heritage, https://www.softwareheritage.org,
          <year>2016</year>
          . Accessed on
          <volume>28</volume>
          /09/
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Abramatic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Cosmo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          , Building the Universal Archive of Source Code,
          <source>Communications of the ACM</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
          . doi:
          <volume>10</volume>
          .1145/3183558.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <source>Geographic Diversity in Public Code Contributions: An Exploratory Large-Scale Study Over 50 Years, in: The 2022 Mining Software Repositories Conference (MSR</source>
          <year>2022</year>
          ), ACM,
          <year>2022</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1145/3524842.3528471.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <article-title>Gender diferences in public code contributions: a 50-year perspective</article-title>
          , IEEE Software (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1109/MS.
          <year>2020</year>
          .
          <volume>3038765</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <article-title>Worldwide Gender Diferences in Public Code Contributions (and How They Have Been Afected by the COVID-19 Pandemic)</article-title>
          ,
          <source>in: 44th International Conference on Software Engineering (ICSE</source>
          <year>2022</year>
          )
          <article-title>- Software Engineering in Society (SEIS) Track</article-title>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>183</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSE-SEIS55304.
          <year>2022</year>
          .
          <volume>9794118</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <article-title>A Large-scale Dataset of (Open Source) License Text Variants</article-title>
          ,
          <source>in: The 2022 Mining Software Repositories Conference (MSR</source>
          <year>2022</year>
          ), ACM,
          <year>2022</year>
          , pp.
          <fpage>757</fpage>
          -
          <lpage>761</lpage>
          . doi:
          <volume>10</volume>
          .1145/3524842.3528491.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pietri</surname>
          </string-name>
          , G. Rousseau,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <article-title>Forking without clicking: on how to identify software repository forks</article-title>
          ,
          <source>in: MSR 2020: The 17th International Conference on Mining Software Repositories, IEEE</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>287</lpage>
          . doi:
          <volume>10</volume>
          .1145/3379597.3387450.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lorentz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Cosmo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <source>The Popular Content Filenames Dataset: Deriving Most Likely Filenames from the Software Heritage Archive</source>
          ,
          <year>2023</year>
          . URL: https://inria.hal.science/hal-04171177.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pietri</surname>
          </string-name>
          ,
          <article-title>Organizing the graph of public software development for large-scale mining</article-title>
          ,
          <source>Ph.D. thesis</source>
          , Université Paris Cité,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Merkle</surname>
          </string-name>
          ,
          <article-title>A digital signature based on a conventional encryption function</article-title>
          , in: C.
          <string-name>
            <surname>Pomerance</surname>
          </string-name>
          (Ed.),
          <source>Advances in Cryptology - CRYPTO '87</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>1988</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>378</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 3-540-48184-2_
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pietri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spinellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zacchiroli</surname>
          </string-name>
          ,
          <article-title>The Software Heritage Graph Dataset: Public software development under one roof</article-title>
          ,
          <source>in: Proceedings of the 16th International Conference on Mining Software Repositories, MSR '19</source>
          , IEEE Press,
          <year>2019</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>142</lpage>
          . doi:
          <volume>10</volume>
          .1109/MSR.
          <year>2019</year>
          .
          <volume>00030</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gamalielsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lundell</surname>
          </string-name>
          ,
          <article-title>On licensing and other conditions for contributing to widely used open source projects: An exploratory analysis</article-title>
          ,
          <source>in: Proc. of the 13th Int. Symp. on Open Collaboration, OpenSym '17</source>
          , ACM, NY, USA,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .1145/3125433.3125456.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barcomb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Riehle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Harutyunyan</surname>
          </string-name>
          ,
          <article-title>Open source license inconsistencies on github</article-title>
          ,
          <source>ACM Trans. Softw. Eng. Methodol</source>
          .
          <volume>32</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1145/3571852.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>