<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SALTBot: Linking Software and Articles in Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Bolinches</string-name>
          <email>j.bolinches@alumnos.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Garijo</string-name>
          <email>daniel.garijo@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Universidad Politécnica de Madrid</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Research Software is becoming a recognized first class citizen to support and reproduce the results of scientific investigations. However, the link between software and their corresponding articles is often absent from Knowledge Graphs like Wikidata, thus making it challenging to retrieve implementations of existing papers. In this work we introduce the Software and Article Linker Toolbot (SALTBot), a bot for linking together GitHub code repositories with their corresponding scholarly articles in Wikidata based on their available citation information. In addition, SALTbot will automatically describe software entities with metadata. We have manually validated SALTbot in 500 code repositories with citation files, adding more than 30 new tools to the Wikidata Knowledge Graph.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Research Software refers to the scripts, tools or computational pipelines developed throughout
an investigation to support the main findings described in a scientific publication [</p>
      <sec id="sec-2-1">
        <title>1]. Research</title>
        <p>Software is becoming increasingly recognized as a research product, 1 and the scientific
community has developed software citation principles [ 2] and citation formats [3] in order to recognize
developers with the appropriate credit.</p>
        <p>However, in most existing scholarly Knowledge Graphs to date (e.g., Open Alex [4],
Wikidata [5], etc.) research software are not usually linked with their corresponding publications.
This leads to three main problems: 1) lack of tool context, as articles usually complement
research software with theoretical background, purpose and experimental results; 2)
paperimplementation availability, as it becomes challenging to know which research papers include
software for others to reuse; 3) author-developer credit, as some developers may have
contributed to a software tool but not to its associated publication.
CEUR
Workshop
Proceedings
• A workflow designed to link software and articles with minimal user intervention, based
on a manual analysis of dozens of software repositories with a link to a publication.
• SALTbot,2 an end to end implementation of our workflow [ 6].</p>
        <p>We have validated SALTbot manually by assessing its performance in over 500 GitHub
repositories with citation files. As a result, we have added 33 new software instances, 104
metadata statements and over 40 new links between software tools and articles in the Wikidata
Knowledge Graph.</p>
        <p>The rest of the paper is structured as follows. We describe background knowledge in Section
2, introducing SALTbot in Section 3. Section 4 describes our eforts to validate SALTbot, Section
5 discusses the current limitations of our approach and Section 6 concludes the paper.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <sec id="sec-3-1">
        <title>2.1. Wikibase Bots</title>
        <p>In this section we briefly introduce the building blocks of SALTbot: 1) existing tools for
automatically editing Wikibase [7]3 and Wikidata (Section 2.1) and 2) recent eforts towards
standardizing software citation (Section 2.2).</p>
        <p>Bots in Wikibase are automated software applications capable of adding, modifying and
removing statements from their corresponding Knowledge Graph. In Wikidata, these bots are
developed by diferent communities to improve the completeness, accuracy and reliability of the
information in the graph. Wikidata currently receives millions of monthly bot contributions,
even surpassing author contributions during certain months.4</p>
        <p>Bots are diverse, ranging from those which fetch data from external sources, adapt and
integrate the data to the Wikidata model, those that add language tags, or those which improve
qualifier descriptions of existing QNodes. There are more than 350 Wikidata oficially approved
bots,5 and some of them enrich existing software tools in Wikidata. For example, Konstin’s
“Github to wikidata bot”,6 enriches entities with Github links with their software release metadata
and project website. However, to the best of our knowledge there are no bots that analyze
the actual contents of a code repository, such as the README and citation files, to link code
repositories with bibliographical entities in Wikibase instances.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Software Citation Files</title>
        <p>The scientific community has developed the Software Citation Principles [ 2], which led to the
proposal of the Citation File Format [3] as a machine-readable metadata file for citing software
2https://github.com/SoftwareUnderstanding/SALTbot
3https://wikiba.se/
4https://stats.wikimedia.org/#/wikidata.org/content/edited-pages/normal|line|1-year|editor_
type~group-bot*name-bot*user|monthly
5https://hgztools.toolforge.org/botstatistics/?lang=www&amp;project=wikidata&amp;dir=desc&amp;sort=ec
6https://github.com/konstin/github-wikidata-bot
projects. Since GitHub implemented support for this representation,7 an increasing number of
developers have started to add these files in their repositories to obtain their corresponding
credit (more than 10.000 to date). A CITATION.cf is a YAML file that usually contains the
following information:
• Title: The title of the software project.
• Authors: The names of the software authors and contributors.
• Identifiers: A collection of identifiers (e.g., Digital Object Identifier) to uniquely identify
the software project or its releases.
• License: The software’s license information (e.g., MIT, GPL, Apache, etc.).
• Repository: The URL of the software’s source code repository.
• Preferred citation: If the software project has already been described in a publication, this
ifeld describes the paper to be used to credit the software project’s authors.</p>
        <p>While the adoption for CFF files is growing, a wide number of researchers still credit articles
describing their software contributions with plain BibTeX, 8 a common format used to reference
articles in LaTeX publications (e.g., by adding their preferred citation in a README file).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Software and Article Linker Toolbot (SALTBot)</title>
      <p>7https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/
customizing-your-repository/about-citation-files</p>
      <p>8https://www.bibtex.org/About/</p>
      <sec id="sec-4-1">
        <title>3.1. SALTbot Assumptions</title>
        <p>We designed SALTbot to be compatible with any Wikibase instance holding software tools and
articles. However, since every Wikibase instance may have diferent node identifiers, SALTbot
assumes that the Wikibase modeling is similar to the Wikidata modeling in terms of the existing
entities, albeit their respective identifiers (QNodes) may be diferent. Therefore, the first step to
configure the bot is to query the graph to find the necessary QNodes and PNodes needed to
operate. The mandatory minimum items that SALTbot needs are the following:
• “instance of” property PNode: property used to check the existence of items of a
specific type (both software and articles must be instances of something).
• “main subject” property PNode: property used to link an article with its specific
software tool.
• “described by source” property PNode: property used to link a software with its
specific article. This is the current practice by which existing articles and tools are
currently linked in Wikidata, and hence we followed it.
• “Scholarly article” entity QNode: entity used to find scholarly articles in the graph.</p>
        <p>Every article must be an instance of this entity.
• “Software category” entity QNode : meta-class used to find software in the graph.</p>
        <p>Every software tool must be recursively an instance of a software category.
• “Software” entity QNode : entity used to add the mandatory ”instance of something”
statement to the software created by SALTbot.</p>
        <p>If one or more of these items are missing from the target KG, SALTbot will not run.
Additionally, SALTbot queries the graph for some optional information to better characterize the
software entities. These additional elements are:
• “source code repository URL” PNode: property used to link a software entity with its
code repository URL.
• “Free software” entity QNode : entity used to add the mandatory “instance of something”
statement to the software node created by SALTbot (if the software tool has a free license
in the GitHub repository). If a software project does not have a free license, we categorize
it as “Software”.
• “programmed in” PNode: property used to define the programming language in which
a software entity is developed.
• “download link” PNode: property used to link a software entity with its specific article.
• “copyright license” Pnode: property used to specify the type of software license used
by a software entity.
• “version control system” and “web interface software” PNodes : properties used as
qualifiers when describing the source code repository of a software project.
• “Git” and “GitHub” QNodes: entities used with the two previous properties to add
qualifiers to assign a source code repository URL to a software entity.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Workflow</title>
        <p>Figure 2 shows an overview of the decision making workflow followed by SALTbot. We start
from a code repository URL. The first step is to get all the relevant metadata using SOMEF,
which generates a JSON file with all the metadata in the code repository. In particular, SOMEF
detects citation information with one or more preferred citations from the authors in both
BiBtex and CFF (in YAML) formats, which are the ones we focus on.</p>
        <p>SALTbot then calls the Searcher module to parse all the BiBtex and YAML citations in order to
ifnd titles for scholarly articles, as well as other information such as the Digital Object Identifier
(DOI) of an article, if present. Once the candidate titles are extracted, the bot will query the
target KG for entities which are instances of scholarly article and whose label is the title from
the citation, filtering by the corresponding DOI. If no articles are found using the parsed citation,
the Searcher module will attempt to find scholarly article entities using the GitHub repository
name. This strategy is less restrictive and consequently produces more vague results that need
to be manually verified, but usually retrieves promising article candidates with a reference to
the software project in their title.</p>
        <p>The same process is also repeated to find software tools: we search for entities which inherit
from the meta-class “software category” and whose label is similar to one of the parsed titles.
These entities are then filtered out by comparing their source code URL repository with the
URL provided to SALTbot.</p>
        <p>We use DOIs to filter articles. If no DOIs are found in the parsed citation, or if these DOIs
do not match those found in the article entities, SALTbot will require manual validation from
users in order to select one of the articles found to proceed with the execution. Similarly,
the repository URL allows identifying whether the software entities found correspond to the
software component in the target repository. If no software candidates are found through their
URL, SALTbot will ask to choose one of the found software components or to create a new one.</p>
        <p>Next, the Analyzer module gathers all the previously existing relationships between the
article and software in the graph. Using the Analyzer output, the Statement Definer will create
a list with the necessary statements to completely link the article and software entities. These
statements are included in one of the following categories:
• If no software was found, SALTBot creates a new item which will be an instance of
“software” and whose label will be the GitHub’s repository name. These software pages
are further enriched by using the repository’s metadata such as the license, the source code
repository URL, the programming languages in which the repository’s code is written and
the fact that it uses Git as a version control system. Additionally, if the license detected is
a open license and the “Free software” QNode was found in the graph, the new software
item will be characterized as free software (i.e., “software distributed under terms that
allow users to freely run, study, change and distribute it and modified versions” 9).
• If the article is not linked to its corresponding software project, SALTBot adds a new
statement to the article using the “main subject software” PNode.
• If the software project is not linked to the article, SALTBot adds a new statement using
the “described by source article” PNode.</p>
        <p>Once the number of statements in the list is higher than a batch size defined by users, all
statements are loaded to the target KG using the Updater module. This process is repeated by
SALTbot any number of times for each of the code repository URLs provided as input.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Uploading statements to Wikidata/Wikibase</title>
        <p>SALTbot can be used against a local Wikibase instance or to upload new contents in Wikidata.
We build upon Wikibase Integrator,10 a Python library designed to read and write data into
Wikibase while solving compatibility and integration problems between diferent Wikibase
instances.</p>
        <p>In order to edit a specific Wikibase instance, SALTbot provides the necessary wrappers
to automatically configure Wikibase Integrator. The following information is required for
configuring SALTbot:
• A valid username and password in the desired Knowledge Graph
• The MediaWiki API URL of the target graph
• The Knowledge Graph SPARQL endpoint
• The Wikibase URL of the graph.</p>
        <p>Any of the three last configuration items default to the corresponding Wikidata values if
left unchanged. SALTbot will process each of the repositories in a semi-autonomous manner,
asking for validation when necessary to decide which article or software to use if multiple
candidates have been found.</p>
        <sec id="sec-4-3-1">
          <title>9https://www.wikidata.org/wiki/Q341 10https://github.com/LeMyst/WikibaseIntegrator</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. SaltBot validation</title>
      <p>In order to assess the correct behaviour of SALTbot, we tested the bot by gathering 500
repositories from GitHub with a “CITATION.cf” file using the GitHub API 11 and validating the results
manually. The rationale behind our approach is to ensure the selection of code repositories
with at least a suggested pointer to a publication.</p>
      <p>Our selected 500 repositories12 presented the following characteristics before our bot
assessment was completed:
• 378 repositories had one or more mentions to scholarly articles (some refer to code
deposits in archives like Zenodo).
• 46 repositories had their corresponding scholarly article page in Wikidata.
• 35 repositories had their corresponding software page in Wikidata.
• 12 scholarly article entities were previously linked through the property ”main subject”
to their corresponding software entity.
• 5 software entities were previously linked through property ”described by source” to their
corresponding article entity.</p>
      <p>In order to perform the validation of SALTbot, we created a bot page13 and a new username
in Wikidata to keep a record of the contributions performed with the tool. These contributions
can be seen in https://www.wikidata.org/wiki/Special:Contributions/SALTbotDev. Figure 3
shows an example with one of our contributions to Wikidata, by linking a newly added tool to
an existing article.</p>
      <p>After our manual validation, SALTbot enriched Wikidata with the following knowledge:
• 33 newly created software entities.
• 104 new software metadata statements.
• 34 scholarly articles linked with their corresponding software entity (this number includes
articles whose software has been created in order to link them).
• 43 software entities linked with their corresponding scholarly articles (this number
includes those software QNodes newly created by SALTbot in order to link them).</p>
      <p>While validating SALTbot, we noticed how our approach blends in with the Wikidata
ecosystem. Shortly after creating new software entities, other bots like Github-wiki-bot started
improving existing page descriptions with their release contents (184 statements regarding
software version identifiers and oficial pages were added to our newly created software entities).</p>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion</title>
      <p>On 2018, GitHub reached the staggering milestone of holding more than a hundred million code
repositories.14 In comparison, ten thousand repositories with a CITATION.cf seems like a very
11https://api.github.com/
12Available at: https://github.com/SoftwareUnderstanding/SALTbot/blob/main/WikidataFindings.csv
13https://www.wikidata.org/wiki/User:SALTbot
14https://github.blog/2018-11-08-100M-repos/
• Not all repositories have an explicit reference to the article’s DOI.
• Not all scholarly articles are currently linked to their corresponding DOI in Knowledge</p>
      <p>Graphs.
• Scholarly articles may have other identifiers, such as an arXiv ID or a Zenodo ID, which
may also be missing in the citation or README files.</p>
      <p>Currently we address these issues by asking for user input, which hinders full process
15https://github.com/codemeta/codemeta/blob/master/crosswalks/Wikidata.csv
16https://paperswithcode.com/
17https://arxiv.org/
automation for some repositories. Relying on external sources like OpenAlex18 and Crossref19
may help address this problem.</p>
      <p>Finally, SALTbot relies on linking software to publications that already exist in
Wikibase/Wikidata. Papers that are not part of the KG are currently out of the scope of the application. However,
as shown in our manual validation, a significant number of tools belong to articles that are not
currently part of the KG, so creating new article pages may be beneficial to include more tool
implementations.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions and Future Work</title>
      <p>In this paper we introduced SALTbot, our efort towards enriching Wikibase/Wikidata with
the software implementations of existing research articles. We have manually validated our
approach with 500 code repositories, resulting in 33 new software entities and over 40 new
software-paper links. SALTbot contributions are integrated within the Wikidata ecosystem,
with other bots building and expanding on our work. We believe that, as developers continue
adopting best software citation practices, SALTbot will become increasingly useful to the
Wikidata and scientific communities.</p>
      <p>Our future work includes three main improvements. First, we are currently running SALTbot
on nearly ten thousand additional repositories with CFF files, manually validating the results
when needed. Second, we are exploring running the bot on repositories with other types of
citation files (e.g., through BiBtex), which are also detected by SOMEF. Finally, we will explore
automatically creating scholarly article entities in the same way we do with software entities.
However, this feature requires further research, especially when determining how to correctly
characterize scholarly articles in Knowledge Graphs (avoiding possible duplicates), how much
article metadata can be obtained from the citation found in a code repository, and how to assess
the validity of the final results.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the Comunidad de Madrid under the Multiannual Agreement
with Universidad Politécnica de Madrid (UPM) in the line Support for R&amp;D projects for Beatriz
Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and
Technological Innovation) and through the UPM call Research Grants for Young Investigators.
18https://openalex.org/
19https://www.crossref.org/
[2] A. M. Smith, D. S. Katz, K. E. Niemeyer, Software citation principles, PeerJ Computer</p>
      <p>Science 2 (2016) e86.
[3] S. Druskat, J. H. Spaaks, N. Chue Hong, R. Haines, J. Baker, S. Bliven, E. Willighagen,</p>
      <p>D. Pérez-Suárez, O. Konovalov, Citation File Format, 2021. doi:10.5281/zenodo.5171937.
[4] J. Priem, H. Piwowar, R. Orr, Openalex: A fully-open index of scholarly works, authors,
venues, institutions, and concepts, 2022. arXiv:2205.01833.
[5] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications
of the ACM 57 (2014) 78–85.
[6] J. Bolinches, D. Garijo, SoftwareUnderstanding/SALTbot: SALTbot 0.0.1: First stable release,
2023. URL: https://doi.org/10.5281/zenodo.8190001. doi:10.5281/zenodo.8190001.
[7] D. Diefenbach, M. D. Wilde, S. Alipio, Wikibase as an infrastructure for knowledge graphs:
The eu knowledge graph, in: A. Hotho, E. Blomqvist, S. Dietze, A. Fokoue, Y. Ding,
P. Barnaghi, A. Haller, M. Dragoni, H. Alani (Eds.), The Semantic Web – ISWC 2021,
Springer International Publishing, Cham, 2021, pp. 631–647.
[8] A. Kelley, D. Garijo, A Framework for Creating Knowledge Graphs of Scientific Software</p>
      <p>Metadata, Quantitative Science Studies (2021). doi:10.1162/qss_a_00167.
[9] A. Mao, D. Garijo, S. Fakhraei, Somef: A framework for capturing scientific software
metadata from its documentation, in: 2019 IEEE International Conference on Big Data (Big
Data), 2019, pp. 3032–3037. doi:10.1109/BigData47090.2019.9006447.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Chue Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Lamprecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. E.</given-names>
            <surname>Psomopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gruenpeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Martinez</surname>
          </string-name>
          , et al.,
          <source>FAIR Principles for Research Software (FAIR4RS Principles)</source>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .15497/RDA00068.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>