SALTBot: Linking Software and Articles in Wikidata
                                Jorge Bolinches1 , Daniel Garijo1
                                1
                                    Ontology Engineering Group, Universidad Politécnica de Madrid


                                                                         Abstract
                                                                         Research Software is becoming a recognized first class citizen to support and reproduce the results of
                                                                         scientific investigations. However, the link between software and their corresponding articles is often
                                                                         absent from Knowledge Graphs like Wikidata, thus making it challenging to retrieve implementations of
                                                                         existing papers. In this work we introduce the Software and Article Linker Toolbot (SALTBot), a bot
                                                                         for linking together GitHub code repositories with their corresponding scholarly articles in Wikidata
                                                                         based on their available citation information. In addition, SALTbot will automatically describe software
                                                                         entities with metadata. We have manually validated SALTbot in 500 code repositories with citation files,
                                                                         adding more than 30 new tools to the Wikidata Knowledge Graph.
                                                                             Paper type: Resource
                                                                             Code repository: https://github.com/SoftwareUnderstanding/SALTbot
                                                                             Zenodo release: https://doi.org/10.5281/zenodo.8190001


                                1. Introduction
                                Research Software refers to the scripts, tools or computational pipelines developed throughout
                                an investigation to support the main findings described in a scientific publication [1]. Research
                                Software is becoming increasingly recognized as a research product,1 and the scientific commu-
                                nity has developed software citation principles [2] and citation formats [3] in order to recognize
                                developers with the appropriate credit.
                                   However, in most existing scholarly Knowledge Graphs to date (e.g., Open Alex [4], Wiki-
                                data [5], etc.) research software are not usually linked with their corresponding publications.
                                This leads to three main problems: 1) lack of tool context, as articles usually complement
                                research software with theoretical background, purpose and experimental results; 2) paper-
                                implementation availability, as it becomes challenging to know which research papers include
                                software for others to reuse; 3) author-developer credit, as some developers may have con-
                                tributed to a software tool but not to its associated publication.
                                   In this work, we address these issues by presenting SALTbot, a Software and Article Linker
                                Toolbot, designed to find article and software entities in Wikibase instances in order to enrich
                                and link them together. SALTbot takes as input one or multiple GitHub repositories and inspects
                                them for references to existing articles in Wikidata. Then, if found, SALTbot will link software
                                and their corresponding articles, creating a new software instance when necessary and enriching
                                it with metadata. Our work includes two main contributions:

                                Wikidata’23: Wikidata workshop at ISWC 2023
                                Envelope-Open j.bolinches@alumnos.upm.es (J. Bolinches); daniel.garijo@upm.es (D. Garijo)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073


                                                  1
                                                      https://sfdora.org/read/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    • A workflow designed to link software and articles with minimal user intervention, based
      on a manual analysis of dozens of software repositories with a link to a publication.
    • SALTbot,2 an end to end implementation of our workflow [6].

   We have validated SALTbot manually by assessing its performance in over 500 GitHub
repositories with citation files. As a result, we have added 33 new software instances, 104
metadata statements and over 40 new links between software tools and articles in the Wikidata
Knowledge Graph.
   The rest of the paper is structured as follows. We describe background knowledge in Section
2, introducing SALTbot in Section 3. Section 4 describes our efforts to validate SALTbot, Section
5 discusses the current limitations of our approach and Section 6 concludes the paper.


2. Background
In this section we briefly introduce the building blocks of SALTbot: 1) existing tools for au-
tomatically editing Wikibase [7]3 and Wikidata (Section 2.1) and 2) recent efforts towards
standardizing software citation (Section 2.2).

2.1. Wikibase Bots
Bots in Wikibase are automated software applications capable of adding, modifying and re-
moving statements from their corresponding Knowledge Graph. In Wikidata, these bots are
developed by different communities to improve the completeness, accuracy and reliability of the
information in the graph. Wikidata currently receives millions of monthly bot contributions,
even surpassing author contributions during certain months.4
   Bots are diverse, ranging from those which fetch data from external sources, adapt and
integrate the data to the Wikidata model, those that add language tags, or those which improve
qualifier descriptions of existing QNodes. There are more than 350 Wikidata officially approved
bots,5 and some of them enrich existing software tools in Wikidata. For example, Konstin’s
“Github to wikidata bot”,6 enriches entities with Github links with their software release metadata
and project website. However, to the best of our knowledge there are no bots that analyze
the actual contents of a code repository, such as the README and citation files, to link code
repositories with bibliographical entities in Wikibase instances.

2.2. Software Citation Files
The scientific community has developed the Software Citation Principles [2], which led to the
proposal of the Citation File Format [3] as a machine-readable metadata file for citing software

    2
      https://github.com/SoftwareUnderstanding/SALTbot
    3
      https://wikiba.se/
    4
      https://stats.wikimedia.org/#/wikidata.org/content/edited-pages/normal|line|1-year|editor_
type~group-bot*name-bot*user|monthly
    5
      https://hgztools.toolforge.org/botstatistics/?lang=www&project=wikidata&dir=desc&sort=ec
    6
      https://github.com/konstin/github-wikidata-bot
projects. Since GitHub implemented support for this representation,7 an increasing number of
developers have started to add these files in their repositories to obtain their corresponding
credit (more than 10.000 to date). A CITATION.cff is a YAML file that usually contains the
following information:
    • Title: The title of the software project.
    • Authors: The names of the software authors and contributors.
    • Identifiers: A collection of identifiers (e.g., Digital Object Identifier) to uniquely identify
      the software project or its releases.
    • License: The software’s license information (e.g., MIT, GPL, Apache, etc.).
    • Repository: The URL of the software’s source code repository.
    • Preferred citation: If the software project has already been described in a publication, this
      field describes the paper to be used to credit the software project’s authors.
   While the adoption for CFF files is growing, a wide number of researchers still credit articles
describing their software contributions with plain BibTeX,8 a common format used to reference
articles in LaTeX publications (e.g., by adding their preferred citation in a README file).


3. Software and Article Linker Toolbot (SALTBot)
Figure 1 shows an overview of the architecture of SALTbot. Given one or multiple GitHub
repository URLs as input, SALTbot finds software and scholarly articles related to each of the
repositories in a Wikibase instance, analyzes the existing relationships between these entities,
and introduce new links between them to complete a bidirectional relationship between articles
and software, creating and characterizing new software instances when they do not exist in the
graph. SALTbot is divided in the following modules:
    • Orchestrator: The main module of SALTbot. It deals with the Wikibase configuration,
      proccesses the input, sends the parsed metadata to the handler module for each repository
      and calls the updater module to introduce data to the graph.
    • SOMEF: We reuse the Software Metadata Extraction Framework [8, 9], a tool that produces
      a JSON with relevant metadata from both the README and CITATION.cff files contained
      in code repositories when provided with a repository URL.
    • Handler: This module is in charge of sending and receiving data from all of SALTbot’s
      modules in order to figure the necessary statements to add to the graph for one repository
    • Searcher: Finds possible article and software entity QNodes from the graph based on the
      metadata extracted by SOMEF.
    • Analyzer: Assesses the existing relationships between all the articles and software found
      and prints them to the user
    • Statement Definer: Creates a list of statements and entities to create in order to link an
      article and software, asking for user validation if needed.
    • Updater: Uploads statements to a target Wikibase Knowledge Graph in bulk.
    7
      https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/
customizing-your-repository/about-citation-files
    8
      https://www.bibtex.org/About/
Figure 1: SALTbot Architecture. SALTbot reuses SOMEF [8] for software citation and metadata
extraction, and the Wikibase API to retrieve potential existing paper candidates to link to software
components.


3.1. SALTbot Assumptions
We designed SALTbot to be compatible with any Wikibase instance holding software tools and
articles. However, since every Wikibase instance may have different node identifiers, SALTbot
assumes that the Wikibase modeling is similar to the Wikidata modeling in terms of the existing
entities, albeit their respective identifiers (QNodes) may be different. Therefore, the first step to
configure the bot is to query the graph to find the necessary QNodes and PNodes needed to
operate. The mandatory minimum items that SALTbot needs are the following:
    • “instance of” property PNode: property used to check the existence of items of a
      specific type (both software and articles must be instances of something).
    • “main subject” property PNode: property used to link an article with its specific
      software tool.
    • “described by source” property PNode: property used to link a software with its
      specific article. This is the current practice by which existing articles and tools are
      currently linked in Wikidata, and hence we followed it.
    • “Scholarly article” entity QNode: entity used to find scholarly articles in the graph.
      Every article must be an instance of this entity.
    • “Software category” entity QNode: meta-class used to find software in the graph.
      Every software tool must be recursively an instance of a software category.
    • “Software” entity QNode: entity used to add the mandatory ”instance of something”
      statement to the software created by SALTbot.
   If one or more of these items are missing from the target KG, SALTbot will not run. Addi-
tionally, SALTbot queries the graph for some optional information to better characterize the
software entities. These additional elements are:
    • “source code repository URL” PNode: property used to link a software entity with its
      code repository URL.
    • “Free software” entity QNode: entity used to add the mandatory “instance of something”
      statement to the software node created by SALTbot (if the software tool has a free license
      in the GitHub repository). If a software project does not have a free license, we categorize
      it as “Software”.
    • “programmed in” PNode: property used to define the programming language in which
      a software entity is developed.
    • “download link” PNode: property used to link a software entity with its specific article.
    • “copyright license” Pnode: property used to specify the type of software license used
      by a software entity.
    • “version control system” and “web interface software” PNodes: properties used as
      qualifiers when describing the source code repository of a software project.
    • “Git” and “GitHub” QNodes: entities used with the two previous properties to add
      qualifiers to assign a source code repository URL to a software entity.

3.2. Workflow
Figure 2 shows an overview of the decision making workflow followed by SALTbot. We start
from a code repository URL. The first step is to get all the relevant metadata using SOMEF,
which generates a JSON file with all the metadata in the code repository. In particular, SOMEF
detects citation information with one or more preferred citations from the authors in both
BiBtex and CFF (in YAML) formats, which are the ones we focus on.
   SALTbot then calls the Searcher module to parse all the BiBtex and YAML citations in order to
find titles for scholarly articles, as well as other information such as the Digital Object Identifier
(DOI) of an article, if present. Once the candidate titles are extracted, the bot will query the
target KG for entities which are instances of scholarly article and whose label is the title from
the citation, filtering by the corresponding DOI. If no articles are found using the parsed citation,
the Searcher module will attempt to find scholarly article entities using the GitHub repository
name. This strategy is less restrictive and consequently produces more vague results that need
to be manually verified, but usually retrieves promising article candidates with a reference to
the software project in their title.
   The same process is also repeated to find software tools: we search for entities which inherit
from the meta-class “software category” and whose label is similar to one of the parsed titles.
These entities are then filtered out by comparing their source code URL repository with the
URL provided to SALTbot.
   We use DOIs to filter articles. If no DOIs are found in the parsed citation, or if these DOIs
do not match those found in the article entities, SALTbot will require manual validation from
users in order to select one of the articles found to proceed with the execution. Similarly,
the repository URL allows identifying whether the software entities found correspond to the
software component in the target repository. If no software candidates are found through their
URL, SALTbot will ask to choose one of the found software components or to create a new one.
   Next, the Analyzer module gathers all the previously existing relationships between the
article and software in the graph. Using the Analyzer output, the Statement Definer will create
Figure 2: SALTbot main decision making workflow. Starting from a code repository URL, SALTbot will
extract the main citation metadata using SOMEF, search for the corresponding paper in Wikidata and
then will attempt to identify whether the software already exists in the target KG. If the software tool
exists, SALTbot will link it to the paper. If it does not, the bot will create a new page for the tool.
a list with the necessary statements to completely link the article and software entities. These
statements are included in one of the following categories:

    • If no software was found, SALTBot creates a new item which will be an instance of
      “software” and whose label will be the GitHub’s repository name. These software pages
      are further enriched by using the repository’s metadata such as the license, the source code
      repository URL, the programming languages in which the repository’s code is written and
      the fact that it uses Git as a version control system. Additionally, if the license detected is
      a open license and the “Free software” QNode was found in the graph, the new software
      item will be characterized as free software (i.e., “software distributed under terms that
      allow users to freely run, study, change and distribute it and modified versions”9 ).
    • If the article is not linked to its corresponding software project, SALTBot adds a new
      statement to the article using the “main subject software” PNode.
    • If the software project is not linked to the article, SALTBot adds a new statement using
      the “described by source article” PNode.

   Once the number of statements in the list is higher than a batch size defined by users, all
statements are loaded to the target KG using the Updater module. This process is repeated by
SALTbot any number of times for each of the code repository URLs provided as input.

3.3. Uploading statements to Wikidata/Wikibase
SALTbot can be used against a local Wikibase instance or to upload new contents in Wikidata.
We build upon Wikibase Integrator,10 a Python library designed to read and write data into
Wikibase while solving compatibility and integration problems between different Wikibase
instances.
  In order to edit a specific Wikibase instance, SALTbot provides the necessary wrappers
to automatically configure Wikibase Integrator. The following information is required for
configuring SALTbot:

    • A valid username and password in the desired Knowledge Graph
    • The MediaWiki API URL of the target graph
    • The Knowledge Graph SPARQL endpoint
    • The Wikibase URL of the graph.

   Any of the three last configuration items default to the corresponding Wikidata values if
left unchanged. SALTbot will process each of the repositories in a semi-autonomous manner,
asking for validation when necessary to decide which article or software to use if multiple
candidates have been found.


    9
        https://www.wikidata.org/wiki/Q341
   10
        https://github.com/LeMyst/WikibaseIntegrator
4. SaltBot validation
In order to assess the correct behaviour of SALTbot, we tested the bot by gathering 500 reposito-
ries from GitHub with a “CITATION.cff” file using the GitHub API11 and validating the results
manually. The rationale behind our approach is to ensure the selection of code repositories
with at least a suggested pointer to a publication.
   Our selected 500 repositories12 presented the following characteristics before our bot assess-
ment was completed:
    • 378 repositories had one or more mentions to scholarly articles (some refer to code
      deposits in archives like Zenodo).
    • 46 repositories had their corresponding scholarly article page in Wikidata.
    • 35 repositories had their corresponding software page in Wikidata.
    • 12 scholarly article entities were previously linked through the property ”main subject”
      to their corresponding software entity.
    • 5 software entities were previously linked through property ”described by source” to their
      corresponding article entity.
   In order to perform the validation of SALTbot, we created a bot page13 and a new username
in Wikidata to keep a record of the contributions performed with the tool. These contributions
can be seen in https://www.wikidata.org/wiki/Special:Contributions/SALTbotDev. Figure 3
shows an example with one of our contributions to Wikidata, by linking a newly added tool to
an existing article.
   After our manual validation, SALTbot enriched Wikidata with the following knowledge:

    • 33 newly created software entities.
    • 104 new software metadata statements.
    • 34 scholarly articles linked with their corresponding software entity (this number includes
      articles whose software has been created in order to link them).
    • 43 software entities linked with their corresponding scholarly articles (this number
      includes those software QNodes newly created by SALTbot in order to link them).

  While validating SALTbot, we noticed how our approach blends in with the Wikidata ecosys-
tem. Shortly after creating new software entities, other bots like Github-wiki-bot started
improving existing page descriptions with their release contents (184 statements regarding
software version identifiers and official pages were added to our newly created software entities).


5. Discussion
On 2018, GitHub reached the staggering milestone of holding more than a hundred million code
repositories.14 In comparison, ten thousand repositories with a CITATION.cff seems like a very
   11
      https://api.github.com/
   12
      Available at: https://github.com/SoftwareUnderstanding/SALTbot/blob/main/WikidataFindings.csv
   13
      https://www.wikidata.org/wiki/User:SALTbot
   14
      https://github.blog/2018-11-08-100M-repos/
Figure 3: An example result from SALTbot for Widoco, a tool for documenting ontologies. In this case,
the bot creates both the page for the software tool, describes it with metadata (description, license, code
repository, etc.) and links it to the existing article in Wikidata.


small percentage, but the number of repositories containing citation is slowly growing. This
number also suggests that many research software projects in GitHub may be lacking a citation
file to indicate the correct way of citing the software in a machine-readable way. We believe that
continuously running SALTBot will increasingly enrich Wikidata with links between articles
and software.
   In addition, SALTBot enriches software entities with existing metadata by following current
Wikidata practices for modeling software. Incorporating additional metadata elements (e.g.,
from Codemeta15 ) may help to further increase the usefulness of our contributions in the target
Wikibase KG.
   Our approach is orthogonal to the efforts of other platforms like Papers With Code16 or
Arxiv,17 which scan data/software availability statements or whole publications (manually or
automatically) to find the corresponding associated code repositories. Instead, we analyze code
repositories assessing the direct citation preference declared by authors.
   As for limitations, our main challenge is unambiguously identifying scholarly articles in
Wikibase instances. Our approach attempts to use the article’s DOI to identify it in the graph,
however, this presents the following issues:
    • Not all repositories have an explicit reference to the article’s DOI.
    • Not all scholarly articles are currently linked to their corresponding DOI in Knowledge
      Graphs.
    • Scholarly articles may have other identifiers, such as an arXiv ID or a Zenodo ID, which
      may also be missing in the citation or README files.

  Currently we address these issues by asking for user input, which hinders full process
   15
      https://github.com/codemeta/codemeta/blob/master/crosswalks/Wikidata.csv
   16
      https://paperswithcode.com/
   17
      https://arxiv.org/
automation for some repositories. Relying on external sources like OpenAlex18 and Crossref19
may help address this problem.
   Finally, SALTbot relies on linking software to publications that already exist in Wikibase/Wiki-
data. Papers that are not part of the KG are currently out of the scope of the application. However,
as shown in our manual validation, a significant number of tools belong to articles that are not
currently part of the KG, so creating new article pages may be beneficial to include more tool
implementations.


6. Conclusions and Future Work
In this paper we introduced SALTbot, our effort towards enriching Wikibase/Wikidata with
the software implementations of existing research articles. We have manually validated our
approach with 500 code repositories, resulting in 33 new software entities and over 40 new
software-paper links. SALTbot contributions are integrated within the Wikidata ecosystem,
with other bots building and expanding on our work. We believe that, as developers continue
adopting best software citation practices, SALTbot will become increasingly useful to the
Wikidata and scientific communities.
   Our future work includes three main improvements. First, we are currently running SALTbot
on nearly ten thousand additional repositories with CFF files, manually validating the results
when needed. Second, we are exploring running the bot on repositories with other types of
citation files (e.g., through BiBtex), which are also detected by SOMEF. Finally, we will explore
automatically creating scholarly article entities in the same way we do with software entities.
However, this feature requires further research, especially when determining how to correctly
characterize scholarly articles in Knowledge Graphs (avoiding possible duplicates), how much
article metadata can be obtained from the citation found in a code repository, and how to assess
the validity of the final results.


Acknowledgments
This work was supported by the Comunidad de Madrid under the Multiannual Agreement
with Universidad Politécnica de Madrid (UPM) in the line Support for R&D projects for Beatriz
Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and
Technological Innovation) and through the UPM call Research Grants for Young Investigators.


References
[1] N. P. Chue Hong, D. S. Katz, M. Barker, A.-L. Lamprecht, C. Martinez, F. E. Psomopoulos,
    J. Harrow, L. J. Castro, M. Gruenpeter, P. A. Martinez, et al., FAIR Principles for Research
    Software (FAIR4RS Principles), 2022. doi:10.15497/RDA00068 .


   18
        https://openalex.org/
   19
        https://www.crossref.org/
[2] A. M. Smith, D. S. Katz, K. E. Niemeyer, Software citation principles, PeerJ Computer
    Science 2 (2016) e86.
[3] S. Druskat, J. H. Spaaks, N. Chue Hong, R. Haines, J. Baker, S. Bliven, E. Willighagen,
    D. Pérez-Suárez, O. Konovalov, Citation File Format, 2021. doi:10.5281/zenodo.5171937 .
[4] J. Priem, H. Piwowar, R. Orr, Openalex: A fully-open index of scholarly works, authors,
    venues, institutions, and concepts, 2022. arXiv:2205.01833 .
[5] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications
    of the ACM 57 (2014) 78–85.
[6] J. Bolinches, D. Garijo, SoftwareUnderstanding/SALTbot: SALTbot 0.0.1: First stable release,
    2023. URL: https://doi.org/10.5281/zenodo.8190001. doi:10.5281/zenodo.8190001 .
[7] D. Diefenbach, M. D. Wilde, S. Alipio, Wikibase as an infrastructure for knowledge graphs:
    The eu knowledge graph, in: A. Hotho, E. Blomqvist, S. Dietze, A. Fokoue, Y. Ding,
    P. Barnaghi, A. Haller, M. Dragoni, H. Alani (Eds.), The Semantic Web – ISWC 2021,
    Springer International Publishing, Cham, 2021, pp. 631–647.
[8] A. Kelley, D. Garijo, A Framework for Creating Knowledge Graphs of Scientific Software
    Metadata, Quantitative Science Studies (2021). doi:10.1162/qss_a_00167 .
[9] A. Mao, D. Garijo, S. Fakhraei, Somef: A framework for capturing scientific software
    metadata from its documentation, in: 2019 IEEE International Conference on Big Data (Big
    Data), 2019, pp. 3032–3037. doi:10.1109/BigData47090.2019.9006447 .