<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Softalias-KG: Reconciling Software Mentions in Scientific Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Esteban González-Guardia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hector Lopez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Garijo</string-name>
          <email>daniel.garijo@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Universidad Politécnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>6</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Research software (i.e., the tools and scripts developed as part of a research investigation) are key to support the results described in academic publications. However, current citation practices followed by researchers make it dificult to automatically identify and reconcile the tools used in existing publications (diferent names are used for the same tool, diferent citing styles, etc.). In this demo we address this mentions in the biomedical domain, with Softalias-KG, a Knowledge Graph of software aliases, in order to reconcile software tools in a text.</p>
      </abstract>
      <kwd-group>
        <kwd>software alias</kwd>
        <kwd>software metadata</kwd>
        <kwd>software reconciliation</kwd>
        <kwd>knowledge graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Research Software [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is increasingly becoming a first-class citizen research product due to its
role in supporting computational results.1 Open platforms like Papers with Code2 or OpenAire3
are dedicated to tracking the links between research articles and code, and paper preprint
archives like Arxiv4 are starting to display such information to readers.
      </p>
      <p>
        In order to ease software citation, the scientific community has developed guidelines and
software citation formats for developers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, researchers often use diferent names
to refer to the tools they refer to in their work. For example, the “Statistical Package for the
Social Sciences”5 is also known as “SPSS” and “SPSS Statistics” among many other variations.
This makes tool reconciliation challenging, making it dificult to provide tool developers their
corresponding credit.
      </p>
      <p>
        In this work introduce Softalias-KG, a Knowledge Graph of software aliases created from a
recent analysis and software mention dataset of over 3.8 million papers from PubMed Central [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
CEUR
Workshop
Proceedings
Our demo uses Softalias-KG as a reconciliation service for similar tool mentions, which we have
integrated with a state of the art Named Entity Recognition (NER) model (Softcite [ 4]) trained
in the biomedical domain to extract software mentions.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Softalias-KG: A Knowledge Graph of Software Aliases</title>
      <p>
        We base our work on [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a recent analysis from the Chan-Zuckerberg Initiative (CZI) where
the authors extracted software mentions from over 3.8 million papers in PubMed Central, 6
making their results available online [5]. In the paper, the authors use a clustering algorithm
based on Jaro-Winkler distance [6] for grouping similar software mentions, finding the most
likely package managers (Pypi,7 CRAN,8 Bioconductor9) software registries (SciCrunch 10) and
repositories (GitHub11) where a cluster of mentions may be linked to. For each potential
link, basic metadata of the software mention is downloaded from the corresponding platform
(description, package URL, identifier, etc.). Unfortunately, the results are stored in pickle and
csv files, designed to be used in notebooks. We have cleaned and transformed these results into
the Softalias-KG, focusing on the clustering analysis of software mentions (i.e., software aliases)
to facilitate reconciliation through SPARQL queries. We have also expanded the results with
software entities from Wikidata [ 7], in order to generalize the application domain of the KG.
      </p>
      <sec id="sec-3-1">
        <title>2.1. Representing Aliases and Groups</title>
        <p>
          Softalias-KG includes two main types of entities: software aliases, i.e., software mentions as
detected in any scientific publication, and groups, which represent single software applications
(schema:SoftwareApplication ) grouping a cluster of aliases based on [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Groups are described
with a canonical name, based on their most frequent alias, and are described with metadata
such as URL, license, etc. All metadata are obtained from the information found in external
sources (Pypi, CRAN, etc.) and represented using Schema.org [8]. The metadata platform used
to enrich all groups is also kept in the KG (schema:provider ), in order preserve the provenance
of each record. Groups are linked with their corresponding aliases with the alias property (a
software group has one or more aliases).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. RDF Transformation and Cleaning</title>
        <p>
          The CZI dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] consists on two sets of files. A first set of pickle files describe the software
mention groups matched to each external platform (Pypi, CRAN, etc.), together with a file
assigning an id to each software mention in a publication. A second set of CSV files contain the
metadata for each software entry on each external platform.
        </p>
        <p>We unified all pickle files ( .pkl) in a single JSON representation, counting the number of
times a software alias is used in the literature. Next, we removed mentions with null repetitions,
6https://www.ncbi.nlm.nih.gov/pmc/
7https://pypi.org/
8https://cran.r-project.org/
9https://contributions.bioconductor.org/index.html
10https://scicrunch.org/
11https://github.com/
removed aliases with null ids and groups containing no aliases. Upon further inspection, we
also removed the mention links matched to GitHub from the first version of the KG, due to
inconsistencies found in the reconciliation. The set of CSV files were also cleaned by removing
empty quotes and parenthesis. The final JSON and CSV files were then mapped with YARRRML 12
and converted into RDF using Morph-KGC [9].13 The final Knowledge Graph contains over
50K aliases which correspond to over 34K unique software application groups.</p>
        <p>We have also enriched Softalias-KG with software entities and metadata from Wikidata, one
of the largest crowdsourced Knowledge Graphs to date. In particular, we have imported over 8K
software applications (free software, wd:Q341), with their corresponding 3K alternative labels
(skos:altLabel).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Demonstration: Reconciling Software Mentions</title>
      <p>Our demo uses Softalias-KG as a software reconciliation and metadata service. Figure 1 shows
an overview of our demo, where users may enter a paragraph of text to search for software
mentions (top part of the figure). After clicking on the “Analyze” button, our service runs
Softcite [ 10, 4] (version 0.7.1), a named entity recognition model trained on over 5K open
12https://rml.io/yarrrml/
13https://github.com/morph-kgc/morph-kgc
research papers in the biomedical and economics domains. Softcite returns a list of candidate
software mentions, which are highlighted in yellow in the text below the “Analyze” button.</p>
      <p>Each found mention is then used to query the Softalias-KG in order to find the canonical name
for that software application. Internally, a SPARQL query retrieves aliases with the mention
name, returning the groups that the alias belongs to, as well as additional metadata like the
URL where the software may be found (from Pypi, CRAN, SciCrunch or Wikidata). We also
retrieve other aliases used for that software application, along with their number of mentions in
scientific literature (when available). Our demo, 14 code [11]15 and mappings [12]16 are available
online.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions and Future Work</title>
      <p>
        This demo integrates a state of the art named entity recognition model for detecting software
mentions in text (SoftCite) with Softalias-KG, a novel Knowledge Graph of software aliases that
is used for tool reconciliation. Softalias-KG is based on an existing analysis in the scientific
literature over millions of paper in the biomedical domain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], contains more than 50K unique
aliases (belonging to more than 30K unique tools) and is enriched with over 8K tools from
Wikidata. Thanks to our demo, we can easily navigate through existing tool aliases, as well as
detect of potential potential omissions in the NER model and our Knowledge Graph.
      </p>
      <p>Our next steps will focus on identifying and addressing clustering errors (e.g., by looking
into software groups with diferent tool URLs, joining software groups sharing the same aliases,
etc.) and expanding Softalias- KG with Somesci KG [ 13], another KG of software mentions of
smaller size that is already aligned with Wikidata.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by the Madrid Government (Comunidad de Madrid - Spain) under the
Multiannual Agreement with Universidad Politécnica de Madrid in the line Support for R&amp;D
projects for Beatriz Galindo researchers, in the context of the VPRICIT, and through the call
Research Grants for Young Investigators from Universidad Politécnica de Madrid.
14Demo: https://w3id.org/softalias/demo
15Demo code: https://github.com/SoftwareUnderstanding/softalias-rs
16Mappings and transformation scripts: https://github.com/SoftwareUnderstanding/softalias-kg
[4] P. Lopez, C. Du, J. Cohoon, K. Ram, J. Howison, Mining software entities in scientific
literature: Document-level NER for an extremely imbalance and large-scale task, in:
Proceedings of the 30th ACM International Conference on Information &amp; Knowledge
Management, ACM, New York, USA, 2021, p. 3986–3995. doi:10.1145/3459637.3481936.
[5] A.-M. Istrate, B. Veytsman, D. Li, D. Taraborelli, M. Torkar, I. Williams, CZ Software
Mentions: A large dataset of software mentions in the biomedical literature, 2022. doi: 10.
5061/DRYAD.6WWPZGN2C.
[6] W. Winkler, String comparator metrics and enhanced decision rules in the fellegi-sunter
model of record linkage, Proceedings of the Section on Survey Research Methods (1990).</p>
      <p>URL: https://eric.ed.gov/?id=ED325505.
[7] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications
of the ACM 57 (2014) 78–85.
[8] R. V. Guha, D. Brickley, S. Macbeth, Schema. org: evolution of structured data on the web,</p>
      <p>Communications of the ACM 59 (2016) 44–51.
[9] J. Arenas-Guerrero, D. Chaves-Fraga, J. Toledo, M. S. Pérez, O. Corcho, Morph-KGC:
Scalable knowledge graph materialization with mapping partitions, Semantic Web (2022).
doi:10.3233/SW- 223135.
[10] Softcite software mention recognizer, https://github.com/ourresearch/software-mentions,
2018–2021.
[11] E. González-Guardia, Softwareunderstanding/softalias-rs: Code used for iswc2023 demo,
2023. doi:10.5281/zenodo.8338240.
[12] D. Garijo, H. Lopez, SoftwareUnderstanding/softalias-kg: Softalias-kg: First release of KG
transformation scripts, 2023. doi:10.5281/zenodo.8341333.
[13] D. Schindler, F. Bensmann, S. Dietze, F. Krüger, Somesci- a 5 star open data gold standard
knowledge graph of software mentions in scientific articles, in: Proceedings of the 30th
ACM International Conference on Information &amp; Knowledge Management, ACM, New
York, NY, USA, 2021, p. 4574–4583. doi:10.1145/3459637.3482017.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Chue Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Lamprecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. E.</given-names>
            <surname>Psomopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gruenpeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Martinez</surname>
          </string-name>
          , et al.,
          <source>FAIR Principles for Research Software (FAIR4RS Principles)</source>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .15497/RDA00068.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Druskat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Spaaks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Chue</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bliven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Willighagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pérez-Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Konovalov</surname>
          </string-name>
          , Citation File Format,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.5171937.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A.-M. Istrate</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Taraborelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Torkar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Veytsman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>A large dataset of software mentions in the biomedical literature</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2209</volume>
          .
          <fpage>00693</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>