<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. Avogadro);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Canonical Register of Public Sector Entities: Semantic Linking of Procurement Data at Scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Avogadro</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Makgill</string-name>
          <email>ian@spendnetwork.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aleena Thomas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmet Soylu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru Roman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SINTEF AS</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norway</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Entity Linking, Knowledge Graphs, Large Language Models, Procurement Data, Semantic Technologies</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kristiania University of Applied Sciences</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Spend Network</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Public procurement generates over $13 trillion annually, yet data about public buyers and suppliers remains fragmented, inconsistent, and dificult to link across jurisdictions. This paper presents a practical industrial solution developed by Spend Network within the European project enRichMyData to semantically enrich and reconcile procurement data at scale. The proposed pipeline combines large language models (LLMs) with knowledge graphs (KGs) to create and maintain a canonical register of public sector entities. It supports multilingual, cross-border integration and is designed to serve both public transparency and commercial applications. The pipeline has been evaluated on a manually curated benchmark of 1,000 procurement-related entities and demonstrates high precision and scalability in real-world settings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Government procurement is a key area of public spending and accountability, with over $13 trillion
annually spent worldwide. Despite the introduction of standards like the Open Contracting Data
Standard (OCDS) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], data about government buyers and suppliers remains dificult to reconcile
due to inconsistencies in naming, multilingual variations, and missing canonical references. This
hampers transparency, compliance checks, and cross-border cooperation. Knowledge graphs such as
Wikidata [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] provide a foundation for such reference alignment.
      </p>
      <p>In the past, entity matching eforts using standard fuzzy matching techniques (e.g., Levenshtein
Distance) often resulted in poor performance, with either high false negatives or false positives, making
large-scale reconciliation economically unviable.</p>
      <p>To address this challenge, within the enRichMyData project1, we developed for Spend Network2
(the largest known collection of OCDS procurement records) a semantic linking pipeline that supports
the creation of a canonical, structured, and continuously updated register of public sector entities.
This register supports multiple use cases ranging from compliance (e.g., Environmental, Social, and
Governance (ESG) reporting or procurement law) and cross-border collaboration to sales intelligence
and civil society oversight.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Semantic Linking Strategy</title>
      <p>
        The pipeline follows a hybrid architecture combining knowledge graphs with large language models.
This approach builds on advances in transformer-based models such as BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
CEUR
      </p>
      <p>
        ceur-ws.org
1. Candidate Generation: Entities are extracted from procurement datasets and matched against
canonical references using hybrid search combining vector similarity and approximate string
matching.
2. Ranking and Validation: LLMs rank and validate candidate entities in context. Entities are
approved automatically when confidence is high or reviewed manually in ambiguous cases. This
process is informed by prior work on zero-shot and neural-based entity linking [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
3. Reconciliation and Enrichment: Once linked, entities are enriched with structured data from
public registers (e.g., URLs, legal identifiers, sectors) and linked to a central reference.
4. Access and Integration: The data is made available via APIs and downloadable formats to
support dashboards, compliance systems, and bulk integration with private and public tools.
      </p>
      <p>As shown in Figure 1, the full pipeline processes OCDS procurement data through enrichment,
linking, and delivery to end-user dashboards. A more detailed view of the semantic linking layer using
LLMs and validation workflows is shown in Figure 2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Deployment Context</title>
      <p>
        Spend Network maintains the largest known collection of OCDS procurement records, with over 180
million entries aggregated from EU and international sources [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Within the enRichMyData project,
the aim is to build a public register of entities that are legally required to publish procurement records
under the EU Procurement Directive. The envisioned service is designed to support both public interest
(e.g., transparency, compliance) and commercial use (e.g., data integrations via API or bulk access). It
builds on Spend Network’s existing infrastructure and leverages the enRichMyData toolbox to support
entity discovery, classification, cleansing, and reconciliation.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Ground Truth and Evaluation</title>
      <p>To evaluate the linking pipeline, we created a ground truth dataset of 1,000 procurement-related entries.
Each row is annotated with a Wikidata entity ID or marked as NIL where no appropriate match exists.
Approximately 22.9% of the dataset contains such NIL cases, reflecting realistic ambiguity and
out-ofknowledge scenarios. The experiments were conducted using the Lion Linker3, an open-source entity
linking python library developed within the enRichMyData project.
3https://github.com/enRichMyData/lion_linker</p>
      <p>
        We evaluated multiple LLMs across diferent prompt configurations using precision@1 as our metric.
Since each model outputs exactly one prediction per row and we always provide ground truth, this
measure efectively captures system accuracy without confounding efects from recall. It aligns with
standard evaluation practices in entity linking and retrieval benchmarks [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The best-performing
model (Gemma3:12b, few-shot prompt) achieved a precision@1 of 77.7%. The ground truth dataset is
publicly available on Zenodo.4
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Business Value and Use Cases</title>
      <sec id="sec-5-1">
        <title>The solution addresses concrete needs:</title>
        <p>
          • Public sector: Enables compliance checks, inter-agency collaboration, and accurate public
registers [
          <xref ref-type="bibr" rid="ref12 ref13 ref9">9, 12, 13</xref>
          ].
• Private sector: Supports due diligence, Environmental, Social, and Governance (ESG) reporting,
and Customer Relationship Management (CRM) system integration.
        </p>
        <p>• Civil society: Empowers journalists, NGOs, and citizens with better transparency tools.</p>
        <p>The pipeline powers OpenOpportunities5 and has been adopted in multiple use cases including a
national compliance system in a non-EU European country.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Lessons Learned and Future Work</title>
      <p>Lessons: Hybrid pipelines improve linking accuracy. Confidence scoring reduces human validation
needs. Reconciliation across multilingual and evolving registers remains a challenge.</p>
      <p>Next steps: Scale coverage across the EU, increase multilingual robustness, and expand entity types
to cover beneficial owners and subnational bodies.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>4https://zenodo.org/records/15745734 5https://www.openopps.com</title>
        <p>This work was supported by the European Union’s Horizon 2020 research and innovation programme
under grant agreements No. 101070284 (enRichMyData) and No. 101093216 (UPCAST). We thank Spend
Network for their contribution.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>This paper used ChatGPT (OpenAI) for drafting assistance, grammar checking, and paraphrasing. No
AI was used for generating results or conclusions; the authors take full responsibility for the content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elvesaeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Turk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , E. Simperl, G. Konstantinidis,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Lech</surname>
          </string-name>
          ,
          <article-title>Towards an ontology for public procurement based on the open contracting data standard, in: Digital Transformation for a Sustainable Society in the 21st Century: 18th IFIP WG 6</article-title>
          .11 Conference on e-Business, e-Services, and e-Society,
          <year>I3E 2019</year>
          , Trondheim, Norway,
          <source>September 18-20</source>
          ,
          <year>2019</year>
          , Proceedings 18, Springer,
          <year>2019</year>
          , pp.
          <fpage>230</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. E. K.</given-names>
            <surname>Niessen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Paciello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. I. P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <article-title>Anomaly detection in public procurements using the open contracting data standard</article-title>
          ,
          <source>in: 2020 Seventh International Conference on eDemocracy &amp; eGovernment (ICEDEG)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Felizzola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arrieta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jerez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Erazo</surname>
          </string-name>
          , G. Camacho,
          <article-title>Enhancing transparency in public procurement: A data-driven analytics approach</article-title>
          ,
          <source>Information Systems</source>
          <volume>125</volume>
          (
          <year>2024</year>
          )
          <fpage>102430</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Logeswaran</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Zero-shot entity linking by reading entity descriptions</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>07348</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O.-E.</given-names>
            <surname>Ganea</surname>
          </string-name>
          , T. Hofmann,
          <article-title>Deep joint entity disambiguation with local neural attention</article-title>
          ,
          <source>arXiv preprint arXiv:1704.04920</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Jayawardene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Avogadro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tablinkllm:</surname>
          </string-name>
          <article-title>An llm-based approach for entity linking in tabular data</article-title>
          ,
          <source>in: 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>206</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elvesaeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Badenes-Olmedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Blount</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yedro Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kovacic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Posinkovic</surname>
          </string-name>
          , I. Makgill,
          <string-name>
            <given-names>C.</given-names>
            <surname>Taggart</surname>
          </string-name>
          , et al.,
          <article-title>Theybuyforyou platform and knowledge graph: Expanding horizons in public procurement with open linked data</article-title>
          ,
          <source>Semantic Web</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>265</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Röder</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Baron</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Both</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Brümmer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ceccarelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cornolti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cherix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Eickmann</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Gerbil</surname>
          </string-name>
          :
          <article-title>general entity annotator benchmarking framework</article-title>
          ,
          <source>in: Proceedings of the 24th international conference on World Wide Web</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1133</fpage>
          -
          <lpage>1143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08663</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elvesaeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Badenes-Olmedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. Y.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kovacic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Posinkovic</surname>
          </string-name>
          , I. Makgill,
          <string-name>
            <given-names>C.</given-names>
            <surname>Taggart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Simperl</surname>
          </string-name>
          , et al.,
          <article-title>Enhancing public procurement in the european union through constructing and exploiting an integrated knowledge graph</article-title>
          ,
          <source>in: The Semantic WebISWC</source>
          <year>2020</year>
          : 19th International Semantic Web Conference, Athens, Greece, November 2-
          <issue>6</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <source>Part II 19</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>430</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Simperl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Ruíz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Taggart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. S.</given-names>
            <surname>Klima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Uliana</surname>
          </string-name>
          , et al.,
          <article-title>Towards a knowledge graph based platform for public procurement</article-title>
          ,
          <source>in: Research Conference on Metadata and Semantics Research</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>317</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>