<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Integrating Public Procurement Data into a Semantic Knowledge Graph?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ahmet Soylu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Corcho</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Simperl</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru Roman</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco Y. Mart nez</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Taggart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Makgill</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Elves ter</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben Symonds</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helen McNally</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Konstantinidis</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuchen Zhao</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Till C. Lech</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>OpenCorporates Ltd</institution>
          ,
          <addr-line>London, the</addr-line>
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>OpenOpps Ltd</institution>
          ,
          <addr-line>London, the</addr-line>
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SINTEF Digital</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidad Politecnica de Madrid</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Southampton</institution>
          ,
          <addr-line>Southampton, the</addr-line>
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Public procurement accounts for a substantial part of the public investment and global economy. Therefore, improving e ectiveness, e ciency, transparency and accountability of government procurement is of broad interest. To this end, in this poster paper, we present our approach for integrating procurement data, including public spending and corporate data, from multiple sources across the EU into a semantic knowledge graph. We are aiming to improve procurement processes through supporting multiple stake holders, such as government agencies, companies, control authorities, journalists, researchers, and individual citizens.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge graph Public procurement Ontology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Public procurement accounts for a substantial part of the public investment and
global economy. Every year, over 250 000 public authorities in the EU spend
around 14% of GDP on the purchase of services, works and supplies1. Therefore,
improving e ectiveness, e ciency, transparency and accountability of government
procurement is of broad interest [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To this end, European Commission has
put several relevant directives forward, i.e., for public sector information (e.g.,
Directive 2003/98/EC) and public procurement (e.g., Directive 2014/24/EU8),
to improve public procurement practices. As a result of these, national public
procurement portals have been created, which live together with regional, local
as well as EU-wide public procurement portals. However, there is no common
agreement across the EU (not even, in many cases, inside the same country)
on the data formats to be used for exposing such data sources and on the data
models that need to be used for exposing such data, which leads to a large
heterogeneity in the data that is being exposed.
      </p>
      <p>
        In Europe, contracting portals like Tenders Electronic Daily2 (TED) may be
seen as a way to homogenise the data that is being provided, but unfortunately
this portal is only used for those contracts that are larger than a prede ned
budget threshold, and hence this does not cover the whole richness of types of
public contracts nor does it force the usage of this format for those contracts that
do not need to be published there. The only relevant data model that is getting
some important traction worldwide is the Open Contracting Data Standard3
(OCDS). However, it has been mostly developed with a focus on transparency in
the public contracting procedures. Though, several ontologies, such as LOTED2
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], PPROC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], PCO[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and upcoming eProcurement ontology4, are developed
with di erent levels of detail and focus for representing procurement data, there is
no solution integrating supplier and procurement data enabling such as matching
of suppliers and buyers and advanced analytics and procurement intelligence.
      </p>
      <p>In this poster paper, we present our approach, in the context of
TheyBuyForYou5 project, for integrating procurement data, including public spending and
corporate data, from multiple sources across the EU into a knowledge graph. We
are aiming to improve procurement processes through supporting multiple stake
holders, such as government agencies, companies, control authorities, journalists,
researchers, and individual citizens. The proposed solution enables developers
to create fully functional, robust, and scalable data integration pipelines, from
including sourcing the data, to pre-processing, augmenting, and interlinking it.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Sources</title>
      <p>High-quality company (i.e., legal entities) and procurement (e.g., tenders and
contracts) data are needed to form an interconnected knowledge graph for public
procurement. However, rstly, in public procurement the vast majority of external
government spending (i.e., not government-to-government) is with companies
and often there is no explicit and unambiguous reference to the legal entities
in the government's own records. Secondly, to truly understand the scope of
procurement data across the EU, we must go through a process of identifying
and recording data sources that exist alongside the formal TED in Europe, such
as procurement transparency initiatives of individual countries including data
from tender alert sites, contract registers and spending data. In our context, we
collect this information from two main providers, that is OpenCorporates6 for
supplier data (i.e., company) and OpenOpps7 for procurement data.</p>
      <p>OpenCorporates makes data on 140 million legal entities, resulting in the
order of 100s of GB data, available through an API. Data is collected from
national company registers and other regulatory sources. OpenCorporates uses
2 http://ted.europa.eu
3 http://standard.open-contracting.org
4 https://github.com/eprocurementontology/eprocurementontology
5 https://theybuyforyou.eu
6 https://opencorporates.com
7 https://openopps.com
a variety of methods of data extraction, depending on the format of the source
data. Where structured data les are available they are imported, although some
scraping is required from less structured sources. OpenCorporates' company
data is mapped to its own schema and inactive companies and sole traders are
identi ed and categorised where possible. OpenOpps is gathering tender and
contract data from European sites like TED as well as many large national
portals, over 300GB of data from over 450 European sources, and makes over 2
million documents, dating back to 2010, available through an API. Included in
this data is details on buyers, suppliers (for contracts), titles, descriptions, values
and categories. OpenOpps extracts data from these sources using scraper scripts
and the extracted data is formatted according to the OCDS. Data is augmented
with Common Procurement Vocabulary8 (CPV) codes where it is not available
(i.e., used for classifying the subjects of procurement contracts). Tender notice
documents are gathered and referenced whenever possible.</p>
      <p>In the context of our work, OpenOpps and OpenCorporates maintain their own
code for validating, mapping and monitoring the data. Currently, more company
registers from new jurisdictions, such as Germany, Russia, and Portugal, etc., are
being added and more scrappers are being built to add more procurement data
from other local and national portals by identifying, prioritising, and auditing new
sources with respect to some quality criteria (e.g., legal, practical, and technical).
OpenCorporates and OpenOpps data is available under their standard share-alike
attribution Open Database Licences910.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Architecture and Process</title>
      <p>
        The preliminary architecture for data integration is presented in Fig. 1. OpenOpps
and OpenCorporates undertake their own processes for gathering, extracting and
curating data from distributed sources, including structured, semi-structured, and
unstructured data. The data is extracted from OpenOpps' and OpenCorporate's
databases through an extract, transform, and load (ETL) process using DataGraft
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is a cloud-based service for data transformation and access.
      </p>
      <p>A series of re nements and enrichments are executed over the extracted data,
such as normalisation (i.e., data types and formats), and curation (i.e.,
missingrecords and duplicate records). Some of these processes are applied at the data
provider side as well. Data will be characterised and integrated through a set
of ontologies as discussed above. The entities in the knowledge graph are linked
and re-reconciled, that is for example legal entities mentioned in procurement
data are linked back to the company record in supplier data for that entity and
references to the documents collected in the document store are added.</p>
      <p>The knowledge graph is accessed via a linked data enabled REST API. This
means that the URIs that are used to identify contracts, tenders, companies etc.
are de-referenceable and with content negotiation. SPARQL endpoint access is
8 https://simap.ted.europa.eu/cpv
9 https://opencorporates.com/info/licence
10 https://openopps.com/legal</p>
      <p>Distributed datasets
s
e
t
a
r
o
p
r
o
C
n
e
p
O
s
p
p
O
n
e
p</p>
      <p>O
A</p>
      <p>A</p>
      <sec id="sec-3-1">
        <title>Supplier Procurement data data</title>
      </sec>
      <sec id="sec-3-2">
        <title>Reconciliation service</title>
      </sec>
      <sec id="sec-3-3">
        <title>SPARQL end-point</title>
        <p>A</p>
      </sec>
      <sec id="sec-3-4">
        <title>Search API</title>
        <p>A</p>
      </sec>
      <sec id="sec-3-5">
        <title>DataGraft</title>
        <p>normalize
curate
link
extract transform load</p>
      </sec>
      <sec id="sec-3-6">
        <title>OpenCorporates API</title>
      </sec>
      <sec id="sec-3-7">
        <title>OpenOpps API</title>
        <p>A</p>
      </sec>
      <sec id="sec-3-8">
        <title>Linked data</title>
      </sec>
      <sec id="sec-3-9">
        <title>REST APIs</title>
      </sec>
      <sec id="sec-3-10">
        <title>Triple store</title>
        <p>T-box</p>
        <p>A-box
Document
store
provided for those developers willing to make ad-hoc queries to the knowledge
graph, as well as a range of additional services to enable search over the knowledge
graph and document store, and reconciliation services to facilitate third parties the
usage of the URIs. Note that not all data from OpenOpps and OpenCorporates
is extracted, but data is linked back to these databases for further access.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Our ultimate goal is to ensure that data providers make their procurement data
available in their own domains/sites, according to our ontology network. However,
since this will not be possible in the short term, we follow a centralised approach
in this work. The data in the resulting knowledge graph will be licensed under a
combination of CC-BY 4.0 and Open Database License.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alvarez-Rodr guez</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          , et al.:
          <article-title>New trends on e-Procurement applying semantic technologies: Current status and future challenges</article-title>
          .
          <source>Computers in Industry</source>
          <volume>65</volume>
          (
          <issue>5</issue>
          ),
          <volume>800</volume>
          {
          <fpage>820</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Distinto</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , et al.:
          <article-title>LOTED2: An ontology of European public procurement notices</article-title>
          .
          <source>Semantic Web</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <volume>267</volume>
          {
          <fpage>293</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mun</surname>
          </string-name>
          <article-title>~oz-</article-title>
          <string-name>
            <surname>Soro</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          , et al.:
          <article-title>PPROC, an ontology for transparency in public procurement</article-title>
          .
          <source>Semantic Web</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <volume>295</volume>
          {
          <fpage>309</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Necasky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Linked data support for ling public contracts</article-title>
          .
          <source>Computers in Industry</source>
          <volume>65</volume>
          (
          <issue>5</issue>
          ),
          <volume>862</volume>
          {
          <fpage>877</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>DataGraft: One-Stop-Shop for Open Data Management</article-title>
          .
          <source>Semantic Web</source>
          <volume>9</volume>
          (
          <issue>4</issue>
          ),
          <volume>393</volume>
          {
          <fpage>411</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>