<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI-based Decision Support Systems for the Management of E-procurement Procedures⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Lops</string-name>
          <email>pasquale.lops@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Di Ciano</string-name>
          <email>m.diciano@innovapuglia.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Lopane</string-name>
          <email>n.lopane@regione.puglia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <email>lucia.siciliani@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Taccardi</string-name>
          <email>vincenzo.taccardi@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleonora Ghizzota</string-name>
          <email>e.ghizzota@studenti.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>giovanni.semeraro@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dip. di Informatica - Università di Bari Aldo Moro</institution>
          ,
          <addr-line>Via E. Orabona 4, 70125 Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>InnovaPuglia S.p.A.</institution>
          ,
          <addr-line>str. prov. per Casamassima km. 3.000 - 70010 Valenzano (Ba)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Regione Puglia</institution>
          ,
          <addr-line>Via G. Gentile 52 - 70126 Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tenders are powerful means of investment of public funds and represent a strategic development resource. Thus, improving the eficiency of procuring entities and developing evaluation models turn out to be essential to facilitate e-procurement procedures. With this contribution, we present our preliminary research to create a supporting system for the decision-making and monitoring process for the entire course of investments and contracts (SIAP). This system employs artificial intelligence techniques based on natural language processing and machine learning, focused on providing instruments for extracting useful information from both structured and unstructured (i.e., text) data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Public procurement, especially when the aim is innovation, represents a powerful mean of
investment of public funds. Hence, in the area of transparency and monitoring of the entire
investment and procurement cycle, it is crucial to improve two main aspects: on one hand, the
engagement process of the RUPs7, procuring entities, administrations, and awarding entities,
allowing them to fulfill many of their assigned tasks in a more efective, eficient and sustainable
manner, and on the other hand, to develop assessment schemes that correlate particular
logicaltemporal sequences of facts and contents that can be traced back to specific anomaly indicators.
Artificial intelligence technologies and automatic Natural Language Processing (NLP) systems
focused on the Italian language represent a new frontier for semantic interpretation, concept
extraction, and correlation of texts and documents. This research, leveraging such technologies,
aims at developing a system that can interface with existing databases, prepare datasets that
are suitable for processing and analysis, execute automatic extraction of relationships between
textual entities, perform correlation tests between portions of text even of diferent lengths
(paragraphs vs. entire document), then receive queries and return predefined outcomes in
web-based format (short report, evidence, reference code, etc.).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The high-level architecture layout is designed to enable its specialization in the later stages of the
project. The architecture consists of four diferent modules: Data Collector, Data Pre-Processing,
Tender Analyzer, Service Tools.</p>
      <p>The Data Collector module gathers tender notices-related data, that can be extracted from
several sources: European (TED1), national (SIMOG/ANAC2) and regional (EmPULIA3) databases,
both in the form of tabular and textual data. Nevertheless, depending on the use-case
requirements, this research may involve sources that do not focus on tendering, such as Feed RSS
and other government agencies’ databases. Due to the potential heterogeneous nature of the
sources, the extraction of data is partitioned into plug-ins, allowing to add new sources or
modify the existing ones more simply. A further component, Data Integration, deals with the
above-mentioned heterogeneity by combining information gathered from each source, matching
the overlapping data, and reporting potential anomalies.</p>
      <p>The Pre-Processing module transforms the information extracted by Data Collector into
entities on which the subsequent analysis tasks can be efectively executed. Thus, it is necessary
to establish which features or entities are appropriate for representing a tender notice and
subsequently memorize them with the help of databases or suitable indexes.</p>
      <p>The Tender Analyzer carries out analyses based on the information extracted. Given the
nature of such information (structured or unstructured), this module consists of two diferent
components:
• The Data Analyzer deals with the structured information associated with tenders: codes
(such as CIG4, CPV5, etc), dates, amounts, etc.
• The Content Analyzer elaborates the unstructured information, e.g. any attachments
inherent to the notice (determine, specifications, etc.). It performs text analysis using NLP
techniques.</p>
      <p>The last module of the architecture is designed as a series of Service Tools. Given the
information and analysis provided by the previous modules, they will be linked to a well-defined set of
use cases and perform specific operations to fulfill the requirements.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Applications</title>
      <p>As discussed in section 2, the proposed framework works on both structured and unstructured
data to fully exploit the information associated with the notice and capture all aspects of it.</p>
      <sec id="sec-3-1">
        <title>1https://ted.europa.eu/</title>
        <p>2https://simog.anticorruzione.it/
3http://www.empulia.it/
4Codice Identificativo Gara, Tender Identification Code
5Common Procurement Vocabulary</p>
        <sec id="sec-3-1-1">
          <title>3.1. Structured Data analysis</title>
          <p>
            The application on structured data involves deriving metrics that can detect anomalies or
deviation from regulatory and normative standards in procurement activities [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. These can be
calculated based on the available public procurement Open Data, we propose the following:
Relative tender value, Variance of bids, Diference between the first and second bids , Concentrated
market structure, Static market structure, Cyclical wins, Lack of ofers from a previously
active company, Superfluous bidders , Prevalence of incorrect applications, Prevalence of consortia,
Prevalence of subcontracting.
          </p>
          <p>Such features engineering activity can facilitate the application of machine learning, with
the aim of detecting suspicious contracts whose allocation might be the result of collusive
agreements among firms participating in the tender or pertaining to that market. The main
criticality of the above task is the absence of datasets that record for a given procurement the
occurrence of a judicial authority investigation that has proven the collusion among participants.
One of the main research areas is the development of unsupervised models, e.g., clustering
or anomaly detection. If needed, with records by the relevant law enforcement authorities
available in external datasets, it would be possible to make use of supervised learning models at
the cost of creating annotated datasets cross-referencing information obtained by data mining
on those sources (rulings of the courts, RSS feeds, etc.) and available procurement data (ANAC3,
TED2, etc).</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Unstructured Data analysis</title>
          <p>
            Unstructured data are handled with the aid of automated tools capable of detecting implicit
information such as grammatical and semantic structures present in the text, following
approaches based on NLP [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] and Semantic Analysis. Such approach enables various outcomes,
for instance, the creation of a search engine capable of receiving queries and returning the
most relevant documents based on co-occurrence measures between terms in the query and
the documents and their semantic similarity as well [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. Furthermore, to generate fixed preset
outcomes, it is possible to use Natural Language Understanding, and Generation techniques, e.g.,
the system can generate summaries that condense, with the needed granularity, the information
gathered from documents [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. These functionalities simplify the access to contract information
and consequently eases the implementation of innovative business processes.
          </p>
          <p>
            The analysis of tender documents might allow the extraction of important information, which
often is not included in the available metadata. The CPV6 code, for example, identifies the type
of the contract scope, and it is employed for classifying tenders. These codes could be used to
train a classifier able to assign a category to a call for tender based on the content collected from
its documentation. In addition, it is possible to automatically extract relations between textual
entities contained in contract-related documents via Open Information Extraction methods
[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. In particular, our proposed system extracts triples made of a subject, an object, and a
predicate that relates them. In this manner, we obtain a machine-readable representation of
the information in the documents, namely a fact that can be seen as a truth-bearer, and it can
be labeled to be either true or false, relevant or not relevant. Since our study focuses on the
Italian language, our framework OIE4PA6 leverages an Italian dataset of labeled and unlabeled
triples extracted from real invitations to tender acquired from EmPulia. OIE4PA adopts the
methodology of a pre-existent system, WikiOIE [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] thus, it adjusts the system to the domain of
public tenders by setting up specific tools for extracting text from announcement documents
and by creating an ad hoc dataset with the resulting facts. We test the self-training approach
proposed by WikiOIE on our novel dataset to check whether it enhances the performance of
the system or not.
          </p>
          <p>This kind of work proves to be indispensable for resolving several scenarios. For instance,
it is possible to cross-check information extracted directly from documents and structured
data to detect potential anomalies; moreover, both sources can be exploited for preliminary
market analysis, which allows to keep track of specific sectors or contracting authorities. By
having enough data, this information can be used to outline profiles that will enable a more
accurate identification of irregularities to be further examined by the contracting or competent
authorities.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.3. Contractor information synthetic overview</title>
          <p>Considering the extensive information apparatus available, we aim to organize this information
focusing on contractors. The objective is, therefore, to contribute to the definition of a "Passport"
for companies or economic operators that allows the contracting station and the RUP7 to access
a real-time overview of the available information. For instance, by simulating the audits that
the RUP usually runs and/or the economic operator declares, the application connects to the
available databases (Registro Imprese, Agenzia Entrate, DURC8, etc.), extracts and works on the
information relevant to each test and, finally, returns them in a synthetic format. Moreover,
possible results from the archive of tenders (won contracts, tenders statistics, localization of
interventions, KPIs, etc.) are added to the aforementioned information to obtain a thorough and
automated informative content about the operator at issue.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future works</title>
      <p>The research project described so far aims to investigate the use of artificial intelligence
technologies to ofer the stakeholders engaged in the public procurement process a number of useful
tools to facilitate their work in both the engagement and assessment phases. The proposed
framework can operate on both tabular and textual data thanks to a modular architecture split
into several plug-ins that take care of retrieving information and developing specific applications
for each data source used.</p>
      <p>The digitization policies are becoming increasingly incisive on a central and peripheral level,
and the growing availability of data in a digital format opens up brand new scenarios of usage
and integration of applications assisted by artificial intelligence. Therefore, the prototype
proposed proves to be cutting-edge within this scenario, exploring the possibilities and, at the
same time, being suitable for future extensions and enhancements.</p>
      <sec id="sec-4-1">
        <title>6Open Information Extraction For the Public Administration</title>
        <p>7Responsabile Unico del Procedimento, Head Project Manager
8Documento Unico di Regolarità Contributiva</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tóth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fazekas</surname>
          </string-name>
          , Á. Czibik,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Tóth</surname>
          </string-name>
          ,
          <article-title>Toolkit for detecting collusive bidding in public procurement. with examples from hungary</article-title>
          .,
          <source>Report number:</source>
          CRC-WP/
          <year>2014</year>
          :02 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurasky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Speech and language processing: An introduction to natural language processing, Computational Linguistics and Speech Recognition</article-title>
          . Prentice Hall, New Jersey (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ciano</surname>
          </string-name>
          , G. Grasso, G. Rossiello, G. Semeraro,
          <article-title>Sepir: a semantic and personalised information retrieval tool for the public administration based on distributional semantics</article-title>
          ,
          <source>International Journal of Electronic Governance</source>
          <volume>9</volume>
          (
          <year>2017</year>
          )
          <fpage>132</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rossiello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , G. Semeraro,
          <article-title>Centroid-based text summarization through compositionality of word embeddings</article-title>
          , in: G. Giannakopoulos,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lloret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Conroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litvak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Rankel</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          Favre (Eds.),
          <source>Proceedings of the Workshop on Summarization and Summary Evaluation Across Source Types and Genres</source>
          ,
          <source>MultiLing@EACL</source>
          <year>2017</year>
          , Valencia, Spain, April 3,
          <year>2017</year>
          , Association for Computational Linguistics,
          <year>2017</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>21</lpage>
          . URL: https://doi.org/10.18653/v1/w17-
          <fpage>1003</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/w17-
          <fpage>1003</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cassotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , M. de Gemmis, P. Lops,
          <article-title>Extracting relations from italian wikipedia using unsupervised information extraction</article-title>
          , in: V. W. Anelli,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 11th Italian Information Retrieval Workshop</source>
          <year>2021</year>
          , Bari, Italy,
          <source>September 13-15</source>
          ,
          <year>2021</year>
          , volume
          <volume>2947</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cassotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , M. de Gemmis, P. Lops, G. Semeraro,
          <article-title>Extracting relations from italian wikipedia using self-training</article-title>
          , in: E. Fersini,
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          , V. Patti (Eds.),
          <source>Proceedings of the Eighth Italian Conference on Computational Linguistics</source>
          , CLiC-it
          <year>2021</year>
          , Milan, Italy, January
          <volume>26</volume>
          -
          <issue>28</issue>
          ,
          <year>2022</year>
          , volume
          <volume>3033</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>