<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Boosting Information Extraction through Semantic Technologies: The KIDs use case at CONSOB</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Maria Scafoglieri</string-name>
          <email>scafoglierig@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Lembo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Limosani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Medda</string-name>
          <email>f.meddag@consob.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurizio Lenzerini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Commissione Nazionale per le Societa` e la Borsa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza Universita` di Roma</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we report on the initial results of a project concerning the integration of Semantic Technologies with Information Extraction (IE) techniques, jointly carried out by Sapienza University of Rome and CONSOB (Commissione Nazionale per la Societa` e la Borsa), the Italian public authority responsible for regulating the securities market. The use case. In the EU, the creators of financial products (a.k.a. financial manufacturers) are obliged by law3 to make information related to so-called PRIIPs (Packaged Retail Investment and Insurance-based Investments Products) publicly available. The NCAs (National Competent Authorities) have supervisory duties on such products, so that they can be safely placed on the respective national markets. The legislation requires information about PRIIPs to be communicated to NCAs through documents called KIDs (Key Information Documents). In the practice, this means that features to be checked are cast into text reports, typically formatted as pdf files, and extracting structured data from them (to bootstrap control activities), is actually in charge to the authority (In Italy, CONSOB). Due to the massive amount of documents to be analyzed (e.g., 700.000 KIDs received by CONSOB in 2019, more than 1 million in 2020), this process cannot be carried out manually, but still it is only partially automated to date. Objectives. Our main aim is thus to develop a solution to streamline the extraction process and reduce as much as possible (ideally eliminate) the need of manual intervention, still guaranteeing very high accuracy. At the same time, such solution should return a data structure providing a due account of the semantics of the business domain and suited for rich and highly informative post-extraction analysis. Solution. Given the previously highlighted requirements, the proposed solution aims at constructing a Knowledge Graph (KG), whose intensional component (expressed in OWL) is designed with the help of domain experts, and whose extensional level is automatically created from KIDs through a rule-based IE mechanism. The choice of structuring the extracted data as a KG not only facilitates the integration with other corporate and external data, enabling rich analysis and management at an abstract, conceptual level, but also allows for properly formalizing the conceptual distinction between PRIIPs and KIDs describing them, and the continuous updates which KIDs are subjected to.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Moreover, the choice to adopt a rule-based approach for IE, instead of a statistical one,
lies not only in the great effort required for generating an annotated dataset for training
learning algorithms, but also in the lack of transparency, accountability, and human
interpretability of Machine Learning solutions, which makes excessively difficult to fully
understand the results of the extraction, ill-fitting the financial context of this use case.
In our first implementation efforts, we focused on a portion of the information to be
extracted, consisting of 12 PRIIPs characteristics (e.g., name of the product, issue date,
etc.). We realized two alternative implementations, briefly described below.
First realization. We initially adopted CoreNLP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the popular Stanford library for
NLP, which is well-documented, fully-supported, and easy-to-use, and provides a rule
engine module, called TokenRegex, useful to generate annotations on text via a
regexlike rule language. After applying the rules, the generated annotations follow a flow
of transformations, realized through components specifically written to translate
annotations into facts of the KG. Although the results of our experiments are particularly
convincing in terms of precision and recall, both averaging around 99% over a dataset
of more than 14.000 KIDs, we encountered two main issues: (i) low performance in
terms of execution time; (ii) complex process to define rules, caused by the low
modularity of TokenRegex and the need of complementing rules with ad-hoc (java) code.
Second realization. We have therefore realized a second implementation through
MASTRO SYSTEM-T [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a KG-aided IE tool, which allowed us to solve the above
issues, still achieving the same accuracy results. As for issue (i), the extraction speed
has increased considerably, reducing the time needed to materialize the KG by 46.15%,
mainly thanks to the use in MASTRO SYSTEM-T of the highly performing IE tool
System-T [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Issue (ii) has been instead greatly mitigated, by virtue of both the full
declarativeness of the language used for the extractors in System-T, and the way in
which MASTRO SYSTEM-T casts them into extraction assertions mapping KIDs to KG
predicates. This second implementation allowed us also to follow a new approach to
access KIDs data. Indeed, MASTRO SYSTEM-T may be used as a Virtual KG engine [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
to perform the extraction at query time, which allows to always get fresh data.
Conclusion. Since all European NCAs need to address the same oversight tasks on
PRIIPs as CONSOB, the impact of our research may go fairly beyond the single experience
we described, considered also that, to the best of our knowledge, very few authorities
have to date developed solutions supporting automatic IE from KIDs. In particular, the
adoption of our approach by other authorities is enabled by the fact that both KIDs
content and structure must obey to the same common regulation.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danilevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Reiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          . SystemT:
          <article-title>Declarative text understanding for enterprise</article-title>
          .
          <source>In Proc. of NAACL-HLT (Industry Papers)</source>
          , pages
          <fpage>76</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Scafoglieri</surname>
          </string-name>
          .
          <article-title>Ontology mediated information extraction with MASTRO SYSTEM-T</article-title>
          .
          <source>In Proc. of ISWC (Demos Track)</source>
          , pages
          <fpage>256</fpage>
          -
          <lpage>261</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>McClosky</surname>
          </string-name>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>In Proc. of ACL</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kontchakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zakharyaschev</surname>
          </string-name>
          .
          <article-title>Ontology-based data access: A survey</article-title>
          .
          <source>In Proc. of IJCAI</source>
          , pages
          <fpage>5511</fpage>
          -
          <lpage>5519</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>