<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ptpDG: A Purchase-To-Pay Dataset Generator for Evaluating Knowledge-Graph-Based Services</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Schulze</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schroder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Jilek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Dengel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Technische Universitat Kaiserslautern</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Smart Data &amp; Knowledge Services Department, Deutsches Forschungszentrum fur Kunstliche Intelligenz GmbH (DFKI)</institution>
          ,
          <addr-line>Kaiserslautern</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces ptpDG, a labeled-dataset generator that generates various data assets for evaluating knowledge graph construction approaches and downstream knowledge services in the purchase-to-pay domain: While organizations sell, purchase and complain about products in a multi-agent-system simulation, a ground truth knowledge graph emerges with di erent kinds of purchase-to-pay processes. Based on this knowledge graph, heterogeneous electronic purchase-topay documents such as e-invoices, credit notes and orders are generated. To those documents, noise patterns are added that we have frequently encountered in real industrial data. Finally, a provenance graph is generated which contains provenance information between document elements and ground truth triples. In this way, for such privacy sensitive scenarios, ptpDG enables data-driven evaluation and its publication.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph Construction</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Simulation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Purchase-to-pay processes are \knowledge-intensive processes" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] consisting of
heterogeneous documents such as orders, e-invoices and credit notes. To support
knowledge workers in such work environments, our research is concerned with
knowledge-graph-based services for users3. For such services, knowledge graphs
have to be constructed in the rst place which we also want to evaluate in a
datadriven way. However, publication of real industrial data for scienti c evaluation
is rarely possible because this kind of data is often highly sensitive. This also
holds for information contained in real purchase-to-pay documents because it
consists of personal information and relates to third parties. In our experience,
industry partners also have objections to anonymization techniques because risk
of de-anonymization exists [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Even in the rare cases when data publishing may
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 https://comem.ai/SensAI</title>
      <p>
        Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
be possible, or when it is not aspired at all, it is still a time consuming task to
label such data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Therefore, this paper introduces ptpDG, an approach that generates various
data assets for evaluating knowledge graph construction approaches and
downstream knowledge services: a), synthetic electronic purchase-to-pay documents
such as e-invoices, credit notes or orders where noise is added (e.g. incomplete
data), b), a ground truth knowledge graph which contains triples that can be
constructed from such purchase-to-pay documents, and c), a provenance graph
that contains relationships between information evidences in the documents and
triples in the ground truth knowledge graph.</p>
      <p>Besides enabling evaluation for such privacy sensitive cases, ptpDG can be
used as a visualization and presentation tool for knowledge-graph-based services
in the purchase-to-pay domain for stakeholders without the need to work on real
sensitive data in the rst place. Also, ptpDG may be leveraged for benchmarking
knowledge graph construction techniques such as RDF mapping engines.
2</p>
      <sec id="sec-2-1">
        <title>Approach</title>
        <p>This section presents the general approach of ptpDG by means of Figure 1:</p>
        <p>
          Initialization: With a con guration knowledge graph (1), it is possible to
con gure scenario related entities such as organizations, products or persons and
their relations to each other (e.g. 1:1, 1:n, n:m). Based on this con guration, an
initialization knowledge graph (2) for the next steps is generated. This contains
the entities with their labels as well as required ontologies, such as P2P-O [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
for the purchase-to-pay domain.
        </p>
        <p>
          Simulation: Because real purchase-to-pay processes emerge while people in
organizations take decisions, we developed a multi-agent system (MAS)
simulation (3) to realize the decentralized creation of such processes and their
documents. In the simulation, organizations as agents purchase, sell and complain
about products from which purchase-to-pay documents and their contents are
created as triples in the ground truth knowledge graph (4). As a result,
various types of processes emerge that are also speci ed in the invoicing norm
EN16931 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], for example, processes with sporadic purchase orders, with and
without credit notes or with partial and nal invoices. For knowledge workers
in real purchase-to-pay processes, reconstructing such processes is a challenging
task which is why building knowledge graphs in such scenarios may be a
promising approach in the rst place [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Which particular processes are generated in
the simulation depends on the randomized and individual decisions
organizations take during the simulation, e.g., whether to complain about an invoice or
not. For possibilities how to adjust parameters, we kindly refer the reader to
https://purl.org/ptp-dg#simulation.
        </p>
        <p>
          Purchase-To-Pay Documents: Electronic purchase-to-pay documents,
which are now as triples in the ground truth knowledge graph, are generated with
the Purchase-To-Pay Document Generator (5) in con gured standards, formats
and syntaxes. Because those documents are still too perfect compared to
realworld documents, based on the idea in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], noise is added with patterns found
in real invoices, credit notes etc. (6). The current set of patterns have been
derived from interviewing invoice processing industry experts in the TRAFFIQX
network4 and from analyzing real documents of this network. For example,
regarding patterns how purchase-to-pay documents are referred to each other (or
not), only last digits of invoice- or order-references are displayed, or such
references are left out completely. Another common pattern is that the person who is
responsible for the order or invoice { or her/his name abbreviation { is entered
in the eld that is actually preserved for the document reference.
        </p>
        <p>Provenance Graph: To enable data-driven evaluation of knowledge graphs
constructed from such documents, a provenance knowledge graph (7) is
generated which contains relationships between particular information evidences in
the documents (e.g. the name abbrevation in the order reference eld) and the
correct triples that may be constructed from this information. In the current
case of XML-documents, the concrete location within a document is represented
in an XPath query. Finally, for the generated dataset and knowledge graphs,
metadata such as con guration parameters is generated.
3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Application of ptpDG</title>
        <p>
          On ptpDG's project site https://purl.org/ptp-dg, a tutorial shows how a
dataset with 105k triples was generated in which six organizations trade 30
products over 60 rounds of simulation. In this dataset, 1328 di erent processes
are generated with 2277 documents in total. To ensure that resulting documents
comply with given standards, they have been validated against respective XSD
speci cations. Consistency of the resulting knowledge graphs have been
evaluated with OWL reasoner, which also means that the knowledge graphs comply
with OWL restrictions speci ed in P2P-O [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. As presented on the project site,
further plausibility checks with SPARQL queries and expected results have been
conducted, for example, to ensure that the number of nal invoices and number
of partial- nal invoicing processes is equal.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 https://www.traffiqx.net/en/about-us</title>
      <p>4</p>
      <sec id="sec-3-1">
        <title>Related Work</title>
        <p>
          Di erent invoice generators exist for presentation purposes and use cases, for
example, for entity extraction from paper-based invoices that have been scanned
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. However, such approaches do not provide labeled data to evaluate knowledge
graph construction approaches. Also, to the best of our knowledge, there is no
approach that considers the process context of invoices. ptpDG is inspired by
a previous approach called Data Sprout [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It also generates labeled data and
ground truth triples from a given content knowledge graph in the context of
heterogeneous spreadsheet generation. However, besides generating other type of
data for purchase-to-pay processes, ptpDG extends this approach by introducing
a MAS simulation and, as a result, by dispensing with the content knowledge
graph as an input.
5
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Conclusion and Outlook</title>
        <p>This paper introduced ptpDG, a labeled-dataset generator for the sensitive
purchase-to-pay domain based on a MAS simulation. In this way, ptpDG moves
towards enabling data-driven evaluation of knowledge-graph-based services: A
knowledge graph construction approach can now take the generated documents
as an input, and the resulting knowledge graph can be evaluated against the
provided provenance and ground truth knowledge graph.</p>
        <p>
          For future work, we plan to extend ptpDG with more heterogeneous
documents, for example, with synthetic emails as purchase orders and other
documents such as dispatch advice- and service provision-documents. This way, it
will be possible to cover more kinds of processes speci ed in EN16931 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Also,
we plan to include more patterns (as in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) regarding organization names,
product descriptions, and in general regarding those elds where users can insert
text freely to better align the generated documents with real ones. To further
evaluate the generated data beyond the presented plausibility checks, we work
on the structural comparison between synthetic and real data. First results
indicate that with the current version of ptpDG it is easier to nd a con guration
that generates correct ratios of di erent kinds of processes and documents than
it is to nd a con guration that at the same time generates correct time
intervals. Further, the support of more di erent standards and syntaxes such as
EDIFACT5 is planned.
        </p>
        <p>Acknowledgements This work was funded by the Investitions- und
Strukturbank Rheinland-Pfalz (ISB) (project InnoProm) and the BMBF project SensAI
(grantno. 01IW20007).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 https://unece.org/trade/uncefact/introducing-unedifact</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Belad, Y., Belad, A.:
          <article-title>Automatic generation of a custom corpora for invoice analysis and recognition</article-title>
          .
          <source>In: Workshop on Industrial Applications of Document Analysis and Recognition, WIADAR@ICDAR</source>
          <year>2019</year>
          , Sydney, Australia,
          <source>September 22-25</source>
          ,
          <year>2019</year>
          . IEEE (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ciccio</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marrella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Knowledge-intensive processes: Characteristics, requirements and analysis of contemporary approaches</article-title>
          .
          <source>J. Data Semant</source>
          .
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>29</volume>
          {
          <fpage>57</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. EN 16931-1:2017:
          <article-title>Electronic invoicing - part 1: Semantic data model of the core elements of an electronic invoice</article-title>
          .
          <source>Standard</source>
          ,
          <string-name>
            <surname>CEN</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mittal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyah</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>Graph data anonymization, de-anonymization attacks, and de-anonymizability quanti cation: A survey</article-title>
          .
          <source>IEEE Commun. Surv. Tutorials</source>
          <volume>19</volume>
          (
          <issue>2</issue>
          ),
          <volume>1305</volume>
          {
          <fpage>1326</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Schroder,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jilek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Dengel</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Dataset generation patterns for evaluating knowledge graph construction</article-title>
          .
          <source>In: The Semantic Web: ESWC 2021 Satellite Events - Virtual Event, June</source>
          <volume>6</volume>
          -10,
          <year>2021</year>
          ,
          <source>Revised Selected Papers. Lecture Notes in Computer Science</source>
          , vol.
          <volume>12739</volume>
          , pp.
          <volume>27</volume>
          {
          <fpage>32</fpage>
          . Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schulze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Schroder,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jilek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Albers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Maus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Dengel</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>P2P-O: A purchase-to-pay ontology for enabling semantic invoices</article-title>
          .
          <source>In: The Semantic Web - 18th International Conference, ESWC</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , June 6-10,
          <year>2021</year>
          , Proceedings. LNCS, vol.
          <volume>12731</volume>
          , pp.
          <volume>647</volume>
          {
          <fpage>663</fpage>
          . Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>