ptpDG: A Purchase-To-Pay Dataset Generator for Evaluating Knowledge-Graph-Based Services Michael Schulze1,2(B) , Markus Schröder1,2 , Christian Jilek1,2 , and Andreas Dengel1,2 1 Computer Science Department, Technische Universität Kaiserslautern, Germany 2 Smart Data & Knowledge Services Department, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Kaiserslautern, Germany {firstname.lastname}@dfki.de Abstract. This paper introduces ptpDG, a labeled-dataset generator that generates various data assets for evaluating knowledge graph con- struction approaches and downstream knowledge services in the pur- chase-to-pay domain: While organizations sell, purchase and complain about products in a multi-agent-system simulation, a ground truth knowl- edge graph emerges with different kinds of purchase-to-pay processes. Based on this knowledge graph, heterogeneous electronic purchase-to- pay documents such as e-invoices, credit notes and orders are generated. To those documents, noise patterns are added that we have frequently encountered in real industrial data. Finally, a provenance graph is gener- ated which contains provenance information between document elements and ground truth triples. In this way, for such privacy sensitive scenarios, ptpDG enables data-driven evaluation and its publication. Keywords: Knowledge Graph Construction · Evaluation · Simulation. 1 Introduction and Motivation Purchase-to-pay processes are “knowledge-intensive processes” [2] consisting of heterogeneous documents such as orders, e-invoices and credit notes. To support knowledge workers in such work environments, our research is concerned with knowledge-graph-based services for users3 . For such services, knowledge graphs have to be constructed in the first place which we also want to evaluate in a data- driven way. However, publication of real industrial data for scientific evaluation is rarely possible because this kind of data is often highly sensitive. This also holds for information contained in real purchase-to-pay documents because it consists of personal information and relates to third parties. In our experience, industry partners also have objections to anonymization techniques because risk of de-anonymization exists [4]. Even in the rare cases when data publishing may 3 https://comem.ai/SensAI Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). M. Schulze et al. Fig. 1. General approach of ptpDG be possible, or when it is not aspired at all, it is still a time consuming task to label such data [5]. Therefore, this paper introduces ptpDG, an approach that generates various data assets for evaluating knowledge graph construction approaches and down- stream knowledge services: a), synthetic electronic purchase-to-pay documents such as e-invoices, credit notes or orders where noise is added (e.g. incomplete data), b), a ground truth knowledge graph which contains triples that can be constructed from such purchase-to-pay documents, and c), a provenance graph that contains relationships between information evidences in the documents and triples in the ground truth knowledge graph. Besides enabling evaluation for such privacy sensitive cases, ptpDG can be used as a visualization and presentation tool for knowledge-graph-based services in the purchase-to-pay domain for stakeholders without the need to work on real sensitive data in the first place. Also, ptpDG may be leveraged for benchmarking knowledge graph construction techniques such as RDF mapping engines. 2 Approach This section presents the general approach of ptpDG by means of Figure 1: Initialization: With a configuration knowledge graph (1), it is possible to configure scenario related entities such as organizations, products or persons and their relations to each other (e.g. 1:1, 1:n, n:m). Based on this configuration, an initialization knowledge graph (2) for the next steps is generated. This contains the entities with their labels as well as required ontologies, such as P2P-O [6] for the purchase-to-pay domain. Simulation: Because real purchase-to-pay processes emerge while people in organizations take decisions, we developed a multi-agent system (MAS) simu- lation (3) to realize the decentralized creation of such processes and their doc- uments. In the simulation, organizations as agents purchase, sell and complain about products from which purchase-to-pay documents and their contents are created as triples in the ground truth knowledge graph (4). As a result, var- ious types of processes emerge that are also specified in the invoicing norm EN16931 [3], for example, processes with sporadic purchase orders, with and without credit notes or with partial and final invoices. For knowledge workers in real purchase-to-pay processes, reconstructing such processes is a challenging ptpDG: A Purchase-To-Pay Dataset Generator task which is why building knowledge graphs in such scenarios may be a promis- ing approach in the first place [6]. Which particular processes are generated in the simulation depends on the randomized and individual decisions organiza- tions take during the simulation, e.g., whether to complain about an invoice or not. For possibilities how to adjust parameters, we kindly refer the reader to https://purl.org/ptp-dg#simulation. Purchase-To-Pay Documents: Electronic purchase-to-pay documents, which are now as triples in the ground truth knowledge graph, are generated with the Purchase-To-Pay Document Generator (5) in configured standards, formats and syntaxes. Because those documents are still too perfect compared to real- world documents, based on the idea in [5], noise is added with patterns found in real invoices, credit notes etc. (6). The current set of patterns have been de- rived from interviewing invoice processing industry experts in the TRAFFIQX network4 and from analyzing real documents of this network. For example, re- garding patterns how purchase-to-pay documents are referred to each other (or not), only last digits of invoice- or order-references are displayed, or such refer- ences are left out completely. Another common pattern is that the person who is responsible for the order or invoice – or her/his name abbreviation – is entered in the field that is actually preserved for the document reference. Provenance Graph: To enable data-driven evaluation of knowledge graphs constructed from such documents, a provenance knowledge graph (7) is gener- ated which contains relationships between particular information evidences in the documents (e.g. the name abbrevation in the order reference field) and the correct triples that may be constructed from this information. In the current case of XML-documents, the concrete location within a document is represented in an XPath query. Finally, for the generated dataset and knowledge graphs, metadata such as configuration parameters is generated. 3 Application of ptpDG On ptpDG’s project site https://purl.org/ptp-dg, a tutorial shows how a dataset with 105k triples was generated in which six organizations trade 30 products over 60 rounds of simulation. In this dataset, 1328 different processes are generated with 2277 documents in total. To ensure that resulting documents comply with given standards, they have been validated against respective XSD specifications. Consistency of the resulting knowledge graphs have been evalu- ated with OWL reasoner, which also means that the knowledge graphs comply with OWL restrictions specified in P2P-O [6]. As presented on the project site, further plausibility checks with SPARQL queries and expected results have been conducted, for example, to ensure that the number of final invoices and number of partial-final invoicing processes is equal. 4 https://www.traffiqx.net/en/about-us M. Schulze et al. 4 Related Work Different invoice generators exist for presentation purposes and use cases, for example, for entity extraction from paper-based invoices that have been scanned [1]. However, such approaches do not provide labeled data to evaluate knowledge graph construction approaches. Also, to the best of our knowledge, there is no approach that considers the process context of invoices. ptpDG is inspired by a previous approach called Data Sprout [5]. It also generates labeled data and ground truth triples from a given content knowledge graph in the context of heterogeneous spreadsheet generation. However, besides generating other type of data for purchase-to-pay processes, ptpDG extends this approach by introducing a MAS simulation and, as a result, by dispensing with the content knowledge graph as an input. 5 Conclusion and Outlook This paper introduced ptpDG, a labeled-dataset generator for the sensitive purchase-to-pay domain based on a MAS simulation. In this way, ptpDG moves towards enabling data-driven evaluation of knowledge-graph-based services: A knowledge graph construction approach can now take the generated documents as an input, and the resulting knowledge graph can be evaluated against the provided provenance and ground truth knowledge graph. For future work, we plan to extend ptpDG with more heterogeneous docu- ments, for example, with synthetic emails as purchase orders and other docu- ments such as dispatch advice- and service provision-documents. This way, it will be possible to cover more kinds of processes specified in EN16931 [3]. Also, we plan to include more patterns (as in [5]) regarding organization names, prod- uct descriptions, and in general regarding those fields where users can insert text freely to better align the generated documents with real ones. To further evaluate the generated data beyond the presented plausibility checks, we work on the structural comparison between synthetic and real data. First results in- dicate that with the current version of ptpDG it is easier to find a configuration that generates correct ratios of different kinds of processes and documents than it is to find a configuration that at the same time generates correct time in- tervals. Further, the support of more different standards and syntaxes such as EDIFACT5 is planned. Acknowledgements This work was funded by the Investitions- und Struktur- bank Rheinland-Pfalz (ISB) (project InnoProm) and the BMBF project SensAI (grantno. 01IW20007). 5 https://unece.org/trade/uncefact/introducing-unedifact ptpDG: A Purchase-To-Pay Dataset Generator References 1. Blanchard, J., Belaı̈d, Y., Belaı̈d, A.: Automatic generation of a custom corpora for invoice analysis and recognition. In: Workshop on Industrial Applications of Document Analysis and Recognition, WIADAR@ICDAR 2019, Sydney, Australia, September 22-25, 2019. IEEE (2019) 2. Ciccio, C.D., Marrella, A., Russo, A.: Knowledge-intensive processes: Characteris- tics, requirements and analysis of contemporary approaches. J. Data Semant. 4(1), 29–57 (2015) 3. EN 16931-1:2017: Electronic invoicing - part 1: Semantic data model of the core elements of an electronic invoice. Standard, CEN (2017) 4. Ji, S., Mittal, P., Beyah, R.A.: Graph data anonymization, de-anonymization at- tacks, and de-anonymizability quantification: A survey. IEEE Commun. Surv. Tu- torials 19(2), 1305–1326 (2017) 5. Schröder, M., Jilek, C., Dengel, A.: Dataset generation patterns for evaluating knowledge graph construction. In: The Semantic Web: ESWC 2021 Satellite Events - Virtual Event, June 6-10, 2021, Revised Selected Papers. Lecture Notes in Com- puter Science, vol. 12739, pp. 27–32. Springer (2021) 6. Schulze, M., Schröder, M., Jilek, C., Albers, T., Maus, H., Dengel, A.: P2P-O: A purchase-to-pay ontology for enabling semantic invoices. In: The Semantic Web - 18th International Conference, ESWC 2021, Virtual Event, June 6-10, 2021, Pro- ceedings. LNCS, vol. 12731, pp. 647–663. Springer (2021)