Natural Language Processing with Process Models (NLP4RE Report Paper) Jan Mendling Henrik Leopold Lucineia Heloisa Thom Wirtschaftsuniversität Wien Kühne Logistics University Federal University of Vienna, Austria Hamburg, Germany Rio Grande do Sul jan.mendling@wu.ac.at henrik.leopold@the-klu.org Porto Alegre, Brazil lucineia@inf.ufrgs.br Han van der Aa Humboldt-Universität zu Berlin Berlin, Germany han.van.der.aa@hu-berlin.de Abstract This paper is a report paper that focuses on research at the intersec- tion of business process management and requirements engineering. It gives an overview of the research on natural language processing with process models organized in terms of 25 challenges. This research line is pursued in a cross-university collaboration between the authors and further colleagues. We describe the most important contributions of the authors and highlight directions for future research. 1 Team Overview The research team has a track record of joint work in the area of natural language processing with process models of over ten years. Since several members have changed affiliation, the collaboration has evolved towards a virtual research team. The team works in the area of business process management [DRMR18] and conducts research on the analysis of business process models including process model verification, refactoring, change propagation, matching, process mining, conformance checking, guidelines, and human comprehension of process models. Business process management according to our tradition strongly builds on research into the design of workflow systems in the 1990s and the configuration of ERP Systems in the 2000s by the help of business process models. It is in line with requirements engineering in its ambition of understanding the application domain, operational constraints, and functionality needed by stakeholders [Som05] and more specific with its focus on a special class of systems, namely systems that support an organization to execute their business processes. 2 Past Research on NLP for Requirements Engineering with Process Models In order to organize our previous research in the area of NLP for Requirements Engineering with a specific focus on process models, we have developed a framework that includes a list of 25 challenges. These challenges are associated with integrating requirements as a process model more efficiently, validating their correctness, completeness and consistency, and extracting information to support the design and implementation of a system that supports the execution of the business process. The 25 challenges can be organized into three major Copyright c 2019 by the paper’s authors. Copying permitted for private and academic purposes. categories as Figure 1 illustrates: challenges in relation to automatically processing labels (C1-C7), in relation to labels in process models (C8-C19), and in relation to overall repositories (C20-C25) [MLP14]. Various of these challenges have been addressed by our research and also by other research teams. In the following, we discuss a selection of our works in order to illustrate the spectrum of contributions that have been made in this area of research. Several of these works have been published in renowned journals including IEEE Transactions on Software Engineering, Information & Software Technology, Decision Support Systems, and Information Systems. The initial spark for this research was laid by the observation that the textual labels of process models can be formulated in a good and bad way. This observation provided the motivation for utilizing natural language processing techniques to improve the text labels of process model. Such a technique can be understood as a specific type of refactoring of process models with the aim to make them easier to understand by humans. Towards this end, we developed a technique to identify different styles of labels automatically [LSM11] and guideline violations [LEM+ 13], based on which we could then refactor them [LSM12]. Recently, we developed a novel label parsing techniques, which can be used to better address the aforementioned use cases [LvdAOR19]. With these works, we addressed the Challenges C1 and C2. This foundational set of techniques was then further extended into different directions. Most notable are translation, semantic processing, and conformance checking between process model and text, as discussed next. 2.1 Translations between Process Models and Text An important question for processing of text and models is to which extent automatic translations are feasible. We addressed this question in both directions: from text to process model and from process model to text. Our research on the translation from text to process model [FMP11] addresses various challenges that we organize in four categories. The first category, Syntactic Leeway, includes problems that stem from changing active and passive voice of input text, potential rewording and changes of order and conditions that are not explicit. The second category, Atomicity, refers to the fact that sentences can be as complex as whole model fragments, that activities can be split across sentences and that relative clauses have to be dealt with. The third category, Relevance, acknowledges that relative clauses, example sentences or meta-statements should not lead to model elements. The fourth category, Referencing, deals with anaphora, textual links and end-of-block recognition. The proposed translation technique works from the sentence level to the text level and creates a process model automatically. Using a test set of 47 text-model pairs, we achieve an average translation accuracy of 77%. This work has been recently extended with a structural analysis of the texts and an analysis of sentence templates in order to address potential issues of ambiguity [STW+ 18] and is currently being integrated into a service-oriented architecture for the generation of process-oriented text. Our complementary research on the translation from process model to text for validation purposes [LMP14] addresses various challenges that stem from parsing the formal structure of the process model. More specifically, we distinguish four categories of challenges. The first category, Text Planning, deals with linguistic information extraction, model linearization and text structuring. The second category, Sentence Planning, includes lexical- ization and message refinement. The third category, Surface Realization, relates to interfacing with established realizers. The fourth category, Flexibility, addresses variations of input data and adaptation of output. The proposed translation technique starts with information extraction from process model elements to graph parsing the process model into the refined process structure tree and text structuring based on the tree fragments. This data is fed into a deep syntax tree where a technique for message refinement is applied. Finally, a realizer generates the resulting natural language text. Our evaluation demonstrates that the generated texts are highly accurate and that a back translation hardly entails any loss of information. 2.2 Semantic Processing of Process Models and Text Each of these translation techniques takes the textual content as given. This is problematic, because terms are often ambiguous. This is the starting point of our research on the automatic detection and resolution of lexical ambiguity in process models [PLM15]. The corresponding technique covers homonym detection and resolution as much as synomym detection and resolution. The technique is evaluated using a collection of more than 2,000 process models from practice with altogether more than 20,000 text labels. The evaluation indicates that homonymous usage of terms like application, case or incident, as well as synonymous word pairs such as check- control, create-produce, and customer-client are found. Automatic resolution significantly reduces ambiguity. A key problem of processing text labels of models in practice is that practitioners often do not use these labels in a canonical way. Examples are activity labels like Screen delivery documents if necessary or update C1: Identify Label Grammar C5: Auto-Complete Label Read Label Read Label Bank Contact Bank verb obj C2: Refactor Label Grammar C6: Calculate Label Specificity Label Reading Read Label Call Bank Specificity C3: Disambiguate Label Terms C7: Calculate Label Similarity Call Call Bank Call Bank (Financials) BankContact Similarity Bank C4: Refactor Label Terms Contact Call Bank Financial Institution C8: Discover Label Mapping C14: Verify Model Correctness Figure 3: Challenges in Relation to Labels. I read the label I read the label A B C A B ... ... ... ... x C tification C9: Identify[20], or model Semantic translation [21]. On the C15: Fragment other hand,Model Validate thisCompleteness more integral perspective on conceptual ... modeling reveals various ... challenges. C In... the ...following ... sections, we aim ... to ... describe ... tasks Aand Bcorresponding C challenges. A We B organize x them into three categories that are based on the extent of their textual content (see Figure 2). C10: Identify Fragment Name C16: Auto-Complete Model The first category relates to labels and their analysis. The second category describes analysis on ... ... Check Action C the level of whole models or model fragments. Finally,A theB third C category discusses A challenges B on ... ... ... ... ... ... D the level of whole model collection. Each challenge is structured accordingly. We discuss each C11: Unfold challenge byLabel to Structure clarifying the goals and the necessary input C17: Calculate Model Specificity information of the associated task. Based on that, we further specify the challenges linked to a particular task and illustrate them with the Do A, B, C A B C A B C Specificity help of small examples. Finally, we conclude with a short summary of prior research and explain howC12:the respective Transform Modelchallenge to Text has been addressed with C18:conceptual or technical solutions. Translate Model First do A, A B C then do B, A B C A B C 3. Label Challenges then do C C13: Transform Text to Model C19: Calculate Model-Text Consistency InFirst this section, we describe various challenges on analyzing and reworking labels of elements do A, that appear then do B, in a process model. Figure A B 3 gives C an overview. A B C Consistency then do C C1: Identify Label Grammar. The goal of this task is the automatic identification of the semantic components of a process model element label. The input tofor this task is an element label and, if C20: Discover Model Mapping Figure 4: Challenges in Relation C23: Discover Models. Object Lifecycle applicable, the process model and the process model collection the label is part of. A B C A B C ... ... ... The challenge of this task is the proper recognition of the various and potentially ... ambiguous ... ... M N O M N O ... ... ... C10: Identify Fragment Name. The goal of this task is to identify grammatical label structures. It is further complicated by the shortness of element labels and thethe name of a set of activities thatC21: fact thatCalculate describethey Model them oftenatSimilarity a more do abstract level. not represent properThe inputC24: sentences.forAsDiscover this Ontology task a result, is it a process fragment is difficult to always containing identify the set A of activities. the correct part of speech of label terms. B C As an example, consider the label “plan data transfer”, ... ... ... ... Similarity The which challenge Mmay N refer of the O to this “planning” task is to find of a a“data nametransfer” for this ... fragment that captures or... the ...“transfer” its content of “plan data”. ... in a ... Prior semantically research hasmeaningful C22: Search approached Model way.thisAlso, the name challenge by of activities describing can be defined grammatical C25: Categorize from di↵erent Modelstyles of labelsperspectives, and defining e.g. what is being done or corresponding parsers [22]. Ambiguitywhat is supposed ...can ... to be achieved. be resolved ...based on As an example, consider ... the inclusion of further ... again contextual ... the Query 1 ... activities “receive order” and “check2 ... order”. ... ... A technique ... for naming ... ... this fragment should ... ... propose ... a label like “handle order”. Prior research has approached this challenge by describing di↵erent strategies for defining a name of a fragment or a whole 4 process based on theories of meaning such Figure 1: 25 Challenges of that di↵erent proposals can be derived automatically [42]. Semantic Process Modeling [MLP14] Figure 5: Challenges in Relation to Collections. C11: Unfold Label to Structure. The goal of this task is to decompose a label into di↵erent activities and to transform this into a corresponding fragment of a process model. The input for this task is 5. activity an Collection labelChallenges that describes more than just a single activity. The challenge In this section,ofwe this task is to describe identify various that several challenges activities and on analyzing are described reworkingand which fragments semantic structure can best capture their semantics. As an of process models. Figure 5 gives an overview.example, consider a single activity label “receive and check order”. Apparently, the single label refers to two activities which might be executed in parallel or sequential C20: Discoverorder. ModelPrior research Mapping. Thehas approached goal of this taskthis is challenge to discoverbya identifying commonalities mapping between the sets in of process model activities collections of two and deducting process models. The inputregular antitask for this patterns that is a pair of incorporate several process models andactivities in a similarity one activity matrix over label [43]. of activities. the pairs inventory and achieve documents. Canonicity refers to the specification of process model elements in such a way that they correspond to exactly one element [LPM17]. The paper identifies a series of patterns of such wrong usage of labels along with automatic refactorings. The transformation rules replace one model element with a non-canonical text label with a fragment of several elements. For example, the Screen delivery documents if necessary yields a decision block and the update inventory and achieve documents a sequence. 2.3 Conformance Checking between Process Models and Text We are also able to automatically check the conformance between process models and corresponding text. A specific conformance checking technique has been developed that automatically compares recorded process exe- cutions (captured in event logs) to natural language specifications of processes [vdALR18]. A particular challenge in this regard is the inherent ambiguity of natural language, which can lead to different possible interpretations of how a process should be executed. The developed technique uses probabilistic conformance checking to take this ambiguity into account to provide reliable results. Several works also consider that process models and textual process descriptions are often used alongside each other in organizations, given their complementary nature [vdALvdWR17]. Techniques have been developed that establish alignments between a model and a corresponding text [SvdACP18], that use such alignments to detect inconsistencies [vdALR17], and a process querying technique that can search repositories of both textual and model-based process descriptions simultaneously [LvdAP+ 17]. Many of the proposed techniques also help to match process models. Process model matching can be defined as the task of automatically aligning the text labels of one process model with the labels of a second model [C+ 13]. The task is rather easy if it can be assumed that there is a 1:1 match between the elements. In practice, this is hardly the case. Often aspects are represented in one model, which are not represented in the second one, and the other way around. Difficult are also matches that bridge different levels of granularity such as 1:n and n:m matches. The process matching contest promotes research in this area [C+ 13]. 3 Research Plan on NLP for Requirements Engineering with Process Models Many of the developed techniques are important to make business process management smarter [MBBF17], though various challenges remain. Many of them can be related to the 25 Challenges illustrated above, but also beyond. In our own future research, we aim to address the following problems. First, our current approach for process model elements identification in natural language text is based on a reduced set of BPMN elements (e.g. activity, subprocess, start, intermediate and end events). As future work we consider to extend our approach to support a larger number of elements as well as to filter natural language texts by process perspectives such as data and events. Second, we have observed that the quality of process descriptions in practice is often low. This calls for research on future techniques that are able to check quality and refactor poor text. One option is to use domain ontologies to check the consistency of process descriptions and respective ontological concepts. Benefits of ontology usage in this context has already been studies empirically in [GMB+ 17]. Third, while existing work on the extraction of process models from natural language focuses on imperative process descriptions and models, we are currently working on the extraction of declarative process constraints from natural language [vdACLR19]. In this way, we aim to deal with rule-based descriptions of processes. Acknowledgements Lucineia Heloisa Thom is a CAPES scholarship holder, Program Professor Visitante no Exterior, Process Number: 88881.172071/2018-01. Han van der Aa is funded as a research fellow of the Alexander von Humboldt foundation. References [C+ 13] Ugur Cayoglu et al. Report: The process model matching contest 2013. In International Conference on Business Process Management, pages 442–463. Springer, 2013. [DRMR18] Marlon Dumas, Marcello La Rosa, Jan Mendling, and Hajo A. Reijers. Fundamentals of Busi- ness Process Management, Second Edition. Springer, 2018. [FMP11] Fabian Friedrich, Jan Mendling, and Frank Puhlmann. Process model generation from natural language text. In CAISE, pages 482–496. Springer, 2011. [GMB+ 17] Jonas Bulegon Gassen, Jan Mendling, Amel Bouzeghoub, Lucinéia Heloisa Thom, and Jos’e Palazzo M. de Oliveira. An experiment on an ontology-based support approach for process modeling. Information & Software Technology, 83:94–115, 2017. [LEM+ 13] Henrik Leopold, Rami-Habib Eid-Sabbagh, Jan Mendling, Leonardo Guerreiro Azevedo, and Fernanda Araujo Baião. Detection of naming convention violations in process models for dif- ferent languages. Decision Support Systems, 56:310–325, 2013. [LMP14] Henrik Leopold, Jan Mendling, and Artem Polyvyanyy. Supporting process model validation through natural language generation. IEEE Trans. Software Eng., 40(8):818–840, 2014. [LPM17] Henrik Leopold, Fabian Pittke, and Jan Mendling. Ensuring the canonicity of process models. Data Knowl. Eng., 111:22–38, 2017. [LSM11] Henrik Leopold, Sergey Smirnov, and Jan Mendling. Recognising activity labeling styles in business process models. Enterprise Modelling & Inf. Systems Architectures, 6(1):16–29, 2011. [LSM12] Henrik Leopold, Sergey Smirnov, and Jan Mendling. On the refactoring of activity labels in business process models. Inf. Syst., 37(5):443–459, 2012. [LvdAOR19] Henrik Leopold, Han van der Aa, Jelmer Offenberg, and Hajo A Reijers. Using hidden markov models for the accurate linguistic analysis of process model activity labels. Information Systems (accepted for publication), 2019. [LvdAP+ 17] Henrik Leopold, Han van der Aa, Fabian Pittke, Manuel Raffel, Jan Mendling, and Hajo A Reijers. Searching textual and model-based process descriptions based on a unified data format. Software & Systems Modeling, pages 1–16, 2017. [MBBF17] Jan Mendling, Bart Baesens, Abraham Bernstein, and Michael Fellmann. Challenges of smart business process management: An introduction to the special issue. Decision Support Systems, 100:1–5, 2017. [MLP14] Jan Mendling, Henrik Leopold, and Fabian Pittke. 25 challenges of semantic process modeling. Int. J. of Inf. Systems and Software Engineering for Big Companies, 1(1):78–94, 2014. [PLM15] Fabian Pittke, Henrik Leopold, and Jan Mendling. Automatic detection and resolution of lexical ambiguity in process models. IEEE Trans. Software Eng., 41(6):526–544, 2015. [Som05] Ian Sommerville. Integrated requirements engineering: A tutorial. IEEE software, 22(1):16–23, 2005. [STW+ 18] Thanner Soares Silva, Lucinéia Heloisa Thom, Aline Weber, Jos’e Palazzo Moreira de Oliveira, and Marcelo Fantinato. Empirical analysis of sentence templates and ambiguity issues for business process descriptions. In OTM, Proceedings, Part I, pages 279–297, 2018. [SvdACP18] Josep Sànchez-Ferreres, Han van der Aa, Josep Carmona, and Lluı́s Padró. Aligning textual and model-based process descriptions. Data Knowl. Eng., 118:25–40, 2018. [vdACLR19] Han van der Aa, Claudio Di Ciccio, Henrik Leopold, and Hajo A. Reijers. Extracting declarative process models from natural language. In CAISE (accepted for publication), 2019. [vdALR17] Han van der Aa, Henrik Leopold, and Hajo A. Reijers. Comparing textual descriptions to process models - the automatic detection of inconsistencies. Inf. Syst., 64:447–460, 2017. [vdALR18] Han van der Aa, Henrik Leopold, and Hajo A. Reijers. Checking process compliance against natural language specifications using behavioral spaces. Inf. Syst., 78:83–95, 2018. [vdALvdWR17] Han van der Aa, Henrik Leopold, Inge van de Weerd, and Hajo A. Reijers. Causes and con- sequences of fragmented process information: Insights from a case study. In 23rd Americas Conference on Information Systems, 2017.