Processing Enterprise Architecture, Business Process, Web Content Mining, Natural Language

Andrii Kopp

0 1

Dmytro Orlovskyi

0 1 0 Enterprise Architecture , Business Process, Web Content Mining, Natural Language 1 National Technical University “Kharkiv Polytechnic Institute” , Kyrpychova str. 2, Kharkiv, 61002 , Ukraine

256 268

This paper considers the enterprise architecture model extraction from websites in an automatic way to simplify the blueprinting of enterprise architecture landscapes at the conceptual level. Thus, such a technique is proposed to be called “enterprise architecture web mining”. Nowadays almost all organizations offer their products and services through their websites, therefore, representing their value-creating processes on the Internet. Thus, enterprise homepages can be considered as sources of business information sufficient to understand the company's business processes landscape and make further decisions depending on the party that uses such information. The proposed approach includes two major stages of business activity detection using hyperlinks of the company's webpage that could represent triggers of certain e-commerce business processes, and enterprise architecture model creation based on the obtained data. The software implementation of the proposed approach uses natural language processing to detect business activities on the corporate web pages and produces human-readable enterprise architecture models that describe business processes offered by examined organizations and supportive application and technology environment. Obtained models represent knowledge about primary business activities conducted by organizations and could be used for decisionmaking. As the result, the enterprise architecture landscapes were built for several organizations using only their publicly available websites. The limitations are discussed, the conclusion is made, and future work in this field is formulated.

Processing Enterprise Architecture, Business Process, Web Content Mining, Natural Language 1. Introduction: Related Work and Problem Statement

  be considered as follows [ 1 ]: data architecture includes data objects, entities, attributes, etc.; Information Technology and Implementation (IT&I-2022), November 30 - December 02, 2022, Kyiv, Ukraine

2022 Copyright for this paper by its authors.

Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR Workshop Proceedings (CEUR-WS.org)  applications architecture includes application components, services, interfaces, etc.;  technology architecture includes system software, nodes, devices, artifacts, etc.

Using EA blueprints, organizations can understand the efficiency of movement toward current and future objectives and make decisions on necessary changes to improve efficiency. Moreover, EA gives a general overview of a whole system, even the large and complex ones. Using the EA approach an organization can define gaps between the ongoing and desired states using various viewpoints, define initiatives that should be implemented to achieve the future state, and continuously track the EA changes over time toward the planned state. The evolution of EA is always defined by its business domain – business processes and services they realize to offer the organization’s external or internal consumers dictate the necessary landscape of software systems and IT infrastructure. In their turn, business processes, services, and created products depend on the organization’s goals and capabilities. Therefore, EA could be considered as a structured high-level description of an organization from different viewpoints (i.e. business, data, applications, and technology [ 3 ]) that serve each other in a layered bottom-up manner. This paper proposes an approach and a software tool for the automatic extraction of EA landscapes from websites that nowadays virtually represent organizations on the Internet. This approach aims at simplifying the procedure of building high-level models in the preliminary stages of EA development. It is well known that today most enterprises offer their products and services on their homepages top-ranked by multiple search engines.

Usually, organizational websites contain information not only about offered products or services but also about related activities that allow customers to receive respective products or services (e.g. order, buy, learn, etc.). The study object is the procedure of EA structure extraction from organizational websites that serve as virtual enterprise representations on the Internet. The study subject is the approach and software tool to extract EA landscapes from organizational websites. The study goal is to simplify the process of EA description in the early stages of EA development. This paper is organized in the following way. In the next subsections, EA frameworks and modeling approaches are discussed, virtual enterprise representation on the Internet is considered, and a formal problem statement is given. In Section 2 the proposed approach to the automatic EA construction based on organizational websites is outlined. Section 3 includes the description of a developed software tool, analysis, and discussion of obtained results. Section 4 contains a conclusion and formulates future work in this field.

1.1. Related Work 1.1.1. Enterprise Architecture Frameworks

The origins of EA refer back to the late 80s when J. Zachman introduced the paper “A Framework for Information Systems Architecture”. When the so-called Zachman Framework (ZF) was proposed, organizations had much simpler information systems landscapes than they have today. Thus, with time the ZF was updated and used not only as of the information systems framework but as the Enterprise Architecture Framework (EAF) across various organizations [ 4 ].

A “framework” is the term usually met in the software development field. It is considered as the set of building blocks that help developers to provide generic capabilities of a software solution. The software development frameworks tend to provide ready source code that only should be customized or extended to satisfy the particular software requirements. Such source code could be given in the form of libraries, toolkits, application programming interfaces (API), etc. [ 5 ]. The EAF concept also uses the framework principles mentioned above, but to set an organization, not only the software system. Existing EA frameworks tend to provide general recommendations and reference solutions that may help in creating and managing EA. EA frameworks also suggest the form of EA description (i.e. models, documents, blueprints, matrices, etc.) [ 5 ]. Except the ZF, which has lost its relevance to the modern business processes and IT infrastructures, the most popular EA frameworks are:  The Open Group Architecture Framework (TOGAF) – an EA framework created and supported by The Open Group that provides a detailed methodology and tools for EA development; its core Architecture Development Method (ADM) provides enterprises with a detailed approach to step-by-step EA development [ 6 ];

Federal Enterprise Architecture Framework (FEAF) – a complex framework by the Federal Government of the United States that is focused on developing and maintaining the EA capabilities; it provides a standardized method and principles for creating and exchanging EA information among

Federal agencies [6];

Department of Defense Architecture Framework (DoDAF) – an architecture framework that is intended to help systems engineers to describe complex systems; it is emerged in the United States Department of Defense as the structure for EA development for engineering and acquisition staff to describe the whole system [ 7 ];

Ministry of Defense Architecture Framework (MoDAF) – an EAF adapted and extended by the United Kingdom Ministry of Defense from the DoDAF; the unique MoDAF viewpoints added to the original DoDAF include strategic and acquisition views to describe high-level requirements for enterprise change and programmatic details respectively [ 7 ].

However, the TOGAF is still the most popular EA framework because of its constant development over the last two decades to become an EA development standard [ 8 ].

1.1.2. Enterprise Architecture Modeling

The ArchiMate EA modeling language is also authored by The Open Group, authors of TOGAF. This language provides a visual notation to illustrate enterprise architecture elements and relationships between EA elements in a standardized way. Besides the EA domains, this powerful language allows modeling stakeholders, requirements, goals, etc. [ 9 ].

ArchiMate describes business processes, including their structure and flows, organizational structure elements, application systems, information flows, and technology infrastructure (Table 1). The goal of ArchiMate modeling is to provide a tool to depict changes in EA elements and relationships, evaluate the decision consequences, and communicate EA solutions [ 10 ]. 1.1.3. Enterprise Architecture Web Mining: State-of-the-Art             1.2.

Problem Statement

where: (1)

As was given in the introduction section, the suggested “EA web mining” technique is focused on the automatic construction of EA models using corporate websites as sources of data about EA elements and the relationships between them. Hence, the main problem is finding mentions of business processes and other EA elements in HyperText Markup Language (HTML) pages of corporate websites. Whereas the direct search in Google Scholar using the “enterprise architecture web mining” key phrase did not give any results, the “enterprise architecture mining” allowed us to discover several studies in this direction:

in [ 11 ] the author states that manual maintenance of EA models is costly and timeconsuming, so they propose EA mining algorithms and tools based on process mining; the study [ 12 ] also considers automatic EA modeling methods that are supposed to reduce the drawbacks of manual EA modeling (error-proneness, time and cost consumption, accuracy, etc.); the systematic review [ 13 ] also states that automatic EA modeling could respond challenges of manual EA modeling but this field is still immature and requires further research.

The formal representation of an ArchiMate EA model is the following [ 14 ]:

= 〈 , , , , , 〉, is the set of vertices that represent EA model elements;

⊂ × is the set of edges that represent relationships between EA model elements; is the set of ArchiMate element types; is the set of ArchiMate relationship types; : : → is the mapping between ArchiMate element types and graph vertices; → is the mapping between ArchiMate relationship types and graph edges.

Hence, the tuple (1) should be automatically constructed using the HTML web page tags, their attributes, and inner text fragments. First of all, the web page should be parsed to work with its tags, their attributes, and text content. Formally it can be given using the following equation: where:  is the Uniform Resource Locator (URL) of a web page that should be parsed;  is the set of tags obtained after the web page parsing;  : → is the function that defines a mapping between URL addresses of web pages and parsed tags that belong to these web pages.

Then web page tags obtained using (2) should be used to extract the data about the organization’s activity described on its web page on the Internet. The following formalism describes this step: = (

), = (

), where:

 = { = 〈 , : → , 〉} is the bag of web page tags , each of which has a name , attributes (whose values are accessible through their names ), and a text content ;

 : → is the function that defines a mapping between web page parsed tags and structured tag data elements.

Using the structured tag data obtained using (3), business activities that help an organization virtually promote its products or services on the Internet should be detected. Formally this operation could be described using the following equation: = ( ), where:

 is the set of business activities detected after the processing of the set of structured tag data elements ;

 : → is the function that defines a mapping between structured tag data elements and business activities .

Finally, using the set of business activities obtained using (4) and the previous outcomes, the EA model should be built using the following formalism: , , , ), where : 〈 , , , 〉 → is the function that defines a mapping between URL addresses of web pages , web page title and description meta tags content, and business activities on the one side and the ArchiMate EA model on the other side.

The conceptual model of automatic EA model construction using the company’s homepage on the Internet, based on introduced transformations (2) – (5), is demonstrated in Fig. 1.

The proposed workflow (Fig. 1) should help automatically build high-level architectural models using only the websites of organizations using the suggested technology we can name “enterprise architecture web mining”. Obtained models may describe landscapes of top-level business processes based on products or services offered to customers on the company’s homepage. Moreover, obtained EA models should include application layers to demonstrate website maps, and technology layers to complete the ArchiMate cross-layer architecture. However, the most valuable outcome is still a business architecture layer that includes core value-added business processes and the business service offered to the organization’s clients. ArchiMate EA models automatically produced using the company’s website can help to understand the current state of the enterprise, including its customer relationship strategy, offered products, and services. Then, shortcomings could be detected in such an EA model, and the decisions to improve the enterprise’s virtual representation on the Internet could be made.

2. Proposed Approach 2.1. Business Activities Detection in Organizational Web Pages

The first HTML tags that should be processed using the proposed approach are “title” and “meta”. These tags contain descriptive information about a web page and, therefore, about the organization and products or services it virtually offers on the Internet.

The text content of the “title” tag can be obtained by processing the structured tag data elements in the following way based on tuple calculus formalisms [ 15 ]: (6) (7) (8) , ∈ { : { }| ∈ ∧ . = " "}, where is the web page title data.

Then it is proposed to process the “meta” tag, which “name” attribute has the value “description” to get the value of its “content” attribute. This could be formally described using the following equation based on tuple calculus formalisms [ 15 ]: ∈ , ∈ { : { . (" ")}| ∈ ∧ . = meta ∧ ∧ . (" ") = " "}, where is the web page description data.

We propose to use the web page description as the “Business service” ArchiMate element to reflect the product(s) or service(s) virtually offered by the organization, in which the homepage is processed. The web page title is proposed to represent the website as the “Application component” ArchiMate element to demonstrate the software that supports business processes of products or services delivery through the Internet. Other important ArchiMate elements “Business process” and “Application service” are proposed to be created using hyperlink “a” tags on the organization’s homepage. We assume that hyperlinks reflect actions that customers can do when visiting a website to perform business activities, e.g. order a product, buy a subscription, learn a tutorial, etc. In other words, by using hyperlinks customers trigger business processes on the websites to get products or services. Using the following equation based on tuple calculus formalisms [ 15 ], a set of pairs of hyperlink text content and URL can be received: = { : { , ("ℎ ")}| ∈ ∧ . = " "}, where

is the set of pairs of hyperlink text content and URL data.

This ArchiMate EA model should include a “Business service” element based on the web page description, “Business process” elements based on hyperlink text content values, “Business service” elements based on hyperlink URL values, an “Application component” element based on the web page title, a “Technology service” element based on the web page URL, and a “Technology node” element that represents a web hosting. Relationships between EA elements mentioned above are given in Table 3 according to the syntax and semantics of the ArchiMate EA modeling language [ 3 ].

The formal description of the ArchiMate model (1) that could be constructed taking into account the suggested EA elements and relationships between them (Table 3) is given below:

The software implementation includes the main module serving as the application’s endpoint. It depends on four modules corresponding to the proposed approach’s steps (Fig. 1). These are the following software modules:

 “Web Page Parsing” – this module is responsible for HTML page parsing to work with tags, attributes, and text contents;

 “Data Extraction” – this module is responsible for the title and description tags processing, as well as URL address and text content data extraction from web page hyperlinks;

 “Business Activities Detection” – this module is responsible for hyperlinks processing to detect the ones that mean certain business activities that trigger business processes supported by the web application services;

 “EA Generation” – this module is responsible for ArchiMate model generation using EA elements and relationships formulated on the previous steps and formally described by (14); the output files are produced in the Plant UML diagramming language [ 18 ].

The software structure is given in a component diagram below (Fig. 4). According to the demonstrated above software component diagram (Fig. 4), the third-party Python modules are also used by the application. There are the following modules in use:

 “urllib.request” – this module helps to make HTTP requests and open URLs taking into account the authentication, redirections, cookies, and other features [ 19 ];

 “bs4” or “Beautiful Soup” – this module helps to pull data out of HTML and eXtensible Markup Language (XML) files [ 20 ];  “re” – this module provides regular expression operations [ 21 ];  “nltk” or “Natural Language Toolkit” – this module helps to work with human language data in Python [ 22 ].

Hence, the “urllib.request” module is used by the created software tool to parse web pages, the “bs4” module is used to extract data from HTML pages, while “re” and “nltk” modules are used to process extracted data from web pages and detect possible business activities offered by corporate homepages. The “Natural Language Toolkit” module plays a core role in the implemented algorithm for business activity detection (Fig. 2). It is used for the part of speech tagging of hyperlink text content words to detect the hyperlinks that begin with verbs. Then, according to the verb-object activity labeling best practice [ 16 ], such hyperlinks are used as sources for business process and application service elements extraction according to the suggested algorithm (Fig. 2).

3. Results and Discussion

To demonstrate the capabilities of the proposed “EA web mining” approach and the corresponding software tool (Fig. 4), let us select for processing websites of two well-known enterprises that belong to the telecommunications industry. As the result, we expect to obtain EA models revealing business processes that could be triggered by users of these websites to receive services or order products.

The first telecommunications enterprise whose website we used as the source for “EA web mining” is T-Mobile (Fig. 5) [ 23 ]. A closer look at the extracted business processes is given in Fig. 6. This model demonstrates only the business process architecture, while other EA elements and relationships (Fig. 5) are avoided. Another telecommunications enterprise whose website we used as the source for “EA web mining” is Verizon (Fig. 7) [ 24 ]. A closer look at the extracted business processes is given in Fig. 8. This model demonstrates only the business process architecture, while other EA elements and relationships (Fig. 7) are avoided. For the sake of EA models’ readability, the names of business services in Fig. 5 and Fig. 7 were changed to “…” because the respective hyperlink URLs could be of significant length and, therefore, may horizontally overflow the models by making them unclear for a reader. Thus, automatically designed T-Mobile (Fig. 5) and Verizon (Fig. 7) EA models contain 16 (Fig. 6) and 12 (Fig. 8) business processes respectively. However, there are “false positive” business processes that do not correspond to the verb-object activity labeling style [ 16 ]:  “Unlimited Phone Plans” and “Unlimited Age 55+” elements of the T-Mobile EA model;  “Certified pre-owned phones”, “Certified pre-owned watches”, “Charging”, “Gaming”, “Unlimited”, “Connected devices”, “Connected car plans”, and “Moving” elements of the Verizon EA model.

Therefore, we can introduce the following quality measures:  is the number of “true positive” detected business processes – 14 for the T-Mobile EA model (Fig. 6) and 4 for the Verizon EA model (Fig. 8);

 is the number of “false positive” detected business processes – 2 for the T-Mobile EA model (Fig. 6) and 8 for the Verizon EA model (Fig. 8). Hence, the precision of the proposed “EA web mining” approach could be measured as follows: 14 + 4 18 (15) + (14 + 4) + (2 + 8) 28

The calculated precision measure (15) signalizes that 64% of detected business process elements are representing business activities offered by the considered websites [ 23 ] and [ 24 ]. The remaining elements recognized as “business processes” are representing offers that inform customers but do not usually require any active behavior, such as “bring”, “pay”, “report”, “try”, etc. Such elements could be changed from active to passive ArchiMate structure elements, such as “business objects”.

The precision measure could be improved by introducing more advanced methods and techniques for business activity detection, i.e. using neural networks or other machine learning facilities. = = = = 0.64.

However, the final decision on EA design, including possible adjustments, must be made by the EA model designer, since the final goal of automatic EA modeling is to reduce the time and cost consumption of enterprise architecture modeling, while keeping models accurate and relevant to a modeling domain.

4. Conclusion and Future Work

In this paper, we proposed the approach and the software tool for the automatic building of EA models using corporate websites. The proposed technique is named “enterprise architecture web mining” and aims to simplification of the process of enterprise architecture blueprinting in the early stages of EA development. It is expected that the proposed approach can reduce the time and cost consumption of EA modeling by making it possible to construct business process-centric EA landscapes directly from company homepages. The proposed approach uses HTML parsing techniques to extract data from enterprise web pages. It considers “title” and “description” meta tags as the sources of general business information, and hyperlink tags as the sources of business activity information. Hyperlink text content values are checked for matching the verb-object labeling style for the sake of business activity recognition among all web page hyperlinks. Then detected business activities are represented as ArchiMate business processes together with remaining EA elements, such as the business service (based on the web page description), application services (based on the hyperlink URL values), the application component (based on the web page title), the technology service (based on the web page URL), and the technology node (it represents a web hosting). The software implementation of the proposed approach is based on the Python language with its modules for HTTP request handling, HTML file parsing, regular expression matching, and natural language processing. The software tool was used to apply the “EA mining” technique to build EA models based on T-Mobile and Verizon homepages. Obtained ArchiMate EA models demonstrate business processes discovered on these web pages and the supporting EA elements and relationships. Additional business process architecture models were also built and analyzed taking into account the precision measure. Obtained EA models and their analysis results demonstrate the 64% precision of the suggested “EA mining” technique. Future work in this field should include the elaboration of business activity detection in enterprise web pages.

5. References

[1]

Josey et al., TOGAF® Business Architecture Level 1 Study Guide , TOGAF series, Van Haren, 2019 .

[2]

Masuda ,

Viswanathan , Enterprise Architecture for Global Companies in a Digital IT Era: Adaptive Integrated Digital Architecture Framework (AIDAF ), Springer, 2019 .

[3]

Josey , ArchiMate® 3.0 .1 -

Pocket Guide , Van Haren, 2017 .

[4]

J. D.

McDowall , Complex Enterprise Architecture: A New Adaptive Systems Approach , Apress, 2019 .

[5]

Kale , Digital Transformation of Enterprise Architecture, CRC Press, 2019 .

[6]

Zimmermann ,

Schmidt ,

L. C.

Jain , Architecting the Digital Transformation: Digital Business, Technology,

Decision

Support , Management, Springer Nature, 2020 .

[7]

H. A. H.

Handley , The Human Viewpoint for System Architectures , Springer, 2019 .

[8]

Buchalcevova , Software process improvement in small companies as a path to enterprise architecture , Information Systems Development , Springer, New York, NY, 2013 , pp. 243 - 253 . doi: 10 .1007/978-1- 4614 -4951-5_ 20 .

[9]

Moyle ,

Kelley , Practical Cybersecurity Architecture: A guide to creating and implementing robust designs for cybersecurity architects , Packt Publishing Ltd , 2020 .

[10]

Fleischmann ,

Oppl ,

Schmidt ,

Stary , Contextual Process Digitalization: Changing Perspectives - Design Thinking - Value-Led

Design

, Springer Nature, 2020 .

[11]

Fajri , Enterprise Architecture Mining, MS thesis, University of Twente, 2019 .

[12]

Pérez-Castillo ,

Ruiz ,

Piattini , A decision-making support system for Enterprise Architecture Modelling, Decision Support Systems 131 ( 2020 ) 113249 . doi: 10 .1016/j.dss. 2020 . 113249 .

[13]

Perez-Castillo et al., A systematic mapping study on enterprise architecture mining , Enterprise Information Systems 5 ( 13 ) ( 2019 ) 675 - 718 . doi: 10 .1080/17517575. 2019 . 1590859 .

[14]

Orlovskyi ,

Kopp , Enterprise Architecture Modeling Support based on Data Extraction from Business Process Models, CEUR Workshop Proceedings 2608 ( 2020 ) 499 - 513 . URL: http://ceurws.org/Vol- 2608 /paper38.pdf.

[15]

S. W.

Dietrich , Understanding Databases: Concepts and Practice , John Wiley & Sons, 2021 .

[16]

Mendling , Managing structural and textual quality of business process models , International Symposium on Data-Driven Process Discovery and Analysis , Springer, Berlin, Heidelberg, 2012 , pp. 100 - 111 . doi: 10 .1007/978-3- 642 -40919- 6 _ 6 .

[17] Categorizing and

Tagging

Words . URL: https://www.nltk.org/book/ch05.html.

[18] Plant

UML

. URL: https://plantuml.com/.

[19] urllib.request - Extensible library for opening URLs . URL: https://docs.python.org/3/library/urllib.request.html.

[20]

Beautiful

Soup Documentation . URL: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

[21] re - Regular expression operations . URL: https://docs.python.org/3/library/re.html.

[22]

Natural

Language Toolkit . URL: https://www.nltk.org/.

[23] T-Mobile . URL: https://www.t-mobile.com/.

[24] Verizon . URL: https://www.verizon.com/.