Method for Synthesizing the Semantic Kernel of Web Content Sergey Orekhov, Henadii Malyhon National Technical University “Kharkiv Polytechnic Institute”, Kyrpychova str. 2, Kharkiv, 61002, Ukraine Abstract The purpose of forming the semantic kernel of web content is to increase the efficiency of virtual promotion of goods or services [1-2]. It is a message in the virtual promotion marketing channel. To operate with the semantic kernel, a set of research tasks is solved, the first of which is the problem of its synthesis. It consists in the formation of the message itself, adapted to the structure and properties of the virtual promotion channel. The paper proposes the formulation of this problem, the method and algorithm for its solution. The scientific novelty of the work is three facts. This problem is considered for the first time. Also for the first time, a metric for estimating the semantic kernel based on the C-value index is proposed. Then solving this problem allows us to analyze the aging effect of the semantic kernel, which was recently discovered. The proposed algorithm includes two cycles. The first cycle searches for candidates for the semantic kernel among phrases. The second one adds new words to the set of candidates. The final stage compares the composition of the set of candidates with the search queries that exist on the Internet today. Keywords 1 Semantic kernel, C-value, UML, HTML, Text mining 1. Introduction The semantic kernel is a message in the marketing channel of virtual promotion [1-2]. Then the synthesis of the semantic kernel is to solve the problem of presenting knowledge about the product in the form of a brief description. This description must be run in the virtual promotion channel. Given the nature of virtual promotion, you can determine the following conditions that must be met when forming a semantic kernel. First, virtual promotion is based on the maximum use of Internet technologies. The latter require the presentation of any information or data in the form of HTML (XML) tags, JSON constructs or RDF schemas. Sometimes we can find information in the form of program code in Javascript, PHP, Python and others. But in this paper we will start from HTML code as the most universal option using the Javascript programming language and relevant frameworks. Second, from the point of view of search engine optimization, the semantic kernel is a database of keywords, word forms and morphological forms that most accurately characterize the type of activity, product or service that promotes a given web resource. Therefore, the semantic kernel should be considered as a set of key word forms presented in HTML format. It is likely that these keywords need to be highlighted with special HTML tags [3]. Third, search engines also store various word forms in their database and have data on what and how many word forms users request on the Internet. That is, we must take into account in the semantic kernel of the most popular word forms according to the search engine version. Fourth, the number of words in the semantic kernel is limited in practice. Users are more likely to enter short search queries that match some semantic kernel. We will assume, using the conditions of COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland EMAIL: sergey.v.orekhov@gmail.com (S. Orekhov); gmalygon@gmail.com (H. Malyhon) ORCID: 0000-0002-5040-5861 (S. Orekhov); 0000-0001-5448-2488 (H. Malyhon) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) the method of analysis of hierarchies, that the number of keywords is limited by the power of a small sample, i.e. ten positions per kernel. In general, the task of presenting knowledge is formulated as follows: to formalize some subject area of knowledge using a conceptual scheme. Typically, such a scheme includes a data structure that brings together all relevant classes of objects and the relationships between them, as well as the rules (theorems and constraints) that exist in a given subject area. We will assume that we have as a data structure - a semantic network, where the vertices are keywords. Such keywords can belong to one of three classes: concept, object or action, as suggested in [4]. Then in the paper it is suggested to consider that the semantic kernel is a semantic network of propositional type [5]. Keywords such as concept and object are considered network nodes. These two classes form the rules that describe the knowledge of the product in the format: what, when, where [6]. That is, what we sell when the product is available and where you can buy it. In the same form it is possible to make out and the description of need of the buyer. Using these assumptions, we build a mathematical model of the semantic kernel as a model of the text [7-8] about the product or need. 2. Problem statement As input we have a text document D, which can be written in any language and presented as an HTML construct or plain text: D = {s1 ,..., si ,..., sn } , (1) where si = {wi1 , sepi1 ..., sepib  , wim , sepi1..., sepik } - it is a sentence that is part of document D. We believe that there will be n sentences in the document.  and sep ik . The number of words The i-th sentence consists of words wij and delimiters sepib  and sep ik . The first set sepib in a sentence will be m. We have two sets of delimiters sepib combines the characters that separate the words in the sentence, and the set sep ik is the characters at the end of the sentence. In our case, we will assume that the set of characters sep ik is unimportant, so we will not take it into account in the future. And the set of characters sepib  will be transformed into a set of connections between nodes of the semantic network [5]. As shown in [4], we have three types of  = {isa , apart , typeof } . connections: "is", "is a part", "is a type". Then the set sepib Each word wij has its own morphological paradigm, or rather token. According to [7-8], we consider that a keyword unites all forms of one word and its different meanings according to the dictionary, for example, the Ukrainian language. It makes no sense to take into account the part of speech for each word in our case, so we will further consider the concept of the lemma of the word wf ij . Lemma is a normal or infinitive form of a word recorded in a dictionary: wf nf = {w, case, n} , wf nf  Wf , (2) where n is the singular and case is the nominative case of the word. Then let there be a function of normalization or lemmatization of words: DNF : Wf → Wf nf . Thus, the semantic kernel will include words (2) from document D, which were lemmatized from sentences (1). Let's move on to identifying candidates (words) that we will include in the semantic kernel. Candidate is a word or phrase that meets the criteria and is potentially a term of a specific subject area, or rather describes a product or need that it covers. Let P be a set of candidates for the semantic kernel. The elements of this set are words wi and phrases p j : P = {w1 ,..., wo , p1 ,..., pr } , where o is the number of candidates for words, and r is the number of candidates for phrases. Next, define the structure of the phrase: p j = {w j1 ,..., w jc c = 1, z} , where j is the number of the phrase, j = 1, r , but c is the number of the word in the phrase p j , z is the maximum number of words in the phrase. It usually is equal to ten, that is z  10 . According to empirical data, it is advisable to consider. Although the most likely limitation is z  5 . In the paper based on typical models of texts without taking into account the properties of the language in which the document D is composed, we describe the construction of the semantic kernel synthesis function as an algorithm that includes two stages: DP : D → P . (3) In the first stage using function (3) the set P of candidates for the kernel from document D is formed. In the second stage the final list of words or phrases inside the kernel is formed by filtering and ranking. Since the power of the set (1) can be any, we are dealing with large text data. It is advisable to use soft calculations or mathematical statistics to process them [9-11]. In the paper we propose to use the statistical metric M in order to compare the element of future kernel: Fm : P → M , M  R, M  0 . Having all the prerequisites, we can formally formulate the problem of semantic kernel synthesis. Given: TD = {d1 ,..., d DOC } - a text body that describes the product and (or) the need that it covers. DOC is the number of web content documents submitted for kernel synthesis. It is necessary to find a set of pairs: T = {( p, m)V } . Each pair is a word or phrase and the value of a statistical metric. V is the number of words or phrases in the TD text box. The set T must be ranked in descending order of the metric M. Thus, the closer to the beginning in the final set T is the candidate p, the more likely it is that it is part of the semantic kernel. This fact does not contradict the similar approach outlined in [4]. Consider a metric that allows us to evaluate a word or phrase. We will take the C-value method as a basis [7-8]. This method is based on the use of such statistical metrics as the frequency of phrases in the text. But the C-value metric also takes into account the length and nesting of the candidate. Nested terms are a concept that is in the input text separately or as part of other concepts [7-8]. Then this metric is calculated by the following formula: log 2 a  f (a ), uninvested  M = 1 log 2 a ( f (a ) − P(T )  f (b)), invested . (4)  a bTa where a - candidate to the kernel, a - length or number of words, f (a ) - frequency of appearance of the candidate a, Ta - set of candidates that include the word a, P(Ta ) - number of candidates to Ta ,  f (b) - sum of frequencies of appearance of candidates b  Ta that include a. That is, a is a nested candidate in the phrase b. The formula makes it possible to draw the following conclusion. The longer the term a, the greater the value of its metrics. This has the following meaning. Longer terms in the text will be less than short ones. Accordingly, the probability of occurrence of the term b in the number of f positions is less than the probability of the occurrence of the term a in the number of f times, provided that a  b . For the same reason, it can be concluded that the phrase b is more likely to be a term than a. In addition, this method is designed with the fact that the higher the number of terms Ta that include a, the greater the degree of independence a. Thus, the solution of the problem of semantic kernel synthesis describes the construction of word- metric pairs. We present a method of obtaining such pairs. 3. Method We verbally describe the method of semantic kernel synthesis as an improvement of the algorithm, which was proposed in [4]. The TD text body is supplied as input. Within the developed method there are two cycles. The first cycle operates with sentences as phrases. The second cycle works with selected sentences and selects words from them. Consider the first cycle and its algorithm. Step 1. We take for processing the i-th document of the TD case. From all sentences of the  , and delimiters of document Di , i = 1, DOC we define words wij and delimiters of type sepib type sep ik and we delete sep ik completely. Step 2. Construct table 1, which accumulates a list of candidates for the semantic kernel, i.e. the set P is formed. As candidates, we enter all the sentences Di , i = 1, DOC . Table 1 Candidates-phrases for entry into the semantic kernel of web content № Candidate (phrase) Frequency С-value Candidate type - - - - - For each candidate, the frequency of occurrence in the text box is estimated. Next, go to step 1 to process the next document Di +1 . Follow these two steps to complete Table 1 fully with all the sentences available in the text box. Step 3. Perform the calculation of statistical metrics (4) for each candidate in table 1. Next, we sort the rows of the table in descending order of the C-value metric. Taking into account the conditions defined above for the text model, we choose PD the first terms as candidates for the semantic kernel. The power of the set PD is defined as the initial condition before the start of the first cycle. The paper proposes to set this value to ten. Step 4. For each candidate from table 1 set its type according to the concept of "4P". To do this, answer three questions: what, where, when. The "what" question describes the name of the product or service. The question "where" is the geography of a good or service, that is, any word that describes a place on a map. But the question of "when" is responsible for words that tell us about time or time intervals. In the simplest case, the semantic kernel includes three words: product name, place of sale and time interval. This completes the first cycle and we have the set PD , which consists only of phrases PD = { p1 ,..., pr } . Consider the second cycle. Its purpose is to add to the set PW individual words {w1 ,..., wo } that are part of the semantic kernel. We believe that P = PW  PD . The following steps are suggested. Step 1. We take for processing the i-th document of the TD case. From all sentences of the  , and delimiters document Di , i = 1, DOC , we define words wij and delimiters of type sepib of type sep ik and we delete sep ik completely. Step 2. Construct table 2, which accumulates a list of candidates for the semantic kernel, i.e. the second part of the set P is formed - PW . As candidates, we enter all the words wij . Determine the frequency of occurrence of the word in the text corpus TD as a whole. Table 2 Candidates-words for entry into the semantic kernel of web content № Candidate (word) Frequency С-value Candidate type Ta value - - - - - Step 3. Perform the calculation of statistical metrics (4) for each candidate in table 2. Next, we sort the rows of the table in descending order of the C-value metric. Taking into account the conditions defined above for the text model, we choose the first terms PW as candidates for the semantic kernel. We also calculate the value Ta as the number of occurrences of words from set PW to set of phrases PD . And again we will re-sort table 2 on condition of increase in value Ta . Thus, we have two sets of candidates to enter the semantic kernel, completing the second cycle. The final stage. The resulting set P = PW  PD is lemmatized. The set of phrases PD is represented in the form of a semantic network according to the algorithm presented in [4]. Then to the semantic kernel we choose those phrases in which the maximum number of words of candidates from the set PW on one side and which are in the filter of the search engine in the first place. It is also advisable to use metric (4) to evaluate the findings of the search engine. It will also allow the formation of "candidate-metric" pairs for a more accurate choice of components of the semantic core. The resulting set PD will be the semantic kernel proved by a set PW and search engine web service. Consider the algorithmic representation of the method proposed in the dissertation 4. Algorithm Using the UML [13-15], an action diagram was prepared to describe the algorithm that implements the method of synthesis of the semantic kernel of the web content - Figure 1. The diagram shows two cycles of forming: first a subset of phrases and then a subset of words that are part of the semantic kernel. The question of typing words and phrases in automatic mode is open. Such a filter can be implemented, for example, using the approach described in [6], where the authors of this paper was directly involved. The next problem is with the search engine. All accumulated candidates from the set PD must be submitted to the web service input to calculate the metric value (4). To build a semantic network, the method described in [4] should be used. It allows us to generate a graph in automated mode. The final choice of the components of the semantic kernel is based on the values of the metric (4) given in Table 2, as well as on the basis of calculating the rules of the type "is a". The last indicator allows us to estimate the number of rules in which a word or phrase that is a candidate for entry into the semantic kernel. We believe that the more rules in which a word or phrase is involved, the higher its value for the semantic kernel [4]. 5. IT solution According to the classical theory of software systems design [16-17] at the first stage it is necessary to form a set of potential software components that will be part of the information system. To do this, we need to create software requirements for future software components. Start of the first cycle (sentence processing) Determine the value of DOC and power Pd We process the document Di Define sentences with words and delimiters i=i+1 Value DOC Is received? NO YES We build a set of candidates for phrases Pd Ranking of many phrases Type definition according to the 4P concept: product, place and time Typing of phrases Start of the second cycle (word processing) We process the document Di Define words and delimiters i=i+1 Value DOC Is received? NO YES We build a set of candidates for the words Pw Filter candidates by search engine Ranking of many words Building a semantic network of candidates Typing words Selection of candidates for the semantic kernel Lemmatization of the set P Figure 1: Activity diagram We take into account the action diagrams for algorithmic support of the proposed methods for solving the problem, shown in Figures 1. According to the SWEBOK concept [17], we have six quality factors that fix non-functional requirements. We will use the following approach in the analysis of quality factors (Figure 2). It is necessary to divide these factors into two groups according to two basic roles in the software development lifecycle: user and developer (Figure 2). The first group includes such factors as functionality, reliability and ease of use. The second group will include efficiency, maintainability and tolerability. It is impossible to fulfill all six factors, so it is necessary to focus on at least two main factors from one and the other group of factors. The most important for the potential user of this information technology are such factors as functionality and portability. To enhance the quality of the software, a description should be added according to the usability factor. Funcrionality Suitability Portability Analyzability Accuracy Installability Interoperability Co-existance Security Changeability Usability Understandability Learnability Operability Attractiveness User Developer Figure 2: Factor of software quality The functionality of the projected information technology is based on the idea of implementing a cycle of situational management, which is triggered by the synthesis of set of semantic kernels. Next, having the first semantic kernel, we can evaluate its effectiveness. According to this assessment, it is decided whether or not to change this semantic kernel in the promotion channel. If a change is required, the next kernel is selected from the set available or a new set of kernels is generated. The possibility of completing the cycle according to a given set of stop criteria is also analyzed. Thus, the choice of these three factors is due to the fact that information technology from the point of view of the end user should cover a maximum of actions (functions) to solve the problem of situational management. This technology must be effectively integrated into existing information environments both at the data level and at the software component level. Also, such integration should be clear, workable and convenient, i.e. easy to use. However, other factors for this case of software systems design are of little importance. For example, the efficiency factor is primarily unimportant from the point of view of time behavior, because the decision-making time is at least thirty days. The reliability of this technology is due to the reliability of the metric values of the semantic core, which does not depend on the requirements for the software itself. From the point of view of the maintainability factor, no further analysis of the technology is required to change it. Such technology either has an effective effect on the promotion of goods or not. That is, in the case of a negative effect, this technology is simply no longer used. The paper proposes to form the following software components: semantic kernel and semantic network – figures 3-4. All components are being designed according to the algorithm of problem solving. The semantic kernel component (Figure 3) is designed according to the MVC template [18-19]. The elements of this component are all components of the algorithm for the synthesis of the semantic kernel (Figure 1): sentence, words, delimiters and HTML content. Also it is necessary to present HTML content or semantic kernel by semantic network (Figure 4). The controller class is also used to start the synthesis procedure. Visualization of the synthesis result is carried out by methods of the KernelView class. Figure 3: Program package – semantic kernel The second component is needed to describe the classes for the software implementation of the method of presenting phrases from web content in the form of a semantic network. This makes it possible to evaluate the semantic kernels generated in terms of artificial intelligence methods, in particular in terms of the number of rules. In order to define program realization of algorithm the sequence diagram was proposed – figure 5. This diagram (figure 5) shows the order in which different class methods are called in software packages by the end user, but exactly according to the synthesis algorithm. The sequence of calls describes two cycles of synthesis of semantic nuclei, which can be implemented both separately and sequentially. The first cycle searches for semantic nuclei in the form of phrases, and the second cycle searches for candidate words. Experiments with the semantic cores of various websites have shown that it is advisable to strengthen the cycle of searching for phrases with a cycle of searching for individual words. And check the result with the help of search engine web services, such as Google API. 6. Results The problem of semantic kernel synthesis and the method of its solution considered in this section is a scientific novelty. At present, there are only verbal descriptions of this problem [3, 20-21], and especially methods of its solution [4]. A distinctive feature of the proposed approach is that the presented statement of the problem is based on the model of text, as well as individual stages, in particular typing, on syntactic models of grammar of the immediate components. Figure 4: Program package – semantic net The semantic network is also used to finally select the semantic kernel components to estimate the number of rules that connect the kernel elements and an additional filter based on the search server database. The last point is especially important because modern verbal approaches to the formation of the nucleus are based on this procedure [3]. But the very use of the principles of systems analysis allows us to operate with the term "synthesis" and solve this problem. In addition, the proposed algorithm immediately forms at least several variants of the semantic kernel, so when you change the kernel to a new one, you can choose the next option. This approach guarantees the application of the ideology of situational management of the semantic kernel, when the transition from one situation (from one kernel) to another. That is, the set PD includes possible variants of the semantic kernel. 7. Conclusions Among the obtained results of scientific novelty are the following: 1. For the first time the problem of synthesis of the semantic kernel of web content is formulated and the method of its solution from the standpoint of system analysis and methods of artificial intelligence is described. 2. For the first time the metrics for estimation of a semantic kernel on the basis of estimations of the text case of web content are offered. 3. The method of presenting the semantic kernel as a semantic network was further developed The direction of further research will be the implementation of this algorithm based on Javascript libraries [22]. In particular, it is planned to implement a semantic kernel synthesis algorithm based on the NodeS library [23]. Figure 5: Sequence diagram 8. References [1] S. Orekhov. Technology of virtual product promotion. Volume 3 of Computer Systems and Information Technologies, 2021, pp. 52-58. [2] S. Orekhov, H. Malyhon. Virtual promotion knowledge management technology. Bulletin of the National Technical University KhPI. Series: System analysis, control and information technology. Volume 1(3) of Collection of Scientific papers. NTU KPI, Kharkiv, 2020, pp.74–79. [3] J. Rowley. Understanding digital content marketing. Volume 24 (5-6) of Journal of Marketing Management, 2008, рр. 517–540. [4] M. Godlevsky, S. Orekhov, E. Orekhova. Theoretical Fundamentals of Search Engine Optimization Based on Machine Learning. Volume 1844 of CEUR Workshop Proceedings, 2017, pp. 23–32. [5] J. Dean. Big Data, Data Mining, and Machine Learning. Value Creation for Business Leaders and Practitioners, John Wiley & Sons Inc., USA, 2014. [6] S. Orekhov, H. Malyhon, T. Goncharenko, I. Liutenko. Using Internet News Flows as Marketing Data Component. Volume 2604 of CEUR Workshop Proceedings, 2020, pp. 358–373. [7] C.C. Aggarwal, C.X. Zhai. A survey of text classification algorithms. Mining Text Data, Springer Science-Business Media, LLC, 2012, pp. 163–222. [8] A. Khan, B. Baharudin, L. Lee, K. Khairullah. A Review of Machine Learning Algorithms for Text-Documents Classification. Volume 1(1) of Journal of advances in information technology, 2010, pp. 4–20. [9] B.R. Prasad, S. Agarwal. Comparative Study of Big Data Computing and Storage Tools: A Review. Volume 9(1) of International Journal of Database Theory and Application, 2016, pp. 45–66. [10] F. Abuqabita, R. Al-Omoush, J. Alwidian. A Comparative Study on Big Data Analytics Frameworks, Data Resources and Challenges. Volume 13(7) of Modern Applied Science, 2019, pp. 1–14. [11] S. Alkatheri, S.A. Abbas, A. S. Muazzam. A Comparative Study of Big Data Frameworks. Volume 17(1) of International Journal of Computer Science and Information Security, 2019, pp. 66–73. [12] J. Nereu, A. Almeida, J. Bernardino. Big Data Analytics: A Preliminary Study of Open Source Platforms, Proceedings of ICSOFT, 2017, pp. 435–440. [13] A. W. Scheer, M. Nuttgens. ARIS architecture and reference models for business process management. Business process management, Springer, Berlin, 2000. [14] IFEF0: Integration Definition for Function Modeling, National Institute of Standards and Technology, Gaithersburg, 1993. [15] B. Rumpe. Agile modeling with UML. Springer, Germany, 2017. [16] J. Ousterhout. A philosophy of software design. Yaknyam Press, USA, 2018. [17] P. Bourque, R. Fairley. SWEBOK. Guide to the Software Engineering Body of Knowledge. Version 3.0, IEEE Computer Society, 2019. [18] E. Gamma, R. Helm, R. Johnson, J. Vlissides. Design Patterns. Elements of Reusable Object- Oriented Software, Addison-Wesley, USA, 1995. [19] T. Winters, T. Manshreck, H. Wright. Software Engineering at Google. Google LLC, USA, 2020. [20] U. Sharma, K.S. Thakur. A Study on Digital Marketing and its Impact on Consumers Purchase. Volume 29(3) of International Journal of Advanced Science and Technology, 2020, pp. 13096 – 13110. [21] J. García, D. Lizcano, C. Ramos, N. Matos. Digital Marketing Actions That Achieve a Better Attraction and Loyalty of Users: An Analytical Study. Volume 11(130) of Future Internet, 2019, рр. 1-16. [22] C. Heilmann. Beginning JavaScript with DOM Scripting and Ajax: From Novice to Professional, Apress, USA, 2006. [23] ReactJS. Notes for Professionals. Stack Overflow, USA, 2019.