Mathematical Model of Semantic Kernel of WEB site Sergey Orekhov, Henadii Malyhon and Tetiana Goncharenko National Technical University “Kharkiv Polytechnic Institute”, Kyrpychova str. 2, Kharkiv, 61002, Ukraine Abstract Our latest research from search engine optimization projects shows the effect of the semantic kernel of a website. It is a unique set of keywords that depends on time as well as the current state of the search engine vector space model. Therefore, the problem of mathematical modeling of changes in the semantic kernel of a website within a period of a month - a year is becoming urgent. The kernel is formed from three clusters of keywords: product (service), geography (location) and time (duration). The article proposes a model for representing the semantic kernel of a website. This view is supported by a description of the kernel based on the semantic web, followed by its presentation in the form of a resource description framework (RDF) schema. An algorithm for forming a kernel based on hierarchical clustering (agglomerative nesting) is also considered. Keywords 1 Semantic kernel, search engine optimization, clustering, degree of proximity 1. Introduction Since 2009, our team has completed more than thirty projects in the field of search engine optimization. The projects were in various fields: pharmacy, marketing research, jewelry production, construction materials, cosmetics, wood products, furniture production, and automobile spare parts. All the projects were united by one circumstance, namely: they are all related to the field of e- commerce. One way or the other, but the main goal of the project was set as an increase in the volume of sales of goods or services via the Internet. Some of the projects were successful, and some were clearly with negative results. Analyzing the results obtained, we came to the following conclusion. There is an interesting effect of the so-called learning of search engine [1]. The fact is that search engine optimization is the process of training a search engine to respond to given user requests by showing our website in the first places in the list of answers. For this purpose, according to the space model vector, texts from our website that describe a product or service must have the maximum number of external links (large citation index). However, such links can be divided into two groups: black (formally created for the citation index) and white (created by real users who, for example, have already tested this product or service). In our case, white links are of the greatest interest, but such links are generated based on semantics. Let us ask ourselves a question: how does an ordinary user form such a link? That is right, he learns in advance by using a product or service and then he writes a review for such use. In other words, feedback is generated. In this way, the search engine responds to a user feedback on our website, or rather, he reacts to texts and images. This reaction is shaped by us, however, this is an indirect reaction. At the first stage, we recognize texts from websites. Analyzing these texts, we either agree with them or not. Our consent is expressed in a positive response to the website. The process of such analysis is clearly short-lived. There are a lot of texts and websites, so we react to so-called annotations, that is, short sets of keywords that, in our opinion, are as complete as possible, but briefly describe the meaning of the text on the website. We will call such short descriptions of texts – semantic kernels. MoMLeT+DS 2021: 3rd International Workshop on Modern Machine Learning Technologies and Data Science, June 5, 2021, Lviv-Shatsk, Ukraine EMAIL: sergey.v.orekhov@gmail.com (S. Orekhov); gmalygon@gmail.com (H. Malyhon); tatianagoncharenko1806@gmail.com (T. Goncharenko) ORCID: 0000-0002-5040-5861 (S. Orekhov); 0000-0001-5448-2488 (H. Malyhon); 0000-0001-6630-307X (T. Goncharenko) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Thus, there is an urgent problem of mathematical modeling of the semantic kernel. It is also necessary to describe mathematically the algorithm for its formation from the text of the website. 2. Problem statement The simplest approach to considering the semantic kernel is to represent the source text as a bag of terms [2] with its subsequent clustering. As it was established [3], we should distinguish three clusters of terms: geography, product (service) and time. There are three main clustering approaches: probabilistic (statistical), graph and hierarchical algorithms [4-5]. Graph clustering algorithms represent the initial sample in the form of a graph, where the nodes of the graph are the sample objects themselves, and the edges are pairwise distances between the sample objects. The main advantage of graph clustering algorithms is clarity and ease of implementation, as well as the ability to make various improvements using simple geometric assumptions. How are the edges of graphs constructed? Moreover, what distinguishes one method from another? Most graph algorithms work is based on a hypothesis about the number of clusters that should be obtained in the end. For example, Minimum Spanning Tree algorithm assumes that each new sample point is sequentially connected to the nearest one, which is already connected to others. Then k-1 longest edges are removed at the end of the algorithm execution. The parameter k in this case is a hypothesis about the number of clusters. Since, in principle, it is impossible to know how many event clusters can occur a day, the advancement of any hypotheses about their number will be unacceptable and will only distort the clustering results [6]. Probabilistic (statistical) algorithms assume that each object is a random variable. Then the cluster will obey some distribution law. However, since the form of the distribution law is unknown in advance, then for the operation of the algorithm it is necessary to put forward a hypothesis about the applicability of a certain distribution law. A hypothesis that is incorrectly put forward can significantly distort the clustering results. Therefore, probabilistic algorithms are also an ineffective approach to solving the problem of clustering terms. Because, firstly, it is impossible to determine the distribution laws of terms, and secondly, the number of terms is finite and small enough in comparison with the number of terms that can be used to describe a given triad (product, time and location). Thirdly, the functioning of probabilistic algorithms also often requires an assumption about the number of clusters, the initial centers of clusters, which leads to a significant qualitative difference in the results of the algorithm. Hierarchical (taxonomic) clustering algorithms form not one sample partitioning into disjoint clusters, but a system of nested partitions. The result of such an algorithm is presented in the form of a dendrogram (Figure 1). Root cluster Empty clusters Figure 1: Dendrogram of semantic kernel Among the algorithms for hierarchical clustering, there are two main types, depending on the logic of building clusters: divisional and agglomerative [6]. Divisional or top-down algorithms split the original sample into smaller and smaller clusters. Agglomerative or bottom-up algorithms are those in which objects are combined into larger and larger clusters. In the study, having evaluated the advantages and disadvantages of different approaches, the agglomerative hierarchical algorithms was proposed to apply. Let at the initial moment of the execution of the algorithm each object (term) be considered a separate cluster. Let us start the merging process, where at each iteration a new cluster Ai  A j is formed instead of a pair of the closest clusters Ai and A j . The quality of the algorithm is: • in the method of determining the distance between Ai and A j ; • in the method of determining the distance between the new cluster Ai  A j formed in the previous step and some element A f to be united. At the same time, do not forget about the so-called dubbing situation, which is in the fact that the same terms, but from the perspective of different users, can describe the same triad: product, time and geography. In addition, when describing a triad, the user uses the concept of a plot, that is, a certain scenario within which he applies a given product or service. Consequently, when constructing an algorithm for identifying the semantic core, it is necessary to take into account the moment of terms duplication, as well as a possible storyline in which these terms take part. To automatically highlight duplicate and plot terms, it is necessary to formulate the corresponding metric criteria for the proximity of terms. We will formulate a vector interpretation of an object (term) based on data obtained in the process of processing web content and introduce metric criteria for the proximity of two terms and an algorithm for combining terms into one of the clusters: geography, goods and time. Each term is assigned to one of three categories, reflecting the type of term. Let us designate a set of terms that form dictionaries of three categories, V = {v j }, j = 1, J where v j is a term in the dictionary. For a set of three categories, we denote the corresponding dictionaries V k = {v kj }, k = 1,3 . At the same time, the same terms cannot form different dictionaries of categories V k  V r = 0; k  r; r , k  (1,3) , which make it possible to unambiguously distinguish a category exclusively on dictionaries of terms. For example, V P = {wood , metal, auto,...} , V T = {month, day, week ,...} and V G = {latitude, longitude,...} . The required identification accuracy is achieved due to the fact that the number of categories is strictly limited. Terms that fall into one category are unlikely to fall into any other. Let the semantic kernel of web content be based on these dictionaries. That is, the kernel will include only terms from these three clusters (dictionaries). Thus, our task is to build a vector of semantic kernel by clustering existing terms from web content. An analysis based on this approach allows, in a primary approximation, to evaluate the kernel described on the website, however, the allocation of a unique core is only possible with the complete processing of a group of websites of a given topic. The problem of processing the flow of websites is that the most of the data is non-metric, and therefore it is necessary to switch from a non-metric representation of terms to a metric one, in which it is possible to describe a universal criterion for the proximity of two terms in one category. In our approach, the processing of web content involves the formation of three vectors: a vector of   is terms n , reflecting information about: what product, where it is located and at what time; vector n   recovered on the basis; vector of terms n that specifies the vector n . Thus, these vectors describe a semantic kernel.   A vector n always has a fixed structure: n = (d , p, G, L, M ) , where d – contains the date and time (when), p  P – product (what), g  G – product geography (where), l  L – many sources of web content, m M – many related products (services) mentioned in web content (with by whom). All sets P, G , L, M contain terms found in the web content of a given web site. Each element of such a set takes on the value zero if the term is absent in the web content and the value one if it is present. Set d contains the date and time of web content creation.   contains the The vector n describes a variant of the semantic core. At the same time, the vector n  terms synonyms that possibly exist in the web content. In addition, we need a third vector n , which contains possible terms that are service terms to describe the main meaning of web content. Then we will assume that the web content of the web resource contains several semantic cores, possibly close in meaning. Our task is to collect several cores from the terms of a bag of words by clustering and check them for dubbing. In addition, as a rule, the content management system of a web resource can contain several versions of the kernels, possibly also duplicating each other. In addition, the search server in its database contains several versions of the semantic cores of this web resource. And the third place, where a variant of the semantic core of a web resource is possibly contained, is a social network, or rather an account in a social network or marketplace. Thus, by clustering, we form a set of nuclei and it is necessary to establish whether they are close and by how much. This affinity is also important as it echoes the idea of linking to a document in a search engine. Therefore, we need a metric indicating the similarity or duplication of cores both on the web resource and on the Internet as a whole. The first parameter for evaluation is probably the date and time of the publication of web content. Consider a metric based on this assumption. To combine two kernels – duplicates in one dictionary, the primary condition is the coincidence of the categories of terms. The date and time of the term appearance in web content should differ within the threshold value dtreshold reflecting the dynamics of the market. The degree of proximity for the rest of the values is carried out according to the formula (1):   J J J J F1 (n , n ) = (1 +  ( pj − pj ) 2 )(1 +  (mj − mj ) 2 )(1 +  (l j − l j) 2 )(1 +  ( g j − g j ) 2 ) → min ,(1) j =0 j =0 j =0 j =0   where M , M  – vectors described by coordinates M   M  . Each element has a value  m j = (0,1) if a term has been possessed to a vector n ;   L, L – vectors of web content sources respectively are taken from two kernel variants;   G , G  – vectors of geography, while the values of the geography of the terms should be reduced to the total dimension. Obviously, in most cases the vectors differ and complete coincidence, when the criterion equal to one, is not observed. This problem can be solved either by calculating expert estimates of the thresholds for combining terms, or by forming the threshold value in an analytical way. The following analytical method for calculating the error coefficient  is proposed, based on the  reconstructed vector n of other terms. The set of such terms should include those that are included in the web content, but at the same time do not reflect the very description of the product or service directly, but only clarify its description, in particular, indicate secondary signs, characteristics, properties, etc.  For vectors n of other terms, we will form a set of dictionaries of synonymous terms mentioned   on a given site, O  O , on the basis of which we will form vectors O, O , such that  0, o  n Oi =   , whence, the criterion for the proximity of vectors takes the form (2): 1, o  n J Ftertiary =  (Oj − Oj) 2 → min . (2) j =0 Based on the coefficient Ftertiary ,  can be obtained (3):   (nU ) = , (3) Ftertiary     where a vector nU = n  n , but  (nU ) – vector’s length. The meaning of the coefficient  is that: the greater the degree of similarity according to the tertiary criterion, the more likely that the terms describe the same product or service, the less stringency is required for similarity according to the primary and secondary criteria. However, the use of only these two criteria for assessing the degree of similarity of terms is not enough, since quite often, the information about the product is incomplete. For greater accuracy in assessing the degree of closeness of terms, we introduce a vector n  and a closeness criterion based on it in order to solve the problem of incompleteness of product data in web content. The vector n  is formed based on n of the collected information about the subject area of the industry; in particular, it is taken into account, which counterparties in which markets operate with which goods, etc. For the restored vectors, a criterion similar to the primary one can be calculated (4):  J J J F2 (n) = (1 +  (mj − mj) 2 )(1 +  (l j − l j) 2 )(1 +  ( g j − g j) 2 ) → min . (4) j =0 j =0 j =0 From where, combining all three criteria into one (1-4), you can get a metric criterion for the proximity of terms (5). Criterion (5) is interpreted as follows: • F1   means that the terms should be similar according to the primary criterion. The larger vector is , the less they are similar according to the tertiary criterion; • F2    F1 means that the similarity for the reconstructed vector n  should be greater (due  to the large number of coordinates) than for the original vector n within the error  .  p = p,     d − d  d treshold , F = (5)  F1   , F    F .  1 2 The algorithm for identifying a cluster of terms describing a unique kernel in the web stream is shown in Figure 2 [7]. As a result, based on three vectors that describe the semantic core and the introduced proximity metrics, it is possible to obtain a single semantic core of the web resource as a whole. However, it should be understood that the core is formed on a specific date and time. In addition, there is a kernel retrospective. Let us first consider how the algorithm for constructing the semantic core of a web resource will function, taking into account the fact that in the web content itself there may be several core options. Moreover, if the content management system contains several options for web content, therefore, the number of kernel options increases. 3. Proposed algorithm For each term (figure 2), a set of potential duplicates is formed among the terms included in the same category (product, time and geography). After that, within the framework of one set of duplicates, the terms are checked in pairs for similarity. In case two terms are not similar, then the compared term is excluded from the list of potential duplicates. Otherwise, the vectors of the two terms are combined into a new vector based on the following rule: if two terms describe one product or service and differ slightly, then the vectors combine these terms to a more accurate description of a product or service. After considering one term within its set of duplicates, in case a cluster is selected, all vectors of terms included in the cluster are replaced by a cluster vector, which eliminates the formation of duplicates. Highlighting terms in a story will differ from searching for a unique term among duplicates, only by the rule of checking the time of appearance of a term in web content. Obviously, the dates should be consistent within the same margin of error. Are there any unreviewed terms in the category? NO YES Select a term in one category Is there a potential duplicate? NO YES Calculate alpha coefficient and first and second criteria Criteria are met? NO Exclude a potential duplicate from consideration YES Highlight the term in question as unique and exclude from consideration Combine vectors of terms into one Exclude original vectors from consideration Add a combined vector All duplicates are excluded Figure 2: Algorithm At the same time, the question of highlighting story chains in the web content of a certain industry is open: even if due to one important event, a number of less important derivative events occurred. To form a semantic core, it is easier to analyze the entire set of terms, rather than highlight a weighty primary source. The accuracy of the approach largely depends on the quality of the primary processing of web content: the quality of syntactic models [3] and dictionaries. Thus, for an error-free combination of terms in a category, it is necessary to use the developed algorithm and criterion for assessing the degree of proximity. 4. Software designing Based on the requirements (figure 2), we start to design a software taking into account that there are two main users: user and administrator, which share the functionality. Using UML [7], the following use case diagram was designed (figure 3). «extends» Normalization «extends» Scanning web content Stop words «extends» Lemmatization Building bag of terms Top Package::Actor1 «extends» Table view Forming semantic kernel «extends» Sorting keywords Top Package::Actor3 «extends» Setting RDF view Reporting Top Package::Actor2 Figure 3: Use cases The main idea is to represent the final result of the algorithm execution in the form of an RDF  schema. It is a handy tool that includes the basic components of a vector n . The operating principle of the software is shown in the diagrams - Figures 4 and 5. As we can see, the software takes the web content of the web resource as the initial information. Next, the primary processing of web texts is performed (lemmatization and normalization). Then our clustering algorithm is applied and, as a result, an xml file is formed - an RDF schema. 5. Results Let us look at the potential results of applying the proposed algorithm on the example of web content on the topic of astrology and psychology. Figure 6 shows an example of such content, followed by highlighting a vocabulary of terms based on the frequency of their occurrence in the text. The web content of this site has changed three times for the last ten years. As you can see, from Google Analytics data, each such change was accompanied by a surge in user activity. In the example (figure 7), the semantic kernel was analyzed at the stage of its first change. The main task of the kernel was to fix in the minds of potential users of the site and the Internet in general that the CelestialTiming site [8] logo is associated with the direction of astrology and psychology. Subsequent changes to the core were aimed at the next two stages of the marketing of this web project. The first stage is to promote the services of this website, namely the construction of a psychological portrait based on the user's personal data. The second stage is to promote a new service - the school of psychology and astrology. In accordance with the stages, the semantic core of the website is also changed. Enter web content The list of keywords from a web page Web content processing The list of keywords Forming semantic kernel Table or RDF schema Forming RDF schema Saving XML file Figure 4: Activity Figure 5: Final web form Unfortunately, the creators of the web project made several mistakes. The first is that the kernels must be linked when changed. The second mistake is that the kernel should be changed not only on the website itself, but also on friendly links pointing to this kernel. The third mistake is that each core has its own life cycle and it is different from each other. This can be seen, for example, in figure 7. The fourth mistake is that the kernels, based on their aging effect, need changing more often than the creators of this project do. The fifth mistake is that all kernel changes must fit within the framework of a single marketing strategy. All these comments were passed on to the developers of this web project. Figure 6: Testing data 3rd change 1st change 2nd change Figure 7: Google analytics data 6. Future work Subsequent research on this topic includes: • development of an alternative approach to the description of the semantic kernel. A promising idea is to represent the kernel in matrix form using the principles of permutations based on expert assessments of relationships between terms. In this case, the matrix contains the same elements as the search engine matrix. Then it is possible to use a vector-space model to determine the proximity of nuclei. In addition, it is promising to use a genetic algorithm to obtain an optimal semantic kernel, on a set of existing or promising ones; • development and testing of software in the Javascript language on the WordPos platform [9]. It is required to design this software as a component with the possibility of its subsequent integration into existing content management systems such as Wordpress and Opencart; • also a promising area of research is the analysis of profiles in social networks in order to identify semantic cores and, on their basis, search for potential buyers of a product or service. 7. References [1] S. Orekhov, M. Godlevsky, O. Orekhova, Theoretical fundamentals of search engine optimization based on machine learning in: Proceesings of the 13th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer, ICTERI ’2017, CEUR-WS, 2017, Volume 1844, pp. 23-32 [2] Y. Goldberg. Neural Network Methods in Natural Language Processing (Synthesis Lectures on Human Language Technologies). Morgan&Claypool Publ. USA, 2017 [3] S. Orekhov, H. Malyhon, T. Goncharenko, Using Internet News Flows as Marketing Data Component in: Proceesings of the 4th International Conference on Computational Linguistics and Intelligent Systems. Volume 1: Main Conference, COLINS ’2020, CEUR-WS, 2020, Volume 2604, pp. 358-373 [4] M. Berry. Survey of Text Mining: Clustering, Classification and Retrieval. Springer, USA, 2004 [5] E. Krupka, N. Tischby. Generalization from Observed to Unobserved Features by Clustering. Journal of Machine Learning Research, 2008, Volume 9, pp. 339-370 [6] A. Jain, M. Murty, P. Flynn. Data clustering: A review. ACM Computing Surveys, 1999, Volume 31, No. 3, pp. 264–323 [7] B. Rumpe. Agile modeling with UML. Springer, Germany, 2017 [8] Web content source: Psychological self portrait is the path of self discovery based on celestial timing / Celestialtiming.com, 2021 [9] J. Lengstorf, K. Wald. Pro PHP and jQuery. APress, USA, 2016