1 Introduction

Modelling of the Context Links Between the Natural Language Sentences

0 National aviation university , Kyiv , Ukraine

0000 0002

In this materials author proposed the rules for searching the context links between the natural language sentences. This theory used mathematical apparatus for the formal description of an electronic text document based on the predicate logic, which, unlike formal grammars, makes it possible to structure textual information, starting with the lowest level of constructing logical relationships and ending with the text as a whole. These rules help to find context links according to the semantic reiteration, that is used, when text in natural language sentence is formed. Author tried to solve a part of the problem of extraction knowledge from the textual information. The study demonstrates different examples of semantic reiteration usage: tautological reiteration, thematic reiteration and reiteration of various stylistic interpretations. Depending on this example, there are following replacements in logic and linguistics models: relation into relation, subject into subject, object or matter-subject and object into subject, object or matter-subject.

natural language logic and linguistic models text information content modelling context links

1 Introduction

Nowadays computer linguistic is one of the most essential tool for solving the problem of knowledge extraction from the textual information. This mechanism can integrate computer modelling, mathematical methods and linguistic rules [1,3,4]. Machine learning, data science and natural language processing, like the most popular spheres of knowledge extraction are widely spread in different social human areas. According to the IBM predicts, 59% of all data science and analytics job demand is in finance and insurance, professional services, and IT. Data science and analytics job market in Germany, in 2019 needed 15% of seniors. More than 60% of job openings require middle-level specialists, while around quarter offers look for seniors into the job market in Switzerland.

The world is seeing a surge in demand for data science services in various field with market researches estimating its potential growth in the near future. In India, 70% of job postings in this sector are for data scientists with less than five years of work experience. So, data science is very popular nowadays, there are great number of program products that try to model semantic links. For example, program ABBYY Compreno, which based on the logical derivation and syntactical rules of natural language sentences building [2].

Adam Geitgey shows in his article “Natural language processing is fun” [5] how computer can understand human language. All steps of this process, such as sentence segmentation, word tokenization, predicting parts of speech for each token, text lemmatization, identifying stop word, dependency parsing, finding noun phrases, named entity recognition and co-reference resolution, have detail description.

Real experience has been written in the article “How I used natural language processing to extract context from news headlines” by Gunnvant Saini [8], who tried to extract interesting information from a large number of news documents.

In the article “Affected Experiencers” [10] authors proposed formal analysis of the affected experiencer construction, illustrated by the following examples in german, albanian, japanese and hebrew and tried to recognize semantic of the natural language sentences.

Many Ukrainian scientists develop areas of data science in different ways. For instance, academic of NAS of Ukraine Shyrokov V.A. created linguistic corpus for ukrainian natural language [9], Glybovets M. M. solves the problem by applying various genetic algorithms [6], Lande D.V. used networks and respondents’ perception for searching data [7] etc.

However, all these facts don’t work out with searching a context between different parts of textual electronic document. Unsolved part of the problem is how to find instruments and methods for modelling and extraction knowledge from the textual information.

The aim of the research is formulation and modelling rules for searching the context links between the natural language sentences by means of logic and linguistic models [12].

2 Materials and Methods

It is said, that every simple sentence of natural language, that consist indivisible content, can be presented as a general logic and linguistic model [11]: LSp = p(x, g, y, q, z, r, h), ( 1 ) where S is natural language sentence; – p - predicate, that indicates content of the sentence, relation, that connect subject, object and subject-matter of relations in the sentence S ; – x - subject of the sentence S ; – g - characteristic of the subject of the sentence S ; – y - object of the sentence S ; – q - characteristic of the object of the sentence S ; – z - subject-matter of the sentence S ; – r - characteristic of the subject-matter of the sentence S ; – h - characteristic of the p relation in the sentence S .

Naturally, that complex sentence might have been combined different combinations of logic and linguistic models ( 1 ). So, if we have textual fragment, that includes two sentences of natural language, and they can be represented by means of decomposition of formal models ( 1 ):  S1 S LS1 = =1 Lp1( ) (h),

 S2 S LS2 = =1 Lp2( ) (h), ( 2 ) ( 3 ) where

S - L p1( ) (h) = p ( ) (x1( ) , g1( ) , y1( ) , q1( ) , z1( ) , r1( ) , h1( ) ) – simple predicate, which 1 describes the part of the sentence S1 , that reflects indivisible content;

S - L p2( ) (h) = p ( ) (x 2( ) , g 2( ) , y 2( ) , q 2( ) , z 2( ) , r2( ) , h2( ) ) – simple predicate, which 2 describes the part of the sentence S 2 , that reflects indivisible content; -  = 1, S1 ,  S1 – the amount of parts in the sentence S1 , that have indivisible content;  = 1, S2 ,  S2 – the amount of parts in the sentence S 2 , that have indivisible content.

Searching for context linkers in textual fragment means filling the array of additional characteristics for each simple natural language sentence. This materials offer to find context linkers between two complex sentences S1 and S 2 according to the rules of the content formation in inflecting natural languages, that are called as semantic reiteration. Semantic reiteration is the technique for location sentences with similar meaning. Syntax structure of these sentences will be the same, but relation, subject or object may be replaced by synonyms. It is possible to implement semantic reiteration in several different ways. In all situations the order of the sentences does not matter. The simplest type of the semantic reiteration is tautological reiteration – the elementary arrangement of link by using of identity words or word forms, having one word root. If relations in two sentences of natural language p ( )  p ( ) or 1 2   p ( )  p ( ) , where p ( )  R p and p ( )  R p , that relate to the same range of 1 2 1 2

 words with identity radical R p  R from the set R , or p1( ) W p and p 2( ) W p , that relate to the same synonymic range W p  W from the set W , so it is necessary to make replacement of p ( ) = p ( ) into the logic and linguistic models LS1 and 1 2

LS2 after what:

S L p2( ) (h) = p ( ) (x 2( ) , g 2( ) , y 2( ) , q 2( ) , z 2( ) , r2( ) , h2( ) ) .

In this regard the elements of arrays of characteristics for both sentences will be: liS1 = p1( ) , i = 1, N1 and l Sj2 = p1( ) , j = 1, N 2 , where N1 , N 2 – amount of all elements of arrays of characteristics in the natural language sentences S1 and S 2 .

For instance, logic and linguistic model for textual fragment “Indians who settled in northern areas hunted and fished. Those who settled in the east and southwest farmed” will be:

LS1 = p1( 1 ) (x1( 1 ) ,0, y1( 1 ) , q1( 1 ) ,0,0,0) & p ( 2 ) (x1( 2 ) ,0,0,0,0,0,0) & p1( 3 ) (x1( 3 ) ,0,0,0,0,0,0).

LS1 = settled (indians, 0, areas, nothern, 0, 0, 0) & hunted (who, 0, 0, 0, 0, 0, 0) & fished (who, 0, 0, 0, 0, 0, 0).

LS2 = p 2( 1 ) (x 2( 1 ) ,0, y 2( 1 ) ,0,0,0,0) & p 2( 2 ) (x 2( 2 ) ,0, y 2( 2 ) ,0,0,0,0) & p 2( 3 ) (x 2( 3 ) ,0,0,0,0,0,0). LS2 = settled (who, 0, east, 0, 0, 0, 0) & settled (who, 0, southwest, 0, 0, 0, 0) & farmed (who, 0, 0, 0, 0, 0, 0).

According to the identity conditions of logic and linguistic models [12] it is p1( 3 ) (x1( 1 ) , p1( 1 ) , y1( 1 ) , q1( 1 ) ,0,0,0).

LS1 = settled (indians, 0, areas, nothern, 0, 0, 0) & hunted (indians, settled, areas, nothern, 0, 0, 0) &

fished (indians, settled, areas, nothern, 0, 0, 0).

LS2 = p2( 1 ) (x2( 1 ) ,0, y2( 1 ) ,0,0,0,0) & p2( 2 ) (x2( 1 ) ,0, y2( 2 ) ,0,0,0,0) & p2( 3 ) (x2( 1 ) , p2( 1 ) , y2( 1 ) ,0,0,0,0) & p2( 3 ) (x2( 1 ) , p2( 1 ) , y2( 2 ) ,0,0,0,0).

LS2 = settled (who, 0, east, 0, 0, 0, 0) & settled (who, 0, southwest, 0, 0, 0, 0) & farmed (who, settled, east, 0, 0, 0,0) & farmed (who, settled, southwest, 0,0,0,0).

According to the rule:

LS1 = p ( 1 ) (x1( 1 ) ,0, y1( 1 ) , q1( 1 ) ,0,0,0) & p ( 2 ) (x1( 1 ) , p1( 1 ) , y1( 1 ) , q1( 1 ) ,0,0,0) &

1 1 p1( 3 ) (x1( 1 ) , p1( 1 ) , y1( 1 ) , q1( 1 ) ,0,0,0).

LS2 = p1( 1 ) (x1( 1 ) ,0, y2( 1 ) ,0,0,0,0) & p1( 1 ) (x1( 1 ) ,0, y2( 2 ) ,0,0,0,0) & p2( 3 ) (x1( 1 ) , p1( 1 ) , y2( 1 ) ,0,0,0,0) & p2( 3 ) (x1( 1 ), p1( 1 ) , y2( 2 ) ,0,0,0,0).

LS1 = settled (indians, 0, areas, nothern, 0, 0, 0) & hunted (indians, settled, areas, nothern, 0, 0, 0) & fished (indians, settled, areas, nothern, 0, 0, 0).

LS2 = settled (indians, 0, east, 0, 0,0,0) & settled (indians, 0, southwest, 0,0,0,0) & farmed (indians, settled, east, 0,0,0,0) & farmed (indians, settled, southwest, 0,0,0,0).

The result of the rule applying is: liS1 = p1( 1 ) , liS+11 = z1( 1 ) and l Sj2 = p1( 1 ) , l Sj+21 = z1( 1 ) . And according to the next rule: liS+12 = x ( 1 ) and l Sj+2 2 = x1( 1 ) . 1

Another type of the semantic reiteration thematic reiteration - the words in the sentences show common lexical meaning, indicate different sizes, components, parts of the elements from one situation. The word inside of the one thematic group make the paradigm, which connect various parts of the text.

If subjects of the relations in two natural language sentences are similar x1( )  x 2( ) or x1( )  x2( ) , where x1( )  R x and x2( )  R x , R x  R , or x1( ) W x and x ( ) W x , that relate to the same synonymic range W x  W from the set W , so 2 it is necessary to make replacement of x ( ) = x ( ) into the logic and linguistic 1 2 models LS1 and LS2 after what: arrays of characteristics in the natural language sentences S1 and S 2 .

If subject, object or matter-subject from any one sentence is the same as the matter-subject of another sentence of natural language, that means they are identical x ( )  z ( ) or y ( )  z ( ) , or z ( )  z ( ) , or x ( )  x ( ) , or y ( )  x ( ) , or 1 2 1 2 1 2 1 2 1 2 z1( )  x2( ) , where x1( )  R x , y1( )  R x , z1( )  R x , x2( )  R x and z2( )  R x , that  relate to the same range of words with identical root R x  R from the set R , or x ( ) W x , y ( ) W x , z ( ) W x , x ( ) W x and z ( ) W x , that relate to the same 1 1 1 2 2 synonymic range W x  W from the set W , so it is necessary to make replacement of x ( ) = x ( ) or 2 1 x ( ) = y ( ) , or 2 1 x ( ) = z ( ) , or 2 1 z ( ) = x ( ) or 2 1 z ( ) = y ( ) , or 2 1 z ( ) = z ( ) into the logic and linguistic models LS1 and LS2 after what: 2 1 or or

S L p2( ) (h) = p ( ) (x1( ) , g 2( ) , y 2( ) , q 2( ) , x1( ) , r2( ) , h2( ) )

2 j = 1, N 2 where N1 , N 2 – amount of all elements of arrays of characteristics in the natural language sentences S1 and S 2 .

For this textual fragment “The USA is the name of the country composed of 50 states joined in a federal republic. It is one of the world’s largest countries” logic and linguistic models are:

LS1 = p1( 1 ) (x1( 1 ) ,0, y1( 1 ) ,0, z1( 1 ) ,0,0) & p1( 2 ) (x1( 2 ) ,0, y1( 2 ) , q1( 2 ) ,0,0,0) & p1( 3 ) (x1( 3 ) , g1( 3 ) , y1( 3 ) , q1( 3 ) ,0,0,0).

LS1 = is (USA, 0, name, 0, country, 0, 0) & composed (country, 0, states, 50, 0, 0, 0) & joined (states, 50, republic, federal, 0, 0, 0).

LS2 = p2( 1 ) (x2( 1 ) ,0, y2( 1 ) ,0, z2( 1 ) , r2( 1 ) ,0) & p2( 2 ) (x2( 2 ) ,0, y2( 2 ) ,0, z2( 2 ) , r2( 2 ) ,0).

LS2 = is (it, 0, one, 0, countries, world’s, 0) &

is (it, 0, one, 0, countries, largest, 0).

According to this one the previous two rules:

LS1 = p1( 1 ) (x1( 1 ) ,0, y1( 1 ) ,0, z1( 1 ) ,0,0) & p1( 2 ) (x1( 1 ) ,0, y1( 2 ) , q1( 2 ) ,0,0,0) & p1( 3 ) ( y1( 2 ) , g1( 3 ) , y1( 3 ) , q1( 3 ) ,0,0,0).

LS1 = is (USA, 0, name, 0, country, 0, 0) & composed (USA, 0, states, 50, 0, 0, 0) & joined (states, 50, republic, federal, 0, 0, 0).

LS2 = p2( 1 ) (x2( 1 ) ,0, y2( 1 ) ,0, z2( 1 ) , r2( 1 ) ,0) & p2( 1 ) (x2( 1 ) ,0, y1( 1 ) ,0, z1( 1 ) , r2( 1 ) ,0).

LS2 = is (USA, 0, one, 0, countries, world’s, 0) &

is (USA, 0, one, 0, countries, largest, 0).

The outcome of the applying rules is the elements of arrays of characteristics: liS1 = x1( 1 ) , liS+11 = z1( 1 ) and l Sj2 = x1( 1 ) , l Sj+21 = z1( 1 ) .

It is possible to use only synonyms for linking phrases and integration the context of different sentences, reiteration of various stylistic interpretations of the one word.

If objects of the relations in two natural language sentences are similar   y ( )  y ( ) or y ( )  y 2( ) , where y ( )  R y and y ( )  R y , that relate to the same 1 2 1 1 2

 range of words with identical root R y  R from the set R , or y1( ) W y and y ( ) W y , that relate to the same synonymic range W y  W from the set W , so it is 2 necessary to make replacement of y ( ) = y ( ) into the logic and linguistic models 1 2

LS1 and LS2 after what:

S L p1( ) (h) = p1( ) (x1( ) , g1( ) , y1( ) , q1( ) , z1( ) , r1( ) , h1( ) ) ,

L p2( ) (h) = p 2( ) (x 2( ) , g 2( ) , y1( ) , q 2( ) , z 2( ) , r2( ) , h2( ) ) .

The elements of arrays of characteristics for both sentences will be: liS1 = y1( ) , i = 1, N1 and l Sj2 = y1( ) , j = 1, N 2 , where N1 , N 2 – amount of all elements of arrays of characteristics in the natural language sentences S1 and S 2 .

If subject of the first sentence and object of the second sentence of natural   language are identical x1( )  y 2( ) or x1( )  y 2( ) , where x1( )  Rx and y 2( )  Rx ,  that relate to the same range of words with identical root Rx  R from the set R , or x ( ) Wx and y ( ) Wx , that relate to the same synonymic range Wx  W from the 1 2 set W , so it is necessary to make replacement of x ( ) = y ( ) into the logic and 1 2 linguistic models LS1 and LS2 after what:

S L p1( ) (h) = p1( ) (x1( ) , g1( ) , y1( ) , q1( ) , z1( ) , r1( ) , h1( ) ) ,

L p2( ) (h) = p 2( ) (x 2( ) , g 2( ) , x1( ) , q 2( ) , z 2( ) , r2( ) , h2( ) ) .

The elements of arrays of characteristics for both sentences will be: liS1 = x1( ) , i = 1, N1 and l Sj2 = x1( ) , j = 1, N 2 , where N1 , N 2 – amount of all elements of arrays of characteristics in the natural language sentences S1 and S 2 .

All this rules consider different versions of the first type of making linkers in textual information - tautological reiteration. This gives an opportunity for computer modelling of the context linkers between the sentences of natural language. Let’s take textual fragment for searching the context linkers between the sentences of natural language.

According to the rules we have to create logic and linguistic model for each sentence, that consist indivisible content.

Suppose we have such textual information: “Celebrities invite publicity despite knowing that this will leave them open to public attention.

Therefore, it is hypocritical for them to complain when the media shows interest in other aspects of their lives.

Also, celebrities are influential role models to many people and because of this, their private lives should be open to public examination.

Additionally, the public have the right to know about the rich and famous since it is our money that supports them.”

Logic and linguistic models for such sentences are:

LS1 = p1( 1 ) (x1( 1 ) ,0, y1( 1 ) ,0,0,0,0) → p( 2 ) (0,0, x1( 1 ) , q1( 2 ) , z1( 2 ) , r1( 2 ) ,0).

1 LS2 = p2( 1 ) (x1( 1 ) ,0,0,0,0,0, h2( 1 ) ) → p2( 2 ) (x2( 2 ),0, y2( 2 ) ,0, z2( 2 ) , r2( 2 ) ,0) & p2( 3 ) (x2( 3 ) ,0, y2( 3 ) ,0, z2( 3 ) , r2( 3 ) ,0).

LS3 = x1( 1 ) (x3( 1 ) , g3( 1 ) , y3( 1 ) , q3( 1 ) ,0,0, h3( 1 ) ) → p3( 2 ) (x3( 2 ) , x1( 2 ) , y3( 2 ) , r1( 2 ),0,0,0) & p3( 3 ) (x3( 3 ) , x1( 1 ) , y3( 3 ) , r1( 2 ) ,0,0,0).

LS4 = p4( 1 ) (r1( 2 ) ,0, y4( 1 ) , q4( 1 ), z4( 1 ) ,0, h4( 1 ) ) & p4( 2 ) (r1( 2 ) ,0, y4( 2 ) , q4( 2 ), z4( 1 ) ,0, h4( 1 ) ) → p4( 3 ) (x4( 3 ) , g 4( 3 ) , x3( 2 ) , x1( 1 ) , z4( 3 ),0, h4( 3 ) ) & p4( 4 ) (x4( 4 ) ,0, x1( 1 ) ,0,0,0,0).

LS1 = invite (celebrities, 0, publicity, 0, 0, 0, 0) → will_leave (0, 0, celebrities, open, attention, public, 0).

LS2 = complain (celebrities, 0, 0, 0, 0, 0, hypocritical) →

shows (media, 0, interest, 0, aspects, other, 0) & shows (media, 0, interest, 0, aspects, celebrities_lives, 0).

LS3 = celebrities (role, influential, people, many, 0, 0, models) → should_be_open (lives, celebrities, examination, public, 0, 0, 0) & should_be_open (lives, private, examination, public, 0, 0, 0).

LS4 = have (public, 0, right, know, rich, 0, since) & have (public, 0, right, know, famous, 0, since) → money (right, know, lives, celebrities, famous, 0, since) &

support (money, 0, celebrities, 0, 0, 0, 0).

The geometric interpretation of these context links between the natural language sentences we can see in Figure 1. There were replaced following components of logic and linguistic models: x ( 1 ) = x 2( 1 ) , p ( 1 ) = x ( 1 ) (according to the second rule about 2 3 1 similar objects), y ( 4 ) = x1( 1 ) , y ( 3 ) = x ( 2 ) (according to the rule of identical object and 4 4 3 subject of the sentence).

Fig. 1. Geometric interpretation of context links

Every cube interprets simple natural language sentence into the complex sentence. Each corner of the cube – is a component of logic and linguistic model. Grey cubes are sentences with the similar construction and with only one different cube corner. The red lines show context links between four natural language sentences.

According to the rules for searching context links between the natural language sentences author has developed the basic principles for the synthesis of logic and linguistic models of natural language sentences based on the identification of means of meaningful connection (semantic and deictic repetition, use of identical grammatical forms, syntactic or transpositional derivation) in text documents and serves as a basis for constructing logic and linguistic models of electronic text documents. The result is the creation of a knowledge base for information technology of automatic comparative analysis of electronic text documents by content.

It has been made comparative analysis for systems, that are able to show the percentage of matches between electronic textual documents to approve results of finding context links between the natural language sentences. Such systems are as follows: Advego Plagiatus, Text.ru, StrikePlagiarism, «Антиплагіат» and information technology of automatic comparative analysis with developed knowledge base (СПАТЛЛМ). It was received the average index for percentage of matches for each system, correspondingly: Advego Plagiatus – 54,55%, Text.ru – 57,98%, StrikePlagiarism – 57,51%, «Антиплагіат» - 51,81% and СПАТЛЛМ – 69,93% (Figure 2, Figure 3).

Fig. 3. The statistics of finding the index for percentage of matches

The statistics of finding the index for percentage of matches shows, that information technology which uses the rules for searching context links between natural language sentences, leads to much more better results.

4 Conclusions

The scientific significance of the work lies in using an approach based on the predicate logic method of formation of meaningful models of text documents. It involves the development of a mathematical apparatus for semantical analysis of electronic text documents, which, on the basis of the analysis and synthesis of logical and linguistic models of natural language sentences, enables the structuring of textual information, ranging from the lowest level of logical connections to the text as a whole.

The effectiveness of the proposed rules for searching the context links was shown at geometric interpretation of links between four natural language sentences. According to the proposed rules, the subject of the first sentence connected with the subject of the second, relation of the third and object of the fourth ones. So, solving the problem of knowledge extraction from the textual information can be realized by means of searching semantic reiteration tautological reiteration, thematic reiteration and reiteration of various stylistic interpretations. 11. Vavilenkova, A.: Basic principles of the synthesis of logical–linguistic models.

Cybernetics and systems analysis 51( 5 ), 826–834 (2015). https:// doi.org/10.1007/s10559015-9776-z 12. Vavilenkova, A.: Analys and synthesis of logic and linguistic models. TOV "SIK GROUP Ukraine" (2017)

1. Bargesyan , A. , Kupriyanov , M., S. , I.I. , H.: The technologies of data analys: Data Mining, Visual Mining, Text Mining, OLAP . BHV-Peterburg ( 2007 )

2. Bulgakov , I. : The algorithm of information extraction by ABBYY Compreno ( 2015 ), https://habr.com/ru/company/abbyy/blog/269273/

3. Evans , V. : Lexical concepts, cognitive models and meaning-construction . In: Cognitive Linguistics . Vol. 17 , pp 73 - 107 . Edinburg university press Publ. ( 2006 )

4. Evans , V. , Green , M. :

Cognitive

Linguistics . Edinburg university press Publ. ( 2006 )

5. Geitgey , A. : Natural Language Processing is Fun ( 2018 ), https://medium.com/@ ageitgey/natural-language-processing-is-fun-9a0bff37854e

6. Hlomozda , D. , Glybovets , M. , Maksymets , O. : Automating the Conversion of Colored Petri Nets with Qualitative Tokens Into Colored Petri Nets with Quantitative Tokens . Cybernetics and Systems Analysis 54 ( 4 ), 650 - 661 ( 2018 ). https:// doi.org/10.1007/s10559-018-0066-4

7. Lande , D. , Snarskii , A. , Yagunova , E. , Pronoza , E. , Volskaya , S. : Hierarchies of Terms on the Euromaidan Events: Networks and Respondents' Perception . In: 12th International Workshop on Natural Language Processing and Cognitive Science NLPCS 2015 Proceedings . pp. 127 - 139 . Krakow, Poland ( 2015 )

8. Saini , G.: How I used natural language processing to extract context from news headlines ( 2018 ), https://towardsdatascience.com /how-i-used-natural-language-processing-toextract-context-from-news-headlines

9. Shyrokov , V. : The computer lexicography . Naukova dumka ( 2011 )

10. Solveig , B. , Bruening , B. , Yamada , M. :

Affected

Experiencers . Natural Language and Linguistic Theory 30 ( 2 ), 85 - 94 ( 2012 ). https:// doi.org/10.1007/s11049-012-9177-1