=Paper=
{{Paper
|id=Vol-2362/paper25
|storemode=property
|title=Semantic Analysis and Natural Language Text Search for Internet Portal
|pdfUrl=https://ceur-ws.org/Vol-2362/paper25.pdf
|volume=Vol-2362
|authors=Tetiana Kovaliuk,Nataliya Kobets
|dblpUrl=https://dblp.org/rec/conf/colins/KovaliukK19a
}}
==Semantic Analysis and Natural Language Text Search for Internet Portal==
Semantic Analysis and Natural Language Text Search for
Internet Portal
Tetiana Kovaliuk1’[0000-0002-1383-1589]’, Nataliya Kobets2’[0000-0003-4266-9741]’
1National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”,
37, Prospekt Peremohy, Kyiv 03056, Ukraine
tetyana.kovalyuk@gmail.com,
2Borys Grinchenko Kyiv University, 18/2 Bulvarno-Kudriavska Str, Kyiv, 04053, Ukraine
nmkobets@gmail.com
Abstract. The article is devoted to solving the set of problems related to natural
language texts semantic analysis. The following problems are addressed:
automation of generating metadata files describing the semantic representation
of a web page; semantic network construction for a given set of texts; semantic
search execution for a given set of texts using metadata files; and semantic
network export to RDF format. The algorithms for knowledge extraction from
text, semantic network construction and query execution on a given semantic
network are described. The lexico-syntactic patterns method was used as a basis
to approach these problems. A specification for describing lexico-syntactic
patterns has been developed and a pattern interpreter based on the
morphological dictionary of the Ukrainian language has been created as a part
of the software implementation of the method. Experimental studies have been
carried out for the «classification of living organisms» subject environment set
of patterns. Modified Boyer–Moore–Horspool algorithm was used to address
the problem of interpreting.
Keywords: metadata file, semantic network, semantic search, lexico-syntactic
patterns.
1 Introduction
International Data Corporation predicts that the summation of all data, whether it is
created, captured, or replicated - called the Global Datasphere – is experiencing
tremendous growth. Global Datasphere will grow from 33 Zettabytes (ZB) in 2018 to
175 ZB by 2025. [1]. A considerable amount of information exists in the form of
natural language texts. The problem of developing and applying new, more
progressive approaches to the presentation and analysis of information on the Internet,
including natural language texts, is becoming more and more acute. In this context,
the problem of information search arises [2]. One of possible solutions to this problem
is conversion of the World Wide Web to a semantic representation of data that is
mapping each web resource to a metadata file that contains the semantics of the
resource and is suitable for machine analysis. Thus, the problem of automating the
creation of metadata files is relevant.
The purpose of this study is to determine the possibility of reducing the metadata
files creation workload for web pages, to reduce the time of search for relevant query
results using metadata and to improve the quality of the results for search queries.
The objective is to perform semantical analysis of natural language texts in order to
fulfill queries formulated in natural language. This objective is complex and can be
divided into the following sub-objectives:
to transform the query into a semantic representation;
to perform natural language texts analysis in order to extract knowledge and
present it in the form of a semantic network [3];
query execution using logical deduction on a set of predicates of a semantic
network;
to transform the results of a query from a machine representation into a natural
language form.
The problem of query transformation into a semantic representation, in turn,
decomposes into the following sub-objectives:
to determine the context of a query, i.e., subject domain [4];
to identify the query semantics in the context of a specific subject domain [5].
The problem of natural language texts analysis in order to extract knowledge and
present it in the form of a semantic network can be broken down into the following
sub-objectives:
to classify available texts by certain subject environments they belong to;
to perform semantic analysis of texts that correspond to the subject domain of the
query. In fact, the result of this analysis is knowledge of the subject environment,
represented in the form of relations between objects
to construct a semantic network that contains the knowledge gained in the previous
step, as well as the knowledge of ontology relevant to the subject domain.
As part of this study, the following problems were solved:
transformation of the query into a semantic representation;
natural language texts analysis to extract knowledge and present it in the form of a
semantic network;
query execution using logical deduction on a set of predicates of a semantic
network.
A semantic network is an informational model of a subject domain that has the form
of a directed graph whose vertices correspond to the entities of the subject domain,
and edges define the relations between them. Entities can represent concepts, events,
properties, processes. In a semantic network, the concepts of the knowledge base are
vertices and edges (directed) define the relations between them. Thus, the semantic
network reflects the semantics of the subject domain in the form of concepts and
relations between them [6] [7].
2 Mathematical formulation of the problem
Suppose there is a set of natural language texts that describes M subject domains. For
each of M subject domains, the system retains its ontology O P, C , R , where
P is a set of predicates, C is a set of concepts that form the basis of the subject
domain, R is a finite set of functions that are defined on concepts and predicates of
ontology. The basis is a set of predicates and concepts that are sufficient to describe at
least 80% of the texts of a given subject domain.
An input to the system is given in the form of a natural language text. As a result of
the analysis of incoming texts, the system returns the result of the query execution in
the natural language.
The problem of semantic analysis and natural language text searching (queries are
formulated in the natural language) breaks down into following sub-tasks:
1. Sub-task of the query context determination is about finding the correspondence
between the query and some ontology Ok P, C, R , k [1, M ] , where k is the
index of a subject domain, which corresponds to the query of the subject domain.
Task and domain ontologies all relate to the context of a specific domain (e.g.,
zoology, biology) or task (e.g., accounting). The ontology “classification of living
organisms” is used.
2. Sub-task of extracting semantics of the query in the context of a specific subject
domain becomes a semantic analysis of a query, which will result in a set of
triplets {S , P, O} , where S is subject, P is predicate, O is object. Predicate
(property) is a binary relation between subject and object (fig.1).
Fig. 1. Graph of triplet.
3. Sub-task of classification of available texts by certain subject environments they
belong to comes down to splitting the original texts set T into M subsets
{T1, T2 ,...TM } , each of which corresponds to a certain subject domain and
ontology Ok P, C , k [1, M ] .
4. Sub-task of semantic analysis of texts that correspond to the subject domain of the
query is about extracting the knowledge from a certain subset of texts Tk that
correspond to the subject domain of the query; the knowledge is represented in the
form of a set of triplets {S , P, O} , where S is the subject, P is the predicate, O is
the object.
5. To construct a semantic network containing the knowledge obtained in the
previous step as well as the knowledge of the ontology corresponding to the
subject domain it is required to augment the resulting set of triplets {S , P, O} with
information about the predicates and concepts {P, C} that are the basis for the
given subject domain. The resulting triplet set can thus be represented as a
directed graph, the vertices of which correspond to the objects and concepts of the
subject domain, and edges are relations between them.
6. Sub-task of query execution using logical deduction on a set of predicates of a
semantic network involves the application of methods for the automatic proofing
of theorems on the triplet set {S , P, O} that consists of the relations of the semantic
network and the query. With logical deduction, a set of triplets is obtained, which
is the result of the query execution.
3 Method for solving the problem of knowledge extraction from
text and representing knowledge as a semantic network
Method selection for solving the problem of semantic analysis and search is based on
the criteria of relevance of query results and performance requirements.
One of the most effective approaches to solving the problem of semantic analysis
of texts, which allows to set up a system for specific subject environments, is the
lexico-syntactic patterns method [8] [9] [10][11].
The lexico-syntactic pattern is a structural model of the grammatical construction.
The pattern specifies its lexical composition and surface syntactic properties and thus
can be used to detect it in the text and to further extract it. The pattern is built as a
sequence of elements describing the corresponding fragments of the grammatical
construction in the order in which they are found in this grammatical construction.
The method of lexico-syntactic patterns assumes that lexical relations can be
described using a pattern hierarchy that consists mainly of indicators of the part of the
speech and the group symbols.
The advantages of the lexico-syntactic patterns method are the simplicity of
implementation and the high relevance of the results if there is a sufficiently complete
set of lexico-syntactic patterns covering the basis of the researched subject domain.
The disadvantage of the method is the need to create lexico-syntactic templates for
each subject environment.
Therefore, lexico-syntactic patterns allow to design a semantic model that
corresponds to the text to which they are applied.
The lexico-syntactic patterns method works with the following inputs:
a set of lexico-syntactic patterns that corresponds to a given subject environment;
an ontology containing the basic concepts and predicates for a given subject
environment;
a text (or set of texts) for semantic analysis.
To work correctly, the method must first classify analyzed text or query by belonging
to a particular subject environment. Classification can be done using the Bayes
algorithm or any other method.
3.1 Description of the structure of lexico-syntactic patterns
A lexico-syntactic pattern is a sequence of string-elements and word-elements [12].
A string-element allows you to write the desired string of characters in the pattern,
in particular, a specific word form, a punctuation mark, or a legend that occurs in a
formalized grammatical construction, for example, «equals», «set», «+», and so on.
A word-element corresponds to a single word of the grammatical construction. In
general, such properties are specified for a word-element:
part of speech (first letters of the corresponding names are used - table 1);
specific lexeme that defines a set of all word forms (identifiable by name);
values of the morphological parameters of a word that narrow the set of possible
word forms, for example, c – case, n – number, g – gender, t – tense, p – person
and so on.
Table 1. Part of speech (legend).
Part of speech Legend
Word W
Noun N
Adjective A
Verb V
Pronoun Pn
Adverb Av
Preposition Pr
Interjection Int
Particle Pt
Numeral Num
For example, the word-element V to understand,t present,p 3 describes the
verb «to understand» in the present tense and in the third person, that is, it defines its
word form «understand» or «understands».
When creating a word-element, a specific lexeme and the value of the
morphological parameters may not be specified, which allows specifying any word
form for the given lexeme (for example N file ) or the arbitrary word of a certain
part of speech with the required grammatical characteristics. For example, a lexeme
that defines an adjective in the form of a singular instrumental (ablative) case can be
written A ; c instrument al, n singular . In general, the pattern may include
both several words from different parts of speech and several different words from
one part of speech. Numerical indices are used to distinguish them. For instance,
pattern NN N1N 2 ; c genitive includes two different nouns N1 and N 2 , the
second one being in the genitive case.
Concord rules point to the grammatical concord of the individual elements of the
formalized grammatical construction and refer to the whole pattern. They are defined
after describing all elements of the pattern in the form of equality of values of the
concordant morphological parameters. For instance, the pattern
PV PnV Pn.n V .n, Pn.g V .g describes concordant pair (pronoun and verb)
in the number n and gender g : «it looks, they agree, she thinks» and so on.
Repetition of the elements can be specified in the pattern. For example, the pattern
{N ; c genitive } defines a chain of nouns going in a row in the genitive case. If
there are known limitations on the number of identical elements, then they can be
specified in the pattern. The record {A} 1,3 N defines a sequence of one, two, or
three adjectives and a noun.
Also, a pattern may include non-mandatory elements (in square brackets): for
example, the [«not»] element specifies the non-binding nature of the «not» particle of
the language expression. Acceptable recording of alternatives of a certain
grammatical construction is one, which uses the symbol | (logical operation «or»).
For instance, the pattern NP N | Pn describes the alternative between the noun N
and the pronoun Pn .
The full specification of the lexical-syntactical patterns is given here [13]:
Thus, when creating a pattern of a complex grammatical construction it makes
sense to highlight its constituent parts and to describe them one by one in the form
patterns.
3.2 Boyer–Moore–Horspool based algorithm for interpreting lexical-syntactic
patterns
The work of the interpreter is to compare lexical-syntactic patterns for each predicate
of the subject environment with the analyzed text. Usually, the subject environment is
described by a set of predicates, for each of them a set of patterns are created. Since
the speed of interpretation plays a critical role, it is desirable to apply the most
efficient interpretation algorithm. One of the best-performing algorithms is the Boyer-
Moore- Horspool (BMH) algorithm [14] [15], which is the reason why it was used in
this work.
BMH is a modified version of original Boyer-Moore algorithm. It performs fast
matching compare to other algorithms regardless of the pattern size that process it.
Algorithm BMH is as good as the original Boyer-Moore algorithm. Moreover, the
same results show that for almost all pattern lengths this algorithm is better than
algorithms that use a hardware instruction to find the occurrence of a designated
character. This algorithm works fast in situations where the pattern is much shorter
than the processed text, or when searching in multiple documents. Usually, the longer
the pattern, the faster the algorithm works [16].
Input data of the algorithm for interpreting lexico-syntactic patterns consists of:
the array L of lexico-syntactic patterns li , i 1.m for the given ontology
Ok Pk , Ck , Rk , where l l0 , l1,...,lm , li , i 1, m – the element of the
pattern, m – number of elements in the pattern; li K1 K 2 , K1 – string-elements
class, K 2 – word-elements class according to [12];
string-element contains the constant string str : li {str}, li K1 only;
word-element can contain the constant string str , as well as data on the parameters
(grammatical categories) of the word: part of speech, gender, case, tense,
grammatical number;
natural language text T .
The result of the algorithm is the set of triplets {S , P, O} , where S is the subject, P is
the predicate, O is the object.
The following is a description of the algorithm step-by-step:
Step 0. Sort elements of each pattern by belonging to K1 and K 2 classes. As a
result, a sorted set of patterns Lsort will be received for which the following statement
holds true: li , l j L, i,j | if i j then li K1, l j K 2 .
Step 1. Sort string-elements by the length in ascending order. As a result, a sorted
set Lsort of patterns will be received for which the following statement holds true:
li , l j L, i, j | if li , l j K1, i j then the length of the string li is greater than the
length of the string l j , also li , l j L, i,j | if i j then li K1, l j K 2 .
Step 2. Sort word-elements by the level of concretization. As a result, a sorted set
Lsort
of patterns will be received for which the following statements hold true:
li L, l j L, i, j if li , l j K 2 , i j then | li | | l j | (1)
li L, l j L, i, j if li , l j K1, i j then | li | | l j | (2)
li L, l j L, i, j if i j then li K1, l j K 2 (3)
where | li | – length of string li , | l j | – length of string l j .
Step 3. Pick one element l from unprocessed elements in the set of lexico-
syntactic patterns L . If all elements of L are processed, then go to step 4. Otherwise,
go to step 3.1.
Step 3.1. Select the next unprocessed string-element licur of the pattern l cur , then
go to step 3.1.1. If all string-elements are processed, then go to step 3.2.
Step 3.1.1. For each character str[i] except the last one ( i n , where n is the
length of the string str ) in string str licur define the value val[i] , which is equal to
the maximum shift of the character relative to the beginning of the word:
val[i] max{ j} , i [0;n 1] (4)
j / str [ j ] str [i ]
val[n 1] n (5)
As a result of this step, the mapping «character – the minimum shift relative to the
end of the string» will be received.
Step 3.1.2 Match the beginning of the text T and the beginning of the string-
element l cur : shiftText 0 and shiftStr 0 . Go to step 3.1.3.
Step 3.1.3 If the end of text T is reached, then go to step 3.1. Examine the last
character str[n 1 shiftStr] of the string-element. If the given character matches the
corresponding character of the text str[n 1 shiftStr] T[shift] then go to step 3.1.4,
otherwise, go to step 3.1.5.
Step 3.1.4 If the current character is the first in the word, that is n 1 shiftStr 0 ,
then go to step 3.1.
Check the next character from the end of the string-element
l cur : shiftStr shiftStr 1 . If the given character matches the corresponding
character of the text str[n 1 shiftStr] T[shift] , then go to step 3.1.4, otherwise, go
to step 3.1.5.
Step 3.1.5 Execute text shift T by number shift , which corresponds to the
current character T [shift] in the mapping shift shift shift . If the character is
not in the table, then the shift is equal to the length of the word str . Make shift
shiftStr within a single word str equal to zero: shiftStr 0 . Go to step 3.1.3.
Step 3.2. Select the next unprocessed word-element licur of the pattern l cur , then
go to step 3.2.1. If all word-elements are processed, then go to step 3.3.
Step 3.2.1. Check the match of the morphological form of the word-element licur
and the word from the text, which position corresponds to the position licur in the
lexico-syntactic pattern. If the morphological form matches all parameters (part of
speech, gender, case, tense, grammatical number), then go to step 3.2. Otherwise, go
to step 3.
Step 3.3. Add the triplet {S , P, O} corresponding to the pattern l cur to the semantic
network.
Step 4. End the algorithm.
4 Software Implementation and Example
The software has a three-tier architecture: it consists of the client application, server
application, and database server. Language C# and the .NET platform selected for
implementation due to advantages such as automatic garbage collection, the built-in
language of LINQ queries, the availability of ADO.NET classes for access to
database and a number of other benefits.
The client application consists of the following modules:
an access module to work with the server-side application;
semantic network graphical display module;
user interface module.
The access module to work with the server-side application is designed to
communicate with web-service, which implements business logic.
The user interface module is responsible for creating a GUI that allows the user to
conveniently enter the input data and view the results of its processing.
The semantic network graphical display module is responsible for creating an
image of a directed graph corresponding to the constructed semantic network.
The server application consists of the following modules:
a semantic network construction module;
a module for exporting the semantic network from logical to RDF format [17];
a query analysis module;
a module to perform logical deduction on the semantic network;
an access module to work with the database.
The semantic network construction module implements the method of lexico-
syntactic patterns in order to extract knowledge from the text.
The module for exporting the semantic network from logical to RDF format allows
to retain knowledge that was extracted from the text in the format suitable for
machine processing.
The query analysis module determines the content of the query.
The module to perform logical deduction on a semantic network executes a query
and present results in the form of a triplet set «object-predicate-subject».
The access module to work with a database is designed to provide access to the
morphological dictionary and other data stored in a database for the application
server.
For semantic analysis, you must specify a * .txt, * .doc or *.html file or a web
resource URL. The program builds the semantic network in the process of text
analysis To represent the knowledge obtained from the text in a machine-friendly
format the RDF format is used (fig. 2).
Fig. 2. Presentation of knowledge received from text in RDF format and
image of semantic network.
5 Conclusion
The paper deals with the problems of extracting knowledge from texts, presenting
them in the form of a semantic network, and executing queries on the constructed
network. The main approach to solving these problems was the lexico-syntactic
patterns method. A specification to describe lexico-syntactic patterns has been
developed, a pattern interpreter has been created based on the morphological
dictionary of the Ukrainian language and the «classification of living organisms»
subject environment set of patterns has been collected as a part of the software
implementation of the method. Modified Boyer–Moore–Horspool algorithm was used
to solve the problem of interpreting lexico-syntactic patterns.
Future work will be concentrated on creating tools for simplifying the process of
building new lexico-syntactic patterns, adding new lexico-syntactic patterns to the
database, improving the query execution module, and creating a module for text
classification by belonging to the subject environment
References
1. Data Age 2025. The Digitization of the World from Edge to Core,
https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-
whitepaper.pdf, last access 2019/02/10.
2. Manning, C. D., Raghavan, P., Schutze, H.: Information Retrieving. Cambridge:
Cambridge University Press (2009)
3. Clark, A., Fox, C., Lappin, S.: Computational Linguistics and Natural Language
Processing. Wiley-Blackwell Publishing (2010)
4. Ijntema, W., Sangers, J., Hogenboom F., Frasincar, F.: A Lexico-Semantic Pattern
Language for Learning Ontology Instances from Text. In: Web Semantics: Science,
Services and Agents on the World Wide Web, Vol. 15, pp. 37-50 (2012)
5. Mark Johnson. Natural Language Processing and Computational Linguistics: from Theory
to Application, http://web.science.mq.edu.au/~mjohnson/papers/CLandTopicModels.pdf,
last accessed 2019/04/12.
6. Hebeler, J., Fisher, M., Blace, R., Perez-Lopez, A.: Semantic Web Programming.
Indianapolis: Wiley Publishing (2009)
7. Pollock, J.T.: Semantic Web for Dummies. Indianapolis: Wiley Publishing (2009)
8. Lexico-Syntactic Pattern Language: language description LSPL, http://www.lspl.ru/, last
access 2019/02/10.
9. Zagorul'ko, Yu. A., Sidorova, E.A.: Extraction system subject terminology from text based
on lexical and syntactic patterns. In: Proc. XIII International Conference on Control and
Modeling Problems in Complex Systems. Samara, pp. 506–511 (2011)
10. Klaussner, C., Zhekova, D.: Lexico-Syntactic Patterns for Automatic Ontology Building.
In: Proceedings of the Student Research Workshop associated with RANLP 2011, Hissar,
Bulgaria, pp. 109–114 (2011)
11. Mochalova, A.V.: Algorithm for semantic text analysis by means of basic semantic
templates with deletion. In: Scientific and Technical Journal of Information Technologies,
Mechanics and Optics. No. 5 (93), pp. 126-132 (2014)
12. Bolshakova, E.I., Baeva, N.V., Bordachenkova, E.A., Vasilyeva, N.E., Morozov, S.S.:
Lexico-syntactic patterns in the tasks of automatic text processing. In: Proceedings Int.
Conference “Dialogue 2007”, pp.70-75 (2007)
13. Language description LSPL (1.0.1), http://www.lspl.ru/articles/LSPL_Refguide_13.pdf,
last access 2019/02/10.
14. Horspool, R.N.: Practical Fast Searching in Strings. Software-Practice and Experience,
vol. 10, pp. 501-506 (1980)
15. The Boyer-Moore-Horspool Algorithm,
http://www.mathcs.emory.edu/~cheung/Courses/323/Syllabus/Text/Matching-Boyer-
Moore2.html, last access 2019/04/14.
16. Hasan, A.A., Nur Aini Abdul Rashid, Muhannad A. Abu-Hashem, Atheer A. Abdulrazzaq:
Multi-Pattern Boyer-Moore-Horspool Algorithm based Hashing Function for Intrusion
Detection System. In: Lecture Notes on Information Theory, Vol. 1, No. 2 (2013)
17. Description of the data presentation model RDF, http://www.w3.org/RDF/, last access
2019/04/14
18. Vavilenkova, A.I.: Software for detection of text documents identical in content. In:
Visnyk of Chernihiv State Technological University, No. 2 (65), pp 125-132 (2013)