Intelligent Tools for the Semantic Internet Navigator Design
© Igor Kuznetsov © Mikhail Charnine © Elena Kozerenko
© Nikolay Somin © Vladimir Nikolaev © Andrey Matskevich
Institute of Informatics Problems of the Russian Academy of Sciences,
Moscow
Igor-kuz@mtu-net.ru keywen1@mail.ru kozerenko@mail.ru
somin@post.ru ipiranlab14@yandex.ru
Abstract find what interests them in the sea of information with
reduced expenditures of labour. On-line encyclopedias
This paper describes the methods and play the role of such means.
instruments for semantic web navigator design The intelligent web navigator comprises the features
which is a novel system providing semantic of the ESN linguistic processors which were developed
drive for Internet users. The solutions proposed for different classes of information systems relating to
rest on the statistical paradigm for knowledge the artificial intelligence research field.
extraction and the semantic presentations
The core feature of the development is assigning a
based on the Extended Semantic Networks
semantic structure to natural language input. Semantic
(ESN) mechanism. The approach presented
structure is obtained via the semantic categorization and
comprises rule-based and stochastic techniques
establishment of semantic relations between concepts
for text processing and extracted entities and
presented in natural language texts contained in
relations mapping onto the structures of the
Internet. Association is the dominant type of semantic
knowledge base.
relations supported by the intelligent tools under
The work is supported by the Russian Foundation discussion. The study of the contexts and co-occurrence
for Basic Research, grant 11-06-00476- of terms and key words allows to shape the semantic
structure of the navigated texts and perform automatic
1 Introduction categorization.
The paper deals with the issues of design an The hybrid approach is taken for the semantic
development of the new tools comprising the intelligent navigator design which incorporates the logical
methods and systems based on the presentation analytical functionality of the intelligent systems based
mechanism of the extended semantic networks (ESN) on the extended semantic networks, statistical methods
[1] which had been employed for creation of a wide and machine learning mechanisms.
range of knowledge-based systems and the features of
the keywords encyclopedia Keywen [2]. 2 The ESN Intelligent Systems and their
As a result of tremendous growth of the Internet, its evolution
users receive huge volumes of information as responses
to their Internet queries. Users are interested in a big The intellectual systems, developed on the basis of the
variety of questions, they make their own attempts to apparatus of the extended semantic networks (ESN) [1,
employ keywords and phrases by means of test and 4-6], called the ESN-systems, were created by the
error method (addressing search machines and making association of developers, including the authors of this
analysis of the answers). This results in tremendous article at the Institute of Informatics problems of the
expenditures of labour and disappointment because of Russian Academy of Sciences during the period of two
huge amounts of irrelevant information and/or its decades within the framework of research projects and
incompleteness. applied systems, oriented at the concrete subject areas
Hence, to make optimal queries, one has to face the and customers.
problem of requests ordering, reflecting interests of We single out 4 generations of ESN- systems. The
users, creating directories of subjects and articles. It is linguistic semantic ideas laid as the basis of the systems
necessary to create special means, which allow users to of this class underwent a specific evolutionary process.
Intellectual ESN- systems contain the developed bases
of knowledge, in this case the knowledge is represented
Proceedings of the 14th All-Russian Conference in the form of the records in the language of the
"Digital Libraries: Advanced Methods and extended semantic networks, called ESN - structures.
Technologies, Digital Collections" RCDL-2012, LТnРuТstТМ knoаleНРe Тs, tСus, a “speМТal Мase oП
Pereslavl Zalesskii, Russia, October 15-18 2012. knoаleНРe” anН Тt Тs also representeН Тn tСe Пorm oП tСe
222
records in the language of the extended semantic • tСe means oП transport аТtС tСe ТnНТМatТon oП tСe
networks. Basic structural element of the ESN is the vehicle type, its state number, color and other attributes;
named N-arв preНТМate, МalleН “ПraРment”. TСe аСole • passport Нata anН otСer НoМuments аТtС tСeТr
set of language objects are given in the form of attributes;
predicate-argument structures, in this case the • eбplosТЯes anН narcotic substances;
mechanisms for presentation of embedded structures are • orРanТгatТons, posТtТons;
supported, which gives very powerful presentation • quantТtatТЯe МСaraМterТstТМs (Сoа manв persons or
mechanisms for describing the objects of different other objects participated in an event);
language levels. • tСe numbers oП aММounts, sums oП moneв аТtС tСe
The uniformity of language presentations is a very indication of the currency type;
important factor. In the process of analysis and • terrorТst Рroups anН orРanТzations;
synthesis of natural language sentences the formal • partТМТpants oП terrorТst Рroups аТtС tСe ТnНТМatТon oП
grammatical apparatus, similar to the dependency their roles (leader, head of, etc.);
grammar, is used. With this approach the words and the • tСe armeН ПorМes, assТРneН Пor antТterrorТst Мombat
constructions, which perform the role of predicates in (Military_.Force);
tСe sentenМe, are tСe “support” elements, anН tСe result • eЯent (МrТmТnal, terrorТst, bТoРrapСТМal, anН so on) аТtС
of the analysis of a sentence must become one the indication of the information objects participation in
predicate, which corresponds to the predicate of the them;
sentence (i.e. to basic verb in the tensed form or to • tТme anН tСe plaМe oП eЯents;
another basic predicate expression) in question. Thus, in • tСe МonneМtТon betаeen НТППerent tвpes oП ТnПormatТon
the process of analysis, in the first place, the processing objects (with whom a person works in an organization,
Тs perПormeН oП tСe “aМtТon аorНs” anН tСe “relatТon or lives at the same address, in what events participated
аorНs”, Т.e., of the verbs and other words, which have together with other objects, etc.). For extracting objects
syntactic-semantТМ ЯalenМes. An eбample oП a “relatТon all versions of an object name including the brief form
аorН” tСe аorН “ПatСer”, “ПrТenН”, anН tСe lТke, Т.e., Тn possible in the text were considered. Standard objects
tСТs Мase a “relatТon” Тs a аorН аСТМС assТРns stronР (names, dates, addresses, the forms of weapon and
clearly expressed syntactical-semantic expectations. others) are reduced to one (standard) form.
The identification of objects is performed taking
Semantic analysis in the engineering linguistic
into account brief designations (for example, separate
understanding is the process of translation of natural
surnames, patronymics, initials), anaphoric references
lanРuaРe eбpressТons Тnto “Тnternal” struМtures oП tСe
(indicative and personal pronouns, for example, "this
knoаleНРe base (KB) Тn our Мase tСese “Тnternal”
person", "it...") definitions and explanations (for
structures are the records in the ESN language. Thus, a
example, "the mayor of Moscow Sobianin" is identified
KB structure is the code of sense in the intellectual
with the subsequent words "mayor", "Sobianin").
information systems. The language engineering
For the extraction of events and connections the
solutions were implemented in the systems with
analysis of verbal forms, participial and adverbial
“Мomplete” lТnРuТstТМ analвsТs, tСeses are tСe sвstems oП
constructions is carried out. An important task is the
the 1st and 2nd generations: DIES1, DIES2, Logos-D
identification of objects in the entire text, the use for
Д1, 4] anН tСe sвstems аТtС “ПaМtoРrapСТМ” approaМС, Т.e.
these purposes of indicative pronouns, brief names,
the intelligent systems of analytical decisions support
anaphoric references.
(ISADS) [6], where the goal of analysis is the extraction
Taking into account the difficulties and in
of entities and connections from the texts, these are the
accordance with the tasks the linguistic processor
systems of the 3rd and 4th generations.
Semantix was developed, which achieves normalization
The ESN systems of the 4-th generation perform
of words, their grouping with the formation of units, the
the tasks os semantic objects (named entities)
identification of objects and the establishment of
extraction. The set of the objects to be extracted
connections. As a result for each NL document a
depends on the tasks of a user. At the same time the
semantic network called the meaningful document
quality of a linguistic processor is to a considerable
portrait was constructed automatically. The latter are
degree determined by the possibilities for this
the knowledge structures of the knowledge base which
extraction. The basic types of information objects and
serve the basis for implementing different forms of
connections, extracted by the ESN semantic processors
semantic search : the search by features and
are given below:
connections, the search for the objects connected at
• persons (bв ПamТlв name, РТЯen name anН patronвmТМ
different levels, the search for similar figurants and
- FNP) with their role features (criminal, victim);
incidents, the search by distinctive signs (with the use
• tСe Яerbal НesМrТptТon oП tСe persons, tСeТr НТstТnМtТЯe
of ontologies).
signs;
The extraction of connections is not only the deep
• aННress, postТnР ТnПormatТon attrТbutes;
analysis of verbal and other forms. Many connections
• Нate(s) mentТoneН;
are given on default. For example, in the summaries of
• аeapon аТtС Тts speМТal Пeatures;
incidents, as a rule, figurants names are followed by
• telepСone numbers, faxes, e-mails with their
their data without the indication of their belonging and
subsequent standardization;
with the additional text insertions. For that the directed
223
search for the connected objects, i.e., the restoration of
1. The analysis of the texts
connections, default data is organized in the processor
Semantix.
Special processes are organized in order to connect
persons with their place of stay or place of work, > 2. Singling out the basic con-
cepts and characteristics
vehicles which belong to them, and so forth. For
example, the analysis of the summaries of incidents is
performed as follows. For a number of objects (address,
telephone, date of birth, etc.) a virtual connection with > 3. Constructing a subject area
vocabulary founded on the
other objects (names, organizations), is built thus yet basic “world model”
unidentified. Then, at the same level of processing their
search is performed with the aid of the special rules for
The basic “world model”
identification. In these rules the direction of search, the and the language model
permissible quantity of steps, and also the signs of
words and punctuation marks, where the process of
search ends are indicated. In this case special filters are
> 4. Establishment of the type-kind
required, in order not to take and not to connect an alien relations between SA notions
object.
This approach showed sufficiently good results in
the system Criminal [11]. The special features of natural > 5.Formulation of the situational
rules in the form of IF… THEN
language are considered where the same actions are rules
identified with the aid of the verbs, verbal nouns and
participial constructions. Presented in ESN they are Figure 1 The flowchart of conceptual linguistic modeling
reduced to one form, i.e. a complex object. Moreover,
Construction of the conceptual linguistic model of a
forms with verbal nouns can be the components of
certain subject area is subdivided into the following
verbal forms. On analogy, in ESN some objects can be
stages: - construction of the conceptual model proper,
the components of others. The reason- consequence and
i.e., the ramification of fundamental notions, their
temporary dependences between actions, events, etc. are
organization in kind-type trees and the determination of
represented which reflect the logical connection of
the connections between them; - the development of the
sentences, assigned explicitly, with the aid of the words
ideographic dictionary for the subject area, i.e., the
“tСereПore”, “tСen”, etМ. The quality of a linguistic
lexical population of the conceptual model; - the
processor is determined by a number of factors. First,
introduction of the base rules, which describe "the
the possibility for isolation of objects and connections.
model of the world" in the natural language relevant for
These are the types of objects being isolated, their
the subject area.
quantity. The Semantix processor identifies up to 40
The procedure of conceptual-linguistic simulation
types of objects, including very complex ones, which
on the basis of the ESN apparatus is based on the
correspond to actions and events. With an increase in
following principles:
the quantity appear the additional difficulties, connected
• tСe model must be "open" , i.e., support the
with collisions of the extraction rules of: some rules can
effective mechanism of expansion and information
seize the words, which relate to other objects and those
update;
extracted by other rules.
• tСe moНel oП tСe “sense” presentatТon sСoulН
It becomes important to consider the order of the
consider the facts of extra-linguistic reality, which in
application of rules, including of the rules of
the form of rules and relations compose a certain basic
identification. In the second place, an important factor
"world model" and the concrete models of subject
is the selectivity of rules and procedures of the
areas;
identification: the factor of the noise and losses. By
• tСe moНel sСoulН be practical, i.e., not overloaded
noise we mean the presence of excessive words in the
by the detailed descriptions of connections and relations
objects. Losses are the situations when an object is not
between the concepts in order to ensure the possibility
revealed or revealed partially: in the text there are the
of its realization, but at the same time, it should reflect
words, which did not enter into the object. In the
the relevant information for specific objectives.
Semantix processor the rules are arranged in such a way
A realistic approach to the formulation of the
that they ensure the high degree of selectivity and the
problem dictates the need of limitation to a domain-
minimization of noise and losses with the large number
oriented subset of a natural language. The essence of
of the objects being selected.
limitations consists in the following: - first, analyzed
text materials contain expert knowledge from particular
3 Conceptual linguistic simulation subject areas (we developed the systems for the subject
Conceptual linguistic simulation (CLS) is the process of areas for the diagnostics of the microcircuits production
constructing a natural language model of a subject area failures, forecast in the social sphere, criminology, and
(SA) (Fig.1), that synthesizes in itself the approaches of others); - in the second place, for the purposes of the
conceptual and linguistic simulation [4-6].
224
{( 895__)(DICSEM) the study of new material. However, their creation in
COORD(PROGNOZ1,RUS, 895__,S5 electronic type – is a huge work which requires not
0_31_51_20,%) SUB(UNIV,0+) SUB(UNIV,1+) simply to enter the adequate material into computer, but
SUB(UNIV,2+) also and its additional ordering: creation of subject
(0-,1-,2-/3+) INFI(3-) (3-) directories for allocation of main classes and subclasses,
(3-/4+) FUT1(4-) SUB( ,5+) definition of main notions, building of hyper-references
for communication of entries (articles) of encyclopedia
Figure 2 An example of the presentation of the verb between themselves, but also of references to primary
ЯвrobatвЯat’ - “to manuПaМture” in the semantic dictionary. sources. What should be also considered is the
dynamism of circulating in Internet information:
maximally possible elimination of ambiguity, dictionary
emergence of new information sources, which should
is built according to the modular principle: there is a
certain most general common part (1-2 levels) be taken into account in encyclopedias.
completed by special dictionaries for each particular In FТР. 2 tСe “Тnternal” НesМrТptТon oП tСe Яerb Тn tСe
subject area. semantic dictionary is represented. This dictionary is
The proposed model of lexical semantics is based on automatically generated by the ESN-systems DIES2,
the principle of the "nuclear" value realized in the LOGOS-D, IKS in the course of natural language texts
context of this subject area with the subsequent processing.
inductive supplementation of other meanings (if they At present the majority of large electronic
are actualized in the contexts in question). The encyclopedias operating on-line have been created on
taxonomy is also used which is realized in the form of the basis of printed materials of universal
the hierarchical trees of the word classes. The general encyclopedias: Big Soviet Encyclopedia, Britannica,
Big Brockhaus, Big Larousse and others. Creation of
"world model" of the system serves as the basis for the
such encyclopedias requires considerable human labour.
subject area models.
The classes of words, are subdivided into The above said leads us to the conclusion that the
concept/names, relations, actions, properties, global problem in the present situation is the
characteristics of actions, time and place locatives. The development of methods and program means for
most Рeneral notТon Тs “МonМept”, or unТЯersal Мlass, automation of the most labor-consuming stages of
which is subdivided into object, the situation, process formation of on-line Internet encyclopedias.
and others. The words which relate to the classes of Such formation requires elements of intellectual
actions and relations, are represented as the semantic- activity: for making the choice of the subject for
syntactic frames, which determine the predicate- description, formation of articles (entries), their names,
argument structures (government model). search for definitions, etc. Development of concepts of
on-line encyclopedia results in reference systems of a
However, in the described approach (let us name it
more general plan, providing collection of information
the ESN-approach) the range of argument values is
substantially extended. This extension consists in the and systematized knowledge representation about
fact that in the role of arguments there can appear different objects which are of interest to the user: -
simple objects corresponding to the individual words, about politicians, persons of science, of culture; - about
structural objects which present word combinations, organizations, companies; - about events (for example,
phrases and clauses, and concept of "case" includes not strikes, their reasons, place and time); - about goods and
only semantic, but also syntactic aspects. The approach, objects of a particular class (for example, fuel, mining,
based on ESN allows to reflect the arbitrary level of the region) and others. While building such systems, many
structures embedding it makes it possible to reflect the common problems appear, that are also vital for on-line
structural nature of lexical semantics, which in this encyclopedia.
The only difference is that instead of articles and
model has a hierarchical network structure.
their names there would be other objects.
Linguistic knowledge is represented in the system
At present the decision of the discussed problems
dictionary and the declarative modules of linguistic
processor. In the ESN systems the function of becomes real because there have been designed and
dynamically formed semantic dictionary which is developed many systems and facilities in the areas,
expanded automatically by the system in the course of connected with creating different classes of intelligent
concrete texts processing is also realized on the basis of systems, language processors, knowledge bases,
initial linguistic information. statistical processing of language components [1-14].
In FТР. 2 tСe “Тnternal” НesМrТptТon oП tСe Яerb Тn tСe The given work is based on the experience of
semantic dictionary is represented. This dictionary is creation of the on-line encyclopedia and is devoted to
automatically generated by the ESN-systems DIES2, the principal directions of decision methods
LOGOS-D, IKS in the course of natural language texts development for the mentioned problem.
processing.
5 Special Features of Automation
4 General Considerations for Encyclopedia In general, the problem looks like this. The input
Design comprises a stream of documents from Internet (all
relating to a determined application domain). The
Encyclopedias traditionally played an important part in
225
output is an electronic encyclopedia consisting of brief conducted in stages: first a simple system is developed
articles with names, with hyper-references between with subsequent enforcement of its features.
articles (if the names of other articles are encountered in
the text) and with hyper-references to primary sources 6 Semantic Navigator: Encyclopedia of
the documents from Internet. Keywords
In addition an electronic encyclopedia should
include the main menu, article sections, various In 2002 the first version of the on-line encyclopedia [2]
classifiers and the internal search system, providing was released by Michael M. Charnine, having received
quick access to concrete subjects making application the name Encyclopedia of keywords largely basing on
domain. Certainly, to automate all this processes is not the methods described above. The Encyclopedia
possible. functions on the web-site: www.keywen.com. It
Formation of the main menu, subjects and query constantly grows and at present contains more than
facilities is done manually. Computer can help with 250000 articles on different subjects in different
selection of material of articles and the choice of their languages. The majority of the articles are English, but
meaningful components. there are also more than 3800 German and 1300 Italian
Two stages are distinguished: training and articles. The Encyclopedia of keywords is universally
operation. The grade level, when training sample is recognized in Internet. Daily several thousand people
given to the system (documents from Internet) with have free use of its information.
indicated articles which the system should select. Each article of Encyclopedia consists of key
For example, types of diseases can be, symptoms, sentences (of phrases). Each of them contains one or
texts of description, falling into, say, preventive several key words. Such phrases are found in Internet
maintenance of diseases and of others. The system with a special semantic navigating program, that is
should develop decision rules providing allocation of named Keywen Encyclopedia Bot.
these articles at the stage of operation on other At present Encyclopedia contains more than 5
documents. million key-phrases. The major part of the articles of
Such rules are founded on statistical treatment with Encyclopedia begin with the section, in which the
discovery of keywords and standard contexts definitions of terms, included into the article title are
(meaningful components), providing selection of given. This allows to understand quickly what the
articles. article is about. If a more profound study of the given
Grade level allows to partly or completely automate subject is required, it is possible to use the references to
the activity of a developer in discovery of the data, Internet sites. Each phrase is supplied with such
necessary for system operation. Discovery of keywords reference in Encyclopedia. Each clause of Encyclopedia
and of contexts requires the use of morphological and contains a list of the most important keywords. For each
semantic blocks of analysis of natural language (NL). keyword in an article there is a section in which
The first block converts word forms examples of phrases, containing this keyword are given.
e.g. TABLE, of TABLE, to TABLE The knowledge of keywords is necessary for
into the uniform type (TABLE) and is particularly automatic development of exact requests to search
important for languages, where words have the a machines. For example, for the article Knowledge
system of cases and other morphological information Discovery a typical structure in the paragraph
as, for example, the Russian language. DEFINITIONS is given: " Knowledge discovery is the
Without such transformation the search in documents extraction of implicit, previously unknown and
for the same components becomes extremely difficult. potentially useful knowledge from data". An article
The second block selects word-combinations (they contains references to more specialized articles:
can also be with names of articles) and verbal forms, Business and Companies, Magazines and,
that determine context in most cases. Organizations, Text Mining, Tools. An article contains
Both these blocks of the language processor keywords (with examples of phrases) KNOWLEDGE
implementing the analysis of natural language sentences DISCOVERY, DATA MINING, INTERNATIONAL
plays an important part in the system. CONFERENCE, KDD and others. Encyclopedia
In creation of on-line encyclopedia important are the (Keywen. com) that contains internal search machine
following factors: the quality of a created encyclopedia allows to quickly find all key-phrases and appropriate
(it is determined by the vicinity to the existing clauses, containing this or that key word. As a result for
encyclopedia); the difficulty of the preparatory stage any keyword it is possible to quickly find application
including creation and input of basic materials domain corresponding to it. At the beginning of 2004 a
(dictionaries, catalogues and others.) necessary for version was created of electronic encyclopedia of the
system operation; also development of a system Open Project type entitled "Encyclopedia of key
teaching to discovery of articles is a very difficult phrases". In the framework of this project each user of
programming task. Internet can bring some contribution into the
Simplification of the second and the third factors development of Encyclopedia. The facilities to move
can dramatically decrease the quality. At the same time, sections of any article according to their value and also
an “oЯer-МomplТМatТon” oП the task should be avoided. enter new phrases in Encyclopedia are given to each
We follow the scheme when the development is user.
226
For Keywen development a constantly growing precise, logically correct, flexible and dynamic. It is
multilingual texts corpora automatically extracted from convenient for effective navigation and fast
Internet is used. For each subject domain and for every understanding, helps:
supported language a particular text corpus is formed. - to see the BIG PICTURE,
The text corpora are analysed by the linguistic - to divide knowledge into parts and select the most
processor. important parts,
Keywen NLP pipeline includes: - to create effective plans for learning and
a text tokenization module, knowledge processing.
a part-of-speech tagging system, Hierarchy is a form of organizational structure in
a sentence boundary detection tool, which each unit has one and only one "parent" unit,
a collocation identification module, except the "top" unit, which has none.
a named entity recognizer, A Polyhierarchy (multi-hierarchy) is like a
a word sense disambiguation system, hierarchy, but nodes can have multiple parents. In
a full-syntactic parser. mathematical terms, polyhierarchy is represented by a
Extraction of term candidates from domain- directed acyclic graph, or a partially ordered set. In
oriented texts supports Automatic Term Recognition terms of object-oriented methodology, it can be viewed
resulting in Multilingual terminology as class hierarchy with multiple inheritance.
Reordering the list of extracted candidates is based Directory structure is a particular case of
on the term/keywords candidate relevance ranking. hierarchical structure (that is more general concept). For
Extraction of key phrases and definitions provides example, UNIX and DOS have a hierarchical directory
Automatic summarization of domain-oriented texts structure that allows files to be organized by categories.
using TF/DF measure The main difference between hierarchical and directory
Extraction of key phrases and definitions creates structure is different naming convention for categories.
Knowledge-Rich Contexts, automated pattern The category names in directory structure can be full
acquisition is used for the identification of semantic or short (local). The full category names in a directory
relations: associations and family trees which serve the usually are equal to their paths from Top category. A
basis for semantic parser. There are a number of useful directory contained inside another directory is called a
advantages of the Keywen apparatus, including, but not subdirectory of that directory. Subdirectories are
limited to: the ability to build large scale human- specified by concatenating the subdirectory short name
readable and semantic-oriented hierarchy of categories; to the name of the directory above it in the hierarchy.
the ability to generate dynamical and flexible Together, the directories form a hierarchy, or tree
hierarchical categories; the ability to accept structure.
contributions of users with different qualification for Keywen Category Structure is a polyhierarchy
improving hierarchical categories; the ability to accept (multi-hierarchy) that contains one preferred (primary)
user’s mТnТmal МontrТbutТons (as lТttle as one МlТМk); tСe hierarchy (tree) which contains all nodes.
ability to have multiple ways to categories in the The following new technologies are employed in
polyhierarchy and at the same time to have Keywen:
hierarchical/directory paths of the categories. - One-click Keywen technology and electronic
The Keywen apparatus produces a "concrete" Voting System,
substantially repeatable result. It generates hierarchical - Keywen search engine with large queries,
categories that are substantially repeatable. If users - Keywen Writing Service.
were to perform the claimed steps on multiple different These technologies can accelerate the encyclopedia
occasions using the same inputs (e.g., the same РroаtС anН Мan make a аrТter’s аork most eППeМtТЯe.
collection of related terms, the same communication
with input/output module), the users would achieve the
same result on each occasion. The functionality of the 7 Prospects for the development of
technique has been mathematically proven, the present Semantic-Focused Systems
apparatus for generating hierarchical categories do not
use any empiric, heuristic, or fuzzy considerations. The development trends "Encyclopedias of keywords"
The following two basic category systems are and "Encyclopedias of key phrases" are determined as
currently most popular: follows:
- Hierarchical, as in directories (easier for - constant increase of the encyclopedia articles number
understanding, planning and processing); in different European languages, including Russian,
- Multi-hierarchical, as in Wiki-encyclopedias (more inter-referenced between the relevant articles in
natural, flexible and easy to maintain). different languages;
The category structure of Keywen is the product of - the speed of updating of Encyclopedia will be
these two systems: it has advantages of both and opens increased; old articles will be kept in the archive of
greater possibilities than either. Both the structure of Encyclopedia, but fresh articles will occupy their place
web-directories and structure of Wiki-encyclopedias with references to the new phrases and new articles
may be viewed as an isolated case of Keywen Category from Internet;
Structure. The category structure of Keywen is more - the Rating of articles self-descriptiveness will be
227
constructed; for this it is necessary to analyze several
million references contained in Keywen.com: those where ID="7" – is an identification of an object, the
containing more key phrases to a given issue, should get TYPE="Organization" is its type. The text component
high position in the rating. corresponding to the object is also given. Objects
Further stages of development are connected with relations and their participation in the actions are given
the use of language processor. through the REF=... references. For example, with the
Stage 1. The system for English and Russian help of the following construction
morphological analysis - for transformation of words
into normal form. Simplistic analysis of sentences for
discovery of definitions on keywords.
Stage 2. The component for analysis of sentences with
selection of often met relevant word-combinations. where the sentence "one of the blows struck the
Stage 3. Means for establishment of relations between headquarters of the oppositional group" is represented.
relevant objects that form the clauses. For each object or action the reference to the sentence is
Stage 4. Extension of the notion "meaningful given. The Semantix processor uses sufficiently
components". universal constructions of XML- file: one object
Not only words and word-combinations are allowed, (through the reference) can include another object.
but also objects described in documents: people, Properties are given as arguments. If necessary the type
addresses, organizations, etc. of attribute is indicated.
Stage 5.Incorporation of the XML-based semantic For example, in the statement
presentations into the semantic navigator. In the XML
file a meaningful portrait of a document (the semantic
network structure) is represented comprising all objects the year is indicated, etc. An XML file has a
and connections, revealed by the Semantix text complete set of information items necessary for the use
processor. In connection with this the organization of in different integrated systems.
XML files has the definite scientific value as the means An example of XML file is given in Figure 3.
for presentation of the semantic structure of sentences
and texts. The transformation of the semantic network
into the XML file is ensured with the aid of the reverse
linguistic processor. In this case the fragments which
present objects, relations, actions and sentences in the
semantic network structure are mapped onto the
appropriate components of the XML file which will
also contain objects, relations, actions and sentences.
The basic task of the LP use consists in operation as
a separate module within the framework of the
integrated systems of information collection and
processing. The exchange is conducted through XML
files [14]. For that end a reverse LP was developed,
which constructs XML files on the basis of meaningful
portraits.
Thus, the input for the linguistic processor (LP) is a
natural language text, and the output is an XML- file,
where all chosen objects and connections with the Figure 3. An example of XML file for the semantic
indication of sources are represented. This LP named structure presentation.
Semantix is provided in the form of an SDK- module. It
works under WINDOWS, but it can be recompiled for 8 Semantic-Focused Systems
the work under LINUX.
The development of concepts of on-line encyclopedia
The Semantix Processor is an independent module results in more general systems providing discovery of
and it can be used without the mentioned systems for semantically meaningful information from documents,
the standard tasks of analytical services. There are and building on this base an information-reference
means of tuning to the objects of other types - due to the system [1, 4-6]. The method of tuning - introduction
linguistic knowledge or the dictionaries. into the system of a new template with the tying of its
Let us give some explanations. Each object has the positions to the components of natural language, or a
following structure: change in the existing templates and corresponding