Intelligent Tools for the Semantic Internet Navigator Design


        © Igor Kuznetsov                 © Mikhail Charnine            © Elena Kozerenko
        © Nikolay Somin                 © Vladimir Nikolaev           © Andrey Matskevich
              Institute of Informatics Problems of the Russian Academy of Sciences,
                                               Moscow
      Igor-kuz@mtu-net.ru                 keywen1@mail.ru              kozerenko@mail.ru
         somin@post.ru                 ipiranlab14@yandex.ru


                       Abstract                                     find what interests them in the sea of information with
                                                                    reduced expenditures of labour. On-line encyclopedias
    This paper describes the methods and                            play the role of such means.
    instruments for semantic web navigator design                       The intelligent web navigator comprises the features
    which is a novel system providing semantic                      of the ESN linguistic processors which were developed
    drive for Internet users. The solutions proposed                for different classes of information systems relating to
    rest on the statistical paradigm for knowledge                  the artificial intelligence research field.
    extraction and the semantic presentations
                                                                        The core feature of the development is assigning a
    based on the Extended Semantic Networks
                                                                    semantic structure to natural language input. Semantic
    (ESN) mechanism. The approach presented
                                                                    structure is obtained via the semantic categorization and
    comprises rule-based and stochastic techniques
                                                                    establishment of semantic relations between concepts
    for text processing and extracted entities and
                                                                    presented in natural language texts contained in
    relations mapping onto the structures of the
                                                                    Internet. Association is the dominant type of semantic
    knowledge base.
                                                                    relations supported by the intelligent tools under
    The work is supported by the Russian Foundation                 discussion. The study of the contexts and co-occurrence
    for Basic Research, grant 11-06-00476-                          of terms and key words allows to shape the semantic
                                                                    structure of the navigated texts and perform automatic
1 Introduction                                                      categorization.
The paper deals with the issues of design an                            The hybrid approach is taken for the semantic
development of the new tools comprising the intelligent             navigator design which incorporates the logical
methods and systems based on the presentation                       analytical functionality of the intelligent systems based
mechanism of the extended semantic networks (ESN)                   on the extended semantic networks, statistical methods
[1] which had been employed for creation of a wide                  and machine learning mechanisms.
range of knowledge-based systems and the features of
the keywords encyclopedia Keywen [2].                               2 The ESN Intelligent Systems and their
    As a result of tremendous growth of the Internet, its           evolution
users receive huge volumes of information as responses
to their Internet queries. Users are interested in a big            The intellectual systems, developed on the basis of the
variety of questions, they make their own attempts to               apparatus of the extended semantic networks (ESN) [1,
employ keywords and phrases by means of test and                    4-6], called the ESN-systems, were created by the
error method (addressing search machines and making                 association of developers, including the authors of this
analysis of the answers). This results in tremendous                article at the Institute of Informatics problems of the
expenditures of labour and disappointment because of                Russian Academy of Sciences during the period of two
huge amounts of irrelevant information and/or its                   decades within the framework of research projects and
incompleteness.                                                     applied systems, oriented at the concrete subject areas
    Hence, to make optimal queries, one has to face the             and customers.
problem of requests ordering, reflecting interests of                   We single out 4 generations of ESN- systems. The
users, creating directories of subjects and articles. It is         linguistic semantic ideas laid as the basis of the systems
necessary to create special means, which allow users to             of this class underwent a specific evolutionary process.
                                                                    Intellectual ESN- systems contain the developed bases
                                                                    of knowledge, in this case the knowledge is represented
Proceedings of the 14th All-Russian Conference                      in the form of the records in the language of the
"Digital Libraries: Advanced Methods and                            extended semantic networks, called ESN - structures.
Technologies, Digital Collections"  RCDL-2012,                      LТnРuТstТМ knoаleНРe Тs, tСus, a “speМТal Мase oП
Pereslavl Zalesskii, Russia, October 15-18 2012.                    knoаleНРe” anН Тt Тs also representeН Тn tСe Пorm oП tСe


                                                              222
records in the language of the extended semantic                    • tСe means oП transport аТtС tСe ТnНТМatТon oП tСe
networks. Basic structural element of the ESN is the                vehicle type, its state number, color and other attributes;
named N-arв preНТМate, МalleН “ПraРment”. TСe аСole                 • passport Нata anН otСer НoМuments аТtС tСeТr
set of language objects are given in the form of                    attributes;
predicate-argument structures, in this case the                     • eбplosТЯes anН narcotic substances;
mechanisms for presentation of embedded structures are              • orРanТгatТons, posТtТons;
supported, which gives very powerful presentation                   • quantТtatТЯe МСaraМterТstТМs (Сoа manв persons or
mechanisms for describing the objects of different                  other objects participated in an event);
language levels.                                                    • tСe numbers oП aММounts, sums oП moneв аТtС tСe
    The uniformity of language presentations is a very              indication of the currency type;
important factor. In the process of analysis and                    • terrorТst Рroups anН orРanТzations;
synthesis of natural language sentences the formal                  • partТМТpants oП terrorТst Рroups аТtС tСe ТnНТМatТon oП
grammatical apparatus, similar to the dependency                    their roles (leader, head of, etc.);
grammar, is used. With this approach the words and the              • tСe armeН ПorМes, assТРneН Пor antТterrorТst Мombat
constructions, which perform the role of predicates in              (Military_.Force);
tСe sentenМe, are tСe “support” elements, anН tСe result            • eЯent (МrТmТnal, terrorТst, bТoРrapСТМal, anН so on) аТtС
of the analysis of a sentence must become one                       the indication of the information objects participation in
predicate, which corresponds to the predicate of the                them;
sentence (i.e. to basic verb in the tensed form or to               • tТme anН tСe plaМe oП eЯents;
another basic predicate expression) in question. Thus, in           • tСe МonneМtТon betаeen НТППerent tвpes oП ТnПormatТon
the process of analysis, in the first place, the processing         objects (with whom a person works in an organization,
Тs perПormeН oП tСe “aМtТon аorНs” anН tСe “relatТon                or lives at the same address, in what events participated
аorНs”, Т.e., of the verbs and other words, which have              together with other objects, etc.). For extracting objects
syntactic-semantТМ ЯalenМes. An eбample oП a “relatТon              all versions of an object name including the brief form
аorН” tСe аorН “ПatСer”, “ПrТenН”, anН tСe lТke, Т.e., Тn           possible in the text were considered. Standard objects
tСТs Мase a “relatТon” Тs a аorН аСТМС assТРns stronР               (names, dates, addresses, the forms of weapon and
clearly expressed syntactical-semantic expectations.                others) are reduced to one (standard) form.
                                                                        The identification of objects is performed taking
      Semantic analysis in the engineering linguistic
                                                                    into account brief designations (for example, separate
understanding is the process of translation of natural
                                                                    surnames, patronymics, initials), anaphoric references
lanРuaРe eбpressТons Тnto “Тnternal” struМtures oП tСe
                                                                    (indicative and personal pronouns, for example, "this
knoаleНРe base (KB) Тn our Мase tСese “Тnternal”
                                                                    person", "it...") definitions and explanations (for
structures are the records in the ESN language. Thus, a
                                                                    example, "the mayor of Moscow Sobianin" is identified
KB structure is the code of sense in the intellectual
                                                                    with the subsequent words "mayor", "Sobianin").
information systems. The language engineering
                                                                        For the extraction of events and connections the
solutions were implemented in the systems with
                                                                    analysis of verbal forms, participial and adverbial
“Мomplete” lТnРuТstТМ analвsТs, tСeses are tСe sвstems oП
                                                                    constructions is carried out. An important task is the
the 1st and 2nd generations: DIES1, DIES2, Logos-D
                                                                    identification of objects in the entire text, the use for
Д1, 4] anН tСe sвstems аТtС “ПaМtoРrapСТМ” approaМС, Т.e.
                                                                    these purposes of indicative pronouns, brief names,
the intelligent systems of analytical decisions support
                                                                    anaphoric references.
(ISADS) [6], where the goal of analysis is the extraction
                                                                        Taking into account the difficulties and in
of entities and connections from the texts, these are the
                                                                    accordance with the tasks the linguistic processor
systems of the 3rd and 4th generations.
                                                                    Semantix was developed, which achieves normalization
     The ESN systems of the 4-th generation perform
                                                                    of words, their grouping with the formation of units, the
the tasks os semantic objects (named entities)
                                                                    identification of objects and the establishment of
extraction. The set of the objects to be extracted
                                                                    connections. As a result for each NL document a
depends on the tasks of a user. At the same time the
                                                                    semantic network called the meaningful document
quality of a linguistic processor is to a considerable
                                                                    portrait was constructed automatically. The latter are
degree determined by the possibilities for this
                                                                    the knowledge structures of the knowledge base which
extraction. The basic types of information objects and
                                                                    serve the basis for implementing different forms of
connections, extracted by the ESN semantic processors
                                                                    semantic search : the search by features and
are given below:
                                                                    connections, the search for the objects connected at
• persons (bв ПamТlв name, РТЯen name anН patronвmТМ
                                                                    different levels, the search for similar figurants and
- FNP) with their role features (criminal, victim);
                                                                    incidents, the search by distinctive signs (with the use
• tСe Яerbal НesМrТptТon oП tСe persons, tСeТr НТstТnМtТЯe
                                                                    of ontologies).
signs;
                                                                        The extraction of connections is not only the deep
• aННress, postТnР ТnПormatТon attrТbutes;
                                                                    analysis of verbal and other forms. Many connections
• Нate(s) mentТoneН;
                                                                    are given on default. For example, in the summaries of
• аeapon аТtС Тts speМТal Пeatures;
                                                                    incidents, as a rule, figurants names are followed by
• telepСone numbers, faxes, e-mails with their
                                                                    their data without the indication of their belonging and
subsequent standardization;
                                                                    with the additional text insertions. For that the directed


                                                              223
search for the connected objects, i.e., the restoration of
                                                                               1. The analysis of the texts
connections, default data is organized in the processor
Semantix.
    Special processes are organized in order to connect
persons with their place of stay or place of work,                           > 2. Singling out the basic con-
                                                                                  cepts and characteristics
vehicles which belong to them, and so forth. For
example, the analysis of the summaries of incidents is
performed as follows. For a number of objects (address,
telephone, date of birth, etc.) a virtual connection with                    > 3. Constructing a subject area
                                                                                  vocabulary founded on the
other objects (names, organizations), is built thus yet                          basic “world model”
unidentified. Then, at the same level of processing their
search is performed with the aid of the special rules for
                                                                                      The basic “world model”
identification. In these rules the direction of search, the                            and the language model
permissible quantity of steps, and also the signs of
words and punctuation marks, where the process of
search ends are indicated. In this case special filters are
                                                                             > 4. Establishment of the type-kind
required, in order not to take and not to connect an alien                     relations between SA notions
object.
    This approach showed sufficiently good results in
the system Criminal [11]. The special features of natural                    > 5.Formulation of the situational
                                                                                rules in the form of IF… THEN
language are considered where the same actions are                                rules
identified with the aid of the verbs, verbal nouns and
participial constructions. Presented in ESN they are                Figure 1 The flowchart of conceptual linguistic modeling
reduced to one form, i.e. a complex object. Moreover,
                                                                     Construction of the conceptual linguistic model of a
forms with verbal nouns can be the components of
                                                                    certain subject area is subdivided into the following
verbal forms. On analogy, in ESN some objects can be
                                                                    stages: - construction of the conceptual model proper,
the components of others. The reason- consequence and
                                                                    i.e., the ramification of fundamental notions, their
temporary dependences between actions, events, etc. are
                                                                    organization in kind-type trees and the determination of
represented which reflect the logical connection of
                                                                    the connections between them; - the development of the
sentences, assigned explicitly, with the aid of the words
                                                                    ideographic dictionary for the subject area, i.e., the
“tСereПore”, “tСen”, etМ. The quality of a linguistic
                                                                    lexical population of the conceptual model; - the
processor is determined by a number of factors. First,
                                                                    introduction of the base rules, which describe "the
the possibility for isolation of objects and connections.
                                                                    model of the world" in the natural language relevant for
These are the types of objects being isolated, their
                                                                    the subject area.
quantity. The Semantix processor identifies up to 40
                                                                         The procedure of conceptual-linguistic simulation
types of objects, including very complex ones, which
                                                                    on the basis of the ESN apparatus is based on the
correspond to actions and events. With an increase in
                                                                    following principles:
the quantity appear the additional difficulties, connected
                                                                         • tСe model must be "open" , i.e., support the
with collisions of the extraction rules of: some rules can
                                                                    effective mechanism of expansion and information
seize the words, which relate to other objects and those
                                                                    update;
extracted by other rules.
                                                                         • tСe moНel oП tСe “sense” presentatТon sСoulН
    It becomes important to consider the order of the
                                                                    consider the facts of extra-linguistic reality, which in
application of rules, including of the rules of
                                                                    the form of rules and relations compose a certain basic
identification. In the second place, an important factor
                                                                    "world model" and the concrete models of subject
is the selectivity of rules and procedures of the
                                                                    areas;
identification: the factor of the noise and losses. By
                                                                         • tСe moНel sСoulН be practical, i.e., not overloaded
noise we mean the presence of excessive words in the
                                                                    by the detailed descriptions of connections and relations
objects. Losses are the situations when an object is not
                                                                    between the concepts in order to ensure the possibility
revealed or revealed partially: in the text there are the
                                                                    of its realization, but at the same time, it should reflect
words, which did not enter into the object. In the
                                                                    the relevant information for specific objectives.
Semantix processor the rules are arranged in such a way
                                                                         A realistic approach to the formulation of the
that they ensure the high degree of selectivity and the
                                                                    problem dictates the need of limitation to a domain-
minimization of noise and losses with the large number
                                                                    oriented subset of a natural language. The essence of
of the objects being selected.
                                                                    limitations consists in the following: - first, analyzed
                                                                    text materials contain expert knowledge from particular
3 Conceptual linguistic simulation                                  subject areas (we developed the systems for the subject
Conceptual linguistic simulation (CLS) is the process of            areas for the diagnostics of the microcircuits production
constructing a natural language model of a subject area             failures, forecast in the social sphere, criminology, and
(SA) (Fig.1), that synthesizes in itself the approaches of          others); - in the second place, for the purposes of the
conceptual     and     linguistic     simulation    [4-6].


                                                              224
{(                895__)(DICSEM)                                    the study of new material. However, their creation in
COORD(PROGNOZ1,RUS,                       895__,S5                  electronic type – is a huge work which requires not
0_31_51_20,%)      SUB(UNIV,0+)      SUB(UNIV,1+)                   simply to enter the adequate material into computer, but
SUB(UNIV,2+)                                                        also and its additional ordering: creation of subject
              (0-,1-,2-/3+) INFI(3-)           (3-)                 directories for allocation of main classes and subclasses,
           (3-/4+) FUT1(4-) SUB(       ,5+)                         definition of main notions, building of hyper-references
                                                                    for communication of entries (articles) of encyclopedia
Figure 2 An example of the presentation of the verb                 between themselves, but also of references to primary
ЯвrobatвЯat’ - “to manuПaМture” in the semantic dictionary.         sources. What should be also considered is the
                                                                    dynamism of circulating in Internet information:
maximally possible elimination of ambiguity, dictionary
                                                                    emergence of new information sources, which should
is built according to the modular principle: there is a
certain most general common part (1-2 levels)                       be taken into account in encyclopedias.
completed by special dictionaries for each particular                   In FТР. 2 tСe “Тnternal” НesМrТptТon oП tСe Яerb Тn tСe
subject area.                                                       semantic dictionary is represented. This dictionary is
    The proposed model of lexical semantics is based on             automatically generated by the ESN-systems DIES2,
the principle of the "nuclear" value realized in the                LOGOS-D, IKS in the course of natural language texts
context of this subject area with the subsequent                    processing.
inductive supplementation of other meanings (if they                    At present the majority of large electronic
are actualized in the contexts in question). The                    encyclopedias operating on-line have been created on
taxonomy is also used which is realized in the form of              the basis of printed materials of universal
the hierarchical trees of the word classes. The general             encyclopedias: Big Soviet Encyclopedia, Britannica,
                                                                    Big Brockhaus, Big Larousse and others. Creation of
"world model" of the system serves as the basis for the
                                                                    such encyclopedias requires considerable human labour.
subject area models.
    The classes of words, are subdivided into                           The above said leads us to the conclusion that the
concept/names,         relations,  actions,     properties,         global problem in the present situation is the
characteristics of actions, time and place locatives. The           development of methods and program means for
most Рeneral notТon Тs “МonМept”, or unТЯersal Мlass,               automation of the most labor-consuming stages of
which is subdivided into object, the situation, process             formation of on-line Internet encyclopedias.
and others. The words which relate to the classes of                    Such formation requires elements of intellectual
actions and relations, are represented as the semantic-             activity: for making the choice of the subject for
syntactic frames, which determine the predicate-                    description, formation of articles (entries), their names,
argument structures (government model).                             search for definitions, etc. Development of concepts of
                                                                    on-line encyclopedia results in reference systems of a
    However, in the described approach (let us name it
                                                                    more general plan, providing collection of information
the ESN-approach) the range of argument values is
substantially extended. This extension consists in the              and systematized knowledge representation about
fact that in the role of arguments there can appear                 different objects which are of interest to the user: -
simple objects corresponding to the individual words,               about politicians, persons of science, of culture; - about
structural objects which present word combinations,                 organizations, companies; - about events (for example,
phrases and clauses, and concept of "case" includes not             strikes, their reasons, place and time); - about goods and
only semantic, but also syntactic aspects. The approach,            objects of a particular class (for example, fuel, mining,
based on ESN allows to reflect the arbitrary level of the           region) and others. While building such systems, many
structures embedding it makes it possible to reflect the            common problems appear, that are also vital for on-line
structural nature of lexical semantics, which in this               encyclopedia.
                                                                        The only difference is that instead of articles and
model has a hierarchical network structure.
                                                                    their names there would be other objects.
    Linguistic knowledge is represented in the system
                                                                        At present the decision of the discussed problems
dictionary and the declarative modules of linguistic
processor. In the ESN systems the function of                       becomes real because there have been designed and
dynamically formed semantic dictionary which is                     developed many systems and facilities in the areas,
expanded automatically by the system in the course of               connected with creating different classes of intelligent
concrete texts processing is also realized on the basis of          systems, language processors, knowledge bases,
initial linguistic information.                                     statistical processing of language components [1-14].
    In FТР. 2 tСe “Тnternal” НesМrТptТon oП tСe Яerb Тn tСe             The given work is based on the experience of
semantic dictionary is represented. This dictionary is              creation of the on-line encyclopedia and is devoted to
automatically generated by the ESN-systems DIES2,                   the principal directions of decision methods
LOGOS-D, IKS in the course of natural language texts                development for the mentioned problem.
processing.
                                                                    5 Special Features of Automation
4 General Considerations for Encyclopedia                           In general, the problem looks like this. The input
Design                                                              comprises a stream of documents from Internet (all
                                                                    relating to a determined application domain). The
Encyclopedias traditionally played an important part in


                                                              225
output is an electronic encyclopedia consisting of brief            conducted in stages: first a simple system is developed
articles with names, with hyper-references between                  with subsequent enforcement of its features.
articles (if the names of other articles are encountered in
the text) and with hyper-references to primary sources              6 Semantic Navigator: Encyclopedia of
the documents from Internet.                                        Keywords
    In addition an electronic encyclopedia should
include the main menu, article sections, various                    In 2002 the first version of the on-line encyclopedia [2]
classifiers and the internal search system, providing               was released by Michael M. Charnine, having received
quick access to concrete subjects making application                the name Encyclopedia of keywords largely basing on
domain. Certainly, to automate all this processes is not            the methods described above. The Encyclopedia
possible.                                                           functions on the web-site: www.keywen.com. It
    Formation of the main menu, subjects and query                  constantly grows and at present contains more than
facilities is done manually. Computer can help with                 250000 articles on different subjects in different
selection of material of articles and the choice of their           languages. The majority of the articles are English, but
meaningful components.                                              there are also more than 3800 German and 1300 Italian
    Two stages are distinguished: training and                      articles. The Encyclopedia of keywords is universally
operation. The grade level, when training sample is                 recognized in Internet. Daily several thousand people
given to the system (documents from Internet) with                  have free use of its information.
indicated articles which the system should select.                      Each article of Encyclopedia consists of key
    For example, types of diseases can be, symptoms,                sentences (of phrases). Each of them contains one or
texts of description, falling into, say, preventive                 several key words. Such phrases are found in Internet
maintenance of diseases and of others. The system                   with a special semantic navigating program, that is
should develop decision rules providing allocation of               named Keywen Encyclopedia Bot.
these articles at the stage of operation on other                       At present Encyclopedia contains more than 5
documents.                                                          million key-phrases. The major part of the articles of
    Such rules are founded on statistical treatment with            Encyclopedia begin with the section, in which the
discovery of keywords and standard contexts                         definitions of terms, included into the article title are
(meaningful components), providing selection of                     given. This allows to understand quickly what the
articles.                                                           article is about. If a more profound study of the given
    Grade level allows to partly or completely automate             subject is required, it is possible to use the references to
the activity of a developer in discovery of the data,               Internet sites. Each phrase is supplied with such
necessary for system operation. Discovery of keywords               reference in Encyclopedia. Each clause of Encyclopedia
and of contexts requires the use of morphological and               contains a list of the most important keywords. For each
semantic blocks of analysis of natural language (NL).               keyword in an article there is a section in which
    The first block converts word forms                             examples of phrases, containing this keyword are given.
    e.g. TABLE, of TABLE, to TABLE                                      The knowledge of keywords is necessary for
into the uniform type (TABLE) and is particularly                   automatic development of exact requests to search
important for languages, where words have the a                     machines. For example, for the article Knowledge
system of cases and other morphological information                 Discovery a typical structure in the paragraph
as, for example, the Russian language.                              DEFINITIONS is given: " Knowledge discovery is the
Without such transformation the search in documents                 extraction of implicit, previously unknown and
for the same components becomes extremely difficult.                potentially useful knowledge from data". An article
    The second block selects word-combinations (they                contains references to more specialized articles:
can also be with names of articles) and verbal forms,               Business      and      Companies,       Magazines      and,
that determine context in most cases.                               Organizations, Text Mining, Tools. An article contains
    Both these blocks of the language processor                     keywords (with examples of phrases) KNOWLEDGE
implementing the analysis of natural language sentences             DISCOVERY, DATA MINING, INTERNATIONAL
plays an important part in the system.                              CONFERENCE, KDD and others. Encyclopedia
In creation of on-line encyclopedia important are the               (Keywen. com) that contains internal search machine
following factors: the quality of a created encyclopedia            allows to quickly find all key-phrases and appropriate
(it is determined by the vicinity to the existing                   clauses, containing this or that key word. As a result for
encyclopedia); the difficulty of the preparatory stage              any keyword it is possible to quickly find application
including creation and input of basic materials                     domain corresponding to it. At the beginning of 2004 a
(dictionaries, catalogues and others.) necessary for                version was created of electronic encyclopedia of the
system operation; also development of a system                      Open Project type entitled "Encyclopedia of key
teaching to discovery of articles is a very difficult               phrases". In the framework of this project each user of
programming task.                                                   Internet can bring some contribution into the
    Simplification of the second and the third factors              development of Encyclopedia. The facilities to move
can dramatically decrease the quality. At the same time,            sections of any article according to their value and also
an “oЯer-МomplТМatТon” oП the task should be avoided.               enter new phrases in Encyclopedia are given to each
    We follow the scheme when the development is                    user.


                                                              226
    For Keywen development a constantly growing                    precise, logically correct, flexible and dynamic. It is
multilingual texts corpora automatically extracted from            convenient for effective navigation and fast
Internet is used. For each subject domain and for every            understanding, helps:
supported language a particular text corpus is formed.                 - to see the BIG PICTURE,
The text corpora are analysed by the linguistic                        - to divide knowledge into parts and select the most
processor.                                                         important parts,
    Keywen NLP pipeline includes:                                      - to create effective plans for learning and
      a text tokenization module,                                  knowledge processing.
      a part-of-speech tagging system,                                 Hierarchy is a form of organizational structure in
      a sentence boundary detection tool,                          which each unit has one and only one "parent" unit,
      a collocation identification module,                         except the "top" unit, which has none.
      a named entity recognizer,                                       A Polyhierarchy (multi-hierarchy) is like a
      a word sense disambiguation system,                          hierarchy, but nodes can have multiple parents. In
      a full-syntactic parser.                                     mathematical terms, polyhierarchy is represented by a
      Extraction of term candidates from domain-                   directed acyclic graph, or a partially ordered set. In
oriented texts supports Automatic Term Recognition                 terms of object-oriented methodology, it can be viewed
resulting in Multilingual terminology                              as class hierarchy with multiple inheritance.
      Reordering the list of extracted candidates is based             Directory structure is a particular case of
on the term/keywords candidate relevance ranking.                  hierarchical structure (that is more general concept). For
      Extraction of key phrases and definitions provides           example, UNIX and DOS have a hierarchical directory
Automatic summarization of domain-oriented texts                   structure that allows files to be organized by categories.
using TF/DF measure                                                The main difference between hierarchical and directory
      Extraction of key phrases and definitions creates            structure is different naming convention for categories.
Knowledge-Rich          Contexts,    automated     pattern             The category names in directory structure can be full
acquisition is used for the identification of semantic             or short (local). The full category names in a directory
relations: associations and family trees which serve the           usually are equal to their paths from Top category. A
basis for semantic parser. There are a number of useful            directory contained inside another directory is called a
advantages of the Keywen apparatus, including, but not             subdirectory of that directory. Subdirectories are
limited to: the ability to build large scale human-                specified by concatenating the subdirectory short name
readable and semantic-oriented hierarchy of categories;            to the name of the directory above it in the hierarchy.
the ability to generate dynamical and flexible                     Together, the directories form a hierarchy, or tree
hierarchical categories; the ability to accept                     structure.
contributions of users with different qualification for                Keywen Category Structure is a polyhierarchy
improving hierarchical categories; the ability to accept           (multi-hierarchy) that contains one preferred (primary)
user’s mТnТmal МontrТbutТons (as lТttle as one МlТМk); tСe         hierarchy (tree) which contains all nodes.
ability to have multiple ways to categories in the                     The following new technologies are employed in
polyhierarchy and at the same time to have                         Keywen:
hierarchical/directory paths of the categories.                        - One-click Keywen technology and electronic
    The Keywen apparatus produces a "concrete"                     Voting System,
substantially repeatable result. It generates hierarchical             - Keywen search engine with large queries,
categories that are substantially repeatable. If users                 - Keywen Writing Service.
were to perform the claimed steps on multiple different                These technologies can accelerate the encyclopedia
occasions using the same inputs (e.g., the same                    РroаtС anН Мan make a аrТter’s аork most eППeМtТЯe.
collection of related terms, the same communication
with input/output module), the users would achieve the
same result on each occasion. The functionality of the             7 Prospects for the development of
technique has been mathematically proven, the present              Semantic-Focused Systems
apparatus for generating hierarchical categories do not
use any empiric, heuristic, or fuzzy considerations.               The development trends "Encyclopedias of keywords"
     The following two basic category systems are                  and "Encyclopedias of key phrases" are determined as
currently most popular:                                            follows:
    - Hierarchical, as in directories (easier for                  - constant increase of the encyclopedia articles number
understanding, planning and processing);                           in different European languages, including Russian,
    - Multi-hierarchical, as in Wiki-encyclopedias (more           inter-referenced between the relevant articles in
natural, flexible and easy to maintain).                           different languages;
    The category structure of Keywen is the product of             - the speed of updating of Encyclopedia will be
these two systems: it has advantages of both and opens             increased; old articles will be kept in the archive of
greater possibilities than either. Both the structure of           Encyclopedia, but fresh articles will occupy their place
web-directories and structure of Wiki-encyclopedias                with references to the new phrases and new articles
may be viewed as an isolated case of Keywen Category               from Internet;
Structure. The category structure of Keywen is more                - the Rating of articles self-descriptiveness will be


                                                             227
constructed; for this it is necessary to analyze several             </OBJECT>
million references contained in Keywen.com: those                  where ID="7" – is an identification of an object, the
containing more key phrases to a given issue, should get           TYPE="Organization" is its type. The text component
high position in the rating.                                       corresponding to the object is also given. Objects
    Further stages of development are connected with               relations and their participation in the actions are given
the use of language processor.                                     through the REF=... references. For example, with the
 Stage 1. The system for English and Russian                       help of the following construction
morphological analysis - for transformation of words                 <ACTION ID="15" TYPE="Blow">
into normal form. Simplistic analysis of sentences for                <ARG CONST="At" />
discovery of definitions on keywords.                                 <ARG REF="7" />
Stage 2. The component for analysis of sentences with                </ACTION>
selection of often met relevant word-combinations.                     where the sentence "one of the blows struck the
Stage 3. Means for establishment of relations between              headquarters of the oppositional group" is represented.
relevant objects that form the clauses.                            For each object or action the reference to the sentence is
Stage 4. Extension of the notion "meaningful                       given. The Semantix processor uses sufficiently
components".                                                       universal constructions of XML- file: one object
Not only words and word-combinations are allowed,                  (through the reference) can include another object.
but also objects described in documents: people,                   Properties are given as arguments. If necessary the type
addresses, organizations, etc.                                     of attribute is indicated.
Stage 5.Incorporation of the XML-based semantic                        For example, in the statement
presentations into the semantic navigator. In the XML
                                                                       <ATTR TYPE="YEAR" VALUE="2003"/>
file a meaningful portrait of a document (the semantic
network structure) is represented comprising all objects               the year is indicated, etc. An XML file has a
and connections, revealed by the Semantix text                     complete set of information items necessary for the use
processor. In connection with this the organization of             in different integrated systems.
XML files has the definite scientific value as the means               An example of XML file is given in Figure 3.
for presentation of the semantic structure of sentences
and texts. The transformation of the semantic network
into the XML file is ensured with the aid of the reverse
linguistic processor. In this case the fragments which
present objects, relations, actions and sentences in the
semantic network structure are mapped onto the
appropriate components of the XML file which will
also contain objects, relations, actions and sentences.
    The basic task of the LP use consists in operation as
a separate module within the framework of the
integrated systems of information collection and
processing. The exchange is conducted through XML
files [14]. For that end a reverse LP was developed,
which constructs XML files on the basis of meaningful
portraits.
    Thus, the input for the linguistic processor (LP) is a
natural language text, and the output is an XML- file,
where all chosen objects and connections with the                  Figure 3. An example of XML file for the semantic
indication of sources are represented. This LP named               structure presentation.
Semantix is provided in the form of an SDK- module. It
works under WINDOWS, but it can be recompiled for                  8 Semantic-Focused Systems
the work under LINUX.
                                                                   The development of concepts of on-line encyclopedia
    The Semantix Processor is an independent module                results in more general systems providing discovery of
and it can be used without the mentioned systems for               semantically meaningful information from documents,
the standard tasks of analytical services. There are               and building on this base an information-reference
means of tuning to the objects of other types - due to the         system [1, 4-6]. The method of tuning - introduction
linguistic knowledge or the dictionaries.                          into the system of a new template with the tying of its
    Let us give some explanations. Each object has the             positions to the components of natural language, or a
following structure:                                               change in the existing templates and corresponding
  <OBJECT ID="7" TYPE="Organization">                              linguistic knowledge. At present this system is created
   <ARG CONST="Headquarters />                                     on the basis of logical-analytical crime detection system
   <ARG CONST="Residence" />                                       ANALYST, and the linguistic processor Semantix
   <SOURCE> Headquarters residence of the opposing                 using the knowledge base and the semantics- oriented
group</SOURCE>                                                     linguistic processor for the tasks of the automatic


                                                             228
formalization of text information, answer to the queries             hybrid approaches comprising hand-made rules and
in free form, etc. [ 4-6 ].                                          statistical means for rapid correction and fine
    More than 40 different types of objects are                      adjustment of linguistic knowledge. In our systems
supported by the Semantix processor. The subject areas               there is an entire complex of such means which ensure
represented in the text documents are as follows.                    rapid tuning to the applications (including the
Documents about terrorism in the Russian language.                   introduction of new objects and connections) taking into
The analysis of the documents, in which the discussion               account the demands of customers.
deals with the terrorist acts and the groups. This feature               Such systems have much in common with the
supports the extraction of 40 types of objects, their                system of electronic encyclopedia construction. The
connections and the degree of participation in the                   significant information corresponds to the names of the
criminal actions. Documents about terrorists in the                  articles of the encyclopedia. Templates are the variety
English language. The objects and links include persons              of schemes, on which are constructed the articles of the
(their family name, name, patronymic – FNP), posts,                  encyclopedia. The layout of the material in accordance
organizations, terrorist groups, instruments of crime,               with the scheme is required, as well as taxonomic
time and place of events and so forth, and also                      formation of hyper-references.
connection with and participation in the actions.
                                                                         A backbone instrument for semantic categorization
 Summaries of incidents. Is ensured the extraction                  is the employment of hierarchy. Hierarchy is a form of
of figurants, their connections, organizations, dates,               organizational structure in which each unit has one and
documents, numbers of bank accounts, details of                      only one "parent" unit, except the "top" unit, which has
weapons, etc. with the indication of their participation             none.
in particular criminal actions.
                                                                         A Polyhierarchy (multi-hierarchy) is like a
 Accusatory conclusions, information about the                      hierarchy, but nodes can have multiple parents. In
criminal cases. Objects are identified along the entire              mathematical terms, polyhierarchy is represented by a
field of text. Their connections and criminal actions are            directed acyclic graph, or a partially ordered set. In
revealed.                                                            terms of object-oriented methodology, it can be viewed
 Government communications, media issues.                           as class hierarchy with multiple inheritance.
Persons, dates, organizations, positions and other
significant information and also connections and                         Directory structure is a particular case of
participation in the actions are selected.                           hierarchical structure (that is more general concept).
                                                                     The main difference between hierarchical and directory
 Autobiographies in the Russian and English
                                                                     structure is different naming convention for categories.
languages. From the resumes all attributes of people,
periods of time and place of their work, studies,                        The category names in directory structure can be full
language proficiency and so forth are extracted.                     or short (local). The full category names in a directory
     Autobiographies in the English. From the English               usually are equal to their paths from Top category. A
language resumes are all attributes of people, periods of            directory contained inside another directory is called a
time and place of their work, studies, language                      subdirectory of that directory. Subdirectories are
proficiency and so forth are extracted.                              specified by concatenating the subdirectory short name
 Documents of media issues in English. From the                     to the name of the directory above it in the hierarchy.
English language texts the persons mentioned in media                Together, the directories form a hierarchy, or tree
issues, positions, organizations, dates, terrorist and anti-         structure.
terrorist groups, weapons, events, their time and place,                 Keywen Category Structure is a polyhierarchy
different connections and other features are extracted.              (multi-hierarchy) that contains one preferred (primary)
    In the processors of the Semantix, Lingua-Master,                hierarchy (tree) which contains all nodes.
“CrТmТnal” sвstems up to 40 tвpes oП objeМts are                         The method for generating hierarchical categories
extracted with high accuracy and minimum noise. For                  from collection of related terms contains the following
example, the system "Criminal" was verified on about                 steps:
500 thousand incidents from the summaries of Moscow                      (a) A huge collection of related terms is
Criminal Police Department, and on the basic objects                 accumulated;
showed the unique results: the coefficient of noise, i.e.
excessive words in the objects) is not more than 1-2%                    (b) Information about relationships of any term is
and losses are not more than 1%. The Semantix                        communicated to users (and agents);
Processor was fixed on a smaller quantity of documents                   (c) Users select multiple parent categories for each
dealing with the terrorist activity, and therefore there             term among its relatives;
can be more noise and losses in it. But this can be                      (d) Many parent-child relationships are accumulated
quickly fixed. The fact is that to consider everything               and create direct graph; and
which can be encountered in the NL texts is impossible.
                                                                         (e) Variety of hierarchical structures is constructed
    Therefore, in the first place, the representative                from combined direct graphs of different users.
collections of test documents are extremely important,
                                                                         The last step (e) contains sub steps of:
and in the second place, the means of fixing or tuning of
linguistic processors are as follows: the employment of


                                                               229
    (e1) Direct graphs of different users are combined             9 Conclusions
together according to user contribution ranks so that
better ranked users have the priority in the selection of              Thus by semantic navigation we mean semantic
parents for particular term; and                                   analysis and search for the relevant semantic
                                                                   information in natural language texts in the Web.
    (e2) Any cycles between nodes in the graph are
                                                                   Semantic analysis consists in assigning a semantic
eliminated.
                                                                   structure to natural language input. Semantic structure
    Categories are indicated in the very beginning of an           is obtained via the semantic categorization and
article; one glance at the category will be sufficient to          establishment of semantic relations between concepts
determine the field of the article, since all categories           presented in natural language texts. Association is the
will contain popular terms.                                        dominant type of semantic relations supported by
    For example, in the beginning of an article on                 Keywen and the Navigator under development.
Mesopotamia, the category "SOCIETY > HISTORY >                     Synonyms, taxonomies and other types of paradigmatic
HISTORICAL ERAS > PREHISTORY > IRON AGE                            semantic relations are established within particular
> MESOPOTAMIA" will be indicated.                                  contexts and are viewed as particular cases of the
EЯen ТП аe Нo not knoа tСe аorН “MESOPOTAMIA”,                     association relation. Hence we employ the semantic
the easily understandable words "SOCIETY >                         impacts of context and co-occurrence which play the
HISTORY" will clearly indicate the field.                          decisive role in automatic categorization.
    Categories are located in the beginning of an article;             Further development includes the detailed
since all categories contain most popular terms, the first         structuring of the Keywen knowledge base with the
glance at the category will make the field of the article          employment of the Semantix linguistic processor and
clear. All category terms correspond to the titles of the          logical processing features, construction of the
articles, which makes the direction of transition, when            encyclopedic articles from definitions and key words
mouse-clicking any term within the category, self-                 automatically extracted from Internet, establishment of
explanatory.                                                       hierarchies / category trees on the basis of key word
    Category String is the line that contains the full             family trees by assigning a dominant category, semi-
name of category, which consists of several terms, such            automatic correction of the category tree, manual and
as                                                                 semi-automatic correction of definitions, manual and
                                                                   semi-automatic correction of articles by the methods of
    "THINKING > NONVERBAL THINKING > BIG-                          digital voting and crowdsourcing. The Keywen
PICTURE THINKING".                                                 technology can be used for terminological data bases
All terms included into the Category String, are located           creation according to the International Standard ISO
in hierarchical order, which makes the internal structure          12620: 2009.
of the category easier to understand and more logical.                 The approach taken combines the methods of the
Every category (as full path to category) in Keywen                rule-based paradigm and machine learning, thus
Category Structure is unique.                                      providing a hybrid platform for design and development
Keywen Category Structure contains 17 top-level                    of the Internet Semantic Navigator.
categories.

3.1 ANIMALS > SEA_ANIMALS > WHALES                                 References
3.2 ARTS > FILM > ANIMATION > ANIME
3.3 BUSINESS > BUSINESS_ECONOMICS                                   [1] Kuznetsov Igor. Semantic Representations.
3.4 COMPUTATION > INTERNET > INTERNET_HISTORY >                        Moscow: Science, 1986. 294 p. (in Russian)..
ARPANET                                                             [2] Web site for the Keywen encyclopedia of
3.5 GAMES > BOARD_GAMES > KINGS_CRIBBAGE
3.6 HEALTH > MEDICINE > HEALTHCARE > THERAPY >                         keywords: www.keywen.com
ENERGY_THERAPIES > REIKI                                            [3] Salton, G. 1989. Automatic text processing: The
3.7 HOME > COOKING > FRUIT_JUICE > LEMONADE                            transformation, analysis, and retrieval of
3.8 IDEAS > BOOKS
3.9 MINERALS > CRYSTALS > ZIRCON                                       information by computer. New York: Addison-
3.10 PEOPLE > POETS                                                    Wesley.
3.11 PLANTS > TREES                                                 [4] Kuznetsov I., Charnine M. Semantic-Oriented
3.12 RECREATION > TRAVEL > TOURISM
3.13 REFERENCE > REFERENCE_WORKS > ATLASES >
                                                                       System For Factual Search With the Interface in
CARTOGRAPHY > WEB_MAPPING                                              Russian and English // Systems and Facilities of
3.14 SCIENCE > NATURAL_SCIENCES > SPACE_SCIENCE >                      Informatics. Moscow: Science, 1995, V 7.
SOLAR_SYSTEM > NEPTUNE
3.15 SOCIETY > HISTORY > HISTORICAL_ERAS >
                                                                    [5] Kuznetsov I.P., Efimov D.A., Kozerenko E.B.
PREHISTORY > IRON_AGE > MESOPOTAMIA                                    Tools for Tuning the Semantix Processor to
3.16 THINKING    >  NONVERBAL_THINKING    >  BIG-                      Application Areas // Proceedings of ICAI'09, Vol.
PICTURE_THINKING                                                       I. WORLDCOMP'09, July 13-16, 2009, Las Vegas,
3.17 WORLD > AFRICA > MIDDLE_EAST > NORTH_AFRICA
> EGYPT
                                                                       Nevada, USA. - CRSEA Press, USA, 2009. P. 467-
                                                                       472.
                                                                    [6] Kuznetsov I.P., Kozerenko E.B., Kuznetsov K.I.,
                                                                       Timonina N.O. Intelligent System for Entities


                                                             230
    Extraction (ISEE) from Natural Language Texts //            [12] Web        site     for     Semantic    Web:
    Proceedings of the International Workshop on                    http://www.w3.org/standards/semanticweb/
    Conceptual Structures for Extracting Natural                [13] Jackendoff, R. Semantic Structures. MIT Press,
    Language Semantics - Sense'09, Uta Priss, Galia                 Cambridge, MA, 1990
    Angelova (Eds.), at the 17 International Conference         [14] Gardner, J. R. and Z. L. Rendon, XSLT and
    on Conceptual Structures (ICCS'09), University                  XPATH: A Guide to XML Transformations,
    Higher School of Economics, Moscow, Russia,                     Prentice Hall, 2001.
    2009. P. 17-25.
 [7] Han J., Pei Y. Yin, and Mao R. Mining Frequent
    Patterns without Candidate Generation: A
    Frequent-Pattern Tree ApproaМС,” // Data MТnТnР
    and Knowledge Discovery, 8(1), 2004. P. 53–87.
 [8] FASTUS: a Cascaded Finite-State Trasducerfor                   . .          , . .         , . .              ,
    Extracting Information from Natural-Language                      . .       , . .         , . .
    Text. // AIC, SRI International. Menlo Park.
    California, 1996.
 [9] Cunningham H. Automatic Information Extraction
    // Encyclopedia of Language and Linguistics, 2cnd
    ed. Elsevier, 2005.                                                                                       -
                                                                       ,
[10] Dobrov B.V., Lukashevich N.V. Ontologies for
    natural language processing: Description of
                                                                            ,
    concepts and lexical senses // Computational
                                                                                                       .      -
    Linguistics     and     Intelligent   Technologies:
    Proceedings of the International Conference
    DТaloР’06, BekasoЯo, Maв, 31-June, 4, 2006, P.
    138-142, 2006.
                                                                                                              -
[11] Kuznetsov I.P., Matskevich A.G. The English                                                   .          ,
    Language Version of Automatic Extraction of                                           ,                   ,
    Meaningful Information from Natural Language
    Texts // Proceedings of the Dialog-2005                                                            -
    International       Conference      "Computational                      ,
    Linguistics     and    Intelligent   Technologies",
    Zvenigorod, 2005pp. 303-311                                             .


                                                          231