Intelligent Tools for the Semantic Internet Navigator Design © Igor Kuznetsov © Mikhail Charnine © Elena Kozerenko © Nikolay Somin © Vladimir Nikolaev © Andrey Matskevich Institute of Informatics Problems of the Russian Academy of Sciences, Moscow Igor-kuz@mtu-net.ru keywen1@mail.ru kozerenko@mail.ru somin@post.ru ipiranlab14@yandex.ru Abstract find what interests them in the sea of information with reduced expenditures of labour. On-line encyclopedias This paper describes the methods and play the role of such means. instruments for semantic web navigator design The intelligent web navigator comprises the features which is a novel system providing semantic of the ESN linguistic processors which were developed drive for Internet users. The solutions proposed for different classes of information systems relating to rest on the statistical paradigm for knowledge the artificial intelligence research field. extraction and the semantic presentations The core feature of the development is assigning a based on the Extended Semantic Networks semantic structure to natural language input. Semantic (ESN) mechanism. The approach presented structure is obtained via the semantic categorization and comprises rule-based and stochastic techniques establishment of semantic relations between concepts for text processing and extracted entities and presented in natural language texts contained in relations mapping onto the structures of the Internet. Association is the dominant type of semantic knowledge base. relations supported by the intelligent tools under The work is supported by the Russian Foundation discussion. The study of the contexts and co-occurrence for Basic Research, grant 11-06-00476- of terms and key words allows to shape the semantic structure of the navigated texts and perform automatic 1 Introduction categorization. The paper deals with the issues of design an The hybrid approach is taken for the semantic development of the new tools comprising the intelligent navigator design which incorporates the logical methods and systems based on the presentation analytical functionality of the intelligent systems based mechanism of the extended semantic networks (ESN) on the extended semantic networks, statistical methods [1] which had been employed for creation of a wide and machine learning mechanisms. range of knowledge-based systems and the features of the keywords encyclopedia Keywen [2]. 2 The ESN Intelligent Systems and their As a result of tremendous growth of the Internet, its evolution users receive huge volumes of information as responses to their Internet queries. Users are interested in a big The intellectual systems, developed on the basis of the variety of questions, they make their own attempts to apparatus of the extended semantic networks (ESN) [1, employ keywords and phrases by means of test and 4-6], called the ESN-systems, were created by the error method (addressing search machines and making association of developers, including the authors of this analysis of the answers). This results in tremendous article at the Institute of Informatics problems of the expenditures of labour and disappointment because of Russian Academy of Sciences during the period of two huge amounts of irrelevant information and/or its decades within the framework of research projects and incompleteness. applied systems, oriented at the concrete subject areas Hence, to make optimal queries, one has to face the and customers. problem of requests ordering, reflecting interests of We single out 4 generations of ESN- systems. The users, creating directories of subjects and articles. It is linguistic semantic ideas laid as the basis of the systems necessary to create special means, which allow users to of this class underwent a specific evolutionary process. Intellectual ESN- systems contain the developed bases of knowledge, in this case the knowledge is represented Proceedings of the 14th All-Russian Conference in the form of the records in the language of the "Digital Libraries: Advanced Methods and extended semantic networks, called ESN - structures. Technologies, Digital Collections" RCDL-2012, LТnРuТstТМ knoаleНРe Тs, tСus, a “speМТal Мase oП Pereslavl Zalesskii, Russia, October 15-18 2012. knoаleНРe” anН Тt Тs also representeН Тn tСe Пorm oП tСe 222 records in the language of the extended semantic • tСe means oП transport аТtС tСe ТnНТМatТon oП tСe networks. Basic structural element of the ESN is the vehicle type, its state number, color and other attributes; named N-arв preНТМate, МalleН “ПraРment”. TСe аСole • passport Нata anН otСer НoМuments аТtС tСeТr set of language objects are given in the form of attributes; predicate-argument structures, in this case the • eбplosТЯes anН narcotic substances; mechanisms for presentation of embedded structures are • orРanТгatТons, posТtТons; supported, which gives very powerful presentation • quantТtatТЯe МСaraМterТstТМs (Сoа manв persons or mechanisms for describing the objects of different other objects participated in an event); language levels. • tСe numbers oП aММounts, sums oП moneв аТtС tСe The uniformity of language presentations is a very indication of the currency type; important factor. In the process of analysis and • terrorТst Рroups anН orРanТzations; synthesis of natural language sentences the formal • partТМТpants oП terrorТst Рroups аТtС tСe ТnНТМatТon oП grammatical apparatus, similar to the dependency their roles (leader, head of, etc.); grammar, is used. With this approach the words and the • tСe armeН ПorМes, assТРneН Пor antТterrorТst Мombat constructions, which perform the role of predicates in (Military_.Force); tСe sentenМe, are tСe “support” elements, anН tСe result • eЯent (МrТmТnal, terrorТst, bТoРrapСТМal, anН so on) аТtС of the analysis of a sentence must become one the indication of the information objects participation in predicate, which corresponds to the predicate of the them; sentence (i.e. to basic verb in the tensed form or to • tТme anН tСe plaМe oП eЯents; another basic predicate expression) in question. Thus, in • tСe МonneМtТon betаeen НТППerent tвpes oП ТnПormatТon the process of analysis, in the first place, the processing objects (with whom a person works in an organization, Тs perПormeН oП tСe “aМtТon аorНs” anН tСe “relatТon or lives at the same address, in what events participated аorНs”, Т.e., of the verbs and other words, which have together with other objects, etc.). For extracting objects syntactic-semantТМ ЯalenМes. An eбample oП a “relatТon all versions of an object name including the brief form аorН” tСe аorН “ПatСer”, “ПrТenН”, anН tСe lТke, Т.e., Тn possible in the text were considered. Standard objects tСТs Мase a “relatТon” Тs a аorН аСТМС assТРns stronР (names, dates, addresses, the forms of weapon and clearly expressed syntactical-semantic expectations. others) are reduced to one (standard) form. The identification of objects is performed taking Semantic analysis in the engineering linguistic into account brief designations (for example, separate understanding is the process of translation of natural surnames, patronymics, initials), anaphoric references lanРuaРe eбpressТons Тnto “Тnternal” struМtures oП tСe (indicative and personal pronouns, for example, "this knoаleНРe base (KB) Тn our Мase tСese “Тnternal” person", "it...") definitions and explanations (for structures are the records in the ESN language. Thus, a example, "the mayor of Moscow Sobianin" is identified KB structure is the code of sense in the intellectual with the subsequent words "mayor", "Sobianin"). information systems. The language engineering For the extraction of events and connections the solutions were implemented in the systems with analysis of verbal forms, participial and adverbial “Мomplete” lТnРuТstТМ analвsТs, tСeses are tСe sвstems oП constructions is carried out. An important task is the the 1st and 2nd generations: DIES1, DIES2, Logos-D identification of objects in the entire text, the use for Д1, 4] anН tСe sвstems аТtС “ПaМtoРrapСТМ” approaМС, Т.e. these purposes of indicative pronouns, brief names, the intelligent systems of analytical decisions support anaphoric references. (ISADS) [6], where the goal of analysis is the extraction Taking into account the difficulties and in of entities and connections from the texts, these are the accordance with the tasks the linguistic processor systems of the 3rd and 4th generations. Semantix was developed, which achieves normalization The ESN systems of the 4-th generation perform of words, their grouping with the formation of units, the the tasks os semantic objects (named entities) identification of objects and the establishment of extraction. The set of the objects to be extracted connections. As a result for each NL document a depends on the tasks of a user. At the same time the semantic network called the meaningful document quality of a linguistic processor is to a considerable portrait was constructed automatically. The latter are degree determined by the possibilities for this the knowledge structures of the knowledge base which extraction. The basic types of information objects and serve the basis for implementing different forms of connections, extracted by the ESN semantic processors semantic search : the search by features and are given below: connections, the search for the objects connected at • persons (bв ПamТlв name, РТЯen name anН patronвmТМ different levels, the search for similar figurants and - FNP) with their role features (criminal, victim); incidents, the search by distinctive signs (with the use • tСe Яerbal НesМrТptТon oП tСe persons, tСeТr НТstТnМtТЯe of ontologies). signs; The extraction of connections is not only the deep • aННress, postТnР ТnПormatТon attrТbutes; analysis of verbal and other forms. Many connections • Нate(s) mentТoneН; are given on default. For example, in the summaries of • аeapon аТtС Тts speМТal Пeatures; incidents, as a rule, figurants names are followed by • telepСone numbers, faxes, e-mails with their their data without the indication of their belonging and subsequent standardization; with the additional text insertions. For that the directed 223 search for the connected objects, i.e., the restoration of 1. The analysis of the texts connections, default data is organized in the processor Semantix. Special processes are organized in order to connect persons with their place of stay or place of work, > 2. Singling out the basic con- cepts and characteristics vehicles which belong to them, and so forth. For example, the analysis of the summaries of incidents is performed as follows. For a number of objects (address, telephone, date of birth, etc.) a virtual connection with > 3. Constructing a subject area vocabulary founded on the other objects (names, organizations), is built thus yet basic “world model” unidentified. Then, at the same level of processing their search is performed with the aid of the special rules for The basic “world model” identification. In these rules the direction of search, the and the language model permissible quantity of steps, and also the signs of words and punctuation marks, where the process of search ends are indicated. In this case special filters are > 4. Establishment of the type-kind required, in order not to take and not to connect an alien relations between SA notions object. This approach showed sufficiently good results in the system Criminal [11]. The special features of natural > 5.Formulation of the situational rules in the form of IF… THEN language are considered where the same actions are rules identified with the aid of the verbs, verbal nouns and participial constructions. Presented in ESN they are Figure 1 The flowchart of conceptual linguistic modeling reduced to one form, i.e. a complex object. Moreover, Construction of the conceptual linguistic model of a forms with verbal nouns can be the components of certain subject area is subdivided into the following verbal forms. On analogy, in ESN some objects can be stages: - construction of the conceptual model proper, the components of others. The reason- consequence and i.e., the ramification of fundamental notions, their temporary dependences between actions, events, etc. are organization in kind-type trees and the determination of represented which reflect the logical connection of the connections between them; - the development of the sentences, assigned explicitly, with the aid of the words ideographic dictionary for the subject area, i.e., the “tСereПore”, “tСen”, etМ. The quality of a linguistic lexical population of the conceptual model; - the processor is determined by a number of factors. First, introduction of the base rules, which describe "the the possibility for isolation of objects and connections. model of the world" in the natural language relevant for These are the types of objects being isolated, their the subject area. quantity. The Semantix processor identifies up to 40 The procedure of conceptual-linguistic simulation types of objects, including very complex ones, which on the basis of the ESN apparatus is based on the correspond to actions and events. With an increase in following principles: the quantity appear the additional difficulties, connected • tСe model must be "open" , i.e., support the with collisions of the extraction rules of: some rules can effective mechanism of expansion and information seize the words, which relate to other objects and those update; extracted by other rules. • tСe moНel oП tСe “sense” presentatТon sСoulН It becomes important to consider the order of the consider the facts of extra-linguistic reality, which in application of rules, including of the rules of the form of rules and relations compose a certain basic identification. In the second place, an important factor "world model" and the concrete models of subject is the selectivity of rules and procedures of the areas; identification: the factor of the noise and losses. By • tСe moНel sСoulН be practical, i.e., not overloaded noise we mean the presence of excessive words in the by the detailed descriptions of connections and relations objects. Losses are the situations when an object is not between the concepts in order to ensure the possibility revealed or revealed partially: in the text there are the of its realization, but at the same time, it should reflect words, which did not enter into the object. In the the relevant information for specific objectives. Semantix processor the rules are arranged in such a way A realistic approach to the formulation of the that they ensure the high degree of selectivity and the problem dictates the need of limitation to a domain- minimization of noise and losses with the large number oriented subset of a natural language. The essence of of the objects being selected. limitations consists in the following: - first, analyzed text materials contain expert knowledge from particular 3 Conceptual linguistic simulation subject areas (we developed the systems for the subject Conceptual linguistic simulation (CLS) is the process of areas for the diagnostics of the microcircuits production constructing a natural language model of a subject area failures, forecast in the social sphere, criminology, and (SA) (Fig.1), that synthesizes in itself the approaches of others); - in the second place, for the purposes of the conceptual and linguistic simulation [4-6]. 224 {( 895__)(DICSEM) the study of new material. However, their creation in COORD(PROGNOZ1,RUS, 895__,S5 electronic type – is a huge work which requires not 0_31_51_20,%) SUB(UNIV,0+) SUB(UNIV,1+) simply to enter the adequate material into computer, but SUB(UNIV,2+) also and its additional ordering: creation of subject (0-,1-,2-/3+) INFI(3-) (3-) directories for allocation of main classes and subclasses, (3-/4+) FUT1(4-) SUB( ,5+) definition of main notions, building of hyper-references for communication of entries (articles) of encyclopedia Figure 2 An example of the presentation of the verb between themselves, but also of references to primary ЯвrobatвЯat’ - “to manuПaМture” in the semantic dictionary. sources. What should be also considered is the dynamism of circulating in Internet information: maximally possible elimination of ambiguity, dictionary emergence of new information sources, which should is built according to the modular principle: there is a certain most general common part (1-2 levels) be taken into account in encyclopedias. completed by special dictionaries for each particular In FТР. 2 tСe “Тnternal” НesМrТptТon oП tСe Яerb Тn tСe subject area. semantic dictionary is represented. This dictionary is The proposed model of lexical semantics is based on automatically generated by the ESN-systems DIES2, the principle of the "nuclear" value realized in the LOGOS-D, IKS in the course of natural language texts context of this subject area with the subsequent processing. inductive supplementation of other meanings (if they At present the majority of large electronic are actualized in the contexts in question). The encyclopedias operating on-line have been created on taxonomy is also used which is realized in the form of the basis of printed materials of universal the hierarchical trees of the word classes. The general encyclopedias: Big Soviet Encyclopedia, Britannica, Big Brockhaus, Big Larousse and others. Creation of "world model" of the system serves as the basis for the such encyclopedias requires considerable human labour. subject area models. The classes of words, are subdivided into The above said leads us to the conclusion that the concept/names, relations, actions, properties, global problem in the present situation is the characteristics of actions, time and place locatives. The development of methods and program means for most Рeneral notТon Тs “МonМept”, or unТЯersal Мlass, automation of the most labor-consuming stages of which is subdivided into object, the situation, process formation of on-line Internet encyclopedias. and others. The words which relate to the classes of Such formation requires elements of intellectual actions and relations, are represented as the semantic- activity: for making the choice of the subject for syntactic frames, which determine the predicate- description, formation of articles (entries), their names, argument structures (government model). search for definitions, etc. Development of concepts of on-line encyclopedia results in reference systems of a However, in the described approach (let us name it more general plan, providing collection of information the ESN-approach) the range of argument values is substantially extended. This extension consists in the and systematized knowledge representation about fact that in the role of arguments there can appear different objects which are of interest to the user: - simple objects corresponding to the individual words, about politicians, persons of science, of culture; - about structural objects which present word combinations, organizations, companies; - about events (for example, phrases and clauses, and concept of "case" includes not strikes, their reasons, place and time); - about goods and only semantic, but also syntactic aspects. The approach, objects of a particular class (for example, fuel, mining, based on ESN allows to reflect the arbitrary level of the region) and others. While building such systems, many structures embedding it makes it possible to reflect the common problems appear, that are also vital for on-line structural nature of lexical semantics, which in this encyclopedia. The only difference is that instead of articles and model has a hierarchical network structure. their names there would be other objects. Linguistic knowledge is represented in the system At present the decision of the discussed problems dictionary and the declarative modules of linguistic processor. In the ESN systems the function of becomes real because there have been designed and dynamically formed semantic dictionary which is developed many systems and facilities in the areas, expanded automatically by the system in the course of connected with creating different classes of intelligent concrete texts processing is also realized on the basis of systems, language processors, knowledge bases, initial linguistic information. statistical processing of language components [1-14]. In FТР. 2 tСe “Тnternal” НesМrТptТon oП tСe Яerb Тn tСe The given work is based on the experience of semantic dictionary is represented. This dictionary is creation of the on-line encyclopedia and is devoted to automatically generated by the ESN-systems DIES2, the principal directions of decision methods LOGOS-D, IKS in the course of natural language texts development for the mentioned problem. processing. 5 Special Features of Automation 4 General Considerations for Encyclopedia In general, the problem looks like this. The input Design comprises a stream of documents from Internet (all relating to a determined application domain). The Encyclopedias traditionally played an important part in 225 output is an electronic encyclopedia consisting of brief conducted in stages: first a simple system is developed articles with names, with hyper-references between with subsequent enforcement of its features. articles (if the names of other articles are encountered in the text) and with hyper-references to primary sources 6 Semantic Navigator: Encyclopedia of the documents from Internet. Keywords In addition an electronic encyclopedia should include the main menu, article sections, various In 2002 the first version of the on-line encyclopedia [2] classifiers and the internal search system, providing was released by Michael M. Charnine, having received quick access to concrete subjects making application the name Encyclopedia of keywords largely basing on domain. Certainly, to automate all this processes is not the methods described above. The Encyclopedia possible. functions on the web-site: www.keywen.com. It Formation of the main menu, subjects and query constantly grows and at present contains more than facilities is done manually. Computer can help with 250000 articles on different subjects in different selection of material of articles and the choice of their languages. The majority of the articles are English, but meaningful components. there are also more than 3800 German and 1300 Italian Two stages are distinguished: training and articles. The Encyclopedia of keywords is universally operation. The grade level, when training sample is recognized in Internet. Daily several thousand people given to the system (documents from Internet) with have free use of its information. indicated articles which the system should select. Each article of Encyclopedia consists of key For example, types of diseases can be, symptoms, sentences (of phrases). Each of them contains one or texts of description, falling into, say, preventive several key words. Such phrases are found in Internet maintenance of diseases and of others. The system with a special semantic navigating program, that is should develop decision rules providing allocation of named Keywen Encyclopedia Bot. these articles at the stage of operation on other At present Encyclopedia contains more than 5 documents. million key-phrases. The major part of the articles of Such rules are founded on statistical treatment with Encyclopedia begin with the section, in which the discovery of keywords and standard contexts definitions of terms, included into the article title are (meaningful components), providing selection of given. This allows to understand quickly what the articles. article is about. If a more profound study of the given Grade level allows to partly or completely automate subject is required, it is possible to use the references to the activity of a developer in discovery of the data, Internet sites. Each phrase is supplied with such necessary for system operation. Discovery of keywords reference in Encyclopedia. Each clause of Encyclopedia and of contexts requires the use of morphological and contains a list of the most important keywords. For each semantic blocks of analysis of natural language (NL). keyword in an article there is a section in which The first block converts word forms examples of phrases, containing this keyword are given. e.g. TABLE, of TABLE, to TABLE The knowledge of keywords is necessary for into the uniform type (TABLE) and is particularly automatic development of exact requests to search important for languages, where words have the a machines. For example, for the article Knowledge system of cases and other morphological information Discovery a typical structure in the paragraph as, for example, the Russian language. DEFINITIONS is given: " Knowledge discovery is the Without such transformation the search in documents extraction of implicit, previously unknown and for the same components becomes extremely difficult. potentially useful knowledge from data". An article The second block selects word-combinations (they contains references to more specialized articles: can also be with names of articles) and verbal forms, Business and Companies, Magazines and, that determine context in most cases. Organizations, Text Mining, Tools. An article contains Both these blocks of the language processor keywords (with examples of phrases) KNOWLEDGE implementing the analysis of natural language sentences DISCOVERY, DATA MINING, INTERNATIONAL plays an important part in the system. CONFERENCE, KDD and others. Encyclopedia In creation of on-line encyclopedia important are the (Keywen. com) that contains internal search machine following factors: the quality of a created encyclopedia allows to quickly find all key-phrases and appropriate (it is determined by the vicinity to the existing clauses, containing this or that key word. As a result for encyclopedia); the difficulty of the preparatory stage any keyword it is possible to quickly find application including creation and input of basic materials domain corresponding to it. At the beginning of 2004 a (dictionaries, catalogues and others.) necessary for version was created of electronic encyclopedia of the system operation; also development of a system Open Project type entitled "Encyclopedia of key teaching to discovery of articles is a very difficult phrases". In the framework of this project each user of programming task. Internet can bring some contribution into the Simplification of the second and the third factors development of Encyclopedia. The facilities to move can dramatically decrease the quality. At the same time, sections of any article according to their value and also an “oЯer-МomplТМatТon” oП the task should be avoided. enter new phrases in Encyclopedia are given to each We follow the scheme when the development is user. 226 For Keywen development a constantly growing precise, logically correct, flexible and dynamic. It is multilingual texts corpora automatically extracted from convenient for effective navigation and fast Internet is used. For each subject domain and for every understanding, helps: supported language a particular text corpus is formed. - to see the BIG PICTURE, The text corpora are analysed by the linguistic - to divide knowledge into parts and select the most processor. important parts, Keywen NLP pipeline includes: - to create effective plans for learning and a text tokenization module, knowledge processing. a part-of-speech tagging system, Hierarchy is a form of organizational structure in a sentence boundary detection tool, which each unit has one and only one "parent" unit, a collocation identification module, except the "top" unit, which has none. a named entity recognizer, A Polyhierarchy (multi-hierarchy) is like a a word sense disambiguation system, hierarchy, but nodes can have multiple parents. In a full-syntactic parser. mathematical terms, polyhierarchy is represented by a Extraction of term candidates from domain- directed acyclic graph, or a partially ordered set. In oriented texts supports Automatic Term Recognition terms of object-oriented methodology, it can be viewed resulting in Multilingual terminology as class hierarchy with multiple inheritance. Reordering the list of extracted candidates is based Directory structure is a particular case of on the term/keywords candidate relevance ranking. hierarchical structure (that is more general concept). For Extraction of key phrases and definitions provides example, UNIX and DOS have a hierarchical directory Automatic summarization of domain-oriented texts structure that allows files to be organized by categories. using TF/DF measure The main difference between hierarchical and directory Extraction of key phrases and definitions creates structure is different naming convention for categories. Knowledge-Rich Contexts, automated pattern The category names in directory structure can be full acquisition is used for the identification of semantic or short (local). The full category names in a directory relations: associations and family trees which serve the usually are equal to their paths from Top category. A basis for semantic parser. There are a number of useful directory contained inside another directory is called a advantages of the Keywen apparatus, including, but not subdirectory of that directory. Subdirectories are limited to: the ability to build large scale human- specified by concatenating the subdirectory short name readable and semantic-oriented hierarchy of categories; to the name of the directory above it in the hierarchy. the ability to generate dynamical and flexible Together, the directories form a hierarchy, or tree hierarchical categories; the ability to accept structure. contributions of users with different qualification for Keywen Category Structure is a polyhierarchy improving hierarchical categories; the ability to accept (multi-hierarchy) that contains one preferred (primary) user’s mТnТmal МontrТbutТons (as lТttle as one МlТМk); tСe hierarchy (tree) which contains all nodes. ability to have multiple ways to categories in the The following new technologies are employed in polyhierarchy and at the same time to have Keywen: hierarchical/directory paths of the categories. - One-click Keywen technology and electronic The Keywen apparatus produces a "concrete" Voting System, substantially repeatable result. It generates hierarchical - Keywen search engine with large queries, categories that are substantially repeatable. If users - Keywen Writing Service. were to perform the claimed steps on multiple different These technologies can accelerate the encyclopedia occasions using the same inputs (e.g., the same РroаtС anН Мan make a аrТter’s аork most eППeМtТЯe. collection of related terms, the same communication with input/output module), the users would achieve the same result on each occasion. The functionality of the 7 Prospects for the development of technique has been mathematically proven, the present Semantic-Focused Systems apparatus for generating hierarchical categories do not use any empiric, heuristic, or fuzzy considerations. The development trends "Encyclopedias of keywords" The following two basic category systems are and "Encyclopedias of key phrases" are determined as currently most popular: follows: - Hierarchical, as in directories (easier for - constant increase of the encyclopedia articles number understanding, planning and processing); in different European languages, including Russian, - Multi-hierarchical, as in Wiki-encyclopedias (more inter-referenced between the relevant articles in natural, flexible and easy to maintain). different languages; The category structure of Keywen is the product of - the speed of updating of Encyclopedia will be these two systems: it has advantages of both and opens increased; old articles will be kept in the archive of greater possibilities than either. Both the structure of Encyclopedia, but fresh articles will occupy their place web-directories and structure of Wiki-encyclopedias with references to the new phrases and new articles may be viewed as an isolated case of Keywen Category from Internet; Structure. The category structure of Keywen is more - the Rating of articles self-descriptiveness will be 227 constructed; for this it is necessary to analyze several million references contained in Keywen.com: those where ID="7" – is an identification of an object, the containing more key phrases to a given issue, should get TYPE="Organization" is its type. The text component high position in the rating. corresponding to the object is also given. Objects Further stages of development are connected with relations and their participation in the actions are given the use of language processor. through the REF=... references. For example, with the Stage 1. The system for English and Russian help of the following construction morphological analysis - for transformation of words into normal form. Simplistic analysis of sentences for discovery of definitions on keywords. Stage 2. The component for analysis of sentences with selection of often met relevant word-combinations. where the sentence "one of the blows struck the Stage 3. Means for establishment of relations between headquarters of the oppositional group" is represented. relevant objects that form the clauses. For each object or action the reference to the sentence is Stage 4. Extension of the notion "meaningful given. The Semantix processor uses sufficiently components". universal constructions of XML- file: one object Not only words and word-combinations are allowed, (through the reference) can include another object. but also objects described in documents: people, Properties are given as arguments. If necessary the type addresses, organizations, etc. of attribute is indicated. Stage 5.Incorporation of the XML-based semantic For example, in the statement presentations into the semantic navigator. In the XML file a meaningful portrait of a document (the semantic network structure) is represented comprising all objects the year is indicated, etc. An XML file has a and connections, revealed by the Semantix text complete set of information items necessary for the use processor. In connection with this the organization of in different integrated systems. XML files has the definite scientific value as the means An example of XML file is given in Figure 3. for presentation of the semantic structure of sentences and texts. The transformation of the semantic network into the XML file is ensured with the aid of the reverse linguistic processor. In this case the fragments which present objects, relations, actions and sentences in the semantic network structure are mapped onto the appropriate components of the XML file which will also contain objects, relations, actions and sentences. The basic task of the LP use consists in operation as a separate module within the framework of the integrated systems of information collection and processing. The exchange is conducted through XML files [14]. For that end a reverse LP was developed, which constructs XML files on the basis of meaningful portraits. Thus, the input for the linguistic processor (LP) is a natural language text, and the output is an XML- file, where all chosen objects and connections with the Figure 3. An example of XML file for the semantic indication of sources are represented. This LP named structure presentation. Semantix is provided in the form of an SDK- module. It works under WINDOWS, but it can be recompiled for 8 Semantic-Focused Systems the work under LINUX. The development of concepts of on-line encyclopedia The Semantix Processor is an independent module results in more general systems providing discovery of and it can be used without the mentioned systems for semantically meaningful information from documents, the standard tasks of analytical services. There are and building on this base an information-reference means of tuning to the objects of other types - due to the system [1, 4-6]. The method of tuning - introduction linguistic knowledge or the dictionaries. into the system of a new template with the tying of its Let us give some explanations. Each object has the positions to the components of natural language, or a following structure: change in the existing templates and corresponding linguistic knowledge. At present this system is created ANALYST, and the linguistic processor Semantix Headquarters residence of the opposing using the knowledge base and the semantics- oriented group linguistic processor for the tasks of the automatic 228 formalization of text information, answer to the queries hybrid approaches comprising hand-made rules and in free form, etc. [ 4-6 ]. statistical means for rapid correction and fine More than 40 different types of objects are adjustment of linguistic knowledge. In our systems supported by the Semantix processor. The subject areas there is an entire complex of such means which ensure represented in the text documents are as follows. rapid tuning to the applications (including the Documents about terrorism in the Russian language. introduction of new objects and connections) taking into The analysis of the documents, in which the discussion account the demands of customers. deals with the terrorist acts and the groups. This feature Such systems have much in common with the supports the extraction of 40 types of objects, their system of electronic encyclopedia construction. The connections and the degree of participation in the significant information corresponds to the names of the criminal actions. Documents about terrorists in the articles of the encyclopedia. Templates are the variety English language. The objects and links include persons of schemes, on which are constructed the articles of the (their family name, name, patronymic – FNP), posts, encyclopedia. The layout of the material in accordance organizations, terrorist groups, instruments of crime, with the scheme is required, as well as taxonomic time and place of events and so forth, and also formation of hyper-references. connection with and participation in the actions. A backbone instrument for semantic categorization  Summaries of incidents. Is ensured the extraction is the employment of hierarchy. Hierarchy is a form of of figurants, their connections, organizations, dates, organizational structure in which each unit has one and documents, numbers of bank accounts, details of only one "parent" unit, except the "top" unit, which has weapons, etc. with the indication of their participation none. in particular criminal actions. A Polyhierarchy (multi-hierarchy) is like a  Accusatory conclusions, information about the hierarchy, but nodes can have multiple parents. In criminal cases. Objects are identified along the entire mathematical terms, polyhierarchy is represented by a field of text. Their connections and criminal actions are directed acyclic graph, or a partially ordered set. In revealed. terms of object-oriented methodology, it can be viewed  Government communications, media issues. as class hierarchy with multiple inheritance. Persons, dates, organizations, positions and other significant information and also connections and Directory structure is a particular case of participation in the actions are selected. hierarchical structure (that is more general concept). The main difference between hierarchical and directory  Autobiographies in the Russian and English structure is different naming convention for categories. languages. From the resumes all attributes of people, periods of time and place of their work, studies, The category names in directory structure can be full language proficiency and so forth are extracted. or short (local). The full category names in a directory  Autobiographies in the English. From the English usually are equal to their paths from Top category. A language resumes are all attributes of people, periods of directory contained inside another directory is called a time and place of their work, studies, language subdirectory of that directory. Subdirectories are proficiency and so forth are extracted. specified by concatenating the subdirectory short name  Documents of media issues in English. From the to the name of the directory above it in the hierarchy. English language texts the persons mentioned in media Together, the directories form a hierarchy, or tree issues, positions, organizations, dates, terrorist and anti- structure. terrorist groups, weapons, events, their time and place, Keywen Category Structure is a polyhierarchy different connections and other features are extracted. (multi-hierarchy) that contains one preferred (primary) In the processors of the Semantix, Lingua-Master, hierarchy (tree) which contains all nodes. “CrТmТnal” sвstems up to 40 tвpes oП objeМts are The method for generating hierarchical categories extracted with high accuracy and minimum noise. For from collection of related terms contains the following example, the system "Criminal" was verified on about steps: 500 thousand incidents from the summaries of Moscow (a) A huge collection of related terms is Criminal Police Department, and on the basic objects accumulated; showed the unique results: the coefficient of noise, i.e. excessive words in the objects) is not more than 1-2% (b) Information about relationships of any term is and losses are not more than 1%. The Semantix communicated to users (and agents); Processor was fixed on a smaller quantity of documents (c) Users select multiple parent categories for each dealing with the terrorist activity, and therefore there term among its relatives; can be more noise and losses in it. But this can be (d) Many parent-child relationships are accumulated quickly fixed. The fact is that to consider everything and create direct graph; and which can be encountered in the NL texts is impossible. (e) Variety of hierarchical structures is constructed Therefore, in the first place, the representative from combined direct graphs of different users. collections of test documents are extremely important, The last step (e) contains sub steps of: and in the second place, the means of fixing or tuning of linguistic processors are as follows: the employment of 229 (e1) Direct graphs of different users are combined 9 Conclusions together according to user contribution ranks so that better ranked users have the priority in the selection of Thus by semantic navigation we mean semantic parents for particular term; and analysis and search for the relevant semantic information in natural language texts in the Web. (e2) Any cycles between nodes in the graph are Semantic analysis consists in assigning a semantic eliminated. structure to natural language input. Semantic structure Categories are indicated in the very beginning of an is obtained via the semantic categorization and article; one glance at the category will be sufficient to establishment of semantic relations between concepts determine the field of the article, since all categories presented in natural language texts. Association is the will contain popular terms. dominant type of semantic relations supported by For example, in the beginning of an article on Keywen and the Navigator under development. Mesopotamia, the category "SOCIETY > HISTORY > Synonyms, taxonomies and other types of paradigmatic HISTORICAL ERAS > PREHISTORY > IRON AGE semantic relations are established within particular > MESOPOTAMIA" will be indicated. contexts and are viewed as particular cases of the EЯen ТП аe Нo not knoа tСe аorН “MESOPOTAMIA”, association relation. Hence we employ the semantic the easily understandable words "SOCIETY > impacts of context and co-occurrence which play the HISTORY" will clearly indicate the field. decisive role in automatic categorization. Categories are located in the beginning of an article; Further development includes the detailed since all categories contain most popular terms, the first structuring of the Keywen knowledge base with the glance at the category will make the field of the article employment of the Semantix linguistic processor and clear. All category terms correspond to the titles of the logical processing features, construction of the articles, which makes the direction of transition, when encyclopedic articles from definitions and key words mouse-clicking any term within the category, self- automatically extracted from Internet, establishment of explanatory. hierarchies / category trees on the basis of key word Category String is the line that contains the full family trees by assigning a dominant category, semi- name of category, which consists of several terms, such automatic correction of the category tree, manual and as semi-automatic correction of definitions, manual and semi-automatic correction of articles by the methods of "THINKING > NONVERBAL THINKING > BIG- digital voting and crowdsourcing. The Keywen PICTURE THINKING". technology can be used for terminological data bases All terms included into the Category String, are located creation according to the International Standard ISO in hierarchical order, which makes the internal structure 12620: 2009. of the category easier to understand and more logical. The approach taken combines the methods of the Every category (as full path to category) in Keywen rule-based paradigm and machine learning, thus Category Structure is unique. providing a hybrid platform for design and development Keywen Category Structure contains 17 top-level of the Internet Semantic Navigator. categories. 3.1 ANIMALS > SEA_ANIMALS > WHALES References 3.2 ARTS > FILM > ANIMATION > ANIME 3.3 BUSINESS > BUSINESS_ECONOMICS [1] Kuznetsov Igor. Semantic Representations. 3.4 COMPUTATION > INTERNET > INTERNET_HISTORY > Moscow: Science, 1986. 294 p. (in Russian).. ARPANET [2] Web site for the Keywen encyclopedia of 3.5 GAMES > BOARD_GAMES > KINGS_CRIBBAGE 3.6 HEALTH > MEDICINE > HEALTHCARE > THERAPY > keywords: www.keywen.com ENERGY_THERAPIES > REIKI [3] Salton, G. 1989. Automatic text processing: The 3.7 HOME > COOKING > FRUIT_JUICE > LEMONADE transformation, analysis, and retrieval of 3.8 IDEAS > BOOKS 3.9 MINERALS > CRYSTALS > ZIRCON information by computer. New York: Addison- 3.10 PEOPLE > POETS Wesley. 3.11 PLANTS > TREES [4] Kuznetsov I., Charnine M. Semantic-Oriented 3.12 RECREATION > TRAVEL > TOURISM 3.13 REFERENCE > REFERENCE_WORKS > ATLASES > System For Factual Search With the Interface in CARTOGRAPHY > WEB_MAPPING Russian and English // Systems and Facilities of 3.14 SCIENCE > NATURAL_SCIENCES > SPACE_SCIENCE > Informatics. Moscow: Science, 1995, V 7. SOLAR_SYSTEM > NEPTUNE 3.15 SOCIETY > HISTORY > HISTORICAL_ERAS > [5] Kuznetsov I.P., Efimov D.A., Kozerenko E.B. PREHISTORY > IRON_AGE > MESOPOTAMIA Tools for Tuning the Semantix Processor to 3.16 THINKING > NONVERBAL_THINKING > BIG- Application Areas // Proceedings of ICAI'09, Vol. PICTURE_THINKING I. WORLDCOMP'09, July 13-16, 2009, Las Vegas, 3.17 WORLD > AFRICA > MIDDLE_EAST > NORTH_AFRICA > EGYPT Nevada, USA. - CRSEA Press, USA, 2009. P. 467- 472. [6] Kuznetsov I.P., Kozerenko E.B., Kuznetsov K.I., Timonina N.O. Intelligent System for Entities 230 Extraction (ISEE) from Natural Language Texts // [12] Web site for Semantic Web: Proceedings of the International Workshop on http://www.w3.org/standards/semanticweb/ Conceptual Structures for Extracting Natural [13] Jackendoff, R. Semantic Structures. MIT Press, Language Semantics - Sense'09, Uta Priss, Galia Cambridge, MA, 1990 Angelova (Eds.), at the 17 International Conference [14] Gardner, J. R. and Z. L. Rendon, XSLT and on Conceptual Structures (ICCS'09), University XPATH: A Guide to XML Transformations, Higher School of Economics, Moscow, Russia, Prentice Hall, 2001. 2009. P. 17-25. [7] Han J., Pei Y. Yin, and Mao R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree ApproaМС,” // Data MТnТnР and Knowledge Discovery, 8(1), 2004. P. 53–87. [8] FASTUS: a Cascaded Finite-State Trasducerfor . . , . . , . . , Extracting Information from Natural-Language . . , . . , . . Text. // AIC, SRI International. Menlo Park. California, 1996. [9] Cunningham H. Automatic Information Extraction // Encyclopedia of Language and Linguistics, 2cnd ed. Elsevier, 2005. - , [10] Dobrov B.V., Lukashevich N.V. Ontologies for natural language processing: Description of , concepts and lexical senses // Computational . - Linguistics and Intelligent Technologies: Proceedings of the International Conference DТaloР’06, BekasoЯo, Maв, 31-June, 4, 2006, P. 138-142, 2006. - [11] Kuznetsov I.P., Matskevich A.G. The English . , Language Version of Automatic Extraction of , , Meaningful Information from Natural Language Texts // Proceedings of the Dialog-2005 - International Conference "Computational , Linguistics and Intelligent Technologies", Zvenigorod, 2005pp. 303-311 . 231