<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Heterogeneous Semantic Networks for Text Representation in Intelligent Search Engine EXACTUS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivan Smirnov</string-name>
          <email>ivs@isa.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Tikhomirov</string-name>
          <email>matandra@isa.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Systems Analysis of RAS</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents an approach to text representation for search tasks. Heterogeneous semantic networks are defined and their construction from natural language is described. The success of application of semantic networks in intelligent search engine is shown. When we talk about intelligent search, we mean 'understanding' a query and finding documents relevant to the query by meaning. To understand the meaning of a query and documents we should use some semantic model which allows us to semantically match a query with documents. Since we deal with natural language documents we should use an adequate text representation model to understand them. In our opinion, heterogeneous semantic networks are a good model for text representation in search tasks. Surely a query on natural language expresses the search request far better than simply a list of keywords does. That is why queries in the form of natural language discourse should be allowed in search engines. From this follows the necessity of semantic analysis of the query text itself and of the required documents. Below we give a definition for the heterogeneous semantic networks, describe linguistic procedures for constructing them and their usage in search, and discuss the experimental results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>In the studies devoted to the methods for representing natural language constructions, one
may distinguish between the problems of lеxiсаl/lеxiсо-semantiс level representation and
those of semantic/pragmatic level representation [1][2]. (By semantics one usually means the
interpretation of а natural language construction in some model, e.g. in а domain model; by
pragmatics - changes in the model that are initiated by this construction). The first task
includes morphologic and syntactic analysis, sometimes - semantic analysis that uses the
results of the lower level analysis as well as dictionary and reference information in order to·
construct а formalized representation of а natural language text.</p>
      <p>Semantic level implies not оn1у linguistic, but also logica1 relations between language
objects to be represented [3]. Among the approaches to semantic level understanding of а text
one should mention the models of ‘Meaning-Text’ type, preference semantics models [4], and
conceptual dependence model [5]. The ‘Meaning-Text’ model proposes а semantic
representation based on a semantic graph and description of text communicative structure.</p>
      <p>From the standpoint of search tasks, it is important to be able to extract text elements
containing searched object description and behavior laws. To such elements belong concepts,
their properties and relations between concepts. All these elements can be presented as a
heterogeneous semantic network that is performed below.</p>
      <sec id="sec-1-1">
        <title>2.1 Names and attributes</title>
        <p>Name is the main unit of any description. Any language makes а distinction between
individual names, general names, and meta-names. Individual names denote concrete objects
of reality. General names bring into correspondence sentences and set of concepts. A general
name, being the name of а set, determines the volume of а concept, collection of attributes
characterizes the other side of а concept - its content.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.2 Semantic relations</title>
        <p>By semantic relation we shall mean а relation between concepts in the conceptual system of а
domain. Parting from the typology of syntaxemes [6], we shall recognize the following kinds
of semantic relations:
• Gen - generative relation, whose one component denotes а person or an object
belonging to an aggregate or category denoted bу the other component;
• Des - destinative relation, whose one component denotes destination of the other
component;
• Dir - directive relation, where one component denotes the way (direction) of the
other component;
• Ins - instrumental relation, whose one component denotes the instrument of the
action denoted bу the other component;
• Caus - causal relation, where оnе component denotes the reason of occurеnсе of the
other component some time later;
• Соm - comitative relation, where оnе component denotes an action, object or person,
accompanying the other component;
• Cor - соrrеlаtivе relation, whose оnе component expresses either the possibility of
occurеnсе of the other component or соrrеspondеnсе of оnе object to the other object
or purpose;
• Neg - negative relation, where оnе component negates, excludes the possibility of
occurеnсе of the other component;
• Lim - limitative relation, whose оnе component limits the application area of the
other component;
• Med - mediative relation, whose оnе component denotes the mode, means of the
other component's action;
• Pos - possessive relation, where оnе component expresses the relation of possession
оf the other component;
• Pot - potensive relation, where оnе component increases the possibility of оссurеnсе
of the other component some time later;
• Res - resultative relation, where оnе component expresses а consequence of the other
component's action;
• Rep - reproductive relation, whose оnе component denotes the starting point for
reproduction or transformation of the other component;
• Sit - situational relation, where оnе component denotes а situation determining the
state or scope of operation of the other component;
• Fin - finitive relation, where one component means the goa1 of the other component.</p>
        <p>In а language, semantic relations are represented by predicators, i.e. words that represent
predicates. It is exactly predicators that ensure the structure of main sentences. A predicate
which expresses some relation has arguments presenting components of the corresponding
relation. Every argument is characterized by a semantic role in the relation. One of arguments
generally has role ‘subject’, while the other presents specific roles of the component in the
relation.</p>
        <p>Concepts, relation and roles can be presented as a network with concepts as vertexes and
relation and roles as edges.
2.3 A formal model for natural language statements representation</p>
      </sec>
      <sec id="sec-1-3">
        <title>Let us consider an algebraic system:</title>
        <p>
          W = (D, S, ℑ , R, F), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where,
S is some set that we shall call the set of names of objects,
R - а family of relations on S х S,
D - а universe of sets D = {D1,Dm, ... ,Dп},
where еасh set Di is said to be the set of attributes, and for еасh name s ∈ S а subset
Δ ∈ ℑ of tuples from Cartesian product Dk = D1 х Dm ... x Dп of sets from D is related, which
is referred to as the extensional or volume of the object called s.
        </p>
        <p>In this case а pair е = &lt; s, Δ &gt; is called an event with the namе s and extensional Δ . Eасh
tuple δ ∈ Δ represents а unit concept that is а concept related to а unit namе.</p>
        <p>Let F bе а family of functions {f1 , f2, ... ,fm}, which mарs Cartesian products Dk = D1 х
k
D m x D п of the sets from D into some of the sets Dj from D, so that for еасh tuple δ ∈ D ,
а suitable function f from F relates an element f (δ ) from Dj. То put it another way, the
functions from F are some methods defined in an application domain that are for computing
the values of certain attributes, given the values of others.</p>
        <p>
          Let us consider relations from R as given not on the set of names S, but rather on events
&lt;Sj, Δ j&gt; , so that а pair of events &lt;&lt; S1 , Δ 1 &gt;, &lt; S2, Δ 2 &gt;&gt; belongs to а relation Rm from R, if
and only if the pair of соrrеsponding names (s1 ,s2) belongs to the same relation, then (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) takes
the following form:
        </p>
        <p>
          W = &lt;D, S, R, F&gt; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where S is the set of pairs of kind &lt;sj, Δ j&gt;.
        </p>
        <p>For definition of relations Rm from R see [1][2] where соrrеspondеnсе between the set of
relations R and semantic relations of the natural language established.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 Linguistic processing: constructing semantic networks from text</title>
      <p>When forming a discourse, syntax deals above all with meaningful units bearing not only
their individual lexical meaning, but also generalized categorical meaning in constructions of
various complexity. These units called syntaxemes. Syntaxemes are minimal indivisible
semantic-syntactic structures of language [6] [7]. Syntaxemes are detected taking into
account: a) categorical semantics of the word; b) morphological form; c) function in the
sentence (according to the constructive capabilities within a sentence there are 3 functional
types of syntaxemes: free/conventional/bound).</p>
      <p>Categorical semantics is a generalized meaning characterizing words that belong to the
same categorical class (for instance, to the class of people, things, attributes for nouns).</p>
      <p>Let us consider some examples of different syntaxemes in Russian that are similar in form,
but different in meaning:
1. Мать привела сына к школе (The mother brings her son to school).
2. Ошибка врача привела к смерти (The medical error caused the death).</p>
      <p>
        In (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) ‘to school’ denotes a spatial noun with the directive meaning (direction of
movement), it is a free syntaxeme, since its meaning does not depend on the position in a
sentence. In (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) ‘death’ denotes an attributive noun meaning logical consequence. It is a
conventional syntaxeme, since its meaning is realized only in a certain complicative model in
the position of a semi-predicative complicator of the model.
      </p>
      <p>In a particular discourse, in a particular sentence of a query a word performs as a
syntaxeme, i.e. has a certain syntactical function, in a certain grammatical form, it realizes
only one of the possible meanings which this word can take in this sentence/phrase. The main
task of the semantic analysis is to reveal semantic meanings of syntaxemes and relations
between syntaxemes.</p>
      <p>Following this general overview let’s now consider the linguistic analysis process in our
system in more detail.</p>
      <p>Semantic processing of the discourse is made in three stages: morphological, syntactic
and semantic analysis itself. Each stage is fulfilled by a separate analyzer with its input and
output data and its own settings.</p>
      <sec id="sec-2-1">
        <title>3.1 Morphological analysis</title>
        <p>At the stage of morphological analysis words and separators are recognized in the discourse.
The list of all possible grammatical forms based on the word morphology is defined for each
word. Word forms corresponding to the same normal dictionary form of the word and to the
same part of speech and those in the same number (singular or plural) (for parts of speech that
can change the number) will be classified into groups which will be further called lexemes
(though they are not lexemes in the strict linguistic sense).</p>
        <p>Obviously, a number of such lexemes can correspond to the same word. In order to reduce
the number of the resulting variants of the sentence, the morphological analyzer has a filter:
for every part of speech it can be defined whether or not it will be taken into account in the
further analysis. The settings allow to ignore interjections and particles by default if there are
variants of the word belonging to another part of speech.</p>
        <p>In the output we get a list of sentences each of them being a list of words, and each word,
in its turn, being a list of lexemes.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Syntactic analysis</title>
        <p>The main task of the syntactic analysis is to establish syntactic dependencies between lexemes
defined at the previous stage. In particular, syntactic analysis is made for extraction of
minimal semantic-syntactic structures (syntaxemes).</p>
        <p>The syntactic analysis can be done within one sentence. Compound sentences are split
into simple clauses which are further processed as separate sentences. A list of variants is
composed for all sentences acquired in the output of morphological analysis so that each word
in each sentence variant has only one lexeme. Since the number of sentence variants is equal
to the product of the number of lexemes for each word, the task of limiting the number of
variants is apparent. To do so heuristics allowing to reject obviously incorrect variants are
applied. Besides, a maximum allowable number of variants can be chosen in the syntactic
analyzer settings.</p>
        <p>And then the algorithm for subordinating syntactic relations discovery should be applied
to each variant. As a result, lexemes are being structured into dependency trees: the lexeme at
the parent node governs all child node lexemes. A minimal semantic-syntactic structure
(syntaxeme) is a tree the root of which is a noun or a preposition that governs the noun. It
should be noted that proper names also belong to nouns. A noun phrase is any syntaxeme
subtree into which the root noun is included.</p>
        <p>Besides searching for revealing syntactic dependencies the syntactic analysis detects
homogeneous parts. Thus at this stage two types of relations between lexemes can be
detected: government and homogeneity. Every time when the program detects some syntactic
relation, the relative weight of the sentence variant is increased. At the end of the sentence
analysis only variants with the maximal weight are kept.</p>
        <p>Thus the syntactic analysis input is a sentence acquired at the output of the morphological
analysis. The output is a sentence in the form of a list of variants, each representing a list of
dependency trees (list of syntaxemes).</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3 Semantic analysis</title>
        <p>The main task of the semantic analysis or, to be more exact, the semantic analysis described
below (it can be deepened in the case of a properly knowledge base of the domain), is to
reveal semantic meanings of syntaxemes and relations on a set of syntaxemes. In general, a
semantic relation is understood as a relation of concepts in the conceptual system of the
domain (see section 2). The representatives of semantic relations in lexis are predicate words,
i.e. lexemes representing predicates. The verbs are predominant here, since, more often than
not, they hold the central position in the semantic structure of the sentence and that influence
noun phrases and sentences. The information on syntactic compatibility of every verb is
recorded in special tables of relations. The tables of relations for each verb indicate types of
relations between its syntaxemes (arguments).</p>
        <p>A semantic search image consists of an ordered map of triples: &lt;relation, arg1, arg2&gt;,
where &lt;relation&gt; denotes the semantic relation type, and &lt;args&gt; are syntaxemes, i.e.
dependency trees for corresponding NPs or PPs. A semantic image can also be presented as a
semantic graph with syntaxemes as vertexes and relations as edges.</p>
        <p>To do the semantic analysis, first of all it is necessary to extract predicate words
(predicators). If a verb is a predicator in the sentence, it can be detected immediately at the
stage of morphological analysis. In other cases (when the predicator is a participle, a verbal
noun, etc.) additional rules are applied.</p>
        <p>When the predicate word and NPs related to it are detected, the arguments in the
predicator syntaxemes structures should be filled up. The filling is made by using special
linguistic dictionaries where a certain set of syntaxemes is associated with each predicator.
Some rules of context and domain consideration are used to remove polysemy.</p>
        <p>Besides, the dictionary indicates how NPs are interrelated within predicate argument
structures. A set of binary relations within the set of syntaxemes is also specific for each type
of predicate words and is defined a priori. The dictionary was developed by linguists and
covers almost 95% of mostly frequent Russian verbs.</p>
        <p>Let’s consider the structure of the linguistic dictionary on an example for predicate ‘love’.
Predicate= love
Meaning = subject
Syntaxeme = no preposition + subjective case
Categorical class = personal
Meaning = object
Syntaxeme = no preposition + accusative case
Categorical class = any
Meaning = causative
Syntaxeme = for + accusative case
Categorical class = attribute
Relation = CAUS
Syntaxeme1 = subject
Syntaxeme2 = causative</p>
        <p>The collection of NPs and binary relations between them is presented in the form of a
semantic graph describing the situation in the neighborhood of one predicator. This graph is a
fragment of the semantic network describing the semantic image of the whole text.</p>
        <p>Every time when a syntaxeme fills up the predicator argument or when two syntaxemes
correspond to a semantic relation, the program increases the weight of the sentence variant.
Hence in case of simultaneous syntactic and semantic analysis the “heaviest” variants from
the point of view of both syntactic and semantic relations are chosen. That is why
simultaneous analysis is not equal to sequential analysis: in the latter case variants with the
greatest number of syntactic dependencies are first selected and then those ones among them
are chosen in which the argument structure of predicators is filled up in a better way and more
semantic relations are found. If a verb is polysemantic (i.e. there are several entries for one
word in the dictionary), then all variants are considered one by one. Those variants are finally
selected at the further stages of analysis, where syntaxeme meanings for a greater number of
NPs of the fragment are found, and where categorial semantics attributes worked the most
frequently. If there are still more than one equivalent variants again, then the variant with the
maximum ratio of the number of syntaxemes found in the sentence to the total number of
syntaxemes described in the given dictionary entry is to be chosen (i.e. the variant with the
best (complete) verb argument structure filling).</p>
        <p>Participial phrases and adverbial participial phrases are processed after the corresponding
main clauses. The subject of the main clause becomes the subject of an adverbial participial
phrase. The candidates for the subject/object of a participial phrase are the nearest NPs, whose
roots agree with the participle in gender, number and case.</p>
        <p>Syntaxemes are also searched for interrogative words like who, what, where, why, when,
how, how much, what for, at what time, etc. A special attribute is assigned to the found
meanings and when comparing the query with the document (during relevance calculation),
this attribute will allow to coincide interrogative words with any NP with the same meaning
in the document.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Using semantic networks in search</title>
      <p>As we said above search by keywords often does not satisfy the main requirement of the user,
namely the requirement of semantic relevance of the found documents to the query, even in
spite of the fact that all key words of the query are present in these found documents [8].</p>
      <p>The main idea of semantic search is semantic matching of user query with searched
documents. Semantic search involves generation of semantic images of documents and
queries. The semantic image in this case is presented as the semantic network so the semantic
matching consists in comparison of networks vertex by vertex, role by role and relation by
relation. In the result the semantic relevance is calculated that allows removing the documents
obviously not semantically corresponding to the search query. Figure 1 shows examples of the
semantic images of the query and the document. In this example the query includes question
word ‘who’ that is subject in relation ‘Des’ with destination ‘Brussels’. The document
contains subject ‘President of Russian Federation’ and destination ‘Brussels’ which are
connected with ‘Des’, so the document is semantically relevant to the query and in essence is
the answer to the question.</p>
      <p>Semantic analysis essentially enhances the precision and recall of the search and decreases
the number of irrelevant documents returned in the result of the search.</p>
    </sec>
    <sec id="sec-4">
      <title>5 Experimental results</title>
      <p>The experimental search algorithm integrating traditional statistical approaches to search and
linguistic processing based on principals performed above was implemented in the search
engine EXACTUS and tested in ROMIP seminar in 2008 [9], [10]. In many respects ROMIP
seminars are similar to other world information retrieval events such as TREC [11], CLEF,
NTCIR, etc. Similar to TREC ROMIP has cycle nature and is overseen by a program
committee consisting of representatives from academia and industry.</p>
      <p>Given a collection of documents and tasks participants run their own system on the data
and submit results to the organizing committee. Collected results are independently judged
and the cycle ends with a workshop for discussing results and sharing experience.</p>
      <p>Widely known metrics are used for evaluation in ROMIP:
• Precision;
• Precision at level 5 and 10;
• Average precision;
• Recall;
• 11-point TREC precision-recall graph;
• Bpref.</p>
      <p>In 2008 EXACTUS was tested on collection of Belarusian websites and showed the best
results for all metrics in OR-evaluation and for almost all metrics in AND-evaluation. Figure
0,7
0,6
0,5
no 0,4
i
s
i
c
reP0,3
0,2
0,1
0
Series1
Series2
Series3
Exactus1
Series5
Series6
Series7
Series8
Series9
Exactus2
Series11
Series12
Series13
Series14
Series15
2 shows TREC 11-point precision/recall graph for search algorithms EXACTUS and other
participants. It is seen that EXACTUS algorithm applying semantic processing gives the
highest precision/recall values. 37% of evaluated queries contain semantic information (i.e.
semantic meanings and relations for words) so we can state that the semantic analysis makes
important contribution to the search results.</p>
      <p>0
0,1
0,2
0,3
0,4</p>
    </sec>
    <sec id="sec-5">
      <title>6 Conclusion and future work</title>
      <p>We presented an approach to semantic search elaborated within an intelligent search system
EXACTUS. The experiments show the advantage of using linguistic methods together with
statistical methods of search for improvement of search quality.</p>
      <p>Subjects of thorough research are methods of revealing implicit relations as well as
anaphora resolution in case of multiple variants requiring semantic filtering of syntaxemes.</p>
      <p>At present the English version of the system is being developed and the possibility of
multilingual search and translation within the system is being investigated. The Russian
prototype of the intelligent search engine is available at www.exactus.ru.
[3] I.A.Meltchuk, Ехреrienсе in the theory of linguistic models "Meaning-Text". Moscow:</p>
      <p>Nauka, 1974. (in Russian)
[4] Y.Wilks, Preference semantics, Кeenan (ed.) "Fonnal semantics оf natura1 language",</p>
      <p>Cambridge, England: Cambridge University Press, 1975.
[5] R.Schank, Conceptual infonnation processing. Amsterdam: North-Holland, 1975.
[6] G. Zolotova, N. Onipenko, M. Sidorova. Communicative grammar of Russian language.</p>
      <p>Мoscow, 2004. (in Russian)
[7] G.A.Zolotova. Syntactic dictionary: Repertory of elementary units of Russian Syntax.</p>
      <p>Moscow: Nauka, 1988. (in Russian)
[8] Gennady Osipov, Ivan Smirnov, Ilya Tikhomirov. Application of Linguistic Knowledge to
Search Precision Improvement. // Proceedings of 4th International IEEE conference on
Intelligent Systems 2008. Volume 2. - P. 17-2 - 17-5.
[9] Boris Dobrov, Igor Kuralenok, Igor Nekrestyanov, Ilya Segalovich Russian Information
Retrieval Evaluation Seminar, LREC'04, february 2004 //
http://www.romip.ru/docs/lrec2004-final.pdf
[10] ROMIP: Russian Information Retrieval Evaluation Seminar.</p>
      <p>http://www.romip.ru/en/index.html
[11] TREC: Text Retrieval Conference. http://trec.nist.gov/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Osipov</surname>
          </string-name>
          .
          <source>Semantic Types of Natural Language Statements. A Method of Representation. //10th IEEE International Symposium on Intelligent Control</source>
          <year>1995</year>
          , Monterey, California, USA, Aug.
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Osipov</surname>
          </string-name>
          .
          <source>Methods for Extracting Semantic Types of Natural Language Statements from Texts. //10th IEEE International Symposium on Intelligent Control</source>
          <year>1995</year>
          , Monterey, California, USA, Aug.
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>