<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Network Information Navigation Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergei Bel'kov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergei Goldstein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ural Federal University</institution>
          ,
          <addr-line>Yekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>22</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Attention is paid to the problems of automatic search of documents by search engines, analysis of documents and the use and developing of network resources, such as thesauruses and ontologies. Also some of proposals to expand the conceptual model associated with the need to reduce the dimension the set of documents found by the search engine to a set of relevant documents are formed.</p>
      </abstract>
      <kwd-group>
        <kwd>search engines</kwd>
        <kwd>query optimization</kwd>
        <kwd>analysis of documents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
The numbers of ways we use the Internet now-a-days are really extensive.
However, algorithms associated with that are almost not formalized and therefore
there are many unresolved problems here. Therefore, the possibility of improving
this situation, it is important.</p>
      <p>In complex cases we have rather complicated query, and the output is the
set of retrieved documents, many of which could not viewed physically, or they
have duplicates of other documents or not useful for our tasks.
(1)
where Q - the set of queries; SE - many search engines; DOC - found resulting
links to documents (further documents).</p>
      <p>Query q usually includes a list of simple keywords or phrases which made up
by the disjunction of conjuncts or disjunctive normal form.</p>
      <p>Search methods which hiding within known specific search engines usually
are not obvious to the user.</p>
      <p>In addition the found resulting documents can be presented in different
formats (txt, doc, pdf, ps, djvu, html, xml and others). Also a set of documents
obtained by different search engines in response to the same query may vary
essentially.</p>
      <p>This raises the following tasks: selection of the most effective (in terms of
search target) search engine; optimization of the structure of the query; selection
from the set of received documents to only those documents that best meet the
targets of the search.</p>
      <p>Tasks associated with the optimization of the structure of the query and
reducing of the set of received documents are usually beyond the capability
search engines.</p>
      <p>To resolve some of those problems we suggest to introduce feedbacks into the
traditional search scheme (Fig. 1).
With this in mind the procession model takes the following form:
SP =&lt; Q, SE, DOC, DSM, DS, M A, A, SES, M O, QO, RDOC &gt;,
(2)
where Fsel - select function ; SE - many of available search engines; Csel -
selection criteria.</p>
      <p>The result of analysis of the set of documents obtained by applying the k-th
search engine:</p>
      <p>Rk = Fa(DOCk, Ca, Ma),
(4)
where Fa - function analysis ; DOCk - set of received documents; Ca - criteria
of analysis; Ma - methods of analysis.</p>
      <p>We also introduce the concept of optimal query:</p>
      <p>Qopt = Fopt(Rk),
where first three traditional components was given above; DSM is methods of
subset selection for documents analysis ; DS is selection procedure of documents;
MA - methods of analysis into selected subset; A is analysis procedure ; SES
selection or changing of the search engine; MO - optimization methods of query
structure; QO - optimization procedure; RDOC is the resulted set of relevant
documents.</p>
      <p>Consider also some of these components separately we may suggest also some
useful formalisms.</p>
      <p>Serious problem for the analysis may be large dimension of the set of
documents on the search engine output. Restricting the sample may be analyzed by
random selection of documents involving experts or require the development of
additional procedures.</p>
      <p>To select a specific search engine, may write:</p>
      <p>SSk = Fsel(SE, Csel),
where Fopt - optimization function of the query structure (for example graph of
connections between keywords).</p>
      <p>Often set of the found documents DOCk is too large (typically tens of
thousands). Therefore, one of the optimality criteria is to reduce the number of
documents which obtained by query. Other criteria can be adequacy to the search
target and complete-ness of the topic consideration.</p>
      <p>With this in mind we may get the Algorithm of informational (text) search
(Fig. 3). It is algorithm of first-level of decomposition.</p>
      <p>At first it may demand some studies of search models or search query
languages.</p>
      <p>After that we may use for example one of the following search models: search
by keys, wide primary search, random wide primary search, intellectual search,
search by last heuristic, search by random walks and others types of search.</p>
      <p>Obtained results may be divided into several groups depending on the
different cri-teria or search characteristics.</p>
      <p>Working with set of found documents will demand methods of documental
analysis. During the analysis of the set of documents may appear the following
tasks:
(3)
(5)
– To identify the documents which are most similar to the search aims. That
may be such documents as at random taking a number of documents from
the beginning of the set (for some search engines, they are usually the most
relevant purpose of the request). Also more special procedures may use here
(for example by taking documents one of presentation format);
– Divide the set of documents for the group (for example: unimportant,
secondary importance, and high importance documents), areas or classes.</p>
      <p>It uses a set of keywords or phrases (terms), which are presented in the
documents. Some of these terms are also present in the query q. Document is
describing its set of keywords is the image of the document.</p>
      <p>
        For domain we have a Dictionary, consisting of terms. To determine the
degree of connection between the two documents apply the mathematical
apparatus of the following models: Boolean, Extended Boolean, Vectoral, Fuzzy
logical, Probabilistic [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Nevertheless, a direct comparison of these methods is
difficult, it requires the development of additional mathematical apparatus. In
more complex cases, the dictionary is transformed into thesaurus or ontology.
For hypertext some special form patterns may used [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The resulting images of documents are allowed to move to the problem of
classification. There are images of reference documents (supervised learning) or
clustering of documents where no master images (learning without a teacher).</p>
      <p>The resulting matrix of pairwise proximity of documents allow us to go to
their classification or clustering. Thus we gave the following tasks: exclusion of
non-uninformative (in terms of search target) documents (infor-mation noise);
elimination of duplicate documents; partition (classification) of the set of
documents into two (important, unim-portant) or three main categories (low, medium
and high degree of importance); the actual clustering as a partition of the set
of documents into groups according to the properties of their images (feature
vectors).</p>
      <p>Many of the docu-ments can be excluded on the basis of viewing only its Title
or Abstract. Thus there are presented three levels of consideration: primary,
main and additional analysis. Turning to the image of the document as a set
of keywords, we also have several levels of keywords analysis: top (from the
query), middle (from Abstract) and low (from text content, i.e. known and new
keywords).</p>
      <p>Thus after analyzing the problems arising from modern network navigation
we proposed to complement existing search engines several of additional units,
in particular helping to optimize the structure of the query and limit the set of
relevant documents.</p>
      <p>We plan to consider them in detail in our further studies.
Аннотация В работе представлен перечень основных компонентов
информационной навигации в сети Internet. Рассмотрены вопросы
оптимизации поисковых запросов и самого процесса поиска.
Представлена расширенная кортежная модель поиска. Предложено
несколько полезных формализмов.
Ключевые слова: поисковые машины, оптимизация запроса,
анализ документов.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lande</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snarskii</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bezsudnov</surname>
            ,
            <given-names>I. V.</given-names>
          </string-name>
          :
          <article-title>Internetika: navigation in complex networks</article-title>
          .
          <source>Librokom</source>
          , Moscow (
          <year>2009</year>
          )
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Belkov</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          :
          <article-title>Representation of materials of text and hypertext sources by net of patterns</article-title>
          .
          <source>J. Informational Technologies</source>
          .
          <volume>1</volume>
          (
          <issue>161</issue>
          ), p.
          <fpage>29</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>