<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ON THE MULTIPLE-ASPECT APPROACH TO THE POSSIBLE TECHNIQUE FOR DETERMINATION OF THE AUTHOR'S LITERARY STYLE*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Boris Melnikov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svetlana Pivneva</string-name>
        </contrib>
      </contrib-group>
      <fpage>311</fpage>
      <lpage>315</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Мультиэвристический
литературный стиль.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction and Preliminaries</title>
      <p>подход;
специальный
алгоритм
кластеризации;</p>
      <p>We consider in this paper our approach to the determination of the author's literary style. For now,
we are working with Russian and English classical texts and applying our algorithms only to them. In the
nearest future, we are going to consider also German classical texts.</p>
      <p>For now, we only began such works; we are accumulating the statistical data. Therefore we will not
publish the obtained concrete preliminary results. However, these preliminary results are very promising,
and we hope to publish them in the next papers. And it is important to remark, that these preliminary results
are promising for both Russian and English literary texts. Thus, our paper can be considered as the detailed
description of our approach to this problem (which can be considered as a problem of artificial intelligence),
and such approach was already applied in some other problems of artificial intelligence.</p>
      <p>
        Despite such questions (i.e., the automatic determination of the author's literary style) were
considered since at least 1916 ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), author does not know the papers where the literary style is determined
by the investigation of grammatical structures.
      </p>
      <p>
        However, there were a lot of papers where a lot of approaches to considering grammatical
structures were described; but after [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (1993), there was very likely no good reviews of such papers.
However, author does not know papers where the determination of the author's literary style was
considered using grammatical structures.
      </p>
      <p>
        In Russian*, such possible techniques were reviewed in [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. However, only some statistical
methods were considered in the techniques reviewed there. Let us remark that in the well-known work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
(also related to the Russian literary texts), also only some statistical methods for the words were considered.
Below, we shall call such methods by "usual" ones.
      </p>
      <p>And we propose in this paper a multi-aspect approach to this problem. Our approach includes the
following stages:</p>
      <p>A “manual” obtaining some characteristics (descriptions) of the considered texts. Remark that some
other aspects of our approach can (i.e., its stages considered below) be considered as automatically
obtaining such characteristics. Applying the "manual" characteristics to the special representation of
considered texts.</p>
      <p>Obtaining the “simple” representation of the text (i.e., representation for “usual” algorithms).</p>
      <p>
        Obtaining the representation of the considered text using so called π-subclasses. It is a special
description of the grammatical structures. This description can be considered as an alternative to the
description using Chomsky hierarchy. We already applied such description in some problems of formal
languages theory ([
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]), and began to apply it in this problem. Most of author's results connected with such
representation were published in Russian; in English, some its applications (for the representation of some
formal languages, at first, simple programming languages) and also the references can be found in [
        <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
        ].
      </p>
      <p>
        Using special archiver for all the obtained representations. For now for its description (i.e., for
algorithms of archiving and some its practical results), we have a master thesis only [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]; we are going to
publish these results in the nearest future. Let us remark, that the goal of considering algorithm of this
archiver is not the data compression, but automatically obtaining some characteristics of considered texts.
      </p>
      <p>
        For each used representations, applying multiheuristical approach (solving appeared discrete
optimization problems) to special comparing of two long string representing two considered texts. See [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
for the detailed description of such approach.†
      </p>
      <p>
        Using special clustering algorithm, i.e., our version of hierarchical clustering algorithm. Clustering
can be used to obtain some texts which seem to belong to the same author. This version was of hierarchy
clustering algorithm was published only in Russian [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>In the next sections, we shall describe some stages of our approach (i.e., its components which are
interested for the subject of this paper) some more detailed.</p>
    </sec>
    <sec id="sec-3">
      <title>A “manual” obtaining some characteristics of the considered texts</title>
      <p>This section is “most informal”, and, therefore, we cannot describe it detailed. Really, the possible
characteristics (i.e., descriptions) of the considered texts strongly depend on the used language. Therefore
we can consider the obtaining such descriptions as a new sub-problems of artificial intelligence, which are
similar to the problems of obtaining knowledge from the expert and formalization these knowledge in some
expert systems (we mean rule-based approach here).</p>
      <p>Thus, our characteristics could be very different for different considered languages. We shall
consider briefly only two used examples of such characteristics used in our computer programs for Russian
literary texts.</p>
      <p>
        One of them is using phraseological stock phrases. However by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],‡ “much more often, in the
author’s texts there are no pronounced words and word constructions, no prominent stock phrases”. But we
already began to analyze this characteristic.
      </p>
      <p>And the second possible characteristic is the relative frequency of using loan words.</p>
      <p>
        In both such cases, we need some dictionaries (data bases of corresponding text structures) for
obtaining such characteristics. Moreover, in the first case, we need something like data base using our
representation of the grammar structure.§ But let us remark, that working with such data bases cannot be
* We mean here both Russian scientific papers and Russian language for determination of the author's literary style.
† After [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (2005), author has published some other papers on this multiheuristical approach. But it is not important for the subject
considered here, i.e., for the possible techniques for determination of the author's literary style
‡ Remember that [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is devoted to Russian texts.
§ See also Section 3.
considered as working with “simplest statistical method” which was criticized before.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Obtaining the representation using π-subclasses</title>
      <p>
        As we said before, we use a special description of the grammatical structures, which can be
considered as an alternative to the description using Chomsky hierarchy. We have already applied such
description in some problems of formal languages theory ([
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]), and began to apply it in this problem. Most
of author's results connected with such representation were published in Russian; in English, some its
applications (for the representation of some formal languages, at first, simple programming languages) and
also the references can be found in [
        <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
        ].
      </p>
      <p>
        The next example* can be considered as a connection (a “bridge”) between using π-subclasses in
formal and natural languages. Let us consider grammatical structure of programming language Pascal,
which defines compound statement. We shall use special brackets ’( and )’ to denote iterations and ε for the
empty word. The other designations (i.e., A, B, L, μ and Ψ) were defined in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Thus, let
[compound statement] ::= begin [sequence of statements] end
[sequence of statements] ::= [statement] ’( ;[statement] )’
We can define here
L = { begin [sequence of statements] end }
A = { begin ’( [statement]; )’ ε }
B = { ε ’( ;[statement] )’ end }
      </p>
      <p>Remark that we succeeded in describing the language L without the productions for the
nonterminal</p>
      <p>
        [sequence of statements]. And then the language defined by the construction [compound
statement] is Ψ*(μ,L). Remark that all the needed conditions (i.e., conditions for Proposition 2 of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) hold.
      </p>
      <p>Thus, this example demonstrates a successful applying of π-subclasses. And the same constructions
can be used in natural languages (at first, in English). Without strict mathematical theory, the profit of
applying π-subclasses can be defined for English infinitive phrases (adjective, adverb or noun ones).
Recognizing them, we use the same π-subclasses as we considered in the example for formal programming
language.†</p>
      <p>And the similar acceptable representation can simply be also applied to the following English
phrases: present ones; past participial ones; gerund ones; prepositional ones; absolute ones; appositive
ones; and also adjective, adverb and noun subordinate clauses. All this structures can be successfully
represented using π-subclasses.‡</p>
    </sec>
    <sec id="sec-5">
      <title>A short description of special algorithm of archiving text</title>
      <p>In this section, we shall very briefly describe our simple algorithm of the text archiving. As we said
before, the goal of considering algorithm is not the data compression,§ but automatically obtaining some
characteristics of considered texts.</p>
      <p>Thus, let us consider the usual 8-bit letter representation as the 0th stage of the archiving.**
On the 1st stage we pass to the 9-bit representation. Using this passing, we use the high-order bit
to designate what we have in other 8 bits: the letter or the number of pairs. Such possible 256 pairs
represent the “best pairs” of the input data, i.e., the mostly used pairs in the considered text.††</p>
      <p>
        After that we pass to the 10-bit representation, etc. However, as we said before, the goal is not the
best comparing, but automatically obtaining some characteristics of considered texts. It is important to
remark, that like dealing with the algorithms of neural networks, we often “do not understand what our
algorithm makes”, i.e., we obtain the needed characteristics independently on the texts of our computer
programs. Thus, we formulated the main difference between “manual” and automatically obtaining the
* It can be considered as an modified and improved Example 2 of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
† Moreover, we use some simplified construction, because, for example, we do not need here the unlimited enclosing of considered
constructions.
‡ To finish this section, let us remark the following thing. Algorithms for working with π-subclasses can include auxiliary algorithms
for working with finite automata of special form. (Because π-subclasses can be considered as special representations of regular
formal languages.) And the most acceptable short representation of such automata can be constructed by the theory [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However,
these things are far from our subjects.
§ Or, rather, the goal of this paper is not the data compression. We have obtained some successful results for some groups of archived
data. We are going to publish such results in the nearest future.
** Certainly, for most languages we may use 5 bits (i.e., 32 letters) for this 0th stage. This possible alternative (i.e., whether we use the
recoding into the 5-bit letter representation and start from it) can be considered as an additional heuristic. Also we can use
arithmetical (Huffman) coding (see [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] etc), which also can be considered here as a heuristic. Remark that usually this coding is
used after the main algorithm, but we use this coding (a subsidiary transformation) before it.
†† Remember that we are working not only with the given texts, but also with some their representations.
characteristics of the given texts.
      </p>
      <p>In the compressing algorithms, this process finishes when the real compressing ends. And in our
case, there is better to choose the number of bits before the starting algorithm; but let us remark, that the
best choosing number (i.e., for obtaining best possible characteristics) is usually equal to the number when
the compressing ends.</p>
    </sec>
    <sec id="sec-6">
      <title>A short description of special clustering algorithm</title>
      <p>In this section, we consider a short description of our clustering algorithm; the used notation is
standard for such area. Thus, let Rij be minimal distance* between elements belonging to the clusters having
numbers i and j, and ri is the maximal distance between 2 elements of cluster number i. The standard
clustering algorithms and also heuristical algorithms for clustering quality usually use values
where f and g are special increasing functions (depending on concrete problem, on concrete
algorithm, etc), and i and j have all the possible values.</p>
      <p>Unlike standard algorithms, we use the same formula, but values ri have other sentences. Namely,
let:
i be the considered cluster;
m1,...,mn be all the its elements;
lmn be the distance between its elements having numbers m and n;
(M1,..., MN) be some finite sequence of the elements m1,...,mn (certainly, all these elements belong
to the considered cluster) and including each of them at least 1 time.</p>
      <p>Then we set</p>
      <p>Remark that the given definition is not an algorithm, but we can simply obtain some algorithms (for
obtaining ri,) constructed on its base.</p>
      <p>The practical programming† shows the acceptability of this algorithm in various problems. As we
said before, we use it to obtain some texts which seem to belong to the same author.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Let us consider a generalization of previous sections.</p>
      <p>By the author’s opinion, the main weakness of previous approaches for determination of the
author's literary style is the over- passion of syntactical analysis of the texts, and at first (as the main
characteristic of the considered texts) storing and processing trivial frequency statistic of the used words.
And we focus the main attention to the analysis of the morphological formal constructions of the author’s
sentences, at first, to the special characteristics of used morphemes. We also are going to continue the
described before multiheuristical approach to discrete optimization problems; we mean here its applying
to some sub-problems considered in this paper.</p>
      <p>Thus, we have considered the brief description of our method for the possible technique
for determination of the author's literary style. And, as we said before, we are going to publish some result
of practical programming in the next papers.</p>
      <p>
        References
* Let us also remark, that we have to have a priory defined metrics there. However, we have indirectly said about such possible metrics
in Introduction: we said that we apply our multiheuristical approach solving some discrete optimization problems to special
comparing of two long string representing two considered texts. And this comparing forms the needed metrics.
† See some references in [
        <xref ref-type="bibr" rid="ref11 ref14">11, 14</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Markov</surname>
          </string-name>
          :
          <article-title>On an application of statistical method</article-title>
          .
          <source>- Izvestia Imperatorskoy Akademii Nauk (Russian Academy of Sciences Ed.)</source>
          ,
          <year>1916</year>
          , Ser.VI, Vol.X, No.
          <volume>4</volume>
          ,
          <fpage>239</fpage>
          -
          <lpage>239</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. E.Charniak:
          <article-title>Statistical language learning</article-title>
          . MIT Press,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. http://www.rusf.ru/books/analysis/history.htm (in Russian).</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>4. http://www.computerra.ru/offline/2000/338/3010/ (in Russian).</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Кjetsaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gustavsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Beckman</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Gil: The Authorship of The Quiet Don</article-title>
          . - Solum
          <string-name>
            <surname>Forlag</surname>
            <given-names>A.S.</given-names>
          </string-name>
          , Oslo and Humaities Press,
          <string-name>
            <surname>N.J.</surname>
          </string-name>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>O.</given-names>
            <surname>Dubasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Melnikov</surname>
          </string-name>
          :
          <article-title>On the generalization of the context-free languages class</article-title>
          . -
          <source>Progtammirovanie (Russian Academy of Sciences Ed.)</source>
          ,
          <year>1995</year>
          , No.
          <volume>6</volume>
          ,
          <fpage>46</fpage>
          -
          <lpage>58</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Melnikov</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.Kashlakova:</surname>
          </string-name>
          <article-title>Some grammatical structures of programming languages as simple bracketed languages. - Informatica (Lithuanian Acad</article-title>
          .Sci. Ed.),
          <year>2000</year>
          , Vol.
          <volume>11</volume>
          , No.
          <volume>4</volume>
          ,
          <fpage>441</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>B.Melnikov:</surname>
          </string-name>
          <article-title>Some equivalence problems for free monoids and for subclasses of the CF-grammar class</article-title>
          .
          <source>- Number Theoretic and Algebraic Methods in Computer Science</source>
          , World Sci. Publ.,
          <year>1995</year>
          ,
          <fpage>125</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>O.</given-names>
            <surname>Sannikova</surname>
          </string-name>
          :
          <article-title>A special archiver</article-title>
          .
          <source>Master thesis</source>
          , Togliatti State University,
          <year>2008</year>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>B.</given-names>
            <surname>Melnikov</surname>
          </string-name>
          .
          <article-title>Discrete optimization problems - some new heuristic approaches</article-title>
          .
          <source>- 8th Int. Conf. on High Performance Computing and Grid</source>
          , IEEE Comp. Soc. Press Ed.,
          <year>2005</year>
          ,
          <fpage>73</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>B.</given-names>
            <surname>Melnikov</surname>
          </string-name>
          , E.Melnikova:
          <article-title>Clustering situations in real-time algorithms for discrete optimization problems. - Sistemi upravleniya i informacionnie technologii (Institute of Control Problems of the Russian Academy</article-title>
          of Sciences Ed.),
          <year>2007</year>
          , Vol.
          <volume>28</volume>
          , No.
          <volume>2</volume>
          ,
          <fpage>16</fpage>
          -
          <lpage>19</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>B.</given-names>
            <surname>Melnikov</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Sciarini-Guryanova: Possible edges of a finite automaton defining the given regular language</article-title>
          . - The
          <source>Korean Journal of Computational and Applied Mathematics</source>
          ,
          <year>2002</year>
          , Vol.
          <volume>9</volume>
          , No.
          <volume>2</volume>
          ,
          <fpage>475</fpage>
          -
          <lpage>485</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>D.Huffman</surname>
          </string-name>
          :
          <article-title>A method for the construction of minimum-redundancy codes</article-title>
          .
          <source>- Proc. Inst. Radio Engineers</source>
          ,
          <year>1952</year>
          , Vol.
          <volume>40</volume>
          , No.
          <volume>9</volume>
          ,
          <fpage>1098</fpage>
          -
          <lpage>1101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>B.</given-names>
            <surname>Melnikov</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Pivneva: Heuristic algorithms of decision-making in humanitarian areas. - News of the Samara centre of science of the Russian academy of sciences, 8 - Samara: Publishing house of the Samara centre of science of the Russian Academy</article-title>
          of Science,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>