<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Text Summarization of Chinese Legal Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dmitry Lande</string-name>
          <email>dwlande@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zijiang Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shiwei Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianping Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Research Institute of Shandong Academy of Sciences</institution>
          ,
          <addr-line>Jinan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Information Recording of National Academy of Sciences of Ukraine</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>222</fpage>
      <lpage>238</lpage>
      <abstract>
        <p>Article is devoted to a method of automatic text summarization of the legal information provided in Chinese. The structure of the abstract and the model of its formation is considered. Two approaches are suggested. First one is determination of weight of separate hieroglyphs instead of words in the texts of documents and abstracts for sentences importance level determination process. Second approach is to consider a model of document as a network of sentences for detection of the most important sentences by parameters of this network. Various methods of automatic text summarization are performed and tested. A cosine measure and Jensen-Shannon divergence are applied as two estimates of quality of the paper abstracts without participation of experts. Compared to other summarizing methods, given one on the basis of the suggested network model of the document was the best by criteria of a cosine measure and Jensen-Shannon's distances for abstracts which volume exceeds 2 sentences. The suggested approach, with minimal modifications, can be applied to texts on any subject of scientific, technical or news information.</p>
      </abstract>
      <kwd-group>
        <kwd>Automatic Text Summarization</kwd>
        <kwd>Legal Information</kwd>
        <kwd>Chinese Language</kwd>
        <kwd>Cosine Measure</kwd>
        <kwd>Jensen-Shannon Divergence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Processing of natural languages practically began with statement of problems of the
artificial translation and automatic text summarization. The first fundamental works
on automatic text summarization appeared in the middle of the last century [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The task is connected with the solution of the most important problem – reduction
of the volumes of information consumed by the person, fight against information
noise. This task is very relevant today due to the constant growth of the information
space. Automatic referencing is known to all users of network search engines -¬ in
response to the request they receive not only the title of documents, but also their
short automatically created descriptions (snippets). Mobile users want to see a brief
description of the articles before they go on to read more. Persons who make
important management decisions have to familiarize themselves with thousands of
documents a day, deliberately dismissing information noise.</p>
      <p>Now there are hundreds of industrial systems of automatic text summarization, for
example, such packages as Office Word AutoSummarize, Mac OS X Summarize,
IBM Tivoli Monitoring Summarization and Pruning Agent, Oracle Text, plug-ins for
browsers Chrome, Mozilla.</p>
      <p>Numerous approaches to automatic text summarization are known, recently, neural
network technologies, deep training are applied more and more widely. There are also
numerous linguistic approaches associated with automatic parsing of sentences
submitted in different languages. Traditional type of systems of automatic text
summarization – extractive (quasi summarizing) at which the paper consists from of the
separate, sometimes poorly connected among themselves sentences of the initial
document. He is succeeded by abstractive type of text summarizing at which the systems
close to the systems of artificial intelligence in a short form retell contents of the
initial document "by the own words".</p>
      <p>However, it should be noted that today still practically all industrial systems of
automatic text summarization belong to extractive systems.</p>
      <p>It would seem, the subject of automatic text summarization of texts is already
rather studied, the main results are received. However, and in this article it is about
creation of system of automatic text summarization.</p>
      <p>There are several reasons for development of new system of automatic text
summarization. First the problem of automatic text summarization of legal information is
solved. And it is texts which can't fully be considered free, unstructured. There is a
structure of separate types of documents and use of the best universal systems of
summarizing doesn't yield satisfactory results. Secondly, the authors deal with the
texts of documents presented in Chinese, which significantly narrows the range of
possible ready-to-use systems. For processing Chinese texts, as a rule, segmentation
of words is required – in the Chinese language words are often not separated by
separators.</p>
      <p>Thirdly, the program capable in corporate system to process big data flows with an
acceptable productivity and quality, built in the existing system of document flow has
to be developed.</p>
      <p>Besides, retelling of documents in this case is unacceptable. Any "imaginations",
liberties of retelling by the computer of legal acts it isn't admissible. Exit one – to
develop some hybrid algorithm and, respectively, the program of extractive type
capable to consider features of legal acts of the People's Republic of China. At the same
time the program has to be capable to process separate documents which unite to big
documentary arrays. This program has to be capable to allocate obviously set objects
in the parts of documents marked with semantic markers, to reveal the most important
parts of documents (including by statistical criteria), to form networks of sentences
and to remove the necessary volume of target information in the abstract.</p>
    </sec>
    <sec id="sec-2">
      <title>The Suggested Approach</title>
      <p>In addressing the problem, two approaches were proposed that could be considered
new in this area. To solve the problem of determining the level of importance of
individual parts of documents (in our case, sentences) it was suggested to move to the
definition of weight values of separate hieroglyphs, not words in the text of
documents and abstracts. It was also suggested that the document model should be
considered as a network of sentences to identify the most important sentences for the
parameters of the network. The weight of the links of the two sentences in this network
is determined by the weight of the common hieroglyphs included in them.</p>
      <p>
        Within the traditional statistical approach to the processing of natural languages,
the weight of sentences is usually calculated on the basis of the estimated weights of
lexical units (words, phrases) included in these sentences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] - [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this study it is
proposed as such elements for the Chinese language to use separate hieroglyphs.
      </p>
      <p>The transition from the words considered in the classical model to hieroglyphs
allows avoiding the relatively complex procedure of words segmentation the text,
which is inevitable with all other meaningful methods of Chinese texts automatic
analysis. Of course, this approach is not applicable to European languages, where the
number of different letters does not exceed several dozens. However, for the purpose
of automatic text summarization of Chinese texts, the proposed approach provides
acceptable results, which will be shown below.</p>
      <p>
        It is known, that in the Chinese language there are more than 40 thousand
hieroglyphs, therefore each of them (though not always, fully reflecting a semantic unit) it is
possible to attribute a weight value calculated on the known formulas, for example,
TF  IDF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>TF  IDF (TF — term frequency, IDF — inverse document frequency) is a
statistical measure used to evaluate the importance of a word (in this case, not a word, but a
hieroglyph) in the context of a document that is part of an array of documents. The
weight of some hieroglyph is proportional to the number of its use in the document,
and is inversely proportional to the frequency of occurrence of this character in all
documents of array.</p>
      <p>Thus, the measure TF  IDF depends on the word t (hieroglyph), the document d ,
the whole array of documents D, and is a product of two factors:</p>
      <p>TF  IDF t, d , D   tf t, d   idf t, D  .</p>
      <p>Here the expression tf t, d  is the ratio of the number of occurrences of some
hieroglyph to the total number of characters in the document (to the length of the
document, actually). Thus, the frequency of the hieroglyph within a single document is
estimated.</p>
      <p>The second factor, idf t, D  (inverse document frequency — the reverse frequency of
the document) is the inversion of the frequency with which some hieroglyph occurs in
the documents of array D.</p>
      <p>IDF accounting allows you to reduce the weight of hieroglyphs that occur very
often. There is only one IDF value for each t within the entire array of the documents
D:
idf t, D  log</p>
      <p>D
d  D | t  D</p>
      <p>In addition, unlike classical approaches to the definition of weight values of
sentences, a new, network model is proposed. Under this model, a non-directional
network is considered, with nodes appearing as separate sentences in the document,
between which the links are established if they have common hieroglyphs. The weight
of the relationship between the two sentences is defined as the sum of the weights
common to these sentences. For this network, the weight of each sentence of the textis
calculated as the sum of the weights of the links of all links that emanate from the
node. Naturally, the weight of the proposals is then normalized, since long sentences
without this procedure will on average have a deliberately greater weight. Practice has
shown that a good normalization is the division by the logarithm of the length of the
corresponding sentence.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Automatic Text Summarization of Legal Information</title>
      <p>
        Procedures of automatic text summarization of an extractive class are based on
determination of weight values (importance degree) of separate sentences which, in turn,
depend on scales of words. In the study, as weight word meanings, the classical
criterion TF  IDF was used though it is not only are possible for the solution of a problem
of summarizing approach [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Traditionally for definition of weight word meanings
two known algorithms were used – in the first case the weight of the sentence was
considered as the sum of weights, rated on length of this sentence, of the words
entering it, and in the second case was used, so-called, a symmetric summarizing algorithm
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this case the weight of the sentence was defined as the sum of weights of its
links with the previous and subsequent sentence.
      </p>
      <p>In addition, this paper proposes a network algorithm, which, unlike the second case,
calculates the relationship not only between adjacent sentences, but also between all
the sentences in the text. This approach, of course, is computationally more complex
than the first two, but, as practice has shown, leads to better results. At the same time,
the complexity of the algorithm, in the case of the considered approach of texts
summarization in Chinese, is compensated by the fact that instead of words (segmentation
of which in this case is not required) are considered only separate characters.
So let's present the basic steps of three considered algorithms of definition of weight
values of sentences:
Step 1. For each hieroglyph ti the value DF  df ti , D  is calculated as the number of
documents d j from the documentary array D that contain this hieroglyph, that is</p>
      <p>DF : d j  D : ti  d j .</p>
      <p>Step 2.For each hieroglyph ti and document d the value as frequencies TF  tf ti , d  of
emergence of this term in the document is calculated:</p>
      <p>#ti  d
TF :
d
.
Then hieroglyph weight is calculated
wi  TF  IDF  TF  log</p>
      <p>D</p>
      <p>DF
Step 3.Segmentation of sentences, i.e. the text of the document is divided into
separate sentences pi and then the definition of their weight values wpi .Let's introduce
notation: let the sentence pi of the set of sentences P ( pi  P ) consists of hieroglyphs
ti,k with weight wi,k .Let's write in a brief form the essence of three different
algorithms.</p>
      <p>Step 4a) Algorithm of the sum of hieroglyphs weights (  tf  idf ):
wpi  1 pi</p>
      <p> wi,k .</p>
      <p>pi k1
Step4b) Symmetrical algorithm for calculating the power of connection of the
sentence pi with the nearest sentences (Nearest):
1 T</p>
      <p> wi,kwi1,k  wi,kwi1,k  ,
wpi  log pi k1
where T is the general composition of the hieroglyphs of the array. If the character is
not present in the document, its weight in it is equal to zero.</p>
      <p>Step 4c) Network algorithm of calculation of force of link of the sentence(Network):
1 P T wikwjk .</p>
      <p>wpi  log pi jj1i, k1
Step 5. The weight of the sentence is corrected depending on its location in the
document. Weight values of initial and last sentences of the document artificially increase.
It should be noted that the specifics of legal information, requirements to the structure
and volume of the abstract, allowed to use the above-mentioned universal approaches
to the solution of a private special task.</p>
      <p>The structure and volume of the abstract of the legal document (examples of such
documents can be found on the website http://www.gov.cn/in the section /zhengce)
are put forward requirements that have found their programmatic implementation:
1. Abstract start with the title of the document, given almost without changes.
2. The abstract notes the type of document (announcement"通告", report
“报告",results of work "工作成果", provisions"政策"etc.).
3. If the document indicates its purpose ("目的", "奖补目的", "调整目的",
"普查的目的和意义", etc.), it is also reflected in the abstract.
4. If the first or second sentence of the document identifies the subjects of
appointment of documents (which is also visible by special markers), such a proposal is
also included in the abstract.
5. If in the title of the document or in the designation of its purpose explicitlythere
are objects from the number of the previously known (included in the base
objects table), these objects should be highlighted in the abstract.
6. ЕIf the document belongs to the type not subject to further processing (awards
"表彰", announcements of bids"招标", letters "函"etc.), the abstract is considered
prepared.
7. All sentences containing the objects selected from the title and purpose are
selected from the text of the document. If such proposals are less than the required
number (given in advance or calculated on the basis of the volume of the
document), they are presented in the abstract in the same sequence as in the primary
document. The abstract is considered prepared.
8. If the sentences are more than the required number, they are weighed according
to the above algorithm (based on the results of testing the network algorithm is
selected). After that the sentences ranked by weight and are presented in the
abstract in the same sequence as in the primary document. The abstract is
considered prepared.</p>
      <p>According to the submitted requirements the program of automatic text
summarization of the legal information provided in Chinese was developed. The web interface of
the user of this program is given in the Figure 1.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Adjacent Tasks</title>
      <p>Automatic text summarization of texts is one of important problems of technologies
of the deep analysis of texts which includes some more directions, such as extraction
of entities (Information extraction), creation of networks of words (Language
Networks) reflecting features of subject domains, a clustering (Cluster Analysis).
The algorithm offered for summarizing leans on some set of in advance prepared
words reflecting the main objects presented in legal documents (for example, "人口"
– the population, "产业" – the industry, "儿童" – children, etc.).</p>
      <p>At the same time, if you apply the algorithm of words segmentation, and then rank
them, it is easy to identify the most common "extensions" of starting objects, for
example, the concept of "organization" (组织) to expand to the concept of "international
organization" (国际组织), "public organization" (社会组织), and the concept of
"defense" (事业) to the concept of "people's air defense" (人民防空事业). As a result,
the documents of the array of legal information have been put in line with the basic
concepts that can act as "keywords", descriptors, basis for the construction of domain
models (Subject domain).</p>
      <p>As one of the types of domain models can be considered a words network, the nodes
of which correspond to separate concepts. There were proposed and implemented
such simple rules of building this network, i.e. rules of communication between
nodes:
1.</p>
      <p>All objects from the base, pre-prepared list, included in one document are
linked by links.
2. If two objects are in N different documents, the force of link between them
equals N.
3. Concepts that are extensions of concepts from the starter kit are linked with the
corresponding basic concepts.</p>
      <p>
        With the help of the program Gephi (http://gephi.org) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the built network has been
visualized (Figure 2) and received such parameters of the built network: number of
nodes: 3364 (number of objects from the starting set – 220); - number of links: 10167;
network density: 0.001; number of connected components: 6; average path length:
3,013; average clustering factor: 0859.
      </p>
      <p>
        The topology features of the built network include a very large average clustering
factor. This is due to the ode of a large number of concepts related only to the natal
of their concept (the absence of other neighbors), and on the other hand the strong
cohesion of objects from the start list. The small average length of the path indicates
that the network is a Small World [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        With the help of the program Gephi also received lists of the most important nodes
in accordance with the criterion of PageRank and the greatest hubs by the criterion of
HITS [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (Figure 3).
      </p>
      <p>The general view of network of words given on the Figure 2 clearly demonstrates a
further possibility of a clustering of network, the choice of subsets – clusters from
words (concepts). This procedure allows to allocate thematic subsets within the
considered subject domain.</p>
      <sec id="sec-4-1">
        <title>General view of words network</title>
      </sec>
      <sec id="sec-4-2">
        <title>A fragment of the words network Fig. 2. The web interface of the system of automatic summarization. 229 230</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Methods of Results Evaluation</title>
      <p>
        To evaluate the results, two assessments of the quality of the abstract are applied
without experts – the cosine measure and the divergence of Jensen-Shannon (Jensen –
Shannon), the justification of which is substantiated in the work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <sec id="sec-5-1">
        <title>PageRank</title>
      </sec>
      <sec id="sec-5-2">
        <title>HITS Fig. 3. The web interface of the system of automatic summarization. 231</title>
        <p>Let us explain the possibilities of using these approaches. The document's d
hieroglyphic dictionary is supposed to consist of N elements t1,t2 ,...,tN  . Each hieroglyph
corresponds to its weight, calculated according to the rule TF  IDF . An array of these
weights can be represented as a vector: d   w1, w2,..., wN  . Accordingly, the
hieroglyphic dictionary of the abstract r consists of a subset of the dictionary of the
document and the abstract can also be put in line with the vector of weight values:
r   wˆ1, wˆ 2,..., wˆ N . In this case, we give a natural definition:
wˆi  wi , if ti  r;</p>
        <p>0, if ti  r.</p>
        <p>It is known that the scalar product of two nonzero vectors in Euclidean space A
and B is defined by a formula:</p>
        <p>A  B  A B cos</p>
        <p>Here  – a corner between the considered vectors. It is natural if the direction of
vectors coincides, the value  becomes equal to zero (respectively, cos  1 ). I.e.
than closer cos to unit, the direction of vectors is closer to those that is easily
substantially interpreted for a case of the document and its abstract (the short summary).
It is accepted function of proximity between vectors A and B to designate as
Sim  A, B  (from word Similarity). In case of studying of a cosine measure of
proximity we have:
n
 AiBi
i1</p>
        <p>,
Sim A, B   cos </p>
        <p>
          A  B 
thematical statistics, in particular, on the relative entropy of Kullback-Leibler [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>The Kulbak-Leibler entropy is generally defined as a non-negative functional,
which is an asymmetric measure of the distance between two probability distributions
defined on a common space of elementary events.</p>
        <p>The divergence distribution Q relatively P is designated D P Q . Distribution Q
often serves as distribution P approach. This measure of distance in the theory of
information is also interpreted as the size of losses of information when replacing true
distribution P to distribution Q . The functional value can be understood as the
number of unaccounted information of distribution Q if it was used for approach the
distribution P .</p>
        <p>For discrete probability distributions P  p1, p2 ,..., pn and Q q1, q2,..., qn the
Kulbak-Leibler entropy is defined as follows:
n p
D  P Q   log qi pi.</p>
        <p>i1 i</p>
        <p>
          The entropy of Kulbak-Leibler, substantially close to the concept of distance, could
be called a metric in the space of probability distributions, but this would be incorrect,
since it is not symmetrical D P Q  DQ P and does not satisfy the inequality of the
triangle. In the future, we will use Jensen-Shannon divergence (JSD), which is based
on Kulbak-Leibler entropy, but is a metric [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], so it is also called
"JensenShannon distance" [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>Jensen-Shannon's divergence is defined as follows:</p>
        <p>JSD  P Q  1  D(P M )  D(Q M ),</p>
        <p>2
where M  1  P  Q.</p>
        <p>2</p>
        <p>In case of application of distance of Jensen-Shannon to a problem of assessment of
quality of abstracts s the number of the lost information in the abstract in comparison
with the initial document is estimated. As well as in a cosine measure, it is supposed
that to the document d there corresponds the vector of the hieroglyphsweights
d   w1, w2,..., wN  , and to the abstract r – a vector of weight values: r   wˆ1, wˆ 2,..., wˆ N .
The "average" vector used in Jensen-Shannon's method is presented in the following
form:</p>
        <p>Let's consider the given sums on two areas of index values: the 1st area where
hieroglyphs of the document and the abstract coincide and 2nd, where do not coincide,
i.e. where wˆi  0 :</p>
        <p>JSD  JSD1  JSD2 .</p>
        <p>In the first area, obviously,</p>
        <p>   
JSD1  1 N log wi  wi  1 Nlog  wi  wi  0.</p>
        <p>2 i1  12  wi  wi   2 i1  21  wi  wi  
In the second area, respectively,</p>
        <p>   
JSD1  1 N log  wi  wi  1 N log wˆi   wˆi  1 N wi.</p>
        <p>2 i1  12 wi  2 i1  12  wi   2 i1</p>
        <p>Strictly speaking, the second term in the latter formula is not correct (you can
consider the limit of expressions under the sign of the sum of when wˆi  0 ), but at the
same time, we can make a fairly obvious conclusion that the Jensen-Shannon measure
corresponds to the loss of information when summarizing and proportional to the total
weight of the words (in our case – characters) included in the document, but missing
in the abstract.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Comparison of Methods</title>
      <p>
        When summarizing the new idea of determination of weight values of sentences on
the basis of weights of separate hieroglyphs, but not words as it is standard was
realized. Therefore, the quality of summarizing is checked not only proceeding from
accounting of scales of separate hieroglyphs, but also taking into account scales of the
whole words included in the documents and abstracts to be convinced that the offered
approach is satisfactory also by criteria of traditional systems of summarizing.
Naturally, this had to perform resource-intensive procedure of segmentation of words [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
It should be noted that this procedure was performed only for quality check of
algorithms of summarizing and is not a part of these algorithms.
      </p>
      <p>The tests were conducted on a real array of legal information of the People’s
Republic of China in the amount of 10 thousand documents.</p>
      <p>In Fig. 4-7 the results of the conducted tests are shown. In Fig. 4 and 6 are the
results, when the models of documents and abstracts corresponded to vectors, elements
of which - weights of individual hieroglyphs from the text of the document by
TF  IDF . In Fig. 5 and 7 – results, when elements of vectors correspond to weight
values of words, segmented from texts of documents and abstracts. In Fig. 4 and 6 the
results are given in accordance with the cosine measure of the proximity of the
document and the abstract, and in Fig. 5 and 7, according to the Jensen-Shannon distance.</p>
      <p>On the horizontal axis on all figures the number of sentences included in the
abstract is marked. The vertical axis shows the values of the corresponding criteria,
which are averaged throughout the document array. It should be noted that in all
examples, as the first sentence of the abstract includes the title of the document, so the
values with argument 1 for all four types of algorithms (  tf  idf , Nearest, Network,
Random) are the same.</p>
      <p>As you can see, for comparison to the three above-mentioned algorithms, the
Random method is added – compiling the abstract from the random sentences of the text
(except for the first sentence – title).</p>
      <p>The test results allow to summarize:</p>
      <p>
        The proposed approaches lead to results, the quality of which is not lower,
presented at the well-known conference on the analysis of texts TAC [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>If the criterion of cosine measure of the proximity of the document and the abstract
when taking into account the weight values of the individual hieroglyphs, the best
results showed the method  tf  idf (which, of course, on the sum TF  IDF
determined the weight of proposals, with the most significant included in the abstract),
then by the same criteria, the proposed network method was the best way to take into
account separate words of natural language.
Fig. 5.Jensen-Shannon's divergence of loss of information when summarizing – accounting of
weight of separate hieroglyphs.
We introduce a new hybrid method of automatic text summarization, covering
statistical and marker methods, as well as taking into account the location of sentences in
the text of the document. The offered model of the paper abstract reflects information
need of customers during the work with legal information.</p>
      <p>We brought the approach to determination of weights of separate hieroglyphs
instead of segmented words in the text of documents. This technique avoids the
expensive procedure of words segmentation required for other semantic methods of Chinese
language processing.</p>
      <p>Various methods of automatic text summarization are implemented and tested.
Summarizing on the basis of the offered network model of the document was the best
by criteria of a cosine measure and Jensen-Shannon's distances for papers which
volume exceeds 2 sentences.</p>
      <p>The offered approach, with minimal modifications, can be applied to texts on any
subject of scientific, technical or news information.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Luhn</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hans Peter</surname>
          </string-name>
          (
          <year>1958</year>
          ).“
          <article-title>The automatic creation of literature abstracts”</article-title>
          .
          <source>IBM Journal of research and development</source>
          ,
          <volume>2</volume>
          :
          <fpage>159</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>“Automatic Keyword Extraction from Documents using Conditional Random Fields”</article-title>
          .
          <source>Journal of Computational Information Systems</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1169</fpage>
          -
          <lpage>1180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>“Using tf-idf to determine word relevance in document queries”</article-title>
          .
          <source>Proceedings of the first instructional conference on machine learning</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Bharti</given-names>
            <surname>Santosh</surname>
          </string-name>
          <string-name>
            <given-names>Kumar</given-names>
            ,
            <surname>Babu</surname>
          </string-name>
          <string-name>
            <given-names>KorraSathya</given-names>
            , Pradhan
            <surname>Anima</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>“Automatic Keyword Extraction for Text Summarization in Multi-document e-Newspapers Articles»</article-title>
          .
          <source>European Journal of Advances in Engineering and Technology</source>
          ,
          <volume>4</volume>
          (
          <issue>6</issue>
          ):
          <fpage>410</fpage>
          -
          <lpage>427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chien</surname>
            ,
            <given-names>L.-F.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>“Pat-tree-based keyword extraction for Chinese information retrieval”</article-title>
          .
          <source>ACM SIGIR Forum. 31, ACM</source>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Salton</surname>
            , G.; Buckley,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1988</year>
          ). “
          <article-title>Term-weighting approaches in automatic text retrieval”</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lande</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Snarskii</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yagunova</surname>
            ,
            <given-names>E. V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pronoza</surname>
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2013</year>
          ). “
          <article-title>The Use of Horizontal Visibility Graphs to Identify the Words that Define the Informational Structure of a Text”</article-title>
          .
          <source>12th Mexican International Conference on Artificial Intelligence</source>
          . pp.
          <fpage>209</fpage>
          -
          <lpage>215</lpage>
          . DOI:
          <volume>10</volume>
          .1109/MICAI.
          <year>2013</year>
          .33
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yatsko</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>“Symmetric Summarization: Thematic Foundations and Methods”</article-title>
          . Nauchno-Tekh. Inf.,
          <source>Ser. 2</source>
          . - N.
          <volume>5</volume>
          :
          <fpage>18</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cherven</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ken</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>“Network Graph Analysis and Visualization with Gephi”</article-title>
          .
          <source>Packt Publishing. ISBN: 9781783280131.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>“Navigation in a small world”</article-title>
          .
          <source>Nature</source>
          ,
          <volume>406</volume>
          (
          <issue>6798</issue>
          ):
          <fpage>845</fpage>
          . DOI:
          <volume>10</volume>
          .1038/35022643
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Langville</surname>
          </string-name>
          , Amy N.; Meyer, Carl D. (
          <year>2011</year>
          ).
          <article-title>“Google's PageRank and beyond: the science of search engine rankings”</article-title>
          . Princeton university press.
          <source>ISBN: 978069115266</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Louis</surname>
          </string-name>
          , Annie; Nenkova,
          <string-name>
            <surname>Ani</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>“Automatic Summary Evaluation without Human Models”</article-title>
          .
          <source>In First Text Analysis Conference (TAC'08)</source>
          , Gaithersburg,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , Etats-Unis,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          November
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Leibler,
          <string-name>
            <surname>R.A.</surname>
          </string-name>
          (
          <year>1951</year>
          ).
          <article-title>"On information and sufficiency"</article-title>
          .
          <source>Annals of Mathematical Statistics</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          . DOI:
          <volume>10</volume>
          .1214/aoms/1177729694.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>1959</year>
          ),
          <source>Information Theory and Statistics</source>
          , John Wiley &amp; Sons. Republished by Dover Publications in
          <year>1968</year>
          ; reprinted in
          <source>1978: ISBN 0-8446-5625-9.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Schütze</surname>
            , Hinrich; Manning,
            <given-names>Christopher D.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <source>Foundations of Statistical Natural Language Processing</source>
          . Cambridge, Mass: MIT Press. p.
          <source>304. ISBN 0-262-13360-1</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Dagan</surname>
          </string-name>
          , Ido; Lillian Lee; Fernando Pereira (
          <year>1997</year>
          ).
          <article-title>"Similarity-Based Methods for Word Sense Disambiguation"</article-title>
          .
          <article-title>Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          :
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          . ArXiv:cmp-lg/9708010. DOI:
          <volume>10</volume>
          .3115/979617.979625.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Endres</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          ;
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Schindelin</surname>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>"A new metric for probability distributions"</article-title>
          .
          <source>IEEE Trans. Inf. Theory</source>
          .
          <volume>49</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1858</fpage>
          -
          <lpage>1860</lpage>
          . DOI:
          <volume>10</volume>
          .1109/TIT.
          <year>2003</year>
          .
          <volume>813506</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Fuglede</surname>
          </string-name>
          , Bent; Topsøe,
          <string-name>
            <surname>Flemming</surname>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>"Jensen-Shannon divergence and Hilbert space embedding”</article-title>
          .
          <source>Proceedings of International Symposium on Information Theory, ISIT</source>
          <year>2004</year>
          , p.
          <fpage>31</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>"Divergence measures based on the Shannon entropy"</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          ,
          <volume>37</volume>
          (
          <issue>1</issue>
          ):
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          . DOI:
          <volume>10</volume>
          .1109/18.61115.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Boris</surname>
            <given-names>Berezin</given-names>
          </string-name>
          , Dmitry Lande,
          <article-title>Oleh Pavlenko Development, Evaluation and Usage of Word Segmentation Algorithm for National Internet Resources Monitoring Systems</article-title>
          .
          <source>CEUR Workshop Proceedings. Selected Papers of the XVII International Scientific and Practical Conference on Information Technologies and Security (ITS</source>
          <year>2017</year>
          ).
          <year>2067</year>
          :
          <fpage>16</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>