<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Application of KEA for Semantically Associated Structural Units Search in a Corpus and Text Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena Sokolova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Saint Petersburg State University</institution>
          ,
          <addr-line>Saint Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents results of the research on possible applications of keyphrase extraction algorithm KEA. Although this algorithm is widely used as an effective and universal tool for keyphrase extraction, our study is aimed at exploration of its further adjustments in the tasks of translation equivalents search and for semantic compression, namely, for extractive summarization. To be precise, in our first series of experiments we analyzed the output of KEA based on the text corpus developed from the United Nations documents in order to find semantically associated structural units (possible translation equivalents) among Russian and English keyphrases. The second series of experiments is concerned with using keyphrases automatically extracted by KEA to compose extracts for short stories. In this case we also compiled a corpus of short stories written in (or translated into) Russian and adjusted KEA so that ranked sentences with keyphrases could be used to form previews for the stories.</p>
      </abstract>
      <kwd-group>
        <kwd>keyphrase extraction</kwd>
        <kwd>KEA</kwd>
        <kwd>translation equivalents</kwd>
        <kwd>summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Keyphrases have a wide range of practical applications in rather different fields such
as document summarization, indexing, information retrieval, library systems, etc.
Being structural units themselves, keyphrases convey the most important information
about the content of the document. That is why automatic keyphrase extraction is one
of the most highly sought tasks to solve today.</p>
      <p>
        There are different approaches to extract keyphrases from a document [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]:
statistical (TFxIDF, Chi-square, C-value, Log-Likelihood, etc.), linguistic (including
different levels of linguistic analysis), machine learning (Naïve Bayes classifier, SVM,
etc.) and also hybrid algorithms (KEA).
      </p>
      <p>
        In this paper we explore further implementations of one of commonly known
keypharse extraction algorithms KEA (Keyphrase Extraction Algorithm) in the wide
field of Natural Language Processing (NLP) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, we conducted a series of
experiments trying to adjust KEA to the tasks which combine semantic compression
and text transformations.
      </p>
      <p>To be precise, in the first experiment we try to find out if KEA is capable of
finding semantically related unites, such as translation equivalents, synonyms, hyponyms,
etc., for two different languages, namely Russian and English.</p>
      <p>
        The second experiment is devoted to the possibility of using KEA as an
intermediate tool for an extractive summarization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] algorithm. Keyphrases automatically
extracted by KEA were used to identify salient sentences in the text.
      </p>
      <p>To mark the borders of our research, it needs to be noted that we are not trying to
find new solutions to existing problems in the field of NLP. The subject of our study
is KEA itself, namely, how it can be used and what for. Thus, those applications of
KEA that we will consider further represent only one of all possible varieties of
approaches to solving some certain tasks, and also give new information about KEA’s
abilities. Despite the fact that the algorithm is not precisely new, we have chosen
KEA for our experiments because it proved to be a useful and universal tool in
different fields, but so far has not been used for processing Russian texts.</p>
      <p>We would also like to state in advance that, as a significant part of the research was
conducted manually, in many aspects it is not large-scale.</p>
      <p>The paper is organized as follows. Section 2 briefly describes the structure and
working principles of KEA. Section 3 contains description of the first possible KEA
application, namely identification of translation equivalents, while Section 4 deals
with the second experiment which concerns composing extracts for short stories based
on keyphrases extracted by KEA. Section 5 is devoted to general conclusions and
future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>KEA Structure</title>
      <p>After that for each candidate two features – TFxIDF and first occurrence – are
calculated. TFxIDF shows how often a phrase occurs in the document in comparison to
its frequency in some large corpus:
freq(P,D) is the number of times P occurs in D;
size(D) is the number of words in D;
df(P) is the number of documents of some collection of documents or in some
corpus containing P;</p>
      <p>N is the size of the collection or corpus.</p>
      <p>The second feature, first occurrence, is the distance between a phrase first appearance
and the beginning of the document, divided by the number of words in the document.
The result is a number between 0 and 1.</p>
      <p>After being trained, KEA marks each candidate as a keyphrase or non-keyphrase,
which is a class future used later by Naïve Bayes classifier. Then, by applying the
model built on the training stage, KEA selects keyphrases from a new document and
after some post-processing operations represents the best keyphrases to a user.</p>
      <p>When the classifier processes a candidate phrase with feature values t (TF×IDF)
and d (distance), two quantities are calculated:
and the same for P[no], where Y is the number of positive instances in the training set,
i.e. keyphrases assigned by the author, and N is the number of negative instances, i.e.
candidate phrases which are not keyphrases.</p>
      <p>The overall probability that a candidate phrase is a keyphrase, in its turn, is
calculated in the following way:</p>
      <p>[ ] [ ] [ ]
According to this value, candidate keypharses are ranked and the first r, where r is a
requested number of keyphrases, presented to the user.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Translation equivalents among Russian and English keyphrases automatically extracted by KEA</title>
      <sec id="sec-3-1">
        <title>Collecting and preprocessing text corpora</title>
        <p>
          Besides KEA’s possible practical usages this experiment was also aimed at verifying,
to which extend KEA is a language independent tool. For us it would mean that it is
capable to identify conventionally ‘the same words’ for the same document written in
several languages. For this purpose we developed a corpus using the United Nations
(the UN) documents [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] as official papers have at most precise translation and are
written in formal style.
        </p>
        <p>The corpus contains official letters, declarations, protocols, reports, etc. On the
whole, it includes 60 documents (~ 115000 tokens), where 30 documents are written
in English and 30 – in Russian. In each subcorpora 25 documents were taken for the
training set, while the rest 5 formed the test set. The documents in each set were
picked randomly. Obviously, in the UN documents no manually assigned keyphrases
are provided, so we used document-headline pairs in the training set.</p>
        <p>As it was already mentioned, KEA is a universal language-independent algorithm
that means that the importance of a phrase for the document content does not depend
on any particularities of a language. Although the realization of KEA allows to
provide external language-dependent modules such as stemmers, for example. And its
initial package contains stemmers for some languages, but Russian is not among
them. As using different stemmers for document preprocessing could influence the
resulting list of keyphrases, no linguistic processing of the documents was used in this
experiment. Thus, equal conditions were set up for both languages.</p>
        <p>
          In processing English texts we used an internal list of stopwords, created by the
developers of the algorithm, and stopword list for the Russian language was collected
from Russian National Corpus (RNC) lists of function words and abbreviations [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It
includes the most frequent prepositions, particles, pronouns, interjections, some
parentheses, digits and Latin characters.
        </p>
        <p>For each document of the test set we obtained a list of 20 (the number
recommended by the developers as containing the most salient keyphrases) the most relevant
keyphrases. After that the lists were manually analyzed in order to find translation
equivalents.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results and evaluation</title>
        <p>It is worth mentioning that results obtained in the course of experiments cannot be
evaluated with high precision as the algorithms of keypharse extraction as such are
hard to evaluate, especially when no manually assigned keyphrases are provided.
Moreover, the algorithms like KEA, as a rule, work better for the documents that were
preprocessed, – for languages with rich grammar like Russian in particular. As it was
already noted, we did not perform preprocessing of the documents in our study to
create at most equal conditions for both languages. Therefore, for each document we
decided to calculate the percentage of semantically associated structural units for both
outputs combined together. The number of units being members of some kind of
semantic relations was dived by 40 (20 Russian keyphrases for a document and 20
English keyphrases for a document) and multiplied by 100 to get a percentage.
Technically, of course, those are two different documents, but as our study is of semantic
nature, we consider it to be unimportant detail. Obtained results with examples are
shown in the Table 1.
As we can see, we indeed can find translation equivalents in the output what proves
KEA’s language-independence and new possibilities for research in that area.</p>
        <p>Although for these figures some notes should be made. Firstly, KEA tends to break
semantically associated units. For instance, for the document G1812398\400 we had
Paris, agreement and Paris Agreement for both languages. It is quite a common issue
for automatic keyphrase extraction, but among researchers there is still no convention
how to conduct any kind of calculations in this case. In our paper we decided to count
full phrases as well as their parts. So, in the example above, all three units were
considered to be semantically associated.</p>
        <p>Secondly, because of the certain nature of texts in our corpus, we mainly dealt with
translation equivalents, and sometimes it is hard to tell, whether or not keyphrases are
equivalent and whether the parts came from the same phrase. For example, for a
document N1813943\46 were extracted Совет Безопасности, Совет Безопасности
напоминает, Security Council and encourages. In such cases we had to turn to the
original text, which is not very convenient within the experiment, because it was done
manually for each document in the corpus, to look at the context. But it is still
impossible to tell, if Security Council came from Security Council encourages or Security
Council recalls. As a used corpus was not aligned, looking at the context becomes a
separate problem.</p>
        <p>Therefore, such, sometimes, high figures are a product of evaluation issues
appearing while processing broken phrases. Those breaks may be caused not only by KEA’s
peculiarities, but also by the absence of morphological preprocessing of the texts. It is
commonly known that ‘messy’ data causes calculation mistakes, that is why we admit
that our evaluation is raw and does not claim to be the only one possible or highly
precise.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Automatic summarization of short stories</title>
      <sec id="sec-4-1">
        <title>Data preprocessing</title>
        <p>
          In this paper we used KEA to create extracts based on the original text [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ].
According to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] extract is a collection of passages (ranging from single words to whole
paragraphs) extracted from the input text(s) and produced verbatim as the summary.
        </p>
        <p>
          For this experiment we compiled a corpus of 35 short stories written in Russian
and Russian translations of famous literary works. Among the authors whose stories
were used are A. Chekhov, O. Henry, D. Kharms and others. While selecting the only
criterion was a small size. 30 short stories were used for the training set and the rest
five for the test set. As manually assigned keyphrases for training we took abstracts
for those stories written by users of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>Further actions can be divided in two ways:
1. Experiments based on the lemmatized training set:
– lemmatization of the abstracts;
– deleting stopwords;
– lemmatization of the training set;
– lemmatization of the test set;
– extraction of 20 the most relevant keyphrases.
2. Experiments based on the non-lemmatized training set:
– lemmatization of the test set;
– extraction of 20 the most relevant keyphrases.</p>
        <p>– lemmatization of the output keyphrases.</p>
        <p>
          The reason for this division is the fact that KEA produces different results
depending on if the training set has been lemmatized or not. For lemmatization we used
morphological analyzer pymorphy2 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] in Python.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>The algorithm</title>
        <p>As the corpus has been processed and keyphrases for the test set extracted, an extract
for a story is automatically composed based on obtained results. We developed and
tested the algorithm which was implemented in Python. Our algorithm is composed of
several modules including preprocessing as well as the module creating an extract.</p>
        <p>The algorithm contains several stages:
1) the text is split into sentences: as the search of keyphrases in the text is
conducted by lemmas, later we need to find and extract original sentences;
2) the title and the first sentence are extracted: we need the title to bound an
extract with its story, and the first sentence gives it a start;
3) the search of the keyphrases in the sentences: at this point we have
lemmatized original texts and their keyphrases to conduct a search by lemmas;
4) candidate sentences are assigned some scores (this stage will be discussed
later);
5) selected sentences are extracted from the original text and the first five
(including the first one) having a score more or equal to 2 form the extract.
Scores are assigned as follows:</p>
        <p>1, if a keyphrase is included in one of the constructions listed below, and if it is
a subject or a predicate of the sentence in the first two cases:
– noun:
–
–


verb:</p>
        <p>
adjective:

noun + noun\verb\full adjective\short adjective (in the
distance of +\- 1 from the main word)
verb + noun\infinitive (in the distance of +\- 1 from the
main word)
verb + full adjective + noun
adjective + noun (in the distance of + 1 from the main
word)
 verb + adjective + noun;
2, if a keyphrase in the sentence is among the first five from the output list;
3, if a sentence contains more than one keyphrase;
4, 5, 6 are assigned for combinations and if a sentence contains several
keyphrases.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results and evaluation</title>
        <p>Therefore, the obtained extracts are as follows.</p>
        <p>Here is an extract for ‘Enemies’ by A. Chekhov. The story begins when to the
doctor, whose son has just died, comes a visitor and asks for help because his wife is sick.
The doctor refuses saying that he cannot work now, but eventually agrees to come.</p>
        <p>— Пока ехал к вам, исстрадался душой... Очень рад, что застал... Бога ради, не
Одевайтесь и едемте, ради бога... Произо- откажите поехать сейчас со мной... У
шло это таким образом. меня опасно заболела жена... И экипаж
— Верьте, я сумею оценить ваше велико- со мной... По голосу и движениям
водушие, — бормотал Абогин, подсаживая шедшего заметно было, что он
находилдоктора в коляску. ся в сильно возбужденном состоянии.</p>
        <p>В его осанке, в плотно застегнутом сюр- Когда Абогин еще раз упомянул про
туке, в гриве и в лице чувствовалось что-то Папчинского и про отца своей жены и
благородное, львиное; ходил он, держа прямо еще раз начал искать в потемках руку,
голову и выпятив вперед грудь, говорил при- доктор встряхнул головой и сказал,
апаятным баритоном, и в манерах, с какими он тично растягивая каждое слово: —
снимал свое кашне или поправлял волосы на Извините, я не могу ехать... Минут пять
голове, сквозило тонкое, почти женское назад у меня... умер сын... — Неужели?
изящество.</p>
        <p>In this case, the second extract seems to be more appropriate, as it is more coherent
and does not contain redundant information.</p>
        <p>Now we can see a counter-example. The story is ‘Tobin’s Palm’ by O. Henry. Two
friends are going to Coney Island to cut loose because one of them, Tobin, has just
been deceived and robbed by his girlfriend. There they meet a gipsy who warns Tobin
to stay away from certain people and says that he will meet a person who will bring
him luck. So, the rest of the story Tobin and his friend are trying to find that person.
Here, the first extract is likely to be more successfully made because it gives the story
a start, while from the second one it is hard to understand what happened with
characters after they had arrived at Coney Island.</p>
        <p>To give estimation to obtained results, we asked 6 experts to evaluate the texts
from the following three perspectives:
 which one of two extract variations is better: lemmatized or
nonlemmatized; the one better is assigned 1 score, while the other gets 0
(further was evaluated the one that got 1 at this step);
 meaningfulness: if it is impossible to get something about a story from the
extract, the score for this parameter equals 0; if a reader could get at least
something, 1; and if an extract is for the most part clear, 2;
 preview: whether or not a given extract can be used as a preview for a
short story.</p>
        <p>The average evaluations for each parameter are shown in Table 3. As the first
parameter is a matter of preference and refers to another issue (data preprocessing), total
score was calculated only for ‘Meaningfulness’ and ‘Preview’ parameters, 3
consequently being the highest point..
Clearly, KEA can be used as an in-between tool for composing extracts for short
stories, as it has shown competitive results, gaining the average total score more than or
equal to 1,5 out of 3.</p>
        <p>Interestingly, experts, as a rule, preferred a version based on non-lemmatized data.
In a way it confirms our suggestion that stemming from the source package would be
better for data preprocessing.
In this paper we tried to find and test some further applications of KEA, namely
identifying translation equivalents in the same text written in several languages and
summarizing short stories. As we can see, KEA has managed to find the equivalents in
texts and summarize stories up to its preview. That means that KEA is capable to
serve as a universal and effective tool for different tasks and may be useful not only
for researchers but for naive users as well.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The reported study is supported by Russian Fund of Basic Research (RFBR) grants
16-06-00529 «Development of a linguistic toolkit for semantic analysis of Russian
text corpora by statistical techniques».</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kaur</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Effective approaches for extraction of keywords</article-title>
          .
          <source>In: International Journal of Computer Science Issues</source>
          , vol.
          <volume>7</volume>
          , № 6, pp.
          <fpage>144</fpage>
          -
          <lpage>148</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beliga</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Keyword extraction a review of methods and approaches</article-title>
          . URL: http://langnet.uniri.hr/papers/beliga/Beliga_KeywordExtraction_
          <article-title>a_review_of_methods_an d_approaches</article-title>
          .pdf (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sokolova</surname>
            ,
            <given-names>E. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitrofanova</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          :
          <article-title>Automatic Keyphrase Extraction by applying KEA to Russian texts</article-title>
          .
          <source>In: IMS 2017 Proceedings, St.-Petersburg</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McKeown</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automatic summarization</article-title>
          .
          <source>In: Foundations and Trends in Information Retrieval</source>
          , vol.
          <volume>5</volume>
          , №
          <fpage>2</fpage>
          -
          <issue>3</issue>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>233</lpage>
          . (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>KEA</given-names>
            <surname>Homepage</surname>
          </string-name>
          , http://www.nzdl.org/Kea/index.html,
          <source>last accessed</source>
          <year>2018</year>
          /05/27.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paynter</surname>
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutwin</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nevill-Manning</surname>
            <given-names>C.G.</given-names>
          </string-name>
          :
          <article-title>KEA: Practical Automated Keyphrase Extraction</article-title>
          . In:
          <article-title>Design and Usability of Digital Libraries: Case Studies in the Asia Pacific</article-title>
          ,
          <source>IGI Global</source>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>152</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. The United Nations Homepage, http://www.un.org/ru/index.html,
          <source>last accessed</source>
          <year>2018</year>
          /05/27.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>RNC</given-names>
            <surname>Homepage</surname>
          </string-name>
          , http://www.ruscorpora.ru/,
          <source>last accessed</source>
          <year>2018</year>
          /05/27.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kazantseva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szpakowicz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Summarizing short stories</article-title>
          .
          <source>In: Computational Linguistics</source>
          , vol.
          <volume>36</volume>
          , № 1, pp.
          <fpage>71</fpage>
          -
          <lpage>109</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Luhn</surname>
            ,
            <given-names>H.P.:</given-names>
          </string-name>
          <article-title>The automatic creation of literature abstracts</article-title>
          .
          <source>In: IBM Journal of research and development</source>
          , vol.
          <volume>2</volume>
          , № 2, pp.
          <fpage>159</fpage>
          -
          <lpage>165</lpage>
          (
          <year>1958</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          :
          <article-title>Automated text summarization and the SUMMARIST system</article-title>
          .
          <source>In: Proceedings of a workshop on held at Baltimore</source>
          ,
          <source>Maryland: October 13-15</source>
          ,
          <year>1998</year>
          , Association for Computational Linguistics, pp.
          <fpage>197</fpage>
          -
          <lpage>214</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. FantLab Homepage, https://fantlab.ru/,
          <source>last accessed</source>
          <year>2018</year>
          /05/27.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Korobov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Morphological Analyzer and Generator for Russian and Ukrainian Languages</article-title>
          .
          <source>In: Analysis of Images, Social Networks and Texts</source>
          , pp
          <fpage>320</fpage>
          -
          <lpage>332</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>