<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Dmytro Lande</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Methodology for extracting of key words and phrases and building directed weighted networks of terms with using Part-of-speech tagging</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Information Recording of National Academy of Sciences of Ukraine</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <volume>1</volume>
      <issue>2</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Today, the rapid globalization of the information space leads to the rise of huge arrays of text data on information resources, including unstructured data. Therefore, developing new and improving existing methods and techniques for finding necessary and relevant information from this text data is important. This article is devoted to solving an urgent and important task related to conceptualization and further formalization in the form of a network of terms of unstructured data contained in thematic information flows distributed on the Internet.</p>
      </abstract>
      <kwd-group>
        <kwd>Text Corpus</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Part-of-Speech (PoS) Tagging</kwd>
        <kwd>Terminological Ontology</kwd>
        <kwd>Network of Terms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Problem statement</title>
      <p>With the beginning a rapid development of IT-technologies and globalization of an
information space the huge amount of text data is produced on information resources
every day. Of course, the part of unstructured data is growing among these data. This
fact complicates the search for necessary and relevant information. So, the huge
volumes of information flows and dynamic text arrays, which are related to some problem
subject domain, determine the relevance of the process of data conceptualization and
their subsequent formalization in the form of a certain ontological model. Therefore, it
is important to develop new and improve existing methods that are used to solve the
task described above.</p>
      <p>Many tasks arise when working with textual information flows lie at the intersection
between mathematical sciences and linguistics. This fact opens wide opportunities to
apply a powerful mathematical and linguistic theory. For example, the application of
knowledge in the field of discrete mathematics makes it possible to present a text data
in the form of a network model that is convenient and effective to use. In terms of a
complex network, the texts of a certain thematic orientation can be presented in the
form of a network of words and phrases that are connected by a formal semantic
connection. The network built from key terms (hereinafter network of terms) is one of the
forms of this network model. In this network, the nodes are related with the single key
terms of some subject domain, and the edges correspond to the links between these
terms. Analysis of such networks can be a basis for decisions making in chosen problem
subject domain</p>
      <p>But while building the network of terms, the identification and extraction the key
object (the key words and phrases) are open and unsolved problem. Due to the sparsity
of text data and complex semantics of natural language, the determining of the syntax
and semantic connections between nodes that correspond to the terms in the text, and
the determining the direction of these connections (links) and their weight values are
also open problem of conceptualization. The automatization of processes described
above and their further visualization are no less important.</p>
      <p>The aim of this work is to propose a new method for extracting the key terms of
thematic text corpus and determining the directions of links between nodes in the
undirected network of key words and phrases to build the terminological ontologies in the
form of a directed weighted network of terms. Further these networks can be used to
make constructive conclusions about the network structure and its parameters, and on
this basis make effective decisions in the accordingly considered problem subject
domains that are related to the texts.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Main approaches to natural language processing</title>
      <p>Text data is a part of natural language. While using a natural language many problems
arise first of all connected with ambiguity, non-compositionality and self-application.</p>
      <p>This is due to the natural language contains a different forms of words (word forms
that have a common basis) and linguistic expressions that are used for expressing
different content; thus, in specific situation the meaning of these word will be depend on
the context. This language is also called an Inflected Language. Non-compositionality
is caused by the lack of rules in natural language that allows to determine the accurate
meaning of a complex statement without knowing its context but knowing the meaning
of all other components of words in the statement. This is because in the statement some
phrases can be interpreted ambiguously.</p>
      <p>
        While building terminological ontologies of subject domains based on a certain
thematic text document [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the elements of this formal scheme, terms (words and phrases),
that are used as concept names and accompany a chosen subject domain must follow
the principle of unambiguity. In other words, a word that used as a name must be the
name of only one object if this name is singular; a phrase must be a general name for
all objects of one class if this phrase is a general name.
      </p>
      <p>In this work, the most common approaches for preprocessing of text data such as
tokenization and stop-word removal are used.</p>
      <p>
        Tokenization is used for preliminary lexical analysis and segmentation the text on
elementary units (tokens). As independent unit, a token is some form of word. The
token is also considered in conjunction with its possible forms and meanings. The
tokenization is usually the initial stage of text processing because makes it possible to
work with the word as a separate entity while knowing its context [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>To remove from the text, in particular, all preposition (for example, „an“, „the“ etc.),
a stop-word removal is used. The stop-words can be considered as a source of the noise.
Prepositions are common stop words particles, exclamations, conjunctions, adverbs,
pronouns, introductory word, numbers from 0 to 9 (single digits), other frequently used,
auxiliary and independent parts of speech, symbols, punctuation marks. Relatively
recently, this list has been supplemented by sequences of characters as www, com, HTTP,
etc. that such frequently used on the Internet.</p>
      <p>
        All mentioned above pre-processing methods can be easily applied to different types
of texts. It can be done using the standard Python libraries such as Python NLP (Natural
Language Processing), in particular, NLTK [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Besides that, the Part-of-Speech tagging approach (or just tagging) (fig. 1) is
proposed to use to extrapolation of syntax and structure of text. Tagging is usually the
next step of natural language processing. Tagging is used after tokenization, and is to
refer the word in the text (corpus) to a certain part of speech. This step is based on both
a definition of the word and a context of the word. In other words, it is based on the
connection of the word with adjacent and related words in a phrase, sentence, or
paragraph. Also, Part-of-Speech tagging is one of the main and basic components of almost
any NLP task. The collection of tags assigned to each word in the sentence is used for
this task. PoS tagging can be used for word indexing, information retrieval and also for
many other applications. PoS tagging can be especially useful if some words or tokens
can have multiple tags. And most importantly, tagging simplifies the context that refers
to some subject domain.</p>
      <p>Parts of speech are also known as word classes or lexical categories (which are based
on the syntactic context of a phrase). Then, we tag each word according to its lexical
category using the above method of classifying words by parts of speech.</p>
      <p>
        E. Brill's parser [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which uses rule-based algorithms is one of the first and most
widely used an English tagger for parts of speech. In addition to a group of rules-based
algorithms, there are also stochastic algorithms.
      </p>
      <p>
        To extract keywords from the text it is necessary to assign them a certain numerical
assessment (in other words, a statistical indicator of importance). In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] it was shown
that it is effective to use the global frequency of the term (GTF - Global Term
Frequency) when working with a text corpus. GTF is determined by the ratio of the total
number of occurrences of a term in all corpus documents to the total number of terms
in corpus documents. GTF shows how important the word is in the global context. It
was shown that, in contrast to the usual statistical indicator TF-IDF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the proposed
assessment of the importance of terms allows to more effectively find
information-important elements of the text when working with a text corpus of a predetermined theme
in which the information-important term occurs in almost every document of the
corpus.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The building of a directed network of terms is made within every separate sentence.</p>
      <p>In this work, separate functions such as “word_tokenize” and “pos_tag” of a
specialized Python add-in, the module NLTK (Natural Language Toolkit that is an
opensource library), are used to automatic tokenization and Part-of-Speech tagging to assign
the tag to every word, accordingly.</p>
      <p>
        Also in this paper, in addition to the standard sets of stop words, which are available
by reference [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], it is proposed to use a list of stop words formed by experts within
the research subject domain.
      </p>
      <p>
        The method for determining of key words and phrases, and also directions of links
in the undirected network of terms proposed in this work is based on using the process
of classification of words on parts of speech and the corresponding tagging of parts of
speech (Part-of-Speech tagging). The practice research shows [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that the most used
parts of speech in English are an article (an abbreviated version is DT), sing or mass
noun (NN), plural noun (NNS), personal pronoun (PR), verb base form (VB), adjectives
(JJ) and adverb (RB). Also in this work, it is considered phrases that have the form
“NN1 NN2”, “JJ1 JJ2”, “JJ1 JJ2 NN”, “JJ1 JJ2 NN1 NN2” and which may be
important. Although the articles, prepositions (IN), conjunction or coordinating (CC),
single verbs, adverbs and pronouns are stop-words, but the phrases that have the form
“VV1 to VV2”, “NN1 IN/CC NN2”, “JJ1 IN/CC JJ2”, “JJ NN1 IN/CC NN2”, “JJ1 IN/CC
JJ2 NN”, “JJ1 JJ2 NN1 IN/CC NN2”, “JJ1 IN/CC JJ2 NN1 IN/CC NN2” may be key. After
forming the above described terms and arranging them in a certain order (a sequence
where phrases with more number of words are placed before phrases and words that are
part of them is formed), single stop words are removed (individual articles,
prepositions, conjunctions, some verbs, adverbs and pronouns).
      </p>
      <p>Then, with the help of the global frequency of the term GTF, the idea of which is
described above, the statistical weighing of words and phrases that is a part of the
sequence formed at the previous stage is carried out.</p>
      <p>The so-called tuple is formed for each word, in the order of its occurrence in the text.
Each element of the tuple consists of three values: the first one is a term (a word or
phrase); the second is a tag that is assigned to a word depending on its belonging to a
certain part of speech; the last one is numerical value of a GTF.</p>
      <p>It should be noted that the GTF is calculated considering two previous values – the
word or phrase and part of speech to which this word or phrase relates. The number of
similar tuples in the whole text, which normalized by the total number of formed terms
in this text defines the value of the third element.</p>
      <p>
        At the next step, it is proposed to determine the undirected links between terms in
the text. For this goal, the Horizontal Visibility Graph algorithm (HVG algorithm) that
transform time series into a visibility graphs is used [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In our case, the time series is
a sequence of numerical GTF values formed in the previous step. Next, we show the
main idea of the mentioned HVG algorithm. Two nodes ti and tj, which correspond to
the elements xi and xj of the time series are in horizontal visibility if and only if xk &lt;
min(xi; xj) for all tk where ti &lt; tk &lt; tj.. In our case, the sequence ti, i=1,..,n is a sequence
of words within a sentence (n is the number of words left in the sentence after the
abovedescribed pre-processing). HVG allows you to build network structures based on texts
in which numerical weight values are somehow assigned to individual words or
phrases.
      </p>
      <p>If there is an undirected link, determined by the above algorithm, between the nodes
from ti to tj of the time series then:
─ it is proposed to determine the link from node ti to tj, if in a sentence the word (not a
phrase) to which the node ti corresponds occurs earlier than the term (word or phrase)
to which the node tj corresponds;
─ it is proposed to determine the link from node tj to ti, if in a sentence the phrase (not
a word) to which the node tj corresponds occurs earlier than the term (word or phrase)
to which the node ti corresponds (fig. 2).</p>
      <p>
        Given the above-described principle of forming the sequence of the terms and the
proposed rules for determining links, it can be noticed that words and phrases will be the
part of the corresponding phrases that have more words. In other words, a significant
part of phrases with more words is only an extension of the corresponding phrases and
words. A similar principle of the building of directed networks of words, the building
of networks of natural hierarchies of terms, proposed in the work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In work
mentioned above, the directed network of words and phrases is built on the principle of
going into the term into its corresponding phrase.
      </p>
      <p>
        The weight of links between nodes of the directed network of terms is determined
by the principle proposed in the work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The main idea of this principle is that the
nodes that are corresponded to the same term of the directed network built in the
previous stage are combined into a single ("glued"). And the number of the same-directed
links between the corresponded nodes determines the weight of the links between these
nodes.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Result of research</title>
      <p>The proposed methodology for computerized processing of text corpus was tested on
the example of a children’s allegorical story-tale, “The Little Prince” by Antoine de
Saint-Exupéry.</p>
      <p>According to the methodology proposed above, the selected text document was
processed and key terms were identified (Table 1).</p>
      <p>
        If we arrange all the key terms in descending order of their numerical value GTF, then
the graph (Fig. 3) shows Zipf's law [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
The obtained directed weighted network of words and phrases was visualized using the
Gephi software [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This software package was applied not only for modelling and
visualization but also for analysis built network. Figure 2 presents the results of the
proposed methodology. The links with a weight value that equal 1 and nodes with a
zero input and output degree were removed for the built network.
      </p>
      <p>Also, the following parameters of the built network were obtained applying the
Gephi software: the number of nodes is 79; the number of links is 117; the average
degree is 1.48; the average path length is 3.74; the average clustering coefficient is
0.012; the network density is 0.019; the number of connected components is 4.</p>
      <p>The list of the most important links between the corresponding nodes in the network
of terms is presented in Table 2.
In this work, a new method for extracting of key words and phrases from thematic
information flows and a new method for determining the directions of links between
nodes in undirected networks of terms with using Part-of-speech tagging were
proposed. Also, the holistic methodology of computerized text corpora processing and
building the directed weighted networks of terms (of key words and phrases) is
presented. Using previous words' classification process into parts of speech
(Part-ofspeech tagging) the key words were extracted. The proposed methodology for
computerized processing of text corpus was tested on the example of a children’s allegorical
story-tale, “The Little Prince” by Antoine de Saint-Exupéry. The most important links
between the corresponded nodes in the network of terms corresponding to certain key
concepts in the considered text were revealed after analyzing the results of the
methodology. The terms such as “little”, “prince” and “little_prince” turned out the key within
the proposed ontological model. These terms also correspond to the name of the
considered text document. As expected, the most important links between the key terms
are "little → little_prince" and "little → prince".</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nikonenko</surname>
            <given-names>A.O.</given-names>
          </string-name>
          :
          <article-title>Review of computer-linguistic methods of natural language texts processing</article-title>
          .
          <source>Artificial Intelligence. № 3</source>
          ,
          <fpage>174</fpage>
          -
          <lpage>181</lpage>
          (
          <year>2011</year>
          ).
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lande</surname>
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dmytrenko</surname>
            <given-names>O.O.</given-names>
          </string-name>
          :
          <source>Creating the Directed Weighted Network of Terms Based on Analysis of Text Corpora</source>
          .
          <source>2020 IEEE 2nd International Conference on System Analysis &amp; Intelligent Computing (SAIC) (Kyiv</source>
          ,
          <fpage>5</fpage>
          -
          <lpage>9</lpage>
          Oct.
          <year>2020</year>
          ). DOI: doi.org/10.1109/SAIC51296.
          <year>2020</year>
          .9239182
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Schütze</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An Introduction to Information Retrieval</article-title>
          . Cambridge University Press,
          <fpage>22</fpage>
          -
          <lpage>36</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein,
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <article-title>Natural Language Processing with Python</article-title>
          . O'- Reilly
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <source>ISBN 0-596-51649-5</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Brill</surname>
          </string-name>
          . E.:
          <article-title>A simple rule-based part of speech tagger</article-title>
          .
          <source>In Proceedings of the third conference on Applied natural language processing (ANLC '92)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>152</fpage>
          -
          <lpage>155</lpage>
          (
          <year>1992</year>
          ). DOI: doi:10.3115/974499.974526
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Extract</given-names>
            <surname>Custom</surname>
          </string-name>
          <article-title>Keywords using NLTK POS tagger in python</article-title>
          . https://thinkinfi.com
          <article-title>/extractcustom-keywords-using-nltk-pos-tagger-in-python/</article-title>
          . Accessed 24 Oct 2020
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lande</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dmytrenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radziievska</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Determining the Directions of Links in Undirected Networks of Terms</article-title>
          .
          <source>In: CEUR Workshop Proceedings (ceur-ws.org)</source>
          . Vol-
          <volume>2577</volume>
          urn:nbn:de:
          <fpage>0074</fpage>
          -
          <lpage>2318</lpage>
          -4.
          <source>Selected Papers of the XIX International Scientific and Practical Conference "Information Technologies and Security" (ITS</source>
          <year>2019</year>
          ), vol.
          <volume>2577</volume>
          ,
          <fpage>132</fpage>
          -
          <lpage>145</lpage>
          . (
          <year>2019</year>
          ). ISSN 1613-0073 [http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2577</volume>
          /paper11.pdf]
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ramos</surname>
          </string-name>
          , J.:
          <article-title>Using tf-idf to determine word relevance in document queries</article-title>
          .
          <source>In Proceedings of the first instructional conference on machine learning</source>
          . vol.
          <volume>242</volume>
          ,
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Google</given-names>
            <surname>Code</surname>
          </string-name>
          <article-title>Archive: Stop-words</article-title>
          . https://code.google.com/archive/p/stop-words/downloads/. Accessed 24 Oct 2020
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Text</surname>
          </string-name>
          <article-title>Fixer: Common English Words List</article-title>
          . http://www.textfixer.com/tutorials/commonenglishwords.php. Accessed 24 Oct 2020
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Luque</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lacasa</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballesteros</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Luque</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Horizontal visibility graphs: Exact results for random time series</article-title>
          . Physical Review E,
          <volume>80</volume>
          (
          <issue>4</issue>
          ), (
          <year>2009</year>
          ). DOI: doi.org/10.1103/PhysRevE.80.046103.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lande</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snarskii</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yagunova</surname>
            ,
            <given-names>E. V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pronoza</surname>
            ,
            <given-names>E. V.</given-names>
          </string-name>
          :
          <article-title>The use of horizontal visibility graphs to identify the words that define the informational structure of a text</article-title>
          .
          <source>In: 2014 12th Mexican International Conference on Artificial Intelligence</source>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>215</lpage>
          (
          <year>2014</year>
          ). DOI: doi.org/10.1109/MICAI.
          <year>2013</year>
          .33
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Li</surname>
          </string-name>
          , Wentian.:
          <article-title>Random texts exhibit Zipf's-law-like word frequency distribution</article-title>
          .
          <source>IEEE Transactions on information theory. 38.6</source>
          ,
          <fpage>1842</fpage>
          -
          <lpage>1845</lpage>
          (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Gephi</surname>
          </string-name>
          . https://gephi.org.
          <source>Accessed. 02 Dec 2020</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>