Transformations of Texts into the Complex Network with
        Applying Visibility Graphs Algorithms

       Dmitry Lande1[0000-0003-3945-1178], Oleh Dmytrenko1[0000-0001-8501-5313] and

                             Andrei Snarskii2[0000-0002-4468-4542]
  1
    Institute for Information Recording of National Academy of Sciences of Ukraine, Kyiv,
                                           Ukraine
2
  National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv,
                                           Ukraine
  dwlande@gmail.com, dmytrenko.o@gmail.com, asnarskii@gmail.com


      Abstract. In this article, the algorithms of visibility for transforming texts into a
      complex network is proposed. Keywords and concepts from a set of documents
      which describe some subject domain are extracted. Numeric values are assigned
      to each word or phrase using GTF metric, which was proposed in this article in-
      stead ordinary TF-IDF metric, that is intended to reflect how important a word
      is to a document in a collection or corpus. As the result, a time series is con-
      structed. A tool in time series analysis – the visibility graph algorithm is used
      for constructing a graph of the subject domain. In this article, two actual subject
      domains (“Information extraction” and “Complex network”) are considered for
      example. The corpora of documents, which are related to actual subject do-
      mains, were considered from an open access repository of electronic preprints –
      arXiv (https://arxiv.org). The proposed algorithm is used for the set of docu-
      ments, which are related to “Information extraction” and “Complex network”.
      This article shows that applying GTF metric is more expedient compared with
      TF-IDF metric in the case when the set of documents describe one subject do-
      main. Also, the results of applying the visibility graph algorithm and the com-
      pactified horizontal visibility graph algorithm are compared. This article shows,
      that in some case using the compactified horizontal visibility graph algorithm
      gives a network of words with more quantity of connections between concepts
      compared with using the visibility graph algorithm. An open-source visualiza-
      tion and exploration software for all kinds of graphs and networks Gephi and an
      original package of specially developed Python modules are used for simulation
      and visualization as an additional tool. The proposed algorithm can be used for
      visualization some subject domain, and also for information support systems,
      enabling to reveal key components of a subject domain. Also, the results of this
      article can be used for building UI of information retrieval systems, enabling to
      make a process of search a relevant information easier.

      Keywords: Set of Documents, Subject Domain, Time Series, Network of
      Words, TF-IDF, Visibility Graph, Compactified Horizontal Visibility Graph.
96


1       Introduction

The development of the Internet caused a number of problems, which are related, first
of all, with a massive quantity of data in the Web-space, including needless data.
   Today on the Internet there is a huge and dynamic information base which is avail-
able for research and analysis. It turned out, that many tasks, which arise during work-
ing with the network information space, have much in common with mathematical
sciences. This fact opens wide opportunities to applying a powerful mathematical tool
[1,2]. Taking into account the problems of the huge dimensionality and the dynamic
of information resources in global networks, the knowledge based on discrete mathe-
matics (graph theory, networks theory), pattern recognition (classification, clustering),
linguistics, digital signal processing, wavelet analysis and fractal analysis are applied.
   Due to terabytes of textual data, that are distributed in networks and have been ac-
cumulating dynamically, development of new methods and algorithms for analyzing
these data is necessary. But also the advantages and disadvantages of algorithms that
exist for information retrieval must consider.
   A modern development of technologies in some case enable to find relevant infor-
mation. But the problems of further analytical processing of this information, selec-
tion of necessary factual data, detection of development trends in selected subject
domain, the relation between concepts, events, and forecasting remain unresolved.
More of these problems are actual challenges of a semantic processing of huge dy-
namical sets of textual data.


2      Analysis of Recent Researches and Publications

A subject of this study is actual and most commonly found in various articles of for-
eign and domestic scientists. For example, in the works [3,4] the main accent makes
on developing new methods and algorithms, which are appointed to analytical
processing of huge sets of textual data. In the works [5,6] authors consider a linguistic
processing of natural language texts, as one of the central problem of intellectualiza-
tion of information technologies.
   In particular, in the works [7-10] the visibility graphs algorithm is proposed. Also
the method of constructing networks based on the visibility graphs algorithm is pre-
sented in works [11-15].


3      Review of Some Visibility Algorithms

In this work, a network of connections between terms and concepts, which go into
textual data is building. Building networks of words, the nodes of which are elements
of the text, enables to reveal key components of the text. At the same time, the task of
determining, which of the important structural elements of the text are also informa-
tionally important, is actual.
                                                                                              97


   There are several approaches of constructing networks from the texts (so-called
language networks) and different ways of interpreting nodes and connections. It leads,
accordingly, to various kinds of presenting of such networks. Nodes are connected if
corresponding words are either adjacent in the text [16, 17], or are in a single sentence
[18], or are syntactically [19, 20] or semantically [21, 22] connected.

3.1    Visibility Graph Algorithm (VG)
In this article, a tool in time series analysis – the visibility graph algorithm [7, 23, 24]
is used for converting a time series into a graph. This algorithm maps a time series
into a network.
   For example, the derived graph of visibility for the time series {0.125, 0.063,
0.042, 0.104, 00.125, 0.063, 0.042, 0.104, 0.125, 0.063, 0.042, 0.104} is presented in
Ошибка! Источник ссылки не найден.. In the graph, every node corresponds, in
the same order, to series data. The visibility rays between the data define the links
connecting nodes in the graph.


Fig. 1. Example of a time series and the associated graph derived from the visibility algorithm

      There is a connection between nodes if they are in “line of sight” with each other,
i.e. if they can be connected by a line that does not cross any other histogram bar.
More formally, the visibility criteria is described as follows: two arbitrary data values
 (ta , ya ) and (tb, yb ) will have visibility, and consequently will become two connected
nodes of the associated graph, if any other data (tс , yс ) placed between them fulfills:
98


                                                            t t
                                       yс  yb  ( ya  yb ) b с .
                                                            tb  ta
   Also in the article [7] is shown that the structure of the time series is conserved in
the graph topology: periodic series convert into regular graphs, random series into
random graphs, and fractal series into scale-free graphs.

3.2       Compactified Horizontal Visibility Graph Algorithm (CHVG)
In the works [11, 12, 13, 25-27] another algorithm for constructing networks of words
– the compactified horizontal visibility graph algorithm (CHVG) is proposed. In gen-
eral, the process of constructing the language network using the compactified hori-
zontal visibility graph algorithm consists of three stages (Fig. 2). At the first stage, the
set of nodes, which correspond to the set of words in order of occurrence in the text,
are marked on the horizontal axis. At the second stage, the horizontal visibility graph
is built. Two observations made at times ti and t j to be connected in a horizontal
visibility graph (HVG) if and only if
                                             xk  min{ xi , x j }
     for all tk with ti  tk  t j .
    At the third stage, the network, that was obtained at the previous stages, is com-
pactified. As the result, the new network of words – the compactified horizontal visi-
bility graph is obtained.


Fig. 2. The stages of the building of the compactified horizontal visibility graph (CHVG)
                                                                                   99


   In this manner, the compactified horizontal visibility graph algorithm enables to
construct of network structures based on texts, in which numeric values are assigned
in some manner to each word or phrase.


4      Forming of the Time Series

In this article, TF-IDF numeric metric (TF – Term Frequency, IDF — Inverse Docu-
ment Frequency) is used for forming of the time series. It is an example of a function
that assigns a number to a word in the text. TF-IDF is the most frequently applied
weighting scheme. Also this a numerical statistic is intended to estimate how impor-
tant a word is to a document in a collection or corpus [28]. The TF-IDF value increas-
es proportionally to the number of times a word appears in the document and is offset
by the number of documents in the corpus that contain the word, which helps to adjust
for the fact that some words appear more frequently in general. It is often used as a
weighting factor in text mining, information searching, and retrieval. Also, it can be
used as one of the criteria to estimate the relevance of a document to a search query
[29].
   TF (term frequency) is a ratio of the number of the word occurs in a document to
the total number of words in the document. In this manner, the weight of a term
(word) ti that occurs in a document is simply proportional to the term frequency. The
term was proposed by Karen Spärck Jones [30],
                                                 ni
                                     TF                  ,
                                            n
                                             k
                                                      k


   where ni is a number of occurrences of the term (word) i in the document;  nk is
                                                                                k
a total number of words in the document.
   IDF (inverse document frequency) is an inverse function of the number of docu-
ments in which a term occurs. It is the logarithmically scaled inverse fraction of the
documents that contain the word, obtained by dividing the total number of documents
by the number of documents containing the term and then taking the logarithm of that
quotient. Using IDF reduces the weight of widely used terms (words).
                                                      D
                                 IDF  log                    ,
                                             (d i  ti )
      where D is a total number of documents in the corpus; ( di  ti ) is the number
of documents contain a term ti ( ni  0 ).
   In other words, the TF-IDF metric is a product of two members: TF and IDF.
                                 TF  IDF  TF IDF
  A word has high TF-IDF score in a document if it appears in relatively few docu-
ments, but appears in this one, and when it appears in a document it tends to appear
many times.
  After the representation of corpora of documents in a vector view (number of
words determines the dimension of the vector), the visibility graph algorithm, which
was described above, is used.
100


5      Presentation of the Basic Material of the Research

In this article, before using the method to constructing networks from the texts, we
propose to remove stop word. It enables to removing the words, which are informa-
tionally unimportant. We use the stop-dictionary based on various stop-dictionaries,
which are available via the links: https://code.google.com/archive/p/stop-
words/downloads/; http://www.textfixer.com/tutorials/common-english-words.php.
   Also, we propose a global TF metric (GTF), which looks like
                                                ni
                                       GTF         ,
                                                nk
                                                k
   where ni is the number of occurrences of the term (word) i in documents of the
corpus;  nk is a total number of words in the documents of the corpus.
         k
    Words, which are occurred not often within a single document, have a low TF me-
tric. But if they occur in every document of corpora, they at real are informationally
important in a global context for the considered subject domain. That is why in this
article we use a GTF metric.

5.1    Example 1
In this article, the corpora of 292 documents, which are related with an actual domain
– “Information extraction”, were considered from an open access repository of elec-
tronic preprints – arXiv (https://arxiv.org) for a period of time 2000-2010.
   As the result of applying of a proposed method of constructing networks from the
texts, the network of keywords, which are important structural elements of the subject
domain, was obtained (Fig. 3).


Fig. 3. The network of keywords, obtained through the application of the proposed method
                                                                                        101


   Based on the results which presented in the Table 1 we can notice that quantity of
keywords, which are informationally important, is more in case of applying only GTF
metric for the set of documents that describe one subject domain. The keywords, such
as “information” and “extraction”, which are informationally important for the consi-
dered subject domain, are missed in case of using TF-IDF metric (these keywords
have a low TF-IDF). After analyzing the results of research (Table 1) we can make
the conclusion that applying only GTF metric is more expedient compared with TF-
IDF metric in the case when the set of documents describe one subject domain. It can
be explained by the fact that words, which are key for the considered subject domain
and occur in every document of corpora, have a low IDF (as the result a low TF-IDF).
But in fact, these words are informationally important and define the structure of the
text.

 Table 1. TOP-40 largest-weight nodes of the network of words constructed from corpora of
           documents, which describe “Information extraction” subject domain

  Weight (TF-IDF)              Word              Weight (GTF)                 Word
         0.23               feedback                 0.26                  quantum
         0.23               quantum                  0.26                information
         0.22               consider                 0.25                  feedback
        0.218                  state                0.239                  consider
        0.214                control                0.237                    control
        0.211             measurement               0.205                 algorithms
         0.21                problem                0.203                  problem
         0.2                   states               0.192               measurement
         0.18              continuous               0.186                      state
        0.179              algorithms               0.185                 distortions
        0.176              estimation               0.185                  concepts
        0.176               available               0.178                 extraction
        0.176              distortions              0.171                      loss
        0.175               concepts                0.168                    limited
        0.172                 theory                0.166                   variable
        0.165                identify               0.163                 estimation
        0.163                limited                0.162                   analyze
        0.162               paradigm                0.159                     states
        0.155                defined                0.158                 properties
        0.152                  field                0.158                   strategy
        0.149               questions               0.157                   number
       0.1489              considered               0.149                  available
        0.145              properties               0.143                     series
        0.145                   time                0.141                 continuous
        0.143                content                0.137                    theory
         0.14                 results               0.136                   identify
        0.135             entanglement              0.113               entanglement
        0.132               presented               0.103                  temporal
102


          0.128                number                   0.101                 propose
          0.126                 rules                   0.092                paradigm
          0.124               temporal                  0.089                   field
          0.122                propose                  0.085                presented
          0.101                 label                   0.084                   rules
          0.087                discuss                  0.082                 defined
          0.086                process                  0.077                associated
          0.083            corresponding                0.072                   time
          0.076               possible                  0.071                 possible
         0.0745                 series                  0.07                   results
          0.053                 noise                   0.067                   basis
          0.046              evolution                  0.055                 answers


5.2    Example 2
For comparison of the results of applying the visibility graph algorithm and the com-
pactified horizontal visibility graph algorithm, the corpora of 2901 documents, which
are related with an actual subject domain – “Complex network”, were considered
from an open access repository of electronic preprints – arXiv (https://arxiv.org) for a
period of time 2000-2010. As a result of applying of visibility graph algorithms two
different networks of words for the considered subject domain, was obtained (Fig. 4,
Fig. 5).


Fig. 4. The network of keywords, obtained through the application of the visibility graph algo-
rithm
                                                                                          103


Fig. 5. The network of keywords, obtained through the application of the compactified horizon-
tal visibility graph algorithm

    After deriving the associated graphs from the visibility algorithms, all the terms are
sorted descending and weight values of CHVG and VG corresponding nodes accord-
ing to a number of connections with other nodes are calculated. As the weight, for
example, the authority (or hub) calculated by HITS algorithm [31] is used. Because
the graph is not directed, the choice of a form of the weight does not matter.
    Comparing the results (Table 2), it may notice, that in the case of applying the
compactified horizontal visibility graph algorithm (Fig. 5) there are many words,
which have more links than in the case of applying the visibility graph algorithm (Fig.
4).

 Table 2. TOP-40 largest-weight nodes of the network of words constructed from corpora of
             documents, which describe “Complex network” subject domain

    Weight (VG)               Word                Weight (CHVG)                Word
       0.17                 networks                   0.16                  selection
       0.17                    web                     0.16                  networks
       0.17                  systems                   0.16                  removal
       0.17                    cell                    0.16                  systems
       0.17                    key                     0.16                   nodes
       0.17                 scale-free                 0.16                  network
       0.17                  display                   0.16                     cell
       0.17                   error                    0.16                   ability
104


        0.17                complex                0.16              communicate
        0.17             components                0.16                   attacks
        0.169                 nodes                0.16                   display
        0.169                 degree               0.16                     rates
        0.169             robustness               0.16                robustness
        0.168              redundant               0.16                    error
        0.166               network                0.16                     high
       0.1645                  rates               0.16                unaffected
       0.1643                  price               0.16                      role
        0.161           unrealistically            0.16                    price
        0.16                  ability              0.16               connectivity
       0.1582              extremely               0.16                 extremely
        0.158                number               0.158                     web
       0.1574                   high              0.158                   degree
       0.1573           communicate               0.158                 scale-free
        0.156               assuring              0.158                    wide
       0.1547                 social              0.158              unrealistically
       0.1545                  wide               0.157                  number
       0.1542                internet             0.157                  complex
       0.1509             unaffected              0.1573               vulnerable
       0.1509                  class              0.1573                   class
       0.1505            connectivity             0.1573                 internet
       0.1504                  albert             0.1572               large-scale
       0.1502               selection             0.1572                redundant
        0.147                attacks              0.1539                   social
        0.146               removal               0.1539                  failure
        0.146             vulnerable              0.1534                     key
        0.145                   role              0.1534           name.albert-laszlo
        0.138                 failure             0.1533                   albert
        0.135         name.albert-laszlo          0.1533              components
       0.1015             large-scale             0.149                  assuring
       0.1012            organization             0.133               organization

   A general quantity of links is 768 in the case of applying the compactified horizon-
tal visibility graph algorithm, unlike in the case of applying the ordinary visibility
graph algorithm, when a general quantity of links is 703. It should be noted, that ob-
tained networks are very complex. That is why we plan to continue our research in
this sphere.


6      Conclusion

The method of constructing networks from the texts, so-called language networks,
was proposed. Keywords and concepts from the set of documents which describe
some subject domain were retrieved. Numeric values were assigned to each word or
                                                                                           105


phrase using GTF metric, which was proposed in this article instead ordinary TF me-
tric. After analyzing the results of the research we made the conclusion that applying
only GTF metric is more expedient compared with TF-IDF metric in the case when
the set of documents describe one subject domain. As the result, a time series were
constructed. A tool in time series analysis – the visibility graph algorithm was used
for constructing the graph of the subject domain. After analyzing the results of re-
search the important structural elements of the text were found. It should be noted that
these elements of the text also are informationally important and define the structure
of the text. There was discovered, that in some case using the compactified horizontal
visibility graph algorithm gives a network of words with more quantity of connections
between concepts compared with using the visibility graph algorithm. Cause of com-
plexity of obtained networks we plan to continue our research in this sphere.
    The proposed method can be used for visualization some subject domain, and also
for information support systems, enabling to reveal key components of a subject do-
main. Also the results of this article can be used for building UI of information re-
trieval systems, enabling to make a process of search a relevant information easier.


References
 1. D.V. Lande, A.A. Snarskii, and I.V. Bezsudnov, Internetika: Navigation in complex net-
    works: models and algorithms, Moscow, Russia: Librokom, Editorial URSS (in Russian)
    (2009).
 2. D.V. Lande, Knowledge Search in INTERNET. Professional work. Dialectics, Moscow (in
    Russian) (2005).
 3. C.C. Aggarwal, and C.X. Zhai, Mining text data. Springer Science & Business Media
    (2012) 77-128.
 4. G. Miner, J. Elder IV, and T. Hill, Practical text mining and statistical analysis for non-
    structured text data applications. Academic Press (2012).
 5. V.Yu. Taranukha, Intelligent processing of texts, Kiev: electronic publication on the web-
    site of the faculty (in Ukrainian) (2014).
 6. E.I. Bol’shakova, E. S. Klyshinsky, D.V. Lande, A.A. Noskov, O. V. Peskova, and E.V.
    Yagunova, Automatic processing of texts in a natural language and computational linguis-
    tics, Moscow: MIEM Publ (in Russian) (2011).
 7. L. Lacasa, B. Luque, F. Ballesteros, J. Luque, and J.C. Nuño, From time series to complex
    networks: the visibility graph, Proc. Natl. Acad. Sci. USA 105 (2008) 4972–4975.
 8. A.M. Nunez, L. Lacasa, J. P. Gomez, and Luque B. Visibility algorithms: A short review,
    Frontiers in Graph Theory. InTech, (2012) 119 – 152.
 9. В. Luque, L. Lacasa, F. Ballesteros, and J. Luque, Horizontal visibility graphs: Exact re-
    sults for random time series. Physical Review E, 80(4) (2009) 046103.
10. G. Gutin, T. Mansour, and S. Severini, A characterization of horizontal visibility graphs
    and combinatoris on words, Physica A, – 390 (2011) 2421-2428.
11. D.V. Lande, and A.A. Snarskii, Compactified HVG for the Language Network. In: Pro-
    ceedings of the International Conference on Intelligent Information Systems: The Confe-
    rence is dedicated to the 50th anniversary of the Institute of Mathematics and Computer
    Science, 20-23 Aug. 2013, Chisinau, Moldova: Proceedings IIS, Institute of Mathematics
    and Computer Science (2013) 108–113.
12. D.V. Lande, A.A. Snarskii, E.V. Yagunova, and E. Pronoza, The Use of Horizontal Visi-
    bility Graphs to Identify the Words that Define the Informational Structure of a Text. In:
106

    Proceedings of the 12th Mexican International Conference on Artificial Intelligence (2013)
    209–215
13. D.V. Lande, A.A. Snarskii, and E.V. Yagunova, Application of the CHVG-algorithm for
    scientific texts. In: Proceedings of the Open Semantic Technologies for Intelligent Systems
    (OSTIS), February 20 – 22th, Minsk (2014) 199–204
14. D.V. Lande, A.A. Snarskii, and D.Yu. Manko, The Model of Words Cumulative Influence
    in a Text. In: XVIII International Conference on Data Science and Intelligent Analysis of
    Information. Springer, Cham (2018) 249-256.
15. D.V. Lande, A.A. Snarskii, E.V. Yagunova, E. Pronoza, and S. Volskaya, Hierarchies of
    Terms on the Euromaidan Events: Networks and Respondents Perception, 12th Interna-
    tional Workshop on Natural Language Processing and Cognitive Science NLPCS 2015
    127-139.
16. R. Ferrer-i-Cancho, and R.V. Solé, The Small World of Human Language, Proceedings of
    the Royal Society of London B: Biological Sciences 268.1482 (2001) 2261-2265.
17. S.N. Dorogovtsev, and J.F.F. Mendes, Language as an Evolving Word Web, Proceedings
    of the Royal Society of London B: Biological Sciences 268.1485 (2001) 2603-2606.
18. S.M.G. Caldeira, T.C. Petit Lobao, R.F.S. Andrade, A. Neme, and J.G. Miranda, The net-
    work of concepts in written texts, Preprint physics/0508066 (2005).
19. R. Ferrer-i-Cancho, R.V. Solé, and R. Kohler, Patterns in syntactic dependency networks,
    Physical Review E 69.5 (2004) 051915.
20. R. Ferrer-i-Cancho, The variation of Zipf’s law in human language, The European Physi-
    cal Journal B-Condensed Matter and Complex Systems, (2005) 249-257.
21. A.E. Motter, A.P.S. De Moura, Y.C. Lai, and P. Dasgupta, Topology of the conceptual
    network of language, Physical Review E, 65(6) (2002) 065102.
22. M. Sigman, and G.A. Cecchi, Global Organization of the Wordnet Lexicon, Proceedings
    of the National Academy of Sciences 99.3 (2002) 1742-1747.
23. I.V. Bezsudnov, and A.A. Snarskii. From the time series to the complex networks: The pa-
    rametric natural visibility graph, Physica A: Statistical Mechanics and its Applications 414
    (2014) 53-60.
24. X. Li, M. Sun, C. Gao, D. Han, and M. Wang, The parametric modified limited penetrable
    visibility graph for constructing complex networks from time series, Physica A: Statistical
    Mechanics and its Applications, 492 (2018) 1097-1106.
25. M. Wang, H. Xu, L. Tian, and H. E. Stanley, Degree distributions and motif profiles of li-
    mited penetrable horizontal visibility graphs. Physica A: Statistical Mechanics and its Ap-
    plications (2018).
26. M. Wang, A.L. Vilela, R. Du, L. Zhao, G. Dong, L. Tian, and H. E. Stanley, Exact results
    of the limited penetrable horizontal visibility graph associated to random time series and
    its application. Scientific reports, 8(1) (2018) 5130.
27. M. Wang, A.L. Vilela, R. Du, L. Zhao, G. Dong, L. Tian, and H. E. Stanley, Topological
    properties of the limited penetrable horizontal visibility graph family, Physical Review E,
    97(5) (2018) 052117.
28. J.D. Ullman, Data Mining, Mining of massive datasets. Cambridge University Press.
    (2011) 1–17.
29. J. Beel, B. GIPP, S. Langer, and C. Breitinger, Research-paper recommender systems: a li-
    terature survey, International Journal on Digital Libraries. 17(4), (2016) 305-338.
30. K.S. Jones, A statistical interpretation of term specificity and its application in retrieval,
    Journal of Documentation, MCB University Press 60, (2004) 493-502.
31. J.M. Kleinberg, Authoritative sources in a hyperlink environment. Journal of the ACM
    JACM. 46 (5) (1999) 604–632.