Introduction

Transformations of Texts into the Complex Network with Applying Visibility Graphs Algorithms

0 Institute for Information Recording of National Academy of Sciences of Ukraine , Kyiv , Ukraine 1 National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” , Kyiv , Ukraine

0000 0003

In this article, the algorithms of visibility for transforming texts into a complex network is proposed. Keywords and concepts from a set of documents which describe some subject domain are extracted. Numeric values are assigned to each word or phrase using GTF metric, which was proposed in this article instead ordinary TF-IDF metric, that is intended to reflect how important a word is to a document in a collection or corpus. As the result, a time series is constructed. A tool in time series analysis - the visibility graph algorithm is used for constructing a graph of the subject domain. In this article, two actual subject domains (“Information extraction” and “Complex network”) are considered for example. The corpora of documents, which are related to actual subject domains, were considered from an open access repository of electronic preprints arXiv (https://arxiv.org). The proposed algorithm is used for the set of documents, which are related to “Information extraction” and “Complex network”. This article shows that applying GTF metric is more expedient compared with TF-IDF metric in the case when the set of documents describe one subject domain. Also, the results of applying the visibility graph algorithm and the compactified horizontal visibility graph algorithm are compared. This article shows, that in some case using the compactified horizontal visibility graph algorithm gives a network of words with more quantity of connections between concepts compared with using the visibility graph algorithm. An open-source visualization and exploration software for all kinds of graphs and networks Gephi and an original package of specially developed Python modules are used for simulation and visualization as an additional tool. The proposed algorithm can be used for visualization some subject domain, and also for information support systems, enabling to reveal key components of a subject domain. Also, the results of this article can be used for building UI of information retrieval systems, enabling to make a process of search a relevant information easier.

Set of Documents Subject Domain Time Series Network of Words TF-IDF Visibility Graph Compactified Horizontal Visibility Graph

Introduction

The development of the Internet caused a number of problems, which are related, first of all, with a massive quantity of data in the Web-space, including needless data.

Today on the Internet there is a huge and dynamic information base which is available for research and analysis. It turned out, that many tasks, which arise during working with the network information space, have much in common with mathematical sciences. This fact opens wide opportunities to applying a powerful mathematical tool [ 1,2 ]. Taking into account the problems of the huge dimensionality and the dynamic of information resources in global networks, the knowledge based on discrete mathematics (graph theory, networks theory), pattern recognition (classification, clustering), linguistics, digital signal processing, wavelet analysis and fractal analysis are applied.

Due to terabytes of textual data, that are distributed in networks and have been accumulating dynamically, development of new methods and algorithms for analyzing these data is necessary. But also the advantages and disadvantages of algorithms that exist for information retrieval must consider.

A modern development of technologies in some case enable to find relevant information. But the problems of further analytical processing of this information, selection of necessary factual data, detection of development trends in selected subject domain, the relation between concepts, events, and forecasting remain unresolved. More of these problems are actual challenges of a semantic processing of huge dynamical sets of textual data. 2

Analysis of Recent Researches and Publications

A subject of this study is actual and most commonly found in various articles of foreign and domestic scientists. For example, in the works [ 3,4 ] the main accent makes on developing new methods and algorithms, which are appointed to analytical processing of huge sets of textual data. In the works [ 5,6 ] authors consider a linguistic processing of natural language texts, as one of the central problem of intellectualization of information technologies.

In particular, in the works [ 7-10 ] the visibility graphs algorithm is proposed. Also the method of constructing networks based on the visibility graphs algorithm is presented in works [ 11-15 ]. 3

Review of Some Visibility Algorithms

In this work, a network of connections between terms and concepts, which go into textual data is building. Building networks of words, the nodes of which are elements of the text, enables to reveal key components of the text. At the same time, the task of determining, which of the important structural elements of the text are also informationally important, is actual.

There are several approaches of constructing networks from the texts (so-called language networks) and different ways of interpreting nodes and connections. It leads, accordingly, to various kinds of presenting of such networks. Nodes are connected if corresponding words are either adjacent in the text [ 16, 17 ], or are in a single sentence [ 18 ], or are syntactically [ 19, 20 ] or semantically [ 21, 22 ] connected. 3.1

Visibility Graph Algorithm (VG)

In this article, a tool in time series analysis – the visibility graph algorithm [ 7, 23, 24 ] is used for converting a time series into a graph. This algorithm maps a time series into a network.

For example, the derived graph of visibility for the time series {0.125, 0.063, 0.042, 0.104, 00.125, 0.063, 0.042, 0.104, 0.125, 0.063, 0.042, 0.104} is presented in Ошибка! Источник ссылки не найден.. In the graph, every node corresponds, in the same order, to series data. The visibility rays between the data define the links connecting nodes in the graph.

There is a connection between nodes if they are in “line of sight” with each other, i.e. if they can be connected by a line that does not cross any other histogram bar. More formally, the visibility criteria is described as follows: two arbitrary data values (ta, ya ) and (tb, yb ) will have visibility, and consequently will become two connected nodes of the associated graph, if any other data (tс, yс ) placed between them fulfills:

Also in the article [ 7 ] is shown that the structure of the time series is conserved in the graph topology: periodic series convert into regular graphs, random series into random graphs, and fractal series into scale-free graphs. 3.2

Compactified Horizontal Visibility Graph Algorithm (CHVG)

In the works [ 11, 12, 13, 25-27 ] another algorithm for constructing networks of words – the compactified horizontal visibility graph algorithm (CHVG) is proposed. In general, the process of constructing the language network using the compactified horizontal visibility graph algorithm consists of three stages (Fig. 2). At the first stage, the set of nodes, which correspond to the set of words in order of occurrence in the text, are marked on the horizontal axis. At the second stage, the horizontal visibility graph is built. Two observations made at times ti and t j to be connected in a horizontal visibility graph (HVG) if and only if

xk  min{xi , x j} for all tk with ti  tk  t j .

At the third stage, the network, that was obtained at the previous stages, is compactified. As the result, the new network of words – the compactified horizontal visibility graph is obtained.

In this manner, the compactified horizontal visibility graph algorithm enables to construct of network structures based on texts, in which numeric values are assigned in some manner to each word or phrase. 4

Forming of the Time Series

In this article, TF-IDF numeric metric (TF – Term Frequency, IDF — Inverse Document Frequency) is used for forming of the time series. It is an example of a function that assigns a number to a word in the text. TF-IDF is the most frequently applied weighting scheme. Also this a numerical statistic is intended to estimate how important a word is to a document in a collection or corpus [ 28 ]. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. It is often used as a weighting factor in text mining, information searching, and retrieval. Also, it can be used as one of the criteria to estimate the relevance of a document to a search query [ 29 ].

TF (term frequency) is a ratio of the number of the word occurs in a document to the total number of words in the document. In this manner, the weight of a term (word) ti that occurs in a document is simply proportional to the term frequency. The term was proposed by Karen Spärck Jones [ 30 ],

TF  ni ,  nk k where ni is a number of occurrences of the term (word) i in the document;  nk is a total number of words in the document. k

IDF (inverse document frequency) is an inverse function of the number of documents in which a term occurs. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient. Using IDF reduces the weight of widely used terms (words).

D IDF  log (di  ti ) where D is a total number of documents in the corpus; (di  ti ) is the number of documents contain a term ti ( ni  0 ).

In other words, the TF-IDF metric is a product of two members: TF and IDF.

TF  IDF  TF IDF

A word has high TF-IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.

After the representation of corpora of documents in a vector view (number of words determines the dimension of the vector), the visibility graph algorithm, which was described above, is used.

Presentation of the Basic Material of the Research

In this article, before using the method to constructing networks from the texts, we propose to remove stop word. It enables to removing the words, which are informationally unimportant. We use the stop-dictionary based on various stop-dictionaries, which are available via the links: https://code.google.com/archive/p/stopwords/downloads/; http://www.textfixer.com/tutorials/common-english-words.php.

Also, we propose a global TF metric (GTF), which looks like

GTF  ni ,  nk k where ni is the number of occurrences of the term (word) i in documents of the corpus;  nk is a total number of words in the documents of the corpus.

Wordsk, which are occurred not often within a single document, have a low TF metric. But if they occur in every document of corpora, they at real are informationally important in a global context for the considered subject domain. That is why in this article we use a GTF metric. 5.1

Example 1

In this article, the corpora of 292 documents, which are related with an actual domain – “Information extraction”, were considered from an open access repository of electronic preprints – arXiv (https://arxiv.org) for a period of time 2000-2010.

As the result of applying of a proposed method of constructing networks from the texts, the network of keywords, which are important structural elements of the subject domain, was obtained (Fig. 3).

Based on the results which presented in the Table 1 we can notice that quantity of keywords, which are informationally important, is more in case of applying only GTF metric for the set of documents that describe one subject domain. The keywords, such as “information” and “extraction”, which are informationally important for the considered subject domain, are missed in case of using TF-IDF metric (these keywords have a low TF-IDF). After analyzing the results of research (Table 1) we can make the conclusion that applying only GTF metric is more expedient compared with TFIDF metric in the case when the set of documents describe one subject domain. It can be explained by the fact that words, which are key for the considered subject domain and occur in every document of corpora, have a low IDF (as the result a low TF-IDF). But in fact, these words are informationally important and define the structure of the text. For comparison of the results of applying the visibility graph algorithm and the compactified horizontal visibility graph algorithm, the corpora of 2901 documents, which are related with an actual subject domain – “Complex network”, were considered from an open access repository of electronic preprints – arXiv (https://arxiv.org) for a period of time 2000-2010. As a result of applying of visibility graph algorithms two different networks of words for the considered subject domain, was obtained (Fig. 4, Fig. 5).

After deriving the associated graphs from the visibility algorithms, all the terms are sorted descending and weight values of CHVG and VG corresponding nodes according to a number of connections with other nodes are calculated. As the weight, for example, the authority (or hub) calculated by HITS algorithm [ 31 ] is used. Because the graph is not directed, the choice of a form of the weight does not matter.

Comparing the results (Table 2), it may notice, that in the case of applying the compactified horizontal visibility graph algorithm (Fig. 5) there are many words, which have more links than in the case of applying the visibility graph algorithm (Fig. 4).

A general quantity of links is 768 in the case of applying the compactified horizontal visibility graph algorithm, unlike in the case of applying the ordinary visibility graph algorithm, when a general quantity of links is 703. It should be noted, that obtained networks are very complex. That is why we plan to continue our research in this sphere. 6

Conclusion

The method of constructing networks from the texts, so-called language networks, was proposed. Keywords and concepts from the set of documents which describe some subject domain were retrieved. Numeric values were assigned to each word or phrase using GTF metric, which was proposed in this article instead ordinary TF metric. After analyzing the results of the research we made the conclusion that applying only GTF metric is more expedient compared with TF-IDF metric in the case when the set of documents describe one subject domain. As the result, a time series were constructed. A tool in time series analysis – the visibility graph algorithm was used for constructing the graph of the subject domain. After analyzing the results of research the important structural elements of the text were found. It should be noted that these elements of the text also are informationally important and define the structure of the text. There was discovered, that in some case using the compactified horizontal visibility graph algorithm gives a network of words with more quantity of connections between concepts compared with using the visibility graph algorithm. Cause of complexity of obtained networks we plan to continue our research in this sphere.

The proposed method can be used for visualization some subject domain, and also for information support systems, enabling to reveal key components of a subject domain. Also the results of this article can be used for building UI of information retrieval systems, enabling to make a process of search a relevant information easier.

D.V.

Lande ,

A.A.

Snarskii ,

and I.V.

Bezsudnov , Internetika: Navigation in complex networks: models and algorithms , Moscow, Russia: Librokom, Editorial URSS (in Russian) ( 2009 ).

D.V.

Lande , Knowledge Search in INTERNET. Professional work . Dialectics , Moscow (in Russian) ( 2005 ).

C.C.

Aggarwal , and

C.X.

Zhai , Mining text data. Springer Science & Business Media ( 2012 ) 77 - 128 .

Miner , J. Elder

, and

Hill , Practical text mining and statistical analysis for nonstructured text data applications . Academic Press ( 2012 ).

Yu . Taranukha, Intelligent processing of texts, Kiev: electronic publication on the website of the faculty (in Ukrainian) ( 2014 ).

6. E.I. Bol'shakova ,

E. S.

Klyshinsky ,

D.V.

Lande ,

A.A.

Noskov ,

O. V.

Peskova , and

E.V.

Yagunova , Automatic processing of texts in a natural language and computational linguistics , Moscow: MIEM Publ (in Russian) ( 2011 ).

Lacasa ,

Luque ,

Ballesteros ,

Luque , and

J.C.

Nuño , From time series to complex networks: the visibility graph , Proc. Natl. Acad. Sci. USA 105 ( 2008 ) 4972 - 4975 .

8. A.M. Nunez , L.

Lacasa , J. P.

Gomez , and Luque B. Visibility algorithms: A short review , Frontiers in Graph Theory. InTech , ( 2012 ) 119 - 152 .

9. В. Luque , L.

Lacasa , F.

Ballesteros , and J.

Luque , Horizontal visibility graphs: Exact results for random time series . Physical Review E, 80 ( 4 ) ( 2009 ) 046103 .

10. G. Gutin,

Mansour , and

Severini , A characterization of horizontal visibility graphs and combinatoris on words , Physica

, - 390 ( 2011 ) 2421 - 2428 .

11.

D.V.

Lande , and

A.A.

Snarskii , Compactified HVG for the Language Network . In: Proceedings of the International Conference on Intelligent Information Systems: The Conference is dedicated to the 50th anniversary of the Institute of Mathematics and Computer Science , 20 - 23 Aug. 2013 , Chisinau, Moldova: Proceedings IIS, Institute of Mathematics and Computer Science ( 2013 ) 108 - 113 .

12.

D.V.

Lande ,

A.A.

Snarskii ,

E.V.

Yagunova , and E. Pronoza, The Use of Horizontal Visibility Graphs to Identify the Words that Define the Informational Structure of a Text . In: Proceedings of the 12th Mexican International Conference on Artificial Intelligence ( 2013 ) 209 - 215

13.

D.V.

Lande ,

A.A.

Snarskii , and

E.V.

Yagunova , Application of the CHVG-algorithm for scientific texts . In: Proceedings of the Open Semantic Technologies for Intelligent Systems (OSTIS) , February 20 - 22th , Minsk ( 2014 ) 199 - 204

14.

D.V.

Lande ,

A.A.

Snarskii , and

Yu . Manko, The Model of Words Cumulative Influence in a Text . In: XVIII International Conference on Data Science and Intelligent Analysis of Information . Springer, Cham ( 2018 ) 249 - 256 .

15.

D.V.

Lande ,

A.A.

Snarskii ,

E.V.

Yagunova , E. Pronoza, and

Volskaya , Hierarchies of Terms on the Euromaidan Events: Networks and Respondents Perception, 12th International Workshop on Natural Language Processing and Cognitive Science NLPCS 2015 127 - 139 .

16. R. Ferrer-i- Cancho , and

R.V.

Solé , The Small World of Human Language, Proceedings of the Royal Society of London B: Biological Sciences 268.1482 ( 2001 ) 2261 - 2265 .

17.

S.N.

Dorogovtsev , and

J.F.F.

Mendes , Language as an Evolving Word Web , Proceedings of the Royal Society of London B: Biological Sciences 268.1485 ( 2001 ) 2603 - 2606 .

18. S.M.G. Caldeira , T.C.

Petit Lobao , R.F.S.

Andrade , A.

Neme , and J.G. Miranda,

The network of concepts in written texts , Preprint physics/0508066 ( 2005 ).

19. R. Ferrer-i-

Cancho , R.V.

Solé , and R.

Kohler , Patterns in syntactic dependency networks , Physical Review E 69.5 ( 2004 ) 051915 .

20. R. Ferrer-i- Cancho , The variation of Zipf's law in human language , The European Physical Journal B-Condensed Matter and Complex Systems , ( 2005 ) 249 - 257 .

21. A.E. Motter , A.P.S. De Moura , Y.C.

Lai , and P.

Dasgupta , Topology of the conceptual network of language , Physical Review E, 65 ( 6 ) ( 2002 ) 065102 .

22. M. Sigman , and G.A. Cecchi , Global Organization of the Wordnet Lexicon , Proceedings of the National Academy of Sciences 99.3 ( 2002 ) 1742 - 1747 .

23.

I.V.

Bezsudnov , and

A.A.

Snarskii . From the time series to the complex networks: The parametric natural visibility graph , Physica A: Statistical Mechanics and its Applications 414 ( 2014 ) 53 - 60 .

24.

Li ,

Sun ,

Gao , D. Han, and

Wang , The parametric modified limited penetrable visibility graph for constructing complex networks from time series , Physica A: Statistical Mechanics and its Applications , 492 ( 2018 ) 1097 - 1106 .

25. M. Wang , H.

Xu , L.

Tian , and H. E.

Stanley , Degree distributions and motif profiles of limited penetrable horizontal visibility graphs . Physica A: Statistical Mechanics and its Applications ( 2018 ).

26. M. Wang , A.L.

Vilela , R.

Du , L.

Zhao , G.

Dong , L.

Tian , and H. E.

Stanley , Exact results of the limited penetrable horizontal visibility graph associated to random time series and its application . Scientific reports , 8 ( 1 ) ( 2018 ) 5130 .

27. M. Wang , A.L.

Vilela , R.

Du , L.

Zhao , G.

Dong , L.

Tian , and H. E.

Stanley , Topological properties of the limited penetrable horizontal visibility graph family , Physical Review E, 97 ( 5 ) ( 2018 ) 052117 .

28. J.D. Ullman , Data Mining, Mining of massive datasets . Cambridge University Press. ( 2011 ) 1 - 17 .

29. J. Beel , B.

GIPP , S.

Langer , and C.

Breitinger , Research-paper recommender systems: a literature survey , International Journal on Digital Libraries . 17 ( 4 ), ( 2016 ) 305 - 338 .

30. K.S. Jones , A statistical interpretation of term specificity and its application in retrieval , Journal of Documentation , MCB University Press 60, ( 2004 ) 493 - 502 .

31. J.M. Kleinberg , Authoritative sources in a hyperlink environment . Journal of the ACM JACM . 46 ( 5 ) ( 1999 ) 604 - 632 .