=Paper=
{{Paper
|id=Vol-2744/paper20
|storemode=property
|title=Visual Analysis of Textual Information on the Frequencies of Joint Use of Nouns and Adjectives
|pdfUrl=https://ceur-ws.org/Vol-2744/paper20.pdf
|volume=Vol-2744
|authors=Alexander Bondarev,Alexander Bondarenko,Vladimir Galaktionov,Lev Shapiro
}}
==Visual Analysis of Textual Information on the Frequencies of Joint Use of Nouns and Adjectives==
Visual Analysis of Textual Information on the Frequencies of Joint Use of Nouns and Adjectives* Alexander Bondarev 1[0000−0003−3681−5212], Alexander Bondarenko 2[0000-0003-4765-6034], Vladimir Galaktionov 1[0000-0001-6460-7539] and Lev Shapiro 1[0000-0002-6350-851X] 1 Keldysh Institute of Applied Mathematics RAS, Moscow, Russia 2 State Res. Institute of Aviation Systems (GosNIIAS), Moscow, Russia bond@keldysh.ru, cod@fgosniias.ru, vlgal@gin.keldysh.ru, pls@gin.keldysh.ru Abstract. This paper presents the results of numerical experiments on the study of data volumes consisting of frequencies of joint use of adjectives and nouns. The volumes of data were obtained from samples from text collections in Rus- sian. The aim of the research is to analyze the cluster structure of the studied volume and semantic proximity of words in clusters and subclusters. The hy- pothesis was used that words with similar meaning should occur in approxi- mately the same context. In this regard, in the space of features, they will be at a relatively close distance from each other, while differing words will be at a more distant distance from each other. Research is carried out using elastic maps, which are effective tools for visual analysis of multidimensional data. The construction of elastic maps and their extensions in the space of the first three principal components makes it possible to determine the cluster structure of the studied multidimensional data volumes. The analysis of the cluster struc- ture for the considered volume of multidimensional data is carried out. The in- fluence of transposition of the initial data array is considered. Keywords: Multidimensional Data, Visual Analysis, Elastic Maps, Frequencies of Joint Use, Cluster Structures. 1 Introduction The rapid development of the universal transition to digital technologies in the mod- ern world has made the task of processing, visualization and analysis of multidimen- sional data extremely urgent. According to modern classifications, multidimensional data can be considered as Big Data. The need for processing, visualization and analy- sis of multidimensional data entailed the intensive development of tools for visual analytics (Visual Analytics) [1-8]. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). * This work has been supported by the RFBR grants 19-01-00402 and 20-01-00358 2 A. Bondarev, A. Bondarenko, V. Galaktionov, L. Shapiro The approaches and methods of visual analytics are constantly evolving and pro- vide users with sufficiently reliable tools for solving many practical problems of re- searching multidimensional data. Such tasks include the tasks of data classification, cluster detection, identification of key determining parameters, establishing relation- ships between key parameters, etc. In fact, the approaches of visual analytics are a synthesis of several algorithms for reducing the dimension and visual presentation of multidimensional data in manifolds of lower dimension embedded in the original volume. Such algorithms include mapping the initial multidimensional volume in elastic maps [5–8] with different elasticity properties. These methods allow one way or an- other to separate the cluster structure from the initial multidimensional data volume. Elastic maps turned out to be a useful and fairly universal tool, which allowed them to be applied to multidimensional data volumes of various types and different nature of origin. This work is a continuation of research on the development of visual analytics tools for the analysis of multidimensional volumes of numerical and textual infor- mation. Studies on this topic are presented in [10-14]. In the process of research, the construction of elastic maps was tested on a large amount of data of various origins. This work is devoted, first of all, to experiments with a multidimensional data vol- ume, which is the frequency of joint use of adjectives and nouns. With the help of certain procedures, text corpuses and arrays of frequencies of joint use are built. Ear- lier, in previous works, studies of a similar nature were carried out for arrays of the “verb + noun” type [10]. 2 Elastic maps constructing In this section, we give a brief description of the elastic map construction technology as a means of visualizing arbitrary multidimensional data. The ideology and imple- mentation algorithms for building elastic maps are presented in detail in [5–8]. A description of the construction of elastic maps follows [7]. Such a map is a system of elastic springs embedded in a multidimensional data space. This approach is based on an analogy with the problems of mechanics: the principal manifold passing through the “middle” of the data can be represented as an elastic membrane or plate. The elas- tic map method is formulated as an optimization problem involving optimization of a given functional from the relative position of the map and data. According to [5–8], the basis for constructing an elastic map is a two-dimensional rectangular grid G embedded in a multidimensional space that approximates the data and has adjustable elasticity properties with respect to tension and bending. The loca- tion of the grid nodes is sought as a result of solving the optimization problem of finding the minimum functional: 𝐷 𝐷 𝐷 𝐷 = |𝑋|1 + 𝜆 2 + 𝜇 3 → 𝑚𝑖𝑛 (1) 𝑚 𝑚 where │X│ is the number of points in the multidimensional data volume X; m is the number of grid nodes, λ, μ are the elastic coefficients responsible for the tension and Visual Analysis of Textual Information on the Frequencies of Joint Use of… 3 curvature of the grid, respectively; D1, D2, D3 - terms responsible for the properties of the grid. Here D1 is a measure of the proximity of the grid nodes to the data, D2 is a meas- ure of the extent of the grid, D3 is a measure of the curvature of the grid. The variation of the elasticity parameters consists in constructing elastic maps with a sequential decrease in the elastic coefficients, as a result of which the map becomes softer and more flexible, adapting to the points of the initial multidimensional data volume in the most optimal way. After construction, the elastic map can be turned into a plane to observe the cluster structure in the studied data volume. On the ex- panded plane, you can colorize the distribution of data density on the elastic map. In some cases, such a coloring can be very useful. Elastic cards are especially effective when used in conjunction with the principal component analysis (PCA). The display of the elastic map and its sweep in the space formed by the first three principal com- ponents can dramatically improve the results, especially in clustering and classifica- tion problems. The construction of elastic maps and their scanning in the space of the first three principal components allows us to determine the cluster structure of the studied multidimensional data volumes. The author of [7] built the ViDaExpert soft- ware package [9], which allows the processing of multidimensional data volumes, the construction of elastic maps, and their effective visualization. Elastic mapping and visualization of the results in this study were performed using this software tool. Based on the construction of elastic maps, a number of studies of various volumes of multidimensional data were conducted and a number of procedures for processing multidimensional data were developed, which significantly improved the cluster pic- ture of the studied data volume [10-14]. 3 Constructing of elastic maps for multidimensional data such as “adjective + noun” This section presents the results of studies on the construction of elastic maps for a multidimensional data array, which are the frequencies of joint use of adjectives and nouns. This work is a continuation of [10-14], where similar studies were performed for multidimensional data volumes constructed on the basis of the “verb + noun” principle. To construct the data volume, procedures similar to those described in [10] were used. The same basic hypothesis was used that words that are close in meaning should occur in approximately the same context. In this regard, in the space of fea- tures such words will be at a relatively close distance from each other, while different words will be at a distance more distant from each other. The models “adjective + noun” were investigated. The number of adjectives was considered as the number of dimensions. The number of nouns was considered as the number of points in multi- dimensional space. The coordinates of these points in the space thus formed were the frequencies of joint use. In the studies considered below, the basis of the array was a sample of such frequencies for 300 adjectives and 300 nouns. That is, in this case we are considering 300 points, each of which lies in a 300-dimensional space. 4 A. Bondarev, A. Bondarenko, V. Galaktionov, L. Shapiro The filtering procedure was carried out at the data preparation stage. Similarly to [10], to cut off the noise, all combinations with a frequency of occurrence below a predetermined frequency were discarded. In addition, only those main words (and their corresponding combinations) were selected for which the power of the set of dependent words exceeds a certain threshold value. This is necessary to filter out the noise in the combinations extracted from the collection. The threshold value of the frequency of occurrence allows us to get rid of combinations that accidentally fell into the database; the number of different combinations guarantees us sufficient statistics for comparisons. According to the data obtained, elastic maps were constructed with a variation of the bending and tensile coefficients towards maximum “softness”. Let's consider some results. Fig. 1 shows a fragment of the constructed elastic map for a multidimensional data array representing the frequencies of joint use of 300 nouns and 300 adjectives. A fragment of the map is presented in annotated form, showing nouns corresponding to each point. Fig. 1. A fragment of the constructed elastic map for the considered multidimensional data array with annotations. The following figure (Fig.2) shows an extension of an elastic map in the space of the first two principal components with a coloring according to the data density. The density range is divided into five equal parts, which correspond to the colors in as- cending order from blue to red. A similar coloring is used in Fig. 2 - 8. Visual Analysis of Textual Information on the Frequencies of Joint Use of… 5 Fig. 2. Extension of the constructed elastic map for the considered multidimensional data array with coloring by data density. The presented visual image of a multidimensional array consisting of joint use fre- quencies for 300 adjectives and 300 nouns allows one to see 5 areas of condensation. Three areas are located on the left edge of the extended map, one is located in the upper right corner and another weakly expressed area of condensation is located in the lower left corner of the constructed image. Let's take a closer look at them separately. Figure 3 shows a close-up of map fragment corresponding to the upper right cor- ner. Here we can see a number of subclusters containing nouns that are similar in mean- ing. So, for example, in the left part of Fig. 3, one can trace closely related nouns ВОПРОС (QUESTION), ПРОБЛЕМА (PROBLEM), ЗАДАЧА (TASK). Another group, located in the middle of the picture, contains semantically close nouns: МОМЕНТ (MOMENT), ПОЛОЖЕНИЕ (STATE), СИТУАЦИЯ (SITUATION), ДЕЛО (CASE), УСЛОВИЕ (CONDITION). The leftmost subcluster in Figure 3 contains the words ВРЕМЯ (TIME), ГОД (YEAR), ЖИЗНЬ (LIFE), ДЕНЬ (DAY), МИР (PEACE). 6 A. Bondarev, A. Bondarenko, V. Galaktionov, L. Shapiro Fig. 3. Extension of the constructed elastic map - large plan - upper right corner. Fig. 4 shows a similarly close-up map fragment corresponding to the lower upper corner. Here, judging by the density coloring in Fig. 2, there is also a weakly ex- pressed cluster. Here we can also see a number of subclusters containing nouns that are similar in meaning. So, for example, in the upper part of Fig. 4, we can trace the nouns МЫСЛЬ (THINK), ИДЕЯ (IDEA), close in meaning and close on the elastic map. The other group contains similar semantic nouns: ПАРЕНЬ (GUY), МУЖЧИНА (MAN), ДЕВУШКА (GIRL), ЖЕНЩИНА (WOMAN). A little lower in Fig. 4 one can see a subcluster consisting of the words ЛИЦО (FACE), РУКА (HAND), ГЛАЗ (EYE). Fig. 4. Extension of the constructed elastic map - large plan - lower right corner. Visual Analysis of Textual Information on the Frequencies of Joint Use of… 7 A complete presentation of all the resulting clusters and subclusters in the resulting picture for nouns takes up too much space, so in the following presentation we will restrict ourselves to the most characteristic places. So, Fig. 5 shows the top-most area of condensation on the left edge in close-up. Here one can trace the following groups of concepts closely located on the fragment of the sweep. In the upper left corner there is a group of words – АРМИЯ (ARMY), ВОЙСКО (VOYSKO), ПРАВИТЕЛЬСТВО (GOVERNMENT), КОМПАНИЯ (COMPANY), ВЛАСТЬ (AUTHORITY), and НАРОД (PEOPLE). In the upper right part of Fig. 5, we can see the group ПРОГРАММА (PROGRAM), ТЕХНИКА (TECHNICS), ИССЛЕДОВАНИЕ (RESEARCH), ПРОЕКТ (PROJECT), ЗАДАЧА (TASK). In the middle on the left side of the figure is the КОМИТЕТ (COMMITTEE), РЫНОК (MARKET), РЕГИОН (REGION), УПРАВЛЕНИЕ (MANAGEMENT), ПОЛИТИКА (POLITICS) group. Fig. 5. Extension of the constructed elastic map - large plan - upper left corner. Thus, the constructed elastic map makes it possible to single out a number of subclus- ters and groups uniting words that are semantically related. This opens up a number of possibilities, including searching for words by related words from such group. The considered data array was transposed similarly to [10]. We studied the trans- posed data array, where nouns played the role of measurements, and adjectives were considered as points in a multidimensional data array. The role of numerical charac- teristics is also played by the frequency of joint use of adjectives and nouns. An extension of the constructed elastic map for colorized data density is shown in Fig. 6. Here the picture is very similar to that shown in Fig. 2, with the difference that the weakly expressed region of condensation in the lower right corner practically disap- pears. The presented visual image allows one to see 4 areas of thickening. Three areas 8 A. Bondarev, A. Bondarenko, V. Galaktionov, L. Shapiro of thickening are located on the left edge of the map, one is located in the upper right corner. Fig. 6. Extension of the constructed elastic map for a transposed data array with coloring by data density. As in the previous case, we consider some areas of condensation. Fig. 7 shows a close-up of the thickening region in relation to the data density in the upper right corner of the sweep of the elastic map for the transposed data array. Here traced groups of adjectives that are similar in characteristics. For example, in the upper right corner – ГОСУДАРСТВЕННЫЙ (STATE), НАЦИОНАЛЬНЫЙ (NATIONAL), ПОЛИТИЧЕСКИЙ (POLITICAL), МЕЖДУНАРОДНЫЙ (INTERNATIONAL), ОБЩЕСТВЕННЫЙ (PUBLIC). Nearby is a group with na- tional-geographical characteristics – РУССКИЙ (RUSSIAN), ЕДИНЫЙ (UNIFIED), ЕВРОПЕЙСКИЙ (EUROPEAN), АМЕРИКАНСКИЙ (AMERICAN), ИНОСТРАННЫЙ (FOREIGN), НЕМЕЦКИЙ (GERMAN), ФРАНЦУЗСКИЙ (FRENCH), ИТАЛЬЯНСКИЙ (ITALIAN), ГЕРМАНСКИЙ (GERMAN). Fig. 7. Extension of the constructed elastic map - large plan - upper right corner. Visual Analysis of Textual Information on the Frequencies of Joint Use of… 9 We also give an example of a group of words located in the lower right corner of the constructed extension of an elastic map for a transposed array. This fragment is shown in Fig. 8. At the bottom of the figure, one can distinguish a group of adjectives with size characteristics – ОГРОМНЫЙ (HUGE), БОЛЬШОЙ (BIG), МАЛЕНЬКИЙ (SMALL), КРУПНЫЙ (LARGE), НЕБОЛЬШОЙ (LITTLE), МЕНЬШИЙ (LESS), БОЛЬШИЙ (LARGE), ДЛИННЫЙ (LONG), УЗКИЙ (NARROW), ШИРОКИЙ (WIDE). Fig. 8. Scan of the constructed elastic map - close-up - lower right corner. Thus, summing up the experiments and the results obtained, it can be argued that the original hypothesis of this study was justified. Recall that we assumed that words that are close in terms of meaning in terms of frequency characteristics should be located close to each other. The implemented approach makes it possible to process volumes of textual infor- mation and highlight groups that are similar in semantic characteristics to nouns and adjectives. 4 Conclusion To analyze the “visual portrait” of a multidimensional data volume, elastic map con- struction technologies were used. These technologies are methods for mapping points of the initial multidimensional space onto manifolds of smaller dimension embedded in this space. The development of such a map, displayed in the space of the first prin- cipal components, allows us to get a "visual portrait" of a multidimensional data vol- ume. Such an image can be organically supplemented by a coloring displaying data density. This work contains a description of the results of constructing elastic maps for ana- lyzing data volumes consisting of frequencies of joint use of adjectives and nouns. 10 A. Bondarev, A. Bondarenko, V. Galaktionov, L. Shapiro The analysis of the cluster structure for the considered volume of multidimensional data is carried out. A study of the effect of of the source data transposition is per- formed. The initial hypothesis about the proximity in space of signs of words that are close in meaning is confirmed. 5 Acknowledgment This work was supported by RFBR grants 19-01-00402 and 20-01-00358. References 1. Thomas, J., Cook, K.: Illuminating the Path: Research and Development Agenda for Visu- al Analytics. IEEE-Press, USA (2005). 2. Wong, P., Thomas, J.: Visual Analytics. IEEE Computer Graphics and Applications 24(5), 20-21 (2004). 3. Keim, D., Kohlhammer, J., Ellis, G., Mansmann, F.: Mastering the Information Age – Solving Problems with Visual Analytics. Eurographics Association (2010). 4. Kielman, J., Thomas, J.: Foundations and Frontiers of Visual Analytics. Information Visu- alization 8(4), 239-314 (2009). 5. Gorban, A. et al.: Principal Manifolds for Data Visualisation and Dimension Reduction. Springer, Berlin – Heidelberg – New York 2007. 6. Gorban, A., Zinovyev, A.: Principal manifolds and graphs in practice: from molecular bi- ology to dynamical systems. International Journal of Neural Systems 20(3), 219–232 (2010). 7. Zinovyev, A.: Visualization of multidimensional data. NGTU, Krasnoyarsk (2000) [in Russian]. 8. Zinovyev, A.: Data visualization in political and social sciences, In: Badie, B., Berg- Schlosser, D., Morlino, L. A. (Eds.) INTERNATIONAL ENCYCLOPEDIA OF POLITICAL SCIENCE. SAGE (2011). 9. ViDaExpert, http://bioinfo.curie.fr/projects/vidaexpert, last accessed (01 March 2020). 10. Bondarev, A., Bondarenko, A., Galaktionov, V., Klyshinsky, E.: Visual analysis of clus- ters for a multidimensional textual dataset. Scientific Visualization 8(3), 1-24 (2016). 11. Bondarev, A., Bondarenko, A., Galaktionov, V.: Visual analysis procedures for multidi- mensional data. Scientific Visualization 10(4) 109 – 122 (2018). https://doi.org/10.26583/sv.10.4.09 12. Bondarev, A..: The procedures of visual analysis for multidimensional data volumes. ISPRS Archives XLII-2/W12 17-21 (2019). https://doi.org/10.5194/isprs-archives-XLII-2- W12-17-2019 13. Bondarev, A.:Visual analysis and processing of clusters structures in multidimensional da- tasets. ISPRS Archives XLII-2/W4 151-154 (2017). 14. Bondarev, A., Galaktionov, V.. Applying visual analysis procedures to multidimensional medical data. CEUR Workshop Proceedings 2485 122-126 (2019). https://doi.org/ 10.30987/graphicon-2019-2-122-126