Features of Big Text Data Visualization for Managerial Decision Making
                                        E. A. Makarova1, D. G. Lagerev1, F.Y. Lozbinev2
                              m4karova.e@yandex.ru | LagerevDG@mail.ru | flozbinev@yandex.ru
                                     1
                                       Bryansk State Technical University, Bryansk, Russia
                                                  2
                                                   RANEPA, Bryansk, Russia

    This paper describes text data analysis in the course of managerial decision making. The process of collecting textual data for
further analysis as well as the use of visualization in human control over the correctness of data collection is considered in depth. An
algorithm modification for creating an “n-gram cloud” visualization is proposed, which can help to make visualization accessible to
people with visual impairments. Also, a method of visualization of n-gram vector representation models (word embedding) is
proposed. On the basis of the conducted research, a part of a software package was implemented, which is responsible for creating
interactive visualizations in a browser and interoperating with them.

    Keywords: visualization, natural language processing, web application accessibility.

                                                                        of serious human and computational resources, which can nullify
1. Introduction                                                         the economic benefits obtained by adding unstructured data to
                                                                        the process of managerial decision making.
     With the acceleration of scientific and technological
                                                                            Currently, there are various analytical systems that work not
progress, as a result, economic growth rates of both global and
                                                                        only with structured data but also with unstructured ones,
local markets are rapidly increasing. According to the study [5],       including text data downloaded from social media [11]. In these
the number of mergers and acquisitions in Russia in 2017
                                                                        systems, visualization is rarely used at the stage of collecting and
increased by 13%. Besides, the number of originated loans is
                                                                        preprocessing big text data. However, collection and
growing. According to the United Credit Bureau, the annual
                                                                        preprocessing of data for such systems is still quite time
number of loans issued in Russia has increased by 22%, while
                                                                        consuming, and there is a significant risk of using irrelevant
lending has increased by 53%. In addition to the accelerated
                                                                        documents as data sources. Usually one of the two approaches is
capital turnover, growth is also observed in the labor market. In
                                                                        used: either a fully automatic analysis of collection and
Antal Russia’s survey, 27% of employers reported an increase in
                                                                        preprocessing results (faster) or a fully manual review of a large
staff turnover in their companies over the past year [12].
                                                                        array of documents (more qualitative). This article discusses a
     Higher velocity and number of transactions conducted in
                                                                        hybrid approach based on vector data visualization that allows
various spheres of social and economic activity results in greater
                                                                        adding expert assessment of document relevance at the stage of
burden on managers at various levels. This requires either an
                                                                        data collection and preprocessing [18].
increase in the decision-making staff or enhancement of
information systems supporting managerial decision-making in            2. Extracting information from                    sources        of
order to reduce the people’s workload. Beside traditional data
used in such systems (e.g., credit history and capital for scoring
                                                                        varying degrees of structuring
systems used in loan approval), many researchers and                        Let us consider in more detail the process of collecting and
manufacturers of technological solutions use unstructured               analyzing text information from various sources presented in
information sources about legal entities and individuals involved       Figure 1.
in transactions. Examples of such information are data from mass
media, social networks, etc.
     In addition, some studies have shown that adding analysis of
text data from social media to prediction models results in greater
accuracy. For example, they help to increase the accuracy of legal
entity’s bankruptcy prediction [7]. Hence, one of the stages of
using managerial decision-making support systems is loading
text information into them about an object of socio-economic
activity for further use.
     Objects of socio-economic relations are widely represented
on the Internet both through official websites and in the form of
digital reputation, i.e., reviews, news, what appears on the
network about them without their direct intervention. However,
the amount of such data is constantly growing (due to data
duplication, borrowing data from another source, etc.), which
requires optimization in terms of speed and cost of their                   Fig. 1. Overview diagram of collecting and processing data
collecting and processing. As small and medium-sized
businesses have to contact with ever increasing number of people            All the processes presented in the diagram are important for
in the course of their activity, the risk of a transaction with legal   the efficient use of unstructured text data for managerial decision
entities or individuals unreliable in terms of tax or other laws        making. However, in this paper, the greatest emphasis is placed
increases, which may entail long-term consequences, such as             on the process of collecting information since accuracy and
costs in public image, etc., and may result in legal entity’s           resource consumption of further analysis depends on the quality
bankruptcy.                                                             of the collected data.
     On the one hand, decisions need to be made faster and faster,          In the diagram, DBin is an internal database containing
their number is growing, which can lead to more errors and risks.       trained models for collecting and analyzing information as well
This problem can be solved by the integration of data mining            as accumulated information about analysis objects. DBout refers
systems into DSS by using large volumes of unstructured data            to external databases (structured sources) attached by the user.
available for analysis [10]. On the other hand, the process of              Conceptual model for collecting text information is
collecting and preprocessing these data requires the involvement                                   S = < R, M, D; I>,


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                       downloaded data is to visualize it. In [17], it was already
    where R is resources (temporary, material, human);                 demonstrated that the difference in the content of documents is
    M is information about previous generations;                       noticeable in the visualization of an n-gram cloud type (n-gram
    D is data sent for analysis;                                       refers to a word sequence), and it was noted that this method
    I is amount of information relevant to the task that is            requires further refinement. In the current implementation,
available to the system.                                               visualization has undergone a number of changes, such as
                         R = < Rm, Rh, Rp, T>                          combining word weights that have semantic proximity below a
                                                                       certain threshold, excluding “stop words” and words with small
    where Rm is money spent on paid services (various APIs and         weights from visualization.
services);                                                                  One of the refinement directions for the visualization method
    Rp is the number of available experts in the subject area;         will include its adaptation for use by various groups of people,
    Rh is hardware limitations. For some problems, hardware            including those with disabilities. When developing
resource limit will only mean speed of calculations and,               visualizations, it is important to consider all user groups, not only
accordingly, be considered together with T parameter, but for          in terms of compliance with international standards but also in
some language processing tasks, for example, when using vector         terms of the increasing number of potential users. For example,
representation of words, the amount of RAM available will be           more than 5% of population suffer from various forms of color
key to the usability of these methods;                                 vision deficiency, which can prevent the user from interacting
    T is time spent on collecting information, which, in turn, can     with the visualization to a full extent [2].
be decomposed into the following components:                                In recent years, the topic of accessible visualization has
                        T = Tu + (Te + Td) + Ta,                       gained great interest from researchers and software development
                                                                       service providers [14]. For example, Square, Inc. [1] has
     where Tu is time spent by the main user of the system;            published an open-source guide to creating accessible data
     Te is time spent by the expert who will check and resolve         visualizations. Among visualizations they propose there are
manually various situations difficult for machine-aided                various types of charts and graphs.
processing;                                                                 Visualizations related to the analysis of text information are
     Td is time delay between expert’s response and continuation       comparatively little studied from this side. Next, we will consider
of processing (the error to account for the non-round-the-clock        two examples of such visualizations that are important for
availability of the expert);                                           collecting text data in the described software package.
     Ta is time spent on automatic processing.                              Classic works devoted to the construction of an n-gram cloud
     The task of optimizing data collection consists in reducing R,    (or “tag cloud”, “word cloud”) [4, 15], which described
H, D parameters while increasing I parameter.                          algorithms employed by libraries implementing visualization
     It is also assumed that a number of parameters will decrease      data, could not take into account WCAG recommendations on
with each subsequent use of the system due to the training of          application adaptation for people with visual impairments since
users and models, accumulation of useful knowledge about the           they had appeared before these guidelines were developed.
objects of research.                                                        As part of the software package, a client-server architecture
     In matters of improving the efficiency of information             subsystem was implemented as a web application that provides
collection, there are two extremes: to make all the work fully         interactive visualizations and implementation of user analysis to
automatic, thereby saving on human resources, or to make               data collection process. So, for example, the developed n-gram
process control completely manual. In this paper, an                   cloud visualization takes into account WCAG 2.1
“intermediate” version is considered when an expert is engaged         recommendations. Therefore, the following restrictions and
in evaluating the effectiveness of the collection process, but due     additions have been introduces to the algorithm:
to the use of various tools, such as visualization, his work time is   1) restriction on the contrast of colors
significantly reduced [9].                                             2) exclusion of vertical text orientation [8]
     In addition, the following approaches are used in the             3) setting the minimum and maximum text sizes
developed software package to optimize information collection          4) adding advanced user settings.
before analysis:                                                            Considering that the interface of the existing system was
1) refinement of search queries;                                       developed as a web application, it will be reasonable to rely on
2) ignoring duplicate information;                                     the algorithms used to create and display tag clouds [4] adapting
3) preliminary data analysis, etc.                                     them to the problem being solved and WCAG 2.1
                                                                       recommendations.
3. Visualization of big text data for data mining                           Many ready-made visualization tools do not take into
optimization.                                                          account contrast for different groups of people, including those
                                                                       suffering from visual impairment and color deficiency. However,
    Let us consider some features which require human                  it should be understood that the purpose of creating a tag cloud
interference for more efficient work and for which various             is often to effectively illustrate an array of information rather than
visualization methods have been studied and refined as part of         a detailed analysis [15].
the work on the system [19]. As a data source in this example,              Color contrast according to WCAG 2.1 is:
we will use web-based media, but the methods being developed                                (L1 + 0.05) / (L2 + 0.05) > Cmin
are applicable, with some adjustment, to all sources of a similar           L1, L2 are relative brightness of compared colors.
structure.                                                                  Since all words in visualization will be interactive, the
    When setting up uploading of text documents from a certain         required contrast for them should be calculated as for controls,
source, by which the search is possible, users of the system may       i.e., Cmin = 3 for n-grams located separately. In addition, contrast
encounter the fact that the query does not correspond to the           for each individual color compared to the background should be
required result, e.g., if the request has turned out to be too         equal to Cmin = 4,5 [16]. Calculations show that it is possible to
“general” or information about homonymous objects is present           find only 2 colors that will be simultaneously contrastive with the
in the same sources. A way out of this situation may be to view        background and between themselves.
a part of the collected text documents, their brief contents or             Also there appear restrictions on the font size. On the one
some metadata. It is time consuming for the user (subject matter       hand, the minimum size of n-grams should not be less than 16pt
expert or employee). Another way to familiarize the user with the      [16]. On the other hand, the same standard imposes the condition
that all texts on a page can be magnified to 200% maintaining                      Table 1. Average time spent by the user on one document
their readability, which constrains the maximum possible font                                                   per document (in seconds).
size when displaying a page at the size of 100%. In order to                   Task     BMZ        BMZ +           BMZ +         Ecofrio   Ecofrio +
maintain the approximate position of containers in which the text                                  Bryansk         Bryansk                 potatoes
will be when enlarged, CSS Grid technology [3] and slicing                                                         +Industry
floorplan algorithm [4] were used for the interface design.              Group
     Besides, the user should be able to add custom settings for         Group      1   12,5           13            11,5           13         12
colors and sizes of visualization. Let us consider a specific            (one task)
                                                                         Group      2     12           13,5            13           14         13
example. In [17], it is described in detail how visual analysis of       (three task)
a part of text documents on a search query allows understanding          Group      3   17,5           19            18,5           16         14
whether various search entities need to be added or excluded             (one task)
from the query. Figure 2 shows visualization implementation for          Group      4     17           15            14,5          17,5        15
adjusting data collection for “BMZ” object (AO UK BMZ –                  (three task)
Bryansk Machine-Building Plant). Presented figures                       Group      5     11           10,5            11           12        11,5
demonstrate the work with the texts in Russian. User’s task is to        (three task)
assess reputation of this legal entity. To do this, it is necessary to
collect data on this object. The goal of this visualization is to             The work on this topic [17] demonstrates how word
track whether the context of the request, that implied the search         embedding models [6] pre-trained on different collections of text
for an enterprise located in the Bryansk region, was transmitted          documents group words differently in terms of their semantic
correctly. As the user can see from the visualization, the search         proximity. Also, errors related to the content of the source data
settings were incorrect, which resulted in occurrence in the              occur in models built on word embedding. In the described
collected data of many documents related to the activity of a             system, these models are used not only to simplify visualization
similar enterprise in the Republic of Belarus. Exclusion of text          but also to remove duplicate documents during their further
documents containing the word “Belarus” from the search results           processing. Canvas-based visualization was developed [13] to
increased significantly the accuracy of the collection by                 give the user an opportunity to edit acceptable boundaries of
discarding also documents with references to such objects as              semantic proximity (or cancel n-gram combining if options
“Africa”, “Chad”, etc.                                                    proposed are unacceptable for the problem being solved). In the
                                                                          center of visualization there is a word position of which in the
                                                                          word embedding model is being explored. Distances from an n-
                                                                          gram are defined so that a two-dimensional vector would be
                                                                          equal to the similarity indicator of this n-gram to the one under
                                                                          study (by default, this value is 0.4). Further, the algorithm selects
                                                                          positions for n-grams in such a way as to ensure the readability
                                                                          of n-grams, including the recommendations described above (no
                                                                          intersections with other elements, a horizontal text of an
                                                                          acceptable size). An example of this visualization for n-grams
                                                                          having maximum semantic affinity with the word "industry" is
                                                                          presented in Figure 3.


       Fig. 2. N-gram cloud visualization of a text document
                     collection in Russian

     In addition, since we are talking about displaying in the
browser, all elements will have the “tabindex” attribute in
ascending order as the significance decreases in the sample and
the “aria-label” attribute with the weight of this element to
facilitate the perception by people who have vision deficiency
and use special programs for reading from the screen.
     Typically, to solve a data collection configuration tasks
required to view (or use visualization) 20-30 random documents
from the collection of documents, depending on the amount of                  Fig. 3. N-gram based nearest neighbor visualization in the
available data. An experiment was conducted on the effect of the                               word embedding model
method of solving configuration tasks in which five groups of
users participated: users of groups 1 and 2 to solve the data                 Table 2 demonstrates how using n-gram cloud visualization
collection configuration tasks using the visualization with               and applying analysis results to the search query parameters
standard settings, users of groups 3 and 4 used quick skimming            increase the number of relevant documents received during data
of documents, users of group 5 used visualization with user               collection (for 20 random documents from a search sample).
settings, pre-setting time is also included in the final calculation.          Table 2. Impact of manual adjustment of the request on the
     The test results for some tasks are presented in table 1. Prior                             number of relevant document search tasks
to working with the tasks presented, all users were trained on a            Number of
                                                                                               Object 1         Object 2        Object 3     Object 4
                                                                            relevant
test task. Some user groups performed only one group of tasks                                  “BMZ”          “Isoterm”        “Ecofrio”   “Spetsstroy”
                                                                            documents
(for example, analyzing entities associated with “BMZ”) at one              Before user
time, while others immediately started the next task after solving                               20%             30%             85%           10%
                                                                            adjustment
the current task.                                                           After
                                                                                                 85%             45%             90%           20%
     On average, time saved using visualization, compared to a              adjustment
quick skimming of texts, varies depending on user's familiarity               On average, there has been registered an increase in the
with the system and ranges from 18 to 42%.                                number of relevant documents by about 24%. The number of
                                                                          relevant documents in the experiment was determined by the
method of expert viewing of 20 random documents from a search             CFP1561Y-ART,                 pp.             02-38-NSAP.
sample.                                                                   doi: 10.1109/MEACS.2015.741490
                                                                      11. Prangnawarat N., Hulpus I., Hayes C. (2015) Event
4. Conclusion                                                             Analysisin Social Media using Clustering of
                                                                          Heterogeneous Information Networks. The 28th
    Adding textual data to analyzed ones in the process of
                                                                          International FLAIRS Conference (AAAI Publications)
managerial decision making can increase the efficiency. In this
                                                                          (AAAI)
paper, special attention is paid to the process of collecting text    12. Staff turnover has started to grow. Available by link:
data from various sources. It is shown that visualization of big
                                                                          https://www.antalrussia.com/news/staff-turnover-has-
text data can significantly reduce time spent on its human
                                                                          started-to-grow/
processing: time savings compared to skimming of texts is from
                                                                      13. The      canvas     elements.     Available     by    link:
18 to 42%, and the number of relevant documents found                     https://html.spec.whatwg.org/multipage/canvas.html#the-
increases by about 24%. Besides, a part of a software package
                                                                          canvas-element
has been developed, which allows for visualization of text data
                                                                      14. The Future of Data Visualization: Predictions for 2019 and
and models of vector representation of words. When developing
                                                                          Beyond           А.        Available        by        link:
visualization algorithms, it is necessary to take into account
                                                                          https://depictdatastudio.com/the-future-of-data-
international standards for creating web applications for people
                                                                          visualization-predictions-for-2019-and-beyond/
with disabilities, thus making them [applications] accessible to a
                                                                      15. Viégas B., Wattenberg M., Feinberg J. (2009) Participatory
wide range of users.
                                                                          visualization with Wordle. IEEE Transactions on
    In the future, it is planned to continue the study of efficient
                                                                          Visualization and Computer Graphics 15, no. 6, pp. 1137–
data collection methods for analysis to support managerial
                                                                          1144. doi:10.1109/TVCG.2009.17
decision-making. In particular, it is planned to study in more
                                                                      16. Web Content Accessibility Guidelines (WCAG) 2.1.
detail n-gram vector representation and its use for identifying and
                                                                          Available by link: https://www.w3.org/TR/WCAG21/
deleting duplicate data.
                                                                      17. Zakharova A.A., Lagerev D.G., Makarova E.A. (2019)
                                                                          Evaluation of the semantic value of textual information for
5. References                                                             the development of management decisions. CPT2019 The
1.  Accessible Colors for Data Visualization. Available by                Conference Proceedings, TzarGrad, Moscow region,
    link: https://medium.com/@zachgrosser/accessible-colors-              Russia
    for-data-visualization-2ad64ac4ee7e                               18. Zakharova A.A., Vekhter E.V., Shklyar A.V. (2017)
2. Causes of Colour Blindness. Available by link:                         Methods of Solving Problems of Data Analysis Using
    http://www.colourblindawareness.org/colour-                           Analytical Visual Models. Scientific Visualization, vol. 9,
    blindness/causes-of-colour-blindness/                                 no. 4, pp. 78-88. doi: 10.26583/sv.9.4.08
3. CSS Grid – Table layout is back. Be there and be square.           19. Zhao J., Zhao G., Zhao L., Zhao W., (2014). PEARL: An
    Available                      by                     link:           Interactive Visual Analytic Tool for Understanding
    https://developers.google.com/web/updates/2017/01/css-                Personal Emotion Style Derived from Social Media. IEEE
    grid                                                                  Conference on Visual Analytics Science and Technology,
4. Kaser O., Lemire D. (2007). Tag-Cloud Drawing:                         VAST               2014           -            Proceedings.
    Algorithms for Cloud Visualization. Tagging and Metadata              doi: 10.1109/VAST.2014.7042496.
    for Social Information Organization. A workshop at
    WWW2007, pp 1086-1087.
5. KPMG presents the results of a survey of Russia's mergers
    and acquisitions market in 2017. Available by link:
    https://home.kpmg/ru/en/home/media/press-
    releases/2018/03/ma-survey-2017.html
6. Kutuzov A, Kutuzov I. (2015) Texts in, meaning out: neural
    language models in semantic similarity task for Russian.
    Proceedings of the Dialog 2015 Conference, Moscow,
    Russia
7. Mai F., Mai T., Ling C., Ling M. (2018). Deep Learning
    Models for Bankruptcy Prediction using Textual
    Disclosures. European Journal of Operational Research.
    doi: 10.1016/j.ejor.2018.10.024.
8. Make your information more accessible. National
    Disability       Authority.               Available      by
    link:http://nda.ie/Resources/Accessibility-toolkit/Make-
    your-information-more-accessible/
9. Podvesovskii A.G., Isaev R.A. (2018) Visualization
    Metaphors for Fuzzy Cognitive Maps. Scientific
    Visualization, vol. 10, no. 4, pp. 13-29. doi:
    10.26583/sv.10.4.02
10. Podvesovskii A.G., Gulakov K.V., Dergachyov K.V.,
    Korostelyov D.A., Lagerev D.G. (2015) The choice of
    parameters of welding materials on the basis of fuzzy
    cognitive model with neural network identification of
    nonlinear dependence. Proceedings of the 2015
    International Conference on Mechanical Engineering,
    Automation and Control Systems (MEACS) (Tomsk,
    Russia, December 1-4, 2015), IEEE Catalog Number: