On Classification Hidden Concepts Language in Specialized
Texts Based on Methods of the Intellectual Data Processing
Iurii Kraka,b, Valentina Petrovycha,                                             Vladislav            Kuznetsova,   Eduard   Manziukc,
Olexander Barmakc, and Anatoliy Kuliasa
a
  Glushkov Cybernetics Institute, Kyiv, 40, Glushkov ave., 03187, Ukraine
b
  Taras Shevchenko National University of Kyiv, Kyiv, 64/13, Volodymyrska str., 01601, Ukraine
c
  National University of Khmelnytskyi, Khmelnytskyi, 11, Institytska str., 29000, Ukraine


                 Abstract
                 In this article discussed and solved the problems of comparing language concepts in
                 specialized texts, in particular, scientific texts in the Ukrainian language. A corpus of
                 scientific texts and dictionaries as well as stop words and affixes has been formed for
                 processing specialized texts. The resulting texts were analyzed and converted into text
                 frequency-inverse document frequency (TF-IDF) feature representation. To transform
                 the original vector of features, it is proposed to use an algorithm for the synthesis of
                 linear systems, in combination with the T-stochastic neighbor embedding (T-SNE). A
                 series of experiments were performed on test examples for the determination of
                 informational density in the text and classification by keywords in specialized texts using
                 the method of random samples consensus (RANSAC). A method of classification of
                 hidden language concepts was proposed, use of clustering methods (K-means). As a
                 result of the experiment, the structure of the classifier of hidden language concepts was
                 obtained in structured texts. The stability of the proposed method is investigated by using
                 the perturbation of the original data by a variational autoencoder. The obtained structure
                 allowed to achieve a relatively high recognition accuracy (97%-99%) using decision
                 trees and machines of extreme gradient amplification.

                 Keywords 1
                 Texts analysis, language concepts, pseudo inversion, clastering, feature extraction

1. Introduction and problems statement
     One of the important problems in scientific research in the specialized texts are to compare texts
according to a certain criterion, namely the inclusion of a certain phrase or set of phrases in a given
scientific text to obtain results that contain the desired set of phrases [1-3]. This method is currently
used successfully in searching for information by keywords, but it has disadvantages - first, the need
to search all the text and search for each keyword by a given criterion, which causes search results
that are not relevant to the searched text [2,4]. Note that among the many known methods of textual
information research are the most popular: the method of frequency analysis of the term, taking into
account the inversion of frequency to other documents - TF- IDF [4], the linear method of reference
vectors (LSVM) [5] and the method of the Bag of Words [6]. The advantages of TF-IDF and Bag of
Words include good speed and great applications; the main advantage of the LSVM method is
accuracy. The disadvantages are the slowness of execution, "rejection" of stop words, which


CMIS-2021: The Fourth International Workshop on Computer Modeling and Intelligent Systems, April 27, 2021, Zaporizhzhia, Ukraine
EMAIL: yuri.krak@gmail.com (I. Krak); filonval63@gmail.com (V. Petrovych); kuznetsowwlad@gmail.com (V. Kuznetsov);
eduard.em.km@gmail.com(E. Manziuk); alexander.barmak@gmail.com(O. Barmak); anatoly016@gmail.com(A. Kulias)
ORCID: 0000-0002-8043-0785 (I. Krak); 0000-0002-5982-8983(V. Petrovych); 0000-0002-1068-769X(V. Kuznetsov); 0000-0002-7310-
2126(E. Manziuk); 0000-0003-0739-9678(O. Barmak); 0000-0003-3715-1454(A. Kulias)
            © 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
inevitably leads to loss of meaning; lack of consideration of the position of each word in the text,
which can provoke difficulties in finding content in a particular text.
     Based on the analysis of the subject area and the selection of research issues, the following
problems are formulated:
     • to form a sample of scientific texts on various topics;
     • get an idea of the vectors of the features of individual sentences in the text;
     • analyze the affinity of texts from the annotations of texts with the source texts;
     • to investigate the representation of feature vectors with the involvement of different methods of
data dimension reduction, clustering and grouping of features;
     • assess the information concentration of the content of different samples of text on the
representation of the vectors of the characteristics of individual sentences of the text;
     • to classify individual sentences of texts by subject;
     • to investigate the selection of hidden concepts by clustering methods;
     • evaluate the stability of classification algorithms to transformations of dimensionality of
features.

2. Related works
     A large number of publications are devoted to the study of specialized and scientific texts, among
which the following should be noted: in [7] the amplitude and phase characteristics of terms in the
text were investigated to assess the concentration of categories of terms in the text, in [8],
experimental information technology was developed analysis of frequency characteristics of semantic
terms and word combinations in the text and built a model Sub-Verb-Sub [8,9]. Despite the
availability of these studies on this topic, a number of problems have not been resolved, in particular,
the analysis of the frequencies of individual sentences in the text, the relationship of the meaning of
the abstracted text (annotations, abstracts) with the text, statistical proximity [10,11] different authors
on one topic and the study of ways to improve the results of text classification in the context of the
problem of analysis of hidden concepts of language, in particular using the methods of grouping and
reducing the dimensionality of the vector of characteristics [12,13,14].

3. Implementation
   3.1.   Getting data
    The study formed three samples of scientific texts for modeling and recognition of communicative
information on three topics: facial expressions, Ukrainian sign language and texts in Ukrainian (about
1500 sentences) [15]. After detailed processing these sets will be presented as open dataset.
    Thus, using the proximity of themes, it is possible to determine how the representation of these
texts in the space of characteristic features changes, which will allow to determine the common and
different concepts of language in these texts. To solve this problem, scientific texts were presented in
the form of a matrix in which each line corresponds to a separate sentence, including the title,
annotation, captions, conclusions, and other textual elements that contain the text.

    3.2.         Data analysis environment
   To analyze this data, a module for intelligent analysis of scientific texts in Python was created with
the involvement of the scikit-learn data processing library in the Jupiter development environment
[16, 17]. This module implements the following operations:
   • parsing of text data;
   • deletion of uninformative data and word endings;
   • obtaining statistical characteristics of the text;
   • transformation of text elements into a vector of characteristic features;
   • transformation of the vector of features into a vector of reduced dimension;
   • training and testing of methods of classification of text data by a set of features;
   • visualization of the vector of characteristic features.
   The resulting module was tested on a hardware platform with Windows and the following
characteristics: Intel Core i5-6600k processor, 8 GB of DDR4 RAM.

    3.3.        Data organisation
    To solve this problem, we present scientific texts in the form of a matrix, in which each line
corresponds to a separate sentence, including title, abstract, captions, conclusions, and other elements
that characterize this text, presented as a vector of features, where each of the features corresponds to
a particular term and the frequency of its appearance in the text. To compare several texts with
different topics, we will form a body of scientific texts that limits the area of interest (for example,
scientific articles that contain the required information).
    Texts (body of texts and compared text) are analyzed by a parser, which eliminates all words that
do not carry significant meaning (stop words) and cuts off affixes (suffixes and endings) of words. On
the basis of the received set of terms frequency characteristics are formed. Then both texts (or corpora
of texts) are compared and all terms and, accordingly, features that are not included in at least one of
the texts are cut off.

    3.4.        Features extraction
    Each row of the matrix was presented in the representation TF-IDF (text frequency inverse
document frequency) [8,9,14], where each of the elements corresponded to a separate term and
frequency of its occurrence. The obtained frequency characteristics were compared and all terms and,
accordingly, features that are not included in at least one of the texts were cut off from the vector of
features. This allowed to preserve the dimensionality of the data and to take into account only the
ratio of the number of scientific terms common to texts with different topics.
    In addition, when using some methods to reduce the dimensionality of the data (in particular,
Karunen-Loeve [9]), this allowed to reduce the number of zero elements and the degree of sparseness
of the sample matrix.

    3.5.        Statistic relationship of abstracts in specialized texts
    To estimate the statistical affinity in the first group of experiments, it was proposed to use the
following indicators: Cartesian distance, Pearson's test and standard deviation. These values indirectly
indicate the heterogeneity of sentences and therefore at the preliminary stage of the process allow you
to indirectly check the validity of the sample [18,19]. As a result of tests when comparing individual
sentences from a single scientific text and annotation from this text, it was noted that sentences that
had different meanings differed in the proposed parameters.
    The example of the 1st sentence from the annotation shows the similarity and difference of
sentences on three parameters are given in Table. 1.

Table 1
Statistic features of the studied sentences
 № Abstract                                         Article
 n/n Dist           Pearson         St. Dev         Dist              Pearson           St. Dev
 1      0           1               0.490698        11                0.425414          0.347540
 2      10          0.618363        0.529891        4                 0.854699          0.325396
 3      11          0.409801        0.300327        12                0.501629          0.490698
 4      17          0.002661        0.300327        14                0.334416          0.418213
 5      12          0.541444        0.529891        11                0.425414          0.34754
   The calculated values in Table 1. show that for sentences with common concepts, there is a pattern
of similarity of statistical indicators for texts with similar content, for example, in sentences that had
the same Cartesian distance, but for the 1st sentence of the annotation and the 2nd of the main
content, which contained close in content sentences, these values did not allow to argue about the
unambiguous similarity of these sentences in content.

    3.6.        Feature vector dimensionality reduction
         To reduce the dimensionality of the original vector of characteristic features of individual
sentences, the Karunen-Loeve transformation was applied. Before performing the decomposition of
the original data into eigenvectors, insignificant features were discarded, in particular, features with a
low frequency of occurrence in the text to reduce the sparseness of the matrix and to exclude zero
rows or columns in the covariance matrix.
         As a result of this experiment, feature sets were obtained for sampling, which were analyzed
further. This schedule of text sampling for the first body of texts showed that 95% of the energy of
eigenvectors was within first three vectors, which allowed to visualize the spatial arrangement of
feature vectors of data elements in the form of a three-dimensional diagram, shown below in Fig. 1.


Figure 1. First three eigenvalues of the studied sentences in Karhunen‐Loeve representation

   As can be seen from the diagram, first, the first three eigenvalues form separate clusters of points,
which allows us to argue about the difference of concepts in these sentences. Second, the distance
between the elements of the test sample (circled) and the training sample (other points) allows you to
visually assess the proximity of data samples to each other, and, therefore, used to identify hidden text
parameters.

    3.7.        Dimensionality reduction using grouping of features
    Since the three-dimensional representation is not convenient for visualization, it was proposed to
perform feature grouping for visualization in the form of a flat image, which additionally used T-
stochastic neighbor embedding [20] to reduce the dimensional vector of features to two. For clarity,
Fig. 2 below shows the representation of the feature vector for the second body of texts by T-
stochastic neighbor embedding and clustiring of points by the K-means.


Figure 2. Representation of a feature vectors for the second body of texts by T‐stochastic neighbor
embedding, grouping of features and clustering of points by the method of K‐means
    At the Fig. 3 shows histograms of the distribution of feature values for the first and second corpora
of texts, respectively.


Figure 3. Histograms of the distribution of feature values for the body of texts 1 and 2

    3.8.        Informational density of sentences in text
     With a relatively large number of samples in the body of texts and, accordingly, the educational
sample, the sentences and their location in the feature space should come to the fore [21].
Accordingly, texts that differ significantly in content will have a large number of non-occurrences of
sentence elements in clusters, which indicates the difference between the subject matter of the text
and hidden concepts.
     To test this hypothesis, it was proposed to use a regression function - the method of consensus of
random samples (RANSAC) [22,23]. In Fig. 4 shows the regressions on three samples of scientific
texts in the representation of T-SNE features.
   Dark gray in Fig. 3 shows which points are accepted by the method of the most informative and
used to build a regression. Accordingly, comparing the graphs, we can indicate that the texts have a
common theme, because the location of the points intersect. In addition, the dark gray dots in Fig. 3
also indicate the areas with the greatest information density of the content of each of the corpora.

3.9.    Sentences classification using different text corpora
      A series of tests was performed on T-SNE representations of texts involving methods of
intelligent data processing [24,25]. Thus, the following methods were studied: random forest, decision
tree, nearest neighbors method, reference vectors method, naive Bayesian classifier, single-layer
neural network. As part of the experiment, the Bayesian classifier performed best on the obtained data
set. The learning results of the algorithms are illustrated in Table 2.
                      а)


                      b)


                     c)
Figure 4. Sentences from 1 to 3 (a to c) and their informational density

Table 2
Precision, recall, score and support for Bayesian classifier on given datasets
               Dataset              Precision          recall      f1‐score support
               Emotion              0.94               0.87        0.90       138
               Gesture              0.82               0.94        0.87       163
               NLP                  0.95               0.87        0.91       159

              accuracy              0.90              0.89       0.89        460
              macro avg             0.90              0.89       0.89        460
              weighted avg          0.90              0.89       0.89        460
   Based on the experiment, it was shown that the obtained set of TF-IDF features allows to classify
texts with high reliability (87%). The presence of recognition errors of the 1st and 2nd kind (see Fig.
4) is explained by the great affinity of the texts and the closeness of the author's styles in scientific
texts, which involves the use of common terms, and this can be seen from the similarity of
representations in Fig. 3.

3.10. Classification of text using help of data clustering
   In order to study the possibility of improving the results of classification, the following approach is
proposed: since the studied texts include hidden concepts (subsets) from the category of sentences that
are close in content, the selection of such hidden concepts allows to correctly set the task of
classifying texts, namely to classify sentences according to their similarity to a particular concept of
the text [26].
   To do this, it was proposed to use one of the implementations of the method of K-means (mini-
batch K-means), which is best suited for experimental data. The obtained data clusters were selected
as data labels, which thus allowed to move from the classification by subject of texts to the
classification by the method of clustering [27,28], taking into account the internal structure of the
data, namely the hidden concepts of language.
   The following methods were chosen as classification methods: the method of reference vectors
with linear and nonlinear hypothesis, single-layer neural network, Bayesian classifier, decision trees
and related methods, including adaptive amplification machine and extreme gradient amplification.
As a result of the test it was shown that the obtained set of data from the studied methods of the
highest accuracy of recognition of hidden concepts is achieved by the method of reference vectors
with a linear hypothesis (97.4%), single-layer neural network (97.4%), random forest (99.1%) ,
decision tree with extreme gradient boost (99.1%).

3.11. Algorithm stability testing using perturbations of feature matrix
     To verify the algorithm, an additional experiment was performed to study the stability of
determining data classes when introducing perturbations into the vector of characteristic features.
Perturbation was performed by converting the original representation of T-SNE (two-dimensional)
into a latent feature space (two-dimensional), in which the data points from the source space
corresponded to the location in the latent feature space. A variational autoencoder (VAE) was used for
this purpose [28]. This method minimizes the rms error between the input (in the T-SNE
representation) and the output data set (in the latent feature space), and generates additional data
elements that have the same distribution as the sample of training samples in Fig. 5.
     In addition, in order to achieve high accuracy of data representation, the hidden layer of the
autocoder has a dimension much higher than in conventional applications.
     Using the dimensional transformation obtained by the autocoder, it was noted that the
representation of features in the latent space (see Fig. 1) allows us to build a linear hypothesis when
classifying each other (one class versus another class). Below in Fig. 6 shows hypotheses and data
labels for representation of features in the latent space.
    Considering Fig. 5 and Fig. 6 it can be noticed that the separation band has decreased.
Accordingly, this led to a decrease in the accuracy of recognition by classification methods by 12% -
up to 85% for the method of reference vectors and single-layer neural network and by 2% - up to
97.1% for random forest and decision tree methods with extreme gradient enhancement. Despite the
decrease in the accuracy of recognition, this experiment is interesting to study the effect of
perturbation on the accuracy of recognition by different methods of classification.

3.12. Determining the nature of perturbations in a sample matrix using
pseudo inversion
   Perturbation of input data affects the convergence of the classifier learning algorithm. The
measure of perturbation in the original and transformed matrix in the latent space of features is the
value of entropy. Entropy can be determined indirectly by the decrease in energy of its eigen mean
square error vectors of the sample matrix.
     A variational auto encoder minimizes the average quadratic error between data elements (MSE).
MSE is also an expression of the magnitude of data. Thus, it is possible to associate the value of
variance change with the nature of the decline in energy of its own numbers of the source and
transformed matrix.
   Knowing the magnitude of the variance change, we can estimate the number of vectors responsible
for the informative part of the data under study. The advantage of latent representation is the
reduction of the distance between data classes (and separation band), which ultimately affects the
number of iterations of the optimization algorithm and its convergence of the classification algorithm.


                        a)


                     b)
Figure 5. T‐SNE a) and b) representation of feature space of higher dimension
Figure 6. Hypotheses visualizing for different classifiers in latent feature space


4. Conclusion
   The presented work proposes an approach to text analysis uning text mining methods, including
TF-IDF feature extraction method, feature dimensionality reduction based on Karhunen-Loeve
transform and T-SNE data representation in 2-dimensional space and classification of acquired data
using decision tree models. Also the paper discusses how the stability of the methods is affected by
perturbation of data using variational autoencoder.
   In order to study proposed method an experimental dataset was obtained; this dataset consists of
three text corpora on three different topics – facial expression recognition, gesture recognition and
text mining; these datasets related to similar topics of research and thus giving a possibility to study
proposed methods in condition where the datasets have some features in common.
   The experimental results of algorithms is shown in tables and figures; most important results are
depicted on the Fig. 4, Fig.5 and Table 2. Given data is separable and has alignment axes in feature
vector in reduced dimensionality representation; moreof, application of variational autoencoder copes
well with data noise and reduces innecessary variance but decreases separability of the data.
   Feature vectors of individual sentences formed from the appearance of individual words are very
effective for the classification of texts by subject, but the presence of common themes of scientific
texts and typical sentences reduces the efficiency of recognition.
   The use of dimensional reduction and grouping methods indicated the presence of data point
concentrations that could be used to assess the author's style using typical sentence constructs. Two
results should be pointed out separately: in the classification of hidden concepts (clusters of data
points) with the use of clustering methods and the introduction of perturbations into the original data.
   Thus, given results make it possible to determine the further direction of research, namely the
study of a sample of scientific texts with other topics and authorial styles, the use of other methods of
classification of structured texts, what will be the goal of our further investigation.
5. References
[1]  N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille. Keyword extraction: Issues and methods.
     Natural Language Engineering, 26(3) (2020): 259-291. doi:10.1017/S1351324919000457
[2] J. Ventura, J. Silva. (2007). New techniques for relevant word ranking and extraction. In: Neves
     J., Santos M.F., Machado J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture
     Notes in Computer Science, 4874. Springer, (2007), pp.691–702. https://doi.org/10.1007/978-3-
     540-77002-2
[3] M. Ortuno, P. Carpena, P. Bernaola, E. Munoz, A.M. Somoza. Keyword detection in natural
     languages and DNA. Europhys. Lett, 57 (5) (2002): 759-764.
[4] B. Das, S. Chakraborty. An Improved Text Sentiment Classification Model Using TF-IDF and
     Next Word Negation. 2018. arXiv preprint arXiv:1806.06
[5] M. Labbé, L.I. Martínez-Merino, A.M. Rodríguez-Chía. Mixed Integer Linear Programming for
     Feature Selection in Support Vector Machine. Discrete Applied Mathematics, 261. Elsevier,
     (2019), pp.276-304. ff10.1016/j.dam.2018.10.025f.
[6] B. Heap, M. Bain, W. Wobcke, A. Krzywicki, S. Schmeidl. Word Vector Enrichment of Low
     Frequency Words in the Bag-of-Words Model for Short Text Multi-class Classification
     Problems, 2017. arXiv:1709.05778
[7] Y. Krak, O.Barmak, O. Mazurets. The practical implementation of the information technology
     for automated definition of semantic terms sets in the content of educational material. CEUR
     WS, Vol. 2139, (2018):245-254. DOI:10.15407/pp2018.02.245
[8] S. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF,
     Journal of Documentation. 60 (5) (2004): 503–520.
[9] A. Aizawa. An information-theoretic perspective of tf-idf measures, Information Processing and
     Management. 39 (1) (2003): 45–65.
[10] M. Farouk. Measuring Sentences Similarity: A Survey. Indian Journal of Science and
     Technology, 12(25) (2019): 1-11. DOI: 10.17485/ijst/2019/v12i25/143977
[11] W.H. Gomaa, A. A. Fahmy. A survey of text similarity approaches. International Journal of
     Computer Applications, 68(13) (2013): 13-18.
[12] A.V. Barmak, Y.V. Krak, E.A. Manziuk, V.S. Kasianiuk. Information technology of separating
     hyperplanes synthesis for linear classifiers. Journal of Automation and Information Sciences,
     51(5) (2019): 54-64. doi: 10.1615/JAutomatInfScien.v51.i5.50
[13] Iu.V. Krak, G.I. Kudin, A.I. Kulyas. Multidimensional scaling by means of pseudoinverse
     operations. Cybernetics and Systems Analysis, 55(1) (2019): 22-29. doi: 10.1007/s10559-019-
     00108-9
[14] E.L. Shimomoto, L.S. Souza, B.B. Gatto, K. Fukui. Text classification based on word subspace
     with term frequency. 2018. arXiv:1806.03125v1
[15] Iu.V. Krak, O.V. Barmak, S.O. Romanyshyn. The method of generalized grammar structure for
     text to gesture computer-aided translation, Cybernetics and Systems Analysis, 50(1) (2014): 116-
     123. doi: 10.1007/s10559-014-9598-4
[16] S. Bird, E. Klein, E. Loper. Natural Language Processing with Python. O'Reilly Media, 2009
[17] J. Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010.
[18] T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in
     Vector Space. 2013. arXiv:1301.3781.
[19] A. Globerson, G. Chechik, F. Pereira, N. Tishby. Euclidean Embedding of Co-occurrence Data,
     Journal of Machine Learning Research, 8 (2007): 2265-2295.
[20] L. Van der Maaten, G. Hinton. Visualizing Data using t-SNE, Journal of Machine Learning
     Research, 9 (2008): 2579-2605.
[21] T. Mikolov. Distributed representations of words and phrases and their compositionality,
     Advances in Neural Information Processing Systems. 2013. arXiv:1310.4546.
[22] M.A. Fischler, R.C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with
     Applications to Image Analysis and Automated Cartography, Comm. of the ACM, 24(6) (1981):
     381-395. https://doi.org/10.1145/358669.358692.
[23] A. Hast, A. Nysjö, A. Marchetti. Optimal RANSAC – Towards a Repeatable Algorithm for
     Finding the Optimal Set, Journal of WSCG, 21(1)(2013): 21-30.
[24] Survey of Text Mining I: Clustering, Classification, and Retrieval. Ed. by M. W. Berry. 2004, .
     https://www.springer.com/gp/book/9780387955636.
[25] Emerging Technologies of Text Mining: Techniques and Applications. Ed. by H. A. Do Prado,
     E. Ferneda. IGI Global, 2007.
[26] G.E. Hinton, R.R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks,
     Science, 313(5786) (2006):504-507. doi: 10.1126/science.1127647.
[27] E.A. Manziuk, A.V. Barmak, Y.V. Krak, V.S. Kasianiuk. Definition of information core for
     documents classification, J. Autom. Inf. Sci. 50(4) (2018): 25-34.
[28] S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing
     Internal Covariate Shift. 2015. arXiv:1502.03167[cs].