An Approach to Processing News Text Messages Based
                   on Markeme Analysis

                                     Alexander Sychev 1
                        1 Voronezh State University, Voronezh, Russia


                                      sav@sc.vsu.ru


       Abstract. The complexity problem of automatic filtering of messages retrieved
       from online media platforms and social networks is discussed. The review of
       approaches in document representation, feature weighting schemes and feature
       selection techniques is provided. In the paper an approach to the text messages
       processing based on the markeme analysis is suggested. Markemes identifica-
       tion is based on calculating the Index of Textual Markedness (InTeM).
       Markemes are words most important for a particular text and occur with the
       frequency, which is higher than that of the words of the same length. Prelimi-
       nary results of the exploratory study of the proposed approach, applied to the
       news messages classification and clustering, are presented and discussed.

       Keywords: markeme, index of textual markedness, word form, term, message,
       skewness coefficient, classification, clustering, feature weighting, feature selec-
       tion.


1      Introduction

The problem of automatic message processing is considered usually as complex one
due to the following features of the content published on online media platforms and
in social networks:

─ “fuzzy” subject matter of message texts and comments;
─ small length of text in a message or comment;
─ heterogeneity of published texts in terms of stylistics, the level of literacy of the
  authors, etc.;
─ a large volume of published messages and comments per unit of time.

   A feature of news messages and user comments is that they are short texts. Short
messages usually include text documents with an average length of less than 2000-
3000 characters [1].
   The problem of topic classification (rubrication) of short texts in Russian has been
considered in many papers, for example, in [2] and [3]. The paper [2] presents the
results of research in the field of classification of short text documents and analyzes
classification methods based on the analysis of the distribution of lexical descriptors


Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).


                                              313
of natural language. It also describes a method for assessing the informational signifi-
cance of lexical units in natural language texts. In [3] the quality estimates for several
methods of thematic classification (rubrication) of news messages, using various nu-
merical estimates of information significance as features, were experimentally ob-
tained on the “20 news groups” data set.
   In general, the text classification pipeline includes the following steps: text features
selection (extraction), dimensionality reduction, application of known classification
techniques or development of new ones, evaluation of the classification model [4].
   The feature selection which is reducing dimensionality, removing irrelevant data,
and increasing the learning accuracy, is essential to tackling problems some problems,
such as the curse of dimensionality and model overfitting, caused by the high dimen-
sionality of data [5].
   For the feature selection there are different techniques, e.g. TF or TF-IDF [6],
Word2Vec [7], GloVe [8], FastText [9], Contextualized Word Representations [10]
and their modification. All these techniques could be related to one of the two general
feature selection approaches as follows: weighted words and word embedding [4].
   Word embedding techniques require a huge corpus of text data sets for training [4].
As well, this approach cannot work for words missing from these data sets.
   Weighted words technique use a simplified representation of the text, usually in the
form of a bag-of-words (BOW) vector model, which allows to use of fairly simple
and fast algorithms for processing text documents and messages. In the BOW model,
the text is represented as a set of words, usually without taking into account the
grammar or the order of words sequence, but using information about the frequency
of words in the text. When solving the problem of documents classification, the word
frequency of occurrence is used as a decisive feature for training the classifier. Exact-
ly the word frequency is used in well-known methods for estimating the information
significance of words in the text, for example, in TF-IDF metric based methods.
   Some feature selection techniques can be not efficient for specific applications, de-
pending on the goal and data set of the application. For example, GloVe does not
perform as well as TF-IDF when used for short text messages [4].
   Several widely used unsupervised and supervised term weighting methods on
benchmark data collections in combination with SVM and k-NN algorithms were
considered in [11]. As was stated in [11] the term weighting assignment is combined
to improve both recall and precision measures by a multiplication operation from two
factors: term frequency factor (tf) and collection frequency factor (idf). Several dif-
ferent collection frequency factors, namely, the multipliers of 1, a conventional in-
verse collection frequency factor (idf), a probabilistic inverse collection frequency
(idf-prob), a  2 factor, an information gain (ig) factor, a gain ratio (gr) factor, an
Odds Ratio (OR) factor, and proposed by authors novel relevance frequency (rf) fac-
tor were studied in experiments. In [12] five term scoring methods for automatic term
extraction on different types of text collections were evaluated to investigate the in-
fluence of three factors in the success of a term scoring method in term extraction:
collection size, background collection and the importance of multi-word terms. One
important conclusion from [12] is that all term scoring methods could not demonstrate


                                           314
the high level of performance for collections smaller than 1,000 words due to the
prevailing of the frequency criterion in all methods.
   According to [13] the dimensionality reduction techniques can be organized into
three groups: feature selection (FS), feature projection, and instance selection. While
the first two types of methods aim to reduce the dimensionality of the feature space,
the third aims to reduce the number of instances used for training.
   In FS methods, the resulting feature set is a subset of the initial feature set. The fea-
ture projection results in a new group of features mapped from the original features.
   FS methods are usually classified into three categories: filter, wrapper, and embed-
ded [14]. Filter methods are executed independently of the classifier learning activity.
Wrapper methods encapsulate the classifier performance to assess the relevance of
features or search for the most relevant subset of features. Embedded methods include
FS as part of the training process.
   A relevant advantage of selecting features is in the resulting feature set which is a
subset of the original features. Each resulting feature preserves the meaning of the
original feature.
   In [15], the InTeM index (the Index of Textual Marking of a Word Form) is used
to assess the degree of subjective weight of a word form in the text. The authors in
[15] assume that each word form in the text has two parameters: frequency and
length. At the same time, in their opinion, the frequency of the word form is a com-
plex subjective-objective indicator, and the length of the word form is a simple objec-
tive – linguistic one. Hence, the subjective (i.e. meaningful) weight of a word form
can be obtained by subtracting the simple objective factor (i.e. the weight of the word
form according to its length) from the complex subjective-objective factor (i.e. the
weight of the word form according to its frequency). The resulting value - the Index
of Textual Markedness of a word form (InTeM) - will indicate the degree of subjec-
tive (textual) weight of a given word form for a given text. Thus, in fact, it is pro-
posed to calculate the following indicator to assess the informational significance of
the word form t i from a text message m:

                                 ITM i  WF i  WLi

   where
                                           Nt              i
                                            f j  f j
                                           j 1           j 1
                             WF i                               , i  Nt
                                                   Nt
                                                    f j
                                                   j 1


                                    Lm                     l
                                     f (jlen )   f (jlen)
                                    j 1                  j 1
                           WL i                                   , i  Nt
                                                  Lm
                                                       (len )
                                                   f j
                                                  j 1


                                                        315
    Word forms t i from the text should be ranked in descending order of their fre-
quency f i in the general list of all word forms of the text. Frequency f (jlen) indicates
the number of occurrences for all word forms t, having the length j in the text of mes-
sage m. N t is the total number of different word forms in the text message m. L m is
defined as the maximum word form's length in the text message m. The length of the
 t i is denoted as l.
     Word forms with the maximum value of ITM i are called markemes and form the
set of the most significant word forms for the author of the text.
     In this paper the possibility of using the markeme model of texts for standard prob-
lems of classification, clustering and thematic categorization based on the example of
a collection of news text messages is considered, and preliminary results of the ex-
ploratory study are presented.


2      Dataset

For the purposes of the study, a set M containing 760 text messages on several topics
was formed. The set M included messages in approximately equal proportions of four
topics marked up by experts manually. The average message size was 145.6 words.
The total number of unique terms in the dictionary D, built from lemmas, which were
extracted from their message texts, was about 12 thousand units, and 600 terms,
which had a total frequency of occurrence for the entire set M at least 28, were select-
ed for the study, The maximum total frequency of occurrence in the set M for a term
from the dictionary was 1281.
   Figure 1 shows the frequency distribution in the M of terms from vocabulary D
along the length. As you can see, long terms (more than 10-12 characters of length)
are found in messages with a low frequency, which reflects the objective linguistic
realities. Within the framework of the markeme approach, the excess of the frequency
of occurrence in the text T of a specific term f i of length l relative to the frequency
 f l(len) typical for terms of length l gives grounds for including it in the set of
markemes MKT of a given text T. One should note that the calculation of ITM i in the
framework of the markeme approach does not take into account the form of the fre-
quency distribution function over the length of word forms.
     This kind of distribution can also be calculated for each text message individually.
     In the study, all terms with ITM i index value exceeded zero were identified as
markemes.
     Table 1 shows the number values of markemes identified from text messages and
averaged by topic categories. N MK 1 is the number of markemes identified from the
global (message collection) frequency distribution of terms along the length, N MK 2 is
the number of markemes identified from the local (i.e. inside individual message)
frequency distribution of terms along the length in the message.


                                           316
  It is obvious that those markemes that are characterized by a relatively high fre-
quency f M in messages and a relatively large value of the distribution asymmetry
index (skewness) Sk over the entire set of messages M will be useful in further study.

              Table 1. Average number of terms and markemes in a message

                  Average number of          Average number
                      markemes                                    Average sum of term
    Topic                                   of different terms
                                                               frequencies in a message
                < N MK 1 > < N MK 2 >         in a message
Medicine          6.3           3.3            39.2                54.2
Accidents         8.8           5.1            54.0                74.1
Politics         10.8           6.2            58.6                90.0
Sports            8.1           4.2            50.3                75.3
Mean:             8.6           5.1            51.0                74.2
 In this study two indicators of asymmetry for markemes were considered:
─ Sk 1 is the skewness of the markeme distribution across the four topics in the M
  message set;
─ Sk 2 is the skewness of the markeme distribution over the entire set of messages M
  as a whole.


                Fig.1. Frequency distribution of dictionary terms by length.

   Figure 2 shows the scatter matrix for the 380 markemes selected from the M mes-
sage set. The values of the Sk 1 , Sk 2 parameters and the total frequency of occurrence
(for the entire set M) f M i , calculated for the markeme Mk i , were used.


                                            317
   The Sk 2  Sk 2 section of the scatter matrix shows the histogram of Sk 2 values
distribution. The highest distribution density is observed near the value 10 of Sk 2
variable. The f  Sk 2 section proves this observation. For the Sk 1 variable values the
distribution density is concentrated in the vicinity of the value 2.


                        Fig.2. Scatter matrix for the markemes set.

  One can expect that a good set of markemes will turn out from candidates that
have:
─ the frequency of occurrence f M i noticeably differs from the minimum values;
─ the topic parameter Sk 1 tends to the limit value 2 (good topic specificity) or the
   Sk 2 parameter value is in the vicinity of 10 (a good indicator of the markeme
  specificity in M).


                                           318
3      Experiment

For the experiment, the filter strategy for feature selection was chosen to reduce the
dimensionality of the features space. Features filtering was realized using f M i , Sk 1 ,
and Sk 2 parameters.
  Markemes Mk i , for which the parameter values satisfied two conditions: f M i  6
, Sk 1  1.9 , were selected from the total set of markemes identified from M. Table 2
provides a list of the 58 markemes identified from the M set in this way.

                  Table 2. A list of selected markemes and sorted by topic.


                                      Message topics
  Medicine                Accident               Politics           Sports
заболевание             авария                 администрация      болельщик
здравоохранение         водитель               власть             ворота
клинический             иномарка               возглавлять        завоевать
лечение                 легковушка             глава              команда
неделя                  личность               государственный    первенство
пациент                 мужчина                депутат            сборная
поликлиника             очевидец               должность          сезон
                        пассажир               заместитель        соперник
                        погибнуть              нацпроект          соревнование
                        полицейский            начальник          спорт
                        случиться              образование        спортсмен
                        столкновение           общественный       спортсменка
                        убийство               обязанность        турнир
                        экспертиза             председатель       факел
                                               президент          футболист
                                               реализация         чемпионат
                                               руководитель
                                               сельский
                                               социальный
                                               территория
                                               чиновник
   Computational experiments on the classification and clustering of the M message
set were carried out using the set of selected markemes.


3.1    Messages Classification

For the messages classification a naive Bayesian classifier, supplied with an assess-
ment of the quality of classification by cross-validation method (10 folds), was used.
The obtained estimates of the quality are given in Table 3, where rows indicate the
classifier predictions for corresponding topic and columns are related to true topics in


                                            319
tested data. The Accuracy value was 82%. Accuracy was calculated as ratio: (sum of
correct classifier predictions) / (total number of testing examples). For comparison,
Table 4 provides the estimates for the same classification, except that all the terms
(600 units) from the D dictionary were used as attributes of the frequency vector of
messages. The Accuracy value was 89%.

Table 3. Performance evaluation for message classification based on the frequency vector with
                                    markeme attributes.

True/Prediction             True           True          True         True        Class
                           topic 1        topic 2       topic 3      topic 4    precision
Prediction topic 1           153            46            47            7        60,5%
Prediction topic 2            4             153               7         0         93,3%
Prediction topic 3            3              8            131           6         88,5%
Prediction topic 4            0              1                8        186        95,4%
Class recall                95,6%         73,6%          67,9%       93,5%

It is noteworthy that although, in general, the markemes list representation of messag-
es worsened the Accuracy value by about 7% , there was an improvement in recall
and precision in some topics. For example, the recall of the topic 1 ("Medicine") im-
proved significantly (with a significant decrease in the precision value), for topics 2,3
("Incidents", "Politics") there was an improvement in the precision value (while the
recall value decreased). The significant decrease of classification Accuracy (table 3)
is due to decreasing in the class precision for the topic 1 and the class recall for topics
2,3. This effect can be considered as a payment for essential features space dimen-
sionality reduction. One can see in the table 2 that the list of selected markemes-
features is too short to provide the high level of class accuracy. Perhaps a more flexi-
ble scheme for selecting f M i , Sk 1 and Sk 2 parameters values could improve the
situation. The dimensionality reduction of the feature space for solving the problem of
message classification has happened to be more than 10 times.

Table 4. Performance evaluation for message classification based on the frequency vector with
                          terms-attributes from the dictionary D.

True/Prediction                True        True        True          True          Class
                              topic 1     topic 1     topic 1       topic 1      precision
Prediction topic 1              130          8          16             2          83,3%
Prediction topic 2                9         190          10            0           90,9%
Prediction topic 3                20        10          164            5           82,4%
Prediction topic 4                1          0            3           192          98,0%
Class recall                   81,3%      91,4%        85,0%        96,5%


                                            320
  Table 5. Performance evaluation for the message classification based on a frequency vector
                     with term attributes from D, filtered by Sk 1  1.9 .

True/Prediction                    True        True       True          True        Class
                                  topic 1     topic 2    topic 3       topic 4    precision
Prediction topic 1                  139            29       32            1         69.2%
Prediction topic 2                   10         170         15            0         87.2%
Prediction topic 3                   10            8        139           5         85.8%
Prediction topic 4                    1            1         7           193        95.5%
Class recall                       86.9%       81.7%       72%          97%
   For comparison purposes, there was carried out a messages classification, based on
a frequency vector with attribute terms selected on the basis of Sk 1  1.9 filter ( Sk 1
factor to some extent could be considered an analogue of the IDF factor in algorithms
with TF-IDF). In fact, the boolean conversion of the Sk 1 factor was used as collection
(topic) frequency factor (IDF). The total number of terms selected from the D was
152. The experiment results are given in Table 5. The value of the Accuracy was
84.3%.


3.2    Messages Clustering
For the set of markemes (given in Table 2) as attributes of message vectors K-means
clustering was carried out. Table 6 summarizes the results of this experiment. As you
can see from the table, the markeme set allows to accurately identify the thematic core
in a set of messages for each topic, but at the same time most of the messages from
the topic class subset are thematically vague. An increase in the clustering recall in-
dex can be achieved by softening the constraints (for parameters f M i and Sk 1 ) when
selecting markems. It is worth noting that the clustering result is quite sensitive to the
choice of the initial conditions for the clustering algorithm.

  Table 6. Performance evaluation for message clustering based on the frequency vector with
                                     markeme attributes.

      Topic              Number of messages                   Recall             Precision
                         in topic subset of M
       1                          160                         36,3%               97,5%
       2                           210                        32,9%               99,5%
       3                           190                        32,6%              100,0%
       4                           200                        56,5%               99,5%
   The implementation of clustering with a markeme list representation of messages
in topics unknown in advance situation, makes it impossible to calculate the Sk 1 pa-


                                             321
rameter. In this case, one can suggest to calculate the Sk 2 parameter, which is not
tied to specific topics. The topics of the clusters identified in this way could be deter-
mined by calculating the correlation coefficients between the frequency-dominant
markemes within the identified clusters. Table 7 shows a fragment of the table of
correlation coefficients for markeme pairs (for Sk 1  1.9 ). When calculating the cor-
relation, the frequency of occurrence of markemes in messages (760 frequencies total-
ly) was used as coordinates of the markeme vector.

                  Table 7. Markemes correlation coefficients ( Sk 1  1.9 )

                     Markemes pair                               Correlation coefficient
  реализация                  нацпроект                                       0,77
  клинический                 здравоохранение                                 0,67
  сельский                    территория                                      0,60
  защитник                    воронежец                                       0,59
  случай                      пациент                                         0,54
  факел                       болельщик                                       0,51
  чемпионат                   спортсменка                                     0,51
  факел                       воронежец                                       0,51
  факел                       штрафной                                        0,51
  спортсменка                 завоевать                                       0,50
  болельщик                   защитник                                        0,49
  болельщик                   воронежец                                       0,49
  штрафной                    воронежец                                       0,48
  образование                 муниципальный                                   0,46
  команда                     болельщик                                       0,46
  иномарка                    водитель                                        0,45
  защитник                    штрафной                                        0,44
  ворота                      штрафной                                        0,43
  чемпионат                   завоевать                                       0,43
  образование                 нацпроект                                       0,43
  факел                       защитник                                        0,43
  главный                     оборина                                         0,41
  убийство                    мужчина                                         0,41
  поликлиника                 здравоохранение                                 0,41
  участок                     результат                                       0,40


                                            322
4       Conclusion

The preliminary results of the study presented in this paper allow us to draw several
conclusions regarding the possible use of the markeme approach in text messages
classification, clustering and thematic categorization.
1. Based on the method of identifying the word form as markeme in the text, it should
   reflect the degree of subjective (author's) weight of this word form for a particular
   text. Since a lot of news text messages come from different online platforms, it is
   basically impossible to talk about a single authorship in the stream of news mes-
   sages. In this case, the analysis of the text of a news messages is significantly dif-
   ferent from the analysis of a large text, for example, a literary work. Of course,
   markeme analysis is more suitable for use as a work tool for linguistic research of
   texts.
2. From the point of view of the messages classification and clustering performance
   evaluation, both representation of a text message by a vector of markemes frequen-
   cy and representation it by a vector of terms frequency based on the calculation of
   the TF factor give quite comparable results. Some degradation in classification ac-
   curacy can be considered as a payment for essential features space dimensionality
   reduction. Perhaps, a more flexible scheme for selecting f M i , Sk 1 and Sk 2 pa-
   rameters values could improve the situation.
3. From the computing point of view, the markeme model of messages has the ad-
   vantage: to identify the markeme from the text, it is enough to have the body of
   text itself only, but not the entire set of texts, as it is required, for example, when
   computing the TF-IDF factor. Of course, the text size is should be sufficient
   enough to calculate the term frequencies.
4. The markemes as a features space basis could be considered as a good choice for
   filter strategy in the feature selection procedure to cut the effects of curse of di-
   mensionality and model overfitting. The choice of markemes based on the thresh-
   old values of the f M i , Sk 1 and Sk 2 parameters can be used to construct an "or-
   thogonal" basis (in some sense) in the feature space of terms for evaluating, for ex-
   ample, the "blurring" degree of existing topic sections and the need to reorganize
   their structure. Markemes can also be used for keywords generation and annotating
   news messages.
5. The threshold values for f M i , Sk 1 and Sk 2 in fact are considered as tuning pa-
    rameters in the feature selection procedure to improve both recall and precision
    measures. In this way the choice of the values mentioned above will depend on the
    target recall and precision levels.

   Of course a relatively small collection of news texts and 4 topics are used for ex-
periments, but the paper presents preliminary results of exploratory research. Further
experiments will engage extended both the size of collection and the number of top-
ics. More experiments and comparison with existing weighting schemes for improv-
ing document representation are expected further.


                                           323
References
 1. Lande D., Morozov, A., Darmokhval A.: An Approach to Identifying Duplicate Messages
    in News Information Streams (2006). URL http://dwl.kiev.ua/art/rdcl/rcdl2006.pdf.
 2. Mbaykodzhi, E., Dral A., Sochenkov, I.: Short Text Messages Classification Method.
    Journal of Information Technologies and Computing Systems, issue 3, pp.93-102. (2012)
 3. Zhebel V., Zharikova, S.-N., Sochenkov, I.: Feature Selection for Text Classification of a
    News Flows Based on Topical Importance Characteristic. Artificial Intelligence and Deci-
    sion Making, issue 3, pp.52-59 (2019). (in Russian).
    https://doi.org/10.14357/20718594190306.
 4. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes L, Brown, D.: Text
    classification       algorithms:      a      survey.        Inf.Switz.    10.      (2019).
    https://doi.org/10.3390/info10040150.
 5. Pintas, J., Fernandes, L., Garcia, A.: Feature Selection Methods for Text Classification: a
    Systematic Literature Review. Artif.Intell.Rev. (2021). https://doi.org/10.1007/s10462-
    021-09970-6.
 6. Salton, G.; Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Pro-
    cess. Manag. 1988, 24, pp. 513–523. (1988)
 7. Goldberg, Y., Levy, O.: Word2vec explained: Deriving mikolov et al.’s negative-sampling
    word-embedding method. arXiv 2014, arXiv:1402.3722. (2014)
 8. Pennington, J., Socher, R., Manning, C.: Glove:Global Vectors for Word Representation.
    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro-
    cessing (EMNLP), Doha, Qatar, 25–29 October 2014; vol. 14, pp.1532–1543. (2014)
 9. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword
    information. arXiv, 2016, arXiv:1607.04606. (2016)
10. Melamud, O., Goldberger, J., Dagan, I.: context2vec: Learning Generic Context Embed-
    ding with Bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Com-
    putational Natural Language Learning, Berlin, Germany, 11–12 August 2016, pp. 51–61
    (2016). https://doi.org/10.18653/v1/K16-1006.
11. Lu Y., Lan M., Su J., Tan, C.: Supervised and Traditional Term Weighting Methods for
    Automatic Text Categorization. IEEE Transactions on Pattern Analysis & Machine Intelli-
    gence, vol. 31, no. 04, pp. 721-735. (2009). https://doi: 10.1109/TPAMI.2008.110
12. Verberne, S., Sappelli, M., Hiemstra, D., & Kraaij, W.: Evaluation and Analysis of Term
    Scoring Methods for Term Extraction. Information Retrieval, 19(5), pp. 510-545 (2016).
    https://doi.org/10.1007/s10791-016-9286-2.
13. Mirończuk, M., Protasiewicz, J.: A Recent Overview of the State-of-the-art Elements of
    Text Classification, Expert Systems with Applications, vol. 106, 2018, pp. 36-54. (2018)
    https://doi.org/10.1016/j.eswa.2018.03.058.
14. Kumar, V., Minz, S.: Feature Selection: A literature Review. The Smart Computing Re-
    view, vol. 4., pp.211-229. (2014) https://doi.org/10.6029/smartcr.2014.03.007.
15. Faustov, A., Kretov, A.: The Concept of Markeme and Interim Results of Markeme Anal-
    ysis of Russian Literature. Proceedings of Voronezh State University. Series: Linguistics
    and intercultural communication, issue 4, pp.16-32 (2017). (in Russian)


                                             324