-

An Approach to Processing News Text Messages Based on Markeme Analysis

Alexander Sychev

0 0 Voronezh State University , Voronezh , Russia

313 324

The complexity problem of automatic filtering of messages retrieved from online media platforms and social networks is discussed. The review of approaches in document representation, feature weighting schemes and feature selection techniques is provided. In the paper an approach to the text messages processing based on the markeme analysis is suggested. Markemes identification is based on calculating the Index of Textual Markedness (InTeM). Markemes are words most important for a particular text and occur with the frequency, which is higher than that of the words of the same length. Preliminary results of the exploratory study of the proposed approach, applied to the news messages classification and clustering, are presented and discussed.

markeme index of textual markedness word form term message skewness coefficient classification clustering feature weighting feature selection

The problem of automatic message processing is considered usually as complex one due to the following features of the content published on online media platforms and in social networks: ─ “fuzzy” subject matter of message texts and comments; ─ small length of text in a message or comment; ─ heterogeneity of published texts in terms of stylistics, the level of literacy of the authors, etc.; ─ a large volume of published messages and comments per unit of time. of natural language. It also describes a method for assessing the informational significance of lexical units in natural language texts. In [ 3 ] the quality estimates for several methods of thematic classification (rubrication) of news messages, using various numerical estimates of information significance as features, were experimentally obtained on the “20 news groups” data set.

In general, the text classification pipeline includes the following steps: text features selection (extraction), dimensionality reduction, application of known classification techniques or development of new ones, evaluation of the classification model [ 4 ].

The feature selection which is reducing dimensionality, removing irrelevant data, and increasing the learning accuracy, is essential to tackling problems some problems, such as the curse of dimensionality and model overfitting, caused by the high dimensionality of data [ 5 ].

For the feature selection there are different techniques, e.g. TF or TF-IDF [ 6 ], Word2Vec [ 7 ], GloVe [ 8 ], FastText [ 9 ], Contextualized Word Representations [ 10 ] and their modification. All these techniques could be related to one of the two general feature selection approaches as follows: weighted words and word embedding [ 4 ].

Word embedding techniques require a huge corpus of text data sets for training [ 4 ]. As well, this approach cannot work for words missing from these data sets.

Weighted words technique use a simplified representation of the text, usually in the form of a bag-of-words (BOW) vector model, which allows to use of fairly simple and fast algorithms for processing text documents and messages. In the BOW model, the text is represented as a set of words, usually without taking into account the grammar or the order of words sequence, but using information about the frequency of words in the text. When solving the problem of documents classification, the word frequency of occurrence is used as a decisive feature for training the classifier. Exactly the word frequency is used in well-known methods for estimating the information significance of words in the text, for example, in TF-IDF metric based methods.

Some feature selection techniques can be not efficient for specific applications, depending on the goal and data set of the application. For example, GloVe does not perform as well as TF-IDF when used for short text messages [ 4 ].

Several widely used unsupervised and supervised term weighting methods on benchmark data collections in combination with SVM and k-NN algorithms were considered in [ 11 ]. As was stated in [ 11 ] the term weighting assignment is combined to improve both recall and precision measures by a multiplication operation from two factors: term frequency factor (tf) and collection frequency factor (idf). Several different collection frequency factors, namely, the multipliers of 1, a conventional inverse collection frequency factor (idf), a probabilistic inverse collection frequency (idf-prob), a  2 factor, an information gain (ig) factor, a gain ratio (gr) factor, an Odds Ratio (OR) factor, and proposed by authors novel relevance frequency (rf) factor were studied in experiments. In [ 12 ] five term scoring methods for automatic term extraction on different types of text collections were evaluated to investigate the influence of three factors in the success of a term scoring method in term extraction: collection size, background collection and the importance of multi-word terms. One important conclusion from [ 12 ] is that all term scoring methods could not demonstrate the high level of performance for collections smaller than 1,000 words due to the prevailing of the frequency criterion in all methods.

According to [ 13 ] the dimensionality reduction techniques can be organized into three groups: feature selection (FS), feature projection, and instance selection. While the first two types of methods aim to reduce the dimensionality of the feature space, the third aims to reduce the number of instances used for training.

In FS methods, the resulting feature set is a subset of the initial feature set. The feature projection results in a new group of features mapped from the original features.

FS methods are usually classified into three categories: filter, wrapper, and embedded [ 14 ]. Filter methods are executed independently of the classifier learning activity. Wrapper methods encapsulate the classifier performance to assess the relevance of features or search for the most relevant subset of features. Embedded methods include FS as part of the training process.

A relevant advantage of selecting features is in the resulting feature set which is a subset of the original features. Each resulting feature preserves the meaning of the original feature.

In [ 15 ], the InTeM index (the Index of Textual Marking of a Word Form) is used to assess the degree of subjective weight of a word form in the text. The authors in [ 15 ] assume that each word form in the text has two parameters: frequency and length. At the same time, in their opinion, the frequency of the word form is a complex subjective-objective indicator, and the length of the word form is a simple objective – linguistic one. Hence, the subjective (i.e. meaningful) weight of a word form can be obtained by subtracting the simple objective factor (i.e. the weight of the word form according to its length) from the complex subjective-objective factor (i.e. the weight of the word form according to its frequency). The resulting value - the Index of Textual Markedness of a word form (InTeM) - will indicate the degree of subjective (textual) weight of a given word form for a given text. Thus, in fact, it is proposed to calculate the following indicator to assess the informational significance of the word form t i from a text message m: where

ITM i  WF i WLi

Nt i  f j  j1 f j WF i  j1

Nt  f j j1

, i  N t WLi  j1

Lm l  f (len)   f (len) j j1 j Lm f (len) j1 j

, i  N t

Word forms t i from the text should be ranked in descending order of their frequency f i in the general list of all word forms of the text. Frequency f (jlen) indicates the number of occurrences for all word forms t, having the length j in the text of message m. N t is the total number of different word forms in the text message m. Lm is defined as the maximum word form's length in the text message m. The length of the t i is denoted as l.

Word forms with the maximum value of ITM i are called markemes and form the set of the most significant word forms for the author of the text.

In this paper the possibility of using the markeme model of texts for standard problems of classification, clustering and thematic categorization based on the example of a collection of news text messages is considered, and preliminary results of the exploratory study are presented. 2

Dataset

For the purposes of the study, a set M containing 760 text messages on several topics was formed. The set M included messages in approximately equal proportions of four topics marked up by experts manually. The average message size was 145.6 words. The total number of unique terms in the dictionary D, built from lemmas, which were extracted from their message texts, was about 12 thousand units, and 600 terms, which had a total frequency of occurrence for the entire set M at least 28, were selected for the study, The maximum total frequency of occurrence in the set M for a term from the dictionary was 1281.

Figure 1 shows the frequency distribution in the M of terms from vocabulary D along the length. As you can see, long terms (more than 10-12 characters of length) are found in messages with a low frequency, which reflects the objective linguistic realities. Within the framework of the markeme approach, the excess of the frequency of occurrence in the text T of a specific term f i of length l relative to the frequency f l(len) typical for terms of length l gives grounds for including it in the set of markemes MKT of a given text T. One should note that the calculation of ITM i in the framework of the markeme approach does not take into account the form of the frequency distribution function over the length of word forms.

This kind of distribution can also be calculated for each text message individually.

In the study, all terms with ITM i index value exceeded zero were identified as markemes.

Table 1 shows the number values of markemes identified from text messages and averaged by topic categories. N MK 1 is the number of markemes identified from the global (message collection) frequency distribution of terms along the length, N MK 2 is the number of markemes identified from the local (i.e. inside individual message) frequency distribution of terms along the length in the message.

It is obvious that those markemes that are characterized by a relatively high frequency f M in messages and a relatively large value of the distribution asymmetry index (skewness) Sk over the entire set of messages M will be useful in further study.

Topic

oAf ivdneifrafaemgreeensntsuatmegrebmesr freAquveenracgieessiunma omfetsesramge Medicine 6.3 3.3 39.2 54.2 Accidents 8.8 5.1 54.0 74.1 Politics 10.8 6.2 58.6 90.0 Sports 8.1 4.2 50.3 75.3 Mean: 8.6 5.1 51.0 74.2

In this study two indicators of asymmetry for markemes were considered: ─ Sk1 is the skewness of the markeme distribution across the four topics in the M message set; ─ Sk 2 is the skewness of the markeme distribution over the entire set of messages M as a whole.

The Sk 2  Sk 2 section of the scatter matrix shows the histogram of Sk 2 values distribution. The highest distribution density is observed near the value 10 of Sk 2 variable. The f  Sk 2 section proves this observation. For the Sk1 variable values the distribution density is concentrated in the vicinity of the value 2. ─ the frequency of occurrence f M i noticeably differs from the minimum values; ─ the topic parameter Sk1 tends to the limit value 2 (good topic specificity) or the Sk 2 parameter value is in the vicinity of 10 (a good indicator of the markeme specificity in M).

Experiment

For the experiment, the filter strategy for feature selection was chosen to reduce the dimensionality of the features space. Features filtering was realized using f M i , Sk1 , and Sk 2 parameters.

Markemes Mk i , for which the parameter values satisfied two conditions: f M i  6 , Sk1  1.9 , were selected from the total set of markemes identified from M. Table 2 provides a list of the 58 markemes identified from the M set in this way. For the messages classification a naive Bayesian classifier, supplied with an assessment of the quality of classification by cross-validation method (10 folds), was used. The obtained estimates of the quality are given in Table 3, where rows indicate the classifier predictions for corresponding topic and columns are related to true topics in tested data. The Accuracy value was 82%. Accuracy was calculated as ratio: (sum of correct classifier predictions) / (total number of testing examples). For comparison, Table 4 provides the estimates for the same classification, except that all the terms (600 units) from the D dictionary were used as attributes of the frequency vector of messages. The Accuracy value was 89%.

True/Prediction Prediction topic 1 Prediction topic 2 Prediction topic 3 Prediction topic 4 Class recall True topic 1

It is noteworthy that although, in general, the markemes list representation of messages worsened the Accuracy value by about 7% , there was an improvement in recall and precision in some topics. For example, the recall of the topic 1 ("Medicine") improved significantly (with a significant decrease in the precision value), for topics 2,3 ("Incidents", "Politics") there was an improvement in the precision value (while the recall value decreased). The significant decrease of classification Accuracy (table 3) is due to decreasing in the class precision for the topic 1 and the class recall for topics 2,3. This effect can be considered as a payment for essential features space dimensionality reduction. One can see in the table 2 that the list of selected markemesfeatures is too short to provide the high level of class accuracy. Perhaps a more flexible scheme for selecting f M i , Sk1 and Sk 2 parameters values could improve the situation. The dimensionality reduction of the feature space for solving the problem of message classification has happened to be more than 10 times.

True/Prediction Prediction topic 1 Prediction topic 2 Prediction topic 3 Prediction topic 4 Class recall True

topic 1 130 9 1 20 True topic 1

8 190 10 0 81,3% 91,4% True topic 1 16 10 164

3 85,0% True topic 1 2 0 5 192 96,5%

Class precision 83,3% 90,9% 82,4% 98,0%

True/Prediction

For comparison purposes, there was carried out a messages classification, based on a frequency vector with attribute terms selected on the basis of Sk1  1.9 filter ( Sk1 factor to some extent could be considered an analogue of the IDF factor in algorithms with TF-IDF). In fact, the boolean conversion of the Sk1 factor was used as collection (topic) frequency factor (IDF). The total number of terms selected from the D was 152. The experiment results are given in Table 5. The value of the Accuracy was 84.3%. 3.2

Messages Clustering

For the set of markemes (given in Table 2) as attributes of message vectors K-means clustering was carried out. Table 6 summarizes the results of this experiment. As you can see from the table, the markeme set allows to accurately identify the thematic core in a set of messages for each topic, but at the same time most of the messages from the topic class subset are thematically vague. An increase in the clustering recall index can be achieved by softening the constraints (for parameters f M i and Sk1 ) when selecting markems. It is worth noting that the clustering result is quite sensitive to the choice of the initial conditions for the clustering algorithm.

The implementation of clustering with a markeme list representation of messages in topics unknown in advance situation, makes it impossible to calculate the Sk1 parameter. In this case, one can suggest to calculate the Sk 2 parameter, which is not tied to specific topics. The topics of the clusters identified in this way could be determined by calculating the correlation coefficients between the frequency-dominant markemes within the identified clusters. Table 7 shows a fragment of the table of correlation coefficients for markeme pairs (for Sk1  1.9 ). When calculating the correlation, the frequency of occurrence of markemes in messages (760 frequencies totally) was used as coordinates of the markeme vector. 0,77 0,67 0,60 0,59 0,54 0,51 0,51 0,51 0,51 0,50 0,49 0,49 0,48 0,46 0,46 0,45 0,44 0,43 0,43 0,43 0,43 0,41 0,41 0,41 0,40

Conclusion

The preliminary results of the study presented in this paper allow us to draw several conclusions regarding the possible use of the markeme approach in text messages classification, clustering and thematic categorization. 1. Based on the method of identifying the word form as markeme in the text, it should reflect the degree of subjective (author's) weight of this word form for a particular text. Since a lot of news text messages come from different online platforms, it is basically impossible to talk about a single authorship in the stream of news messages. In this case, the analysis of the text of a news messages is significantly different from the analysis of a large text, for example, a literary work. Of course, markeme analysis is more suitable for use as a work tool for linguistic research of texts. 2. From the point of view of the messages classification and clustering performance evaluation, both representation of a text message by a vector of markemes frequency and representation it by a vector of terms frequency based on the calculation of the TF factor give quite comparable results. Some degradation in classification accuracy can be considered as a payment for essential features space dimensionality reduction. Perhaps, a more flexible scheme for selecting f M i , Sk1 and Sk 2 parameters values could improve the situation. 3. From the computing point of view, the markeme model of messages has the advantage: to identify the markeme from the text, it is enough to have the body of text itself only, but not the entire set of texts, as it is required, for example, when computing the TF-IDF factor. Of course, the text size is should be sufficient enough to calculate the term frequencies. 4. The markemes as a features space basis could be considered as a good choice for filter strategy in the feature selection procedure to cut the effects of curse of dimensionality and model overfitting. The choice of markemes based on the threshold values of the f M i , Sk1 and Sk 2 parameters can be used to construct an "orthogonal" basis (in some sense) in the feature space of terms for evaluating, for example, the "blurring" degree of existing topic sections and the need to reorganize their structure. Markemes can also be used for keywords generation and annotating news messages. 5. The threshold values for f M i , Sk1 and Sk 2 in fact are considered as tuning parameters in the feature selection procedure to improve both recall and precision measures. In this way the choice of the values mentioned above will depend on the target recall and precision levels.

Of course a relatively small collection of news texts and 4 topics are used for experiments, but the paper presents preliminary results of exploratory research. Further experiments will engage extended both the size of collection and the number of topics. More experiments and comparison with existing weighting schemes for improving document representation are expected further.

1. Lande

, Morozov , A. , Darmokhval

: An Approach to Identifying Duplicate Messages in News Information Streams ( 2006 ). URL http://dwl.kiev.ua/art/rdcl/rcdl2006.pdf.

2. Mbaykodzhi , E. , Dral

, Sochenkov , I. : Short Text Messages Classification Method . Journal of Information Technologies and Computing Systems, issue 3 , pp. 93 - 102 . ( 2012 )

3. Zhebel

, Zharikova , S.-N. , Sochenkov , I. : Feature Selection for Text Classification of a News Flows Based on Topical Importance Characteristic . Artificial Intelligence and Decision Making, issue 3 , pp. 52 - 59 ( 2019 ). (in Russian) . https://doi.org/10.14357/20718594190306.

4. Kowsari , K. , Jafari Meimandi , K. , Heidarysafa , M. , Mendu , S. , Barnes

, Brown , D.: Text classification algorithms: a survey . Inf.Switz . 10 . ( 2019 ). https://doi.org/10.3390/info10040150.

5. Pintas , J. , Fernandes , L. , Garcia , A. : Feature Selection Methods for Text Classification: a Systematic Literature Review . Artif.Intell.Rev . ( 2021 ). https://doi.org/10.1007/s10462- 021-09970-6.

6. Salton , G.; Buckley, C. : Term-weighting approaches in automatic text retrieval . Inf. Process. Manag . 1988 , 24 , pp. 513 - 523 . ( 1988 )

7. Goldberg , Y. , Levy , O. : Word2vec explained: Deriving mikolov et al.' s negative-sampling word-embedding method . arXiv 2014 , arXiv: 1402 . 3722 . ( 2014 )

8. Pennington , J. , Socher , R. , Manning , C. : Glove: Global Vectors for Word Representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Doha, Qatar, 25 - 29 October 2014 ; vol. 14 , pp. 1532 - 1543 . ( 2014 )

9. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching word vectors with subword information . arXiv , 2016 , arXiv: 1607 . 04606 . ( 2016 )

10. Melamud , O. , Goldberger , J. , Dagan , I.: context2vec: Learning Generic Context Embedding with Bidirectional LSTM . In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , Berlin, Germany, 11 -12 August 2016 , pp. 51 - 61 ( 2016 ). https://doi.org/10.18653/v1/ K16 -1006.

11. Lu

, Lan

, Su

, Tan , C. : Supervised and Traditional Term Weighting Methods for Automatic Text Categorization . IEEE Transactions on Pattern Analysis & Machine Intelligence , vol. 31 , no. 04 , pp. 721 - 735 . ( 2009 ). https://doi: 10.1109/TPAMI. 2008 .110

12. Verberne , S. , Sappelli , M. , Hiemstra , D. , & Kraaij , W. : Evaluation and Analysis of Term Scoring Methods for Term Extraction . Information Retrieval , 19 ( 5 ), pp. 510 - 545 ( 2016 ). https://doi.org/10.1007/s10791-016-9286-2.

13. Mirończuk , M. , Protasiewicz , J.: A Recent Overview of the State-of-the-art Elements of Text Classification, Expert Systems with Applications , vol. 106 , 2018 , pp. 36 - 54 . ( 2018 ) https://doi.org/10.1016/j.eswa. 2018 . 03 .058.

14. Kumar , V. , Minz , S. : Feature Selection: A literature Review . The Smart Computing Review , vol. 4 ., pp. 211 - 229 . ( 2014 ) https://doi.org/10.6029/smartcr. 2014 . 03 .007.

15. Faustov , A. , Kretov , A. : The Concept of Markeme and Interim Results of Markeme Analysis of Russian Literature . Proceedings of Voronezh State University. Series: Linguistics and intercultural communication, issue 4 , pp. 16 - 32 ( 2017 ). (in Russian)