Modernized Mathematical Model of Text Document Classification Tetiana Golub1[0000-0001-6024-008X] 1 Zaporizhzhia National Technical University, Zhukovsky str.,64, Zaporizhzhia, 69063, Ukraine golub.tv6@gmail.com Abstract. The modernized mathematical model of the main stages of the text document classification is proposed. It takes into account the characteristics of certain categories. A mathematical description of the document data set creating stages, a document classification into categories is proposed. The principles of reducing the feature space dimension are described and the proposed method what used for determining the term weights is argued. The application of the method proposed in the article leads to reduce the analysis time of each docu- ment in order to make a decision about its category. This leads to decrease the resulting time for the analysis of the entire document set. Keywords: text document classification, term vector, mathematical model, term weight, SLF parameter 1 Introduction The information amount witch presented in text form increases continuously. Text information is accumulated in all areas of human activity. It is represented from data stored on personal computers to data in the form of Big Data. It covers such areas as business, research institutions, government and financial institutions that use technol- ogy intensively. Text information contains statistical data, control commands, refer- ence information and principle laws of different processes. A feature of such informa- tion is the lack of its structuredness. It makes more complicated the process of its analysis [1]. Text analytics converts text into numbers. It allows organizing data and helps to identify patterns. Structured data are easier to analyze. Therefore, decisions made on their basis are more quality [2, 3]. If it is necessary to find the information in a data large amount, firstly it must be classified [4]. This process is the consideration subject in the proposed study. Text classification refers to one of the computational linguistic tasks. It includes the definition of the text thematic affiliation, the text author, the statement emotional coloring and etc. The task of organizing documents is solved to simplify the search for the necessary information. It is one of the most urgent tasks. Text classification is needed to solve this problem. [5]. It is difficult to solve the classification problem because the data flow is constantly increasing. Therefore, its decision is relevant. Many approaches to solving this problem are described in the literature. An over- view and comparison of currently relevant methods are presented in accordance with the various stages of this process in [1, 6–8]. According to these sources one of the most important points of test classification is key feature selection. The works [9–15] were devoted to solving this problem. Various approaches, including statistical, fre- quency, latent-semantic and others are disclosed there. However, the described meth- ods consider terms within the entire document collection. It is not possible to assess the importance of a separate term for each category separately. The classification of text documents is the process of analyzing its content and au- tomatically defining a document into one or several categories [16, 17]. Categories are sets of documents with a common theme. The set of categories is set by the expert or is determined automatically on the basis of the training sample. Automatic classi- fier is used in the information-analytical system at the stage of processing documents. An automatic classifier is a program that determines the subject of documents and assigns them to categories [6]. The inverse problem is also relevant. It consists of document selection from a doc- ument set according to the category defined by the user. Presented in the literature mathematical models do not consider the term impor- tance for certain categories. The author offers an improved mathematical model that takes into consideration this parameter. The proposed in the article model considers this parameter which allows reducing the time for assessing the belonging of a document to certain categories by reducing the size of the term vector of certain categories for the text document classification. 2 Task formalization The classifying document process in a formal form can be described as follows. The text document classification will be understood as the task of automatically defining a document into one or several categories based on its content. The category will be understood as a variety of documents with a general theme. Many categories are set by an expert or determined automatically using a training set. Automatic classifier is used in the information-analytical system at the document processing stage [6]. Mathematical models of the text document classification process given in [18, 19] are common. The author proposes the improvement of the existing variants of the term weight determining process as a part of the classification process with consider- ing requirements of the task in this article. It is proposed next designations to formally describe the process of text documents classifying: ─ Т={t1,…t|А|} – document term set; ─ В={b1,…b|B|} – term set; ─ D={d1,…d|D|} – documents set; ─ C={c1,…c|C|} – category set; ─ Е={е1,…е|Е|} – category term set. In the general case, the searching task of documents which corresponding to a particu- lar category is following. A set of documents D, from which it is necessary to choose those documents di, which most likely belong to the category cі determined in advance from the set of categories C exists. The solution of this problem is considered in this article. 0, if d j  ci (d j , ci )   . (1) 1, if d j  ci 3 Document term set creating The text document classification is performed using the analysis of the text document terms. A term is an intuitively defined expression of a formal language. It is the for- mal name of the object [2]. In this study, the term will be understood as the word obtained after stemming. Stemming is the reduction of a word to a certain normal form using the clipping of its endings and suffixes. The formation of terms is one of the tasks of the preprocessing stage. The text is presented in the form of a document term set model for solving the clas- sification problem. Each term has its own weight. Text preprocessing is performed when determining whether a document belongs to any category, considering the importance of each term. The preprocessing process has the following characteristics suggested by the au- thor: ─ Т ∈ В - all terms of the document are included in the set of possible terms; ─ Е ∈ В - all terms of the category are included in the set of possible terms; The set of elements of the sets T and E forms the set B. The sets T and E constitute the set B. ─ ТM = – a multiset of the set T. It allows collecting the occurrence of the set elements several times. The formation of a document multiset of one group, category, allows its power for each term determines. This parameter estimates the quantitative index of the term occurrence. ─ ЕM = – a multiset of the set E. It allows collecting the occurrence of the set elements several times. Based on the category multiset, it is possible to determine the term indicators by the power of their occurrence. Analyzing this parameter it is possible to determine in how many categories of the collection the considered term occurs at least once. It allows distinguishing terms that are characteristic for all categories and are not characteristic to a particular category. These terms do not contain information for classification. Therefore it is possible to exclude them from the analyzed set. Subsequent text processing is performed based on these characteristics. All words that appear in the documents can be ordered in some way, for example alphabetically. Then, for each document it is possible to write out the entire set of weights matching the dictionary words. If some term is out of the document, then the weight will be zero. That is the vector will be: d i  ( w1 , w 2 ,..., w n ) , (2) where di - i-th document vector representation, wi - weight of the i-th document term, n - the total number of different terms in all the documents of the collection –the power of set B [21] 4 Term weight identification The term weight values of the set E for each category of the incoming set B are de- termined to assess the occurrence to the document category. There are many methods to determine the term weights in the literature are pre- sented. Some of them are: ─ Boolean weight. w = sign(tf), i.e. 1 - if word occurs in the document, 0 - otherwise; ─ w=tf – number of word duplications in the document [3]; ─ w=tf/df - the coefficient «tf•idf», i.e. the multiplication of the words occur-rence frequency (tf), to the reciprocal value of the words occurrence frequency in all documents of the collection (inverse df). There are many options to define the weight value of the i-th term (wij) in the document dj. One of the simplest options is the following: wij= tf•log10(1/df). When the formulas «tf•idf» are used the prob- lem of common words is solved – when the words with no meaning are of high weight; ─ SLF parameter [3]; ─ Latent semantic analysis [7, 8]. The mentioned methods for determining the term weight values characterize that terms within a single document or within the entire collection as a whole. The im- portance and significance of a term within a single category is ignored in both cases. The SLF parameter [3], used to determine the weight values of each term of the set E, compensates for this disadvantage. The parameter SLF is a coefficient that charac- terizes the assessment of terms with regard to their inclusion in the category. This method considers the importance of each term for a particular category, unlike many other approaches to determining weight values. The following parameters were defined to find the SLF parameter: 1. dftc - the number of documents of category c, in which the term t occurs at least once; 2. Nd - the number of documents in the category c; 3. NDFtc - normalized frequency of occurrence of the term t in the category c. It is found as the ratio of the document number of the category c, in which the term t occurs at least once, to the number of documents in the category c. This estimate is local to the category. NDFtc =dftc/Nc (3) 4. SLFt - logarithmic sum of the term t frequencies: SLFt = log(|C|/ ∑(NDFtc)) (4) The SLFt indicator eliminates the imbalance between categories with small and with a large number of documents. The SLF parameter for each term within the collection is determined according to the formula (5) TFSLFt = TFt(Е/t/)  SLFt, (5) where TFt(Е/t/) – the frequency of the term belonging to the set B. It is defined as the ratio of the certain term number occurrences to the total number of the document terms. Thus, the importance of the term ti within a separate document dj is estimated [9]. Vector BT, with the weight coefficient values of the set T terms within the entire collection as a whole, will be obtained. In this case, the significance of terms of a particular category is not fully considered. It reduces the quality indicators of the classification implementation of texts belonging to similar in meaning and used words topics. The SLF parameter considers the term importance for categories within the collec- tion, but does not take into account the importance of terms for each category sepa- rately. The following modification of the term weight definition based on the given parameter is proposed by the author for solving this problem. The author proposes a sequence of actions for defining non-informative terms for each category individually based on the SLF parameter and statistical data. And fur- ther removal of these terms from the term vector of a separate category. The sequence of actions to determine the weight values of the terms of the set E of each category for each category term ei: ─ the coefficient tf/df for each category term ei is determined; ─ the value of the weight of each term by categories is determined; ─ uncharacteristic terms for each category are identified and removed. The coefficient TF for each category term ei within the collection as a whole is de- fined as the ratio of the total number of each term within a separate category to the total number of each term within the collection as a whole (6). frij TF (t i , с j )   fr , i ij (6) where 0 ≤ i ≤ |Е|, 0 ≤ j ≤ |С|. The importance of the term τi within a single document dj is evaluated. [14] The weight of each term by category, taking into account its occurrence in collec- tion categories (the set E, containing the CTFSLF(ti,cj) values of each category term) is defined as the product of the TF coefficient for each term of individual categories and the SLF parameter: CTFSLF ( t i , с j )  TF ( t i , с j ) * SLF k , (7) where 0 ≤ i ≤ |Е|, 0 ≤ j ≤ |С|, 0 ≤ k ≤ |B|. The CTFSLF method for determining the term weights makes it possible to take in- to consideration the term importance within a particular category. 5 The feature space dimension reducing The computational complexity of various classification methods directly depends on the feature space dimension. Therefore, the stage of the used term number reducing, or the stage of reducing the dictionary size of the category |B| to |В’| << |В|, often is performed for classification problem solving. The purpose of this stage is to reduce the data set dimension. This goal is achieved by removing uninformative for classifying terms. It allows decreasing the data size, to reduce the computing power requirements of the algorithm [4]. In this case, each documents terms vector undergoes the following preliminary processing: ─ elimination of stop words (often used and not carrying a semantic load such as unions) [5]; ─ performing a morphological analysis of words [5]; ─ using clustering methods [6]. The following method of terms vector size reducing on the basis of the modernization described previously is proposed by author. It consists of the stage of determining non-characteristic terms for separate categories and the stage of their remove. The value of Kj is calculated to determine the threshold value. The value of Kj is calculated as the inverse value of the number of documents which belongs to the ana- lyzed categories. It is used to remove non-informative terms. The term weight describes the property of its belonging to certain category. Terms that are found in all categories are low weight. Terms whose weights are below threshold are excluded. 1 Кj  Dj (8) where 0 ≤ j ≤ |C|. Further, the weight value is compared with the threshold value for each collection term. If the value of the term weight is less than the threshold, this value is equaled to zero:  0, if ei  k i  ( ei , c j )   (9) СTFSLF (t i , c j ), if ei  k i where 0 ≤ i ≤ |Е|, 0 ≤ j ≤ |С|. The given analysis allows us to identify and exclude from the analysis such terms with low informativeness, as often encountered in the categories in the document corpus, and which are not informative for classification. Thus, the removal of the terms distinguished from the feature space as a result of the analysis will reduce the length of the analyzed set and simplify the classification task. The resulting term vector is used to search for documents belonging to a particular category, using the classification process. 6 Document classification into categories In general, the task of classifying documents into categories is to find the maxi-mum sum value of the term weighted coefficients that coincide with the terms characteriz- ing a separate category. The following parameter is introduced by author to evaluate this indicator. W – a set that indicates the degree which shows this document falls into a separate category. A set is defined as the intersection of the document set T and the corre- sponding categories set E. All terms that are included in both sets are included in the set W. W=T∩Е (10) The estimated value of the belonging degree of document to a separate category can be defined as the sum of the products of the set W elements by the corresponding weight values Ψ for terms belonging to the set T. Then the degree of document compliance to a separate category can be determined as follows. NW d   W ( t i )  TFSLF ( t i , e j ) (11) i where NWd – the normalized value, the degree of coincidence of the term set be- longing to category T to the term set of category E. When a document and category match, this parameter will have a maximum value relative to other categories, and when comparing a document with a foreign category, the match will be observed mainly only for common words that can be attributed to several categories and whose significance decreases with increasing number of these categories. 7 Classification stage time reducing The application of the method proposed by the author will reduce the spent time at the classification stage. According to the property of additivity, the resulting value of time spent on the classification of the n documents is equal to the sum of time spent on the classifica- tion of each document separately. That is, the resulting value of time is determined by adding the individual time spent on the classification of each document. It is proposed next designations: ─ A –total number of documents for classification; ─ S={s1,…s|А|} –the set containing the time spent on the classification of each doc- ument analyzed sample; ─ S1 - the set containing the time spent on the classification of each document ana- lyzed sample using based method; ─ S2 - the set containing the time spent on the classification of each document ana- lyzed sample using proposed method. The total time to perform classifications of all documents using the methods S1 and S2 is determined: S    si (12) i According to the properties of commutativity and associativity for the addition opera- tion, the elements of the sets S1 and S2 can be grouped into two groups. The first group consists of the sum of expenditure time equal in total value for both sets. The second group consists of the summands whose total values differ. If the different total values from the second group of the sample S2i are less than the different total values of the sample S1i, then it can be argued that the sum of the sample S2 is less than the sum of the sample S1 that is presented in (13). if s1i  s 2 i than S1  S 2  (13) Thus, analyzing the obtained results, it can be argued that the shorter the time spent on implementing the classification process of each document separately, the shorter the time value of implementing the classification as a whole. Since reducing the time spent on the classification of a certain document leads to a decrease in the time spent on the classification as a whole. So, this task is relevant. 8 Proposed method testing The task of document classifying by individual categories of class 004 " Computer science and technology. Computing. Data processing" of the UDC classifier was se- lected for testing the proposed method. Certain categories are: ─ 004.0 " Special auxiliary subdivision for computing", ─ 004.2 " Computer architecture", ─ 004.4 "Software", ─ 004.9 " Application-oriented computer-based techniques". 30 documents of each category were used as a training sample. Categories of doc- uments were determined by their authors. Testing was conducted on unused for train- ing documents for each category. The training and testing results are shown in tables 1-2. Table 1. Term vector size after learning stage SLF CTFSLF Ex- De- Words in Category Terms in Term Terms in Term cluded creas- docum vector part vector part words ing part 004.0 148419 22118 14,90% 18450 12,43% 3668 16,58% 004.2 111213 12510 11,25% 8978 8,07% 3532 28,23% 004.4 108077 18752 17,35% 14652 13,56% 4100 21,86% 004.9 104207 17411 16,71% 13473 12,93% 3938 22,62% Average – – 15,05% – 11,75% 3809 21,53% result Table 2. Spent time for testing stage Category of Time for Time for Decreasing Decreasing part of document SLF, s CTFSLF, s time, s time 004.0 0,03125 0,02500 0,006251 20,00% 004.2 0,018751 0,01250 0,006249 33,33% 004.4 0,021877 0,021875 0,000002 0,01% 004.9 0,028126 0,015627 0,012499 44,44% Summary / 0,100004 0,075003 0,025001 24,44% average result As can be seen from table 1, the terms average proportion of the words in documents total number according to the original SLF method is 15.05%. The proposed CTFSLF method shows a result of 11, 75%. The average number of terms excluded from each category is 21.53%. As a result, the average time for determining the category of a document was reduced by 24.44% (table 2). This shows the promise of the proposed method. 9 Conclusions Thus, this article a modernized mathematical model of the text document classifica- tion main stages taking into account the characteristics of certain categories proposed. A mathematical description of the document data set creating stages for a document classification into categories is proposed. The principles of reducing the feature space dimension are described and the proposed method using for determining the weights of terms is argued. The purpose of the proposed approach is to identify and exclude non-informative terms for a particular category, i.e. leave inherent informative terms that characterize the category. The using of this approach leads to reduce the amount of computations performed for searching in the general collection of documents belonging to a particu- lar category. As a result, the analysis time to classification of certain document is reduced. This leads to reduce the resulting time for analyzing the entire set of docu- ments. References 1. Thangaraj M., Sivakami M.: Text classification techniques: A literature review. Interdisci- plinary Journal of Information, Knowledge, and Management, 2018, vol. 13, pp. 117-135 (2018) 2. Brindha S., Sukumaran S., Prabha, K.: A survey on classification techniques for text min- ing. 3rd International Conference on Advanced Computing and Communication Systems. IEEE. Coimbatore, Indi,. (2016) doi: 10.1109/ICACCS.2016.7586371 3. Daud A., Li J., Zhou L.: Muhammad F. Knowledge discovery through directed probabilis- tic topic models: a survey. Frontiers of computer science in China, 2010, vol. 4, no. 2, pp. 280–301 (2010) 4. Korde V., Mahender N.: Text classification and classifiers: a survey. International Journal of Artificial Intelligence & Applications (IJAIA), 2012, Vol. 3, no. 2, pp. 85–99 (2012) 5. Pankov S. V., Shebanin S. P., Ribakov А. А.: Thematic classification of text. ROOKEE, ROMIP 2010, Kazan’, Russia, 2010, pp. 142-147 (2010) 6. Golub T. The Analysis of text documents classifiers constructing methods, Modern prob- lems of radio engineering, telecommunications, and computer science, 2016, pр.742-745 (2016) 7. Yang Y., Zhang J., Kisiel B.: A scalability analysis of classifiers in text categorization. ACM SIGIR'03, (2003) 8. Sebastiani F.: Machine learning in automated text categorization. ACM computing surveys (CSUR), 2002, vol. 34, pp. 1-47 (2002) 9. Karpovich S.N.: Multi-valued text documents classification using probabilistic thematic modeling ml-PLSI. SPIIRAS Proceedings. 2016. vol. 4(47), pp.92 – 104. (2016) doi: 10.15622/sp.47.5. 10. Kuralegov I.: Automatic classification of documents based on latent semantic analysis. 1st International Conference Digital Libraries: Advanced Methods and Technologies, Digital Collections, St-Petersburg, Russia, 1999, pp. 89-96. (1999) 11. Andreev A. M.: Automatic classification of text documents using the neural network algo- rithms and semantic analysis. Advanced Methods and Technologies, Digital Collections, St-Petersburg, Russia, 2003, pp. 76-86. (2003) 12. Krasnov A., Ilatovskiy A.S., Khomonenko A.D., Arsen'yev V.N.: Evaluation of documents semantic proximity based on latent-semantic analysis with automatic selection of rank val- ues. SPIIRAN proceedings, 2017. no. 5(54), pp. 185-204 (2017) 13. Rehman Abdur, Barbi H., Saeed M., Feature Extraction for Classification of Text Docu- ments. International Conference on Communications and Information Technology (ICCIT 2012), Hammamet, Tunisia, 2012, pp. 234 - 239. (2012) 14. Budanitsky A. Hirst G.: Evaluating WordNet-based Measures of Lexical Semantic Relat- edness Computational Linguistics. 2006. Vol. 32. pp. 13-47 (2006) 15. Bondarchuk D.V. Vector model of knowledge representation based on semantic proximity of terms]. Bulletin of SUSU. Series: Computational Mathematics and Computer Science. 2017. vol. 6 no. 3. pp. 73–83 (2017) doi: 10.14521/cmse170305. 16. Tsoumakas G., Katakis I.: Multi-label classification: an overview. International Journal of Data Warehousing & Mining. 2007. vol. 3(3). pp. 1–13 (2007) 17. Rubin T.N., Chambers A., Smyth P., Steyvers M.: Statistical topic models for multilabel document classification. Machine Learning. 2012. vol. 88. no. 1–2. pp. 157–208 (2012) 18. Erpev А.S.: Automatic classification of text documents. Mathematical Structures and Modeling. 2010, vol. 21, pp. 65-81 (2010) 19. Zyuz'kov V. M.: Mathematical logic and theory of algorithms. Tomsk, El Content (2015) 20. Willett P. The Porter Stemming Algorithm: Then and Now Program: Electronic Library and Information Systems. 2006. vol. 4, no. 4. pp. 219-223 (2006) 21. Golub T.V., Tyahunova M.YU.: The method of Ukrainian language stitemming for the classification of documents based on Porter's algorithm. Scientific papers of the Donetsk National Technical University. Series: Informatics, Cybernetics and Computing 2017, no. 1, pp. 59 – 63 (2017) 22. Oliynyk YU. O., Katyushchenko D. O.: Analysis of the methods of determining the text documents signs weight. Scientific Review, 2018, 3(46), pp. 112 – 123 (2018).