<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Modernized Mathematical Model of Text Document Classification</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Tetiana</forename><surname>Golub</surname></persName>
							<email>golub.tv6@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Zaporizhzhia National Technical University</orgName>
								<address>
									<addrLine>Zhukovsky str., 64</addrLine>
									<postCode>69063</postCode>
									<settlement>Zaporizhzhia</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Modernized Mathematical Model of Text Document Classification</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F33F4CF3383E7055987AAC0694A0A8DE</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>text document classification</term>
					<term>term vector</term>
					<term>mathematical model</term>
					<term>term weight</term>
					<term>SLF parameter</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The modernized mathematical model of the main stages of the text document classification is proposed. It takes into account the characteristics of certain categories. A mathematical description of the document data set creating stages, a document classification into categories is proposed. The principles of reducing the feature space dimension are described and the proposed method what used for determining the term weights is argued. The application of the method proposed in the article leads to reduce the analysis time of each document in order to make a decision about its category. This leads to decrease the resulting time for the analysis of the entire document set.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The information amount witch presented in text form increases continuously. Text information is accumulated in all areas of human activity. It is represented from data stored on personal computers to data in the form of Big Data. It covers such areas as business, research institutions, government and financial institutions that use technology intensively. Text information contains statistical data, control commands, reference information and principle laws of different processes. A feature of such information is the lack of its structuredness. It makes more complicated the process of its analysis <ref type="bibr" target="#b0">[1]</ref>.</p><p>Text analytics converts text into numbers. It allows organizing data and helps to identify patterns. Structured data are easier to analyze. Therefore, decisions made on their basis are more quality <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>.</p><p>If it is necessary to find the information in a data large amount, firstly it must be classified <ref type="bibr" target="#b3">[4]</ref>. This process is the consideration subject in the proposed study.</p><p>Text classification refers to one of the computational linguistic tasks. It includes the definition of the text thematic affiliation, the text author, the statement emotional coloring and etc.</p><p>The task of organizing documents is solved to simplify the search for the necessary information. It is one of the most urgent tasks. Text classification is needed to solve this problem. <ref type="bibr" target="#b4">[5]</ref>. It is difficult to solve the classification problem because the data flow is constantly increasing. Therefore, its decision is relevant.</p><p>Many approaches to solving this problem are described in the literature. An overview and comparison of currently relevant methods are presented in accordance with the various stages of this process in <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref>. According to these sources one of the most important points of test classification is key feature selection. The works <ref type="bibr" target="#b8">[9]</ref><ref type="bibr" target="#b9">[10]</ref><ref type="bibr" target="#b10">[11]</ref><ref type="bibr" target="#b11">[12]</ref><ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref> were devoted to solving this problem. Various approaches, including statistical, frequency, latent-semantic and others are disclosed there. However, the described methods consider terms within the entire document collection. It is not possible to assess the importance of a separate term for each category separately.</p><p>The classification of text documents is the process of analyzing its content and automatically defining a document into one or several categories <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>. Categories are sets of documents with a common theme. The set of categories is set by the expert or is determined automatically on the basis of the training sample. Automatic classifier is used in the information-analytical system at the stage of processing documents. An automatic classifier is a program that determines the subject of documents and assigns them to categories <ref type="bibr" target="#b5">[6]</ref>.</p><p>The inverse problem is also relevant. It consists of document selection from a document set according to the category defined by the user.</p><p>Presented in the literature mathematical models do not consider the term importance for certain categories. The author offers an improved mathematical model that takes into consideration this parameter.</p><p>The proposed in the article model considers this parameter which allows reducing the time for assessing the belonging of a document to certain categories by reducing the size of the term vector of certain categories for the text document classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Task formalization</head><p>The classifying document process in a formal form can be described as follows. The text document classification will be understood as the task of automatically defining a document into one or several categories based on its content. The category will be understood as a variety of documents with a general theme. Many categories are set by an expert or determined automatically using a training set. Automatic classifier is used in the information-analytical system at the document processing stage <ref type="bibr" target="#b5">[6]</ref>. Mathematical models of the text document classification process given in <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19]</ref> are common. The author proposes the improvement of the existing variants of the term weight determining process as a part of the classification process with considering requirements of the task in this article.</p><p>It is proposed next designations to formally describe the process of text documents classifying:</p><formula xml:id="formula_0">─ Т={t 1 ,…t |А| } -document term set; ─ В={b 1 ,…b |B| } -term set; ─ D={d 1 ,…d |D| } -documents set; ─ C={c 1 ,…c |C| } -category set; ─ Е={е 1 ,…е |Е| } -category term set.</formula><p>In the general case, the searching task of documents which corresponding to a particular category is following.</p><p>A set of documents D, from which it is necessary to choose those documents d i , which most likely belong to the category c і determined in advance from the set of categories C exists. The solution of this problem is considered in this article.</p><formula xml:id="formula_1">       i j i j i j c d if c d if c d , 1<label>, 0 ) , ( . (1)</label></formula><p>3</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Document term set creating</head><p>The text document classification is performed using the analysis of the text document terms. A term is an intuitively defined expression of a formal language. It is the formal name of the object <ref type="bibr" target="#b1">[2]</ref>. In this study, the term will be understood as the word obtained after stemming. Stemming is the reduction of a word to a certain normal form using the clipping of its endings and suffixes. The formation of terms is one of the tasks of the preprocessing stage.</p><p>The text is presented in the form of a document term set model for solving the classification problem. Each term has its own weight.</p><p>Text preprocessing is performed when determining whether a document belongs to any category, considering the importance of each term.</p><p>The preprocessing process has the following characteristics suggested by the author:</p><p>─ Т ∈ В -all terms of the document are included in the set of possible terms; ─ Е ∈ В -all terms of the category are included in the set of possible terms;</p><p>The set of elements of the sets T and E forms the set B. The sets T and E constitute the set B. The formation of a document multiset of one group, category, allows its power for each term determines. This parameter estimates the quantitative index of the term occurrence.</p><formula xml:id="formula_2">─ Т M = &lt;n 1 (t 1 ), n 2 (t 2 ), … n /T/ (t /T/ )&gt; -a</formula><p>─ Е M = &lt;n 1 (е 1 ), n 2 (е 2 ), … n /Е/ (е /Е/ )&gt; -a multiset of the set E. It allows collecting the occurrence of the set elements several times.</p><p>Based on the category multiset, it is possible to determine the term indicators by the power of their occurrence. Analyzing this parameter it is possible to determine in how many categories of the collection the considered term occurs at least once. It allows distinguishing terms that are characteristic for all categories and are not characteristic to a particular category. These terms do not contain information for classification. Therefore it is possible to exclude them from the analyzed set. Subsequent text processing is performed based on these characteristics. All words that appear in the documents can be ordered in some way, for example alphabetically. Then, for each document it is possible to write out the entire set of weights matching the dictionary words. If some term is out of the document, then the weight will be zero. That is the vector will be: ) ,..., , (</p><formula xml:id="formula_3">2 1 n i w w w d  , (<label>2</label></formula><formula xml:id="formula_4">)</formula><p>where d i -i-th document vector representation, w i -weight of the i-th document term, n -the total number of different terms in all the documents of the collection -the power of set B <ref type="bibr" target="#b20">[21]</ref> 4</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Term weight identification</head><p>The term weight values of the set E for each category of the incoming set B are determined to assess the occurrence to the document category.</p><p>There are many methods to determine the term weights in the literature are presented. Some of them are: ─ Boolean weight. w = sign(tf), i.e. 1 -if word occurs in the document, 0 -otherwise; ─ w=tf -number of word duplications in the document <ref type="bibr" target="#b2">[3]</ref>; ─ w=tf/df -the coefficient «tf•idf», i.e. the multiplication of the words occur-rence frequency (tf), to the reciprocal value of the words occurrence frequency in all documents of the collection (inverse df). There are many options to define the weight value of the i-th term (wij) in the document dj. One of the simplest options is the following: wij= tf•log10(1/df). When the formulas «tf•idf» are used the problem of common words is solved -when the words with no meaning are of high weight; ─ SLF parameter <ref type="bibr" target="#b2">[3]</ref>; ─ Latent semantic analysis <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>.</p><p>The mentioned methods for determining the term weight values characterize that terms within a single document or within the entire collection as a whole. The importance and significance of a term within a single category is ignored in both cases.</p><p>The SLF parameter <ref type="bibr" target="#b2">[3]</ref>, used to determine the weight values of each term of the set E, compensates for this disadvantage. The parameter SLF is a coefficient that characterizes the assessment of terms with regard to their inclusion in the category. This method considers the importance of each term for a particular category, unlike many other approaches to determining weight values.</p><p>The following parameters were defined to find the SLF parameter:</p><p>1. df tc -the number of documents of category c, in which the term t occurs at least once; 2. N d -the number of documents in the category c; 3. NDF tc -normalized frequency of occurrence of the term t in the category c. It is found as the ratio of the document number of the category c, in which the term t occurs at least once, to the number of documents in the category c. This estimate is local to the category.</p><formula xml:id="formula_5">NDF tc =df tc /N c (3)</formula><p>4. SLF t -logarithmic sum of the term t frequencies:</p><formula xml:id="formula_6">SLF t = log(|C|/ ∑(NDF tc ))<label>(4)</label></formula><p>The SLF t indicator eliminates the imbalance between categories with small and with a large number of documents.</p><p>The SLF parameter for each term within the collection is determined according to the formula ( <ref type="formula" target="#formula_7">5</ref>)</p><formula xml:id="formula_7">TFSLF t = TF t (Е /t/ )  SLF t ,<label>(5)</label></formula><p>where TF t (Е /t/ ) -the frequency of the term belonging to the set B. It is defined as the ratio of the certain term number occurrences to the total number of the document terms. Thus, the importance of the term t i within a separate document d j is estimated <ref type="bibr" target="#b8">[9]</ref>.</p><p>Vector B T , with the weight coefficient values of the set T terms within the entire collection as a whole, will be obtained. In this case, the significance of terms of a particular category is not fully considered. It reduces the quality indicators of the classification implementation of texts belonging to similar in meaning and used words topics.</p><p>The SLF parameter considers the term importance for categories within the collection, but does not take into account the importance of terms for each category separately. The following modification of the term weight definition based on the given parameter is proposed by the author for solving this problem.</p><p>The author proposes a sequence of actions for defining non-informative terms for each category individually based on the SLF parameter and statistical data. And further removal of these terms from the term vector of a separate category.</p><p>The sequence of actions to determine the weight values of the terms of the set E of each category for each category term e i : ─ the coefficient tf/df for each category term e i is determined; ─ the value of the weight of each term by categories is determined; ─ uncharacteristic terms for each category are identified and removed.</p><p>The coefficient TF for each category term e i within the collection as a whole is defined as the ratio of the total number of each term within a separate category to the total number of each term within the collection as a whole <ref type="bibr" target="#b5">(6)</ref>.</p><formula xml:id="formula_8">  i ij ij j i fr fr с t TF ) , ( ,<label>(6)</label></formula><p>where</p><formula xml:id="formula_9">0 ≤ i ≤ |Е|, 0 ≤ j ≤ |С|.</formula><p>The importance of the term τ i within a single document d j is evaluated. <ref type="bibr" target="#b13">[14]</ref> The weight of each term by category, taking into account its occurrence in collection categories (the set E, containing the CTFSLF(t i, c j ) values of each category term) is defined as the product of the TF coefficient for each term of individual categories and the SLF parameter:</p><formula xml:id="formula_10">k j i j i SLF с t TF с t CTFSLF * ) , ( ) , (  ,<label>(7)</label></formula><p>where</p><formula xml:id="formula_11">0 ≤ i ≤ |Е|, 0 ≤ j ≤ |С|, 0 ≤ k ≤ |B|.</formula><p>The CTFSLF method for determining the term weights makes it possible to take into consideration the term importance within a particular category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5</head><p>The feature space dimension reducing</p><p>The computational complexity of various classification methods directly depends on the feature space dimension. Therefore, the stage of the used term number reducing, or the stage of reducing the dictionary size of the category |B| to |В'| &lt;&lt; |В|, often is performed for classification problem solving. The purpose of this stage is to reduce the data set dimension. This goal is achieved by removing uninformative for classifying terms. It allows decreasing the data size, to reduce the computing power requirements of the algorithm <ref type="bibr" target="#b3">[4]</ref>.</p><p>In this case, each documents terms vector undergoes the following preliminary processing: ─ elimination of stop words (often used and not carrying a semantic load such as unions) <ref type="bibr" target="#b4">[5]</ref>; ─ performing a morphological analysis of words <ref type="bibr" target="#b4">[5]</ref>; ─ using clustering methods <ref type="bibr" target="#b5">[6]</ref>.</p><p>The following method of terms vector size reducing on the basis of the modernization described previously is proposed by author. It consists of the stage of determining non-characteristic terms for separate categories and the stage of their remove.</p><p>The value of K j is calculated to determine the threshold value. The value of K j is calculated as the inverse value of the number of documents which belongs to the analyzed categories. It is used to remove non-informative terms.</p><p>The term weight describes the property of its belonging to certain category. Terms that are found in all categories are low weight. Terms whose weights are below threshold are excluded.</p><formula xml:id="formula_12">j j D К 1  (<label>8</label></formula><formula xml:id="formula_13">)</formula><p>where 0 ≤ j ≤ |C|. Further, the weight value is compared with the threshold value for each collection term.</p><p>If the value of the term weight is less than the threshold, this value is equaled to zero:</p><formula xml:id="formula_14">       i i j i i i j i k e if c t СTFSLF k e if c e</formula><p>),</p><p>where</p><formula xml:id="formula_16">0 ≤ i ≤ |Е|, 0 ≤ j ≤ |С|.</formula><p>The given analysis allows us to identify and exclude from the analysis such terms with low informativeness, as often encountered in the categories in the document corpus, and which are not informative for classification.</p><p>Thus, the removal of the terms distinguished from the feature space as a result of the analysis will reduce the length of the analyzed set and simplify the classification task.</p><p>The resulting term vector is used to search for documents belonging to a particular category, using the classification process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Document classification into categories</head><p>In general, the task of classifying documents into categories is to find the maxi-mum sum value of the term weighted coefficients that coincide with the terms characterizing a separate category.</p><p>The following parameter is introduced by author to evaluate this indicator. W -a set that indicates the degree which shows this document falls into a separate category. A set is defined as the intersection of the document set T and the corresponding categories set E. All terms that are included in both sets are included in the set W.</p><formula xml:id="formula_17">W = T ∩ Е (<label>10</label></formula><formula xml:id="formula_18">)</formula><p>The estimated value of the belonging degree of document to a separate category can be defined as the sum of the products of the set W elements by the corresponding weight values Ψ for terms belonging to the set T.</p><p>Then the degree of document compliance to a separate category can be determined as follows.</p><formula xml:id="formula_19">   i j i i d e t TFSLF t W NW ) ,<label>( ) ( (11)</label></formula><p>where NW d -the normalized value, the degree of coincidence of the term set belonging to category T to the term set of category E.</p><p>When a document and category match, this parameter will have a maximum value relative to other categories, and when comparing a document with a foreign category, the match will be observed mainly only for common words that can be attributed to several categories and whose significance decreases with increasing number of these categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Classification stage time reducing</head><p>The application of the method proposed by the author will reduce the spent time at the classification stage.</p><p>According to the property of additivity, the resulting value of time spent on the classification of the n documents is equal to the sum of time spent on the classification of each document separately. That is, the resulting value of time is determined by adding the individual time spent on the classification of each document. It is proposed next designations: ─ A -total number of documents for classification; ─ S={s 1 ,…s |А| } -the set containing the time spent on the classification of each document analyzed sample; ─ S1 -the set containing the time spent on the classification of each document analyzed sample using based method; ─ S2 -the set containing the time spent on the classification of each document analyzed sample using proposed method.</p><p>The total time to perform classifications of all documents using the methods S1 and S2 is determined:</p><formula xml:id="formula_20">   i i s S (12)</formula><p>According to the properties of commutativity and associativity for the addition operation, the elements of the sets S1 and S2 can be grouped into two groups. The first group consists of the sum of expenditure time equal in total value for both sets. The second group consists of the summands whose total values differ. If the different total values from the second group of the sample S2 i are less than the different total values of the sample S1 i , then it can be argued that the sum of the sample S2 is less than the sum of the sample S1 that is presented in <ref type="bibr" target="#b12">(13)</ref>.</p><formula xml:id="formula_21">    2 1 2 1 S S than s s if i i (13)</formula><p>Thus, analyzing the obtained results, it can be argued that the shorter the time spent on implementing the classification process of each document separately, the shorter the time value of implementing the classification as a whole. Since reducing the time spent on the classification of a certain document leads to a decrease in the time spent on the classification as a whole. So, this task is relevant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Proposed method testing</head><p>The task of document classifying by individual categories of class 004 " Computer science and technology. Computing. Data processing" of the UDC classifier was selected for testing the proposed method. Certain categories are:</p><p>─ 004.0 " Special auxiliary subdivision for computing", ─ 004.2 " Computer architecture", ─ 004.4 "Software", ─ 004.9 " Application-oriented computer-based techniques".</p><p>30 documents of each category were used as a training sample. Categories of documents were determined by their authors. Testing was conducted on unused for training documents for each category. The training and testing results are shown in tables 1-2.  As can be seen from table 1, the terms average proportion of the words in documents total number according to the original SLF method is 15.05%. The proposed CTFSLF method shows a result of 11, 75%. The average number of terms excluded from each category is 21.53%. As a result, the average time for determining the category of a document was reduced by 24.44% (table <ref type="table" target="#tab_1">2</ref>). This shows the promise of the proposed method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusions</head><p>Thus, this article a modernized mathematical model of the text document classification main stages taking into account the characteristics of certain categories proposed.</p><p>A mathematical description of the document data set creating stages for a document classification into categories is proposed. The principles of reducing the feature space dimension are described and the proposed method using for determining the weights of terms is argued. The purpose of the proposed approach is to identify and exclude non-informative terms for a particular category, i.e. leave inherent informative terms that characterize the category. The using of this approach leads to reduce the amount of computations performed for searching in the general collection of documents belonging to a particular category. As a result, the analysis time to classification of certain document is reduced. This leads to reduce the resulting time for analyzing the entire set of documents.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>multiset of the set T. It allows collecting the occurrence of the set elements several times.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Term vector size after learning stage</figDesc><table><row><cell>Category</cell><cell>Words in docum</cell><cell cols="2">SLF Terms in vector</cell><cell>Term part</cell><cell cols="2">CTFSLF Terms in vector Term part</cell><cell>Ex-cluded words</cell><cell>De-creas-ing part</cell></row><row><cell>004.0</cell><cell>148419</cell><cell cols="5">22118 14,90% 18450 12,43%</cell><cell>3668</cell><cell>16,58%</cell></row><row><cell>004.2</cell><cell>111213</cell><cell cols="3">12510 11,25%</cell><cell>8978</cell><cell>8,07%</cell><cell>3532</cell><cell>28,23%</cell></row><row><cell>004.4</cell><cell>108077</cell><cell cols="5">18752 17,35% 14652 13,56%</cell><cell>4100</cell><cell>21,86%</cell></row><row><cell>004.9</cell><cell>104207</cell><cell cols="5">17411 16,71% 13473 12,93%</cell><cell>3938</cell><cell>22,62%</cell></row><row><cell>Average result</cell><cell>-</cell><cell>-</cell><cell cols="2">15,05%</cell><cell>-</cell><cell>11,75%</cell><cell>3809</cell><cell>21,53%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Spent time for testing stage</figDesc><table><row><cell>Category of</cell><cell>Time for</cell><cell>Time for</cell><cell>Decreasing</cell><cell>Decreasing part of</cell></row><row><cell>document</cell><cell>SLF, s</cell><cell>CTFSLF, s</cell><cell>time, s</cell><cell>time</cell></row><row><cell>004.0</cell><cell>0,03125</cell><cell>0,02500</cell><cell>0,006251</cell><cell>20,00%</cell></row><row><cell>004.2</cell><cell>0,018751</cell><cell>0,01250</cell><cell>0,006249</cell><cell>33,33%</cell></row><row><cell>004.4</cell><cell>0,021877</cell><cell>0,021875</cell><cell>0,000002</cell><cell>0,01%</cell></row><row><cell>004.9</cell><cell>0,028126</cell><cell>0,015627</cell><cell>0,012499</cell><cell>44,44%</cell></row><row><cell>Summary /</cell><cell>0,100004</cell><cell>0,075003</cell><cell>0,025001</cell><cell>24,44%</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Text classification techniques: A literature review</title>
		<author>
			<persName><forename type="first">M</forename><surname>Thangaraj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sivakami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Interdisciplinary Journal of Information, Knowledge, and Management</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="117" to="135" />
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey on classification techniques for text mining</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brindha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sukumaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Prabha</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICACCS.2016.7586371</idno>
	</analytic>
	<monogr>
		<title level="m">3rd International Conference on Advanced Computing and Communication Systems</title>
				<meeting><address><addrLine>Coimbatore, Indi</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Knowledge discovery through directed probabilistic topic models: a survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Daud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Muhammad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Frontiers of computer science in China</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="280" to="301" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Text classification and classifiers: a survey</title>
		<author>
			<persName><forename type="first">V</forename><surname>Korde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mahender</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Artificial Intelligence &amp; Applications (IJAIA)</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="85" to="99" />
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">V</forename><surname>Pankov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Shebanin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">А</forename><forename type="middle">А</forename><surname>Ribakov</surname></persName>
		</author>
		<title level="m">Thematic classification of text</title>
				<meeting><address><addrLine>Kazan&apos;, Russia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2010">2010. 2010</date>
			<biblScope unit="page" from="142" to="147" />
		</imprint>
	</monogr>
	<note>ROOKEE, ROMIP 2010</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">The Analysis of text documents classifiers constructing methods, Modern problems of radio engineering</title>
		<author>
			<persName><forename type="first">T</forename><surname>Golub</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">telecommunications, and computer science</title>
		<imprint>
			<biblScope unit="page" from="742" to="745" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A scalability analysis of classifiers in text categorization</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kisiel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGIR&apos;</title>
		<imprint>
			<biblScope unit="volume">03</biblScope>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Machine learning in automated text categorization</title>
		<author>
			<persName><forename type="first">F</forename><surname>Sebastiani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM computing surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="1" to="47" />
			<date type="published" when="2002">2002. 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Multi-valued text documents classification using probabilistic thematic modeling ml-PLSI</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Karpovich</surname></persName>
		</author>
		<idno type="DOI">10.15622/sp.47.5</idno>
	</analytic>
	<monogr>
		<title level="j">SPIIRAS Proceedings</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">47</biblScope>
			<biblScope unit="page" from="92" to="104" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Automatic classification of documents based on latent semantic analysis</title>
		<author>
			<persName><forename type="first">I</forename><surname>Kuralegov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">1st International Conference Digital Libraries: Advanced Methods and Technologies, Digital Collections</title>
				<meeting><address><addrLine>St-Petersburg, Russia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999. 1999</date>
			<biblScope unit="page" from="89" to="96" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Automatic classification of text documents using the neural network algorithms and semantic analysis</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Andreev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advanced Methods and Technologies, Digital Collections</title>
				<meeting><address><addrLine>St-Petersburg, Russia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003">2003. 2003</date>
			<biblScope unit="page" from="76" to="86" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Evaluation of documents semantic proximity based on latent-semantic analysis with automatic selection of rank values</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krasnov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Ilatovskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Khomonenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">N</forename><surname>Arsen'yev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SPIIRAN proceedings</title>
		<imprint>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="185" to="204" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Feature Extraction for Classification of Text Documents</title>
		<author>
			<persName><forename type="first">Rehman</forename><surname>Abdur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barbi</forename><forename type="middle">H</forename><surname>Saeed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Communications and Information Technology (ICCIT 2012)</title>
				<meeting><address><addrLine>Hammamet, Tunisia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012. 2012</date>
			<biblScope unit="page" from="234" to="239" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Evaluating WordNet-based Measures of Lexical Semantic</title>
		<author>
			<persName><forename type="first">A</forename><surname>Budanitsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hirst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Relatedness Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="13" to="47" />
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Vector model of knowledge representation based on semantic proximity of terms</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">V</forename><surname>Bondarchuk</surname></persName>
		</author>
		<idno type="DOI">10.14521/cmse170305</idno>
	</analytic>
	<monogr>
		<title level="j">Bulletin of SUSU. Series: Computational Mathematics and Computer Science</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="73" to="83" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Multi-label classification: an overview</title>
		<author>
			<persName><forename type="first">G</forename><surname>Tsoumakas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Katakis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Data Warehousing &amp; Mining</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="1" to="13" />
			<date type="published" when="2007">2007. 2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Statistical topic models for multilabel document classification</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Rubin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chambers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Smyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Steyvers</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">88</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="157" to="208" />
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Automatic classification of text documents</title>
		<author>
			<persName><forename type="first">А</forename><forename type="middle">S</forename><surname>Erpev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Mathematical Structures and Modeling</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="65" to="81" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Mathematical logic and theory of algorithms</title>
		<author>
			<persName><forename type="first">'</forename><surname>Zyuz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">M</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<pubPlace>Tomsk, El Content</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">The Porter Stemming Algorithm: Then and Now Program: Electronic Library and Information Systems</title>
		<author>
			<persName><forename type="first">P</forename><surname>Willett</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006. 2006</date>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="219" to="223" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The method of Ukrainian language stitemming for the classification of documents based on Porter&apos;s algorithm</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">V</forename><surname>Golub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tyahunova</surname></persName>
		</author>
		<author>
			<persName><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific papers of the Donetsk National Technical University. Series: Informatics, Cybernetics and Computing</title>
		<imprint>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="59" to="63" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Analysis of the methods of determining the text documents signs weight</title>
		<author>
			<persName><forename type="first">Yu</forename><forename type="middle">O</forename><surname>Oliynyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">O</forename><surname>Katyushchenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific Review</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">46</biblScope>
			<biblScope unit="page" from="112" to="123" />
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
