<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">The Third Personal Pronoun Anaphora Resolution in Texts from Narrow Subject Domains with Grammatical Errors and Mistypings</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Daniel</forename><surname>Skatov</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Dictum Ltd</orgName>
								<address>
									<settlement>Nizhny Novgorod</settlement>
									<country key="RU">Russia</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Sergey</forename><surname>Liverko</surname></persName>
							<email>liverko@dictum.ru</email>
							<affiliation key="aff0">
								<orgName type="institution">Dictum Ltd</orgName>
								<address>
									<settlement>Nizhny Novgorod</settlement>
									<country key="RU">Russia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">The Third Personal Pronoun Anaphora Resolution in Texts from Narrow Subject Domains with Grammatical Errors and Mistypings</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4352177161348178A443CC287EC29B85</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T01:03+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Computational linguistics</term>
					<term>natural language processing</term>
					<term>anaphora resolution</term>
					<term>machine learning</term>
					<term>opinion mining</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The third personal pronoun anaphora resolution in texts from the Internet sources (forum comments, opinions) with a given subject domain (cars, household appliances etc) is being discussed. A concrete solution to the task is offered. High precision with acceptable recall (and vice versa) is shown by an example of opinions about mobile phones.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The problem of the third personal pronoun anaphora resolution discussed in this paper consists in the replacement of pronouns such as "he", "his", "her", "it", … with nouns (antecedents) that these pronouns were used instead. Its solution is needed firstly in text mining applications, such as opinion mining (about goods, people) or fact extraction. Without resolved anaphoras those applications lose in recall of their results. The loss degree depends on the type of proceeded texts: e.g., in opinions about goods the density of "it" (masculine gender in Russian) pronoun is 1,5 times higher than in news <ref type="foot" target="#foot_0">1</ref> .</p><p>The known methods of anaphora resolution can be divided into two groups -(1) statistical and (2) syntactical. Methods from class (1) <ref type="bibr" target="#b2">[3]</ref> are based on the results of machine learning and are potentially applicable to texts of significantly different nature. Class <ref type="bibr" target="#b1">(2)</ref>  <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> exploits the sentence syntactical parsing tree (or semantic graphs as their derivatives) and as a result the applicability of such methods is limited to relatively «correct» texts (e.g., dossier texts <ref type="bibr" target="#b1">[2]</ref>). This article describes a method combining these two approaches in a certain sense.</p><p>Texts from «real life» are full of typos and specialized slang with their grammar far from correct one:</p><p>Ive got a whit ceise and buttons peel gradauly and they becomes gray no cleaning helps or anything likethat..! Weak processor also made upset as well as small memory amount, it works terribly slow. <ref type="bibr" target="#b0">(1)</ref> The method of anaphora resolution, offered by the authors, takes mistypings and the results of syntactic parsing of text fragments (with mistypings corrected) into account. It is adapted to process texts from specific subject domains. Method can work with «correct» texts as well as informal ones (such as opinions or notes). To achieve a high processing quality for texts from a selected domain, a preliminary adjustment to the method is needed. It consists in learning on an unmarked corpus and composing the operating terminological dictionaries.</p><p>Three modes of the method have been implemented: (A) good precision (70-80%) with high recall (90-95%), (B) approximately equally good precision and recall (75-85%), (C) excellent precision (up to 95%) with high acceptable recall (40-50%).</p><p>The implementation of the technology is represented by a software module called DictaScope Anaphora. It is adjusted to processing opinions about mobile phones from Internet sources. Within the bounds of the article an estimation of recall-precision ratio for processing such kind of data is carried out. The model is being used in the real application for online opinion monitoring. Modes A, B and C were obtained in the process of looking for a solution effective for this application -i.e. the one with high precision on possibly intentionally reduced input data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem statement</head><p>Basic statement. For each pronoun , i = 1,…, N from text choose the resolving pronoun (antecedent) . Remark. In certain cases it is impossible to choose , e.g.:</p><p>This mobile phone has a sensor screen. It's very inconvenient. (screen or phone?)</p><p>Resolving of such an ambiguity (which can conditionally be called semantic) is a hard task even for a human, as both variants are of equal possibility. In the current problem statement it is offered either to choose a concrete antecedent or not to resolve the anaphora.</p><p>Advanced statement. It sometimes turns out that an acceptable precision of selecting a sole variant is unreachable. Therefore the following task specification is proposed: for each pronoun , i = 1,…, N form a list of possible resolving variants a i 1 ,…, a i l i ( ) sorted in accordance with their ranks (the first one is the best). Then can be chosen as . In case a requirement of a high recall takes place (e.g., for posterior hand processing of results) it is sufficient to ensure high quality of ranking.</p><p>The variants of resolving antecedents can be supplied with real-value weights </p><formula xml:id="formula_1">w = w a i k ( ) ∈ 0,1 ( ⎤ ⎦ , i ∈ 1,…, N { } , k ∈ 1,…, l i { } ,</formula><formula xml:id="formula_2">For pronoun pr 1 = «it» the list of variants is formed ( a 1 1 = «*» , a 1 2 = «bussiness» , a 1 3 = «NULL» ) with weights w a 1<label>1</label></formula><p>( ) ≈ 0.65 , w a 1 2 ( ) ≈ 0.237 , w a 1 3 ( ) ≈ 0.1686 (simi- larly for pr 2 = «it» ). There are also special and «NULL» designations:</p><p>• -«the current object of discourse», so-called «implicit» antecedent. This is typical for opinions and reviews -i.e. for texts representing direct speech in writing. In the example above the word «phone» (as well as its concrete model reference) is not found anywhere before pr 1 = «it» , though the teller means exactly «this phone».</p><p>• «NULL» -a directive «not to resolve pronoun». If «NULL» is at first position in the list of variants, the pronoun is left unresolved.</p><p>Thus, there are two cases in a basic problem statement in which the anaphora will not be resolved:</p><p>1. No variants for pronoun resolution is found; 2. «NULL» is the first in the ranged list of variants. It is easy to see that if, in case of semantic ambiguity, the probability of the correct choice of antecedent is less than ½, the precision will not fall on the average. Therefore, in this case the choice of «NULL» variant is justified.</p><p>In the example (3) the task in the basic statement is resolved correctly by choosing the first variant for each pronoun. A solution in a basic statement will be further estimated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Review</head><p>The subject area of this paper is covered in the works of three Russian groups.</p><p>1. Ermakov A.E., RCO. In <ref type="bibr" target="#b1">[2]</ref> empirical regularities of persons referencing are shown for texts from Russian mass media; they can be used to build a mechanism for anaphora resolution in text sources of this class (with the help of natural language syntactic parser). 2. Tolpegin P., Vetrov D., Kropotov D. Article <ref type="bibr" target="#b2">[3]</ref> describes an experience of this group in resolving the third personal pronoun anaphora in news by machine learning methods. The approach is typical for this type of solvers, the precision shown equals 62% on a control collection. 3. Okatiev V., Erechinskaya T., Skatov D. In the report <ref type="bibr" target="#b0">[1]</ref> it is shown how pronoun anaphoras of different types can be resolved with the help of syntax parsing trees analysis. This approach is well applicable to the texts in which most of the sentences allow building correct syntax trees.</p><p>The specificity of this article -processing texts from narrow subject domains with mistypings and slang -is not touched upon in the works listed above.</p><p>The question discussed is more widely represented in foreign scientific works:</p><p>• from English-speaking authors patented system <ref type="bibr" target="#b10">[11]</ref> and work <ref type="bibr" target="#b7">[8]</ref> (which demonstrates values of basic indicators at a level about 80% while using probability model) are first to be mentioned; • authors of <ref type="bibr" target="#b8">[9]</ref> use maximum entropy method to resolve the third personal pronoun anaphora in Chinese, with F-measure about 70%; • <ref type="bibr" target="#b9">[10]</ref> describes an application of machine learning to personal pronouns anaphora resolution in Turkish with recall-precision at about 60-70%.</p><p>The overall impression of these works is the following: competent combination of analysis methods and rather full vocabulary data results in recall-precision not less that 70%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Solution</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Lists of variants and attributes</head><p>After tokenization (when the lists of grammar values of the tokens are supplemented taking mistypings into consideration) and dividing text into "conditional" sentences all the pronouns are looked through in the text from left to right. A concrete pronoun pr is fixed, i = 1,…, N , and list var pr ( ) of possible antecedents is formed:</p><p>1. from all the words located within sentences to the left of , nouns in concordance with by gender and number are selected; 2. from the same words pronouns which are in concordance with by gender and number are selected and the list is supplemented with nouns that resolve these pronouns.</p><p>Possible antecedents can also be found to the right of ; however, not more than the left of in ⅓ cases. Therefore, the possible variant location to the right is ignored by the method.</p><p>The proposed scheme has a chain character: pronouns on the left of given , which are close to it and already resolved, add antecedents which are located to the left of the boundary of the window µ = 2 to . The scheme presents a certain compromise: the list can be imprecise but remains quite compact. Advancing the window border up to 5 with the chain scheme disabled has led to a noticeable decrease in the solution precision during the experiments, so the decision was made to reject the varying left border.</p><p>For the further ranking of the lists a vector of attributes A a ( ) is calcu- lated for each a ∈var pr ( ) . Let us mention the following attributes from the opera- tional ones:</p><formula xml:id="formula_3">• IsVoc ∈ 0,1</formula><p>{ } -the belonging of to a terminological dictionary</p><p>• Freq ∈N ∪ 0 { } -the number of mentionings of the given word (in any form) to the left of ; • Dist ∈N -the distance between the pronoun and the position of inside the text (measured in words);</p><formula xml:id="formula_4">• IsVerb ∈ 0,1</formula><p>{ } -the presence of direct father in a form of verb in syntax tree for a fragment containing ;</p><p>• NumNodes ∈N ∪ 0 { } -the number of nodes in a bush subordinate to .</p><p>The last two attributes have been introduced based on exploring correlation between numeric properties of a tree and resolving antecedents. For example, greater were often correspondent to proper variants of resolution. These attributes values are set into null in case the tree was not formed.</p><p>The distance is measured in words for a number of reasons: (a) to get a valid syntactical unit (clause, noun phrase) was not possible (at that moment) due to the laboriousness of the adaptation of the syntactical parser to the special features of input texts (e.g. the absence of punctuation); (b) a paragraph is too large for being a unit of measure -the majority of opinions consist of one paragraph; (c) windows are measured in sentences and a two-sentence diapason is considered to be sufficient for the research.</p><p>attribute implements the following idea: taking a subject domain's specificity into account allows to obtain higher quality of analysis. In fact, allows to raise the priority of variants relating to subject domain of the text -they are of most interest (not always, though).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">The test corpus</head><p>To evaluate the work of the methods a corpus of 3M was built from opinions about mobile phones from the sources like <ref type="bibr" target="#b11">[13,</ref><ref type="bibr">14,</ref><ref type="bibr">15]</ref>. Due to the specificity of the application the corpus was additionally divided into three groups: positive, negative and neutral opinions, each of 0.8-1.2 M. As a next step it was marked up with the resolved anaphoras according to the following scheme:</p><p>• if the correct antecedent could be chosen directly from the text, its occurrence which was closest to the left of the pronoun being resolved was marked in a special way; • in case of semantic ambiguity the pronoun was marked with «NULL» variant;</p><p>• the resolving word was written next to the pronoun in the corresponding case.</p><p>The statistical characteristics of the corpus were estimated.</p><p>• The whole number of 8.3 thousand opinions formed of 37 thousand unique word forms (including mistypings). • The most frequent opinion length varying from 15 to 35 words; average opinion length -54 words; the bulk of the opinions containing 10 to 90 words; opinions of more than 100 words are rare. The length scatter -from 2 to 340 words (Fig. <ref type="figure">1</ref>).</p><p>• Opinions consisting of one sentence are the most frequent; average opinion length -4 sentences. The majority of opinions include 1 to 16 sentences; lengths more than 24 sentences are very rare (Fig. <ref type="figure">2</ref>). • The corpus contains about 6.2 thousand third personal pronouns, including 4.5 thousand ones of masculine gender, 0.8 thousand of feminine gender, 0.7 thousand of plurals. The reason for a great number of masculine pronouns is the subject of the opinions (mobile phones). • Less than 50% of the opinions do not contain any of the pronouns under research.</p><p>35% contain only one pronoun, about 10% -two of them. The maximum is 9 pronouns per opinion (Fig. <ref type="figure">3</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Lexicographical analysis method</head><p>At the initial stage of studying a heuristic method for the options ranking was implemented:</p><p>• a system of priorities is formed on the set of attributes, which were listed in subparagraph 4.1; • attribute values for each option are sorted according to the priorities; • options are sorted lexicographically according to their sets of attributes.</p><p>The method resolves all the anaphoras for which it has found variants to the left with precision rate not more than 60%. The experiments in introducing new attributes and varying their priorities were not efficient. This has led the authors to the idea of filtration of the input data in order to achieve higher precision rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">SVM-method based on machine learning</head><p>Let there be a general set of objects , divided into previously unknown classes, and a sample set O ⊂ Ω , for each element of which its class is known. The task of classification is to answer the question: "which class does each object from belong to", knowing only the sample set (or the probabilities of belonging).</p><p>Let us fix a list for one specific pronoun . In this case</p><formula xml:id="formula_5">O i = A a ( ) | a ∈var pr i ( ) { } , i = 1,…, N</formula><p>, and two classes are of interest -"are antecedents" and the inverse to it. Then the first class distance can be taken as w a ( ) . Now we need to generalize the approach for pronouns. Each set represents an independent group, each of which consists of two classes -"is the antecedent for " the inverse one, classes for the whole training set. It is impossible to use this classification in practice with a different number Q ≠ N of other pronouns. In order to get exactly two classes for any number of pronouns, it is necessary to construct an acceptable combination of these groups. For this purpose, the authors propose adding attributes characterizing the group to each set ω i ∈O i . Thus within the same group all its members are additionally provided with the same set of numbers describing the group. The centroid can be taken as these numbers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>After expanding of the group members a sample set</head><formula xml:id="formula_6">O = O i i=1 N </formula><p>with the corresponding universe and a fuzzy classifier K ω ( ) ∈ 0,1 ( ⎤ ⎦ which determines a dis- tance between and the class "are antecedents" are constructed. is constructed in a form of so-called probabilistic decision function as described in <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> based on a classical C-SVM with a nonlinear kernel <ref type="bibr" target="#b6">[7]</ref>. Selection of the core and the constants for the SVM was performed by minimizing the overtraining on the parameters grid while verifying the recall-precision ratio on the training and control samples. In the end, the kernel was chosen to be a polynomial one with a small degree.</p><p>Centroids raised the precision of the SVM-method from 70% to 80% (mode A).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Recall-precision regulator</head><p>To reach the precision rate of 90% linear discriminative analysis <ref type="bibr" target="#b3">[4]</ref> was used: its aim is to find a line between classes, in the projection on which they are most discernible. With the help of discriminant, pronouns which may be not resolved (for the purpose of rising the precision rate) were identified. The combination of this filtration and SVM-method allowed to reach the desired result (mode C). Along the way, it was managed to derive mode B in which basic rates are balanced in the region of 75-85%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5</head><p>Analysis of the results</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Quality requirements and evaluation</head><p>Processing of the input set containing third personal pronoun anaphoras is carried out in 2 steps.</p><p>1. Filtration of anaphoras. From the total number of objects those for which the algorithm: (1) failed to form the set of variants, (2) put in the first place in the list of variants or (3) eliminated from the examination due to regulator work are deleted. As a result, anaphoras are left, for each of them the algorithm can choose an antecedent (not necessarily the correct one). If the whole of anaphoras resolved correctly are considered as relevant, the recall rate of this step is while the precision is equal to 1, as all chosen objects ) are included in the relevant ( ). 2. Resolution of the left anaphoras. In this step the whole of anaphoras resolved correctly are considered as relevant. The algorithm attempts to resolve them, succeeding in cases. Due to the coincidence between the volumes of relevant objects and those being resolved, the precision and recall rates are both equal to .</p><p>Two out of four rates mentioned above (precision and recall for each step) are informative:</p><p>• recall is a portion of pronouns for which the algorithm succeeded in finding an antecedent; • precision is equal to a percent of this portion containing correctly identified antecedents.</p><p>To the writers' opinion, this approach to evaluation conforms to the quality requirements. In addition, the estimations do not depend on the mechanism of anaphora resolution (including the size of variant lists).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.2</head><p>The quality of SVM-method and sensitivity to the sample volume</p><p>Opinions containing at least one of the pronouns under research (4 thousand altogether) were selected from the corpus. To evaluate the SVM-method sensitivity to the sample volume this set of opinions underwent the procedure of q-fold cross validation.</p><p>Verification was carried out for q = 1,…,300 , i.e. means verification of the model for the whole 4 thousand opinions, -for a sample of 13 opinions. For each the mean of recall and precision was calculated for each iteration as well as their minimum and maximum for the diagrams reflecting the dependency between quality and the volume of input data.</p><p>Measuring was done for modes A, B and C (Fig. <ref type="figure" target="#fig_1">4</ref>, abscissa corresponds to ). It can be seen that all the means are stable even for small-sized samples. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.3</head><p>The results of ROC-analysis of SVM-method Fig. <ref type="figure" target="#fig_2">5</ref> illustrates ROC-curves for SVM-method in A, B and C modes. The area under A curve is 0.74, under B one -0.76, which is considered as "good" according to the expert scale. The area under C curve is 0.81 with this mode considered as "very good". </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.4</head><p>The SVM-method independence of the sentiment of the corpus It was additionally verified in empirical way that the SVM-method is independent of the sentiment of the texts processed, since it cannot be forgotten that anaphoras in negative opinions might be different from those in positive opinions.</p><p>The "negative" corpus was used as a training set, the "positive" one as a control set. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Significance of the factors</head><p>Discriminative analysis provides an estimation of contribution of the attributes to the common decision -the judgment can be made based on the coefficients for the corresponding attributes in the linear discriminant and the range of attribute values. It is also possible to estimate how much influence components of the centroid bring to the solution.</p><p>According to the Table <ref type="table" target="#tab_3">3</ref>, the frequency is two times more important than the distance, the presence of a father-verb is more important than the number of nodes in the bush (even if correcting this by a wide range of sometimes up to 10-15 knots). Picture according to the centroid is consistent on a whole, except for and , so their contribution can be estimated to be approximately equal. Compiling vocabularies for is rather laborious. The authors have discovered that the main coefficients in modes A and C (recall and precision respectively) reduce from about 90 to 70% when this attribute is not used; in mode B both coefficients reduce by ~10%. It can be stated that it is precisely attribute that allows to achieve the precision rate of 90% and higher.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Evaluation of lexicographical method</head><p>The advantage of this method is that no marked-up corpus is needed for its initialization. The practical use of the SVM-method has shown that a trained classifier copes with texts from domains different from that of the training set with the rates declining by several percents (with the exception of attribute -new vocabularies are needed). The main error f the method is an excessively strong influence of an attribute with the highest priority. E.g. using attribute often results in an incorrect choosing a vocabulary word while not using it -in choosing the word closest to the left.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>This paper offers a solution to the problem of the third personal pronoun anaphora resolution. The software complex called DictaScope Anaphora was implemented based on the models and methods discussed in this paper. It has the following characteristics:</p><p>• there are three modes, which allow to achieve both recall and precision rates of 80% or to give preference to one of them and achieve the result of 95%; • it is possible to take mistypings and grammatical errors into account, which is important for processing texts from online sources (such as reviews); • in this case an adjustment of the parameters for a specific subject area is needed.</p><p>The features of the internal structure of the system and the mathematical foundation are described; the detailed evaluation of the test data and the quality of its processing is carried out.</p><p>Among the shortcomings it is a drop in accuracy on the masculine pronouns that should be noted. It is caused by the choice of the subject of opinions (a mobile phone). It is mentioned very often (including implicit mentioning) and the main part of malfunctions consists in choosing an implicit antecedent . In authors' opinion, the problem can be solved by taking new attributes connected with the result of syntactical parsing into consideration.</p><p>The development plans include application of the system to other domains improving the recall-precision ratio by introducing new attributes and refining the adjustment of the coefficients.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .Fig. 2 .Fig. 3 .</head><label>123</label><figDesc>Fig. 1. Distribution of opinions lengths in words</figDesc><graphic coords="6,186.00,517.92,223.44,95.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Results for SVM-method cross-validation in A,B,C modes</figDesc><graphic coords="10,124.80,147.36,345.84,257.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 5 .</head><label>5</label><figDesc>Fig. 5. ROC-curves for SVM-method in A, B, C modes</figDesc><graphic coords="11,164.64,147.12,266.16,255.84" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>it]{display = 0.466248, * = 0.284525, NULL = 0.0777368, business = 0.0101848} (3)</head><label></label><figDesc></figDesc><table><row><cell>which correspond to each variant's</cell></row><row><cell>confidence.</cell></row><row><cell>Traits. Let's resort to an example to make the task statement clear:</cell></row><row><cell>bought it for business, very useful because [it] {* =</cell></row><row><cell>0.652166, business = 0.2371, NULL = 0.168611} supports</cell></row><row><cell>two sim cards. Nice, big display, no dead spaces found</cell></row><row><cell>on [</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>Averaged quality measures for SVM-method</figDesc><table><row><cell></cell><cell>Recall</cell><cell>Precision</cell></row><row><cell>A</cell><cell>97.3%</cell><cell>74.2%</cell></row><row><cell>B</cell><cell>75.4%</cell><cell>80.7%</cell></row><row><cell>C</cell><cell>45.6%</cell><cell>90.3%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>Check for SVM-method independency from sentiment</figDesc><table><row><cell>(RECALL %, PRECISION %)</cell><cell>(A)</cell><cell>(B)</cell><cell>(C)</cell></row><row><cell>Negative (training)</cell><cell>(95.1, 80.2)</cell><cell>(77.8, 86.7)</cell><cell>(43.1, 93.2)</cell></row><row><cell>Positive (control)</cell><cell>(96.3, 78.7)</cell><cell>(79.1, 83.4)</cell><cell>(56.2, 89.9)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Valuing the attributes significance according to the results of discriminant analysis</figDesc><table><row><cell>Attribute</cell><cell>Coefficient in linear discriminant</cell><cell>Corresponding coefficient near the component of the centroid</cell></row><row><cell></cell><cell>-2.9</cell><cell>18.8</cell></row><row><cell></cell><cell>9.3</cell><cell>1.1</cell></row><row><cell></cell><cell>-7</cell><cell>35.8</cell></row><row><cell></cell><cell>-0.5</cell><cell>18.9</cell></row><row><cell></cell><cell>-21.5</cell><cell>-1.6</cell></row><row><cell></cell><cell>-10.6</cell><cell>0.1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4 .</head><label>4</label><figDesc>Estimation of the lexicographical method quality</figDesc><table><row><cell>With IsVoc</cell><cell>Without IsVoc</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">A random sample of news from [12] (the anaphora density -0.34 per 1 K and a sample of opinions about mobile phones from the sources such as<ref type="bibr" target="#b11">[13]</ref> (the anaphora density -0,53 per 1 K were used to perform measurements, each one of 1 Mb.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Development of a pilot version of syntactical analyzer for the Russian Language</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">V</forename><surname>Okatev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">P</forename><surname>Gergel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">E</forename><surname>Alexeev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">A</forename><surname>Talanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">A</forename><surname>Barkalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Skatov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Erekhinskaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Kotov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Titova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">VNTIC Inventory Number 02200803750 // VNTIC</title>
				<meeting><address><addrLine>Moscow</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
	<note>Report on research implementation on the topic:</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Referencing the designations of persons and organizations in Russian media texts: empirical laws for computer analysis</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Ermakov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference &quot;Dialog&apos;2005</title>
				<meeting>the International Conference &quot;Dialog&apos;2005</meeting>
		<imprint>
			<publisher>Computational Linguistics and Intelligent Technologies</publisher>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Algorithm for automated third-person pronouns resolution on the basis of machine learning methods</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">V</forename><surname>Tolpegin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Wind</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Kropotov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of International Conference &quot;Dialog&apos;2006</title>
				<meeting>International Conference &quot;Dialog&apos;2006<address><addrLine>Moscow</addrLine></address></meeting>
		<imprint>
			<publisher>Izd RGGU</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="504" to="507" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Factor, discriminant and cluster analysis</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Oldenderfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K</forename><surname>Blashfield</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1989">1989</date>
			<pubPlace>Moscow</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Igor Enyukova. Finance and Statistics</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Under</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods</title>
		<author>
			<persName><forename type="first">Platt</forename><surname>John</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Large Margin Classifiers</title>
				<editor>
			<persName><forename type="first">Alexander</forename><forename type="middle">J</forename><surname>Smola</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Peter</forename><surname>Bartlett</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Bernhard</forename><surname>Sch Olkopf</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Dale</forename><surname>Schuurmans</surname></persName>
		</editor>
		<imprint>
			<publisher>MIT Press</publisher>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A note on Platt&apos;s probabilistic outputs for support vector machines</title>
		<author>
			<persName><forename type="first">Chih-Jen</forename><surname>Hsuan-Tien Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruby</forename><forename type="middle">C</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><surname>Weng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<date type="published" when="2007-10">October 2007</date>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="page" from="267" to="276" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Statistical Learning Theory</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vapnik</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>Wiley</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A statistical approach to anaphora resolution</title>
		<author>
			<persName><forename type="first">G</forename><surname>Niyu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Charniak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Sixth Workshop on Very Large Corpora. COLING-ACL&apos;98</title>
				<meeting>the Sixth Workshop on Very Large Corpora. COLING-ACL&apos;98<address><addrLine>Montreal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The third personal pronoun anaphora resolution in the paroxysmal text of the Chinese web</title>
		<author>
			<persName><forename type="first">Ning</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun-Feng</forename><surname>Shi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Coll. of Appl. Sci.</title>
		<imprint>
			<publisher>Taiyuan Sci. &amp; Technol. Univ</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A machine learning approach to personal pronoun resolution in Turkish</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yıldırım</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kılıçaslan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 20th International FLAIRS Conference, FLAIRS-20</title>
				<meeting>20th International FLAIRS Conference, FLAIRS-20<address><addrLine>Key West, Florida</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Anaphora analyzing apparatus provided with antecedent candidate rejecting means using candidate rejecting decision tree</title>
		<author>
			<persName><forename type="first">P</forename><surname>Michael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kazuhide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Eiichiro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Patent</title>
		<imprint>
			<biblScope unit="volume">6343266</biblScope>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<ptr target="http://market.yandex.ru" />
		<title level="m">Market -search, selection and purchase of goods</title>
				<imprint>
			<publisher>Yandex</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
