The Third Personal Pronoun Anaphora Resolution in Texts from Narrow Subject Domains with Grammatical Errors and Mistypings Daniel Skatov and Sergey Liverko Dictum Ltd, Nizhny Novgorod, Russia {ds,liverko}@dictum.ru Abstract. The third personal pronoun anaphora resolution in texts from the In- ternet sources (forum comments, opinions) with a given subject domain (cars, household appliances etc) is being discussed. A concrete solution to the task is offered. High precision with acceptable recall (and vice versa) is shown by an example of opinions about mobile phones. Keywords. Computational linguistics, natural language processing, anaphora resolution, machine learning, opinion mining. 1 Introduction The problem of the third personal pronoun anaphora resolution discussed in this paper consists in the replacement of pronouns such as “he”, “his”, “her”, “it”, … with nouns (antecedents) that these pronouns were used instead. Its solution is needed firstly in text mining applications, such as opinion mining (about goods, people) or fact extraction. Without resolved anaphoras those applications lose in recall of their results. The loss degree depends on the type of proceeded texts: e.g., in opinions about goods the density of “it” (masculine gender in Russian) pronoun is 1,5 times higher than in news1. The known methods of anaphora resolution can be divided into two groups — (1) statistical and (2) syntactical. Methods from class (1) [3] are based on the results of machine learning and are potentially applicable to texts of significantly different na- ture. Class (2) [1,2] exploits the sentence syntactical parsing tree (or semantic graphs as their derivatives) and as a result the applicability of such methods is limited to relatively «correct» texts (e.g., dossier texts [2]). This article describes a method combining these two approaches in a certain sense. 1 A random sample of news from [12] (the anaphora density — 0.34 per 1 K and a sample of opinions about mobile phones from the sources such as [13] (the anaphora density — 0,53 per 1 K were used to perform measurements, each one of 1 Mb. Texts from «real life» are full of typos and specialized slang with their grammar far from correct one: Ive got a whit ceise and buttons peel gradauly and they becomes gray no cleaning helps or anything likethat..! Weak processor also made upset as well as small memory amount, it works terribly slow. (1) The method of anaphora resolution, offered by the authors, takes mistypings and the results of syntactic parsing of text fragments (with mistypings corrected) into account. It is adapted to process texts from specific subject domains. Method can work with «correct» texts as well as informal ones (such as opinions or notes). To achieve a high processing quality for texts from a selected domain, a preliminary adjustment to the method is needed. It consists in learning on an unmarked corpus and composing the operating terminological dictionaries. Three modes of the method have been implemented: (A) good precision (70-80%) with high recall (90-95%), (B) approximately equally good precision and recall (75-85%), (C) excellent precision (up to 95%) with high acceptable recall (40-50%). The implementation of the technology is represented by a software module called DictaScope Anaphora. It is adjusted to processing opinions about mobile phones from Internet sources. Within the bounds of the article an estimation of recall-precision ratio for processing such kind of data is carried out. The model is being used in the real application for online opinion monitoring. Modes A, B and C were obtained in the process of looking for a solution effective for this application – i.e. the one with high precision on possibly intentionally reduced input data. 2 Problem statement Basic statement. For each pronoun , i = 1,…, N from text choose the re- solving pronoun (antecedent) . Remark. In certain cases it is impossible to choose , e.g.: This mobile phone has a sensor screen. It’s very inconvenient. (screen or phone?) (2) Resolving of such an ambiguity (which can conditionally be called semantic) is a hard task even for a human, as both variants are of equal possibility. In the current problem statement it is offered either to choose a concrete antecedent or not to resolve the anaphora. Advanced statement. It sometimes turns out that an acceptable precision of select- ing a sole variant is unreachable. Therefore the following task specification is pro- posed: for each pronoun , i = 1,…, N form a list of possible resolving variants ( a ,…, a ) sorted in accordance with their ranks (the first one is the best). Then 1 i li i can be chosen as . In case a requirement of a high recall takes place (e.g., for poste- rior hand processing of results) it is sufficient to ensure high quality of ranking. The variants of resolving antecedents can be supplied with real-value weights ( ) ( { } { } w = w aik ∈ 0,1⎤⎦ , i ∈ 1,…, N , k ∈ 1,…, li , which correspond to each variant’s confidence. Traits. Let’s resort to an example to make the task statement clear: bought it for business, very useful because [it] {* = 0.652166, business = 0.2371, NULL = 0.168611} supports two sim cards. Nice, big display, no dead spaces found on [it]{display = 0.466248, * = 0.284525, NULL = 0.0777368, business = 0.0101848} (3) For pronoun pr1 = «it» the list of variants is formed ( a11 = «*» , a12 = «bussiness» , ( ) ( ) ( ) a13 = «NULL» ) with weights w a11 ≈ 0.65 , w a12 ≈ 0.237 , w a13 ≈ 0.1686 (simi- larly for pr2 = «it» ). There are also special and «NULL» designations: • — «the current object of discourse», so-called «implicit» antecedent. This is typical for opinions and reviews — i.e. for texts representing direct speech in writ- ing. In the example above the word «phone» (as well as its concrete model refer- ence) is not found anywhere before pr1 = «it» , though the teller means exactly «this phone». • «NULL» — a directive «not to resolve pronoun». If «NULL» is at first position in the list of variants, the pronoun is left unresolved. Thus, there are two cases in a basic problem statement in which the anaphora will not be resolved: 1. No variants for pronoun resolution is found; 2. «NULL» is the first in the ranged list of variants. It is easy to see that if, in case of semantic ambiguity, the probability of the correct choice of antecedent is less than ½, the precision will not fall on the average. Therefore, in this case the choice of «NULL» variant is justified. In the example (3) the task in the basic statement is resolved correctly by choosing the first variant for each pronoun. A solution in a basic statement will be further esti- mated. 3 Review The subject area of this paper is covered in the works of three Russian groups. 1. Ermakov A.E., RCO. In [2] empirical regularities of persons referencing are shown for texts from Russian mass media; they can be used to build a mechanism for anaphora resolution in text sources of this class (with the help of natural language syntactic parser). 2. Tolpegin P., Vetrov D., Kropotov D. Article [3] describes an experience of this group in resolving the third personal pronoun anaphora in news by machine learn- ing methods. The approach is typical for this type of solvers, the precision shown equals 62% on a control collection. 3. Okatiev V., Erechinskaya T., Skatov D. In the report [1] it is shown how pronoun anaphoras of different types can be resolved with the help of syntax parsing trees analysis. This approach is well applicable to the texts in which most of the sen- tences allow building correct syntax trees. The specificity of this article — processing texts from narrow subject domains with mistypings and slang — is not touched upon in the works listed above. The question discussed is more widely represented in foreign scientific works: • from English-speaking authors patented system [11] and work [8] (which demon- strates values of basic indicators at a level about 80% while using probability model) are first to be mentioned; • authors of [9] use maximum entropy method to resolve the third personal pronoun anaphora in Chinese, with F-measure about 70%; • [10] describes an application of machine learning to personal pronouns anaphora resolution in Turkish with recall-precision at about 60-70%. The overall impression of these works is the following: competent combination of analysis methods and rather full vocabulary data results in recall-precision not less that 70%. 4 Solution 4.1 Lists of variants and attributes After tokenization (when the lists of grammar values of the tokens are supple- mented taking mistypings into consideration) and dividing text into “conditional” sentences all the pronouns are looked through in the text from left to right. A concrete ( ) pronoun pr is fixed, i = 1,…, N , and list var pr of possible antecedents is formed: 1. from all the words located within sentences to the left of , nouns in con- cordance with by gender and number are selected; 2. from the same words pronouns which are in concordance with by gender and number are selected and the list is supplemented with nouns that resolve these pronouns. Possible antecedents can also be found to the right of ; however, not more than 30 examples of this were found in the corpus, with the correct variant also found to the left of in ⅓ cases. Therefore, the possible variant location to the right is ignored by the method. The proposed scheme has a chain character: pronouns on the left of given , which are close to it and already resolved, add antecedents which are located to the left of the boundary of the window µ = 2 to . The scheme presents a certain compromise: the list can be imprecise but remains quite compact. Advanc- ing the window border up to 5 with the chain scheme disabled has led to a notice- able decrease in the solution precision during the experiments, so the decision was made to reject the varying left border. For the further ranking of the lists () a vector of attributes A a is calcu- ( ) lated for each a ∈var pr . Let us mention the following attributes from the opera- tional ones: { } • IsVoc ∈ 0,1 — the belonging of to a terminological dictionary • Freq ∈N ∪ {0} — the number of mentionings of the given word (in any form) to the left of ; • Dist ∈N — the distance between the pronoun and the position of inside the text (measured in words); { } • IsVerb ∈ 0,1 — the presence of direct father in a form of verb in syntax tree for a fragment containing ; {} • NumNodes ∈N ∪ 0 — the number of nodes in a bush subordinate to . The last two attributes have been introduced based on exploring correlation be- tween numeric properties of a tree and resolving antecedents. For example, greater were often correspondent to proper variants of resolution. These attrib- utes values are set into null in case the tree was not formed. The distance is measured in words for a number of reasons: (a) to get a valid syn- tactical unit (clause, noun phrase) was not possible (at that moment) due to the labo- riousness of the adaptation of the syntactical parser to the special features of input texts (e.g. the absence of punctuation); (b) a paragraph is too large for being a unit of measure — the majority of opinions consist of one paragraph; (c) windows are meas- ured in sentences and a two-sentence diapason is considered to be sufficient for the research. attribute implements the following idea: taking a subject domain’s specific- ity into account allows to obtain higher quality of analysis. In fact, allows to raise the priority of variants relating to subject domain of the text — they are of most interest (not always, though). 4.2 The test corpus To evaluate the work of the methods a corpus of 3M was built from opinions about mobile phones from the sources like [13,14,15]. Due to the specificity of the applica- tion the corpus was additionally divided into three groups: positive, negative and neu- tral opinions, each of 0.8–1.2 M. As a next step it was marked up with the resolved anaphoras according to the following scheme: • if the correct antecedent could be chosen directly from the text, its occurrence which was closest to the left of the pronoun being resolved was marked in a special way; • in case of semantic ambiguity the pronoun was marked with «NULL» variant; • the resolving word was written next to the pronoun in the corresponding case. The statistical characteristics of the corpus were estimated. • The whole number of 8.3 thousand opinions formed of 37 thousand unique word forms (including mistypings). • The most frequent opinion length varying from 15 to 35 words; average opinion length — 54 words; the bulk of the opinions containing 10 to 90 words; opinions of more than 100 words are rare. The length scatter — from 2 to 340 words (Fig.1). • Opinions consisting of one sentence are the most frequent; average opinion length — 4 sentences. The majority of opinions include 1 to 16 sentences; lengths more than 24 sentences are very rare (Fig.2). • The corpus contains about 6.2 thousand third personal pronouns, including 4.5 thousand ones of masculine gender, 0.8 thousand of feminine gender, 0.7 thousand of plurals. The reason for a great number of masculine pronouns is the subject of the opinions (mobile phones). • Less than 50% of the opinions do not contain any of the pronouns under research. 35% contain only one pronoun, about 10% — two of them. The maximum is 9 pronouns per opinion (Fig.3). Fig. 1. Distribution of opinions lengths in words Fig. 2. Distribution of opinion lengths in sentences Fig. 3. Opinions distribution by a number of pronouns 4.3 Lexicographical analysis method At the initial stage of studying a heuristic method for the options ranking was im- plemented: • a system of priorities is formed on the set of attributes, which were listed in sub- paragraph 4.1; • attribute values for each option are sorted according to the priorities; • options are sorted lexicographically according to their sets of attributes. The method resolves all the anaphoras for which it has found variants to the left with precision rate not more than 60%. The experiments in introducing new attributes and varying their priorities were not efficient. This has led the authors to the idea of filtration of the input data in order to achieve higher precision rate. 4.4 SVM-method based on machine learning Let there be a general set of objects , divided into previously unknown classes, and a sample set O ⊂ Ω , for each element of which its class is known. The task of classification is to answer the question: “which class does each object from belong to”, knowing only the sample set (or the probabilities of belonging). Let us fix a list for one specific pronoun . In this case { () ( )} , i = 1,…, N , and two classes are of interest — "are Oi = A a | a ∈var pri antecedents” and the inverse to it. Then the first class distance can be taken as w ( a ) . Now we need to generalize the approach for pronouns. Each set represents an independent group, each of which consists of two classes — “is the antecedent for ” and the inverse one, classes for the whole training set. It is impossible to use this classification in practice with a different number Q ≠ N of other pronouns. In order to get exactly two classes for any number of pronouns, it is necessary to con- struct an acceptable combination of these groups. For this purpose, the authors pro- pose adding attributes characterizing the group to each set ω i ∈Oi . Thus within the same group all its members are additionally provided with the same set of numbers describing the group. The centroid can be taken as these numbers. N After expanding of the group members a sample set O =  i=1 Oi with the corre- sponding universe ( ) ( and a fuzzy classifier K ω ∈ 0,1⎤⎦ which determines a dis- tance between and the class “are antecedents” are constructed. is constructed in a form of so-called probabilistic decision function as described in [5,6] based on a classical C-SVM with a nonlinear kernel [7]. Selection of the core and the constants for the SVM was performed by minimizing the overtraining on the parameters grid while verifying the recall-precision ratio on the training and control samples. In the end, the kernel was chosen to be a polynomial one with a small degree. Centroids raised the precision of the SVM-method from 70% to 80% (mode A). 4.5 Recall-precision regulator To reach the precision rate of 90% linear discriminative analysis [4] was used: its aim is to find a line between classes, in the projection on which they are most dis- cernible. With the help of discriminant, pronouns which may be not resolved (for the purpose of rising the precision rate) were identified. The combination of this filtration and SVM-method allowed to reach the desired result (mode C). Along the way, it was managed to derive mode B in which basic rates are balanced in the region of 75-85%. 5 Analysis of the results 5.1 Quality requirements and evaluation Processing of the input set containing third personal pronoun anaphoras is car- ried out in 2 steps. 1. Filtration of anaphoras. From the total number of objects those for which the algorithm: (1) failed to form the set of variants, (2) put in the first place in the list of variants or (3) eliminated from the examination due to regulator work are deleted. As a result, anaphoras are left, for each of them the algorithm can choose an antecedent (not necessarily the correct one). If the whole of anaphoras resolved correctly are considered as relevant, the recall rate of this step is while the precision is equal to 1, as all chosen objects ( ) are included in the relevant ( ). 2. Resolution of the left anaphoras. In this step the whole of anaphoras resolved correctly are considered as relevant. The algorithm attempts to resolve them, suc- ceeding in cases. Due to the coincidence between the volumes of relevant ob- jects and those being resolved, the precision and recall rates are both equal to . Two out of four rates mentioned above (precision and recall for each step) are infor- mative: • recall is a portion of pronouns for which the algorithm succeeded in finding an antecedent; • precision is equal to a percent of this portion containing correctly identified ante- cedents. To the writers’ opinion, this approach to evaluation conforms to the quality re- quirements. In addition, the estimations do not depend on the mechanism of anaphora resolution (including the size of variant lists). 5.2 The quality of SVM-method and sensitivity to the sample volume Opinions containing at least one of the pronouns under research (4 thousand alto- gether) were selected from the corpus. To evaluate the SVM-method sensitivity to the sample volume this set of opinions underwent the procedure of q-fold cross valida- tion. Verification was carried out for q = 1,…,300 , i.e. means verification of the model for the whole 4 thousand opinions, — for a sample of 13 opinions. For each the mean of recall and precision was calculated for each iteration as well as their minimum and maximum for the diagrams reflecting the dependency between quality and the volume of input data. Measuring was done for modes A, B and C (Fig.4, abscissa corresponds to ). Fig. 4. Results for SVM-method cross-validation in A,B,C modes It can be seen that all the means are stable even for small-sized samples. Table 1. Averaged quality measures for SVM-method Recall Precision A 97.3% 74.2% B 75.4% 80.7% C 45.6% 90.3% 5.3 The results of ROC-analysis of SVM-method Fig.5 illustrates ROC-curves for SVM-method in A, B and C modes. The area under A curve is 0.74, under B one — 0.76, which is considered as “good” according to the expert scale. The area under C curve is 0.81 with this mode considered as “very good”. Fig. 5. ROC-curves for SVM-method in A, B, C modes 5.4 The SVM-method independence of the sentiment of the corpus It was additionally verified in empirical way that the SVM-method is independent of the sentiment of the texts processed, since it cannot be forgotten that anaphoras in negative opinions might be different from those in positive opinions. The “negative” corpus was used as a training set, the “positive” one as a control set. Table 2. Check for SVM-method independency from sentiment (RECALL %, PRECISION %) (A) (B) (C) Negative (training) (95.1, 80.2) (77.8, 86.7) (43.1, 93.2) Positive (control) (96.3, 78.7) (79.1, 83.4) (56.2, 89.9) 5.5 Significance of the factors Discriminative analysis provides an estimation of contribution of the attributes to the common decision — the judgment can be made based on the coefficients for the corresponding attributes in the linear discriminant and the range of attribute values. It is also possible to estimate how much influence components of the centroid bring to the solution. According to the Table 3, the frequency is two times more important than the dis- tance, the presence of a father-verb is more important than the number of nodes in the bush (even if correcting this by a wide range of — sometimes up to 10- 15 knots). Picture according to the centroid is consistent on a whole, except for and , so their contribution can be estimated to be approximately equal. Table 3. Valuing the attributes significance according to the results of discriminant analysis Corresponding coefficient Coefficient in linear Attribute near the component of the discriminant centroid – 2.9 18.8 9.3 1.1 –7 35.8 – 0.5 18.9 – 21.5 -1.6 – 10.6 0.1 Compiling vocabularies for is rather laborious. The authors have discovered that the main coefficients in modes A and C (recall and precision respectively) reduce from about 90 to 70% when this attribute is not used; in mode B both coefficients reduce by ~10%. It can be stated that it is precisely attribute that allows to achieve the precision rate of 90% and higher. 5.6 Evaluation of lexicographical method The advantage of this method is that no marked-up corpus is needed for its initiali- zation. The practical use of the SVM-method has shown that a trained classifier copes with texts from domains different from that of the training set with the rates declining by several percents (with the exception of attribute — new vocabularies are needed). Table 4. Estimation of the lexicographical method quality With IsVoc Without IsVoc (RECALL %, PRECISION %) (93.7, 51.9) (93.7, 42.4) The main error f the method is an excessively strong influence of an attribute with the highest priority. E.g. using attribute often results in an incorrect choosing a vocabulary word while not using it — in choosing the word closest to the left. 6 Conclusion This paper offers a solution to the problem of the third personal pronoun anaphora resolution. The software complex called DictaScope Anaphora was implemented based on the models and methods discussed in this paper. It has the following characteristics: • there are three modes, which allow to achieve both recall and precision rates of 80% or to give preference to one of them and achieve the result of 95%; • it is possible to take mistypings and grammatical errors into account, which is important for processing texts from online sources (such as reviews); • in this case an adjustment of the parameters for a specific subject area is needed. The features of the internal structure of the system and the mathematical founda- tion are described; the detailed evaluation of the test data and the quality of its proc- essing is carried out. Among the shortcomings it is a drop in accuracy on the masculine pronouns that should be noted. It is caused by the choice of the subject of opinions (a mobile phone). It is mentioned very often (including implicit mentioning) and the main part of malfunctions consists in choosing an implicit antecedent . In authors’ opinion, the problem can be solved by taking new attributes connected with the result of syn- tactical parsing into consideration. The development plans include the application of the system to other domains and improving the recall-precision ratio by introducing new attributes and refining the adjustment of the coefficients. 7 References 1. Okatev V.V., Gergel V.P., Alexeev V.E., Talanov V.A., Barkalov K.A., Ska- tov D.S., Erekhinskaya T.N., Kotov A.E., Titova A.S. Report on research implementation on the topic: "Development of a pilot version of syntactical analyzer for the Russian Language", VNTIC Inventory Number 02200803750 // VNTIC, Moscow (2008) 2. Ermakov A.E. Referencing the designations of persons and organizations in Russian media texts: empirical laws for computer analysis. In: Proceedings of the International Conference "Dialog'2005", Computational Linguistics and Intelligent Technologies (2005) 3. Tolpegin P.V., Wind D.P., Kropotov D.A. Algorithm for automated third-person pronouns resolution on the basis of machine learning methods. In: Proceedings of International Conference "Dialog’2006", pp. 504-507. Izd RGGU, Moscow (2006) 4. Oldenderfer M.S., Blashfield R.K. Factor, discriminant and cluster analysis. Under. Ed. Igor Enyukova. Finance and Statistics, Moscow (1989) 5. Platt John C. Probabilistic Outputs for Support Vector Machines and Compari- sons to Regularized Likelihood Methods. In Advances in Large Margin Classifi- ers, Alexander J. Smola, Peter Bartlett, Bernhard Sch olkopf, Dale Schuurmans, eds., MIT Press, (1999) 6. Hsuan-Tien Lin, Chih-Jen Lin, Ruby C. Weng, A note on Platt's probabilistic outputs for support vector machines. In: Machine Learning, v.68 n.3, p.267-276 (October 2007) 7. Vapnik V. Statistical Learning Theory. Wiley (1998) 8. Niyu G., Hale J., Charniak E. A statistical approach to anaphora resolution // In; Proceedings of the Sixth Workshop on Very Large Corpora. COLING-ACL'98. Montreal, Canada (1998) 9. Ning Pang, Jun-feng Shi. The third personal pronoun anaphora resolution in the paroxysmal text of the Chinese web. In. Coll. of Appl. Sci., Taiyuan Sci. & Technol. Univ., Taiyuan, China 10. Yıldırım S., Kılıçaslan Y. A machine learning approach to personal pronoun resolution in Turkish. In Proceedings of 20th International FLAIRS Conference, FLAIRS-20. Key West, Florida (2007) 11. Michael P., Kazuhide Y., Eiichiro S. Anaphora analyzing apparatus provided with antecedent candidate rejecting means using candidate rejecting decision tree. Patent US6343266 (2002) 12. Novoteka — news of the day: http://www.novoteka.ru. 13. Yandex.Market — search, selection and purchase of goods: http://market.yandex.ru. 14. CNews Internet-portal: http://zoom.cnews.ru. 15. All for Nokia phones: http://www.allnokia.ru.