The Third Personal Pronoun Anaphora Resolution in Texts from Narrow Subject Domains with Grammatical Errors and Mistypings

The Third Personal Pronoun Anaphora Resolution in Texts from Narrow Subject Domains with Grammatical Errors and Mistypings DanielSkatov Dictum Ltd

Nizhny Novgorod Russia

SergeyLiverko liverko@dictum.ru Dictum Ltd

Nizhny Novgorod Russia

The Third Personal Pronoun Anaphora Resolution in Texts from Narrow Subject Domains with Grammatical Errors and Mistypings 4352177161348178A443CC287EC29B85 GROBID - A machine learning software for extracting information from scholarly documents Computational linguistics natural language processing anaphora resolution machine learning opinion mining

The third personal pronoun anaphora resolution in texts from the Internet sources (forum comments, opinions) with a given subject domain (cars, household appliances etc) is being discussed. A concrete solution to the task is offered. High precision with acceptable recall (and vice versa) is shown by an example of opinions about mobile phones.

Introduction

The problem of the third personal pronoun anaphora resolution discussed in this paper consists in the replacement of pronouns such as "he", "his", "her", "it", … with nouns (antecedents) that these pronouns were used instead. Its solution is needed firstly in text mining applications, such as opinion mining (about goods, people) or fact extraction. Without resolved anaphoras those applications lose in recall of their results. The loss degree depends on the type of proceeded texts: e.g., in opinions about goods the density of "it" (masculine gender in Russian) pronoun is 1,5 times higher than in news 1 .

The known methods of anaphora resolution can be divided into two groups -(1) statistical and (2) syntactical. Methods from class (1) [3] are based on the results of machine learning and are potentially applicable to texts of significantly different nature. Class (2) [1,2] exploits the sentence syntactical parsing tree (or semantic graphs as their derivatives) and as a result the applicability of such methods is limited to relatively «correct» texts (e.g., dossier texts [2]). This article describes a method combining these two approaches in a certain sense.

Texts from «real life» are full of typos and specialized slang with their grammar far from correct one:

Ive got a whit ceise and buttons peel gradauly and they becomes gray no cleaning helps or anything likethat..! Weak processor also made upset as well as small memory amount, it works terribly slow. (1) The method of anaphora resolution, offered by the authors, takes mistypings and the results of syntactic parsing of text fragments (with mistypings corrected) into account. It is adapted to process texts from specific subject domains. Method can work with «correct» texts as well as informal ones (such as opinions or notes). To achieve a high processing quality for texts from a selected domain, a preliminary adjustment to the method is needed. It consists in learning on an unmarked corpus and composing the operating terminological dictionaries.

Three modes of the method have been implemented: (A) good precision (70-80%) with high recall (90-95%), (B) approximately equally good precision and recall (75-85%), (C) excellent precision (up to 95%) with high acceptable recall (40-50%).

The implementation of the technology is represented by a software module called DictaScope Anaphora. It is adjusted to processing opinions about mobile phones from Internet sources. Within the bounds of the article an estimation of recall-precision ratio for processing such kind of data is carried out. The model is being used in the real application for online opinion monitoring. Modes A, B and C were obtained in the process of looking for a solution effective for this application -i.e. the one with high precision on possibly intentionally reduced input data.

Problem statement

Basic statement. For each pronoun , i = 1,…, N from text choose the resolving pronoun (antecedent) . Remark. In certain cases it is impossible to choose , e.g.:

This mobile phone has a sensor screen. It's very inconvenient. (screen or phone?)

Resolving of such an ambiguity (which can conditionally be called semantic) is a hard task even for a human, as both variants are of equal possibility. In the current problem statement it is offered either to choose a concrete antecedent or not to resolve the anaphora.

Advanced statement. It sometimes turns out that an acceptable precision of selecting a sole variant is unreachable. Therefore the following task specification is proposed: for each pronoun , i = 1,…, N form a list of possible resolving variants a i 1 ,…, a i l i ( ) sorted in accordance with their ranks (the first one is the best). Then can be chosen as . In case a requirement of a high recall takes place (e.g., for posterior hand processing of results) it is sufficient to ensure high quality of ranking.

The variants of resolving antecedents can be supplied with real-value weights

w = w a i k ( ) ∈ 0,1 ( ⎤ ⎦ , i ∈ 1,…, N { } , k ∈ 1,…, l i { } ,For pronoun pr 1 = «it» the list of variants is formed ( a 1 1 = «*» , a 1 2 = «bussiness» , a 1 3 = «NULL» ) with weights w a 11

( ) ≈ 0.65 , w a 1 2 ( ) ≈ 0.237 , w a 1 3 ( ) ≈ 0.1686 (simi- larly for pr 2 = «it» ). There are also special and «NULL» designations:

• -«the current object of discourse», so-called «implicit» antecedent. This is typical for opinions and reviews -i.e. for texts representing direct speech in writing. In the example above the word «phone» (as well as its concrete model reference) is not found anywhere before pr 1 = «it» , though the teller means exactly «this phone».

• «NULL» -a directive «not to resolve pronoun». If «NULL» is at first position in the list of variants, the pronoun is left unresolved.

Thus, there are two cases in a basic problem statement in which the anaphora will not be resolved:

1. No variants for pronoun resolution is found; 2. «NULL» is the first in the ranged list of variants. It is easy to see that if, in case of semantic ambiguity, the probability of the correct choice of antecedent is less than ½, the precision will not fall on the average. Therefore, in this case the choice of «NULL» variant is justified.

In the example (3) the task in the basic statement is resolved correctly by choosing the first variant for each pronoun. A solution in a basic statement will be further estimated.

Review

The subject area of this paper is covered in the works of three Russian groups.

1. Ermakov A.E., RCO. In [2] empirical regularities of persons referencing are shown for texts from Russian mass media; they can be used to build a mechanism for anaphora resolution in text sources of this class (with the help of natural language syntactic parser). 2. Tolpegin P., Vetrov D., Kropotov D. Article [3] describes an experience of this group in resolving the third personal pronoun anaphora in news by machine learning methods. The approach is typical for this type of solvers, the precision shown equals 62% on a control collection. 3. Okatiev V., Erechinskaya T., Skatov D. In the report [1] it is shown how pronoun anaphoras of different types can be resolved with the help of syntax parsing trees analysis. This approach is well applicable to the texts in which most of the sentences allow building correct syntax trees.

The specificity of this article -processing texts from narrow subject domains with mistypings and slang -is not touched upon in the works listed above.

The question discussed is more widely represented in foreign scientific works:

• from English-speaking authors patented system [11] and work [8] (which demonstrates values of basic indicators at a level about 80% while using probability model) are first to be mentioned; • authors of [9] use maximum entropy method to resolve the third personal pronoun anaphora in Chinese, with F-measure about 70%; • [10] describes an application of machine learning to personal pronouns anaphora resolution in Turkish with recall-precision at about 60-70%.

The overall impression of these works is the following: competent combination of analysis methods and rather full vocabulary data results in recall-precision not less that 70%.

Solution

Lists of variants and attributes

After tokenization (when the lists of grammar values of the tokens are supplemented taking mistypings into consideration) and dividing text into "conditional" sentences all the pronouns are looked through in the text from left to right. A concrete pronoun pr is fixed, i = 1,…, N , and list var pr ( ) of possible antecedents is formed:

1. from all the words located within sentences to the left of , nouns in concordance with by gender and number are selected; 2. from the same words pronouns which are in concordance with by gender and number are selected and the list is supplemented with nouns that resolve these pronouns.

Possible antecedents can also be found to the right of ; however, not more than the left of in ⅓ cases. Therefore, the possible variant location to the right is ignored by the method.

The proposed scheme has a chain character: pronouns on the left of given , which are close to it and already resolved, add antecedents which are located to the left of the boundary of the window µ = 2 to . The scheme presents a certain compromise: the list can be imprecise but remains quite compact. Advancing the window border up to 5 with the chain scheme disabled has led to a noticeable decrease in the solution precision during the experiments, so the decision was made to reject the varying left border.

For the further ranking of the lists a vector of attributes A a ( ) is calcu- lated for each a ∈var pr ( ) . Let us mention the following attributes from the opera- tional ones:

• IsVoc ∈ 0,1

{ } -the belonging of to a terminological dictionary

• Freq ∈N ∪ 0 { } -the number of mentionings of the given word (in any form) to the left of ; • Dist ∈N -the distance between the pronoun and the position of inside the text (measured in words);

• IsVerb ∈ 0,1

{ } -the presence of direct father in a form of verb in syntax tree for a fragment containing ;

• NumNodes ∈N ∪ 0 { } -the number of nodes in a bush subordinate to .

The last two attributes have been introduced based on exploring correlation between numeric properties of a tree and resolving antecedents. For example, greater were often correspondent to proper variants of resolution. These attributes values are set into null in case the tree was not formed.

The distance is measured in words for a number of reasons: (a) to get a valid syntactical unit (clause, noun phrase) was not possible (at that moment) due to the laboriousness of the adaptation of the syntactical parser to the special features of input texts (e.g. the absence of punctuation); (b) a paragraph is too large for being a unit of measure -the majority of opinions consist of one paragraph; (c) windows are measured in sentences and a two-sentence diapason is considered to be sufficient for the research.

attribute implements the following idea: taking a subject domain's specificity into account allows to obtain higher quality of analysis. In fact, allows to raise the priority of variants relating to subject domain of the text -they are of most interest (not always, though).

The test corpus

To evaluate the work of the methods a corpus of 3M was built from opinions about mobile phones from the sources like [13,14,15]. Due to the specificity of the application the corpus was additionally divided into three groups: positive, negative and neutral opinions, each of 0.8-1.2 M. As a next step it was marked up with the resolved anaphoras according to the following scheme:

• if the correct antecedent could be chosen directly from the text, its occurrence which was closest to the left of the pronoun being resolved was marked in a special way; • in case of semantic ambiguity the pronoun was marked with «NULL» variant;

• the resolving word was written next to the pronoun in the corresponding case.

The statistical characteristics of the corpus were estimated.

• The whole number of 8.3 thousand opinions formed of 37 thousand unique word forms (including mistypings). • The most frequent opinion length varying from 15 to 35 words; average opinion length -54 words; the bulk of the opinions containing 10 to 90 words; opinions of more than 100 words are rare. The length scatter -from 2 to 340 words (Fig. 1).

• Opinions consisting of one sentence are the most frequent; average opinion length -4 sentences. The majority of opinions include 1 to 16 sentences; lengths more than 24 sentences are very rare (Fig. 2). • The corpus contains about 6.2 thousand third personal pronouns, including 4.5 thousand ones of masculine gender, 0.8 thousand of feminine gender, 0.7 thousand of plurals. The reason for a great number of masculine pronouns is the subject of the opinions (mobile phones). • Less than 50% of the opinions do not contain any of the pronouns under research.

35% contain only one pronoun, about 10% -two of them. The maximum is 9 pronouns per opinion (Fig. 3).

Lexicographical analysis method

At the initial stage of studying a heuristic method for the options ranking was implemented:

• a system of priorities is formed on the set of attributes, which were listed in subparagraph 4.1; • attribute values for each option are sorted according to the priorities; • options are sorted lexicographically according to their sets of attributes.

The method resolves all the anaphoras for which it has found variants to the left with precision rate not more than 60%. The experiments in introducing new attributes and varying their priorities were not efficient. This has led the authors to the idea of filtration of the input data in order to achieve higher precision rate.

SVM-method based on machine learning

Let there be a general set of objects , divided into previously unknown classes, and a sample set O ⊂ Ω , for each element of which its class is known. The task of classification is to answer the question: "which class does each object from belong to", knowing only the sample set (or the probabilities of belonging).

Let us fix a list for one specific pronoun . In this case

O i = A a ( ) | a ∈var pr i ( ) { } , i = 1,…, N

, and two classes are of interest -"are antecedents" and the inverse to it. Then the first class distance can be taken as w a ( ) . Now we need to generalize the approach for pronouns. Each set represents an independent group, each of which consists of two classes -"is the antecedent for " the inverse one, classes for the whole training set. It is impossible to use this classification in practice with a different number Q ≠ N of other pronouns. In order to get exactly two classes for any number of pronouns, it is necessary to construct an acceptable combination of these groups. For this purpose, the authors propose adding attributes characterizing the group to each set ω i ∈O i . Thus within the same group all its members are additionally provided with the same set of numbers describing the group. The centroid can be taken as these numbers.

After expanding of the group members a sample setO = O i i=1 N 

with the corresponding universe and a fuzzy classifier K ω ( ) ∈ 0,1 ( ⎤ ⎦ which determines a dis- tance between and the class "are antecedents" are constructed. is constructed in a form of so-called probabilistic decision function as described in [5,6] based on a classical C-SVM with a nonlinear kernel [7]. Selection of the core and the constants for the SVM was performed by minimizing the overtraining on the parameters grid while verifying the recall-precision ratio on the training and control samples. In the end, the kernel was chosen to be a polynomial one with a small degree.

Centroids raised the precision of the SVM-method from 70% to 80% (mode A).

Recall-precision regulator

To reach the precision rate of 90% linear discriminative analysis [4] was used: its aim is to find a line between classes, in the projection on which they are most discernible. With the help of discriminant, pronouns which may be not resolved (for the purpose of rising the precision rate) were identified. The combination of this filtration and SVM-method allowed to reach the desired result (mode C). Along the way, it was managed to derive mode B in which basic rates are balanced in the region of 75-85%.

Analysis of the results

Quality requirements and evaluation

Processing of the input set containing third personal pronoun anaphoras is carried out in 2 steps.

1. Filtration of anaphoras. From the total number of objects those for which the algorithm: (1) failed to form the set of variants, (2) put in the first place in the list of variants or (3) eliminated from the examination due to regulator work are deleted. As a result, anaphoras are left, for each of them the algorithm can choose an antecedent (not necessarily the correct one). If the whole of anaphoras resolved correctly are considered as relevant, the recall rate of this step is while the precision is equal to 1, as all chosen objects ) are included in the relevant ( ). 2. Resolution of the left anaphoras. In this step the whole of anaphoras resolved correctly are considered as relevant. The algorithm attempts to resolve them, succeeding in cases. Due to the coincidence between the volumes of relevant objects and those being resolved, the precision and recall rates are both equal to .

Two out of four rates mentioned above (precision and recall for each step) are informative:

• recall is a portion of pronouns for which the algorithm succeeded in finding an antecedent; • precision is equal to a percent of this portion containing correctly identified antecedents.

To the writers' opinion, this approach to evaluation conforms to the quality requirements. In addition, the estimations do not depend on the mechanism of anaphora resolution (including the size of variant lists).

5.2

The quality of SVM-method and sensitivity to the sample volume

Opinions containing at least one of the pronouns under research (4 thousand altogether) were selected from the corpus. To evaluate the SVM-method sensitivity to the sample volume this set of opinions underwent the procedure of q-fold cross validation.

Verification was carried out for q = 1,…,300 , i.e. means verification of the model for the whole 4 thousand opinions, -for a sample of 13 opinions. For each the mean of recall and precision was calculated for each iteration as well as their minimum and maximum for the diagrams reflecting the dependency between quality and the volume of input data.

Measuring was done for modes A, B and C (Fig. 4, abscissa corresponds to ). It can be seen that all the means are stable even for small-sized samples.

5.3

The results of ROC-analysis of SVM-method Fig. 5 illustrates ROC-curves for SVM-method in A, B and C modes. The area under A curve is 0.74, under B one -0.76, which is considered as "good" according to the expert scale. The area under C curve is 0.81 with this mode considered as "very good".

5.4

The SVM-method independence of the sentiment of the corpus It was additionally verified in empirical way that the SVM-method is independent of the sentiment of the texts processed, since it cannot be forgotten that anaphoras in negative opinions might be different from those in positive opinions.

The "negative" corpus was used as a training set, the "positive" one as a control set.

Significance of the factors

Discriminative analysis provides an estimation of contribution of the attributes to the common decision -the judgment can be made based on the coefficients for the corresponding attributes in the linear discriminant and the range of attribute values. It is also possible to estimate how much influence components of the centroid bring to the solution.

According to the Table 3, the frequency is two times more important than the distance, the presence of a father-verb is more important than the number of nodes in the bush (even if correcting this by a wide range of sometimes up to 10-15 knots). Picture according to the centroid is consistent on a whole, except for and , so their contribution can be estimated to be approximately equal. Compiling vocabularies for is rather laborious. The authors have discovered that the main coefficients in modes A and C (recall and precision respectively) reduce from about 90 to 70% when this attribute is not used; in mode B both coefficients reduce by ~10%. It can be stated that it is precisely attribute that allows to achieve the precision rate of 90% and higher.

Evaluation of lexicographical method

The advantage of this method is that no marked-up corpus is needed for its initialization. The practical use of the SVM-method has shown that a trained classifier copes with texts from domains different from that of the training set with the rates declining by several percents (with the exception of attribute -new vocabularies are needed). The main error f the method is an excessively strong influence of an attribute with the highest priority. E.g. using attribute often results in an incorrect choosing a vocabulary word while not using it -in choosing the word closest to the left.

Conclusion

This paper offers a solution to the problem of the third personal pronoun anaphora resolution. The software complex called DictaScope Anaphora was implemented based on the models and methods discussed in this paper. It has the following characteristics:

• there are three modes, which allow to achieve both recall and precision rates of 80% or to give preference to one of them and achieve the result of 95%; • it is possible to take mistypings and grammatical errors into account, which is important for processing texts from online sources (such as reviews); • in this case an adjustment of the parameters for a specific subject area is needed.

The features of the internal structure of the system and the mathematical foundation are described; the detailed evaluation of the test data and the quality of its processing is carried out.

Among the shortcomings it is a drop in accuracy on the masculine pronouns that should be noted. It is caused by the choice of the subject of opinions (a mobile phone). It is mentioned very often (including implicit mentioning) and the main part of malfunctions consists in choosing an implicit antecedent . In authors' opinion, the problem can be solved by taking new attributes connected with the result of syntactical parsing into consideration.

The development plans include application of the system to other domains improving the recall-precision ratio by introducing new attributes and refining the adjustment of the coefficients.

Fig. 1 .Fig. 2 .Fig. 3 .123Fig. 1. Distribution of opinions lengths in words

Fig. 4 .4Fig. 4. Results for SVM-method cross-validation in A,B,C modes

Fig. 5 .5Fig. 5. ROC-curves for SVM-method in A, B, C modes

it]{display = 0.466248, * = 0.284525, NULL = 0.0777368, business = 0.0101848} (3)which correspond to each variant'sconfidence.Traits. Let's resort to an example to make the task statement clear:bought it for business, very useful because [it] {* =0.652166, business = 0.2371, NULL = 0.168611} supportstwo sim cards. Nice, big display, no dead spaces foundon [

Table 1 .1Averaged quality measures for SVM-methodRecallPrecisionA97.3%74.2%B75.4%80.7%C45.6%90.3%

Table 2 .2Check for SVM-method independency from sentiment(RECALL %, PRECISION %)(A)(B)(C)Negative (training)(95.1, 80.2)(77.8, 86.7)(43.1, 93.2)Positive (control)(96.3, 78.7)(79.1, 83.4)(56.2, 89.9)

Table 3 .3Valuing the attributes significance according to the results of discriminant analysisAttributeCoefficient in linear discriminantCorresponding coefficient near the component of the centroid-2.918.89.31.1-735.8-0.518.9-21.5-1.6-10.60.1

Table 4 .4Estimation of the lexicographical method qualityWith IsVocWithout IsVoc

A random sample of news from [12] (the anaphora density -0.34 per 1 K and a sample of opinions about mobile phones from the sources such as[13] (the anaphora density -0,53 per 1 K were used to perform measurements, each one of 1 Mb.

Development of a pilot version of syntactical analyzer for the Russian Language VVOkatev VPGergel VEAlexeev VATalanov KABarkalov DSSkatov TNErekhinskaya AEKotov ASTitova VNTIC Inventory Number 02200803750 // VNTIC

Moscow

2008 Report on research implementation on the topic: Referencing the designations of persons and organizations in Russian media texts: empirical laws for computer analysis AEErmakov Proceedings of the International Conference "Dialog'2005 the International Conference "Dialog'2005 Computational Linguistics and Intelligent Technologies 2005 Algorithm for automated third-person pronouns resolution on the basis of machine learning methods PVTolpegin DPWind DAKropotov Proceedings of International Conference "Dialog'2006 International Conference "Dialog'2006

Moscow

Izd RGGU 2006 Factor, discriminant and cluster analysis MSOldenderfer RKBlashfield 1989 Moscow Igor Enyukova. Finance and Statistics Under Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods PlattJohn C Advances in Large Margin Classifiers AlexanderJSmola PeterBartlett BernhardSch Olkopf DaleSchuurmans MIT Press 1999 A note on Platt's probabilistic outputs for support vector machines Chih-JenHsuan-Tien Lin RubyCLin Weng Machine Learning October 2007 68 Statistical Learning Theory VVapnik 1998 Wiley A statistical approach to anaphora resolution GNiyu JHale ECharniak Proceedings of the Sixth Workshop on Very Large Corpora. COLING-ACL'98 the Sixth Workshop on Very Large Corpora. COLING-ACL'98

Montreal, Canada

1998 The third personal pronoun anaphora resolution in the paroxysmal text of the Chinese web NingPang Jun-FengShi Coll. of Appl. Sci. Taiyuan Sci. & Technol. Univ A machine learning approach to personal pronoun resolution in Turkish SYıldırım YKılıçaslan Proceedings of 20th International FLAIRS Conference, FLAIRS-20 20th International FLAIRS Conference, FLAIRS-20

Key West, Florida

2007 Anaphora analyzing apparatus provided with antecedent candidate rejecting means using candidate rejecting decision tree PMichael YKazuhide SEiichiro Patent 6343266 2002 Market -search, selection and purchase of goods Yandex