The Implementation of the Mention-Ranking Approach to Coreference Resolution in Russian Anna Kupriianova Ivan Shilin annkupriyanova26@gmail.com shilinivan@corp.ifmo.ru Gerhard Wohlgenannt Liubov Kovriguina wolkinger@gmail.com lyukovriguina@corp.ifmo.ru ITMO University Saint-Petersburg, Russian Federation Abstract Coreference resolution is a fundamental ingredient for many downstream tasks in natural language-based applications. For Russian language, the work in coreference resolution is very limited. In this publication, we present a system inspired by the mention-ranking approach, which improves the state-of-the-art F1 score from 0.63 to 0.71, measured with the B 3 metric. We evaluate various sets of feature combinations, and also discuss the limitations of the presented work. Keywords: coreference resolution in Russian, mention-pair model, neural net- work based coreference system, RuCor, FastText model. 1 Introduction Coreference resolution is an important problem in many natural language processing tasks. It can support, i.e., automatic text summarization, knowledge extraction, objects’ identi- fication in dialogue and translation systems [Toldova, Ionov 2017]. The term coreference denotes the relation between the parts of a text (mentions), that refer to the same real world entities, for example the mention of a person name in a text, and of a pronoun that refers to this same person. Thus, the task of coreference resolution is to find and group all the mentions in the text according to their referents. Mentions are typically represented by noun phrases (NPs), named entities and pronouns, except the cases of abstract anaphora, where the anaphoric pronoun refers to the whole preceding sentence and not to noun phrases (NPs) or pronouns [Nedoluzhko, Lapshinova-Koltunski 2016]. In the pair of mentions the first one (full mention) is called the antecedent while the second one is an anafor. In the broader field of coreference resolution, there are two main tasks: mention extraction and coreference resolution in narrower sense (mention cluster- ing) [Sysoev et al. 2017]. Mention extraction finds textual expressions which are possible elements of coreference chains in unstructured data, whereas coreference resolution groups 1 mentions into clusters, which refer to a single real-world entity. The system presented here focuses on the second task, i.e. coreference resolution in the narrower sense. For Russian language, the work on coreference resolution is limited. Results re- ported on the RuCor1 [Toldova et al. 2014] coreference corpus are 60.48 of F1 score for the B 3 metric by Toldova and Ionov [Toldova, Ionov 2017], and 63.12 F1 by Sysoev et al. [Sysoev et al. 2017]. The mentioned work applies rule-based and “classical” machine- learning methods like decision trees or logistic regression. In this work, we present an approach using neural networks based on an adapted version of the mention-ranking model by Clark and Manning [Clark, Manning 2016]. With this architecture, we man- age to outperform previous work with an achieved F1-score of 0.7131. The B 3 met- ric [Bagga, Baldwin 1998, Amigo et al. 2009] is a clustering metric, which evaluates a gold-standard clustering of mentions against a system-produced clustering. The paper is structured as follows: After an overview of related work in Section 2, Section 3 introduces the system architecture, and Section 4 discusses the features used, and how features are combined into three different sets. The following section (Section 5) provides the evaluation details for those feature sets and compares the results to the state-of-the-art. Furthermore, difficult cases are discussed. The paper concludes with Section 6. 2 Related Work Existing approaches to coreference resolution can be divided into heuristic [Hobbs 1978, Boyarski et al. 2013 ] and based on machine learning (ML) algorithms [Rahman, Ng 2009, Ng 2008, Clark, Manning 2016]. Heuristic methods are built upon a handmade set of rules, which is time- and labour-consuming to construct. On the contrary, ML-based approaches are faster and easier to develop, but they depend on the availability of a coreference dataset of sufficient size and quality to apply supervised learning methods. Recent advances in coreference resolution for English language include the work of Clark and Manning [Clark, Manning 2016] on a cluster-ranking algorithm that handles entity-level information and eliminates the disadvantages of the mention-pair models. The main benefit of this neural network-based method is that it can distinguish beneficial cluster merges from harmful ones. Central parts of the system architecture used for this publication are inspired by this work. In general, the state-of-the-art results for Russian language are lower than for En- glish, work on Russian language is rather limited so far [Sysoev et al. 2017]. For Rus- sian language, coreference resolution became a more active research topic with the re- lease of a tagged coreference corpus (RuCor) in 2014 [Toldova et al. 2014]. Toldova and Ionov [Toldova, Ionov 2017] compare rule-based and ML-based methods for coreference resolution using this RuCor dataset, with slightly better performance for the ML-based methods. They reach 31.56 for predicted mentions and 60.48 for gold mentions with the B 3 metric for the RuCor corpus. Predicted mentions refers to coreference resolution for mentions which were automatically extracted with a mention extraction module. Sysoev et al. [Sysoev et al. 2017] tackle both mention extraction and coreference resolution. For mention extraction, they use a number of linguistic, structural, etc., features and apply classifiers such as logistic regression, Jaccard Item Set mining and random forest, and 1 http://rucoref.maimbava.net/ 2 reach an F1 of 63.12 for the gold mentions from the RuCor dataset. In comparison, our approach provides an F1-score about 8 points above previous work. 3 System Architecture We present a coreference resolution system based on the mention-ranking model by Clark and Manning [Clark, Manning 2016]. The core of the system is a feedforward neural network. Its topology is shown in Figure 1. In a nutshell, the network can be divided into two parts: a mention-pair encoder and a mention-ranking model. The implementation of the system is available on github2 . The source code was written in Python, using Keras and Tensorflow for the neural network models. Antecedent Feature Mention Feature Additional Vector Vector Features Input Layer h0 ... ... ... ReLU (W1h0 + b1) Hidden Layer h1 ReLU (W2h1 + b2) Hidden Layer h2 ReLU (W3h2 + b3) Hidden Layer h3 Wmh3 + bm Output Layer sm sm (a, m) Figure 1: Neural Network Topology 3.1 Mention-Pair Encoder The purpose of the mention-pair encoder is to transform a pair of a mention m and its potential antecedent a into their distributed representations. The mention-pair encoder is implemented as a feedforward neural network with three fully-connected hidden layers of rectified linear units (ReLU): hi (a, m) = max(0, Wi hi−1 (a, m) + bi ) (1) where hi (a, m) is an output of the i-th hidden layer for a pair of mention m and its potential antecedent a. Wi is a weight matrix and bi is the bias for the i-th hidden layer. The input layer of the mention-pair encoder takes a vector of features of a mention and its potential antecedent as well as additional pair features (all the features are described in Section 4). The output of the last hidden layer is the distributed representation of the pair which is used as an input to the mention-ranking model. 2 https://github.com/annkupriyanova/Coreference-Resolution 3 3.2 Mention-Ranking Model The purpose of the mention-ranking model is to estimate the score of coreference com- patibility for the pair of a mention m and its potential antecedent a. To compute this score one applies one fully-connected layer to the distributed representation of the pair rm (a, m), sm (a, m) = Wm rm (a, m) + bm (2) where sm (a, m) denotes a score of coreference compatibility of a pair of mention m and its potential antecedent a, rm (a, m) is the distributed representation of this pair. 3.3 Training objective For pretraining, which determines the initial configuration of the model parameters, we used the following objective function, N X  X X  − log p(t, mi ) + log(1 − p(t, mi )) (3) i=1 t∈T (mi ) f ∈F (mi ) where T (mi ) and F (mi ) are sets of true and false antecedents of a mention mi respectively, and p(a, mi ) = sigmoid(s(a, mi )). The main training objective is a slack-rescaled max-margin which penalizes different types of errors: XN max ∆(a, mi )(1 + sm (a, mi ) − sm (tˆi , mi )) (4) a∈A(mi ) i=1 where A(mi ) is a set of candidate antecedents of a mention mi , tˆi is a highest scoring true antecedent of mention mi : tˆi = arg max sm (t, mi ) (5) t∈T (mi ) and ∆(a, mi ) is a cost function for different types of mistakes: ( αF N if a = N A ∧ T (mi ) 6= N A αF A if a 6= N A ∧ T (mi ) = N A ∆(a, mi ) = (6) αW L if a 6= N A ∧ a ∈ / T (mi ) 0 if a ∈ T (mi ) where F N stands for False New mistake, F A for False Anaphor, and W L for Wrong Link. 4 Feature Sets and Models For creating the coreference resolution models, we designed three feature sets. We will compare the results for the individual models in the evaluation section. Table 1 shows how the features are partitioned into our feature sets. The list of features is inspired by previous work on English and Russian coreference resolution. The feature set I is a reduced version of the features used by Clark and Man- ning [Clark, Manning 2016]. In feature set II we removed some of the word embedding features, and added features of explicit indication of morphological characteristics and 4 Table 1: Feature Sets Feature Model Model Model I II III Word embedding of the head word of the men- + + - tion Average word embedding of all words in the + - - mention Cosine similarity between the vectors of men- - - + tions heads Gender + + + Number + + + Animacy + + + Exact string match + + + Head string match + + + Partial string match + + + Distance between the mentions in intervening + + + mentions Distance between the mentions in sentences - + + Gender match - + + Number match - + + Animacy match - + + Both mentions are proper nouns - + + Both mentions are pronouns - + + Antecedent is a pronoun - + + Anafor is a pronoun - + + POS-tag agreement, which is highly relevant for Russian as a morphologically rich lan- guage3 . And in feature set III the cosine similarity between the vectors of mentions heads completely replaces the word embeddings. For each mention, we build a feature vector. A mention consisting of one word, is represented by a single vector (word embedding). If the mention consists of a group of words, is represented by the average of the embedding vectors of each word in the group. We use word embeddings pre-trained with FastText on the Wikipedia corpus4 . 5 Experiments and Results This section describes the experimental setup, especially the dataset, and provides the results of the evaluations for the three models introduced in Section 4 in comparison with existing work. Finally, we discuss some of the difficulties and limitations observed with the current architecture. 3 Morphological annotation, lemmatization and word embedding models were borrowed from[Kovriguina et al. 2017] 4 https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md 5 5.1 Experimental Setup With regards to the dataset, we used the RuCor dataset to train and test the coreference resolution system. RuCor is the first open corpus with coreference annotations for the Russian language. It comprises short texts and text fragments of different genres: news, fiction, scientific papers, etc. All the texts are tokenized, split into sentences and parsed syntactically and morphologically. In the corpus, mentions are limited to NPs that refer to real-world entities. Thus, abstract and generic NPs as well as bridging relations and coreference relations with a split antecedent are not annotated. The RuCor dataset con- tains 181 texts with 156637 tokens and 3638 coreference chains. Based on these numbers, in model training we used a 60-20-20 percent split to create the training, validation, and test datasets. In terms of the settings of the neural network models, we minimized the training ob- jectives with the Adam and RMSProp optimizers. For regularization, we applied dropout with a rate of 0.3 on the output of each hidden layer. 5.2 Experiments For the feature sets three models of the neural network were developed. All the models were pretrained. Pretraining is performed for an initial setup of the weights, and uses binary cross-entropy to compute the loss when training only the mention-ranking model. The results of the experiments are shown in Table 2. Table 2: Experiment Results AUC B 3 metric Precision Recall F1 score Training Model I 0.6538 0.6397 0.6711 0.6550 Model II 0.8280 0.7170 0.7092 0.7131 Model III 0.7870 0.6783 0.6902 0.6842 According to Table 2 Model II shows the best results. We suppose that word em- beddings and explicit indication of the matches in morphological attributes and POS-tags positively influence the results. In Model III word embeddings are replaced with cosine similarity between the embeddings for the mentions’ head words. It appears that is may not be sufficient for the identification of their semantic similarity. Model I has the lowest score which might be explained by the absence of the explicit indication of matches in fea- tures, which, as stated above, lowers the model quality. Moreover, the non-proportional lengths of the vectors for different features, with 300-dimensional vectors for the head word of each mention in a pair and all the words of these mentions, in contrast with only 38-dimensional vector for the other features, might influence the outcome. 5.3 Comparison In Table 3 we compare the results of our system with the state-of-the-art open coreference resolution systems for Russian by Tolodova and Ionov [Toldova, Ionov 2017] and Sysoev 6 et al. [Sysoev et al. 2017]. We compare the B 3 metric on the gold mentions, i.e. mentions from the RuCor corpus (not for mentions automatically extracted from the text). All our models surpass existing work with regards to F1 score, with model II giving the best results. Table 3: Comparison Model B 3 metric Precision Recall F1 score Model I 0.6397 0.6711 0.6550 Model II 0.7170 0.7092 0.7131 Model III 0.6783 0.6902 0.6842 Toldova and Ionov [Toldova, Ionov 2017]: MLUpdated 0.7937 0.4860 0.6029 Toldova and Ionov [Toldova, Ionov 2017]: NamedEntities 0.7937 0.4886 0.6048 Toldova and Ionov [Toldova, Ionov 2017]: Word2vec 0.7925 0.4864 0.6028 Sysoev et al. [Sysoev et al. 2017]: log. regr. + Jaccard Item Set mining 0.6014 0.6103 0.6055 Sysoev et al. [Sysoev et al. 2017]: random forest 0.7389 0.5516 0.6312 The comparison baselines can be briefly described as follows (for details see Tolodova and Ionov [Toldova, Ionov 2017], and Sysoev et al. [Sysoev et al. 2017]): The MLUpdated model implements a ML-based decision tree classifier. The NamedEntities model takes into account semantic information in the form of the lists of possible named entities. This allows it to compare the mentions’ semantic classes. The Word2vec model uses word embeddings for evaluating the semantic compatibility of the mentions’ heads (we used the same feature in our Model I and Model II). Sysoev et al. [Sysoev et al. 2017] use a common set of features, which is fed into various classifiers such as logistic regression with Jaccard Item set mining, or a random forest. 5.4 Error Analysis and Results Discussion Here, we outline some of the problems and errors which have been discovered in the analysis of the predictions of the neural network: 1. Some errors are caused by the wrong annotations in the RuCor coreference corpus, esp. wrong lemmas or morphological attributes in the corpus. For example, for the word “дотком” (dotcom) two different lemmas were found – “доткома” and “доткомом”. 2. Direct speech mistakes: Pronouns "я" (I) and “ты” (you), if used in a dialogue by different speakers, can be coreferential in a certain context. For example, in case of the following dialogue: 7 – “Я сегодня выполнил работу за два дня.” (Today I have done the work for two days.) – “Ты - молодец! ” (You did well!) The pronouns "я" (I) and “ты” (you) have the same referent – the first speaker. However, the neural network makes this kind of mistakes because there is no infor- mation about speakers in the dataset. 3. Context mistakes: They arise when the coreference relation gets evident only after the analysis of the mentions context. For example, the coreference relation between the mentions “Их Сиятельство” (Their Majesty / Highness) and “женщина в черном капоте” (the woman wearing black dressing gown) is not clear without the analysis of the context parts of the text. Such types of mistakes can be explained with the difficulty of formalizing semantic information. One possible solutions for this problem might be the use of word embeddings for a longer context or even for the whole text. 4. Split anafor mistakes: For example, the mentions “они” (they) and “Иван Тихонович и Татьяна Финогеновна” (Ivan Tihonovich and Tatyana Finogenovna) are coreferential. But the network makes the mistake because in the dataset the morphological attributes (gender, number, animacy, etc.) are identified only for the head elements of the mentions. And in the above stated case there are two heads in the second mention and they differ in gender with each other and in number with the head of the first mention. 6 Conclusion We have presented a coreference resolution system implementing the mention-ranking approach, and experimented with different sets of feature combinations and evaluated their impact on system quality with the B 3 metric. Our best model provides an F1 score of 0.7131 on the RuCor dataset, and exceeds existing work by around 8 points. In future work, there are a number of directions to improve model quality: (i) experimenting with other network architectures, (ii) hyperparameter tuning, and (iii) adding more relevant features into the training process, and finally (iv) applying a cluster-ranking approach to capture additional entity-level information. Furthermore, we plan to extend our system with a mention extraction module. Acknowledgments L.Kovriguina acknowledges support from the Russian Fund of Basic Research (RFBR), Grant No. 16-36-60055. Furthermore, the work is supported by the Government of the Russian Federation (Grant 074-U01) through the ITMO Fellowship and Professorship Program. 8 References [Hobbs 1978] Hobbs. J. R (1978) Resolving pronoun references. //Lingua – Volume 44, p. 311–338, 1978. [Boyarski et al. 2013 ] Boyarski K.K., Kanevski E.A., Stepukova A.V. (2013) Viyavle- nie anaforicheskih otnosheni pri avtomaticheskom analize teksta [Identification of anaphoric relations in automatic text analysis]. //Nauchno-tehnicheski vestnik in- formacionnyh tehnologi, mehaniki i optiki [Scientific and Technical Herald of Infor- mation Technologies, Mechanics and Optics], s. 108–112, 2013. (In Russian) = Боярский К. К., Каневский Е. А., Степукова А. В. Выявление анафорических отношений при автоматическом анализе текста. //Научно-технический вестник информационных технологий, механики и оптики, c. 108–112, 2013. [Rahman, Ng 2009] Rahman A., Ng V. (2009) Supervised models for coreference resolu- tion. //Proceedings of the 2009 conference on empirical methods in natural language processing, p. 968–77. Singapore, 2009. [Ng 2008] Ng V. (2008) Unsupervised models for coreference resolution. //Proceedings of the 2008 conference on empirical methods in natural language processing, p. 640–9. Honolulu, 2008. [Clark, Manning 2016] Clark K., Manning C. D. (2016) Improving Coreference Resolu- tion by Learning Entity-Level Distributed Representations. //Association for Compu- tational Linguistics Proceedings – Volume 1, p. 643–653. Berlin, 2016. [Toldova et al. 2014] Toldova S. Ju., Roytberg A., Nedoluzhko A., Kurzukov M., Lady- gina A., Vasilyeva M., Azerkovich I., Grishina Y., Sim G., Ivanova A., Gorshkov D. (2014) Evaluating Anaphora and Coreference Resolution for Russian. //Computational Linguistics and Intellectual Technologies: “DIALOG 2014”, p. 681–695. – M.: RGGU, 2014. [Toldova, Ionov 2017] Toldova S., Ionov M. (2017) Coreference Resolution for Russian: The Impact of Semantic Features. //Computational Linguistics and Intellectual Tech- nologies. International Conference "Dialog 2017" Proceedings, p. 339–349. – М.: М., 2017. [Sysoev et al. 2017] Sysoev A., Andrianov I., Khadzhiiskaia A. (2017) Coreference Res- olution in Russian: State-of-the-Art Approaches Application and Evolvement. //Com- putational Linguistics and Intellectual Technologies. International Conference "Dialog 2017" Proceedings, 16(23):327–347. [Nedoluzhko, Lapshinova-Koltunski 2016] Nedoluzhko A., Lapshinova-Koltunski E. (2016) Abstract Coreference in a Multilingual Perspective: a View on Czech and German. //Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016), p.47–52. [Bagga, Baldwin 1998] Bagga A., Baldwin B. (1998) Entity-based Cross-document Coreferencing Using the Vector Space Model //Proceedings of ACL’98, 1998, Mon- treal, Quebec, Canada, p.79–85. 9 [Amigo et al. 2009] Amigo E., Gonzalo J., Artiles J., Verdejo F. (2009) A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints //Information Retrieval – Volume 12, p. 461–486, 2009. [Kovriguina et al. 2017] Kovriguina, L., Shilin, I., Putintseva, A., Shipilo, A. (2017) Rus- sian Tagging and Dependency Parsing Models for Stanford CoreNLP Natural Language Toolkit. //International Conference on Knowledge Engineering and the Semantic Web, p.101–111. – Springer, 2017. 10