=Paper=
{{Paper
|id=Vol-2203/130
|storemode=property
|title=A Study on Bilingually Informed Coreference Resolution
|pdfUrl=https://ceur-ws.org/Vol-2203/130.pdf
|volume=Vol-2203
|authors=Michal Novak
|dblpUrl=https://dblp.org/rec/conf/itat/Novak18
}}
==A Study on Bilingually Informed Coreference Resolution==
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 130–137 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Michal Novák A Study on Bilingually Informed Coreference Resolution Michal Novák Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranské náměstí 25, CZ-11800 Prague 1 mnovak@ufal.mff.cuni.cz Abstract: Coreference is a basic means to retain coher- big parallel data might be leveraged in a weakly supervised ence of a text that likely exists in every language. How- manner to boost the models trained in a monolingual way. ever, languages may differ in how a coreference relation The present work is concerned with corpus-based bilin- is manifested on the surface. A possible way how to mea- gually informed CR on Czech-English texts. Specifically, sure the extent and nature of such differences is to build it focuses on resolution of pronouns and zeros, as these are a coreference resolution system that operates on a parallel the coreferential expressions whose grammatical and func- corpus and extracts information from both language sides tional properties differ considerably across the languages. of the corpus. In this work, we build such a bilingually For instance, whereas in English most of non-living ob- informed coreference resolution system and apply it on jects are referred to with pronouns in neuter gender (e.g. Czech-English data. We compare its performance with “it”, “its”), genders are distributed more evenly in Czech. the system that learns only from a single language. Our Information on Czech genders thus may be useful to fil- results show that the cross-lingual approach outperforms ter out English candidates that are highly improbable to be the monolingual one. They also suggest that a system for coreferential with the pronoun. By comparison of its per- Czech can exploit the additional English information more formance with a monolingual approach and by thorough effectively than the other way round. The work concludes analysis of the results, our work aims at discovering the with a detailed analysis that tries to reveal the reasons be- extent and nature of such differences. hind these results. The paper is structured as follows. After mentioning related work (Section 2), we introduce a coreference re- solver (Section 3), both its monolingual and cross-lingual 1 Introduction variants. Section 4 describes the dataset used in experi- ments in Section 5. Before we conclude, the results of Cross-lingual techniques are becoming still more and experiments are thoroughly analyzed (Section 6). more popular. Even though they do not circumvent the task of Coreference Resolution (CR), the research is mostly limited to cross-lingual projection. Other cross- 2 Related Work lingual techniques remain a largely unexplored area for this task. Building a bilingually informed CR system requires a par- One of the yet neglected cross-lingual techniques is allel corpus with at least the target-language side annotated called bilingually informed resolution. It is an approach, with coreference. Even these days very few such corpora in which decisions in a particular task are made based on exist, e.g. Prague Czech-English Dependency Treebank the information from bilingual parallel data. Parallel texts 2.0 Coref [14], ParCor 1.0 [9] and parts of OntoNotes 5.0 must be available when a method is trained, but also at [19]. test time, that is when a trained model is applied to new It is thus surprising that the peak of popularity for such data. In real-world scenarios, the availability of parallel approach was reached around ten years before these cor- data at test time requires the technique to apply a machine pora had been published. Harabagiu and Maiorano [10] translation service to acquire them (MT-based bilingually present an heuristics-based approach to CR. The set of informed resolution). heuristics is expanded by exploiting the transitivity prop- Nevertheless, for limited purposes it may pay off to use erty of coreferential chains in a bootstrapping fashion. human-translated parallel data instead (corpus-based bilin- Moreover, they expand the heuristics even more, follow- gually informed resolution). If it outperforms the mono- ing mention counterparts in translations of source English lingual approach, it may be used in building automatically texts to Romanian with coreference annotation. Mitkov annotated parallel corpora. Such corpora with more reli- and Barbu [13] adjust a rule-based pronoun coreference able annotation could be useful for corpus-driven theoret- resolution system to work on a parallel corpus. After pro- ical research.1 Furthermore, it can be also used for au- viding a linguistic comparison of English and French pro- tomatic processing. For instance, improved resolution on nouns and their behavior in discourse, the authors distill their findings into a set of cross-lingual rules to be inte- 1 In case a cross-lingual origin of the annotation does not matter. grated into the CR system. In evaluation, they observe im- A study on bilingually informed coreference resolution 131 provements in resolution accuracy of up to 5 percentage The tectogrammatical layer is also the place, where points compared to the monolingual approach. coreference relations should be annotated. It is technically As for more recent works, the authors of [5] address represented as a link between two coreferential nodes:3 the the task of overt pronoun resolution in Chinese. Among anaphor (the referring expression) and the antecedent (the the others they propose an MT-based bilingually informed referred expression). approach. A model is built on Chinese coreference, ex- Each input text must be first automatically pre- ploiting Chinese features. These are augmented with En- processed up to this level of linguistic annotation. The CR glish features, extracted from the Chinese texts machine- system based on supervised machine learning then takes translated to English. It allows for taking advantage of advantage of the information available in the annotation. English nouns’ gender and number lists, which according to authors correspond to the distribution of genders and Pre-processing. The input text must undergo an analysis numbers over Chinese nouns. producing a tectogrammatical representation of its sen- Experiments of Novák and Žabokrtský [17], the first tences before coreference resolution is carried out. We ones using bilingually informed CR on Czech-English use pipelines for analysis of Czech and English available data, are most relevant to the present work. With the focus in the Treex framework [18]. The analysis starts with a on English personal pronouns only, their best cross-lingual rule-based tokenization, morphological analysis and part- configuration managed to outperform the monolingual CR of-speech tagging (e.g. [21] for Czech), dependency pars- by one F-score point. Taking advantage of a more devel- ing to surface trees (e.g. MST parser [12] for English) oped version of their CR system, we extend their work in and named entity recognition [22]. In addition, the NADA several directions. First, we explore the potential of such tool [3] is applied to help distinguish referential and non- approach for a wider range of English coreferential expres- referential occurrences of the English pronoun “it”. sions. Next, we perform experiments in the opposite direc- Tectogrammatical trees are created by a transformation tion, i.e. Czech CR informed by English. And finally, we from the surface trees. All function words are made hid- provide a very detailed analysis of the results unveiling the den, morpho-syntactic information is transferred and se- nature of the cross-lingual aid. mantic roles are assigned to tectogrammatical nodes [4]. On the tectogrammatical layer, certain types of ellipsis can be restored. The automatic pre-processing focuses only on 3 Coreference Resolution System restoring nodes that might be anaphoric. Such nodes are added by heuristics based on syntactic structures. The re- For coreference resolution we adopt a more developed ver- stored nodes include Czech zero subjects and both Czech sion of the resolver utilized in [17]. This new version and English zeros in non-finite clauses, e.g. zero relative builds on the monolingual Treex CR system [15], and aug- pronouns, unexpressed arguments in infinitives, past and ments it with the cross-lingual extension presented in [17]. present participles. The difference between the current system and the sys- tem in [17] lies mostly in that it can target a wider range Model design. Treex CR models coreference in a way to of expressions, it exploits a richer feature set and the pre- be easily optimized by supervised learning. Particularly, processing stage analyzing the text to the tectogrammati- we use logistic regression with stochastic gradient descend cal representation is of higher quality. Instead of listing all optimization implemented in the Vowpal Wabbit toolkit.4 the changes, we briefly introduce the monolingual (Sec- Design of the model employs multiple concepts that have tion 3.1) and the cross-lingual component (Section 3.2) of proved to be useful and simple at the same time. Treex CR from the scratch.2 Given an anaphor and a set of antecedent candidates, mention-ranking models [6] are trained to score all the candidates at once. On the one hand a mention-ranking 3.1 Monolingual Resolution model is able to capture competition between the candi- Treex CR operates on the tectogrammatical layer. It is dates, but on the other hand features describe solely the a layer of deep syntax based on the theory of Functional actual mentions, not the whole clusters built up to the mo- Generative Description [20]. The tectogrammatical repre- ment. Antecedent candidates for an anaphor (both positive sentation of a sentence is a dependency tree with rich lin- and negative) are selected from the context window of a guistic features consisting of the content words only. Fur- predefined size. thermore, some surface ellipses are restored at this layer. No anaphor detection stage precedes the coreference It includes anaphoric zeros (e.g. zero subjects in Czech, resolution. Unless another measure was taken, it would unexpressed arguments of non-finite clauses in both En- lead to all occurrences of the pronoun “it” labeled as ref- glish and Czech) that are introduced in the tectogrammat- erential, for instance. Nevertheless, the model determines ical layer with a newly established node. 3 A mention is determined only by its head in tectogrammatics. No mention boundaries are specified. Therefore, it is sufficient for a corefer- ence link to determine only two nodes, the mentions’ head nodes. 2 Please refer to [15] for more details on the monolingual component 4 https://github.com/JohnLangford/vowpal_ of the system. wabbit/wiki 132 Michal Novák whether the anaphor is referential jointly with selecting its Alignment. It is central for our cross-lingual approach to antecedent. This is ensured by adding a dummy candi- have the English and Czech texts aligned on the level of date representing solely the anaphor itself. By selecting tectogrammatical nodes. The alignment is based on un- this candidate, the model claims that the anaphor is in fact supervised word alignment performed by MGIZA++ [8] non-referential. trained on the data from CzEng 1.0 [4], and projected to Diverse properties of various types of coreferential rela- the tectogrammatical layer. Furthermore, it is augmented tions (e.g. different referential scopes of personal and rela- with a supervised method [17] addressing selected corefer- tive pronouns) encouraged us to model individual anaphor ential expressions, including potentially anaphoric zeros. types separately. A specialized model is build for (1) per- sonal and possessive pronouns in 3rd person (and zero sub- Features. Cross-lingual features describe the nodes jects in Czech), (2) reflexive pronouns, (3) relative pro- aligned to the coreferential candidates in the target lan- nouns, and (4) zeros in non-finite clauses. Treex CR runs guage – the anaphor candidate and the antecedent candi- them in a sequence. date. To collect such nodes, we follow the alignment links connected to these two candidates. For each of the nodes, Features. The pre-processing stage enriches a raw text we take at most one of its aligned counterparts. In this with a substantial amount of linguistic information. Fea- way, we obtain at most two nodes aligned to the pair of ture extraction stage then uses this material to yield fea- potentially coreferential nodes, for which we can extract tures consumable by the learning method. Features are al- cross-lingual features. If no aligned counterpart is found, ways related to at most two nodes – an anaphor candidate no cross-lingual features are added. and an antecedent candidate. We extract two sets of cross-lingual features: The features can be divided into three categories. Firstly, location and distance features indicate positions of • aligned_all: it consists of all the features contained the anaphor and the antecedent candidate in a sentence and in a monolingual set for a given aligned language; their mutual distance in terms of words, clauses and sen- • aligned_coref : it consists of a single indicator fea- tences. Secondly, a big group of features reflects (deep) ture, assigning the true value only if the two aligned morpho-syntactic aspects of the candidates. It includes the nodes belong to the same coreferential entity. This mention head’s part-of-speech tag and morphological fea- feature can be activated only if there exists a mono- tures (e.g. gender, number, person, case), (deep) syntax lingual coreference resolver for the aligned language. features (e.g. dependency relation, semantic role) as well We employ Treex CR and its monolingual models for as some features exploiting the structure of the syntactic English and Czech, but any CR system, even a rule- tree. Many of the features are combined by concatena- based one, could be used. tion or by agreement, i.e. indicating whether the anaphor’s value agrees with antecedent’s one. Finally, lexical fea- We do not manually construct features combining both tures focus on lemmas of the mentions’ heads and their language sides. Nevertheless, such features are formed au- parents. These are used directly or through the frequen- tomatically by the machine-learning tool Vowpal Wabbit. cies collected in a large data of Czech National Corpus [1] indexed in a list of noun-verb collocations. Furthermore, 4 Datasets all hypernymous concepts of a mention are extracted as features from ontologies (e.g. WordNet [7]) and named We employ Prague Czech-English Dependency Treebank entity labels are also employed. 2.0 Coref [14, PCEDT 2.0 Coref] to train and test our CR systems. It is a Czech-English parallel corpus, consisting 3.2 Cross-lingual Extension of almost 50k sentence pairs (more on its basic statistics is shown in the upper part of the Table 1). The English part The extension enables bilingually informed CR. Like the of the treebank is based on texts from the Wall Street Jour- monolingual CR, it addresses coreference in one target nal collected for the Penn Treebank [11]. The Czech part language at a time. However, instead of data in single lan- was manually translated from English. All texts have been guage, it must be fed with parallel data in two languages. annotated at multiple layers of linguistic representation up Both language sides (Czech and English in this case) of to the tectogrammatical layer. the data must be first pre-processed with the pipelines an- Although PCEDT 2.0 Coref has been extensively anno- alyzing the texts up to the diagrammatically layer. Fur- tated by humans, we strip almost all manual annotations thermore, to facilitate the access to important information and replace it by the output of the pre-processing pipeline in the other language, the pre-processing stage also seeks (see Sections 3.1 and 3.2). The only manually annotated for alignment between tectogrammatical nodes. The bilin- information that we retain are the coreferential links. gually informed approach then augments the monolingual We do not split the data into train and test sections. features with those accessing the other side of the paral- All the experiments are conducted using 10-fold cross- lel data. Design of the model remains the same as for the validation. monolingual approach. A study on bilingually informed coreference resolution 133 Mention type Czech English Czech English Mention type Sentences 49,208 49,208 monoling biling monoling biling Tokens 1,151,150 1,173,766 Personal 63.84 61.24 62.51 67.82 64.38 66.06 76.34 71.37 73.77 78.57 72.64 75.49 Tecto. nodes 931,846 838,212 Possessive 71.93 71.51 71.72 75.73 74.85 75.29 80.07 79.54 79.81 81.46 81.00 81.23 Mentions (total) 183,277 188,685 Refl. poss. 85.61 85.42 85.52 87.70 87.04 87.36 — — Personal pron. 3,038 14,887 Reflexive 66.91 56.60 61.33 67.24 55.66 60.90 77.31 72.67 74.92 75.88 71.01 73.37 Possessive pron. 3,777 9,186 Zero subj. 73.18 55.46 63.10 78.88 57.64 66.61 — — Refl. poss. pron. 4,389 — Zero nonfin. 78.98 41.51 54.42 81.52 42.63 55.98 71.48 54.62 61.92 73.31 54.75 62.68 Reflexive pron. 1,272 484 Relative 81.51 79.94 80.72 83.48 81.62 82.54 83.47 76.23 79.69 85.76 77.13 81.21 Zero subject 16,875 — Total 76.83 70.52 80.27 73.09 75.93 70.19 77.85 71.41 65.17 67.09 65.26 65.95 Zero in nonfin. cl. 6,151 29,759 Relative pron. 15,198 8,170 Table 2: Anaphora scores of monolingual and bilingually Other 132,577 126,199 informed coreference resolution. Table 1: Basic and coref. statistics of PCEDT 2.0 Coref. by the system as anaphoric, recall averages over all true anaphoric mentions. A decision on an anaphor candidate As mentioned in Section 3.1, our CR system consists is correct if the system correctly labels it as non-anaphoric of four models targeting different types of mentions as or the antecedent found by the system really belongs to the anaphors. In evaluation, we split the anaphor candidates to same entity as the anaphor. In the following tables, we use even finer categories, namely: (1) personal pronouns, (2) R F to format the three components of the anaphora score. P possessive pronouns, (3) reflexive possessive pronouns, (4) reflexive pronouns, all four types of pronouns in the Bilingually informed vs. Monolingual CR. Table 2 lists 3rd or ambiguous person, (5) zero subjects, (6) zeros in the anaphora scores measured on the output of 10-fold non-finite clauses, and (7) relative pronouns (the statistics cross-validation. In overall, cross-lingual models succeed of coreferential mentions is collected in the bottom part of in exploiting additional knowledge from parallel data and Table 1). Driven by the findings in an analysis of Czech- perform better than the monolingual approach. The F- English correspondences [16], these expressions are very score improvement benefits mainly from a rise in preci- interesting from a cross-lingual point of view, as they often sion, but recall also gets improved. In both languages, transform to a different type or carry different grammati- personal and possessive pronouns are the types that ex- cal properties, when translated. We assume this aspect is hibit the greatest improvement. In Czech, the top-scoring not so significant in case of nominal groups, for instance, mention types include zero subjects, too. Nevertheless, which represent the majority of remaining mentions. The English as an aligned language seems to have a stronger other types grouped under the category Other are demon- impact on resolution in Czech (the difference between the strative pronouns, pronouns in 1st and 2nd person etc. This systems is 2.5 F-score points) than Czech has on resolution category of anaphors is not targeted by our CR method. in English (the difference of 1.2 F-score points). 5 Experiments 6 Analysis of the Results The following experiments compare the performance of The results of experiments undoubtedly show the superi- the monolingual and bilingually informed system. Both ority of the cross-lingual CR over the monolingual one. systems are trained on the PCEDT dataset. All the design Here, we delve more into the comparison of these two choices (except for the feature sets) and hyperparameter approaches. Firstly, we conduct a quantitative analysis values are shared by both systems. of resolvers’ decisions. It should show how many deci- sion changes the switch to the cross-lingual approach in- Evaluation measure. We expect different mention types to troduces for individual mention types and what is the role behave differently in the cross-lingual approach. Standard of anaphoricity in these changes. Secondly, we inspect evaluation metrics (e.g. MUC [23], B3 [2]), however, do randomly sampled examples in a qualitative analysis. We not allow for scoring only a subset of mentions. Instead, attempt to disclose what are the typical examples when the we use the anaphora score, an anaphor-decomposable system benefits from the other language and, on the other measure proposed by [15]. The score consists of three hand, if there is a systematic case when the cross-lingual components: precision, recall, and F-score as a harmonic approach hurts. mean of the previous two. While precision expresses the success rate of a system averaged over all mentions labeled 134 Michal Novák Mention type Anaph Non-anaph Both X Both × M > C MC M C), the resolution deteriorates with cross-lingual features. The • C’s decision was correct while M’s decision was in- systems’ decisions differ the least for Czech reflexive pos- correct (M < C). sessive (7%) and English relative pronouns (6%). Here, we also observe a various effect on anaphora score. While A decision is either assignment of the anaphor candidate to the resolution of Czech reflexive possessives is hardly im- a coreferential entity5 or labeling it as non-anaphoric. The proved by English features, the small amount of changed tables also distinguish if the candidate is in fact anaphoric decisions on English relative pronouns suffices to achieve or non-anaphoric. Numbers in the tables represent pro- one of the biggest improvements among English corefer- portions (in %) of these categories aggregated over all in- ential expressions. stances. Every row thus sums to 100%. Anaphora scores in Table 2 have already shown that ba- Conditioning on anaphoricity allows us to directly relate sic reflexive pronouns are the only mention type, where the this analysis to the anaphora scores shown in Table 2. Note cross-lingual approach falls behind the monolingual one. that while resolution on anaphoric mentions may have an The quantitative analysis of changed decisions confirms effect on both the precision and the recall component of it, especially for anaphoric occurrences. the anaphora score, resolution on non-anaphoric mentions The gains of the Czech cross-lingual system on non- affects only the precision. anaphoric mentions can be attributed mostly to zeros. Also Changed decisions account for around 10% in both thanks to the resolution on non-anaphoric mentions, the Czech and English. More importantly, whereas we see highest margin between the proportion of improved and over 7% of decisions changed positively in Czech, it cor- worsened instances (5%) is observed on Czech zero sub- responds to 5.5% of decisions in English. This accords jects. It leads to one of the biggest improvement in terms with the extents of improvement observed on anaphora of the anaphora F-score (see Table 2). score. In Czech, a difference between improved and wors- ened decisions is only a bit higher for anaphoric mentions. 6.2 Qualitative Analysis It means that the positive effect of English on resolution In the following, we scrutinize more closely what are the 5 Some of the anaphors that were assigned to the same entity typical cases, where the cross-lingual system makes a dif- (columns Both X and Both ×) may have been in fact paired with dif- ferent decision. ferent antecedents by each of the CR algorithms. As our anaphora score is agnostic to such changes, we do not distinguish such cases. A study on bilingually informed coreference resolution 135 Mention type Anaph Non-anaph Both X Both × M > C M C M