=Paper= {{Paper |id=Vol-2203/130 |storemode=property |title=A Study on Bilingually Informed Coreference Resolution |pdfUrl=https://ceur-ws.org/Vol-2203/130.pdf |volume=Vol-2203 |authors=Michal Novak |dblpUrl=https://dblp.org/rec/conf/itat/Novak18 }} ==A Study on Bilingually Informed Coreference Resolution== https://ceur-ws.org/Vol-2203/130.pdf

S. Krajči (ed.): ITAT 2018 Proceedings, pp. 130–137
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Michal Novák

A Study on Bilingually Informed Coreference Resolution

Michal Novák

Charles University, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Malostranské náměstí 25, CZ-11800 Prague 1
mnovak@ufal.mff.cuni.cz

Abstract: Coreference is a basic means to retain coher- big parallel data might be leveraged in a weakly supervised
ence of a text that likely exists in every language. How- manner to boost the models trained in a monolingual way.
ever, languages may differ in how a coreference relation The present work is concerned with corpus-based bilin-
is manifested on the surface. A possible way how to mea- gually informed CR on Czech-English texts. Specifically,
sure the extent and nature of such differences is to build it focuses on resolution of pronouns and zeros, as these are
a coreference resolution system that operates on a parallel the coreferential expressions whose grammatical and func-
corpus and extracts information from both language sides tional properties differ considerably across the languages.
of the corpus. In this work, we build such a bilingually For instance, whereas in English most of non-living ob-
informed coreference resolution system and apply it on jects are referred to with pronouns in neuter gender (e.g.
Czech-English data. We compare its performance with “it”, “its”), genders are distributed more evenly in Czech.
the system that learns only from a single language. Our Information on Czech genders thus may be useful to fil-
results show that the cross-lingual approach outperforms ter out English candidates that are highly improbable to be
the monolingual one. They also suggest that a system for coreferential with the pronoun. By comparison of its per-
Czech can exploit the additional English information more formance with a monolingual approach and by thorough
effectively than the other way round. The work concludes analysis of the results, our work aims at discovering the
with a detailed analysis that tries to reveal the reasons be- extent and nature of such differences.
hind these results. The paper is structured as follows. After mentioning
related work (Section 2), we introduce a coreference re-
solver (Section 3), both its monolingual and cross-lingual
1 Introduction variants. Section 4 describes the dataset used in experi-
ments in Section 5. Before we conclude, the results of
Cross-lingual techniques are becoming still more and experiments are thoroughly analyzed (Section 6).
more popular. Even though they do not circumvent
the task of Coreference Resolution (CR), the research is
mostly limited to cross-lingual projection. Other cross- 2 Related Work
lingual techniques remain a largely unexplored area for
this task. Building a bilingually informed CR system requires a par-
One of the yet neglected cross-lingual techniques is allel corpus with at least the target-language side annotated
called bilingually informed resolution. It is an approach, with coreference. Even these days very few such corpora
in which decisions in a particular task are made based on exist, e.g. Prague Czech-English Dependency Treebank
the information from bilingual parallel data. Parallel texts 2.0 Coref [14], ParCor 1.0 [9] and parts of OntoNotes 5.0
must be available when a method is trained, but also at [19].
test time, that is when a trained model is applied to new It is thus surprising that the peak of popularity for such
data. In real-world scenarios, the availability of parallel approach was reached around ten years before these cor-
data at test time requires the technique to apply a machine pora had been published. Harabagiu and Maiorano [10]
translation service to acquire them (MT-based bilingually present an heuristics-based approach to CR. The set of
informed resolution). heuristics is expanded by exploiting the transitivity prop-
Nevertheless, for limited purposes it may pay off to use erty of coreferential chains in a bootstrapping fashion.
human-translated parallel data instead (corpus-based bilin- Moreover, they expand the heuristics even more, follow-
gually informed resolution). If it outperforms the mono- ing mention counterparts in translations of source English
lingual approach, it may be used in building automatically texts to Romanian with coreference annotation. Mitkov
annotated parallel corpora. Such corpora with more reli- and Barbu [13] adjust a rule-based pronoun coreference
able annotation could be useful for corpus-driven theoret- resolution system to work on a parallel corpus. After pro-
ical research.1 Furthermore, it can be also used for au- viding a linguistic comparison of English and French pro-
tomatic processing. For instance, improved resolution on nouns and their behavior in discourse, the authors distill
their findings into a set of cross-lingual rules to be inte-
1 In case a cross-lingual origin of the annotation does not matter.
grated into the CR system. In evaluation, they observe im-
A study on bilingually informed coreference resolution 131

provements in resolution accuracy of up to 5 percentage The tectogrammatical layer is also the place, where
points compared to the monolingual approach. coreference relations should be annotated. It is technically
As for more recent works, the authors of [5] address represented as a link between two coreferential nodes:3 the
the task of overt pronoun resolution in Chinese. Among anaphor (the referring expression) and the antecedent (the
the others they propose an MT-based bilingually informed referred expression).
approach. A model is built on Chinese coreference, ex- Each input text must be first automatically pre-
ploiting Chinese features. These are augmented with En- processed up to this level of linguistic annotation. The CR
glish features, extracted from the Chinese texts machine- system based on supervised machine learning then takes
translated to English. It allows for taking advantage of advantage of the information available in the annotation.
English nouns’ gender and number lists, which according
to authors correspond to the distribution of genders and Pre-processing. The input text must undergo an analysis
numbers over Chinese nouns. producing a tectogrammatical representation of its sen-
Experiments of Novák and Žabokrtský [17], the first tences before coreference resolution is carried out. We
ones using bilingually informed CR on Czech-English use pipelines for analysis of Czech and English available
data, are most relevant to the present work. With the focus in the Treex framework [18]. The analysis starts with a
on English personal pronouns only, their best cross-lingual rule-based tokenization, morphological analysis and part-
configuration managed to outperform the monolingual CR of-speech tagging (e.g. [21] for Czech), dependency pars-
by one F-score point. Taking advantage of a more devel- ing to surface trees (e.g. MST parser [12] for English)
oped version of their CR system, we extend their work in and named entity recognition [22]. In addition, the NADA
several directions. First, we explore the potential of such tool [3] is applied to help distinguish referential and non-
approach for a wider range of English coreferential expres- referential occurrences of the English pronoun “it”.
sions. Next, we perform experiments in the opposite direc- Tectogrammatical trees are created by a transformation
tion, i.e. Czech CR informed by English. And finally, we from the surface trees. All function words are made hid-
provide a very detailed analysis of the results unveiling the den, morpho-syntactic information is transferred and se-
nature of the cross-lingual aid. mantic roles are assigned to tectogrammatical nodes [4].
On the tectogrammatical layer, certain types of ellipsis can
be restored. The automatic pre-processing focuses only on
3 Coreference Resolution System restoring nodes that might be anaphoric. Such nodes are
added by heuristics based on syntactic structures. The re-
For coreference resolution we adopt a more developed ver- stored nodes include Czech zero subjects and both Czech
sion of the resolver utilized in [17]. This new version and English zeros in non-finite clauses, e.g. zero relative
builds on the monolingual Treex CR system [15], and aug- pronouns, unexpressed arguments in infinitives, past and
ments it with the cross-lingual extension presented in [17]. present participles.
The difference between the current system and the sys-
tem in [17] lies mostly in that it can target a wider range Model design. Treex CR models coreference in a way to
of expressions, it exploits a richer feature set and the pre- be easily optimized by supervised learning. Particularly,
processing stage analyzing the text to the tectogrammati- we use logistic regression with stochastic gradient descend
cal representation is of higher quality. Instead of listing all optimization implemented in the Vowpal Wabbit toolkit.4
the changes, we briefly introduce the monolingual (Sec- Design of the model employs multiple concepts that have
tion 3.1) and the cross-lingual component (Section 3.2) of proved to be useful and simple at the same time.
Treex CR from the scratch.2 Given an anaphor and a set of antecedent candidates,
mention-ranking models [6] are trained to score all the
candidates at once. On the one hand a mention-ranking
3.1 Monolingual Resolution model is able to capture competition between the candi-
Treex CR operates on the tectogrammatical layer. It is dates, but on the other hand features describe solely the
a layer of deep syntax based on the theory of Functional actual mentions, not the whole clusters built up to the mo-
Generative Description [20]. The tectogrammatical repre- ment. Antecedent candidates for an anaphor (both positive
sentation of a sentence is a dependency tree with rich lin- and negative) are selected from the context window of a
guistic features consisting of the content words only. Fur- predefined size.
thermore, some surface ellipses are restored at this layer. No anaphor detection stage precedes the coreference
It includes anaphoric zeros (e.g. zero subjects in Czech, resolution. Unless another measure was taken, it would
unexpressed arguments of non-finite clauses in both En- lead to all occurrences of the pronoun “it” labeled as ref-
glish and Czech) that are introduced in the tectogrammat- erential, for instance. Nevertheless, the model determines
ical layer with a newly established node. 3 A mention is determined only by its head in tectogrammatics. No

mention boundaries are specified. Therefore, it is sufficient for a corefer-
ence link to determine only two nodes, the mentions’ head nodes.
2 Please refer to [15] for more details on the monolingual component 4 https://github.com/JohnLangford/vowpal_

of the system. wabbit/wiki
132 Michal Novák

whether the anaphor is referential jointly with selecting its Alignment. It is central for our cross-lingual approach to
antecedent. This is ensured by adding a dummy candi- have the English and Czech texts aligned on the level of
date representing solely the anaphor itself. By selecting tectogrammatical nodes. The alignment is based on un-
this candidate, the model claims that the anaphor is in fact supervised word alignment performed by MGIZA++ [8]
non-referential. trained on the data from CzEng 1.0 [4], and projected to
Diverse properties of various types of coreferential rela- the tectogrammatical layer. Furthermore, it is augmented
tions (e.g. different referential scopes of personal and rela- with a supervised method [17] addressing selected corefer-
tive pronouns) encouraged us to model individual anaphor ential expressions, including potentially anaphoric zeros.
types separately. A specialized model is build for (1) per-
sonal and possessive pronouns in 3rd person (and zero sub- Features. Cross-lingual features describe the nodes
jects in Czech), (2) reflexive pronouns, (3) relative pro- aligned to the coreferential candidates in the target lan-
nouns, and (4) zeros in non-finite clauses. Treex CR runs guage – the anaphor candidate and the antecedent candi-
them in a sequence. date. To collect such nodes, we follow the alignment links
connected to these two candidates. For each of the nodes,
Features. The pre-processing stage enriches a raw text we take at most one of its aligned counterparts. In this
with a substantial amount of linguistic information. Fea- way, we obtain at most two nodes aligned to the pair of
ture extraction stage then uses this material to yield fea- potentially coreferential nodes, for which we can extract
tures consumable by the learning method. Features are al- cross-lingual features. If no aligned counterpart is found,
ways related to at most two nodes – an anaphor candidate no cross-lingual features are added.
and an antecedent candidate. We extract two sets of cross-lingual features:
The features can be divided into three categories.
Firstly, location and distance features indicate positions of • aligned_all: it consists of all the features contained
the anaphor and the antecedent candidate in a sentence and in a monolingual set for a given aligned language;
their mutual distance in terms of words, clauses and sen- • aligned_coref : it consists of a single indicator fea-
tences. Secondly, a big group of features reflects (deep) ture, assigning the true value only if the two aligned
morpho-syntactic aspects of the candidates. It includes the nodes belong to the same coreferential entity. This
mention head’s part-of-speech tag and morphological fea- feature can be activated only if there exists a mono-
tures (e.g. gender, number, person, case), (deep) syntax lingual coreference resolver for the aligned language.
features (e.g. dependency relation, semantic role) as well We employ Treex CR and its monolingual models for
as some features exploiting the structure of the syntactic English and Czech, but any CR system, even a rule-
tree. Many of the features are combined by concatena- based one, could be used.
tion or by agreement, i.e. indicating whether the anaphor’s
value agrees with antecedent’s one. Finally, lexical fea- We do not manually construct features combining both
tures focus on lemmas of the mentions’ heads and their language sides. Nevertheless, such features are formed au-
parents. These are used directly or through the frequen- tomatically by the machine-learning tool Vowpal Wabbit.
cies collected in a large data of Czech National Corpus [1]
indexed in a list of noun-verb collocations. Furthermore, 4 Datasets
all hypernymous concepts of a mention are extracted as
features from ontologies (e.g. WordNet [7]) and named We employ Prague Czech-English Dependency Treebank
entity labels are also employed. 2.0 Coref [14, PCEDT 2.0 Coref] to train and test our CR
systems. It is a Czech-English parallel corpus, consisting
3.2 Cross-lingual Extension of almost 50k sentence pairs (more on its basic statistics is
shown in the upper part of the Table 1). The English part
The extension enables bilingually informed CR. Like the of the treebank is based on texts from the Wall Street Jour-
monolingual CR, it addresses coreference in one target nal collected for the Penn Treebank [11]. The Czech part
language at a time. However, instead of data in single lan- was manually translated from English. All texts have been
guage, it must be fed with parallel data in two languages. annotated at multiple layers of linguistic representation up
Both language sides (Czech and English in this case) of to the tectogrammatical layer.
the data must be first pre-processed with the pipelines an- Although PCEDT 2.0 Coref has been extensively anno-
alyzing the texts up to the diagrammatically layer. Fur- tated by humans, we strip almost all manual annotations
thermore, to facilitate the access to important information and replace it by the output of the pre-processing pipeline
in the other language, the pre-processing stage also seeks (see Sections 3.1 and 3.2). The only manually annotated
for alignment between tectogrammatical nodes. The bilin- information that we retain are the coreferential links.
gually informed approach then augments the monolingual We do not split the data into train and test sections.
features with those accessing the other side of the paral- All the experiments are conducted using 10-fold cross-
lel data. Design of the model remains the same as for the validation.
monolingual approach.
A study on bilingually informed coreference resolution 133

Mention type Czech English Czech English
Mention type
Sentences 49,208 49,208 monoling biling monoling biling
Tokens 1,151,150 1,173,766 Personal 63.84
61.24 62.51 67.82
64.38 66.06 76.34
71.37 73.77 78.57
72.64 75.49
Tecto. nodes 931,846 838,212 Possessive 71.93
71.51 71.72 75.73
74.85 75.29 80.07
79.54 79.81 81.46
81.00 81.23
Mentions (total) 183,277 188,685 Refl. poss. 85.61
85.42 85.52 87.70
87.04 87.36 — —
Personal pron. 3,038 14,887 Reflexive 66.91
56.60 61.33 67.24
55.66 60.90 77.31
72.67 74.92 75.88
71.01 73.37
Possessive pron. 3,777 9,186 Zero subj. 73.18
55.46 63.10 78.88
57.64 66.61 — —
Refl. poss. pron. 4,389 — Zero nonfin. 78.98
41.51 54.42 81.52
42.63 55.98 71.48
54.62 61.92 73.31
54.75 62.68
Reflexive pron. 1,272 484 Relative 81.51
79.94 80.72 83.48
81.62 82.54 83.47
76.23 79.69 85.76
77.13 81.21
Zero subject 16,875 — Total 76.83
70.52 80.27
73.09 75.93
70.19 77.85
71.41
65.17 67.09 65.26 65.95
Zero in nonfin. cl. 6,151 29,759
Relative pron. 15,198 8,170 Table 2: Anaphora scores of monolingual and bilingually
Other 132,577 126,199 informed coreference resolution.

Table 1: Basic and coref. statistics of PCEDT 2.0 Coref.

by the system as anaphoric, recall averages over all true
anaphoric mentions. A decision on an anaphor candidate
As mentioned in Section 3.1, our CR system consists is correct if the system correctly labels it as non-anaphoric
of four models targeting different types of mentions as or the antecedent found by the system really belongs to the
anaphors. In evaluation, we split the anaphor candidates to same entity as the anaphor. In the following tables, we use
even finer categories, namely: (1) personal pronouns, (2) R F to format the three components of the anaphora score.
P

possessive pronouns, (3) reflexive possessive pronouns,
(4) reflexive pronouns, all four types of pronouns in the Bilingually informed vs. Monolingual CR. Table 2 lists
3rd or ambiguous person, (5) zero subjects, (6) zeros in the anaphora scores measured on the output of 10-fold
non-finite clauses, and (7) relative pronouns (the statistics cross-validation. In overall, cross-lingual models succeed
of coreferential mentions is collected in the bottom part of in exploiting additional knowledge from parallel data and
Table 1). Driven by the findings in an analysis of Czech- perform better than the monolingual approach. The F-
English correspondences [16], these expressions are very score improvement benefits mainly from a rise in preci-
interesting from a cross-lingual point of view, as they often sion, but recall also gets improved. In both languages,
transform to a different type or carry different grammati- personal and possessive pronouns are the types that ex-
cal properties, when translated. We assume this aspect is hibit the greatest improvement. In Czech, the top-scoring
not so significant in case of nominal groups, for instance, mention types include zero subjects, too. Nevertheless,
which represent the majority of remaining mentions. The English as an aligned language seems to have a stronger
other types grouped under the category Other are demon- impact on resolution in Czech (the difference between the
strative pronouns, pronouns in 1st and 2nd person etc. This systems is 2.5 F-score points) than Czech has on resolution
category of anaphors is not targeted by our CR method. in English (the difference of 1.2 F-score points).

5 Experiments 6 Analysis of the Results
The following experiments compare the performance of The results of experiments undoubtedly show the superi-
the monolingual and bilingually informed system. Both ority of the cross-lingual CR over the monolingual one.
systems are trained on the PCEDT dataset. All the design Here, we delve more into the comparison of these two
choices (except for the feature sets) and hyperparameter approaches. Firstly, we conduct a quantitative analysis
values are shared by both systems. of resolvers’ decisions. It should show how many deci-
sion changes the switch to the cross-lingual approach in-
Evaluation measure. We expect different mention types to troduces for individual mention types and what is the role
behave differently in the cross-lingual approach. Standard of anaphoricity in these changes. Secondly, we inspect
evaluation metrics (e.g. MUC [23], B3 [2]), however, do randomly sampled examples in a qualitative analysis. We
not allow for scoring only a subset of mentions. Instead, attempt to disclose what are the typical examples when the
we use the anaphora score, an anaphor-decomposable system benefits from the other language and, on the other
measure proposed by [15]. The score consists of three hand, if there is a systematic case when the cross-lingual
components: precision, recall, and F-score as a harmonic approach hurts.
mean of the previous two. While precision expresses the
success rate of a system averaged over all mentions labeled
134 Michal Novák

Mention type Anaph Non-anaph
Both X Both × M > C M C M C), the resolution deteriorates with cross-lingual features. The
• C’s decision was correct while M’s decision was in- systems’ decisions differ the least for Czech reflexive pos-
correct (M < C). sessive (7%) and English relative pronouns (6%). Here,
we also observe a various effect on anaphora score. While
A decision is either assignment of the anaphor candidate to the resolution of Czech reflexive possessives is hardly im-
a coreferential entity5 or labeling it as non-anaphoric. The proved by English features, the small amount of changed
tables also distinguish if the candidate is in fact anaphoric decisions on English relative pronouns suffices to achieve
or non-anaphoric. Numbers in the tables represent pro- one of the biggest improvements among English corefer-
portions (in %) of these categories aggregated over all in- ential expressions.
stances. Every row thus sums to 100%. Anaphora scores in Table 2 have already shown that ba-
Conditioning on anaphoricity allows us to directly relate sic reflexive pronouns are the only mention type, where the
this analysis to the anaphora scores shown in Table 2. Note cross-lingual approach falls behind the monolingual one.
that while resolution on anaphoric mentions may have an The quantitative analysis of changed decisions confirms
effect on both the precision and the recall component of it, especially for anaphoric occurrences.
the anaphora score, resolution on non-anaphoric mentions The gains of the Czech cross-lingual system on non-
affects only the precision. anaphoric mentions can be attributed mostly to zeros. Also
Changed decisions account for around 10% in both thanks to the resolution on non-anaphoric mentions, the
Czech and English. More importantly, whereas we see highest margin between the proportion of improved and
over 7% of decisions changed positively in Czech, it cor- worsened instances (5%) is observed on Czech zero sub-
responds to 5.5% of decisions in English. This accords jects. It leads to one of the biggest improvement in terms
with the extents of improvement observed on anaphora of the anaphora F-score (see Table 2).
score. In Czech, a difference between improved and wors-
ened decisions is only a bit higher for anaphoric mentions.
6.2 Qualitative Analysis
It means that the positive effect of English on resolution
In the following, we scrutinize more closely what are the
5 Some of the anaphors that were assigned to the same entity typical cases, where the cross-lingual system makes a dif-
(columns Both X and Both ×) may have been in fact paired with dif-
ferent decision.
ferent antecedents by each of the CR algorithms. As our anaphora score
is agnostic to such changes, we do not distinguish such cases.
A study on bilingually informed coreference resolution 135

Mention type Anaph Non-anaph
Both X Both × M > C M C M