=Paper=
{{Paper
|id=Vol-2794/paper8
|storemode=property
|title=Confirming the Generalizability of a Chain-Based Animacy Detector
|pdfUrl=https://ceur-ws.org/Vol-2794/paper8.pdf
|volume=Vol-2794
|authors=Labiba Jahan,W. Victor H. Yarlott,Rahul Mittal,Mark A. Finlayson
|dblpUrl=https://dblp.org/rec/conf/ijcai/JahanYMF20
}}
==Confirming the Generalizability of a Chain-Based Animacy Detector==
Confirming the Generalizability of a Chain-Based Animacy Detector
Labiba Jahan⇤ , W. Victor H. Yarlott , Rahul Mittal and Mark A. Finlayson
School of Computing and Information Sciences
Florida International University, Miami, FL 33199
{ljaha002, wyarl001, rmitt008, markaf}@fiu.edu
Abstract independently, so they are considered inanimate. Because an-
imacy is a necessary quality of characters in stories (that is,
Animacy is the characteristic of a referent being all characters, traditionally conceived, must be animate), an-
able to independently carry out actions in a story imacy is useful to story understanding. Further, animacy is
world (e.g., movement, communication). It is a potentially useful in many natural language processing tasks
necessary property of characters in stories, and so including word sense disambiguation, semantic role labeling,
detecting animacy is an important step in automatic coreference resolution, and character identification.
story understanding; it is also potentially useful for Most prior approaches assigned animacy as a property of
many other natural language processing tasks such individual words; by contrast, Jahan et al. [2018] introduced
as word sense disambiguation, coreference resolu- a new approach to animacy detection that reconceived of an-
tion, character identification, and semantic role la- imacy as a property of referring expressions and coreference
beling. Recent work by Jahan et al. [2018] demon- chains. In the work by Jahan et al., they demonstrated their
strated a new approach to detecting animacy where approach on 142 stories, comprising 156,154 words, that in-
animacy is considered a direct property of corefer- cluded Russian folktales and Islamist Extremists stories. That
ence chains (and referring expressions) rather than work left some questions as to the generalizability of the de-
words. In Jahan et al., they combined hand-built tector to other story forms. Here we test the generalizability
rules and machine learning (ML) to identify the an- of Jahan et al.’s detector on two new corpora, a news sub-
imacy of referring expressions and used majority set of OntoNotes [Weischedel et al., 2013] and the subset of
voting to assign the animacy of coreference chains, the Corpus of English Novels (CEN) [De Smet, 2008]. We
and reported high performance of up to 0.90 F1 . In test all three of Jahan et al.’s models, specifically, an SVM-
this short report we verify that the approach gener- based ML, a rule-based model, and a hybrid model combining
alizes to two different corpora (OntoNotes and the both. We show, in agreement with Jahan et al.’s results, that
Corpus of English Novels) and we confirmed that the hybrid model performs best, followed by the rule-based
the hybrid model performs best, with the rule-based model. Our results also suggest that the animacy models have
model in second place. Our tests apply the animacy a strong dependence on the quality of coreference chains; in
classifier to almost twice as much data as Jahan et particular, the performance of the models on the CEN data
al.’s initial study. Our results also strongly suggest, (with automatically computed chains) is much poorer than on
as would be expected, the dependence of the mod- OntoNotes and the ProppLearner corpus (with manually cor-
els on coreference chain quality. We release our rected chains).
data and code to enable reproducibility. In this paper first we discuss our corpora (§2), followed
by the models (§3) created by Jahan et al. [2018]. We then
1 Introduction outline the experimental setup (§4) and describe our results
(§5). We briefly discuss related work (§6), before finishing
Animacy is the characteristic of a referent being able to inde- with a discussion of the contributions of the paper (§7).
pendently carry out actions in a story world (e.g., movement,
communication). For example, human beings are animate 2 Data
because they can move or communicate in a realistic story
world but a chair or a table cannot accomplish those actions We annotated animacy on two new corpora. First, 94 news
texts drawn from the OntoNotes Corpus [Weischedel et al.,
⇤
Contact Author 2013]. Second, 30 chapters from 30 novels drawn from
Copyright c 2020 by the paper’s authors. Use permitted under Cre- CEN. We performed this manual annotation by following the
ative Commons License Attribution 4.0 International (CC BY 4.0). same guidelines described by Jahan et al. [2018]. In accor-
In: A. Jorge, R. Campos, A. Jatowt, A. Aizawa (eds.): Proceed- dance with their procedure, we have annotated the corefer-
ings of the first AI4Narratives Workshop, Yokohama, Japan, January ence chains of these two corpora as to whether each corefer-
2021, published at http://ceur-ws.org ence chain head acted as an animate being in the text. Be-
43
Anim. Inanim. tic subject to a verb; (c) a proper noun (i.e., excluding named-
Ref. Ref. Ref. Coref. Anim. Inanim. entity types of LOCATION, ORGANIZATION, MONEY); or, (d)
Corpus Texts Exps. Exps. Exps. Chains Chains Chains a descendant of LIVING BEING in WordNet. If the last word
Jahan 142 34,698 22,052 12,646 10,941 3,832 7,109 of a referring expression is a descendant of ENTITY but not a
descendant of LIVING BEING in WordNet, the model consid-
OntoN. 94 4,197 2,079 2,118 1,145 472 673 ers it inanimate.
CEN 30 70,379 20,937 49,442 17,251 2,808 14,443
Hybrid Model is the third approach where hand-built rules
Total 124 74,576 23,016 51,560 18,396 3,280 15,116 are applied first, followed by the ML classifier to those refer-
ring expressions not covered by the rules.
Table 1: Counts of various text types. Ref. Exp. = Referring Expres- Majority Vote Model The coreference model applies ma-
sion; Coref. = Coreference; Anim. = Animate; Inanim. = Inanimate jority voting to combine the results of the referring expression
animacy model to obtain a coreference animacy prediction.
cause the inter-annotator agreement for this annotation was For ties, the chain was marked inanimate.
quite high, we only performed single annotation. Details of
the corpora are given in Table 1. These corpora contain ap-
proximately twice as much data, by count of referring expres- 4 Experiments
sions and coreference chains, as the original work. We investigated four training setups for the SVM and Hybrid
OntoNotes [Weischedel et al., 2013] is a large corpus con- referring expression models: first, training the model each
taining a variety of genres, e.g., news, conversational tele- data set individually, and also training on all three datasets
phone speech, broadcast, talk show transcripts, etc., in En- together. For all models (SVM, Hybrid, Rule-Based) we also
glish, Chinese, and Arabic. We extracted 94 English broad- varied the test corpus. Where the test data was a subset of
cast news texts that had coreference chain annotations. The the training data, we applied ten-fold cross-validation. In all
first author annotated the animacy of the coreference chains. approaches, we used the majority vote classifier to identify
Corpus of English Novels (CEN) [De Smet, 2008] con- the animacy of the coreference chains. These experiments are
tains 292 English novels written between 1881 and 1922 used to compare the performance of Jahan et al.’s referring
comprising various genres including drama, romance, fan- expression model on our new corpora, as well as determine
tasy, etc. We selected 30 novels and listed the characters of the performance for determining coreference chain animacy.
these novels from the online resources. Then we extracted a
single chapter of each novel that contains a significant num- 5 Results & Discussion
ber of characters. We computed coreference chains using
Stanford CoreNLP [Manning et al., 2014], and the first au- The results in Table 2 show that the hybrid model outper-
thor annotated those chains for animacy. formed all of the other models in detecting referring expres-
sion animacy, which is the same result reported in Jahan et al.
[2018]. It performed the best on Jahan et al.’s original data,
3 Models achieving an F1 of 0.88, and is the most useful model when
Jahan et al.’s animacy model first classifies the animacy of applying as input to the majority vote model to identify the
referring expressions, and second classifies each coreference animacy of coreference chains, achieving an F1 of 0.77.
chain as animate or not by taking the majority vote of it’s The rule-based model performs second-best. It performed
constituting referring expressions. In our experiments we ran best on Jahan et al.’s original data for referring expres-
Jahan et al.’s three referring expression animacy detection sions, achieving an F1 of 0.88. But the majority vote model
models and the single coreference chain animacy detection achieved the best result (F1 of 0.76) on OntoNotes when the
model. (majority vote backed by the different referring ex- rule-based results are used to detect the chain animacy. We
pression models, which were determined by to be the best developed a baseline for chain animacy where we considered
coreference model). Jahan et al. released the code so the the first referring expression only instead of majority vote and
models are identical to their work. achieved an F1 of 0.69 and 0.43 on OntoNotes and CEN.
SVM Model is a simple supervised SVM classifier [Chang The SVM model performed worse in most of the cases,
and Lin, 2011] for assigning animacy to referring expres- especially when the outputs are used for the majority vote
sions, with a Radial Basis Function Kernel where SVM pa- model. It performed worst when it trained on the Corpus
rameters were set at = 1, C = 0.5 and p = 1. The features of English Novels and tested on Jahan et al.’s original data,
of the best performing model are boolean values of whether a achieving an F1 of only 0.56 for the referring expressions and
given referring expression contained a noun, a grammatical or achieved an F1 of 0.37 when the results of the referring ex-
a semantic subject. Jahan et al. chose these features because pressions are used for the majority vote model.
animate references tend to appear as nouns, grammatical sub- The majority vote model performed best when tested on
jects, or semantic subjects. When training and testing on the OntoNotes. It performed worst when tested on the Corpus
same dataset, we used ten-fold cross validation, and reported of English Novels (CEN). Besides the text genre, the ma-
the micro-averages across the performance on test folds. jor difference between these corpora is the quality of the
Rule-Based Model The second approach is a rule-based coreference chains. For OntoNotes, they are manually cor-
classifier that marks a referring expression as animate if its rected, while we automatically computed those on CEN. This
last word was: (a) a gendered personal, reflexive, or posses- strongly suggests that the quality of coreference chains is a
sive pronoun (i.e., excluding it, its, itself, etc.); (b) the seman- major factor in the performance of the animacy classifier.
44
Referring Expression Results Corerference Chain Results
SVM Hybrid Rule-Based SVM Hybrid Rule-Based
Train Corpus Test Corpus F1 F1 F1 F1 F1 F1
Jahan et al. [2018] Jahan et al. [2018] 0.84 0.53 0.90 0.70 0.88 0.60 0.46 0.03 0.75 0.61 0.72 0.51
Jahan et al. [2018] OntoNotes 0.70 0.35 0.80 0.54 - - 0.60 0.34 0.77 0.59 - -
Jahan et al. [2018] English Novels 0.75 0.53 0.80 0.60 - - 0.52 0.40 0.54 0.41 - -
OntoNotes Jahan et al. [2018] 0.82 0.51 0.88 0.64 - - 0.62 0.44 0.72 0.56 - -
OntoNotes OntoNotes 0.70 0.36 0.80 0.54 0.76 0.44 0.60 0.34 0.77 0.59 0.73 0.48
OntoNotes English Novels 0.76 0.54 0.80 0.61 - - 0.42 0.40 0.54 0.41 - -
English Novels Jahan et al. [2018] 0.56 0.22 0.88 0.64 - - 0.37 0.18 0.72 0.56 - -
English Novels OntoNotes 0.70 0.37 0.80 0.54 - - 0.60 0.34 0.77 0.59 - -
English Novels English Novels 0.76 0.55 0.80 0.61 0.75 0.48 0.54 0.43 0.54 0.41 0.46 0.28
All All 0.80 0.53 0.84 0.62 0.82 0.54 0.58 0.42 0.60 0.43 0.54 0.33
Table 2: Performance of referring expression and majority vote coreference chain animacy models backed by different referring expression
models for different training and testing setups. = Cohen’s kappa [Cohen, 1960], a statistical measure that takes into account the possibility
of the agreement occurring by chance [Glasser, 2008]. Note that the rule-based model does not require training, and so results are not reported
for different training combinations. Italics in the first line are the results reported by Jahan et al. [2018].
Finally, the results on the combined corpus are reasonable random field (CRF) layer to mark a word in a text sequence
for the referring expression models but performed poorly for with the animal attribute (animate). The work was done in
the majority vote coreference chain model. This is perhaps Chinese and they reported an F1 of 0.38.
to be expected because CEN is the largest corpus among the There are some works based on ontologies or other exter-
three and the coreference chains are poor in quality. nal resources. As an example, Declerck et al. [2012] aug-
Overall, these results strongly suggest that the features mented an existing ontology using nominal phrases found
used in Jahan et al. [2018] are generalizable to domains out- in folktales. They reported an F1 of 0.80 with 79% accu-
side the Russian folklore corpus used as long as high quality racy. Moore et al. [2013] assigned animacy to words, where
coreference chains are available. multiple model (including WordNet and WordSim) votes be-
tween Animal, Person, Inanimate or abstains, and then the
6 Related Work results are combined using various interpretable voting mod-
els. They reported an accuracy of 89% under majority voting
Most prior work classifies animacy as a word or noun level and 95% under an SVM scheme.
property using different supervised and unsupervised ap-
proaches. For example, Orasan and Evans [2007] performed Generally, however, compared to all other prior work on
animacy classification of senses and nouns and achieved the animacy, only Jahan et al. [2018] demonstrated an approach
best performance by the supervised ML method (F1 of 0.94). where animacy is considered a direct property of coreference
Similarly, Bowman and Chopra [2012] used a maximum en- chains (and referring expressions) rather than words or nouns.
tropy classifier to classify noun phrases into a most prob-
able class (human, animal, place, etc.), which was used to 7 Contributions
mark animacy, achieving 94% accuracy. Again, Karsdorp et
al. [2015] employed a maximum entropy classifier to label This paper makes two contributions. First, we have demon-
the animacy of Dutch words using different combinations of strated the generalizability of a previously reported approach
lemmas, POS tags, dependency tags, and word embeddings. in animacy detection [Jahan et al., 2018] by testing the ap-
Their best result reported an F1 of 0.93. However, the work is proach on twofold more data comprising two additional types
language-bound and hasn’t been tested on other natural lan- of story genres (news and novels). We release this data for use
guages. by the community1 . These results confirm the best perform-
Ji and Lin [2009] leveraged gender and animacy proper- ing models, and also strongly suggest the dependence of the
ties to detect person mentions with an unsupervised learn- models of the quality of coreference chain annotations.
ing model. They reported an F1 of 0.85 which is marginally
lower than a supervised learning approach, but has higher
coverage of low frequency mentions. More recently, Ardanuy Acknowledgements
et al. [2020] proposed an unsupervised approach to atypical
animacy detection using contextualized word embeddings. This work was supported by NSF CAREER Award
Using a masking approach with context, they achieved the IIS-1749917 and DARPA Contract FA8650-19-C-6017. We
best performance of F1 of 0.78 on one dataset, while reported would also like to thank the members of the FIU Cognac Lab
an F1 of 0.94 on another dataset using a simple BERT classi- for their discussions and assistance.
fier on the target expressions in a sentence. Zhu et al. [2019]
proposed an animacy detector based on a bi-directional Long 1
The data and code may be downloaded from https://doi.org/10.
Short-term Memory (bi-LSTM) network with a conditional 34703/gzx1-9v95/FCYIPW
45
References [Moore et al., 2013] Joshua Moore, Christopher J.C. Burges,
[Ardanuy et al., 2020] Mariona Coll Ardanuy, Federico Erin Renshaw, and Wen-tau Yih. Animacy detection with
Nanni, Kaspar Beelen, Kasra Hosseini, Ruth Ahnert, voting models. In Proceedings of the 2013 Conference
Jon Lawrence, Katherine McDonough, Giorgia Tolfo, on Empirical Methods in Natural Language Processing,
Daniel CS Wilson, and Barbara McGillivray. Living ma- pages 55–60, Seattle, Washington, USA, 2013.
chines: A study of atypical animacy, 2020. [Orasan and Evans, 2007] Constantin Orasan and Richard J
[Bowman and Chopra, 2012] Samuel R. Bowman and Evans. NP animacy identification for anaphora resolu-
Harshit Chopra. Automatic animacy classification. tion. Journal of Artificial Intelligence Research, 29:79–
In Proceedings of the 2012 Conference of the North 103, 2007.
American Chapter of the Association for Computational [Weischedel et al., 2013] Ralph Weischedel, Martha Palmer,
Linguistics: Human Language Technologies: Student Re- Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance
search Workshop (NAACL HLT’12), page 7–10, Montréal, Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman,
Canada, 2012. Michelle Franchini, Mohammed El-Bachouti, Robert
[Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen Belvin, and Ann Houston. OntoNotes Release 5.0,
Lin. LIBSVM: A library for support vector machines. 2013. LDC Catalog No. LDC2013T19, https://catalog.ldc.
ACM Transactions on Intelligent Systems and Technology upenn.edu/LDC2013T19.
(TIST), 2(3):27, 2011. [Zhu et al., 2019] Yuanqing Zhu, Wei Song, Xianjun Liu,
[Cohen, 1960] Jacob Cohen. A coefficient of agreement for Lizhen Liu, and Xinlei Zhao. Improving anaphora res-
nominal scales. Educational and Psychological Measure- olution by animacy identification. In Proceedings of the
ment, 20(1):37–46, 1960. 2019 IEEE International Conference on Artificial Intelli-
gence and Computer Applications (ICAICA), pages 48–51,
[De Smet, 2008] Hendrik De Smet. Corpus of English nov- Dalian, China, 2019.
els, 2008. https://perswww.kuleuven.be/⇠u0044428/.
[Declerck et al., 2012] Thierry Declerck, Nikolina Koleva,
and Hans-Ulrich Krieger. Ontology-based incremental an-
notation of characters in folktales. In Proceedings of the
6th Workshop on Language Technology for Cultural Her-
itage, Social Sciences, and Humanities, pages 30–34, Avi-
gnon, France, 2012.
[Glasser, 2008] Stephen Glasser. Research Methodology for
Studies of Diagnostic Tests, pages 245–257. Springer
Netherlands, Dordrecht, 2008.
[Jahan et al., 2018] Labiba Jahan, Geeticka Chauhan, and
Mark Finlayson. A new approach to animacy detection.
In Proceedings of the 27th International Conference on
Computational Linguistics (COLING), pages 1–12, Santa
Fe, NM, 2018. Data and code may be found at https:
//dspace.mit.edu/handle/1721.1/116172.
[Ji and Lin, 2009] Heng Ji and Dekang Lin. Gender and An-
imacy knowledge discovery from web-scale n-grams for
unsupervised person mention detection. In Proceedings of
the 23rd Pacific Asia Conference on Language, Informa-
tion and Computation, Volume 1, pages 220–229, Hong
Kong, 2009.
[Karsdorp et al., 2015] Folgert B Karsdorp, Marten van der
Meulen, Theo Meder, and Antal van den Bosch. Animacy
detection in stories. In Proceedings of the 6th Workshop
on Computational Models of Narrative (CMN’15), pages
82–97, Atlanta, GA, 2015.
[Manning et al., 2014] Christopher D. Manning, Mihai Sur-
deanu, John Bauer, Jenny Finkel, Steven J. Bethard, and
David McClosky. The Stanford CoreNLP natural language
processing toolkit. In Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics
(ACL): System Demonstrations, pages 55–60. Baltimore,
MD, 2014.
46