=Paper= {{Paper |id=Vol-2794/paper8 |storemode=property |title=Confirming the Generalizability of a Chain-Based Animacy Detector |pdfUrl=https://ceur-ws.org/Vol-2794/paper8.pdf |volume=Vol-2794 |authors=Labiba Jahan,W. Victor H. Yarlott,Rahul Mittal,Mark A. Finlayson |dblpUrl=https://dblp.org/rec/conf/ijcai/JahanYMF20 }} ==Confirming the Generalizability of a Chain-Based Animacy Detector== https://ceur-ws.org/Vol-2794/paper8.pdf
              Confirming the Generalizability of a Chain-Based Animacy Detector

                 Labiba Jahan⇤ , W. Victor H. Yarlott , Rahul Mittal and Mark A. Finlayson
                                School of Computing and Information Sciences
                              Florida International University, Miami, FL 33199
                      {ljaha002, wyarl001, rmitt008, markaf}@fiu.edu


                             Abstract                                     independently, so they are considered inanimate. Because an-
                                                                          imacy is a necessary quality of characters in stories (that is,
        Animacy is the characteristic of a referent being                 all characters, traditionally conceived, must be animate), an-
        able to independently carry out actions in a story                imacy is useful to story understanding. Further, animacy is
        world (e.g., movement, communication). It is a                    potentially useful in many natural language processing tasks
        necessary property of characters in stories, and so               including word sense disambiguation, semantic role labeling,
        detecting animacy is an important step in automatic               coreference resolution, and character identification.
        story understanding; it is also potentially useful for               Most prior approaches assigned animacy as a property of
        many other natural language processing tasks such                 individual words; by contrast, Jahan et al. [2018] introduced
        as word sense disambiguation, coreference resolu-                 a new approach to animacy detection that reconceived of an-
        tion, character identification, and semantic role la-             imacy as a property of referring expressions and coreference
        beling. Recent work by Jahan et al. [2018] demon-                 chains. In the work by Jahan et al., they demonstrated their
        strated a new approach to detecting animacy where                 approach on 142 stories, comprising 156,154 words, that in-
        animacy is considered a direct property of corefer-               cluded Russian folktales and Islamist Extremists stories. That
        ence chains (and referring expressions) rather than               work left some questions as to the generalizability of the de-
        words. In Jahan et al., they combined hand-built                  tector to other story forms. Here we test the generalizability
        rules and machine learning (ML) to identify the an-               of Jahan et al.’s detector on two new corpora, a news sub-
        imacy of referring expressions and used majority                  set of OntoNotes [Weischedel et al., 2013] and the subset of
        voting to assign the animacy of coreference chains,               the Corpus of English Novels (CEN) [De Smet, 2008]. We
        and reported high performance of up to 0.90 F1 . In               test all three of Jahan et al.’s models, specifically, an SVM-
        this short report we verify that the approach gener-              based ML, a rule-based model, and a hybrid model combining
        alizes to two different corpora (OntoNotes and the                both. We show, in agreement with Jahan et al.’s results, that
        Corpus of English Novels) and we confirmed that                   the hybrid model performs best, followed by the rule-based
        the hybrid model performs best, with the rule-based               model. Our results also suggest that the animacy models have
        model in second place. Our tests apply the animacy                a strong dependence on the quality of coreference chains; in
        classifier to almost twice as much data as Jahan et               particular, the performance of the models on the CEN data
        al.’s initial study. Our results also strongly suggest,           (with automatically computed chains) is much poorer than on
        as would be expected, the dependence of the mod-                  OntoNotes and the ProppLearner corpus (with manually cor-
        els on coreference chain quality. We release our                  rected chains).
        data and code to enable reproducibility.                             In this paper first we discuss our corpora (§2), followed
                                                                          by the models (§3) created by Jahan et al. [2018]. We then
1       Introduction                                                      outline the experimental setup (§4) and describe our results
                                                                          (§5). We briefly discuss related work (§6), before finishing
Animacy is the characteristic of a referent being able to inde-           with a discussion of the contributions of the paper (§7).
pendently carry out actions in a story world (e.g., movement,
communication). For example, human beings are animate                     2   Data
because they can move or communicate in a realistic story
world but a chair or a table cannot accomplish those actions              We annotated animacy on two new corpora. First, 94 news
                                                                          texts drawn from the OntoNotes Corpus [Weischedel et al.,
    ⇤
     Contact Author                                                       2013]. Second, 30 chapters from 30 novels drawn from
Copyright c 2020 by the paper’s authors. Use permitted under Cre-         CEN. We performed this manual annotation by following the
ative Commons License Attribution 4.0 International (CC BY 4.0).          same guidelines described by Jahan et al. [2018]. In accor-
In: A. Jorge, R. Campos, A. Jatowt, A. Aizawa (eds.): Proceed-            dance with their procedure, we have annotated the corefer-
ings of the first AI4Narratives Workshop, Yokohama, Japan, January        ence chains of these two corpora as to whether each corefer-
2021, published at http://ceur-ws.org                                     ence chain head acted as an animate being in the text. Be-



                                                                     43
                   Anim. Inanim.                                            tic subject to a verb; (c) a proper noun (i.e., excluding named-
             Ref.  Ref.  Ref.    Coref. Anim. Inanim.                       entity types of LOCATION, ORGANIZATION, MONEY); or, (d)
Corpus Texts Exps. Exps. Exps.   Chains Chains Chains                       a descendant of LIVING BEING in WordNet. If the last word
Jahan    142    34,698 22,052 12,646      10,941 3,832     7,109            of a referring expression is a descendant of ENTITY but not a
                                                                            descendant of LIVING BEING in WordNet, the model consid-
OntoN. 94       4,197 2,079 2,118         1,145 472        673              ers it inanimate.
CEN    30       70,379 20,937 49,442      17,251 2,808     14,443
                                                                               Hybrid Model is the third approach where hand-built rules
Total    124    74,576 23,016 51,560      18,396 3,280     15,116           are applied first, followed by the ML classifier to those refer-
                                                                            ring expressions not covered by the rules.
Table 1: Counts of various text types. Ref. Exp. = Referring Expres-           Majority Vote Model The coreference model applies ma-
sion; Coref. = Coreference; Anim. = Animate; Inanim. = Inanimate            jority voting to combine the results of the referring expression
                                                                            animacy model to obtain a coreference animacy prediction.
cause the inter-annotator agreement for this annotation was                 For ties, the chain was marked inanimate.
quite high, we only performed single annotation. Details of
the corpora are given in Table 1. These corpora contain ap-
proximately twice as much data, by count of referring expres-               4   Experiments
sions and coreference chains, as the original work.                         We investigated four training setups for the SVM and Hybrid
   OntoNotes [Weischedel et al., 2013] is a large corpus con-               referring expression models: first, training the model each
taining a variety of genres, e.g., news, conversational tele-               data set individually, and also training on all three datasets
phone speech, broadcast, talk show transcripts, etc., in En-                together. For all models (SVM, Hybrid, Rule-Based) we also
glish, Chinese, and Arabic. We extracted 94 English broad-                  varied the test corpus. Where the test data was a subset of
cast news texts that had coreference chain annotations. The                 the training data, we applied ten-fold cross-validation. In all
first author annotated the animacy of the coreference chains.               approaches, we used the majority vote classifier to identify
   Corpus of English Novels (CEN) [De Smet, 2008] con-                      the animacy of the coreference chains. These experiments are
tains 292 English novels written between 1881 and 1922                      used to compare the performance of Jahan et al.’s referring
comprising various genres including drama, romance, fan-                    expression model on our new corpora, as well as determine
tasy, etc. We selected 30 novels and listed the characters of               the performance for determining coreference chain animacy.
these novels from the online resources. Then we extracted a
single chapter of each novel that contains a significant num-               5   Results & Discussion
ber of characters. We computed coreference chains using
Stanford CoreNLP [Manning et al., 2014], and the first au-                  The results in Table 2 show that the hybrid model outper-
thor annotated those chains for animacy.                                    formed all of the other models in detecting referring expres-
                                                                            sion animacy, which is the same result reported in Jahan et al.
                                                                            [2018]. It performed the best on Jahan et al.’s original data,
3    Models                                                                 achieving an F1 of 0.88, and is the most useful model when
Jahan et al.’s animacy model first classifies the animacy of                applying as input to the majority vote model to identify the
referring expressions, and second classifies each coreference               animacy of coreference chains, achieving an F1 of 0.77.
chain as animate or not by taking the majority vote of it’s                    The rule-based model performs second-best. It performed
constituting referring expressions. In our experiments we ran               best on Jahan et al.’s original data for referring expres-
Jahan et al.’s three referring expression animacy detection                 sions, achieving an F1 of 0.88. But the majority vote model
models and the single coreference chain animacy detection                   achieved the best result (F1 of 0.76) on OntoNotes when the
model. (majority vote backed by the different referring ex-                 rule-based results are used to detect the chain animacy. We
pression models, which were determined by to be the best                    developed a baseline for chain animacy where we considered
coreference model). Jahan et al. released the code so the                   the first referring expression only instead of majority vote and
models are identical to their work.                                         achieved an F1 of 0.69 and 0.43 on OntoNotes and CEN.
   SVM Model is a simple supervised SVM classifier [Chang                      The SVM model performed worse in most of the cases,
and Lin, 2011] for assigning animacy to referring expres-                   especially when the outputs are used for the majority vote
sions, with a Radial Basis Function Kernel where SVM pa-                    model. It performed worst when it trained on the Corpus
rameters were set at = 1, C = 0.5 and p = 1. The features                   of English Novels and tested on Jahan et al.’s original data,
of the best performing model are boolean values of whether a                achieving an F1 of only 0.56 for the referring expressions and
given referring expression contained a noun, a grammatical or               achieved an F1 of 0.37 when the results of the referring ex-
a semantic subject. Jahan et al. chose these features because               pressions are used for the majority vote model.
animate references tend to appear as nouns, grammatical sub-                   The majority vote model performed best when tested on
jects, or semantic subjects. When training and testing on the               OntoNotes. It performed worst when tested on the Corpus
same dataset, we used ten-fold cross validation, and reported               of English Novels (CEN). Besides the text genre, the ma-
the micro-averages across the performance on test folds.                    jor difference between these corpora is the quality of the
   Rule-Based Model The second approach is a rule-based                     coreference chains. For OntoNotes, they are manually cor-
classifier that marks a referring expression as animate if its              rected, while we automatically computed those on CEN. This
last word was: (a) a gendered personal, reflexive, or posses-               strongly suggests that the quality of coreference chains is a
sive pronoun (i.e., excluding it, its, itself, etc.); (b) the seman-        major factor in the performance of the animacy classifier.



                                                                       44
                                                     Referring Expression Results                    Corerference Chain Results
                                                   SVM         Hybrid      Rule-Based              SVM        Hybrid     Rule-Based
    Train Corpus          Test Corpus           F1          F1           F1                  F1         F1          F1     
    Jahan et al. [2018]   Jahan et al. [2018]   0.84    0.53    0.90    0.70     0.88   0.60    0.46    0.03    0.75    0.61    0.72    0.51
    Jahan et al. [2018]   OntoNotes             0.70    0.35    0.80    0.54     -      -       0.60    0.34    0.77    0.59    -       -
    Jahan et al. [2018]   English Novels        0.75    0.53    0.80    0.60     -      -       0.52    0.40    0.54    0.41    -       -
    OntoNotes             Jahan et al. [2018]   0.82    0.51    0.88    0.64     -      -       0.62    0.44    0.72    0.56    -       -
    OntoNotes             OntoNotes             0.70    0.36    0.80    0.54     0.76   0.44    0.60    0.34    0.77    0.59    0.73    0.48
    OntoNotes             English Novels        0.76    0.54    0.80    0.61     -      -       0.42    0.40    0.54    0.41    -       -
    English Novels        Jahan et al. [2018]   0.56    0.22    0.88    0.64     -      -       0.37    0.18    0.72    0.56    -       -
    English Novels        OntoNotes             0.70    0.37    0.80    0.54     -      -       0.60    0.34    0.77    0.59    -       -
    English Novels        English Novels        0.76    0.55    0.80    0.61     0.75   0.48    0.54    0.43    0.54    0.41    0.46    0.28
    All                   All                   0.80    0.53    0.84    0.62     0.82   0.54    0.58    0.42    0.60    0.43    0.54    0.33

Table 2: Performance of referring expression and majority vote coreference chain animacy models backed by different referring expression
models for different training and testing setups.  = Cohen’s kappa [Cohen, 1960], a statistical measure that takes into account the possibility
of the agreement occurring by chance [Glasser, 2008]. Note that the rule-based model does not require training, and so results are not reported
for different training combinations. Italics in the first line are the results reported by Jahan et al. [2018].


   Finally, the results on the combined corpus are reasonable               random field (CRF) layer to mark a word in a text sequence
for the referring expression models but performed poorly for                with the animal attribute (animate). The work was done in
the majority vote coreference chain model. This is perhaps                  Chinese and they reported an F1 of 0.38.
to be expected because CEN is the largest corpus among the                     There are some works based on ontologies or other exter-
three and the coreference chains are poor in quality.                       nal resources. As an example, Declerck et al. [2012] aug-
   Overall, these results strongly suggest that the features                mented an existing ontology using nominal phrases found
used in Jahan et al. [2018] are generalizable to domains out-               in folktales. They reported an F1 of 0.80 with 79% accu-
side the Russian folklore corpus used as long as high quality               racy. Moore et al. [2013] assigned animacy to words, where
coreference chains are available.                                           multiple model (including WordNet and WordSim) votes be-
                                                                            tween Animal, Person, Inanimate or abstains, and then the
6    Related Work                                                           results are combined using various interpretable voting mod-
                                                                            els. They reported an accuracy of 89% under majority voting
Most prior work classifies animacy as a word or noun level                  and 95% under an SVM scheme.
property using different supervised and unsupervised ap-
proaches. For example, Orasan and Evans [2007] performed                       Generally, however, compared to all other prior work on
animacy classification of senses and nouns and achieved the                 animacy, only Jahan et al. [2018] demonstrated an approach
best performance by the supervised ML method (F1 of 0.94).                  where animacy is considered a direct property of coreference
Similarly, Bowman and Chopra [2012] used a maximum en-                      chains (and referring expressions) rather than words or nouns.
tropy classifier to classify noun phrases into a most prob-
able class (human, animal, place, etc.), which was used to                  7   Contributions
mark animacy, achieving 94% accuracy. Again, Karsdorp et
al. [2015] employed a maximum entropy classifier to label                   This paper makes two contributions. First, we have demon-
the animacy of Dutch words using different combinations of                  strated the generalizability of a previously reported approach
lemmas, POS tags, dependency tags, and word embeddings.                     in animacy detection [Jahan et al., 2018] by testing the ap-
Their best result reported an F1 of 0.93. However, the work is              proach on twofold more data comprising two additional types
language-bound and hasn’t been tested on other natural lan-                 of story genres (news and novels). We release this data for use
guages.                                                                     by the community1 . These results confirm the best perform-
   Ji and Lin [2009] leveraged gender and animacy proper-                   ing models, and also strongly suggest the dependence of the
ties to detect person mentions with an unsupervised learn-                  models of the quality of coreference chain annotations.
ing model. They reported an F1 of 0.85 which is marginally
lower than a supervised learning approach, but has higher
coverage of low frequency mentions. More recently, Ardanuy                  Acknowledgements
et al. [2020] proposed an unsupervised approach to atypical
animacy detection using contextualized word embeddings.                     This work was supported by NSF CAREER Award
Using a masking approach with context, they achieved the                    IIS-1749917 and DARPA Contract FA8650-19-C-6017. We
best performance of F1 of 0.78 on one dataset, while reported               would also like to thank the members of the FIU Cognac Lab
an F1 of 0.94 on another dataset using a simple BERT classi-                for their discussions and assistance.
fier on the target expressions in a sentence. Zhu et al. [2019]
proposed an animacy detector based on a bi-directional Long                    1
                                                                                 The data and code may be downloaded from https://doi.org/10.
Short-term Memory (bi-LSTM) network with a conditional                      34703/gzx1-9v95/FCYIPW




                                                                       45
References                                                            [Moore et al., 2013] Joshua Moore, Christopher J.C. Burges,
[Ardanuy et al., 2020] Mariona Coll Ardanuy, Federico                   Erin Renshaw, and Wen-tau Yih. Animacy detection with
    Nanni, Kaspar Beelen, Kasra Hosseini, Ruth Ahnert,                  voting models. In Proceedings of the 2013 Conference
    Jon Lawrence, Katherine McDonough, Giorgia Tolfo,                   on Empirical Methods in Natural Language Processing,
    Daniel CS Wilson, and Barbara McGillivray. Living ma-               pages 55–60, Seattle, Washington, USA, 2013.
    chines: A study of atypical animacy, 2020.                        [Orasan and Evans, 2007] Constantin Orasan and Richard J
[Bowman and Chopra, 2012] Samuel R. Bowman and                          Evans. NP animacy identification for anaphora resolu-
    Harshit Chopra.        Automatic animacy classification.            tion. Journal of Artificial Intelligence Research, 29:79–
    In Proceedings of the 2012 Conference of the North                  103, 2007.
    American Chapter of the Association for Computational             [Weischedel et al., 2013] Ralph Weischedel, Martha Palmer,
    Linguistics: Human Language Technologies: Student Re-               Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance
    search Workshop (NAACL HLT’12), page 7–10, Montréal,               Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman,
    Canada, 2012.                                                       Michelle Franchini, Mohammed El-Bachouti, Robert
[Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen                     Belvin, and Ann Houston. OntoNotes Release 5.0,
    Lin. LIBSVM: A library for support vector machines.                 2013. LDC Catalog No. LDC2013T19, https://catalog.ldc.
    ACM Transactions on Intelligent Systems and Technology              upenn.edu/LDC2013T19.
    (TIST), 2(3):27, 2011.                                            [Zhu et al., 2019] Yuanqing Zhu, Wei Song, Xianjun Liu,
[Cohen, 1960] Jacob Cohen. A coefficient of agreement for               Lizhen Liu, and Xinlei Zhao. Improving anaphora res-
    nominal scales. Educational and Psychological Measure-              olution by animacy identification. In Proceedings of the
    ment, 20(1):37–46, 1960.                                            2019 IEEE International Conference on Artificial Intelli-
                                                                        gence and Computer Applications (ICAICA), pages 48–51,
[De Smet, 2008] Hendrik De Smet. Corpus of English nov-                 Dalian, China, 2019.
    els, 2008. https://perswww.kuleuven.be/⇠u0044428/.
[Declerck et al., 2012] Thierry Declerck, Nikolina Koleva,
    and Hans-Ulrich Krieger. Ontology-based incremental an-
    notation of characters in folktales. In Proceedings of the
    6th Workshop on Language Technology for Cultural Her-
    itage, Social Sciences, and Humanities, pages 30–34, Avi-
    gnon, France, 2012.
[Glasser, 2008] Stephen Glasser. Research Methodology for
    Studies of Diagnostic Tests, pages 245–257. Springer
    Netherlands, Dordrecht, 2008.
[Jahan et al., 2018] Labiba Jahan, Geeticka Chauhan, and
    Mark Finlayson. A new approach to animacy detection.
    In Proceedings of the 27th International Conference on
    Computational Linguistics (COLING), pages 1–12, Santa
    Fe, NM, 2018. Data and code may be found at https:
    //dspace.mit.edu/handle/1721.1/116172.
[Ji and Lin, 2009] Heng Ji and Dekang Lin. Gender and An-
    imacy knowledge discovery from web-scale n-grams for
    unsupervised person mention detection. In Proceedings of
    the 23rd Pacific Asia Conference on Language, Informa-
    tion and Computation, Volume 1, pages 220–229, Hong
    Kong, 2009.
[Karsdorp et al., 2015] Folgert B Karsdorp, Marten van der
    Meulen, Theo Meder, and Antal van den Bosch. Animacy
    detection in stories. In Proceedings of the 6th Workshop
    on Computational Models of Narrative (CMN’15), pages
    82–97, Atlanta, GA, 2015.
[Manning et al., 2014] Christopher D. Manning, Mihai Sur-
    deanu, John Bauer, Jenny Finkel, Steven J. Bethard, and
    David McClosky. The Stanford CoreNLP natural language
    processing toolkit. In Proceedings of the 52nd Annual
    Meeting of the Association for Computational Linguistics
    (ACL): System Demonstrations, pages 55–60. Baltimore,
    MD, 2014.



                                                                 46