=Paper=
{{Paper
|id=Vol-2831/paper9
|storemode=property
|title=Automatic recognition of figurative language in biomedical articles
|pdfUrl=https://ceur-ws.org/Vol-2831/paper9.pdf
|volume=Vol-2831
|authors=Dina Demner-Fushman,Willie Rogers,James Mork
|dblpUrl=https://dblp.org/rec/conf/aaai/Demner-FushmanR21
}}
==Automatic recognition of figurative language in biomedical articles==
Automatic recognition of figurative language in biomedical articles Dina Demner-Fushman, Willie Rogers, James Mork National Library of Medicine 8600 Rockville Pike Bethesda, MD, 20894 {ddemner,wjrogers,jmork}@mail.nih.gov Abstract In the biomedical publications, the problem of recogniz- ing non-literal utterances is intertwined with word sense dis- Figurative language plays an important role in thought pro- cesses and science. Automatic detection of figurative lan- ambiguation (WSD), and compounded by the importance of guage is gaining momentum in the open domain natural lan- the term to the article. The WSD aspects could be illustrated guage processing research, but it is hindered in the biomedi- by the following: cal domain by the absence of document collections for devel- The head of each fish, including the brain and pitu- opment and testing of the approaches. Reliable approaches itary, was sampled for double-colored FISH analysis. to detection of figurative language could potentially improve automatic indexing of the literature and support clinical ap- To many NER approaches, the first occurrence of fish is plications. We have developed a collection of documents an- indistinguishable from FISH, which stands for fluorescent in notated for literal or non-literal use of seven terms that are situ hybridization. The confusion continues in: known to cause errors in automatic indexing of biomedi- Is being a small fish in a big pond bad for students´ cal abstracts. Using the collection, we explore detection of psychosomatic health? figurative language with CNN-RNN, logistic regression and transformer models. We establish baselines for each of the Moreover, for food products manufactured from fish, such seven terms, achieving the results at the level of the state-of- as fish oil, linking to Fishes also violates indexing rules. the-art reported in the open domain evaluations. To summarize, to label a biomedical publication with the terms from a terminology, we need to determine if the terms Introduction are used literally, if the sense in the context corresponds to the sense in the terminology, and if the term is important Figurative language plays an important role in science, with enough to be indexed for the article in MEDLINE/PubMed metaphors and idiomatic expressions viewed as foundations database, which comprises more than 30 million biomedical for thought processes (Taylor and Dewsbury 2018; Cork, abstracts (NLM 2020 (accessed November, 2020). The im- Kaiser, and White 2019). Wide use of figurative language portance of a term plays a bigger role when we use the exist- in the biomedical literature presents a significant challenge ing manual indexing of biomedical abstracts for training and in automatic text understanding. Consider the term falls in testing: The correct sense of a term could be used literally in the following sentences: the abstract, but the term might not be central enough to the A patient who suffered a fall from a wagon. publication to be assigned by the indexer. Falling off the care wagon. Whereas there continues to be a steady research in biomedical WSD (Pesaranghader et al. 2019), and use of fig- Falling off the dopamine wagon. urative language in biomedicine (Cork, Kaiser, and White Fall from a train wagon. 2019), automated understanding of biomedical figurative Fall from horse-drawn wagon. language is still an under-explored area. Our objectives therefore are: Whereas it is relatively easy for people to discern which of 1. to determine which non-literal expressions are prevalent these phrases refer to physical falls, the biomedical named in the biomedical literature and present difficulties to au- entity recognition (NER) approaches often treat figurative tomated understanding, language as literal and link the word to inappropriate on- tology terms as a result. Specifically, in the task of auto- 2. create training and test collections for these terms, and mated indexing that aims to summarize the main points of a 3. explore approaches to automated detection of non-literal publication by assigning terms from a controlled vocabulary language. created to index the biomedical literature: Medical Subject Headings (MeSH) (NLM 2020 (accessed November, 2020). Related Work No copyright. Use permitted under Creative Commons License At- The body of work on detection of figurative language in the tribution 4.0 International (CC BY 4.0). open domain is significant, and the interest to the topic is growing, as evidenced by the workshops and shared tasks Term (MH) Check Tag Training Test on figurative language processing (Klebanov et al. 2020). fall (Accidental Falls) no 45,820 895 Veale et al. (2016) provide an overview of the types of figu- fish (Fishes) no 18,256 513 rative language and of the computational approaches to de- juvenile (Adolescent) yes 59,176 581 tection and understanding of figurative language. The ap- baby (Infant) yes 1,065 270 proaches are mostly formulated as a binary classification bull (Cattle) yes 1,194 555 task on a limited set of triples, and sometimes as predic- cat (Cats) yes 4,368 542 tion of the class of a token in a sentence (Feldman and Peng dog (Dogs) yes 19,167 905 2013; Gao et al. 2018). Taking into account the immediate lexico–syntactic context of the utterance and incorporating Table 1: Sizes of the training and test sets for each term in the discourse features improves recognition of figurative lan- PubMed Figuratively Language Collection. The Check Tag guage (Mu, Yannakoudakis, and Shutova 2019). In an end- column indicates if the term is a required term to be added to-end RNN-based system, Mao et al. (2019) emulated two because it pertains to the subject of the study. Check Tags are human approaches to identification of figurative language: the most frequently used MeSH terms, which indicates our 1) noticing a semantic contrast between a target word and collection covers a sizable portion of false positive triggers. its context – Selectional Preference Violation, and 2) iden- tifying if the literal meaning of a word contrasts with the meaning that word takes in the context – Metaphor Identifi- 2. Partial Literal: MH-appropriate sense, but being a part cation Procedure. of an expression, which should not trigger mapping to To the best of our knowledge, our work is the first to MeSH, e.g., shaken baby syndrome. explore the difficulties figurative language poses for auto- 3. Literal Other: Literal senses other than MH, e.g., baby mated indexing of the biomedical literature. We also pro- hamster is still a baby, but it should not be indexed with vide the first publicly available biomedical literature dataset Infant, which applies only to human babies. annotated for figurative language at the token and sentence 4. Figurative: Non-literal use of the term, e.g., in “There’s level. In addition, leveraging the state-of-the-art approaches a Baby in this Bath Water!” explored in the open domain, we establish baselines for de- tection of figurative language in biomedical abstracts using Each document was annotated by two annotators and the dif- sentence or token level classification. ferences were reconciled. Data Sources and Collections Experiments We explored CNN-RNN (Svoboda 2020 (accessed Novem- We analyzed 870 American English idioms (Bulkes and ber, 2020), Logistic Regression (Pedregosa et al. 2011) and Tanner 2017), and 464 metaphors (Katz et al. 1988; Camp- BERT-based (Kaiyinzhou 2020 (accessed November, 2020) bell and Raney 2016). We searched the Free Dictionary Id- approaches with various embeddings and the Universal Sen- ioms dictionary (FARLEX 2020 (accessed November, 2020) tence Encoder (Cer et al. 2018). We used sentences from for additional examples of figurative phrases. We then sub- PubMed abstracts containing the trigger terms and the ex- mitted figurative language expressions to MeSH on De- pressions from the above collections of idioms for train- mand (NLM 2020 (accessed November, 2020) to identify ing these models. Due to sparseness of the annotations and potential triggers for false-positive linking to MeSH e.g., cat unavailability of sufficient examples for training and for and mouse in “the game of cat and mouse” could be mapped judging the results, we collapsed the annotations into two to Cats and Mice, respectively. We then searched PubMed classes: figurative or literal MH-appropriate. Any terms that with these trigger terms to get the frequency of their use in were labeled LiteralOther or PartialLiteral were relabeled publications. We identified seven most frequent false posi- as Figurative. For example, in an article about dog owners, tives triggers that are shown in Table 1 along with the sizes dog was considered as non-literal. Terms labeled as Figura- of the training and test sets for each term. tive or FullMH remained unchanged. We then searched PubMed for the exact figurative expres- We then approached the task as binary classification at the sions, and for the abstracts containing trigger terms that were sentence or token level. either indexed or not with the corresponding MeSH head- To train the CNN-RNN and Logistic Regression mod- ings. Abstracts with trigger terms and MeSH headings serve els, sentences containing the target trigger terms were ex- as examples of literal use in the training set, and abstracts tracted from a set of retrieved documents that were labeled without MeSH headings serve as examples of non-literal using MeSH indexing information as described above. Each use. For the test sets, we randomly sampled files from both extracted sentence was assigned the label of the document distributions and manually annotated the sentences contain- from which it was derived. Sentence embeddings were gen- ing the terms at the token level. We annotated fine-grained erated using a Doc2Vec (Rehurek and Sojka 2010) model senses corresponding to: pre-trained on the documents retrieved for the trigger terms. 1. Full MH: the literal Mesh Heading-appropriate sense, In the CNN-RNN approach, the embeddings and asso- e.g., “a healthy baby at 34 weeks of gestation.” The labels ciated labels served as input to a neural network contain- assigned by the indexers were not shown to the annotators ing four groups of four layers: convolutional layer, dropout, to avoid bias. max-pooling, and dropout, followed by an LSTM layer. Sentence level Token level Term CNN-RNN Logistic regression USE BERT P R F1 A P R F1 A P R F1 A P R F1 A fall 0.77 0.68 0.72 0.99 0.64 0.78 0.71 0.73 0.89 0.89 0.89 0.88 0.37 0.34 0.35 0.98 fish 0.51 0.48 0.50 0.99 0.58 0.45 0.50 0.50 0.58 0.54 0.56 0.48 0.37 0.35 0.36 0.98 juvenile 0.77 0.64 0.70 0.99 0.97 0.38 0.55 0.86 0.82 0.83 0.82 0.80 0.37 0.36 0.37 0.99 baby 0.76 0.99 0.86 0.99 0.39 0.36 0.37 0.39 0.67 0.56 0.61 0.45 0.61 0.61 0.61 0.99 bull 0.90 0.87 0.88 0.99 0.56 0.38 0.45 0.58 0.78 0.74 0.76 0.71 0.84 0.86 0.85 0.99 cat 0.77 0.74 0.76 0.99 0.54 0.74 0.63 0.54. 0.73 0.73 0.73 0.65 0.68 0.78 0.73 0.99 dog 0.76 0.97 0.85 0.98 0.48 0.55 0.51 0.50 0.63 0.58 0.60 0.65 0.76 0.78 0.77 0.99 Table 2: Results of predicting literal and figurative use of trigger terms. USE = Universal sentence encoder, R = Recall, P = Precision A = Accuracy. The differences in 0.99 accuracy between the CNN-RNN and BERT approaches are in the third decimal point. The model uses a sigmoid activation function, binary cross- entropy loss and the adam optimizer. We used the SciKit Learn Logistic Regression classifier, with Doc2Vec output as inputs. The Universal Sentence Encoder was also applied in the sentence level classification task. Unlike the Doc2Vec mod- els, the Universal Sentence Encoder was trained on a very large corpus using a variety of sources. In our approach, each sentence vector representation was generated using the Uni- versal Sentence Encoder during training. The vector repre- sentation and the sentence label was then passed to a two- layer neural network consisting of a RELU and a softmax Figure 1: The size of the training set does not always directly layer. A categorical cross-entropy loss and the adam opti- influence the best F-1 scores obtained in figurative language mizer was used when building the model. detection We used BERT encoder extended with a CRF layer for Named Entity Recognition (Kaiyinzhou 2020 (accessed November, 2020) for the token-level classification of lit- eral and figurative use of the tokens. We used BIO-style ken level. We hoped to identify one best approach for the (Beginning-Inside-Outside) features. To train BERT, we task and achieve state-of-the-art performance for all trigger tagged the trigger terms with the label of the sentence and terms. The best results reported in the literature for the open- all other terms in the sentence as outside. domain figurative language detection and in the shared task on metaphor detection (Klebanov et al. 2020) are around Results 70% F-1 score, sometimes reaching 80% and above perfor- mance. Although we have obtained F-1 scores above 80% Table 2 summarizes the results obtained for the binary clas- for five of the seven terms, we cannot identify a single ap- sification approaches to detection of figurative language. proach that will achieve good scores on all trigger terms. The PubMed searches yielded training sets of varying sizes, The F-1 score for fish is only 56%. This score could prob- ranging from 1, 065 documents for baby, to 59, 176 for juve- ably be explained by the fact that this term often violates nile. The manually annotated test sets for each of the terms the widely used WSD assumption of “one sense per docu- range from 270 to 905 documents. The size of the training ment” (Yarowsky 1995), which we used to create the train- set does not seem to be directly correlated with the results, ing set. As can be seen in the example, two senses of fish are as shown in Figure 1. used in the same sentence: Discussion These preliminary results provide the basis for the fur- ther development of a non-GMO approach to modu- We created a collection of PubMed abstracts automatically late fish allergenicity and improve safety of aquacul- annotated for literal and non-literal use of seven terms that ture fish. (PMID: 31622806) proved to be a rich source of false positive linking to termi- nologies and have sufficient amounts of training documents The indexers labeled this article with both Fishes and in PubMed. Interestingly, one of these terms, fall was also Seafood. When the contexts for these occurrences of fish are found to be difficult to classify as figurative in the open do- used in the models as positive examples, they might be too main tasks (Stowe et al. 2019). close to the contexts of the articles that present fish only in We explored several state-of-the-art approaches, casting the context of food and thus serve as negative examples. the task as binary classification at the sentence and to- With respect to identifying one approach that would work best for all of the trigger terms, we can see that cast- open domain evaluations. We hope that the interesting prob- ing the task as sentence-level classification and using the lem of detection of figurative language in biomedical text, CNN-RNN model produces the majority of best results. the dataset, and the automated approach to creation of the Stowe (2019) observes that fall is difficult to classify be- training sets outlined in this work will bring about further cause the distribution of the literal and metaphoric uses of research in this area. this word in the open domain is almost even. In our an- Data & code: https://ii.nlm.nih.gov/DataSets/index.shtml notations, we also observed frequent use of fall in person- ification, which might explain why the Universal Sentence Acknowledgements Encoder pre-trained on a variety of sources performs much better for falls. This work was supported by the intramural research program Another interesting observation is that if we want to select at the U.S. National Library of Medicine, National Institutes a method for automated indexing, we will have to decide of Health. if recall or precision are more important when suggesting We thank Alan Aronson, Francois Lang, Laritza Ro- the terms. For cat, dog, fish and juvenile, the differences in driguez and Sonya Shooshan for judging parts of the col- these two metrics achieved by different approaches are rel- lections. We thank Anna Ripple for constructing PubMed atively large, although the F-scores are mostly close, show- searches. ing a typical trade-off between the two metrics. In selecting approaches to support automated indexing, precision often References plays an important role, as currently the consensus is that it Bulkes, N. Z.; and Tanner, D. 2017. “Going to town”: Large- is better to miss a term than to assign an inappropriate term scale norming and statistical analysis of 870 American En- that will mislead the search engines that rely on MeSH in- glish idioms. Behavior research methods 49(2): 772–783. dexing. For that reason, we do not consider accuracy when selecting an approach for supporting automated indexing. Campbell, S. J.; and Raney, G. E. 2016. A 25-year repli- Our work has some limitations that we hope to address in cation of Katz et al.’s (1988) metaphor norms. Behavior the future. First, we addressed only seven of the hundreds research methods 48(1): 330–340. of terms used figuratively in the biomedical literature. Al- Cer, D.; Yang, Y.; Kong, S.-y.; Hua, N.; Limtiaco, N.; John, though the seven terms provided enough information to see R. S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, that no single approach is a winning strategy, additional an- C.; et al. 2018. Universal Sentence Encoder for English. In notations will be needed for testing approaches to figurative Proceedings of the 2018 Conference on Empirical Methods language detection on PubMed scale. We also found that in Natural Language Processing: System Demonstrations, for many remaining terms figurative use in PubMed is in- 169–174. frequent and additional sources of figurative language will be needed for training. For example, butterflies in my stom- Cork, C.; Kaiser, B. N.; and White, R. G. 2019. The integra- ach is used in PubMed only two times, and butterflies AND tion of idioms of distress into mental health assessments and stomach 20 times. More data will be needed to train a clas- interventions: a systematic review. Global Mental Health 6. sifier to distinguish between these two titles: FARLEX. 2020 (accessed November, 2020). 25. The Free Butterflies in My Stomach: Insects in Human Nutrition Dictionary by FARLEX. Idioms and phrases. URL https: //idioms.thefreedictionary.com/. Neurotic butterflies in my stomach: the role of anxiety, anxiety sensitivity and depression in functional gastroin- Feldman, A.; and Peng, J. 2013. Automatic detection of testinal disorders idiomatic clauses. In International Conference on Intelli- gent Text Processing and Computational Linguistics, 435– 446. Springer. Conclusions This work presents an initial exploration of the use and de- Gao, G.; Choi, E.; Choi, Y.; and Zettlemoyer, L. 2018. Neu- tection of figurative language in biomedical publications. On ral Metaphor Detection in Context. In Proceedings of the the one hand, figurative language is known to play an im- 2018 Conference on Empirical Methods in Natural Lan- portant role in thought processes and in science, and there- guage Processing, 607–613. fore being widely used in biomedical publications, on the Kaiyinzhou. 2020 (accessed November, 2020). BERT-NER. other hand, automated detection of figurative language in URL https://github.com/kyzhouhzau/BERT-NER. the biomedical publications has not yet attracted research. To explore feasibility of automated detection of figurative Katz, A. N.; Paivio, A.; Marschark, M.; and Clark, J. M. language, we created a collection of documents annotated 1988. Norms for 204 literary and 260 nonliterary metaphors for literal or non-literal use of seven terms that are known to on 10 psychological dimensions. Metaphor and Symbol cause errors in automatic indexing of biomedical abstracts 3(4): 191–214. with MeSH terms. We then explored sentence and token- Klebanov, B. B.; Shutova, E.; Lichtenstein, P.; Muresan, S.; level classification approaches to detection of figurative lan- Wee, C.; Feldman, A.; and Ghosh, D., eds. 2020. Proceed- guage using CNN-RNN, logistic regression and transformer ings of the Second Workshop on Figurative Language Pro- models. With the exception of one term, fish, our perfor- cessing. Online: Association for Computational Linguistics. mance is on par with the state-of-the-art achieved in the URL https://www.aclweb.org/anthology/2020.figlang-1.0. Mao, R.; Lin, C.; and Guerin, F. 2019. End-to-end sequential metaphor identification inspired by linguistic theories. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3888–3898. Mu, J.; Yannakoudakis, H.; and Shutova, E. 2019. Learn- ing Outside the Box: Discourse-level Features Improve Metaphor Identification. In Proceedings of NAACL-HLT, 596–601. NLM. 2020 (accessed November, 2020)a. Medical Subject Headings. URL https://www.nlm.nih.gov/mesh/meshhome. html. NLM. 2020 (accessed November, 2020)b. MEDLINE and PubMed. URL https://pubmed.ncbi.nlm.nih.gov/. NLM. 2020 (accessed November, 2020)c. MeSH on De- mand. URL https://meshb.nlm.nih.gov/MeSHonDemand. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830. Pesaranghader, A.; Matwin, S.; Sokolova, M.; and Pesarang- hader, A. 2019. deepBioWSD: effective deep neural word sense disambiguation of biomedical text data. Journal of the American Medical Informatics Association 26(5): 438–446. Rehurek, R.; and Sojka, P. 2010. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frame- works. Citeseer. Stowe, K.; Moeller, S.; Michaelis, L.; and Palmer, M. 2019. Linguistic Analysis Improves Neural Metaphor Detection. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 362–371. Svoboda, D. 2020 (accessed November, 2020). Doc2VecC N NR N N. U RL. Taylor, C.; and Dewsbury, B. M. 2018. On the problem and promise of metaphor use in science and science communi- cation. Journal of microbiology & biology education 19(1). Veale, T.; Shutova, E.; and Klebanov, B. B. 2016. Metaphor: A computational perspective. Synthesis Lectures on Human Language Technologies 9(1): 1–160. Yarowsky, D. 1995. Unsupervised word sense disambigua- tion rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, 189–196.