             Automatic recognition of figurative language in biomedical articles
                                  Dina Demner-Fushman, Willie Rogers, James Mork
                                                    National Library of Medicine
                                                        8600 Rockville Pike
                                                        Bethesda, MD, 20894

                            Abstract                                    In the biomedical publications, the problem of recogniz-
                                                                     ing non-literal utterances is intertwined with word sense dis-
  Figurative language plays an important role in thought pro-
  cesses and science. Automatic detection of figurative lan-
                                                                     ambiguation (WSD), and compounded by the importance of
  guage is gaining momentum in the open domain natural lan-          the term to the article. The WSD aspects could be illustrated
  guage processing research, but it is hindered in the biomedi-      by the following:
  cal domain by the absence of document collections for devel-            The head of each fish, including the brain and pitu-
  opment and testing of the approaches. Reliable approaches               itary, was sampled for double-colored FISH analysis.
  to detection of figurative language could potentially improve
  automatic indexing of the literature and support clinical ap-         To many NER approaches, the first occurrence of fish is
  plications. We have developed a collection of documents an-        indistinguishable from FISH, which stands for fluorescent in
  notated for literal or non-literal use of seven terms that are     situ hybridization. The confusion continues in:
  known to cause errors in automatic indexing of biomedi-                 Is being a small fish in a big pond bad for students´
  cal abstracts. Using the collection, we explore detection of            psychosomatic health?
  figurative language with CNN-RNN, logistic regression and
  transformer models. We establish baselines for each of the             Moreover, for food products manufactured from fish, such
  seven terms, achieving the results at the level of the state-of-    as fish oil, linking to Fishes also violates indexing rules.
  the-art reported in the open domain evaluations.                    To summarize, to label a biomedical publication with the
                                                                      terms from a terminology, we need to determine if the terms
                        Introduction                                  are used literally, if the sense in the context corresponds to
                                                                      the sense in the terminology, and if the term is important
Figurative language plays an important role in science, with          enough to be indexed for the article in MEDLINE/PubMed
metaphors and idiomatic expressions viewed as foundations             database, which comprises more than 30 million biomedical
for thought processes (Taylor and Dewsbury 2018; Cork,                abstracts (NLM 2020 (accessed November, 2020). The im-
Kaiser, and White 2019). Wide use of figurative language              portance of a term plays a bigger role when we use the exist-
in the biomedical literature presents a significant challenge         ing manual indexing of biomedical abstracts for training and
in automatic text understanding. Consider the term falls in           testing: The correct sense of a term could be used literally in
the following sentences:                                              the abstract, but the term might not be central enough to the
   A patient who suffered a fall from a wagon.                        publication to be assigned by the indexer.
   Falling off the care wagon.                                           Whereas there continues to be a steady research in
                                                                      biomedical WSD (Pesaranghader et al. 2019), and use of fig-
   Falling off the dopamine wagon.                                    urative language in biomedicine (Cork, Kaiser, and White
   Fall from a train wagon.                                           2019), automated understanding of biomedical figurative
   Fall from horse-drawn wagon.                                       language is still an under-explored area. Our objectives
                                                                      therefore are:
Whereas it is relatively easy for people to discern which of
                                                                     1. to determine which non-literal expressions are prevalent
these phrases refer to physical falls, the biomedical named
                                                                         in the biomedical literature and present difficulties to au-
entity recognition (NER) approaches often treat figurative
                                                                         tomated understanding,
language as literal and link the word to inappropriate on-
tology terms as a result. Specifically, in the task of auto-         2. create training and test collections for these terms, and
mated indexing that aims to summarize the main points of a           3. explore approaches to automated detection of non-literal
publication by assigning terms from a controlled vocabulary              language.
created to index the biomedical literature: Medical Subject
Headings (MeSH) (NLM 2020 (accessed November, 2020).                                       Related Work
No copyright. Use permitted under Creative Commons License At-       The body of work on detection of figurative language in the
tribution 4.0 International (CC BY 4.0).                             open domain is significant, and the interest to the topic is
growing, as evidenced by the workshops and shared tasks             Term (MH)                  Check Tag      Training      Test
on figurative language processing (Klebanov et al. 2020).           fall (Accidental Falls)       no            45,820      895
Veale et al. (2016) provide an overview of the types of figu-       fish (Fishes)                 no            18,256      513
rative language and of the computational approaches to de-          juvenile (Adolescent)         yes           59,176      581
tection and understanding of figurative language. The ap-           baby (Infant)                 yes            1,065      270
proaches are mostly formulated as a binary classification           bull (Cattle)                 yes            1,194      555
task on a limited set of triples, and sometimes as predic-          cat (Cats)                    yes            4,368      542
tion of the class of a token in a sentence (Feldman and Peng        dog (Dogs)                    yes           19,167      905
2013; Gao et al. 2018). Taking into account the immediate
lexico–syntactic context of the utterance and incorporating       Table 1: Sizes of the training and test sets for each term in the
discourse features improves recognition of figurative lan-        PubMed Figuratively Language Collection. The Check Tag
guage (Mu, Yannakoudakis, and Shutova 2019). In an end-           column indicates if the term is a required term to be added
to-end RNN-based system, Mao et al. (2019) emulated two           because it pertains to the subject of the study. Check Tags are
human approaches to identification of figurative language:        the most frequently used MeSH terms, which indicates our
1) noticing a semantic contrast between a target word and         collection covers a sizable portion of false positive triggers.
its context – Selectional Preference Violation, and 2) iden-
tifying if the literal meaning of a word contrasts with the
meaning that word takes in the context – Metaphor Identifi-       2. Partial Literal: MH-appropriate sense, but being a part
cation Procedure.                                                     of an expression, which should not trigger mapping to
   To the best of our knowledge, our work is the first to             MeSH, e.g., shaken baby syndrome.
explore the difficulties figurative language poses for auto-      3. Literal Other: Literal senses other than MH, e.g., baby
mated indexing of the biomedical literature. We also pro-             hamster is still a baby, but it should not be indexed with
vide the first publicly available biomedical literature dataset       Infant, which applies only to human babies.
annotated for figurative language at the token and sentence
                                                                  4. Figurative: Non-literal use of the term, e.g., in “There’s
level. In addition, leveraging the state-of-the-art approaches
                                                                      a Baby in this Bath Water!”
explored in the open domain, we establish baselines for de-
tection of figurative language in biomedical abstracts using       Each document was annotated by two annotators and the dif-
sentence or token level classification.                            ferences were reconciled.

            Data Sources and Collections                                                 Experiments
                                                                  We explored CNN-RNN (Svoboda 2020 (accessed Novem-
We analyzed 870 American English idioms (Bulkes and
                                                                  ber, 2020), Logistic Regression (Pedregosa et al. 2011) and
Tanner 2017), and 464 metaphors (Katz et al. 1988; Camp-
                                                                  BERT-based (Kaiyinzhou 2020 (accessed November, 2020)
bell and Raney 2016). We searched the Free Dictionary Id-
                                                                  approaches with various embeddings and the Universal Sen-
ioms dictionary (FARLEX 2020 (accessed November, 2020)
                                                                  tence Encoder (Cer et al. 2018). We used sentences from
for additional examples of figurative phrases. We then sub-
                                                                  PubMed abstracts containing the trigger terms and the ex-
mitted figurative language expressions to MeSH on De-
                                                                  pressions from the above collections of idioms for train-
mand (NLM 2020 (accessed November, 2020) to identify
                                                                  ing these models. Due to sparseness of the annotations and
potential triggers for false-positive linking to MeSH e.g., cat
                                                                  unavailability of sufficient examples for training and for
and mouse in “the game of cat and mouse” could be mapped
                                                                  judging the results, we collapsed the annotations into two
to Cats and Mice, respectively. We then searched PubMed
                                                                  classes: figurative or literal MH-appropriate. Any terms that
with these trigger terms to get the frequency of their use in
                                                                  were labeled LiteralOther or PartialLiteral were relabeled
publications. We identified seven most frequent false posi-
                                                                  as Figurative. For example, in an article about dog owners,
tives triggers that are shown in Table 1 along with the sizes
                                                                  dog was considered as non-literal. Terms labeled as Figura-
of the training and test sets for each term.
                                                                  tive or FullMH remained unchanged.
   We then searched PubMed for the exact figurative expres-
                                                                     We then approached the task as binary classification at the
sions, and for the abstracts containing trigger terms that were
                                                                  sentence or token level.
either indexed or not with the corresponding MeSH head-
                                                                     To train the CNN-RNN and Logistic Regression mod-
ings. Abstracts with trigger terms and MeSH headings serve
                                                                  els, sentences containing the target trigger terms were ex-
as examples of literal use in the training set, and abstracts
                                                                  tracted from a set of retrieved documents that were labeled
without MeSH headings serve as examples of non-literal
                                                                  using MeSH indexing information as described above. Each
use. For the test sets, we randomly sampled files from both
                                                                  extracted sentence was assigned the label of the document
distributions and manually annotated the sentences contain-
                                                                  from which it was derived. Sentence embeddings were gen-
ing the terms at the token level. We annotated fine-grained
                                                                  erated using a Doc2Vec (Rehurek and Sojka 2010) model
senses corresponding to:
                                                                  pre-trained on the documents retrieved for the trigger terms.
1. Full MH: the literal Mesh Heading-appropriate sense,              In the CNN-RNN approach, the embeddings and asso-
   e.g., “a healthy baby at 34 weeks of gestation.” The labels    ciated labels served as input to a neural network contain-
   assigned by the indexers were not shown to the annotators      ing four groups of four layers: convolutional layer, dropout,
   to avoid bias.                                                 max-pooling, and dropout, followed by an LSTM layer.
                                                Sentence level                                                   Token level
Term                CNN-RNN                    Logistic regression                    USE                          BERT
             P       R      F1       A      P       R       F1      A      P       R       F1      A       P       R      F1       A
fall       0.77     0.68   0.72     0.99   0.64    0.78    0.71   0.73    0.89    0.89    0.89    0.88    0.37    0.34   0.35     0.98
fish       0.51     0.48   0.50     0.99   0.58    0.45    0.50   0.50    0.58    0.54    0.56    0.48    0.37    0.35   0.36     0.98
juvenile   0.77     0.64   0.70     0.99   0.97    0.38    0.55   0.86    0.82    0.83    0.82    0.80    0.37    0.36   0.37     0.99
baby       0.76     0.99   0.86     0.99   0.39    0.36    0.37   0.39    0.67    0.56    0.61    0.45    0.61    0.61   0.61     0.99
bull       0.90     0.87   0.88     0.99   0.56    0.38    0.45   0.58    0.78    0.74    0.76    0.71    0.84    0.86   0.85     0.99
cat        0.77     0.74   0.76     0.99   0.54    0.74    0.63   0.54.   0.73    0.73    0.73    0.65    0.68    0.78   0.73     0.99
dog        0.76     0.97   0.85     0.98   0.48    0.55    0.51   0.50    0.63    0.58    0.60    0.65    0.76    0.78   0.77     0.99

Table 2: Results of predicting literal and figurative use of trigger terms. USE = Universal sentence encoder, R = Recall, P
= Precision A = Accuracy. The differences in 0.99 accuracy between the CNN-RNN and BERT approaches are in the third
decimal point.

The model uses a sigmoid activation function, binary cross-
entropy loss and the adam optimizer.
   We used the SciKit Learn Logistic Regression classifier,
with Doc2Vec output as inputs.
   The Universal Sentence Encoder was also applied in the
sentence level classification task. Unlike the Doc2Vec mod-
els, the Universal Sentence Encoder was trained on a very
large corpus using a variety of sources. In our approach, each
sentence vector representation was generated using the Uni-
versal Sentence Encoder during training. The vector repre-
sentation and the sentence label was then passed to a two-
layer neural network consisting of a RELU and a softmax              Figure 1: The size of the training set does not always directly
layer. A categorical cross-entropy loss and the adam opti-           influence the best F-1 scores obtained in figurative language
mizer was used when building the model.                              detection
   We used BERT encoder extended with a CRF layer
for Named Entity Recognition (Kaiyinzhou 2020 (accessed
November, 2020) for the token-level classification of lit-
eral and figurative use of the tokens. We used BIO-style             ken level. We hoped to identify one best approach for the
(Beginning-Inside-Outside) features. To train BERT, we               task and achieve state-of-the-art performance for all trigger
tagged the trigger terms with the label of the sentence and          terms. The best results reported in the literature for the open-
all other terms in the sentence as outside.                          domain figurative language detection and in the shared task
                                                                     on metaphor detection (Klebanov et al. 2020) are around
                           Results                                   70% F-1 score, sometimes reaching 80% and above perfor-
                                                                     mance. Although we have obtained F-1 scores above 80%
Table 2 summarizes the results obtained for the binary clas-         for five of the seven terms, we cannot identify a single ap-
sification approaches to detection of figurative language.           proach that will achieve good scores on all trigger terms.
The PubMed searches yielded training sets of varying sizes,          The F-1 score for fish is only 56%. This score could prob-
ranging from 1, 065 documents for baby, to 59, 176 for juve-         ably be explained by the fact that this term often violates
nile. The manually annotated test sets for each of the terms         the widely used WSD assumption of “one sense per docu-
range from 270 to 905 documents. The size of the training            ment” (Yarowsky 1995), which we used to create the train-
set does not seem to be directly correlated with the results,        ing set. As can be seen in the example, two senses of fish are
as shown in Figure 1.                                                used in the same sentence:
                        Discussion                                       These preliminary results provide the basis for the fur-
                                                                         ther development of a non-GMO approach to modu-
We created a collection of PubMed abstracts automatically                late fish allergenicity and improve safety of aquacul-
annotated for literal and non-literal use of seven terms that            ture fish. (PMID: 31622806)
proved to be a rich source of false positive linking to termi-
nologies and have sufficient amounts of training documents              The indexers labeled this article with both Fishes and
in PubMed. Interestingly, one of these terms, fall was also          Seafood. When the contexts for these occurrences of fish are
found to be difficult to classify as figurative in the open do-      used in the models as positive examples, they might be too
main tasks (Stowe et al. 2019).                                      close to the contexts of the articles that present fish only in
   We explored several state-of-the-art approaches, casting          the context of food and thus serve as negative examples.
the task as binary classification at the sentence and to-               With respect to identifying one approach that would work
best for all of the trigger terms, we can see that cast-          open domain evaluations. We hope that the interesting prob-
ing the task as sentence-level classification and using the       lem of detection of figurative language in biomedical text,
CNN-RNN model produces the majority of best results.              the dataset, and the automated approach to creation of the
Stowe (2019) observes that fall is difficult to classify be-      training sets outlined in this work will bring about further
cause the distribution of the literal and metaphoric uses of      research in this area.
this word in the open domain is almost even. In our an-              Data & code: https://ii.nlm.nih.gov/DataSets/index.shtml
notations, we also observed frequent use of fall in person-
ification, which might explain why the Universal Sentence                          Acknowledgements
Encoder pre-trained on a variety of sources performs much
better for falls.                                                 This work was supported by the intramural research program
   Another interesting observation is that if we want to select   at the U.S. National Library of Medicine, National Institutes
a method for automated indexing, we will have to decide           of Health.
if recall or precision are more important when suggesting            We thank Alan Aronson, Francois Lang, Laritza Ro-
the terms. For cat, dog, fish and juvenile, the differences in    driguez and Sonya Shooshan for judging parts of the col-
these two metrics achieved by different approaches are rel-       lections. We thank Anna Ripple for constructing PubMed
atively large, although the F-scores are mostly close, show-      searches.
ing a typical trade-off between the two metrics. In selecting
approaches to support automated indexing, precision often                                References
