-

NLM at ImageCLEF 2017 Caption Task

Asma Ben Abacha

asma.benabacha@nih.gov 0

Alba G. Seco de Herrera

albagarcia@nih.gov 0

Soumya Gayen

soumya.gayen@nih.gov 0

Dina Demner-Fushman

Sameer Antani

santani@mail.nih.gov 0 0 Lister Hill National Center for Biomedical Communications, National Library of Medicine , Bethesda , USA

This paper describes the participation of the U.S. National Library of Medicine (NLM) in the ImageCLEF 2017 caption task. We proposed di erent machine learning methods using training subsets that we selected from the provided data as well as retrieval methods using external data. For the concept detection subtask, we used Convolutional Neural Networks (CNNs) and Binary Relevance using decision trees for multi-label classi cation. We also proposed a retrieval-based approach using Open-i image search engine and MetaMapLite to recognize relevant terms and associated Concept Unique Identi ers (CUIs). For the caption prediction subtask, we used the recognized CUIs and the UMLS to generate the captions. We also applied Open-i to retrieve similar images and their captions. We submitted ten runs for the concept detection subtask and six runs for the caption prediction subtask. CNNs provided good results with regards to the size of the selected subsets and the limited number of CUIs used for training. Using the CUIs recognized by the CNNs, our UMLS-based method for caption prediction obtained good results with 0.2247 mean BLUE score. In both subtasks, the best results were achieved using retrieval-based approaches outperforming all submitted runs by all the participants with 0.1718 mean F1 score in the concept detection subtask and 0.5634 mean BLUE score in the caption prediction subtask.

Concept Detection Caption Prediction Convolutional Neural Networks Multi-label Classi cation Open-i MetaMapLite UMLS

This paper describes the participation of the U.S. National Library of Medicine1 (NLM) in the ImageCLEF 2017 caption task [ 1 ]. ImageCLEF [ 2 ] is an evaluation campaign organized as part of the CLEF2 initiative labs. In 2017, the caption task consisted of two subtasks including concept detection and caption prediction. A detailed description of the data and the task is presented in Eickho et al. [ 1 ].

1 http://www.nlm.nih.gov 2 http://clef2017.clef-initiative.eu

The concept detection subtask consists of identifying the UMLS R (Uni ed Medical Language System)3 Concept Unique Identi ers (CUIs). To solve this rst challenge of detecting CUIs from a given image from the biomedical literature, we propose several approaches based on multi-label classi cation and information retrieval. For the multi-label classi cation, Convolutional Neural Networks (CNNs) and Binary Relevance using Decision Trees (BR-DT) are applied. The information retrieval approach is based on the Open-i Biomedical Image Search Engine4 [ 3 ].

The caption prediction subtask aims to recreate the original image caption.To predict the captions of the images, we proposed a retrieval-based approach using Open-i and a second approach based on the retrieved CUIs and the UMLS R to nd the associated terms and groups.

The rest of the paper is organized as follows. Section 2 describes the data provided for the two subtasks and our method to select training subsets. Then we present the proposed approaches for concept detection in Section 3 and caption prediction in Section 4. Section 5 provides a description of the submitted runs. Finally Section 6 presents and discusses our results. 2

Data Analysis and Selection

Training, validation and test datasets were provided containing 164,614, 10,000 and 10,000 biomedical images respectively. The images were extracted from scholarly articles on PubMed Central5 (PMC).

For the concept detection subtask, a set of CUIs was provided for each image. For the caption prediction subtask, captions were provided. Figure 1 shows an example from the provided data. 2.1

Analysis of Concept Detection Data

We analyzed the task data in order to study the types of methods that could be applied for concept detection subtask and whether it is needed to select training data and remove the less frequent CUIs. Also we studied whether it is relevant to build rule-based methods and construct patterns for the caption prediction subtask based on the recognized CUIs.

For the concept detection subtask: { Training data includes 164,614 images associated with 20,463 CUIs. 19,145

CUIs have less than 100 images, including 6,251 CUIs with only one image. { Validation data includes 10,000 images associated with 7,070 CUIs. 6,981

CUIs have less than 100 images, including 3,247 CUIs with only one image.

3 https://www.nlm.nih.gov/research/umls 4 http://Open-i.nlm.nih.gov 5 http://www.ncbi.nlm.nih.gov/pmc

Concepts: { C0016911: Gadolinium { C0021485: Injection of therapeutic agent { C0024485: Magnetic Resonance Imaging { C0577559: Mass of body structure { C1533685: Injection procedure Caption: Magnetic resonance imaging. After intravenous injection of adolinium, the mass showed a progressive, heterogeneous, and delayed enhancement. The heterogeneous distribution of CUIs in the training data is not adapted for multi-label classi cation, therefore we studied data selection methods.

Cho et al. [ 4 ] applied deep learning to medical image classi cation and focused on determining the ideal training data size to achieve high classi cation accuracy. They trained a CNN using di erent sizes of training data and tested the models on 6000 computed tomography (CT) images. Using 200 training samples, the classi cation accuracy was already near or at 100%. Based on these experiments, we xed a threshold of 200 training images for each CUI.

In addition to the number of examples for each CUI, some CUIs are a lot more frequent than others in the datasets (the number of training images for each CUI varies from 1 to 17,998). Therefore we built two di erent training subsets targeting the most frequent CUIs: { Subset 1 [92 CUIs with frequency>=1,500]: We selected CUIs having at least 1,500 training examples. This subset corresponded to 92 distinct CUIs. For each CUI, we selected randomly 200 training examples from the provided training images. { Subset 2 [239 CUIs with frequency>=400]: We selected CUIs having at least 400 training examples. This subset corresponded to 239 CUIs. For each CUI, we selected randomly 200 training examples.

We used these two subsets to train our machine learning (ML) methods. 3

Concept Detection Methods

For the concept detection subtask, each image can be associated with one or multiple CUIs. We approached the problem in two ways, (1) applying multilabel classi cation methods and (2) using a retrieval-based approach.

In the multi-label classi cation approach we consider the CUIs in the training set as the labels to be assigned. Thus each image will be assigned one or multiple labels from the prede ned label set. Two methods for multi-label classi cation were applied: Convolutional Neural Networks (CNNs) and Binary Relevance using Decision Trees (BR-DT). To train our ML models, we utilized the highperformance computational capabilities of the Biowulf Linux cluster at the U.S. National Institutes of Health7.

In the information retrieval approach, we used Open-i to retrieve the most similar images and their associated labels and CUIs. 3.1

Multi-label classi cation with Convolutional Neural Networks (CNNs)

Deep learning methods have been widely applied to image analysis. In particular, CNNs achieved excellent results for image classi cation [ 5, 6 ].

7 http://biowulf.nih.gov

We applied CNNs for multi-label classi cation and tested di erent neural networks such as the GoogleNet network [7]. GoogLeNet won the classi cation and object recognition challenges in the 2014 ImageNet LSVRC competition (ILSVRC20148). In our experiments on the training sets, the GoogleNet network provided better results compared to AlexNet [8] and LeNet [9].

We ran the CNNs using NVIDIA Deep Learning GPU Training System (DIGITS)9. DIGITS is a Deep Learning (DL) training system with a web interface that allows designing custom network architectures and evaluating their e ectiveness. It also allows the design of new models by providing the details of optimization and network architecture. DIGITS can be used for image classi cation, segmentation and object detection tasks.

In our nal runs, we used the GoogLeNet network. We applied stochastic gradient descent (SGD) and performed 100 training epochs. We used the two training subsets associated respectively to 92 and 239 CUIs to train the network (see Section 2.2). 3.2

Multi-label Classi cation with Binary Relevance using Decision Trees (BR-DT)

The Meka project [10]10 is based on the Weka machine learning library [11], and provides an open source implementation of methods for multi-label classi cation. It contains several algorithms, such as Binary Relevance (BR) or Label Powerset.

Similar to [12] we used BR-DT as implemented in Meka (J48). BR methods create an individual model for each label, thus each model is a simply binary problem. We used Decision Trees (DT) as a base classi er because DT are able to capture relations between labels. For the experiments we extract from the images one visual descriptors commonly used for image classi cation Colour and Edge Directivity Descriptor (CEDD) [13]. The descriptor was provided as input to Meka.

Before submitting the runs we carried out some experiments on the training data using also Fuzzy Colour and Texture Histogram (FCTH) [14] as a visual descriptor. However using CEDD provided better results. 3.3

Retrieval and Annotation Approach with Open-i and MetaMapLite

The Open-i service of the NLM enables search and retrieval of abstracts and images (including charts, graphs, clinical images) from the open source literature, and biomedical image collections. Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central articles, 7,470 chest x-rays with 3,955 radiology reports, 67,517 images from NLM History of Medicine collection,

8 http://image-net.org/challenges/LSVRC/2014/eccv2014 9 http://github.com/NVIDIA/DIGITS 10 http://meka.sourceforge.net

2,064 orthopedic illustrations and 8084 medical case images from MedPix11. Open-i combines text processing, image analysis and machine learning techniques to retrieve relevant images from an input image-query.

We submitted each query image to the Open-i search API and selected 10 result images with captions. For each retrieved image, we annotated its caption with MetaMapLite12 (3.1-SNAPSHOT version) to recognize CUIs. MetaMapLite recognizes named entities using the longest match as well as associated CUIs. It also allows restricting the CUIs with UMLS Semantic Types. We did not use any restriction as CUIs in the provided data have heterogeneous semantic types. 4

Caption Prediction Methods

To predict image captions, we used two di erent methods based on UMLS R and Open-i. 4.1

UMLS-based Method

We used the CUIs recognized in the rst concept detection subtask to generate the associated UMLS terms and semantic types. We then grouped the recognized UMLS terms using the UMLS groups of their semantic types. The UMLS Semantic Network includes 15 groups: Activities & Behaviors, Anatomy, Chemicals & Drugs, Concepts & Ideas, Devices, Disorders, Genes & Molecular Sequences, Geographic Areas, Living Beings, Objects, Occupations, Organizations, Phenomena, Physiology and Procedures.

The following are examples of four captions and their corresponding image IDs, generated using the UMLS-based method: 1. 1471-2342-10-23-4: Procedures: diagnostic computed tomography, imaging pet. Anatomy: armpit. Disorders: metastasis. Physiology: uptake. 2. iej-04-20-g007: Procedures: h&e stain. Chemicals & Drugs: haematoxylin, 11445 red, eosin. Disorders: proliferation. 3. 13014 2015 335 Fig1 HTML: Procedures: brain mri, di usion weighted imaging, bodies weight. Concepts & Ideas: rows. Chemicals & Drugs: gadolinium. 4. fonc-04-00350-g002: Procedures: antineoplastic chemotherapy regimen. Disorders: abnormally opaque structure, condition response. Anatomy: left lung, anterior thoracic region. 4.2

Open-i-based Method

For each input image, the Open-i biomedical image search engine returns a list of similar images. In our experiments we performed several tests with the caption, mention, Medical Subject Headings (MeSH R ) terms, three outcomes and 11 As of September 2016. 12 https://metamap.nlm.nih.gov/MetaMapLite.shtml medical problems from the retrieved images. In our nal runs, we used only the captions of the rst and second retrieved images.

The following are two examples of results provided by Open-i: 1. 1MS-10-20646-g003: Open-i provides the following relevant results: { Caption: Laryngostenosis in patient with laryngeal tuberculosis. Tracheostomy. { Problem(s): tuberculoses. { Concept(s): laryngostenoses; laryngeal tuberculoses. { Outcomes: (i) Within the group of patients with lymph node tuberculosis in 15 cases there were infected lymph nodes of the 2(nd) and 3(rd) cervical region and in 11 infected lymph nodes of the 1(st) cervical region. (ii) In 5 cases of laryngeal tuberculosis there was detected coexistence of cancer. (iii)Chest X-ray was performed in all cases and pulmonary tuberculosis was identi ed in 26 (35.6%) cases. { Mention: Moreover, histopathological examination revealed in 5 cases coexistence of planoepithelial carcinoma with tuberculosis. In all 5 cases total laryngectomy was performed. Chest X-ray was performed in all patients and the evidence of lung tuberculosis was con rmed in 14 (70%) cases. Tuberculin skin test was positive in 10 (66.6%) out of 15 tests performed. Contact history with active tuberculosis was detected in 3 (15%) cases (Figures 2 and 3). 2. 110.1177 2324709614529417- g1: Open-i provides the following results: { Caption: Magnetic resonance imaging after the onset of isolated adrenocorticotropic hormone de ciency. Magnetic resonance imaging showed no space-occupying lesions in the pituitary gland or hypothalamus. { Problem(s): isolated adrenocorticotropic hormone de ciency. { Concept(s): isolated adrenocorticotropic hormone de ciency. { Outcomes: (i) Although the neutrOpen-ia and fever immediately improved, he became unable to take any oral medications and was bedridden 1 week after admission. (ii) His serum sodium level abruptly decreased to 122mEq/L on the fth day of hospitalization. (iii) Hydrocortisone replacement therapy was begun at 20mg/day, resulting in a marked improvement in his anorexia and general fatigue within a few days. { Mention: CT and magnetic resonance imaging showed no space-occupying lesion or atrophic change in his pituitary gland or hypothalamus (Figure1). 5

Runs

This section provided a detailed description of the runs submitted to ImageCLEF 2017 caption task. The methods used to implement these runs are described in previous Sections 3 and 4. 5.1

Concept Detection

As speci ed by the task guidelines, a maximum of 50 UMLS concepts per gure is accepted. Therefore, if the limit of 50 CUIs per image is reached, we took only the rst 50 CUIs for each image. We submitted the following runs to the Concept Detection subtask: DET 1. DET run 1 Open-i MetaMapLite 1: We used Open-i to nd similar images and then extracted CUIs from their captions using MetaMapLite. In this rst run, we used the caption of the most similar image according to Open-i. The returned CUIs are all the CUIs recognized by MetaMapLite.

DET 2. DET run 1 baseline: The same DET 1 run with the exclusion of test images if they are retrieved by Open-i.

DET 3. DET run 2 Open-i MetaMapLite 2: The same as DET 1 except

that we took only the rst CUI recognized by MetaMapLite for each term.

DET 4. DET run 3 Open-i MetaMapLite 3: Similar to DET 1 except that

we used the captions from the rst and second best images retrieved by Open-i.

DET 5. DET run 5 Meka CEDD: Multi-label Classi cation method using MEKA software to applied binary relevance method. CEED is used as a visual descriptor for the images. Subset 1 of 92 CUIs is used for training.

DET 6. DET run 6 CNN GoogLeNet 92Cuis: Multi-label classi cation with a convolutional neural network. We trained the GoogLeNet network using subset 1 of 92 CUIs.

DET 7. DET run 7 CNN GoogLeNet 239Cuis: We trained the GoogLeNet network using subset 2 of 239 CUIs.

DET 8. DET run 8 comb1 CNN2: Fusion of the runs DET 6 and DET 7 ) DET 9. DET run 9 comb2 CNN2Meka: Fusion of the runs DET 5, DET 6 and DET 7 )

DET 10. DET run 10 comb3 CNN2MekaOpen-i: Fusion of the runs DET

1, DET 5, DET 6 and DET 7 ) 5.2

Caption Prediction

We submitted the following runs to the Caption Prediction subtask: PRED 1. PRED run 1 Open-iMethod: We used Open-i Biomedical Image Search Engine to nd similar images. In this run, we used the caption of the rst retrieved image.

PRED 2. PRED run 1 baseline: Same as PRED 1, except we excluded the test images if they are retrieved by Open-i.

PRED 3. PRED run 2 CNN 92: We used the CUIs recognized by the CNN (CRun DET 6 ) and the UMLS semantic groups to generate the captions.

PRED 4. PRED run 3 CNN 239: We used the CUIs recognized by the CNN (run DET 7 ) and the UMLS semantic groups to generate the captions. PRED 5. PRED run 4 CNN comb: We used the CUIs recognized by the CNN (run DET 8 ) and the UMLS semantic groups to generate the captions.

PRED 6. PRED run 5 comb all: We used the CUIs recognized by the hybrid method (run DET 10 ) and the UMLS R to generate the captions.

6 O cial Results

In this section we describe and discuss the results obtained by the submitted runs. 6.1

Concept Detection Results

The best overall results were obtained by run DET 1 followed by run DET 3 ; both approaches are based on Open-i retrieval system. To better understand the results, Table 3 shows the e ciency of the Open-i system on the test set by presenting how many times the query image itself was retrieved and ranked in the rst 10 positions when searching on the full Open-i collection (3.7 million images). We analyze only the rst 10 because it is the maximum number of retrieved images that we used in our experiments.

Open-i was able to nd the image in the rst top 10 results in 61% of the cases, and extract the relevant information from the image itself.

For comparison, we performed a second run called DET 2, which is equivalent to run DET 1 but with the exclusion of test images if they are retrieved by Openi. For run DET 2 the mean F1 score decreased to 0.0162, which we consider as baseline result. The best results using Open-i based approaches were obtained when using all the CUIs associated with the rst retrieved image.

Without using external resources the results were poorer. One of the reasons could be that not all the CUIs in the test set were contained in the training and validation sets. Also, we only considered the most frequent CUIs in the training set. With CNNs, up to 0.0880 mean F1 score was achieved and only 0.0012 when applying BR-DT (BR-DT detected at least one CUI on 2046 images only).

Table 2 also shows the performance of three hybrid methods: run DET 8, run DET 9 and run DET 10. 6.2

Caption Prediction Results

The best results were achieved by run PRED 1 using Open-i with 0.5634 mean BLUE score and was ranked rst. As baseline, we proposed run PRED 2, similar to run PRED 1 but without including test images if they are retrieved by Open-i. Run PRED 2 obtained 0.2646 mean BLUE score and was the 4th best run out of 34 submitted runs by the participating teams.

CNN approaches achieved good results with 0.2247 mean BLUE score despite the limited number of CUIs used for training and the simple UMLS-based patterns built for caption generation. Two hybrid methods were also presented: run PRED 5 and run PRED 6. In this subtask, run PRED 6 was ranked second. 7

Conclusions

This paper describes our participation in ImageCLEF 2017 caption task. We proposed and compared di erent approaches for concept detection and caption prediction. Our retrieval methods using Open-i obtained the best results with 0.1718 mean F1 score in the concept detection subtask and 0.5634 mean BLUE score in the caption prediction subtask. We proposed baseline results by excluding test images if they are found by Open-i. Open-i baseline was ranked 4th with 0.2646 mean BLUE score in the caption prediction subtask.

We also performed multi-label classi cation of CUIs with CNNs and BR-DT. Both methods used selected subsets from the training data. CNNs provided acceptable results with regards the limited number of CUIs used for training. CNNs method achieved 0.2247 mean BLUE score in the caption prediction subtask.

Future improvements can tackle Open-i method as it does not support images with panels. One better way would be to perform panel segmentation before the search. Open-i also has size limitations on images of 2MB. A better approach would be to resize the image if needed before submitting to Open-i API. Also, MetaMapLite provided CUIs that are di erent from the gold standard even if the labels retrieved by Open-i are correct. Moreover, we only used the fusion to combine the results of our di erent methods for concept detection (the intersection gave very few CUIs). More sophisticated combination methods could be used to improve the results of the hybrid methods.

Acknowledgments

This research was supported by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM), and Lister Hill National Center for Biomedical Communications (LHNCBC). 7. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. (2015) 1{9 8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classi cation with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25. Curran Associates, Inc. (2012) 1097{1105 9. LeCun, Y., Bottou, L., Bengio, Y., Ha ner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (November 1998) 2278{ 2324 10. Read, J., Reutemann, P., Pfahringer, B., Holmes, G.: MEKA: A multi-label/multitarget extension to Weka. Journal of Machine Learning Research 17(21) (2016) 1{5 11. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2016) 12. Tanaka, E.A., Nozawa, S.R., Macedo, A.A., Baranauskas, J.A.: A multi-label approach using binary relevance and decision trees applied to functional genomics.

Journal of Biomedical Informatics 54 (2015) 85{95 13. Chatzichristo s, S.A., Boutalis, Y.S.: CEDD: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. In: Lecture notes in Computer Sciences. Volume 5008. (2008) 312{322 14. Chatzichristo s, S.A., Boutalis, Y.S.: FCTH: Fuzzy color and texture histogram: A low level feature for accurate image retrieval. In: Proceedings of the 9th International Workshop on Image Analysis for Multimedia Interactive Service. (2008) 191{196

1. Eickho , C. , Schwall , I. , Garc a Seco de Herrera, A. , Muller, H.: Overview of ImageCLEFcaption 2017 - the image caption prediction and concept extraction tasks to understand biomedical images . CLEF working notes , CEUR ( 2017 )

2. Ionescu , B. , Muller, H., Villegas , M. , Arenas , H. , Boato , G. , Dang-Nguyen , D.T. , Dicente Cid , Y. , Eickho , C. , Garc a Seco de Herrera, A. , Gurrin , C. , Islam , Bayzidul and, K .V., Liauchuk , V. , Mothe , J. , Piras , L. , Riegler , M. , Schwall , I. : Overview of ImageCLEF 2017: Information extraction from images . In: CLEF 2017 Proceedings. Lecture Notes in Computer Science , Dublin, Ireland, Springer (September 11 -14 2017 )

3. Demner-Fushman , D. , Antani , S. , Simpson , M.S. , Thoma , G.R. : Design and development of a multimodal biomedical information retrieval system . Journal of Computing Science and Engineering 6 ( 2 ) ( 2012 ) 168 { 177

4. Cho , J. , Lee , K. , Shin , E. , Choy , G. , Do , S. : Medical image deep learning with hospital PACS dataset . CoRR abs/1511 .06348 ( 2015 )

5. Roth , H.R. , Lee , C.T. , Shin , H.C. , Se , A. , Kim , L. , Yao , J. , Lu , L. , Summers , R.M. : Anatomy-speci c classi cation of medical images using deep convolutional nets . In: ISBI, IEEE ( 2015 ) 101 { 104

6. Havaei , M. , Davy , A. , Warde-Farley , D. , Biard , A. , Courville , A.C. , Bengio , Y. , Pal , C. , Jodoin , P. , Larochelle , H.: Brain tumor segmentation with deep neural networks . CoRR abs/1505 .03540 ( 2015 )