=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-ImageCLEF-KalpathyCramerEt2007
|storemode=property
|title=Medical Image Retrieval and Automatic Annotation: OHSU at ImageCLEF 2007
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-KalpathyCramerEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/Kalpathy-CramerH07a
}}
==Medical Image Retrieval and Automatic Annotation: OHSU at ImageCLEF 2007==
Medical Image Retrieval and Automatic Annotation: OHSU at ImageCLEF 2007 Jayashree Kalpathy-Cramer William Hersh Department of Medical Informatics & Clinical Epidemiology Oregon Health and Science University Portland, OR, USA {kalpathy, hersh}@ohsu.edu Abstract Oregon Health & Science University participated in the medical retrieval and medical annotation tasks of ImageCLEF 2007. In the medical retrieval task, we created a web- based retrieval system for the collection built on a full-text index of both image and case annotations. The text-based search engine was implemented in Ruby using Ferret, a port of Lucene, and a custom query parser. In addition to this textual index of annotations, supervised machine learning techniques using visual features were used to classify the images based on image acquisition modality. All images were annotated with the purported modality. Purely textual runs as well as mixed runs using the purported modality were submitted. Our runs performed moderately well using the MAP metric and better for the early precision (P10) metric. In the automatic annotation task, we used the 'gist' technique to create the feature vectors. Using statistics derived from a set of multi-scale oriented filters, we created a 512 dimensional vector. PCA was then used to create a 100-dimensional vector. This feature vector was fed into a two layer neural network. Our error rate on the 1000 test images was 67.8 using the hierarchical error calculations. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries General Terms Performance, Image annotation, Experimentation, Algorithms Keywords Query parsing, Text retrieval, Image modality classification, Neural networks Medical Image Retrieval Advances in digital imaging technologies and the increasing prevalence of Picture Archival and Communication Systems (PACS) have led to a substantial growth in the number of digital images stored in hospitals and medical systems in recent years. In addition, on-line atlases of images have been created for many medical domains including dermatology, radiology and gastroenterology. Medical images can form an essential component of a patient’s health record. Medical image retrieval systems can be important with aiding in diagnosis and treatment. They can also be highly effective in health care education, for students, instructors and patients alike. a. Introduction Image retrieval systems do not currently perform as well as their text counterparts [1]. Medical and other image retrieval systems have historically relied on annotations or captions associated with the images for indexing the retrieval system. The labor-intensive task of indexing and cataloging the images in these collections has traditionally been performed manually, a process that can be subjective and prone to errors. The last few decades have seen numerous advancements in the area of content-based image retrieval (CBIR) [2,3]. Although CBIR systems have demonstrated success in fairly constrained medical domains including pathology, dermatology, chest radiology, and mammography, they have demonstrated poor performance when applied to databases with a wide spectrum of imaging modalities, anatomies and pathologies [1,4,5,6 ]. Retrieval performance has shown demonstrable improvement by fusing the results of textual and visual techniques. This has especially been shown to improve early precision [7,8]. The medical image retrieval task within ImageCLEF (ImageCLEFmed) 2007 campaign is a TREC-style [9] and provides a forum and set of test collections for the medical image retrieval community to use to benchmark their algorithms on a set of queries. The ImageCLEF campaign has, since 2003, been a part of the Cross Language Evaluation Forum (CLEF) [9,10,11] which is an offshoot from the Text Retrieval Conference (TREC, trec.nist.gov). b. System description of our adaptive medical image retrieval system The ImageCLEF medical retrieval collection consists of about 66,000 medical images and annotations associated with them. We wanted to create a flexible database schema that could incorporate new collections easily while facilitating retrieval using both text and visual techniques. The text annotations in the collection are currently indexed and we continue to add indexable fields for incorporating visual information. Database and web application We used the Ruby programming language, with the open source Ruby On Rails web application framework [http://www.ruby-lang.org, http://www.rubyonrails.org ]. A PostgreSQL relational database was used to store the images and annotations. The database has images from the four different collections that were part of the ImageCLEFmed 2006 image retrieval challenge as well as two new collections for 2007. The approximately 66,000 images in these collections resided in cases, with annotations in English, German and/or French. The collections themselves were substantially heterogeneous in their architectures. Some collections had only one image per case while others had many images per case. Annotation fields were also quite different among the collections. Some collections had case-based annotations while others had image-based annotations. This difference is especially significant for text based retrieval as images of different modalities or anatomies or pathologies could be linked to the same case annotation. In this situation, even though only one image from a case containing many images might be relevant to a query (based on the annotation), all images for the case would be retrieved in a purely text based system, reducing the precision of the search. We used the relational database to maintain the mappings between the collections, the cases in the collections, the cased-based annotations, the images associated with a collection, and the image based annotations. Image Processing and Analysis The image itself has important visual characteristics such as color and texture that can help in the retrieval process. Images that may have had information about the imaging modality or anatomy or view associated with them as part of the DICOM header can lose that information when the image is compressed to become a part of a teaching or on-line collection, as the image format used by these collections is usually compressed JPEG. We created additional tables in the database to store image information that was created using a variety of image processing techniques in MATLAB (www.mathworks.com). For instance, the images in the collection typically do not contain explicit details about the imaging modality. In previous work [8], we have described our modality classifier that can identify the imaging modality for medical images with a high level of confidence (>95% accuracy on the database used for the validation). Grey scale images are classified into a set of modalities including x-rays, CT, MRI, ultrasound and nuclear medicine. Color image classes include gross pathology, microscopy, and endoscopy. Each image was annotated in the database with the purported image modality and a confidence value. This can be extremely useful for queries where the user has specified a desired image modality. An example query from ImageCLEF 2006 was “Show me microscopic images of tissue from the cerebellum”. The precision of the result of such a query can be improved significantly by restricting the images returned to those of the modality desired [8]. This is especially useful in eliminating images of the incorrect modality that may be part of a case containing a relevant image from the returned list of images. However, this increase in precision may result in a loss in recall if the classification algorithm incorrectly classifies the image modality. We continue to experiment with a variety of image clustering and classification algorithms and adding the numerical data and labels to the database. Clustering images that look visually similar can be again used to improve the precision of the image retrieval process and speed up the system searching of images in the same cluster as the query image (if available). Query parser and Search Engine The system presents search options to the user including Boolean OR, AND and exact match. There are also options to perform fuzzy searches and custom query parsing. The cornerstone of our system is the query parser, written in Ruby. Ferret, a Ruby port of the popular Lucene system, was used in our system as the underlying search engine [http://ferret.davebalmain.com]. Queries were first analyzed using MedPost, a Parts-of-Speech (POS) Tagger created using the Medline corpus, and distributed by the National Library of Medicine [ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost] [14]. A simple Bayesian classifier [classifier.rubyforge.org] was trained to discern the desired image modality from the query, if available. The classifier performed extremely well within the constrained vocabulary of imaging modalities. Stop words were then removed from the query. These include Standard English stop words as well as a small set of stop words determined by analyzing queries from the last three years, including ‘finding’, ‘showing’, ‘images’, ‘including’ and ‘containing’. The system is also linked to the UMLS metathesaurus. The user can choose to perform automatic query expansion using synonyms from the metathesarus. A sample query “Show me CT images with a brain infarction” is automatically parsed and the following information is extracted from it: CT-> imaging modality, brain -> anatomic location, infarction -> finding. This information can be used to combine the results of the textual and visual systems more effectively. c. Runs submitted We submitted a total of 10 runs. These runs included textual and mixed, automatic and manual options. We also submitted runs using different weighted combinations of the FIRE baseline (published by the organizers) with our baseline textual runs. The run labels and a short description is given below. d. Results and Discussion Table 2 presents the mean average precision (MAP) and early precision of the OHSU runs as well that of the best overall run. Our baseline textual system performed quite well, with a MAP of 0.3453. There was only a minimal improvement in performance with the use of the image modality. Our system is tuned to improve precision, at the cost of recall as we believe that most real users are interested in early precision. Consequently, our system demonstrated good early precision, with the highest P10 and high P30 results. Our baseline textual system performed quite well, with a MAP of 0.3453. There was only a minimal improvement in performance with the use of the image modality. Our system is tuned to improve precision, at the cost of recall as we believe that most real users are interested in early precision. Consequently, our system demonstrated good early precision, with the highest P10 and high P30 results. Figure 1: Screen shot of our system displaying user options run# Label Description OHSU- 1 iclefmed2007_as_out_1000rev1_c.txt.eval Baseline textual run using the custom query parser (automatic) Mixed run using modality to resort the results of the standard 2 OHSU-ohsu_m1.eval textual baseline, only 50 images returned per topic (automatic) 3 OHSU-OHSU_txt_exp2.eval Textual run using UMLS term expansion (automatic) Weighted combination of FIRE and OHSU textual runs 4 OHSU-oshu_c_e_f_q.eval (automatic) 5 OHSU-oshu_man2.eval Manual query modification, purely textual (manual) Weighted combination of FIRE and OHSU textual runs 6 ohsu_comb3_ef_wt1_rev1_c.txt.eval (automatic) Weighted combination of FIRE and OHSU textual runs 7 ohsu_fire_ef_wt2_rev1_c.txt.eval (automatic) Mixed run using modality to resort the results of the standard textual baseline 8 ohsu_m2_rev1_c.txt.eval (automatic) 9 ohsu_text_e4_out_rev1.txt.eval Textual run using modified query parser (automatic) Table 1: OHSU run labels and descriptions Label MAP P10 p30 OHSU- iclefmed2007_as_out_1000rev1_c.txt.eval 0.3453 0.53 0.4433 OHSU-ohsu_m1.eval 0.2117 0.52 0.4578 OHSU-OHSU_txt_exp2.eval 0.3135 0.5867 0.4878 OHSU-oshu_c_e_f_q.eval 0.1129 0.2 0.1544 OHSU-oshu_man2.eval 0.3428 0.54 0.44 ohsu_comb3_ef_wt1_rev1_c.txt.eval 0.1134 0.3 0.1833 ohsu_fire_ef_wt2_rev1_c.txt.eval 0.0586 0.2 0.1211 ohsu_m2_rev1_c.txt.eval 0.3461 0.5567 0.4622 ohsu_text_e4_out_rev1.txt.eval 0.3321 0.5867 0.4878 LIG-MRIM-LIG_MU_A.eval 0.3962 0.5067 0.46 Table 2: Results of OHSU runs Run Type MAP Visual 0.23 Mixed 0.35 Textual 0.47 Table 3: Results of OHSU best run by topic category Although there was significant inter-category variation, our textual runs performed well on textual queries and more poorly on visual queries. Future Work We will continue to improve our image retrieval system by adding more image tags using automatic visual feature extraction. Our next goal is to annotate the images with the their anatomical location and view attributes. 2. Automatic Image Annotation The goal of this task was to correctly classify 1000 radiographic medical images using the hierarchical IRMA code. This code classifies the image along the modality, body orientation, body region, and biological system axes. There were 116 unique classes. The task organizers provided a set of 9,000 training images and 1000 development images. The goal of the task was to classify the images to the most precise level possible, with a greater penalty applied for incorrect classification than for a less specific classification in the hierarchy. a. Introduction A supervised machine learning approach using global gist features and neural network architecture was employed for the task of automatic annotation of medical images with the IRMA code. b. System Description The automatic image annotation was based on a neural network classifier using Gist features [14]. The classifiers were created in MATLAB using the Netlab toolbox [15]. All images were convolved with a set of 32 multiscale-oriented Gabor filters. We created a 512- dimensional vector using statistics from these filters. Principal component analysis was then used to reduce the dimensionality of the vector to 100. A multilayer perceptron with one hidden layer containing 250-500 nodes was used to create and train a multi-class classifier. The training data set of 10,000 images was used to optimize performance of the development set of 1000 images. The final configuration of the classifier used 300 hidden nodes. Images originally classified as classes 108 and 111 in the old code were commonly misclassified by the neural network classifier described above. To handle this special case, we created a second layer of classification built around a support vector machine (SVM) using scale-invariant feature transform (SIFT) features [16] as inputs. This new binary classify was used to determine the final class assignments for images in classes 108 and 111 c. Runs submitted OHSU submitted two runs for the automatic annotation task. The first run used gist feature vectors to train the multi-layer perceptron. A neural network was used to create a multi-class classifier consisting of 116 classes. These were the original classes from 2006 and did not use the hierarchical nature of the IRMA code. These classes were then converted to the IRMA code, as required for the submission in 2007. The second run used a hierarchical classifier architecture, with the first layer as described above and the second classifier using SIFT features and an SVM. d. Results and Analysis The relationship between semantic and visual hierarchy remains an open area of research. Based on our experiments using this collection of images used for automatic annotation, the use of hierarchy of the semantic classes did not improve our automatic annotations. The error count for both our runs were quite similar at 67.8 and 67.97 for 1000 images, compared to the best count of 26.84 and worst count of 505.61. There was only a very slight improvement in using the two-layer classifier. There were 227 errors using the 2006 classes, which corresponds to an classification accuracy of 77.3%. However, of these 227 errors, only 15 were wrong along all 4 axes. 76 were misclassified along two axes (primarily view and anatomy) while 12 were misclassified along 3 axes. 77 of our single misclassifications were along the view axis. A significant portion of these occurred where class 111 was misclassified as 108, an error due to confusion between posterior- anterior and anterior-posterior views of the chest. e. Future Work We would like to further investigate the mapping between the semantic and visual hierarchy of images in the IRMA collection. We also plan to further explore specifying the image classification in the semantic hierarchy only to the level Acknowledgments We acknowledge the support of NLM Training Grant 1T15 LM009461 and NSF Grant ITR-0325160. We would also like to thank Steven Bedrick, DMICE, OHSU for his help in creating the web-based image retrieval system. References 1. Hersh, W, Muller H, et al. Advancing biomedical image retrieval: development and analysis of a test collection. J. Am. Med. Inform. Assoc. 13(5), 488-96, 2006. 2. Smeulders AWM, Worring M et al. Content-Based Image Retrieval at the End of the Early Years, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349-1380, 2000. 3. Tagare HD, Jaffe C et al, Medical Image Databases: A Content-Based Retrieval Approach, J. Am. Med. Inform. Assoc. (JAMIA), 4(3):184-198, 1997. 4. Aisen AM, Broderick LS, et al, Automated storage and retrieval of thin-section CT images to assist diagnosis: System description and preliminary assessment, Radiology, 228,. 265-270, 2003. 5. Schmid-Saugeon P, Guillod J, et al, Towards a computer-aided diagnosis system for pigmented skin lesions, Computerized Medical Imaging and Graphics 27,65-78, 2003. 6. Müller H, Michoux N, Bandon D, Geissbuhler A, A review of content-based image retrieval systems in medicine – clinical benefits and future directions, Int. J. Med. Inform., 73,1-23, 2004. 7. Hersh W, Kalpathy-Cramer J, et al. Medical image retrieval and automated annotation: OHSU at ImageCLEF 2006, Springer Lecture Notes in Computer Science (LNCS). 8. Kalpathy-Cramer J, Hersh W, Automatic Image Modality Based Classification and Annotation to Improve Medical Image Retrieval, accepted to MedInfo 2007, Brisbane, Australia, 2007. 9. Braschler M, Peters C. Cross-language evaluation forum: objectives, results, achievements. Inform Retriev 2004 (7) 7–31. 10. Müller H, Deselaers T, Lehmann T, Clough P, Hersh W, Overview of the ImageCLEFmed 2006 medical retrieval annotation tasks, Evaluation of Multilingual and Multi-modal Information Retrieval, Seventh Workshop of the Cross-Language Evaluation Forum, CLEF 2006, LNCS 2006,Alicante, Spain, to appear. 11. Müller H, Clough P, et al, Evaluation Axes for Medical Image Retrieval Systems - The ImageCLEF Experience, ACM Int. Conf. on Multimedia, Singapore, November 2005. 12. Florea F, Müller H, Rogozan A, Geissbühler A, Darmoni S. Medical image categorization with MedIC and MedGIFT. Medical Informatics Europe (MIE 2006). 13. Smith L, Rindflesch T, Wilbur W, MedPost: a part-of-speech tagger for biomedical text Bioinformatics 20(14), 2004. 14. Oliva, A, Torralba, A, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Computer Vision, 42(3): 145-175, 2001 15. Nabney IT, Netlab: Algorithms for Pattern Recognition. 2004, London, England: Springer-Verlag 16. Lowe DG, Distinctive image features from scale-invariant keypoints, Int. J. of Computer Vision, 60(2) :91-110, 2004