=Paper= {{Paper |id=Vol-2882/MediaEval_20_paper_21 |storemode=property |title=A Temporal-Spatial Attention Model for Medical Image Detection |pdfUrl=https://ceur-ws.org/Vol-2882/paper21.pdf |volume=Vol-2882 |authors=Hwang Maxwell,Wu Cai,Hwang Kao-Shing,Xu Yong Si,Wu Chien-Hsing |dblpUrl=https://dblp.org/rec/conf/mediaeval/HwangWHXW20 }} ==A Temporal-Spatial Attention Model for Medical Image Detection== https://ceur-ws.org/Vol-2882/paper21.pdf
            A Temporal-Spatial Attention Model for Medical Image
                                  Detection
                                              Maxwell Hwang1 , Cai-Wu2 , Kao-Shing Hwang3 ,
                                                     Yong Si Xu3 , Chien-Hsing Wu3
    1 Department of Colorectal Surgery, the Second Affiliated Hospital of Zhejiang University School of Medicine, Zhejiang,

                                                                                 China,
    2 Department of Hematology, the Fourth Affiliated Hospital of Zhejiang University School of Medicine, Zhejiang, China,
                  3 Department of Electrical Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan

                     himax26@zju.edu.cn,8013016@zju.edu.cn,himac26@zju.edu.cn,hwang@g-mail.nsysu.edu.tw

ABSTRACT                                                                              to determine the efficacy of the proposed model and are used as
A local region model with attentive temporal-spatial pathways is                      a pre-training data set for detection and classification for colono-
proposed for automatically learning various target structures. The                    scopic images [6] that are the motif of this work. The contributions
attentive spatial pathway highlights the salient region to generate                   of this work are summarized as follows:
bounding boxes and ignores irrelevant regions in an input image.                         A hybrid attention approach allows an attention mechanism
The proposed attention mechanism allows efficient object localiza-                    specific to local regions and the subsequent strategy or decision-
tion, and the overall predictive performance is increased because                     making process. This improved model performs better than state-
there are fewer false positives for the object detection task for medi-               of-art methods that use global or local search schemes.
cal images with manual annotations. The experimental results show                        An attention interface is used for region proposals and sequential
that proposed models consistently increase the base architecture’s                    search of glimpses on local regions simultaneously for medical
predictive performance on the Medico dataset with satisfactory                        images. The proposed attention interface, which can be trained
computational efficiency.                                                             from end to end, replaces the hard-attention approaches currently
                                                                                      used only for image classification. It eliminates the need for the
                                                                                      global generation of bounding boxes for a Faster R-CNN [7] and
1    INTRODUCTION                                                                     provides better accuracy and greater computational efficiency than
This study proposes a simple and effective solution that interfaces                   a local search scheme method. The study demonstrates that the
an attention mechanism in a standard CNN model. The feature                           proposed attention mechanism produces fine-scale attention maps
maps are utilized more efficiently, and localization does not require                 that can be visualized with minimal computational overhead.
processing the entire image. The proposed attentive model, which                         A masking scheme is applied to the distribution of attention
consists of tempo-spatial pathways, automatically learns to focus on                  scores to increase computational efficiency, instead of imposing
target structures without additional supervision. The spatial path-                   directly on the feature map and influencing downstream operations.
way generates local region proposals on-the-fly using the salient                     It ensures better classification performance than the baseline ap-
features for a specific task. The temporal attention model proposes                   proach. It is shown that attention maps and an observation pinpoint
a sequence of locations for the local region search and not the en-                   allow fewer glimpses and fewer useful observations. A modification
tire image, so the computational overhead is significantly reduced,                   to the standard FPN is used for feature extraction, so the process is
and many model parameters are omitted, similarly to multi-model                       sensitive and specific.
frameworks. CNN models that use the proposed attentive model
can be trained from scratch using standard methods or transfer                        2 APPROACH
learning. Similar attention mechanisms have been proposed for
natural image classification and captioning [2, 4] for adaptive fea-                  2.1 Method
ture pooling, where model predictions are conditioned only using                      The process for the proposed local search method for polyps detec-
a subset of selected image regions. The proposed process assigns                      tion involves two stages [1]. During the first stage, the local region
attention coefficients to specific local regions.                                     proposal network (RPN) proposes candidate ROIs from glimpsed
   This study uses a novel hybrid attention model (HAM) as an                         regions located in sequence by the HAM. The weighted feature’s
interface between any feature extractors, such as a CNN, and a                        attention scores are used to determine a glimpsed region in which
decision-making module for end-to-end tasks, such as RL, classifica-                  target objects may reside. Bounding boxes are generated, and the
tion, regression. The proposed module determines spatial pinpoints                    process and the process then involves classification and position
in feature space using a hard attestation pathway. The model also                     regression for preliminary screening. The confidence index for the
synthesizes the context vector using a soft attention mechanism                       classification is used to determine bounding boxes with higher val-
and a GRU for decision-making downstream. Real images are used                        ues. Local non-maximum suppression is used to filter out some
                                                                                      bounding boxes as regions of interest (ROIs), and these are used as
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons    inputs for the second stage network, which involves binding box
License Attribution 4.0 International (CC BY 4.0).
                                                                                      regression and classification. When the RoIs are generated and ac-
MediaEval’20, December 14-15 2020, Online
                                                                                      cumulated in all the sequences for classification and bounding box
MediaEval’20, December 14-15 2020, Online                                                                                          M. Hwang et al.


regression, an exhaustive search is initiated. This process involves
considerable computing resources, so a method that uses a hybrid
attention mechanism with RL to the RPN reduces calculation.
    Instead of an exhaustive search over the entire image, the pro-
posed method uses a Faster RCNN for a sequential search directed
by a hybrid attention module (HAM) to determine glimpse regions
that are likely to contain an object. RoI’s are generated in a re-
stricted area, where target objects are likely to be located. This local
search reduces the amount of calculation for insignificant ROIs. The
proposed model has four modules: a CNN-based feature extractor,
the proposed HAM, a local RPN, and a detector for bounding box
regression and object classification. Glimpse regions are pinpointed,
and the length of the sequence of glimpses is determined sequen-
tially. The local RPN generates bounding boxes of different sizes
and aspect ratios within a glimpsed region. The detector regresses         Figure 2: Comparisons between different configurations for
bounding boxes and classifies objects. The architecture of the HAM         the proposed model and peer methods.
is shown in Figure 1.

                                                                           4   CONCLUSION AND FUTURE WORK
                                                                           This study proposes an innovative attention module that uses soft
                                                                           and hard attention. This module can interface with any architecture
                                                                           that involves simultaneous spatial and temporal tasks, such as polys
                                                                           detection. A global search scans the entire image in an object detec-
                                                                           tion task, but it requires much time and resources. The proposed
                                                                           approach obviates the need to use an extra model by learning to
                                                                           highlight salient local regions in images. The proposed temporal-
                                                                           spatial attention module leverages the salient information in the
                                                                           state space for a policy learner, such as reinforcement learning, in
                                                                           addition to object detection in image tasks.

                                                                           ACKNOWLEDGMENTS
Figure 1: The architecture of the local region proposal                    This work is supported by the grant of the Key Project of Yiwu
method.                                                                    Science and Technology plan, China. No.20-3-067.

                                                                           REFERENCES
2.2    Preparation and Data set                                             [1] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bailey, Suhui Yang,
                                                                                Guanju Cheng, Pengyi Zhang, Xiaoqiong Li, Maxime Kayser, Roger D
The experiments were executed using the Ubuntu 18.04 operating                  Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
system, Python 3.7, Tensorflow. The data sets for the experiments               Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shufan Yang, Mo-
are provided in Medico Challenge [5]. A public data set of real scenes          hammad Azam Khan, Xiaohong W Gao, Stefano Realdon, Maxim
(PASCAL VOC [3]) is used to pre-train the Faster R-CNN framework.               Loshchenov, Julia A Schnabel, James E East, Georges Wagnieres, Vic-
The data set contains only images, so data augmentation operations,             tor B Loschenov, Enrico Grisan, Christian Daul, Walter Blondel, and
such as rotation, reflection, and resizing, increase the number of              Jens Rittscher. 2020. An objective comparison of detection and seg-
images. Five-fold cross-validation is used for the experiments.                 mentation algorithms for artefacts in clinical endoscopy. Scientific
                                                                                Reports 10, 1 (2020), 2748. https://doi.org/10.1038/s41598-020-59413-5
                                                                            [2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark
3     RESULTS OF COMPARISONS WITH PEER                                          Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-
      METHODS                                                                   down attention for image captioning and visual question answering.
The results for the colonoscopy dataset in Figure 2 show that the               In Proceedings of the IEEE conference on computer vision and pattern
HAM-beta and HAM-beta-mask are similar to drl-RPN in terms                      recognition. 6077–6086.
                                                                            [3] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn,
of 𝐴𝑃5 0.There are fewer average glimpses and a smaller average
                                                                                and Andrew Zisserman. 2010. The pascal visual object classes (voc)
glimpsed area than for the drl-RPN, and the AP density and glimpse
                                                                                challenge. International journal of computer vision 88, 2 (2010), 303–
contribution are better than peer methods.                                      338.
   The drl-RPN must search three times for important areas before           [4] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip H. S. Torr.
terminating the glimpsing process, requiring more computation                   2018. Learn To Pay Attention. CoRR abs/1804.02391 (2018).
time. The HAM-beta and HAM-beta-mask accurately locate the                  [5] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard Johansen,
correct in the first time search.                                               Dag Johansen, Thomas de Lange, Michael A. Riegler, and Pål
Medico Multimedia Task                                                        MediaEval’20, December 14-15 2020, Online


    Halvorsen. 2020. Medico Multimedia Task at MediaEval 2020: Auto-
    matic Polyp Segmentation. In Proc. of the MediaEval 2020 Workshop.
[6] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen,
    Thomas de Lange, Dag Johansen, and Håvard D Johansen. 2020. Kvasir-
    seg: A segmented polyp dataset. In International Conference on Multi-
    media Modeling. Springer, 451–462.
[7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-
    CNN: Towards Real-Time Object Detection with Region Proposal Net-
    works. In Advances in Neural Information Processing Systems, C. Cortes,
    N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Cur-
    ran Associates, Inc.