<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Temporal-Spatial Attention Model for Medical Image Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Maxwell</forename><surname>Hwang</surname></persName>
							<email>hwang@g-mail.nsysu.edu.tw</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Colorectal Surgery Second Affiliated Hospital</orgName>
								<orgName type="institution">Zhejiang University School of Medicine</orgName>
								<address>
									<settlement>Zhejiang</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><surname>Cai-Wu</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Hematology Fourth Affiliated Hospital</orgName>
								<orgName type="institution">Zhejiang University School of Medicine</orgName>
								<address>
									<settlement>Zhejiang</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">National Sun Yat-sen University</orgName>
								<address>
									<postCode>80424</postCode>
									<settlement>Kaohsiung</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kao-Shing</forename><surname>Hwang</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">National Sun Yat-sen University</orgName>
								<address>
									<postCode>80424</postCode>
									<settlement>Kaohsiung</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yong</forename><forename type="middle">Si</forename><surname>Xu</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">National Sun Yat-sen University</orgName>
								<address>
									<postCode>80424</postCode>
									<settlement>Kaohsiung</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chien-Hsing</forename><surname>Wu</surname></persName>
						</author>
						<title level="a" type="main">A Temporal-Spatial Attention Model for Medical Image Detection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">150357C262ADDDEC289E27CD9C2EE3C5</idno>
					<idno type="DOI">10.1038/s41598-020-59413-5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A local region model with attentive temporal-spatial pathways is proposed for automatically learning various target structures. The attentive spatial pathway highlights the salient region to generate bounding boxes and ignores irrelevant regions in an input image. The proposed attention mechanism allows efficient object localization, and the overall predictive performance is increased because there are fewer false positives for the object detection task for medical images with manual annotations. The experimental results show that proposed models consistently increase the base architecture's predictive performance on the Medico dataset with satisfactory computational efficiency.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>This study proposes a simple and effective solution that interfaces an attention mechanism in a standard CNN model. The feature maps are utilized more efficiently, and localization does not require processing the entire image. The proposed attentive model, which consists of tempo-spatial pathways, automatically learns to focus on target structures without additional supervision. The spatial pathway generates local region proposals on-the-fly using the salient features for a specific task. The temporal attention model proposes a sequence of locations for the local region search and not the entire image, so the computational overhead is significantly reduced, and many model parameters are omitted, similarly to multi-model frameworks. CNN models that use the proposed attentive model can be trained from scratch using standard methods or transfer learning. Similar attention mechanisms have been proposed for natural image classification and captioning <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b3">4]</ref> for adaptive feature pooling, where model predictions are conditioned only using a subset of selected image regions. The proposed process assigns attention coefficients to specific local regions.</p><p>This study uses a novel hybrid attention model (HAM) as an interface between any feature extractors, such as a CNN, and a decision-making module for end-to-end tasks, such as RL, classification, regression. The proposed module determines spatial pinpoints in feature space using a hard attestation pathway. The model also synthesizes the context vector using a soft attention mechanism and a GRU for decision-making downstream. Real images are used to determine the efficacy of the proposed model and are used as a pre-training data set for detection and classification for colonoscopic images <ref type="bibr" target="#b5">[6]</ref> that are the motif of this work. The contributions of this work are summarized as follows:</p><p>A hybrid attention approach allows an attention mechanism specific to local regions and the subsequent strategy or decisionmaking process. This improved model performs better than stateof-art methods that use global or local search schemes.</p><p>An attention interface is used for region proposals and sequential search of glimpses on local regions simultaneously for medical images. The proposed attention interface, which can be trained from end to end, replaces the hard-attention approaches currently used only for image classification. It eliminates the need for the global generation of bounding boxes for a Faster R-CNN <ref type="bibr" target="#b6">[7]</ref> and provides better accuracy and greater computational efficiency than a local search scheme method. The study demonstrates that the proposed attention mechanism produces fine-scale attention maps that can be visualized with minimal computational overhead.</p><p>A masking scheme is applied to the distribution of attention scores to increase computational efficiency, instead of imposing directly on the feature map and influencing downstream operations. It ensures better classification performance than the baseline approach. It is shown that attention maps and an observation pinpoint allow fewer glimpses and fewer useful observations. A modification to the standard FPN is used for feature extraction, so the process is sensitive and specific.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH 2.1 Method</head><p>The process for the proposed local search method for polyps detection involves two stages <ref type="bibr" target="#b0">[1]</ref>. During the first stage, the local region proposal network (RPN) proposes candidate ROIs from glimpsed regions located in sequence by the HAM. The weighted feature's attention scores are used to determine a glimpsed region in which target objects may reside. Bounding boxes are generated, and the process and the process then involves classification and position regression for preliminary screening. The confidence index for the classification is used to determine bounding boxes with higher values. Local non-maximum suppression is used to filter out some bounding boxes as regions of interest (ROIs), and these are used as inputs for the second stage network, which involves binding box regression and classification. When the RoIs are generated and accumulated in all the sequences for classification and bounding box regression, an exhaustive search is initiated. This process involves considerable computing resources, so a method that uses a hybrid attention mechanism with RL to the RPN reduces calculation.</p><p>Instead of an exhaustive search over the entire image, the proposed method uses a Faster RCNN for a sequential search directed by a hybrid attention module (HAM) to determine glimpse regions that are likely to contain an object. RoI's are generated in a restricted area, where target objects are likely to be located. This local search reduces the amount of calculation for insignificant ROIs. The proposed model has four modules: a CNN-based feature extractor, the proposed HAM, a local RPN, and a detector for bounding box regression and object classification. Glimpse regions are pinpointed, and the length of the sequence of glimpses is determined sequentially. The local RPN generates bounding boxes of different sizes and aspect ratios within a glimpsed region. The detector regresses bounding boxes and classifies objects. The architecture of the HAM is shown in Figure <ref type="figure" target="#fig_0">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Preparation and Data set</head><p>The experiments were executed using the Ubuntu 18.04 operating system, Python 3.7, Tensorflow. The data sets for the experiments are provided in Medico Challenge <ref type="bibr" target="#b4">[5]</ref>. A public data set of real scenes (PASCAL VOC <ref type="bibr" target="#b2">[3]</ref>) is used to pre-train the Faster R-CNN framework. The data set contains only images, so data augmentation operations, such as rotation, reflection, and resizing, increase the number of images. Five-fold cross-validation is used for the experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS OF COMPARISONS WITH PEER METHODS</head><p>The results for the colonoscopy dataset in Figure <ref type="figure" target="#fig_1">2</ref> show that the HAM-beta and HAM-beta-mask are similar to drl-RPN in terms of 𝐴𝑃 5 0.There are fewer average glimpses and a smaller average glimpsed area than for the drl-RPN, and the AP density and glimpse contribution are better than peer methods. The drl-RPN must search three times for important areas before terminating the glimpsing process, requiring more computation time. The HAM-beta and HAM-beta-mask accurately locate the correct in the first time search. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONCLUSION AND FUTURE WORK</head><p>This study proposes an innovative attention module that uses soft and hard attention. This module can interface with any architecture that involves simultaneous spatial and temporal tasks, such as polys detection. A global search scans the entire image in an object detection task, but it requires much time and resources. The proposed approach obviates the need to use an extra model by learning to highlight salient local regions in images. The proposed temporalspatial attention module leverages the salient information in the state space for a policy learner, such as reinforcement learning, in addition to object detection in image tasks.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The architecture of the local region proposal method.</figDesc><graphic coords="2,53.80,286.43,253.44,134.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Comparisons between different configurations for the proposed model and peer methods.</figDesc><graphic coords="2,327.79,83.69,220.58,146.29" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This work is supported by the grant of the Key Project of Yiwu Science and Technology plan, China. No.20-3-067.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy</title>
		<author>
			<persName><forename type="first">Sharib</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Felix</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barbara</forename><surname>Braden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adam</forename><surname>Bailey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Suhui</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Guanju</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pengyi</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaoqiong</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maxime</forename><surname>Kayser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roger</forename><forename type="middle">D</forename><surname>Soberanis-Mukul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shadi</forename><surname>Albarqouni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaokang</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chunqing</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Seiryo</forename><surname>Watanabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilkay</forename><surname>Oksuz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qingtian</forename><surname>Ning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shufan</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohammad</forename><surname>Azam Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaohong</forename><forename type="middle">W</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefano</forename><surname>Realdon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maxim</forename><surname>Loshchenov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julia</forename><forename type="middle">A</forename><surname>Schnabel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><forename type="middle">E</forename><surname>East</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georges</forename><surname>Wagnieres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Enrico</forename><surname>Victor B Loschenov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Grisan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Walter</forename><surname>Daul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jens</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><surname>Rittscher</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41598-020-59413-5</idno>
		<ptr target="https://doi.org/10.1038/s41598-020-59413-5" />
	</analytic>
	<monogr>
		<title level="j">Scientific Reports</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">2748</biblScope>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Bottom-up and topdown attention for image captioning and visual question answering</title>
		<author>
			<persName><forename type="first">Peter</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaodong</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chris</forename><surname>Buehler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Damien</forename><surname>Teney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mark</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Gould</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lei</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="6077" to="6086" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The pascal visual object classes (voc) challenge</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Everingham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luc</forename><surname>Van Gool</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">I</forename><surname>Christopher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Winn</surname></persName>
		</author>
		<author>
			<persName><surname>Zisserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of computer vision</title>
		<imprint>
			<biblScope unit="volume">88</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="303" to="338" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Learn To Pay Attention</title>
		<author>
			<persName><forename type="first">Saumya</forename><surname>Jetley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nicholas</forename><forename type="middle">A</forename><surname>Lord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Namhoon</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><forename type="middle">H S</forename><surname>Torr</surname></persName>
		</author>
		<idno>CoRR abs/1804.02391</idno>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Medico Multimedia Task at MediaEval 2020: Automatic Polyp Segmentation</title>
		<author>
			<persName><forename type="first">Debesh</forename><surname>Jha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Steven</forename><forename type="middle">A</forename><surname>Hicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Krister</forename><surname>Emanuelsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Håvard</forename><surname>Johansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dag</forename><surname>Johansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>De Lange</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><forename type="middle">A</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pål</forename><surname>Halvorsen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the MediaEval 2020 Workshop</title>
				<meeting>of the MediaEval 2020 Workshop</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Kvasirseg: A segmented polyp dataset</title>
		<author>
			<persName><forename type="first">Debesh</forename><surname>Jha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pia</forename><forename type="middle">H</forename><surname>Smedsrud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><forename type="middle">A</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pål</forename><surname>Halvorsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dag</forename><surname>Thomas De Lange</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Håvard D</forename><surname>Johansen</surname></persName>
		</author>
		<author>
			<persName><surname>Johansen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Multimedia Modeling</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="451" to="462" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks</title>
		<author>
			<persName><forename type="first">Kaiming</forename><surname>Shaoqing Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ross</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jian</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Cortes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Lawrence</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sugiyama</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">28</biblScope>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
