<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SuperCaptioning: Image Captioning Using Two-dimensional Word Embedding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Baohua Sun Gyrfalcon Technology Inc. Milpitas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CA Milpitas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Yang Gyrfalcon Technology Inc. Milpitas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Dong Gyrfalcon Technology Inc. Milpitas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenhan Zhang Gyrfalcon Technology Inc. Milpitas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Dong Gyrfalcon Technology Inc. Milpitas</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Language and vision are processed as two different modal in current work for image captioning. However, recent work on Super Characters method shows the effectiveness of two-dimensional word embedding, which converts text classification problem into image classification problem. In this paper, we propose the SuperCaptioning method, which borrows the idea of two-dimensional word embedding from Super Characters method, and processes the information of language and vision together in one single CNN model. The experimental results on Flickr30k data shows the proposed method gives high quality image captions. An interactive demo is ready to show at the workshop.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Image captioning outputs a sentence related to the input image. Current methods process the image and text
separately [
        <xref ref-type="bibr" rid="ref1 ref13 ref14 ref15 ref2 ref4 ref5 ref6">4, 13, 15, 14, 5, 6, 1, 2</xref>
        ]. Generally, the image is processed by a CNN model to extract the image
feature, and the raw text passes through embedding layer to convert into one-dimensional word-embedding
vectors, e.g. a 300x1 dimensional vector. And then the extracted image feature and the word embedding vectors
will be fed into another network, such as RNN, LSTM, or GRU model, to predict the next word in the image
caption sequentially.
      </p>
      <p>
        Super Characters method [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is originally designed for text classification tasks. It has achieved state-of-the-art
results on benchmark datasets for multiple languages, including English, Chinese, Japanese, and Korean. It is a
two-step method. In the first step, the text characters are printed on a blank image, and the generated image
is called Super Characters image. In the second step, the Super Characters image is fed into a CNN model
Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
(a) “Four men in life
jackets are riding in a bright
orange boat”.
      </p>
      <p>(b) “A woman in a black (c) “A man in a boat on
coat walks down the side- a lake with mountains in
walk holding a red um- the background”.
brella”.</p>
      <p>
        (d) “Four performers are
performing with their
arms outstretched in a
ballet”.
for classification. The CNN model is fine-tuned from pre-trained ImageNet model. The extensions of Super
Characters method [
        <xref ref-type="bibr" rid="ref11 ref12 ref8">8, 12, 11</xref>
        ] also prove the effectiveness of two-dimensional embedding on different tasks.
      </p>
      <p>In this paper, we address the image captioning problem by employing the two-dimensional word embedding
from the Super Characters method, and the resulting method is named as SuperCaptioning method. In this
method, the input image and the raw text are combined together through two-dimensional embedding, and
then fed into a CNN model to sequentially predict the words in the image caption. The experimental results
on Flickr30k shows that the proposed method gives high quality image captions. Some examples given by
SuperCaptioning method are shown in Figure 1.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Proposed SuperCaptioning Method</title>
      <p>The SuperCaptioning method is motivated by the success of Super Characters method on text classification
tasks. Super Characters method converts text into images. So it will be very natural to combine the input image
and the image of the text together, and feed it into one single CNN model to predict the next word in the image
caption sentence.</p>
      <p>Figure 2 illustrates the proposed SuperCaptioning method. The caption is predicted sequentially by predicting
the next word in multiple iterations. At the beginning of the caption prediction, the partial caption is initialized
as null, and the input image is resized to occupying a designed portion (e.g. top) of a larger blank image as shown
in Figure 2. Then the text of the current partial caption is drawn into the the other portion (e.g. bottom) of
the larger image as well. The resulting image is called the SuperCaptioning image, which is then fed into a CNN
model to classify the next word in the caption. The CNN model is fine-tuned from the ImageNet pre-trained
model. The iteration continues until the next word is EOS (End Of Sentence) or the word count reaches the
cut-length. Cut-length is defined as the maximum number of words for the caption.</p>
      <p>Squared English Word (SEW) method is used to represent the English word in a squared space. For example,
the word “child” occupies the same size of space as the word “A”, but each of its alphabet will only occupies
{1/ceil[sqrt(N )]}2 of the word space, where N is five for “child” which has five alphabets, sqrt(.) stands for
square root, and ceil[.] is rounding to the top.</p>
      <p>The data used for training is from Flickr30k1. Each image in Flickr30k has 5 captions by different people,
and we only keep the longest caption if it is less than 14 words as the ground truth caption for the training data.
After this filtering, 31,333 of the total 31,783 images are left.</p>
      <p>After comparing the accuracy of experimental results using different configurations for the font size,
cutlength, and resizing of the input image, we finally set the font size to 31, cut-length to 14 words, and resizing
the image size to 150x224 in the fixed-size SuperCaptioning image with 224x224 pixels, as shown in Figure 2.</p>
      <p>
        The training data is generated by labeling each SuperCaptioning image as an example of the class indicated
by its next caption word. EOS is labeled to the SuperCaptioning image if the response sentence is finished. The
model used is SE-net-154 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] pre-trained on ImageNet2. We fine-tuned this model on our generated data set by
only modifying the last layer to 11571 classes, which is the vocabulary size of all the selected captions.
      </p>
      <p>Figure 1 shows that the proposed SuperCaptioning method captions the number of objects in the image, as
shown in Figure 1a “Four men ...”; and it also captions the colors of overlapping objects, as shown in Figure 1b
“A woman in a black coat ... holding a red umbrella”; it captions the big picture of the background, as shown
1http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k.html
2https://github.com/hujie-frank/SENet
in Figure 1c “... with mountains in the background”; and it also captions the detailed activity, as shown in
Figure 1d “... with their arms outstretched in a ballet”.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>
        In this paper, we propose the SuperCaptioning method for image captioning using two-dimensional word
embedding. The experimental results on Flickr30k shows that the SuperCaptioning method gives high quality image
captions. The proposed method could be used for on-device image captioning applications with low-power CNN
accelerator becoming more and more available [
        <xref ref-type="bibr" rid="ref10 ref7">10, 7</xref>
        ]. An interactive demo is ready to show at the workshop.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Anderson</surname>
          </string-name>
          , Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
          <article-title>Bottom-up and top-down attention for image captioning and visual question answering</article-title>
          .
          <source>In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>60776086</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>MD</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <surname>F Sohel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>MF</given-names>
            <surname>Shiratuddin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H</given-names>
            <surname>Laga</surname>
          </string-name>
          .
          <article-title>A comprehensive survey of deep learning for image caption- ing</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>51</volume>
          (
          <issue>6</issue>
          ):
          <fpage>136</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jie</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Gang</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Squeeze-and-excitation net- works</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>71327141</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrej</given-names>
            <surname>Karpathy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Li</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>Deep visual-semantic align- ments for generating image descriptions</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recog- nition</source>
          , pages
          <fpage>31283137</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jiasen</given-names>
            <surname>Lu</surname>
          </string-name>
          , Caiming Xiong, Devi Parikh, and Richard Socher.
          <article-title>Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning</article-title>
          .
          <source>In Proceedings of the IEEE con- ference on computer vision and pattern recognition</source>
          , pages
          <fpage>375383</fpage>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Steven</surname>
            <given-names>J Rennie</given-names>
          </string-name>
          , Etienne Marcheret, Youssef Mroueh,
          <string-name>
            <surname>Jerret Ross</surname>
            , and
            <given-names>Vaibhava</given-names>
          </string-name>
          <string-name>
            <surname>Goel</surname>
          </string-name>
          .
          <article-title>Self-critical sequence training for image captioning</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>7008</fpage>
          <lpage>7024</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Baohua</given-names>
            <surname>Sun</surname>
          </string-name>
          , Daniel Liu, Leo Yu,
          <string-name>
            <given-names>Jay</given-names>
            <surname>Li</surname>
          </string-name>
          , Helen Liu, Wen- han
          <string-name>
            <surname>Zhang</surname>
            , and
            <given-names>Terry</given-names>
          </string-name>
          <string-name>
            <surname>Torng</surname>
          </string-name>
          .
          <article-title>Mram co-designed processing- in-memory cnn accelerator for mobile and iot applications</article-title>
          . arXiv preprint arXiv:
          <year>1811</year>
          .12179,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Baohua</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catherine Chi</surname>
            , Wenhan Zhang, and
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Squared english word: A method of generating glyph to use super characters for sentiment analysis</article-title>
          .
          <source>arXiv preprint arXiv:1902.02160</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Baohua</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patrick Dong</surname>
            , Wenhan Zhang, Jason Dong, and
            <given-names>Charles</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Super characters: A conversion from sentiment classification to image classification</article-title>
          .
          <source>In Pro- ceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis</source>
          , pages
          <fpage>309315</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Baohua</surname>
            <given-names>Sun</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patrick Dong</surname>
            , Wenhan Zhang, Jason Dong, and
            <given-names>Charles</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Ultra power-efficient cnn domain specific accelerator with 9.3 tops/watt for mobile and embed- ded applications</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops</source>
          , pages
          <fpage>16771685</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Baohua</surname>
            <given-names>Sun</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michael Lin</surname>
            , Charles Young, Jason Dong, Wenhan Zhang, and
            <given-names>Patrick</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
          </string-name>
          .
          <article-title>Superchat: Dia- logue generation by transfer learning from vision to language using two-dimensional word embedding and pretrained ima- genet cnn models</article-title>
          .
          <source>arXiv preprint arXiv:1905.05698</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Baohua</surname>
            <given-names>Sun</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wenhan Zhang</surname>
            , Michael Lin, Patrick Dong, Charles Young, and
            <given-names>Jason</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
          </string-name>
          .
          <article-title>Supertml: Two- dimensional word embedding and transfer learning using im- agenet pretrained cnn models for the classifications on tabu- lar data</article-title>
          .
          <source>arXiv preprint arXiv:1903.06246</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Oriol</surname>
            <given-names>Vinyals</given-names>
          </string-name>
          , Alexander Toshev, Samy Bengio, and Du- mitru
          <string-name>
            <surname>Erhan</surname>
          </string-name>
          .
          <article-title>Show and tell: Lessons learned from the 2015 mscoco image captioning challenge</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <fpage>652663</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ting</surname>
            <given-names>Yao</given-names>
          </string-name>
          , Yingwei Pan,
          <string-name>
            <given-names>Yehao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zhaofan</given-names>
            <surname>Qiu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tao</given-names>
            <surname>Mei</surname>
          </string-name>
          .
          <article-title>Boosting image captioning with attributes</article-title>
          .
          <source>In Pro- ceedings of the IEEE International Conference on Computer Vision</source>
          , pages
          <fpage>48944902</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Quanzeng</surname>
            <given-names>You</given-names>
          </string-name>
          , Hailin Jin, Zhaowen Wang,
          <string-name>
            <surname>Chen Fang</surname>
            , and
            <given-names>Jiebo</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
          </string-name>
          .
          <article-title>Image captioning with semantic attention</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>46514659</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>