We evaluate the proposed News2Images on a big media data
                                                                      including more-than one million news articles served through a
                                                                      Korean media portal website, NAVER 2 , in 2014. Experimental
                                                                      results show our method outperforms a baseline method based on
                                                                      word occurrence in terms of both quantitative and qualitative
                                                                      criteria. Moreover, we discuss some future directions for applying
                                                                      News2Images to personalized news recommender systems.


                                                                      2. DEEP LEARNING-BASED FEATURE
                                                                      REPRESENTATION
                                                                      Most news articles consist of a title, a document, and attached
                                                                      images. Mathematically, a news article x is defined as a triple
                                                                       x  {t , S ,V } , where t, S, and V denote a title, the set of document
                                                                      sentences, and an image set. V can be an empty set. A title t and a
                                                                      document sentence s, s  S , are represented as a vector of word
                                                                      features such as occurrence frequency or word embedding. An
                                                                      image v, v  V is also defined as a vector of visual features such
                                                                      as Scale invariant feature transform (SIFT) [8] or CNN features.
                                                                      For representing a news article with a feature vector, we use deep
                                                                      learning in this study.
                                                                      Many recent studies have reported that the hidden node values
                                                                      generated from deep learning models such as word embedding
 Figure 1. An example of the image-based contents                     networks and CNNs are very useful for diverse problems
 generated from a news document by News2Images. Left                  including image classification [5], image descriptive sentence
 box includes an original online news document and right              generation [14], and language models [12].
 box represents the contents summarizing the news into
 three images. Red sentences in the left box are key                  Formally, a word w is represented as a real-valued vector,
 sentences extracted by summarization and they are located             w d , where d is the dimension of a word vector. The vector
 in the black rectangle below the retrieved images in the             value of each word is learned from a large corpus by word2vec
 right box.                                                           [10]. This distributed word representation, called word embedding,
                                                                      is to not only characterize the semantic and the syntactic
                                                                      information but also overcome the data sparsity problem [6, 10].
                                                                      It means that two words with similar meaning are located at a
key sentences, we define a score considering both the similarity to   close position in the vector space. A sentence or a document can
the core news contents and the diversity for the coverage on the      be represented as a real-valued vector as well. Sentence or
entire contents of the news. The similarity and the diversity are     document vectors can be generated by learning of deep networks,
computed using sentence embedding based on word2vec [10].             or they are calculated by pooling the word vectors included in the
The image retrieval module searches the images semantically           sentences. Here a sentence vector is calculated by average
associated with the sentences extracted by the summarization          pooling:
module. The semantic association between a sentence and an
                                                                                                      1
image is defined as the cosine similarity between the sentence and
the title of the news article which the image is attached in. Also,
                                                                                             si          wi ,
                                                                                                    | s | ws
                                                                                                                                          (1)
we use the hidden node values of the top fully connected layer of
the convolutional neural networks (CNNs) [4] for each image as        where w and s denote a word and the set of words included in a
an image feature. Finally, the image-based content module             sentence. Also, si and wi are the i-th element of embedding vector
generates a set of new images by synthesizing a retrieved image       s and w corresponding to s and w, respectively. Simple average
and the sentence corresponding to the image. These image-based        pooling leads to lose sequence information of words. Therefore,
contents generated can improve the readability and enhance the        the concatenation of multiple word vectors and the sliding
interests of mobile device users, compared to text-based news         window strategy can be used instead of simple pooling.
articles. The proposed News2Images has the originality in aspect      Image features can be generated for an input image by the CNNs
of generating new contents suitable for mobile services by            learned from a large-scale image database. Typically, the hidden
summarizing a long news document into not sentences but images        node values of the fully connected layer below the top softmax
even if there exist many methods for summarization [9] or text-to-    layer of CNNs are used as features. The CNN image features are
image retrieval [1]. Figure 1 presents an example of the image-       also represented as a (non-negative) real-valued vector and they
based content consisting of three synthesized images generated        are known to be distinguishable for object recognition.
from a Korean online news article.


                                                                      2
                                                                          www.naver.com
                                   Learned                   Summarization
                   News             Word                        Function                     Similarity                                 News
                   Article        Embedding                   (Similarity &                  Function                               Title-Image
                  Database          Model                       Diversity)                                                           Database


                                                                              k Extracted
                                                Sentence                                                       Retrieved
                  HTML                                                         Sentence
                                                Vector Set                                                       Titles
                  News                                                          Vectors
                Documents                                                                                                             Learned
                                                                                                                                       CNN
                                                                                                                                       Model
                                                                                                              Generated
                  Image-                       Synthesized                    Retrieved
                                                                                                                Image
                   Based                         Images                        Images
                                                                                                               Features
                 Contents


                              Data Flow                         Image
                                                                                              Similarity
                                                               Synthesis
                               Function Flow                                                  Function
                                                               Function

                    Figure 2. Overall flow of generating image-based contents from a news article via News2Images

                                                                                       Sk*  arg max   f  Sk , S   1     g  Sk , S 
3. NEWS-TO-IMAGES                                                                               Sk  S
News2Images is a method of generating image-based contents                                                                                           ,              (2)
from a given news document using summarization and text-to-                                  arg max   f  Sk , t   1     g  Sk , S 
                                                                                                Sk  S
image retrieval. News2Images consists of three parts including
key sentence extraction based on the single document
summarization, key sentence-related image retrieval by
                                                                                  s.t. f  S k , S      sSk f  s, S  and g  Sk , S    sSk g  s, S  ,
associating images with sentences, and image-based content                     where t denotes the title of S, Sk and Sk* are the set of k
generation by synthesizing sentences and images. Figure 2 shows
                                                                               sentences extracted and an optimal set among Sk . f  Sk , S  and
the overall framework of News2Images.
                                                                                g  Sk , S  denote the similarity and the diversity functions, and
3.1 News Document Summarization                                                 is the constant for moderating the ratio of two criteria.
Document summarization is a task of automatically generating a
                                                                               The similarity f (s, t ) between a given sentence s and a news title
minority of key sentences from an original document, minimizing
loss of the content information [9]. Two approaches are mainly                 t is defined as the cosine similarity between two sentence
used for document summarization. One is abstraction which is to                embedding vectors:
generate a few new sentences. Abstraction more precisely                                                            st
                                                                                                        f (s, t )      .                       (3)
summarizes a document but still remains a challenging issue. The                                                    s t
other is extraction, to select some core sentences from a document,
and we use the extraction approach in this study. Also, the news               For calculating the diversity, we partition the sentences of S into
summarization in this study belongs to single document                         multiple subsets using a clustering method. Because a sentence
summarization [7]. We assume two conditions for the                            vector implicitly reflects syntactic and semantic information,
summarization:                                                                 multiple semantically distinctive subsets are generated by
                                                                               clustering. For the j-th cluster Cj, we calculate the cosine
    i)   A news title is the best sentence consistently representing           similarity between all the sentences in Cj and the centroid of Cj.
         the entire content of the news.                                       Because the cosine similarity can be negative, we consider a
                                                                               negative value as zero. This value is defined as the diversity:
    ii) A news article consists of at least two sentences and the
        entire content is built up by composing its sentences’                                                     scj
                                                                                                    g (s, C j )         ,                     (4)
        content.                                                                                                  s cj
For precisely summarizing a news document, thus, it is required
that a summarized sentence set consists of the sentences not only              where cj denotes the centroid vector of Cj.
semantically similar to its title but also covering the entire content         Finally, k sentences with the largest value defined in (2) are
with diverse words. We call the former similarity and the latter               extracted as the summarization set for the given document. Here
diversity.                                                                     we set k to three, which means that a news article is summarized
Formally, a document S is defined as a set of its sentences,                   into three image-based contents.
 S  {s1, ..., s M } , where M denotes the number of the sentences
included in S. The i-th sentence si is represented as a real-valued
                                                                               3.2 Sentence-to-Image Retrieval
                                                                               The second subtask is to retrieve the images representing
vector, si  d , where d is the vector size, by word2vec and                  semantics similar to the extracted sentences. Because we use the
average pooling. Then, document summarization is formulated                    images attached in news articles, the title of a news including an
with                                                                           image can be used as a description sentence of the image.
Therefore, the semantic similarity of an image to an extracted                   Table 1. Accuracy of the baseline method and News2Images
sentence is calculated by measuring the similarity between the                   Classification     Baseline (TF/IDF)        News2Images
image title vector and the sentence vector.
                                                                                   Correct #             14,020/20,224          18,908/20,224
Formally, when an image feature vector set, V={v1, …, vN}, is                      Accuracy                        0.693                 0.935
given, the images similar to an extracted sentence ŝ are extracted:           Cosine Similarity                   0.636                 0.866
                                               sˆ  t ( v)                  We set the number of images for averaging in (6), M to 1 both two
   v*  arg max  f  sˆ, t ( v)   arg max                   ,   (5)     methods. The window size of the words is 1. Both methods use news
          vV                           vV    | sˆ || t ( v) |             titles in pooling word vectors into sentence vectors.

where t(v) denotes the title of an image v.                                 20,224 image-based contents were generated from validation
                                                                            news data in total.
Due to the diversity, sentences which are not directly related to           We used the word2vec for word embedding and modified
the title may be extracted as a core sentence. We assume that a             GoogleNet implemented in Caffe for CNN features [4]. The word
title is “Yuna Kim decided to participate in 2013 world figure              vector and image feature sizes are 100 and 1024, respectively. For
skating championship”, and two extracted sentences are “Yuna                error correction in learning CNNs, we set the label of an image to
Kim will take part in the coming world figure skating                       the person name in the image. Thus, the size of the class label set
championship” and “The competition will be held in February.”               is 100. The learned CNN model for generating image features
In this case, the title is not semantically similar to the second           yields 0.56 and 0.79 as Top-1 and Top-5 classification accuracies,
sentence. Thus it is difficult to associate the second sentence with        respectively. This indicates that the generated image features are
Yuna Kim’s images. For overcoming this, we can additionally use             distinguishable enough to be used for associating images and
the title vector of the news articles given as a query for pooling          sentences. The number of clusters for the diversity in
word vectors into a sentence vector. The use of the news title does         summarization was set to 3 and the constant moderating the
not influence the summarization because the title vector is                 similarity and the diversity is 0.9.
reflected on all the sentence vectors.                                      For comparisons, we used a word occurrence vector based on
                                                                            TF/IDF as a baseline in computing the similarity between
Instead of v*, we can generate a new image vector v̂ by averaging
                                                                            sentences and titles, instead of a word embedding vector. TF/IDF
the vectors of top K images with the large similarity value. Then,          has been widely used for text mining, and thus we can verify the
v* is selected as follows:                                                  effects of deep learning-based word features.
                    v*  arg max  f  vˆ , v  ,                    (6)
                             vV
                                                                            4.2 Content Generation Accuracy
                                                                            Human efforts are still essential for precisely measuring how
                              R( v)                                         similar the generated image-based contents are semantically to the
                     vˆi              vi ,                           (7)
                            vV R( v)
                                   K
                                                                            news document given as a query. Instead of manual evaluation by

                                                                                  Table 2. Accuracies according to the usage of news titles
where vi is the i-th element of v and R(v) denotes a weight
function proportional to the similarity rank. An image more                       News title             No used                 Used
similar to v̂ has a larger R(v).                                                   Correct #             13,896/20,224          18,908/20,224
                                                                                   Accuracy                        0.687                 0.935
3.3 Image-Based Content Generation
                                                                              Table 3. Accuracies according to the size of retrieved images size
Readability is a main issue of mobile content service. Therefore
                                                                              for generating a new image feature
we generate new image-based contents instead of using the
retrieved images for improving the readability and enhancing the                  Image size               K=1                    K=3
users’ interests. An image-based content includes continuous                       Correct #             18,908/20,224          18,791/20,224
series of synthesized images where the retrieved images and their                  Accuracy                        0.935                 0.929
corresponding sentences are merged. Figure 1 illustrates an
example of the image-based contents from a news document.
                                                                                Table 4. Accuracies according to the weight for proper nouns

4. EXPERIMENTAL RESULTS                                                          Proper noun
                                                                                                         PW = 1.0              PW=10.0
                                                                                    weight
4.1 Data and Parameter Setting                                                     Correct #             18,908/20,224          19,191/20,224
We evaluate the proposed News2Images on a big media data                           Accuracy                        0.935                 0.950
including over one million Korean news articles, which are                    PW denotes the weight of proper nouns.
provided by a media portal site, NAVER, in 2014. In detail, the
word vectors are learned from all the news documents and the                     Table 5. Accuracies according to word vector window sizes
CNN models for constructing image features are trained from
approximately 220 thousands of news images, which are related to                 Window size               |W|=1                 |W|=3
100 famous entertainers, movie stars, and sports stars. Also, 6,967                Correct #             18,908/20,224          18,743/20,224
news articles are used as the validation set for evaluating the                    Accuracy                        0.935                 0.927
performance. Three key sentences were extracted from a news
                                                                               Cosine Similarity                   0.866                 0.833
article including more than three sentences and we used all the
sentences in the news consisting of less than three sentences. Then,          |W| denotes the number of concatenated word vectors.
humans, we consider a classification problem as the similarity       for an image feature, iii) the weight for proper nouns, and iv) the
evaluation. That is, for a given extracted news sentence, we         size of concatenated word vectors. Table 2 presents the accuracy
consider that the retrieved image is similar to the sentence when    improvement when the title of the summarized news documents is
the persons referred in the sentence exist in the image. It is       used. We found that the use of the news title dramatically
reasonable because this means the method provides diverse            improves the accuracy as 30% compared to the case in which the
images of a movie star for users when a user reads a news about      titles are not used. Interestingly, News2Images not using titles
the star.                                                            provides the similar performance to the baseline method using
Table 1 compares the classification accuracy of the baseline and     titles. Table 3 shows the effects of averaging multiple image
the proposed method. As shown in Table 1, News2Images                features on sentence-to-image retrieval. This indicates that
outperforms the baseline method. This indicates the word             generating a new image feature from multiple image features has
embedding features used in News2Images more precisely                no effect on enhancing the performance. To give more weight to
represent semantics, compared to TF/IDF-based features. Also, we     proper nouns can improve the quality of the image-based content
compared the cosine similarity between the titles of the retrieved   generation because proper nouns are likely to be a key content of
images and the extracted sentences using their word embedding        the news. The results in Table 4 support this hypothesis. The
vectors. The values are averaged on the titles of 20,224 retrieved   number of concatenated word vectors rarely influences the
images. We can find that our method retrieves the images more        accuracy. We indicate that the information on word sequences is
semantically similar to the extracted sentences.                     not essential to classify the person in the images from Table 5.


4.3 Effects of Parameters on Performance                             4.4 Image-Based Contents as News
We compare the accuracies of the generated contents under four       Summarization
parameters including i) the use of news title for pooling word       Figure 3 illustrates good and bad examples of image-based
vectors into a sentence vector, ii) the number of retrieved images   contents from news articles. Most of the images are related to the


 Figure 3. Examples of image-based contents generated from the summarization sentences extracted from news articles by
 News2Images and the baseline method. Images with a red border are very similar to the sentences. Blue bordered images include
 the persons referred in the given sentences but represent contents different from the sentences.
news contents but the sentences including polysemy or too many         [2] Hinton, G. et al. 2012. Deep neural networks for acoustic
words are occasionally linked to images not relevant to the                modeling in speech recognition, IEEE Signal Processing
sentences. This is caused that one word is represented as only one         Magazine. 29, 6. 82-97.
vector regardless of its meaning. Also, the representation power of    [3] Irsoy, O. and Cardie C., Deep recursive neural networks for
pooling-based sentence embedding can be weaken due to the                  compositionality in language. In Advances in Neural
property of average pooling when a sentence consists of too many           Information Processing Systems 2014. 2096-2104.
words.
                                                                       [4] Jia, Y. et al. 2014. Caffe: Convolutional architecture for fast
                                                                           feature embedding. In Proceedings of the ACM International
5. DISCUSSION                                                              Conference on Multimedia 2014. 675-678.
We proposed a new method for summarizing news articles into            [5] Krizhevsky, A., Sutskever, I., and Hinton, G. 2012. Imagenet
image-based contents, News2Images. These image-based contents              classification with deep convolutional neural networks. In
are useful for providing the news for mobile device users while            Advances in Neural Information Processing Systems 2012.
enhancing the readability and interests. Deep learning-based text          1097-1105.
and image features used in the proposed method improved the
performance as approximately 24% of the classification accuracy        [6] LeCun, Y., Bengio, Y., and Hinton, G. 2015. Deep learning.
and 0.23 of the cosine similarity compared to the TF/IDF baseline          Nature. 521, 7553. 436-444.
method. Our study has an originality in aspect of generating new       [7] Lin, C.-Y. and Hovy, E. 2002. From single to multi-
image contents from news documents even if many studies on                 document summarization: a prototype system and its
summarization or text-to-image retrieval have been reported.               evaluation. In Proceedings of the 40th Annual Meeting on
This method can be applied to a personalized news recommender              Association for Computational Linguistics (ACL ’02). 457-
system adding user preference information such as subject                  464.
categories and persons preferred by a user and feedback                [8] Lowe, D. G. 2004. Distinctive image features from scale-
information into the method. In detail, we can give a weight to            invariant keypoints. International Journal of Computer
words related to subjects or persons preferred by a user when              Vision. 60, 2. 91-110.
generating sentence vectors. This strategy allows the sentences
which the user is likely to feel an interest in to have higher score   [9] McDonald, R. 2007. A study of global inference algorithms
in summarization and retrieval, thus exposing the photos which             in multi-document summarization. Springer Berlin
the user prefers.                                                          Heidelberg. 557-564.
Evaluation should be also improved. Although we evaluate the           [10] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
proposed method with the cosine similarity-based measure and the            Dean, J. 2013. Distributed representations of words and
classification accuracy, it has a limitation for precisely measuring        phrases and their compositionality. In Advances in Neural
the similarity between the news articles and the image contents             Information Processing Systems 2013. 3111-3119.
generated. It is required to make a ground truth dataset by humans,
                                                                       [11] Salakhutdinov, R., Mnih, A., and Hinton, G. 2007.
which not only helps to more precisely evaluate the model
                                                                            Restricted Boltzmann machines for collaborative filtering. In
performance and can be used as a good dataset for
                                                                            Proceedings of the 24th International Conference on
recommendation as well as image-text multimodal learning.
                                                                            Machine Learning (ICML 2007). 791-798.
Furthermore, we will verify the effects of News2Images on the
improvements of the readability through human experiments as           [12] Socher, R., Lin, C. C.-Y., Ng, A., and Manning, C. 2011.
future work.                                                                Parsing natural scenes and natural language with recursive
The proposed method can be improved by adding the module of                 neural networks. In Proceedings of the 28th International
efficiently learning a common semantic hypothesis represented               Conference on Machine Learning (ICML-11). 129-136.
with sentences and images using a unified model [14].                  [13] Van den Oord, A., Dieleman, S., and Schrauwen, B. 2013.
                                                                            Deep content-based music recommendation, In Advances in
                                                                            Neural Information Processing Systems 2013. 2643-2651.
ACKNOWLEDGMENTS
                                                                       [14] Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R.,
                                                                            Zemel, R., and Bengio, Y. 2015. Show, attend and tell:
                                                                            Neural image caption generation with visual attention. In
6. REFERENCES                                                               Proceedings of 32th International Conference on Machine
[1] Datta, R., Joshi, D., Li, J. and Wang, J. Z. 2008. Image                Learning (ICML’15).
    retrieval: Ideas, influences, and trends of the new age. ACM
    Computing Surveys (CSUR). 40, 2. 5.