=Paper= {{Paper |id=Vol-1441/recsys2015_poster9 |storemode=property |title=Image Discovery and Insertion for Custom Publishing |pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster9.pdf |volume=Vol-1441 |dblpUrl=https://dblp.org/rec/conf/recsys/LiuLW15 }} ==Image Discovery and Insertion for Custom Publishing== https://ceur-ws.org/Vol-1441/recsys2015_poster9.pdf
        Image Discovery and Insertion for Custom Publishing

                                 Lei Liu                                              Jerry Liu                           Shanchan Wu
                            HP Labs                                                HP Labs                                   HP Labs
                        1501 Page Mill Rd                                      1501 Page Mill Rd                         1501 Page Mill Rd
                       Palo Alto, CA 94304                                    Palo Alto, CA 94304                       Palo Alto, CA 94304
                       lei.liu2@hp.com                                        jerry.liu@hp.com                     shanchan.wu@hp.com



ABSTRACT                                                                                      and treat all of these concepts as a single topic with which to find
Images in reading materials make the content come alive. Aside                                the relevant images. In addition, as search engines usually transfor-
from providing additional information to the text, reading mate-                              m the query and candidate resources into bags or vectors of words,
rial containing illustration engages our spatial memory, increases                            the semantic topics underlying the content are totally overlooked.
memory retention of the material. However, despite the plethora                               Topics are a better choice for truly understanding both the query
of available multimedia, adding illustrations to text continues to                            and the image illustrations.
be a difficult task for the amateur content publisher. To address                                To address these challenges, we created a novel system that rec-
this problem, we present a semantic-aware image discovery and in-                             ommending illustration for custom publishing by enabling search
sertion system for custom publishing. Compared to image search                                with text passages of any length and by recommending a ranked
engines, our system has the advantage of being able to discern a-                             list of images that match the different topics covered within the
mong different topics within a long text passage and recommend                                queried passage. In summary, our system has these contribution-
the most relevant images for each detected topic with semantic “vi-                           s:(1) Our system recommends images for text queries of any length.
sual words” based relevance.                                                                  (2)Our system detects underlying topics from multi-topic content,
                                                                                              then recommends illustration for each topic.(3) Our system intro-
1. INTRODUCTION                                                                               duces a novel semantic “visual words” based image ranking sys-
   Eye-catching illustrations make reading experience more attrac-                            tem. Using text content from the web page where the image o-
tive, result in increasing reading engagement and memory retention                            riginated from, we determine “visual words” as the semantic topic
rate. However, associating the image illustrations with the appro-                            features with probabilistic topic modeling technique.
priate text content can be a painful task which takes much time and
efforts, especially for the amateur content creator. These content                            2.    METHODOLOGIES
creators may be a blogger sharing her subject matter expertise, a                                We address this problem by developing a semantic-aware image
father creating a PTA newsletter, or even a teacher authoring her                             discover and insertion system for custom publishing. In a nutshell,
own class material. These subject matter expert can author text                               given query text of any length, our system first detects the underly-
quite fluently but may often find locating meaningful illustrations                           ing topics from the text. We then recommend a list of images that
to be a painful task requiring significant time and effort. Thus, cus-                        are semantically relevant for each detected topic.
tom publications from non-professionals often lack the richness of                               Query Topics Discovery: Given a text content in any length as
illustration found in their professional counterparts. To find illus-                         a query, we utilize topic models to discover the abstract topics un-
trations relevant to given long text reading content, a usual practice                        derlying the query. Intuitively, provided that a selected text content
is to submit the entire text string as a query to an image search                             is about one or more topics, one expects particular words to appear
engine, like Bing Image. However, as existing search engines are                              in each topic more or less frequently. After generating the topics,
designed to accept a few words as the query, the output from the                              each topic is represented by a set of words that frequently occur to-
search engine from a query string will be an error indicating that                            gether. In this paper, we use Latent Dirichlet Allocation (LDA). We
the query is too long to process. The other way is to manually sum-                           represent each topic with a set of terms to indicate the concept for
marize the long query passage to create a query consisting of a few                           each single topic. Topic Compression: As the number of topics to
words to find the relevant images. However, this approach is ineffi-                          be generated is given as input to LDA, this number associated with
cient and may not accurately represent the query passage. Another                             the queried passage is unknown, it is possible that multiple topics
key disadvantage with current image recommendation systems is                                 are generated but about similar concepts. In order to remove such
that although there may be more than one topic underlying the long                            redundancy, we propose the idea of topic compression by consider-
query content, existing search engines fail to consider this factor                           ing the word distribution of each topic, and then remove duplicate
                                                                                              topics if they are discussing the similar concept. To identify if t-
                                                                                              wo topics are about similar concepts, we use Pearson correlation in
Permission to make digital or hard copies of all or part of this work for personal or
                                                                                              this paper. Then for each remaining query topic, we fetch the top
classroom use is granted without fee provided that copies are not made or distributed         K (K=40 in this paper) relevant images from Bing Image with the
for profit or commercial advantage and that copies bear this notice and the full cita-        topic represent terms as query.
tion on the first page. Copyrights for components of this work owned by others than              Directly use the images with the original ranking from Bing Im-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-          age is not appropriate for publishing purpose. To illustration this,
publish, to post on servers or to redistribute to lists, requires prior specific permission
                                                                                              we provide an example in Figure 1. Where a query passage from
and/or a fee. Request permissions from Permissions@acm.org.
Copyright
RecSys 2015,isSep held  by 2015,
                    16–20,  the authors.     RecSys 2015 Poster Proceedings,
                                  Vienna, Austria.
                                                                                              a chemistry book "... explain why fireworks different colors with
September
⃝             16-20,
 c 2015 is held       2015,
                  by the     Austria, Vienna.
                         owner/authros(s).  Publication rights licensed to ACM.               the knowledge of fireworks chemistry ...", with Bing Image API,
                                                                              core by treating each phrase as a single term,∑and the average
                                                                                                                                     fi
                                                                              frequency of all terms contained in the phrase ti ∈s |s|  , where
                                                                              |s| is the length of the phrase, i.e., the number of terms. The
                                                                              first factor considers the importance of the phrase as a whole
                                                                              unit fw (s), the second factor is the relevance to the document
                                                                              reflected by the average frequency of its contained terms.
Figure 1: Images Directly from Search Engine is inappropriate
both two images in Figure 1 are returned. However, the left one is         To evaluate the performance of different query extraction schemes,
more appropriate for publishing as it is semantic related to the book   we conducted a user study for the results . We select top 20 results
content and has the illustration ability.                               to show with different relevance methods (T1∼T2) and key term
   Distinguishing the most semantic relevant images from other ones     schemes (S1∼S4) for each query passage. The results are manual-
is critical for publishing purpose, since it is important to embed      ly judged to be appropriate or not. Precision is used as the evalu-
the image that has the illustration or semantic content explaining      ation metric. Stanford POS Tagger 1 is used to extract nouns and
value to the surrounding book content. To perform this, we use          noun phrases. The noun phrases are extracted by regular expres-
these images from Bing as candidates and re-rank them based on          sion (Adjective|Noun)*(Noun Preposition)?(Adjective|Noun)*Noun. We
the semantic “visual words”, where we compare the relevance be-                       Table 2: Relevance Discovery T1 ∼ T2
tween each query topic and surrounding content in the original page                                     Relevance
where candidate images originated from. Visual Words Genera-                        T1 Topic Features with Cosine Similarity
tor: To generate “visual words”, we combine content from origi-                     T2          Directly from Bing Image
nal page containing the candidate images for each query topic with      repeat the experiment with randomly selected passage queries from
the corresponding represented query topic content as a bucket. For      6 testing books (B1∼B6). For each book, the average precisions of
each bucket, we generate a set of topics as the semantic “visual        all passage queries with all 8 combinations (S1∼S4 with T1∼ T2)
words” with LDA[2] (No. of topics can be selected via cross vali-       are calculated. Figure 2 plots the overall results.
dation. In this paper, we generate 50 topics for each content buck-
et). Relevance Discovery and Ranking: With the “visual words”
representation matrix, where the row is the extracted query topic or
candidate images. The column are the "visual words". We apply
cosine similarity to measure the relevance in this paper[3], other
distance methods also can be applied here[5][4]. Finally, we select
the top n (n < K, n = 20 in experiment section) images to show for
each query topic discovered from the query passage. Besides this,
our system allows user to customize image preferences, including
privacy, sources, formats, resolutions, size, etc.
3. EXPERIMENT
   We have implemented the system, which is currently being pilot-             Figure 2: Results of the 8 combinations on B1∼B6
ed with multiple local universities and high schools[1]. While the         From the results, we have the following observations: (1) T1
pilot is ongoing, early feedbacks show that our system is seen fa-      is better than T2 with all key term extraction Schemes (S1∼S4)
vorably by the users. To show the effectiveness of our method, we       across all 6 books(B1∼B6), which justify our argument that top-
randomly select 100 query passages from each of 6 testing books in      ics underlying the content offers a better way to truly understand
varies grades and subjects. We consider 4 different ways to extract     both passage query and candidate resources. (2) Nouns and Noun
the key terms from the selected query (summarized in Table 1) and       phrases achieve similar performance, and they are better than words
2 ways to discover the relevant resources (summarized in Table 2).      selected without POS tagger. (3)Our method (S4 with T1) achieves
Consequently, we have implemented 8 scheme combinations from            the best performance for all the queries across 6 testing books.
Tables 1 and 2.                                                         4.      REFERENCES
         Table 1: Key Terms Extraction Scheme S1 ∼ S4
                                                                        [1] J. M. Hailpern, R. Vernica, M. Bullock, U. Chatow, J. Fan,
      Scheme         Words                Weighting
                                                                            G. Koutrika, J. Liu, L. Liu, S. J. Simske, and S. Wu. To print
         S1          words       Frequency-based weighting                  or not to print: hybrid learning with METIS learning platform.
         S2          nouns       Frequency-based weighting                  In ACM EICS 2015, pages 206–215, 2015.
         S3      noun phrases          Phrase Weighting
                                                                        [2] G. Koutrika, L. Liu, and S. Simske. Generating reading orders
         S4       topic words       topic word distribution
                                                                            over document collections. In 31st IEEE International
  The words are extracted with the largest tf ∗ idf value from the          Conference on Data Engineering, ICDE’2015, pages
selected passage in S1, nouns and noun phrases are identified us-           507–518, 2015.
ing off the shelf POS tagger. Both words and nouns are weighted         [3] L. Liu, G. Koutrika, and S. Wu. Learningassistant: A novel
by Frequency-based weighting (tf ∗ idf ). The noun phrases are              learning resource recommendation system. In International
weighted by phrase weighting. The details of the weighting meth-            Conference on Data Engineering (ICDE’2015), 2015.
ods are provided as follow:                                             [4] L. Liu and P.-N. Tan. A framework for co-classification of
 • Frequency-based weighting: For any term ti , the frequency               articles and users in wikipedia. In Web Intelligence’10, pages
   based weighting ft (ti ) is computed by using the tf ∗idf weight-        212–215, 2010.
   ing scheme widely used.                                              [5] P. Mandayam Comar, L. Liu, S. Saha, A. Nucci, and P.-N.
 • Phrase Weighting: Let s be a phrase
                                   ∑ and ti ∈ s be a term con-
                                                                            Tan. Weighted linear kernel with tree transformed features for
                                         ∈s fi                              malware detection. In CIKM’12, pages 2287–2290, 2012.
   tained in s. Then: fs (s) = ft (s) ti|s|   . The phrase weight-
                                                                        1
   ing fs (s) considers two factors: ft (s), the phrase TF*IDF s-           http://nlp.stanford.edu/software/tagger.shtml