=Paper= {{Paper |id=Vol-1441/recsys2015_poster9 |storemode=property |title=Image Discovery and Insertion for Custom Publishing |pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster9.pdf |volume=Vol-1441 |dblpUrl=https://dblp.org/rec/conf/recsys/LiuLW15 }} ==Image Discovery and Insertion for Custom Publishing== https://ceur-ws.org/Vol-1441/recsys2015_poster9.pdf

Image Discovery and Insertion for Custom Publishing

Lei Liu Jerry Liu Shanchan Wu
HP Labs HP Labs HP Labs
1501 Page Mill Rd 1501 Page Mill Rd 1501 Page Mill Rd
Palo Alto, CA 94304 Palo Alto, CA 94304 Palo Alto, CA 94304
lei.liu2@hp.com jerry.liu@hp.com shanchan.wu@hp.com

ABSTRACT and treat all of these concepts as a single topic with which to find
Images in reading materials make the content come alive. Aside the relevant images. In addition, as search engines usually transfor-
from providing additional information to the text, reading mate- m the query and candidate resources into bags or vectors of words,
rial containing illustration engages our spatial memory, increases the semantic topics underlying the content are totally overlooked.
memory retention of the material. However, despite the plethora Topics are a better choice for truly understanding both the query
of available multimedia, adding illustrations to text continues to and the image illustrations.
be a difficult task for the amateur content publisher. To address To address these challenges, we created a novel system that rec-
this problem, we present a semantic-aware image discovery and in- ommending illustration for custom publishing by enabling search
sertion system for custom publishing. Compared to image search with text passages of any length and by recommending a ranked
engines, our system has the advantage of being able to discern a- list of images that match the different topics covered within the
mong different topics within a long text passage and recommend queried passage. In summary, our system has these contribution-
the most relevant images for each detected topic with semantic “vi- s:(1) Our system recommends images for text queries of any length.
sual words” based relevance. (2)Our system detects underlying topics from multi-topic content,
then recommends illustration for each topic.(3) Our system intro-
1. INTRODUCTION duces a novel semantic “visual words” based image ranking sys-
Eye-catching illustrations make reading experience more attrac- tem. Using text content from the web page where the image o-
tive, result in increasing reading engagement and memory retention riginated from, we determine “visual words” as the semantic topic
rate. However, associating the image illustrations with the appro- features with probabilistic topic modeling technique.
priate text content can be a painful task which takes much time and
efforts, especially for the amateur content creator. These content 2. METHODOLOGIES
creators may be a blogger sharing her subject matter expertise, a We address this problem by developing a semantic-aware image
father creating a PTA newsletter, or even a teacher authoring her discover and insertion system for custom publishing. In a nutshell,
own class material. These subject matter expert can author text given query text of any length, our system first detects the underly-
quite fluently but may often find locating meaningful illustrations ing topics from the text. We then recommend a list of images that
to be a painful task requiring significant time and effort. Thus, cus- are semantically relevant for each detected topic.
tom publications from non-professionals often lack the richness of Query Topics Discovery: Given a text content in any length as
illustration found in their professional counterparts. To find illus- a query, we utilize topic models to discover the abstract topics un-
trations relevant to given long text reading content, a usual practice derlying the query. Intuitively, provided that a selected text content
is to submit the entire text string as a query to an image search is about one or more topics, one expects particular words to appear
engine, like Bing Image. However, as existing search engines are in each topic more or less frequently. After generating the topics,
designed to accept a few words as the query, the output from the each topic is represented by a set of words that frequently occur to-
search engine from a query string will be an error indicating that gether. In this paper, we use Latent Dirichlet Allocation (LDA). We
the query is too long to process. The other way is to manually sum- represent each topic with a set of terms to indicate the concept for
marize the long query passage to create a query consisting of a few each single topic. Topic Compression: As the number of topics to
words to find the relevant images. However, this approach is ineffi- be generated is given as input to LDA, this number associated with
cient and may not accurately represent the query passage. Another the queried passage is unknown, it is possible that multiple topics
key disadvantage with current image recommendation systems is are generated but about similar concepts. In order to remove such
that although there may be more than one topic underlying the long redundancy, we propose the idea of topic compression by consider-
query content, existing search engines fail to consider this factor ing the word distribution of each topic, and then remove duplicate
topics if they are discussing the similar concept. To identify if t-
wo topics are about similar concepts, we use Pearson correlation in
Permission to make digital or hard copies of all or part of this work for personal or
this paper. Then for each remaining query topic, we fetch the top
classroom use is granted without fee provided that copies are not made or distributed K (K=40 in this paper) relevant images from Bing Image with the
for profit or commercial advantage and that copies bear this notice and the full cita- topic represent terms as query.
tion on the first page. Copyrights for components of this work owned by others than Directly use the images with the original ranking from Bing Im-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- age is not appropriate for publishing purpose. To illustration this,
publish, to post on servers or to redistribute to lists, requires prior specific permission
we provide an example in Figure 1. Where a query passage from
and/or a fee. Request permissions from Permissions@acm.org.
Copyright
RecSys 2015,isSep held by 2015,
16–20, the authors. RecSys 2015 Poster Proceedings,
Vienna, Austria.
a chemistry book "... explain why fireworks different colors with
September
⃝ 16-20,
c 2015 is held 2015,
by the Austria, Vienna.
owner/authros(s). Publication rights licensed to ACM. the knowledge of fireworks chemistry ...", with Bing Image API,
core by treating each phrase as a single term,∑and the average
fi
frequency of all terms contained in the phrase ti ∈s |s| , where
|s| is the length of the phrase, i.e., the number of terms. The
first factor considers the importance of the phrase as a whole
unit fw (s), the second factor is the relevance to the document
reflected by the average frequency of its contained terms.
Figure 1: Images Directly from Search Engine is inappropriate
both two images in Figure 1 are returned. However, the left one is To evaluate the performance of different query extraction schemes,
more appropriate for publishing as it is semantic related to the book we conducted a user study for the results . We select top 20 results
content and has the illustration ability. to show with different relevance methods (T1∼T2) and key term
Distinguishing the most semantic relevant images from other ones schemes (S1∼S4) for each query passage. The results are manual-
is critical for publishing purpose, since it is important to embed ly judged to be appropriate or not. Precision is used as the evalu-
the image that has the illustration or semantic content explaining ation metric. Stanford POS Tagger 1 is used to extract nouns and
value to the surrounding book content. To perform this, we use noun phrases. The noun phrases are extracted by regular expres-
these images from Bing as candidates and re-rank them based on sion (Adjective|Noun)*(Noun Preposition)?(Adjective|Noun)*Noun. We
the semantic “visual words”, where we compare the relevance be- Table 2: Relevance Discovery T1 ∼ T2
tween each query topic and surrounding content in the original page Relevance
where candidate images originated from. Visual Words Genera- T1 Topic Features with Cosine Similarity
tor: To generate “visual words”, we combine content from origi- T2 Directly from Bing Image
nal page containing the candidate images for each query topic with repeat the experiment with randomly selected passage queries from
the corresponding represented query topic content as a bucket. For 6 testing books (B1∼B6). For each book, the average precisions of
each bucket, we generate a set of topics as the semantic “visual all passage queries with all 8 combinations (S1∼S4 with T1∼ T2)
words” with LDA[2] (No. of topics can be selected via cross vali- are calculated. Figure 2 plots the overall results.
dation. In this paper, we generate 50 topics for each content buck-
et). Relevance Discovery and Ranking: With the “visual words”
representation matrix, where the row is the extracted query topic or
candidate images. The column are the "visual words". We apply
cosine similarity to measure the relevance in this paper[3], other
distance methods also can be applied here[5][4]. Finally, we select
the top n (n < K, n = 20 in experiment section) images to show for
each query topic discovered from the query passage. Besides this,
our system allows user to customize image preferences, including
privacy, sources, formats, resolutions, size, etc.
3. EXPERIMENT
We have implemented the system, which is currently being pilot- Figure 2: Results of the 8 combinations on B1∼B6
ed with multiple local universities and high schools[1]. While the From the results, we have the following observations: (1) T1
pilot is ongoing, early feedbacks show that our system is seen fa- is better than T2 with all key term extraction Schemes (S1∼S4)
vorably by the users. To show the effectiveness of our method, we across all 6 books(B1∼B6), which justify our argument that top-
randomly select 100 query passages from each of 6 testing books in ics underlying the content offers a better way to truly understand
varies grades and subjects. We consider 4 different ways to extract both passage query and candidate resources. (2) Nouns and Noun
the key terms from the selected query (summarized in Table 1) and phrases achieve similar performance, and they are better than words
2 ways to discover the relevant resources (summarized in Table 2). selected without POS tagger. (3)Our method (S4 with T1) achieves
Consequently, we have implemented 8 scheme combinations from the best performance for all the queries across 6 testing books.
Tables 1 and 2. 4. REFERENCES
Table 1: Key Terms Extraction Scheme S1 ∼ S4
[1] J. M. Hailpern, R. Vernica, M. Bullock, U. Chatow, J. Fan,
Scheme Words Weighting
G. Koutrika, J. Liu, L. Liu, S. J. Simske, and S. Wu. To print
S1 words Frequency-based weighting or not to print: hybrid learning with METIS learning platform.
S2 nouns Frequency-based weighting In ACM EICS 2015, pages 206–215, 2015.
S3 noun phrases Phrase Weighting
[2] G. Koutrika, L. Liu, and S. Simske. Generating reading orders
S4 topic words topic word distribution
over document collections. In 31st IEEE International
The words are extracted with the largest tf ∗ idf value from the Conference on Data Engineering, ICDE’2015, pages
selected passage in S1, nouns and noun phrases are identified us- 507–518, 2015.
ing off the shelf POS tagger. Both words and nouns are weighted [3] L. Liu, G. Koutrika, and S. Wu. Learningassistant: A novel
by Frequency-based weighting (tf ∗ idf ). The noun phrases are learning resource recommendation system. In International
weighted by phrase weighting. The details of the weighting meth- Conference on Data Engineering (ICDE’2015), 2015.
ods are provided as follow: [4] L. Liu and P.-N. Tan. A framework for co-classification of
• Frequency-based weighting: For any term ti , the frequency articles and users in wikipedia. In Web Intelligence’10, pages
based weighting ft (ti ) is computed by using the tf ∗idf weight- 212–215, 2010.
ing scheme widely used. [5] P. Mandayam Comar, L. Liu, S. Saha, A. Nucci, and P.-N.
• Phrase Weighting: Let s be a phrase
∑ and ti ∈ s be a term con-
Tan. Weighted linear kernel with tree transformed features for
∈s fi malware detection. In CIKM’12, pages 2287–2290, 2012.
tained in s. Then: fs (s) = ft (s) ti|s| . The phrase weight-
1
ing fs (s) considers two factors: ft (s), the phrase TF*IDF s- http://nlp.stanford.edu/software/tagger.shtml