Toffee – Semantic Media Search Using Topic
          Modeling and Relevance Feedback

        Mikko Koho1 , Erkki Heino1 , Arttu Oksanen1,2 , and Eero Hyvönen1,2
    1
      Semantic Computing Research Group (SeCo), Aalto University, Finland and
2
    HELDIG – Helsinki Centre for Digital Humanities, University of Helsinki, Finland
                 http://seco.cs.aalto.fi, http://heldig.fi


1       Research Problem Addressed
This paper considers relevance feedback [1, Ch. 5] search on the Web. Here
the information need and query cannot be formulated in the outset—a typical
situation in many search situations—but gets refined through making a series of
queries and by evaluating the results in between. As an instance of such search
the following problem setting is considered: since 1981, The Finnish engineering
trade unions TEK and TFiF have given the yearly Finnish Engineering Award3
to a “notable engineering or architectural work which has remarkably advanced
technical competence in Finland”. Would it be possible to devise a search system
that could help the award committee members in finding out award winning
candidates from the news and other materials on the Web?
    This paper presents and demonstrates the first results of our research on
creating such a search service. The novel idea in the proposed approach is to
combine implicit and explicit feedback methods [6] by using topic modeling [2]
for extracting topics from the search results. Extracted topics and user feedback
are used to generate new search keywords, which then guides the iterative search
process. The developed search prototype Toffee is designed to work especially
with Finnish language content, but can handle documents in any language.


2       Solution: Topical Relevance Feedback Search
To illustrate the idea, Fig. 1 shows the user interface of Toffee, with a search
made to find news related to technology innovations from a web corpus created
by the National Broadcasting Company YLE4 . The initial search was based on
the words “innovaatio” (innovation) and “teknologia” (technology). After this,
the search has been repeated with feedback that emphasized news articles about
the clean technology industry. The actual nine search words are shown below
the search field, and the list of results below that. The main topics of each result
are shown on the left of the result title and short description. The colored circles
indicate different topics and the circle sizes depict the importance of the topic,
with a tooltip showing the most important words of a topic.
3
    https://www.tek.fi/en/technology-future/finnish-engineering-award
4
    http://www.yle.fi
          Fig. 1. Toffee user interface, showing the 3 top results of a search.


    Toffee source codes are available online5 as a multi-container Docker appli-
cation. The search logic is based on the following steps (cf. Fig. 2):
1. A broad initial search is conducted with some keywords that are hypothesized
   to produce at least some results of interest.
2. Query expansion [11] is applied by looking up broader and similar entities
   from the Holistic Collaborative Finnish Ontology KOKO [10] via SPARQL
   using the ARPA annotation tool [3]. The related entities are found by match-
   ing entity labels to search keywords and following SKOS relations to other
   entities.
3. The query is sent to the search service API (currently either Google or Elas-
   ticsearch) and a maximum of 50 results are received. In case of web search,
   the resulting web pages are scraped for text contents. With Elasticsearch,
   the document contents are returned from the search.
4. All the words in the document contents are then reduced to their base forms
   using the SeCo Lexical Analysis Service [4].
5. Topic modeling is applied to the result set using Latent Dirichlet Alloca-
   tion [2]. In the case of Elasticsearch, the topics of the whole corpus have
   been pre-calculated initially. In the case of web search, topics are computed
   on-the-fly, with a low amount of iterations.
6. All results are returned to the user interface and the user can mark each
   individual result as interesting or not interesting. The user can then resend
   the query with the feedback, and the feedback is used to reformulate the
   query, and the search process continues from step 2.
5
    https://github.com/SemanticComputing/toffee
 Search        Query             Search           Scrape           Baseform            Topic
keywords     Expansion            API             results           words             modeling


                                  Calculate word
           Select top words                              Get weighted      Feedback
                                  scores based                                        Results
            for new query                                 topic words
                                   on feedback


     Fig. 2. Iterative Toffee workflow starting from the initial search keywords.


    The user can mark any of the returned results as relevant or non-relevant, or
leave it undecided. Any of the generated search words of the previous iteration
can be removed, as the search process could produce unwanted keywords.
    The system reformulates a new iteration of the query, based on the user
feedback. An initial weight S0 is given to each word present in the previous
query terms Q or in the words of the previous search results R according to
formula 1, where V = Q ∪ R.

                                         
                                             1 if w ∈ Q
                              S0 (w) =                  , ∀w ∈ V                            (1)
                                             0 if w ∈
                                                    /Q

    The initial weights are then modified based on the possible feedback of each
result according to formula 2, where D is the set of result documents from
previous search, θd,k is the probability of topic k occurring in document d. ϕk,w
is the probability of word w occurring in topic k, and fd is the user feedback
for document d, which can be positive, negative or zero (meaning no feedback is
given about the result), with the system using a fixed magnitude to both positive
and negative feedback. K is the number of topics.

                                                K
                                               XX
                      S(w) = S0 (w) +                   θd,k · ϕk,w · fd                    (2)
                                              d∈D k=1

   The words with the highest weight are then used for the next iteration of
the query, with some limit in the maximum number of query words. The user
can iteratively give feedback on the results, receive new results, and direct the
search to the topics of interest.


3   Related Work and Discussion
Various methods exist for relevance feedback search [1,6]. Teevan et al. [9] enrich
web search with relevance feedback based on a constructed user profile. Peltonen
et al. [5] combine visual intent modeling with exploratory relevance feedback
search. Tang et al. [8] have used topic modeling in academic literature search.
Song et al. [7] employed topic modeling with relevance search, based on implicit
feedback from the topics of the user web search history. However, the idea of
combining topic modeling of the query results with relevance search, as described
in this paper, is to the best of our knowledge new.
    Toffee is still in an early stage of development. No formal evaluation about its
usability from the user’s view point has been made and there are also challenges
in measuring precision and recall in an application like this. However, based on
our first tests, the idea of providing the user with suggestions for refining the
next search seems promising, and if the suggestions seem inappropriate, she is
not forced to use them. The system contains plenty of variables to tune, like the
number of topics, the number of topic modeling iterations, feedback strength,
and query expansion details, which impact the system performance, and the full
potential of the approach has not been reached yet. In the future, the system
will be evaluated based on the original research problem.
    Acknowledgements This research was partially funded by Business Fin-
land and The Media Industry Research Foundation of Finland.


References
 1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval (2nd Ed.).
    Addison-Wesley Longman Publishing Co., Inc. (2011)
 2. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (Apr 2012),
    http://doi.acm.org/10.1145/2133806.2133826
 3. Mäkelä, E.: Combining a REST lexical analysis web service with SPARQL for
    mashup semantic annotation from text. In: Proceedings of the ESWC 2014 demon-
    stration track, Springer-Verlag (May 2014)
 4. Mäkelä, E.: LAS: an integrated language analysis tool for multiple languages. The
    Journal of Open Source Software 1(6) (oct 2016)
 5. Peltonen, J., Strahl, J., Floréen, P.: Negative relevance feedback for exploratory
    search with visual interactive intent modeling. In: Proceedings of the 22nd Inter-
    national Conference on Intelligent User Interfaces. pp. 149–159. ACM (2017)
 6. Salton, G., Buckley, C.: Improving retrieval performance by relevance feedback.
    Journal of the American Society for Information Science 41(4), 288 (1990)
 7. Song, W., Zhang, Y., Liu, T., Li, S.: Bridging topic modeling and personalized
    search. In: Proceedings of the 23rd International Conference on Computational Lin-
    guistics: Posters. pp. 1167–1175. Association for Computational Linguistics (2010)
 8. Tang, J., Jin, R., Zhang, J.: A topic modeling approach and its integration into
    the random walk framework for academic search. In: Data Mining, 2008. ICDM’08.
    Eighth IEEE International Conference on. pp. 1055–1060. IEEE (2008)
 9. Teevan, J., Dumais, S.T., Horvitz, E.: Personalizing search via automated analysis
    of interests and activities. In: Proc. of the 28th Annual International ACM SIGIR
    Conference. pp. 449–456. SIGIR ’05, ACM (2005)
10. Viljanen, K., Tuominen, J., Hyvönen, E.: Ontology libraries for production use:
    The finnish ontology library service onki. In: European Semantic Web Conference.
    pp. 781–795. Springer (2009)
11. Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings
    of the 17th annual international ACM SIGIR conference on Research and devel-
    opment in information retrieval. pp. 61–69. Springer-Verlag New York, Inc. (1994)