FacetX: Dynamic Facet Generation
             for Advanced Information Filtering of Search Results
                                                                  Work-in-Progress Paper

                               Raffael Affolter                                                           Andreas Weiler
            Institute of Applied Information Technology                                     Institute of Applied Information Technology
               Zurich University of Applied Sciences                                           Zurich University of Applied Sciences
                       Winterthur, Switzerland                                                         Winterthur, Switzerland
                      affolraf@students.zhaw.ch                                                        andreas.weiler@zhaw.ch


              Figure 1: Exemplary demonstration of the FacetX application during a job search for “data scientist”
ABSTRACT                                                                               the platform of Monster.com for the job titles “Data Scientist” or
Searching for information is an important part of our daily life.                      “Gardener” are the same (e.g., city or job status) even though the
People are searching for information about jobs, recipes, enter-                       content of the search results is completely different.
tainment, places and much more. Many information systems try                              Several systems for creating dynamic facets for supporting
to support the users in finding the most relevant information                          users in their search process have been proposed in the past. For
to their information need by providing pre-built categories as                         example, Kim Hak-Jin et al. [3] use semantic web technologies
filter mechanism. However, in most of the systems this support is                      to create facets from the ontologies of the data. After an initial
designed in a very static way and does not consider the dynamic                        search the system presents the resources and determined cate-
content of the documents in their collections. In this paper we                        gories to the user. The user then selects a category and a value
introduce FacetX, an application for the dynamic generation of                         of it. The system then updates the result collection and presents
filter facets for advanced information filtering of search results.                    the new determined categories and the corresponding resources.
                                                                                       This process is repeated until the user finds the desired item.
                                                                                       Similar to our approach is the guidance of the system through
1    INTRODUCTION AND MOTIVATION                                                       a graph which connects the search result together. Another ex-
The number of searches for information in the internet is steadily                     ample is proposed by Tvarozek et al. [11], which supports the
increasing. For example, Google [9] receives over 63,000 searches                      user to overcome information overload by generating personal-
per second on any given day or people search for jobs in the                           ized dynamic facets for a user in multimedia collections. They
network of Monster.com [7] about 8,000 times per minute. At                            present how a typical facet browser can be extended to support
the same time the number of results for the search queries in-                         the generation of dynamic facets. With the use of a provided
creases as well. Most of the times the search query executor is                        domain ontology and the analysis of the user behavior they try
overwhelmed by the large amount of results and the accompa-                            to generate the most relevant facets for the corresponding users.
nying information overflow. For example, by searching on the                           However, in contrast to our work, both systems are built on
platform of Monster.com for the job title “Data Scientist”, the                        pre-defined ontologies, which are not necessary for FacetX.
search result consists of more than 15,000 jobs. One solution for                         In this paper we introduce FacetX, an application for the dy-
supporting the information seekers in finding their requested                          namic generation of filter facets for advanced information filter-
information in very large search results is to provide them nav-                       ing of search results. FacetX can be applied to any domain, in
igational structures like product categories or price ranges in                        which the results of a search are returned as a collection of doc-
e-commerce platforms. With the support of the so-called faceted                        uments. In contrast to previous work our application generates
search [10] the information seekers are able to narrow down                            facets without the need of a predefined domain ontology and is
their search results to specific properties. However, the facets for                   therefore adaptable and usable in any context. In the following,
filtering the search results are most of the times static and do                       we present the methodology behind our current work-in-progress
not adapt themselves dynamically to the corresponding search                           implementation and several case studies, like searching for a job,
results. For example, the filter facets for the search results on                      recipe, or movie. These case studies show the effect of FacetX
                                                                                       to search processes, which are daily undertaken by millions of
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   users.
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0)
                                                                      food platform epicurious [2] and in the third one we show the
                                                                      application of FacetX to a job search on Monster.com [7].

                                                                      3.1    Movie Search
                                                                      The internet movie database lists information about millions of
                                                                      movies and tv series on their platform. If a user is interested in
                                                                      movies of a specific genre like action or mystery the search is
                                                                      very simple and the results are presented as a sorted list (e.g. by
                                                                      user rating or year) to the user. The sorting and filter properties
                                                                      are always the same for all genres and other search results on the
                                                                      platform. However, we claim that users could definitively benefit
                                                                      from dynamically generated facets, which are for example based
                                                                      on the story line of the movies. For example, it would be possible
                                                                      to group together more similar movies like Star Wars episode 6
                                                                      and 7, by extracting facets based on their synopsis. For this case
                                                                      study we extracted the 150 top ranked sci-fi movies sorted by
                                                                      user ratings from IMDB. The topic extraction was applied with
                                                                      the contents of the synopsis of the movies. We created 25 clusters
                                                                      with the best term of 6 topics as facets. As result we received
                                                                      some interesting topics (cf. Figure 4). For example, the topic
                                                                      around “astronauts” and “constructions” refers to the movies
                                                                      “Moon”, “The Martian” and “2001: A Space Odyssey” as seen in
                                                                      Figure 4a. Another interesting example is the topic consisting
     Figure 2: Process flow of the FacetX application.
                                                                      of “facehuggers”, “salvage”, and “romulans” which refers to the
                                                                      movies “Alien”, “Aliens”, and “Star Trek” as seen in Figure 4b.
2   METHODOLOGY
In this section, we describe the methodology behind the dynamic       3.2    Recipe Search
generation of the facets for the retrieved search results. Our ap-    For this case study, we chose to search for recipes on epicurious
proach follows a process flow with several successive phases,         by using the advanced search functionality of the platform. While
which can be seen in Figure 2. FacetX is implemented with the         epicurious offers a lot of facets to choose from, like technique
libraries of the Weka [4] toolkit and uses the StringToWordvector     or ingredient, a search still can result in a very high number
and the Rainbow stop word eliminator. The clustering is im-           of recipes. For example, the search for the technique barbecue
plemented with the use of the Weka hierarchical clusterer. To         returns a total of 1687 results and the search for the ingredient
extract the topic, we used the parallel topic model of the Machine    soup/stew returns a total of 1955 results. Although, it is possible
Learning for Language Toolkit (Mallet [6]). The input of the first    to reduce the number of results further by selecting additional
phase are the results of a user defined search query which is         facets, mostly still a high number of recipes remain in the results.
executed by the information seeker. Therefore, we have a set             To show the effectiveness of FacetX for the recipe search on
of N documents per query as basis for the generation of the           epicurious we searched for recipes containing chicken as an in-
dynamic facets. To extract the features for the next step the con-    gredient and reduced the number of the first results with the
tent of the search result is tokenized, all stop words are removed    additional facet “healthy” to the final number of 286 recipes. We
from the documents and then term frequency-inverse document           then used the ingredients for clustering and topic extraction.
frequency(tf-idf [8]) is applied. In the next phase, we cluster the   With 50 clusters and 6 topics per node the results seem very
documents by applying agglomerative hierarchical clustering           interesting. The resulting topics (cf. Figure 5a) included some
with complete-linkage as linkage criteria between the individual      useful information to a dish like “boneless”, “skinless” or “skin-
clusters. The result of this phase is a dendrogramm, which acts as    on”. Other topics were not as helpful like “oil”, “tablespoon”, “cup”
a search tree for the creation of the facets for each branch. Since   since those are terms which are present in almost every ingre-
every branch in the dendrogram represents a cluster we can take       dient list. An interesting finding can been seen under the topic
every document per cluster to extract topics from it with the         “parsley”, “carrots”, and “celery”, where the recipe for “Jambalaya”
Latent Dirichlet Allocation (LDA [1]). The resulting number of        appears, which is probably very unknown to most of the users.
clusters also decides how deep the search tree is. If the number      Note with the current filter settings of epicurious it would not
of clusters is equal to the number of documents the leaves would      have been possible to select the ingredients “parsley” and “cel-
contain only one document with the main topics in the parent          ery”. Another interesting finding can be seen in Figure 5b where
node. In our application we decided to take the most important        the ingredients “Bok Choy”, “Yams” and “Hoisin” appear in the
word for each topic as a facet suggestion.                            recipe names which are probably also unknown to a wide range
                                                                      of users. To improve the generated facets of FacetX it might help
3   CASE STUDIES                                                      to include more information like how the dish is prepared, how
In the following, we describe three case studies, in which FacetX     many calories it contains, or the reviews of users about the recipe.
is able to support users in the filtering step of their daily life
search processes. The first one is using FacetX to create facets      3.3    Job Search
for movies from the internet movie database (IMDB) [5]. In the        In this case study, we apply FacetX to the domain of job search
second case study we create facets for a recipe search on the         on the job advertisement platform Monster.com. Figure 3 shows
     Figure 3: Example result of FacetX applied to the search results for the query “data scientist” on Monster.com


                            (a)


                                                                                          (a)


                                                                                          (b)
                            (b)
                                                              Figure 5: Sample facets of the recipe search case study for
Figure 4: Generated facets of one cluster for the IMDB case   recipes with the ingredient chicken.
study.
an example, which is the result of applying FacetX to the search
results for the query “data scientist”. From the resulting docu-
ments, the job title and the description were extracted for further
processing. In this example a cluster size of eight was chosen
and only the first four topics per node were extracted. For the
modeling of the topics the job description of the results were
used. The title of the job offerings is put at the leaves to give
some idea how many documents would appear after those facets
were chosen. After the search for job offerings with search terms
on any online platform the users are always confronted with the
same filter facets, also if they search for very different working
areas or specific fields. These filter facets mostly contain the city   Figure 6: The first two branches of the dendrogram for
or region of the workplace, the salary range, type (e.g., beginner,     the search term “gardener” and “data scientist” of the job
experienced, manager) of the job offering, or the percentage of         search case study.
the workload. However, to be able to really understand what the
job offering contains, the user needs to browse through every
job description individually.                                           4    CONCLUSIONS AND FUTURE WORK
   For example, if we search with the search term “gardener”            In this work-in-progress paper we introduce FacetX, a tool for the
without any further restrictions on Monster.com we would have           dynamic generation of facets for advanced information filtering.
to browse through a total of about 2500 job descriptions and for        With FacetX it is possible to create dynamic facets for the result
the search term “data scientist” it would be even more with a total     of an initial search query and gain more information about the
of about 17000. Since Monster.com just offers filter facets like        retrieved documents without specific domain knowledge. FacetX
company, city, job status, or date posted we have no possibilities      could be integrated with any search engine as additional tool
to further narrow down the results. But to be able to find the          for the user to support the search and limiting the information
very best match between a user and a job offering it would be           overflow. In those settings where it is not possible to create mean-
preferable for the job seeker, as well as for the company, to be able   ingful facets the application still could be used to support the
to use dynamic filter facets which are contained in the content         users in keeping the overview about the results. Future work
of the job offering to drill down in the result set. With FacetX        includes the improvement of the facet generation by domain-
we would be able to dynamically create the filter facets based on       specific stop word lists. Furthermore, other topic extraction tech-
the content of the job offerings. Furthermore, as seen in Figure 3      niques could be investigated for documents with less content.
the user would be able to refine the results further by choosing        For example, the case study with the recipes demonstrated the
the facets of the sub clusters. In this case study we searched for      support of FacetX is limited by the small amount of words in the
job offerings as a data scientist on Monster.com and extracted          recipes. Additionally, the number of topics for topic extraction
52 results. We used FacetX to create facets for those results with      could be made dynamically as the number of documents shrinks
different number of clusters and topics. With this a user can           with increasing cluster numbers. The clustering part also needs
traverse down the hierarchical tree and reduce the search results       more research in the case of how many clusters would be a good
by selecting the node with the topics, which seems the most             fit for a given number of documents and which linkage criteria
interesting one. To get the best insight into a job offering the        is preferable to use.
topics at the leaves seemed to be the most meaningful. There
were topics like “healthcare”, “federal” or “insurance” which can       REFERENCES
help in the decision whether pursuing the search further for             [1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
those job offerings or not. The more topics are extracted per node           Allocation. J. Mach. Learn. Res. 3 (2003), 993–1022.
                                                                         [2] Epicurious.com. 2019. Epicurious Recipes, Menu Ideas, Videos & Cooking Tips.
the more information a user might gain about the content of                  Condé Nast Publications. Retrieved December 9, 2019 from https://www.
the offerings but also more time has to be spent to read through             epicurious.com/
                                                                         [3] Kim Hak-Jin, Zhu Yongjun, Kim Wooju, and Sun Taimao. 2014. Dynamic
the topics. Also, not all topics might be useful to the user, as for         faceted navigation in decision making using SemanticWeb technology. Deci-
example the terms “job” or “position”.                                       sion Support Systems 61 (2014), 59–68.
   In this case study we also wanted to evaluated how FacetX             [4] Mark Hall, Frank Eibe, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-
                                                                             mann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update.
performs in a more complex setting. We therefore searched for                SIGKDD Explor. Newsl. 11, 1 (2009), 10–18.
both jobs (“gardener” or “data scientist”) on Monster.com. Note,         [5] IMDB.com. 2019. Internet Movie Database, Ratings and Reviews for New Movies
with the filter functionality of Monster.com it would not be pos-            and TV Shows. IMDb.com, Inc. Retrieved December 9, 2019 from http:
                                                                             //www.imdb.com/
sible to separate the two very different job types from each other       [6] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language
after the first search results are returned to the user. However,            Toolkit. http://mallet.cs.umass.edu/
                                                                         [7] Monster.com. 2019. Monster.com About. Monster Worldwide, Inc. Retrieved
with the support FacetX additional filter facets were generated              December 9, 2019 from https://www.monster.com/about/
which successfully divided the job offerings from “data scientist”       [8] Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information
and “gardener” from each other (cf. Figure 6). Only two job offer-           Retrieval. McGraw-Hill, Inc., New York, NY, USA.
                                                                         [9] Seotribunal.com. 2019. 63 Fascinating Google Search Statistics. Retrieved
ings for gardener could be found in the cluster with mainly data             December 9, 2019 from https://seotribunal.com/blog/google-stats-and-facts/
scientists and in the cluster for gardener jobs 3 for data scientist    [10] Daniel Tunkelang. 2009. Faceted Search. Synthesis lectures on information
were found. The extracted topics for the two branches included               concepts, retrieval, and services 1, 1 (2009), 1–80.
                                                                        [11] Michal Tvarozek and Maria Bielikova. 2007. Personalized Faceted Navigation
words like “deep”, “model” and “process” for the data scientist              for Multimedia Collections. In Second International Workshop on Semantic
branch and “landscape”, “work” and “tree” for the other one.                 Media Adaptation and Personalization (SMAP 2007). IEEE, Piscataway, New
                                                                             Jersey, US, 104–109.