FacetX: Dynamic Facet Generation for Advanced Information Filtering of Search Results Work-in-Progress Paper Raffael Affolter Andreas Weiler Institute of Applied Information Technology Institute of Applied Information Technology Zurich University of Applied Sciences Zurich University of Applied Sciences Winterthur, Switzerland Winterthur, Switzerland affolraf@students.zhaw.ch andreas.weiler@zhaw.ch Figure 1: Exemplary demonstration of the FacetX application during a job search for “data scientist” ABSTRACT the platform of Monster.com for the job titles “Data Scientist” or Searching for information is an important part of our daily life. “Gardener” are the same (e.g., city or job status) even though the People are searching for information about jobs, recipes, enter- content of the search results is completely different. tainment, places and much more. Many information systems try Several systems for creating dynamic facets for supporting to support the users in finding the most relevant information users in their search process have been proposed in the past. For to their information need by providing pre-built categories as example, Kim Hak-Jin et al. [3] use semantic web technologies filter mechanism. However, in most of the systems this support is to create facets from the ontologies of the data. After an initial designed in a very static way and does not consider the dynamic search the system presents the resources and determined cate- content of the documents in their collections. In this paper we gories to the user. The user then selects a category and a value introduce FacetX, an application for the dynamic generation of of it. The system then updates the result collection and presents filter facets for advanced information filtering of search results. the new determined categories and the corresponding resources. This process is repeated until the user finds the desired item. Similar to our approach is the guidance of the system through 1 INTRODUCTION AND MOTIVATION a graph which connects the search result together. Another ex- The number of searches for information in the internet is steadily ample is proposed by Tvarozek et al. [11], which supports the increasing. For example, Google [9] receives over 63,000 searches user to overcome information overload by generating personal- per second on any given day or people search for jobs in the ized dynamic facets for a user in multimedia collections. They network of Monster.com [7] about 8,000 times per minute. At present how a typical facet browser can be extended to support the same time the number of results for the search queries in- the generation of dynamic facets. With the use of a provided creases as well. Most of the times the search query executor is domain ontology and the analysis of the user behavior they try overwhelmed by the large amount of results and the accompa- to generate the most relevant facets for the corresponding users. nying information overflow. For example, by searching on the However, in contrast to our work, both systems are built on platform of Monster.com for the job title “Data Scientist”, the pre-defined ontologies, which are not necessary for FacetX. search result consists of more than 15,000 jobs. One solution for In this paper we introduce FacetX, an application for the dy- supporting the information seekers in finding their requested namic generation of filter facets for advanced information filter- information in very large search results is to provide them nav- ing of search results. FacetX can be applied to any domain, in igational structures like product categories or price ranges in which the results of a search are returned as a collection of doc- e-commerce platforms. With the support of the so-called faceted uments. In contrast to previous work our application generates search [10] the information seekers are able to narrow down facets without the need of a predefined domain ontology and is their search results to specific properties. However, the facets for therefore adaptable and usable in any context. In the following, filtering the search results are most of the times static and do we present the methodology behind our current work-in-progress not adapt themselves dynamically to the corresponding search implementation and several case studies, like searching for a job, results. For example, the filter facets for the search results on recipe, or movie. These case studies show the effect of FacetX to search processes, which are daily undertaken by millions of © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- users. ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- tribution 4.0 International (CC BY 4.0) food platform epicurious [2] and in the third one we show the application of FacetX to a job search on Monster.com [7]. 3.1 Movie Search The internet movie database lists information about millions of movies and tv series on their platform. If a user is interested in movies of a specific genre like action or mystery the search is very simple and the results are presented as a sorted list (e.g. by user rating or year) to the user. The sorting and filter properties are always the same for all genres and other search results on the platform. However, we claim that users could definitively benefit from dynamically generated facets, which are for example based on the story line of the movies. For example, it would be possible to group together more similar movies like Star Wars episode 6 and 7, by extracting facets based on their synopsis. For this case study we extracted the 150 top ranked sci-fi movies sorted by user ratings from IMDB. The topic extraction was applied with the contents of the synopsis of the movies. We created 25 clusters with the best term of 6 topics as facets. As result we received some interesting topics (cf. Figure 4). For example, the topic around “astronauts” and “constructions” refers to the movies “Moon”, “The Martian” and “2001: A Space Odyssey” as seen in Figure 4a. Another interesting example is the topic consisting Figure 2: Process flow of the FacetX application. of “facehuggers”, “salvage”, and “romulans” which refers to the movies “Alien”, “Aliens”, and “Star Trek” as seen in Figure 4b. 2 METHODOLOGY In this section, we describe the methodology behind the dynamic 3.2 Recipe Search generation of the facets for the retrieved search results. Our ap- For this case study, we chose to search for recipes on epicurious proach follows a process flow with several successive phases, by using the advanced search functionality of the platform. While which can be seen in Figure 2. FacetX is implemented with the epicurious offers a lot of facets to choose from, like technique libraries of the Weka [4] toolkit and uses the StringToWordvector or ingredient, a search still can result in a very high number and the Rainbow stop word eliminator. The clustering is im- of recipes. For example, the search for the technique barbecue plemented with the use of the Weka hierarchical clusterer. To returns a total of 1687 results and the search for the ingredient extract the topic, we used the parallel topic model of the Machine soup/stew returns a total of 1955 results. Although, it is possible Learning for Language Toolkit (Mallet [6]). The input of the first to reduce the number of results further by selecting additional phase are the results of a user defined search query which is facets, mostly still a high number of recipes remain in the results. executed by the information seeker. Therefore, we have a set To show the effectiveness of FacetX for the recipe search on of N documents per query as basis for the generation of the epicurious we searched for recipes containing chicken as an in- dynamic facets. To extract the features for the next step the con- gredient and reduced the number of the first results with the tent of the search result is tokenized, all stop words are removed additional facet “healthy” to the final number of 286 recipes. We from the documents and then term frequency-inverse document then used the ingredients for clustering and topic extraction. frequency(tf-idf [8]) is applied. In the next phase, we cluster the With 50 clusters and 6 topics per node the results seem very documents by applying agglomerative hierarchical clustering interesting. The resulting topics (cf. Figure 5a) included some with complete-linkage as linkage criteria between the individual useful information to a dish like “boneless”, “skinless” or “skin- clusters. The result of this phase is a dendrogramm, which acts as on”. Other topics were not as helpful like “oil”, “tablespoon”, “cup” a search tree for the creation of the facets for each branch. Since since those are terms which are present in almost every ingre- every branch in the dendrogram represents a cluster we can take dient list. An interesting finding can been seen under the topic every document per cluster to extract topics from it with the “parsley”, “carrots”, and “celery”, where the recipe for “Jambalaya” Latent Dirichlet Allocation (LDA [1]). The resulting number of appears, which is probably very unknown to most of the users. clusters also decides how deep the search tree is. If the number Note with the current filter settings of epicurious it would not of clusters is equal to the number of documents the leaves would have been possible to select the ingredients “parsley” and “cel- contain only one document with the main topics in the parent ery”. Another interesting finding can be seen in Figure 5b where node. In our application we decided to take the most important the ingredients “Bok Choy”, “Yams” and “Hoisin” appear in the word for each topic as a facet suggestion. recipe names which are probably also unknown to a wide range of users. To improve the generated facets of FacetX it might help 3 CASE STUDIES to include more information like how the dish is prepared, how In the following, we describe three case studies, in which FacetX many calories it contains, or the reviews of users about the recipe. is able to support users in the filtering step of their daily life search processes. The first one is using FacetX to create facets 3.3 Job Search for movies from the internet movie database (IMDB) [5]. In the In this case study, we apply FacetX to the domain of job search second case study we create facets for a recipe search on the on the job advertisement platform Monster.com. Figure 3 shows Figure 3: Example result of FacetX applied to the search results for the query “data scientist” on Monster.com (a) (a) (b) (b) Figure 5: Sample facets of the recipe search case study for Figure 4: Generated facets of one cluster for the IMDB case recipes with the ingredient chicken. study. an example, which is the result of applying FacetX to the search results for the query “data scientist”. From the resulting docu- ments, the job title and the description were extracted for further processing. In this example a cluster size of eight was chosen and only the first four topics per node were extracted. For the modeling of the topics the job description of the results were used. The title of the job offerings is put at the leaves to give some idea how many documents would appear after those facets were chosen. After the search for job offerings with search terms on any online platform the users are always confronted with the same filter facets, also if they search for very different working areas or specific fields. These filter facets mostly contain the city Figure 6: The first two branches of the dendrogram for or region of the workplace, the salary range, type (e.g., beginner, the search term “gardener” and “data scientist” of the job experienced, manager) of the job offering, or the percentage of search case study. the workload. However, to be able to really understand what the job offering contains, the user needs to browse through every job description individually. 4 CONCLUSIONS AND FUTURE WORK For example, if we search with the search term “gardener” In this work-in-progress paper we introduce FacetX, a tool for the without any further restrictions on Monster.com we would have dynamic generation of facets for advanced information filtering. to browse through a total of about 2500 job descriptions and for With FacetX it is possible to create dynamic facets for the result the search term “data scientist” it would be even more with a total of an initial search query and gain more information about the of about 17000. Since Monster.com just offers filter facets like retrieved documents without specific domain knowledge. FacetX company, city, job status, or date posted we have no possibilities could be integrated with any search engine as additional tool to further narrow down the results. But to be able to find the for the user to support the search and limiting the information very best match between a user and a job offering it would be overflow. In those settings where it is not possible to create mean- preferable for the job seeker, as well as for the company, to be able ingful facets the application still could be used to support the to use dynamic filter facets which are contained in the content users in keeping the overview about the results. Future work of the job offering to drill down in the result set. With FacetX includes the improvement of the facet generation by domain- we would be able to dynamically create the filter facets based on specific stop word lists. Furthermore, other topic extraction tech- the content of the job offerings. Furthermore, as seen in Figure 3 niques could be investigated for documents with less content. the user would be able to refine the results further by choosing For example, the case study with the recipes demonstrated the the facets of the sub clusters. In this case study we searched for support of FacetX is limited by the small amount of words in the job offerings as a data scientist on Monster.com and extracted recipes. Additionally, the number of topics for topic extraction 52 results. We used FacetX to create facets for those results with could be made dynamically as the number of documents shrinks different number of clusters and topics. With this a user can with increasing cluster numbers. The clustering part also needs traverse down the hierarchical tree and reduce the search results more research in the case of how many clusters would be a good by selecting the node with the topics, which seems the most fit for a given number of documents and which linkage criteria interesting one. To get the best insight into a job offering the is preferable to use. topics at the leaves seemed to be the most meaningful. There were topics like “healthcare”, “federal” or “insurance” which can REFERENCES help in the decision whether pursuing the search further for [1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet those job offerings or not. The more topics are extracted per node Allocation. J. Mach. Learn. Res. 3 (2003), 993–1022. [2] Epicurious.com. 2019. Epicurious Recipes, Menu Ideas, Videos & Cooking Tips. the more information a user might gain about the content of Condé Nast Publications. Retrieved December 9, 2019 from https://www. the offerings but also more time has to be spent to read through epicurious.com/ [3] Kim Hak-Jin, Zhu Yongjun, Kim Wooju, and Sun Taimao. 2014. Dynamic the topics. Also, not all topics might be useful to the user, as for faceted navigation in decision making using SemanticWeb technology. Deci- example the terms “job” or “position”. sion Support Systems 61 (2014), 59–68. In this case study we also wanted to evaluated how FacetX [4] Mark Hall, Frank Eibe, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- mann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. performs in a more complex setting. We therefore searched for SIGKDD Explor. Newsl. 11, 1 (2009), 10–18. both jobs (“gardener” or “data scientist”) on Monster.com. Note, [5] IMDB.com. 2019. Internet Movie Database, Ratings and Reviews for New Movies with the filter functionality of Monster.com it would not be pos- and TV Shows. IMDb.com, Inc. Retrieved December 9, 2019 from http: //www.imdb.com/ sible to separate the two very different job types from each other [6] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language after the first search results are returned to the user. However, Toolkit. http://mallet.cs.umass.edu/ [7] Monster.com. 2019. Monster.com About. Monster Worldwide, Inc. Retrieved with the support FacetX additional filter facets were generated December 9, 2019 from https://www.monster.com/about/ which successfully divided the job offerings from “data scientist” [8] Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information and “gardener” from each other (cf. Figure 6). Only two job offer- Retrieval. McGraw-Hill, Inc., New York, NY, USA. [9] Seotribunal.com. 2019. 63 Fascinating Google Search Statistics. Retrieved ings for gardener could be found in the cluster with mainly data December 9, 2019 from https://seotribunal.com/blog/google-stats-and-facts/ scientists and in the cluster for gardener jobs 3 for data scientist [10] Daniel Tunkelang. 2009. Faceted Search. Synthesis lectures on information were found. The extracted topics for the two branches included concepts, retrieval, and services 1, 1 (2009), 1–80. [11] Michal Tvarozek and Maria Bielikova. 2007. Personalized Faceted Navigation words like “deep”, “model” and “process” for the data scientist for Multimedia Collections. In Second International Workshop on Semantic branch and “landscape”, “work” and “tree” for the other one. Media Adaptation and Personalization (SMAP 2007). IEEE, Piscataway, New Jersey, US, 104–109.