Historical Clicks for Product Search: GESIS at CLEF LL4IR 2015 Philipp Schaer1 and Narges Tavakolpoursaleh12 ? 1 GESIS – Leibniz Institute for the Social Sciences, 50669 Cologne, Germany firstname.lastname@gesis.org 2 University of Bonn, Computer Science / EIS, 53117 Bonn, Germany Abstract. The Living Labs for Information Retrieval (LL4IR) lab was held for the first time at CLEF and GESIS participated in this pilot evaluation. We took part in the product search task and describe our system that is based on the Solr search engine and includes a re-reranking based on historical click data. This brief workshop note also includes some preliminary results, discussion and some lessons learned. 1 Introduction In 2015 the Living Labs for Information Retrieval initiative (LL4IR) for the first time organized a lab at the CLEF conference series [5]. This lab can be seen as a pilot evaluation lab or as stated by the organizers: “a first round”. GESIS took part in this pilot to get a first hand experience with the lab’s API [1] and the rather new evaluation methodology. The main focus was not on winning the implicit competition every evaluation campaign is, but to learn more about the procedures and the systems used. Since this lab had no direct predecessor we could not learn from previous results and best practices. Previous systems that we used in other CLEF labs, namely the CHiC lab on cultural heritage [4], were from a totally different domain and could therefore not be directly applied to the use cases of LL4IR. So, the main objective of this pilot participation was to establish the retrieval environment and to surpass the obvious issues in the first place. After the initial three tasks were cut down to only two, we took part in the remaining task on product search using the REGIO JÁTÉK e-commerce site. In the following paper we will present our approaches and preliminary results from their assessments. 2 Product Search with REGIO JÁTÉK One of the remaining of formally three different tasks were the product search on the e-commerce site REGIO JÁTÉK. As previously noted in the Living Labs Challenge Report [2] this specific task introduces a range of different challenges, ? Authors are listed in alphabetical order. issues, and possibilities. In the report some issues like “generally little amount of textual material associated with products (only name, description, and name of categories)” were noted. On the other hand additional information included in the available metadata were listed, among these were: (1) Historical click infor- mation for queries, (2) collection statistics, (3) product taxonomy, (4) product photos, (5) date/time when the product first became available, (6) historical click information for products, and (7) sales margins. In our approach we decided to re-use the historical click data for products and a keyword-based relevance score derived from a Solr indexation of the available product metadata. 2.1 Ranking Approach We used a Solr search server (version 5.0.0) to index all available metadata pro- vided by the API for every document related to the given queries. For each document (or more precisely for each product) we additionally stored the corre- sponding query number. This way we were able to retrieve all available candidate documents and rank them according to the Solr score based on the query string. Additionally we added the historical click rates as a weighting factor into the final ranking if this was available at query time. Some query related documents had no term which can be matched with the query string and therefore we were not able to retrieve every query related document on a mere query string-based search. We had to add the query number to the query itself as a boolean query part to fix this issue and used Solr’s query syntax and the default QParserPlugin3 . This syntax allows the use of boolean operators and of different boosting factors which were both used in the query formulation: qid:query[id]^0.0001 OR (qid:query[id]^0.0001 AND query[str]) Using this query string we got a Solr-ranked list of documents for each query which were then re-ranked using the historical click rates as outlined in algo- rithm 1. Basically it’s a linear combination of a boosted search on the document id (field name docid) and the vector space-based relevance score of the query string. This is a typical “the rich are getting richer” approach where formally successful products are more likely to be once again ranked high in the result list. The approach was inspired by a presentation by Andrzej Bialecki [3]. 2.2 Solr Configuration As stated above we used a Solr installation. To keep the system simple, we used the original Solr configuration and imported the REGIO dump using the originally provided schema.xml and the configuration from table 1. We did not include any language specific configurations for stemmers or stop word lists, since the Hungarian Solr stemmer returned the same results as the generic stemmer. We used the following standard components for text general fields: 3 https://wiki.apache.org/solr/SolrQuerySyntax Algorithm 1: Re-ranking algorithm merging the Solr ranking score and the historical click rates. Data: runs of production system correspond to the queries to products of REGIO JAT́EḰ site Result: runs of our experimental system according to the document’s fields and click-through rate for query in queries do run = get doclist(query)  ctr = get ctr query for doc in run do  doc detail = get docDetail doc  BuildSolrIndex doc detail,qid end  myQuery = docid1 ˆctr1 OR docid2 ˆctr2 OR . . . OR docidn ˆctrn OR qidˆ0.0001   AND query ’str’  myRun = solr.search myQuery  update runs key, myRun , feedbacks end – StandardTokenizerFactory: A general purpose tokenizer, which divides a string into tokens with various types. – StopFilterFactory: Words from the Solr included stopword lists are dis- carded. – LowerCaseFilterFactory: All letters are indexed and queried as lowercase. The detailed indexation configuration of the given fields is listed in table 1. We used a very limited set of fields for the first round and were basically only searching in the title and description field. We changed this in the second round where we included all the available metadata in the search. 3 Results 3.1 Official Run The results of the campaign were documented by giving numbers on the (1) impressions per query, (2) wins, losses, and ties calculated against the production system, and (3) the calculated outcome (#wins/(#wins+#losses). As noted in the documentation a win “is defined as the experimental system having more clicks on results assigned to it by Team Draft Interleaving than clicks on results assigned to the production system”. This means that any value below 0.5 can be seen as a performance worse than the production system. Due to the problem of unavailable items in the shop the expected outcome had to be corrected to 0.28 as unavailable items were not filtered out for participating systems (for more details check the workshop overview paper [5]). Our system received 523 impressions in the two weeks test period. This makes roughly 37.4 impressions per day and 1.6 impressions per hour. Although we Table 1. Solr field configuration for the available product metadata. Since different configurations for round #1 and #2 were used we also report in the usage of the fields for the two evaluation rounds. field type multiValued round #1 round #2 title text general X X X category text general X content text general X creation time string X docid text general X X main category text general X brand text general X product name text general X photos string X X short description text general X description text general X X category id string X main category id string X bonus price float X available string X age min string X age max string X characters string X X queries text general X X gender string X X arrived string X qid string X X characters string X site id string X don’t have any comparable numbers we interpret these impression rates to be quite low. If we compare to the other teams we received the lowest number of impressions while for example system UiS-Mira received 202 more impressions in the same time period (725 impressions which is 38% more impressions than we got). This is not quite in line with the principle of giving fair impression rates between the different teams. Another thing regarding the impressions is the fact that different queries had very different impression rates (see figure 1). While some of them got more than 50 impressions others (actually 5) were not shown to the users at all. Our approach did not perform very well due to some obvious misconfigu- ration and open issues of our implementation. In fact we provided the least efficient ranking compared to the other three participants [5]. We could achieve an outcome of 0.2685 by getting 40 wins vs. 109 losses and 374 ties. On the other hand no other participant was able to beat the baseline with an outcome rate of 0.4691. The best performing system received an outcome rate of 0.3413 Table 2. Some basic statistics on the results. round #1 round #2 test queries 50 50 impressions total 523 816 impressions per day 37.4 58.1 queries with no impressions 5 2 queries with outcome > 0.5 4 13 queries with outcome > baseline 10 13 Table 3. Outcome, wins, losses and ties from round #1 and #2. outcome wins losses ties round #1 0.2685 40 109 374 round #2 0.4520 80 97 639 (system UiS-Mira) and was able to be better than the expected outcome of 0.28 but below the simple baseline provided by the organizers. 3.2 Unofficial 2nd Round We also took part in the 2nd evaluation round and adapted some parameters of the system. As there was a misconfiguration in the Solr system of round #1 we only searched the titles and description of products. We fixed this bug so that for round #2 we correctly indexed all available metadata fields. Another issue from round #1 was that not all 50 test topics were correctly calculated. We only used the historical click data for 1 test topic and 13 training topics. The other topics were just the standard Solr ranking without any click history boosting. We fixed this issue for round #2 where now all 50 topics are correctly calculated according to the described boosting approach. After we corrected these two points we observe a clear increase in the out- comes. The outcome increased to 0.4520 by getting 80 wins, 97 losses and 639 ties. Although the performace increase might be due to the fixes introduced by the organizers regarding unavailable items we could still see some positive ef- fects: The performance of the other teams increased too but while we were the weakest team in round #1 we were now able to provide the second best system performance. We also outperformed the winning system from round #1. Nev- ertheless we (and no other system) was able to compete with the productive system. Comparing the number of impressions we see a clear increase in queries that are above the 0.5 threshold and the baseline (13 queries each) and the impressions in total and per day are also increased. The issue of unbalanced impression rates stays the same for round #2 (see figure 1). 180 impressions round #1 160 impressions round #2 140 120 100 80 60 40 20 0 Fig. 1. Distribution of impressions per topic for the first and official CLEF round (blue) and the second unofficial test round (red). 4 Lessons Learned, Open Issues, and Future Work The first prominent issue that arose when processing the data collection was the Hungarian content. Since we don’t know Hungarian we were not able to directly read or understand the content or queries and therefore had used a language- and content-agnostic approach. Although the different fields were documented4 the content was hidden behind the language barrier, except for obvious brand names like Barbie or He-Man. It would have been really interesting and maybe useful to make use of further provided metadata like for example the classes of the two-level deep topical categorization system to which the products were assigned to. As we don’t know more about this categorization system, expect for the ad-hoc translation5 we could only add the category names to the index and leave it with that. A typical problem with real world systems was also present in the available queries: Real world users tend to use rather short queries. For the 100 available query strings only 15 had more than one word and only 2 had more than 2 words 4 http://doc.living-labs.net/en/latest/usecase-regio.html#usecase-regio 5 https://translate.google.com/translate?sl=auto&tl=en&u=http\%3A\%2F\ %2Fwww.regiojatek.hu\%2Fkategoriak.html (R-q22: “bogyó és babóca” and R-q50: “my little pony”). The average word length per query was 1.17 and the average string length was 7.16 characters. Another factor that we did not think of, although it was clearly stated in the documentation and in the key concepts6 , was the fact that no feedback data was available during the test phase. As this came to our mind way too late we were only able to include historical click data for some queries. Therefore the validity of our results from round #1 is weak as their are to few queries to really judge on the influence of the historical click data vs. live click data. We were not able to include new feedback data into our rankings after the official upload and the beginning of the test phase. All the uploaded rankings were “final” and only depend on historical clicks. While this is of course due to the experimental setup it is not truly a “living” component in the living lab environment (metaphorically spoken). On top of that not every document received clicks and therefore some documents are missing any hint of being relevant at all. Last but not least we had to struggle with speed issues of the LL4IR platform itself. As mentioned in the workshop report of 2014 there are known issues on “scaling up with the number of participants and sites talking to the API simul- taneously” [2]. Although they state that these bottlenecks have been identified and that they have started addressing these it still takes some time to corre- spond with the API. To get a feeling for the lack of the systems performance: The extraction of the final evaluation outcomes took roughly 45 minutes to ex- tract 100 short JSON snippets not longer than in listing 1.1. Same is true for the generation of the list of available queries, result list and other data sets. The development team should thing about caching these kind of data. 5 Conclusion To sum it up, we succeeded in our primary objective of our participation which was to learn about the LL4IR API and the evaluation methodology. We could clearly improve our results from round #1 to round #2 and learned a lot during the campaign. We fixed some obvious configuration issues in our Solr system and were therefore desperately looking forward to the start of the second phase of the evaluation that started on 15 June 2015. As it turned out, these issues could be solved and the performance of the retrieval could be clearly improved. Although we are not able to simply compare round #1 and #2 due to the misconfiguration we can see the positive effects of the including of historical click data to boost on popular products. Acknowledgements Besides the previously (maybe unfair) mentioned complains about the system’s and API’s performance we are really impressed by it’s functionality and stability. The online documentation and the support by the organizers were clear, direct and always helpful. Thank you. 6 http://doc.living-labs.net/en/latest/guide-participant.html#key Listing 1.1. Sample output of the outcome documentation { " outcomes ": [ { " impressions ": 36 , " losses ": 11 , " outcome ": "0.15384615384615385" , " qid ": "R - q60 " , " site_id ": " R " , " test_period ": { " end ": " Sat , 16 May 2015 00:00:00 -0000" , " name ": " CLEF LL4IR Round #1" , " start ": " Fri , 01 May 2015 00:00:00 -0000" }, " ties ": 23 , " type ": " test " , " wins ": 2 } ] } References 1. Balog, K., Kelly, L., Schuth, A.: Head first: Living labs for ad-hoc search evaluation. In: Proceedings of the 23rd ACM International Conference on Conference on Infor- mation and Knowledge Management. pp. 1815–1818. CIKM ’14, ACM (2014), http: //www.anneschuth.nl/wp-content/uploads/2014/08/cikm2014-lleval.pdf 2. Balog, K., Kelly, L., Schuth, A.: Living labs for ir challenge workshop (2014), http: //living-labs.net/wp-content/uploads/2014/05/LLC_report.pdf 3. Bialecki, A.: Implementing click-through relevance ranking in solr and lu- cidworks enterprise (2011), http://de.slideshare.net/LucidImagination/ bialecki-andrzej-clickthroughrelevancerankinginsolrlucidworksenterprise 4. Schaer, P., Hienert, D., Sawitzki, F., Wira-Alam, A., Lüke, T.: Dealing with sparse document and topic representations: Lab report for chic 2012. In: CLEF 2012 Labs and Workshop, Notebook Papers: CLEF/CHiC Workshop-Notes (2012) 5. Schuth, A., Balog, K., Kelly, L.: Overview of the living labs for information retrieval evaluation (ll4ir) clef lab 2015. In: CLEF 2015 - 6th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS), Springer (Septem- ber 2015)