<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Historical Clicks for Product Search: GESIS at CLEF LL4IR 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Philipp Schaer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Narges Tavakolpoursaleh</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz Institute for the Social Sciences</institution>
          ,
          <addr-line>50669 Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bonn</institution>
          ,
          <addr-line>Computer Science / EIS, 53117 Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Living Labs for Information Retrieval (LL4IR) lab was held for the rst time at CLEF and GESIS participated in this pilot evaluation. We took part in the product search task and describe our system that is based on the Solr search engine and includes a re-reranking based on historical click data. This brief workshop note also includes some preliminary results, discussion and some lessons learned.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In 2015 the Living Labs for Information Retrieval initiative (LL4IR) for the rst
time organized a lab at the CLEF conference series [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This lab can be seen
as a pilot evaluation lab or as stated by the organizers: \a rst round". GESIS
took part in this pilot to get a rst hand experience with the lab's API [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
the rather new evaluation methodology. The main focus was not on winning the
implicit competition every evaluation campaign is, but to learn more about the
procedures and the systems used. Since this lab had no direct predecessor we
could not learn from previous results and best practices. Previous systems that
we used in other CLEF labs, namely the CHiC lab on cultural heritage [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], were
from a totally di erent domain and could therefore not be directly applied to
the use cases of LL4IR. So, the main objective of this pilot participation was
to establish the retrieval environment and to surpass the obvious issues in the
rst place. After the initial three tasks were cut down to only two, we took part
in the remaining task on product search using the REGIO JATEK e-commerce
site.
      </p>
      <p>In the following paper we will present our approaches and preliminary results
from their assessments.</p>
    </sec>
    <sec id="sec-2">
      <title>Product Search with REGIO JATEK</title>
      <p>
        One of the remaining of formally three di erent tasks were the product search
on the e-commerce site REGIO JATEK. As previously noted in the Living Labs
Challenge Report [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] this speci c task introduces a range of di erent challenges,
? Authors are listed in alphabetical order.
issues, and possibilities. In the report some issues like \generally little amount of
textual material associated with products (only name, description, and name of
categories)" were noted. On the other hand additional information included in
the available metadata were listed, among these were: (1) Historical click
information for queries, (2) collection statistics, (3) product taxonomy, (4) product
photos, (5) date/time when the product rst became available, (6) historical
click information for products, and (7) sales margins.
      </p>
      <p>In our approach we decided to re-use the historical click data for products and
a keyword-based relevance score derived from a Solr indexation of the available
product metadata.
2.1</p>
      <p>Ranking Approach
We used a Solr search server (version 5.0.0) to index all available metadata
provided by the API for every document related to the given queries. For each
document (or more precisely for each product) we additionally stored the
corresponding query number. This way we were able to retrieve all available candidate
documents and rank them according to the Solr score based on the query string.
Additionally we added the historical click rates as a weighting factor into the
nal ranking if this was available at query time.</p>
      <p>
        Some query related documents had no term which can be matched with
the query string and therefore we were not able to retrieve every query related
document on a mere query string-based search. We had to add the query number
to the query itself as a boolean query part to x this issue and used Solr's query
syntax and the default QParserPlugin3. This syntax allows the use of boolean
operators and of di erent boosting factors which were both used in the query
formulation:
qid:query[id]^0.0001 OR (qid:query[id]^0.0001 AND query[str])
Using this query string we got a Solr-ranked list of documents for each query
which were then re-ranked using the historical click rates as outlined in
algorithm 1. Basically it's a linear combination of a boosted search on the document
id ( eld name docid) and the vector space-based relevance score of the query
string. This is a typical \the rich are getting richer" approach where formally
successful products are more likely to be once again ranked high in the result
list. The approach was inspired by a presentation by Andrzej Bialecki [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.2
      </p>
      <p>Solr Con guration
As stated above we used a Solr installation. To keep the system simple, we
used the original Solr con guration and imported the REGIO dump using the
originally provided schema.xml and the con guration from table 1. We did not
include any language speci c con gurations for stemmers or stop word lists, since
the Hungarian Solr stemmer returned the same results as the generic stemmer.
We used the following standard components for text general elds:
3 https://wiki.apache.org/solr/SolrQuerySyntax
Algorithm 1: Re-ranking algorithm merging the Solr ranking score and the
historical click rates.</p>
      <p>Data: runs of production system correspond to the queries to products of</p>
      <p>REGIO JATEK site
Result: runs of our experimental system according to the document's elds and
click-through rate
for query in queries do
run = get doclist(query)
ctr = get ctr query
for doc in run do
doc detail = get docDetail doc</p>
      <p>BuildSolrIndex doc detail,qid
end
myQuery = docid1^ctr1 OR docid2^ctr2 OR . . . OR docidn^ctrn OR qid^0.0001
AND query 'str'
myRun = solr.search myQuery
update runs key, myRun , feedbacks
end
{ StandardTokenizerFactory: A general purpose tokenizer, which divides a
string into tokens with various types.
{ StopFilterFactory: Words from the Solr included stopword lists are
discarded.</p>
      <p>{ LowerCaseFilterFactory: All letters are indexed and queried as lowercase.
The detailed indexation con guration of the given elds is listed in table 1.
We used a very limited set of elds for the rst round and were basically only
searching in the title and description eld. We changed this in the second round
where we included all the available metadata in the search.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>O</p>
      <p>
        cial Run
The results of the campaign were documented by giving numbers on the (1)
impressions per query, (2) wins, losses, and ties calculated against the production
system, and (3) the calculated outcome (#wins/(#wins+#losses). As noted in
the documentation a win \is de ned as the experimental system having more
clicks on results assigned to it by Team Draft Interleaving than clicks on results
assigned to the production system". This means that any value below 0.5 can be
seen as a performance worse than the production system. Due to the problem of
unavailable items in the shop the expected outcome had to be corrected to 0.28
as unavailable items were not ltered out for participating systems (for more
details check the workshop overview paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
      </p>
      <p>Our system received 523 impressions in the two weeks test period. This makes
roughly 37.4 impressions per day and 1.6 impressions per hour. Although we
eld
title
category
content
creation time
docid
main category
brand
product name
photos
short description
description
category id
main category id
bonus price
available
age min
age max
characters
queries
gender
arrived
qid
characters
site id</p>
      <p>type
text general
text general
text general</p>
      <p>string
text general
text general
text general
text general</p>
      <p>string
text general
text general
string
string</p>
      <p>oat
string
string
string
string
text general
string
string
string
string
string</p>
      <p>X
X
X
X
X</p>
      <p>X
X
X
X</p>
      <p>X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
don't have any comparable numbers we interpret these impression rates to be
quite low. If we compare to the other teams we received the lowest number of
impressions while for example system UiS-Mira received 202 more impressions
in the same time period (725 impressions which is 38% more impressions than
we got). This is not quite in line with the principle of giving fair impression rates
between the di erent teams. Another thing regarding the impressions is the fact
that di erent queries had very di erent impression rates (see gure 1). While
some of them got more than 50 impressions others (actually 5) were not shown
to the users at all.</p>
      <p>
        Our approach did not perform very well due to some obvious miscon
guration and open issues of our implementation. In fact we provided the least
e cient ranking compared to the other three participants [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We could achieve
an outcome of 0.2685 by getting 40 wins vs. 109 losses and 374 ties. On the
other hand no other participant was able to beat the baseline with an outcome
rate of 0.4691. The best performing system received an outcome rate of 0.3413
(system UiS-Mira) and was able to be better than the expected outcome of 0.28
but below the simple baseline provided by the organizers.
3.2
      </p>
      <p>Uno cial 2nd Round
We also took part in the 2nd evaluation round and adapted some parameters of
the system. As there was a miscon guration in the Solr system of round #1 we
only searched the titles and description of products. We xed this bug so that
for round #2 we correctly indexed all available metadata elds. Another issue
from round #1 was that not all 50 test topics were correctly calculated. We only
used the historical click data for 1 test topic and 13 training topics. The other
topics were just the standard Solr ranking without any click history boosting.
We xed this issue for round #2 where now all 50 topics are correctly calculated
according to the described boosting approach.</p>
      <p>After we corrected these two points we observe a clear increase in the
outcomes. The outcome increased to 0.4520 by getting 80 wins, 97 losses and 639
ties. Although the performace increase might be due to the xes introduced by
the organizers regarding unavailable items we could still see some positive
effects: The performance of the other teams increased too but while we were the
weakest team in round #1 we were now able to provide the second best system
performance. We also outperformed the winning system from round #1.
Nevertheless we (and no other system) was able to compete with the productive
system.</p>
      <p>Comparing the number of impressions we see a clear increase in queries that
are above the 0.5 threshold and the baseline (13 queries each) and the impressions
in total and per day are also increased. The issue of unbalanced impression rates
stays the same for round #2 (see gure 1).
impressions round #1
impressions round #2</p>
    </sec>
    <sec id="sec-4">
      <title>Lessons Learned, Open Issues, and Future Work</title>
      <p>The rst prominent issue that arose when processing the data collection was the
Hungarian content. Since we don't know Hungarian we were not able to directly
read or understand the content or queries and therefore had used a
languageand content-agnostic approach. Although the di erent elds were documented4
the content was hidden behind the language barrier, except for obvious brand
names like Barbie or He-Man.</p>
      <p>It would have been really interesting and maybe useful to make use of further
provided metadata like for example the classes of the two-level deep topical
categorization system to which the products were assigned to. As we don't know
more about this categorization system, expect for the ad-hoc translation5 we
could only add the category names to the index and leave it with that.</p>
      <p>A typical problem with real world systems was also present in the available
queries: Real world users tend to use rather short queries. For the 100 available
query strings only 15 had more than one word and only 2 had more than 2 words
4 http://doc.living-labs.net/en/latest/usecase-regio.html#usecase-regio
5 https://translate.google.com/translate?sl=auto&amp;tl=en&amp;u=http\%3A\%2F\
%2Fwww.regiojatek.hu\%2Fkategoriak.html
(R-q22: \bogyo es baboca" and R-q50: \my little pony"). The average word
length per query was 1.17 and the average string length was 7.16 characters.</p>
      <p>Another factor that we did not think of, although it was clearly stated in
the documentation and in the key concepts6, was the fact that no feedback data
was available during the test phase. As this came to our mind way too late we
were only able to include historical click data for some queries. Therefore the
validity of our results from round #1 is weak as their are to few queries to really
judge on the in uence of the historical click data vs. live click data. We were not
able to include new feedback data into our rankings after the o cial upload and
the beginning of the test phase. All the uploaded rankings were \ nal" and only
depend on historical clicks. While this is of course due to the experimental setup
it is not truly a \living" component in the living lab environment (metaphorically
spoken). On top of that not every document received clicks and therefore some
documents are missing any hint of being relevant at all.</p>
      <p>
        Last but not least we had to struggle with speed issues of the LL4IR platform
itself. As mentioned in the workshop report of 2014 there are known issues on
\scaling up with the number of participants and sites talking to the API
simultaneously" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Although they state that these bottlenecks have been identi ed
and that they have started addressing these it still takes some time to
correspond with the API. To get a feeling for the lack of the systems performance:
The extraction of the nal evaluation outcomes took roughly 45 minutes to
extract 100 short JSON snippets not longer than in listing 1.1. Same is true for
the generation of the list of available queries, result list and other data sets. The
development team should thing about caching these kind of data.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>To sum it up, we succeeded in our primary objective of our participation which
was to learn about the LL4IR API and the evaluation methodology. We could
clearly improve our results from round #1 to round #2 and learned a lot during
the campaign. We xed some obvious con guration issues in our Solr system and
were therefore desperately looking forward to the start of the second phase of the
evaluation that started on 15 June 2015. As it turned out, these issues could be
solved and the performance of the retrieval could be clearly improved. Although
we are not able to simply compare round #1 and #2 due to the miscon guration
we can see the positive e ects of the including of historical click data to boost
on popular products.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Besides the previously (maybe unfair) mentioned complains about the system's
and API's performance we are really impressed by it's functionality and stability.
The online documentation and the support by the organizers were clear, direct
and always helpful. Thank you.
6 http://doc.living-labs.net/en/latest/guide-participant.html#key</p>
      <p>Listing 1.1. Sample output of the outcome documentation
" outcomes ": [
{
" impressions ": 36 ,
" losses ": 11 ,
" outcome ": "0.15384615384615385" ,
" qid ": "R - q60 ",
" site_id ": "R",
" test_period ": {
" end ": " Sat , 16 May 2015 00:00:00 -0000" ,
" name ": " CLEF LL4IR Round #1" ,
" start ": " Fri , 01 May 2015 00:00:00 -0000"
},
" ties ": 23 ,
" type ": " test ",
" wins ": 2
{</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Head rst: Living labs for ad-hoc search evaluation</article-title>
          .
          <source>In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          . pp.
          <year>1815</year>
          {
          <year>1818</year>
          . CIKM '14,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2014</year>
          ), http: //www.anneschuth.nl/wp-content/uploads/2014/08/cikm2014-lleval.pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Living labs for ir challenge workshop (</article-title>
          <year>2014</year>
          ), http: //living-labs.net/wp-content/uploads/2014/05/LLC_report.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bialecki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Implementing click-through relevance ranking in solr and lucidworks enterprise (</article-title>
          <year>2011</year>
          ), http://de.slideshare.net/LucidImagination/ bialecki-andrzej-clickthroughrelevancerankinginsolrlucidworksenterprise
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Schaer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hienert</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sawitzki</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wira-Alam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Luke, T.:
          <article-title>Dealing with sparse document and topic representations: Lab report for chic 2012</article-title>
          .
          <source>In: CLEF 2012 Labs and Workshop</source>
          , Notebook Papers: CLEF/CHiC Workshop-Notes (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Overview of the living labs for information retrieval evaluation (ll4ir) clef lab 2015</article-title>
          .
          <source>In: CLEF 2015 - 6th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS)</source>
          , Springer (
          <year>September 2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>