<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semantic Perspective on Query Log Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Katja Hofmann</string-name>
          <email>k.hofmann@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edgar Meij</string-name>
          <email>edgar.meij@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Maarten de Rijke Bouke Huurnink ISLA, University of Amsterdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present our views on the CLEF log file analysis task. We argue for a task definition that focuses on the semantic enrichment of query logs. In addition, we discuss how additional information about the context in which queries are being made could further our understanding of users' information seeking and how to better facilitate this process.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Query log analysis</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Query logs provide an excellent opportunity for gaining insight into how a search engine is used and
what the users’ interests are since they form a complete record of what users searched for in a given time
frame. Particularly appealing is that they are collected unobtrusively, without interrupting users’ normal
interactions with the system. Depending on the specifics of how the data is collected, the logs may contain
additional information, such as identification of users (e.g., via login name, IP-address, or cookie), their
location (by IP-address), or the results that were clicked in response to each query (in this case the names
“click logs” or “click data” are more common).</p>
      <p>
        The information contained in query logs has been used in many different ways, for example to provide
context during search, to classify queries, to infer search intent, to facilitate personalization, to uncover
different aspects of a topic, etc. In various studies, researchers and search engine operators have used
information from query logs to learn about the search process and to improve search engines—from early
studies of the logs created by users of library catalog systems [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to later studies of the logs of special text
genres [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Web search engines [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], or the user’s intent [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ]. More recent studies have investigated
query logs for online search engines for biomedical publications [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and multimedia search [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Besides
learning about search engines or their users, query logs are also being used to infer semantic concepts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
or relations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Naturally, query log analysis comes with limitations. For example, we cannot identify
the person behind the computer, determine demographic information, and the reason for the search (i.e.,
underlying information need) is not recorded [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Query logs are also increasingly considered as a valuable resource for informing certain aspects of
information retrieval. They provide a specific view on retrieval, for example pinpointing particular types of
information that users typically search for, or helping identify bottlenecks with current technology. In this
way they can inform decisions on what aspects of retrieval technology to focus on.</p>
      <p>Within the Cross Language Evaluation Forum (CLEF), LogCLEF is a new task that targets the
opportunities of query log analysis. The goal of this task is the “analysis and classification of queries in order
to improve search systems.”1 Below, we briefly repeat the key features of the LogCLEF tasks and then
present our view on what a suitable task log analysis task at future editions of LogCLEF should look like.
2</p>
    </sec>
    <sec id="sec-3">
      <title>LogCLEF 2009</title>
      <p>In its first year, LogCLEF consisted of two subtasks. LADS was a general task, focusing on analyzing log
files. LAGI was more concrete—participants were asked to classify queries into “geographical” and “non
geographical” and to identify the geographical component (via a unique identifier, e.g., a Wikipedia page)
in a second step. The query logs that were made available were provided in part by TEL (The European
Library) and in part by Tumba!, a Portuguese web search engine.</p>
      <p>The reason for starting LogCLEF was to inform future developments of the other CLEF tasks.
Questions that the task should solve include: (i) What are people looking for? (ii) What should be our focus in
developing new search technology? (iii) Do our methods for evaluating search reflect something that is of
actual value to searchers? In this way, looking at “the user” through query log analysis can be a “real life
sanity check” on whether other CLEF tasks actually make sense from a user perspective. This is an
important goal of CLEF – to develop methods and tools that enable people to access multilingual information
more effectively.</p>
      <p>Both LogCLEF subtasks were run for the first time this year and in the remainder of this paper we offer
our perspective on the tasks and possible future developments for log analysis at CLEF without looking at
the specific outcomes of the tasks.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Understanding Queries</title>
      <p>In order to work towards the stated goal of improving search systems we believe that we need some
understanding of the current search process, including users’ interactions with the systems. The better we
understand this process, the better we can address search systems’ current limitations.</p>
      <p>Ideally, we would like to have a fully explicated user model which contains the user’s personal
context, task context, intent, etc. Since such a complete model is infeasible to construct we should look for
surrogates or approximations and query logs are one such approximation.</p>
      <p>The question, then, is how much query log analysis can contribute to a better understanding of the
search process. An advantage is that query logs capture a large amount of activity of many users. This
allows us to statistically analyze the collected data, for example using data mining, to identify patterns that
would not become apparent when studying small sets of users.</p>
      <p>A limitation is that query logs only provide a very narrow view on users’ interactions with the search
system. Any activity can be interpreted in many different ways. For example, someone posting the query
“girl with the pearl earring” may want to see a photo of a particular painting, find the name of the painter,
or read stories about how it was created.</p>
      <p>As many interpretations are possible, we need to be careful in what kinds of conclusions we can actually
draw from an analysis of query logs. Ultimately, only the person who does an actual search knows what
they are looking for (and sometimes even they have difficulties articulating their need) or can identify the
item in the search results that answered their question.</p>
      <p>1http://www.uni-hildesheim.de/logclef/</p>
    </sec>
    <sec id="sec-5">
      <title>Towards Semantically Enriched Query Logs</title>
      <p>A complete understanding of the search process is not possible. We should work towards the best possible
use of the resources that we currently have available and aim for robust, scaleable and repeatable types of
analysis.</p>
      <p>Our perspective is that LogCLEF should focus on semantically enriching query logs. This means
creating annotations that specify for example the language in which a query was posed, any named entities
contained in the query, and relations between the different parts of the query.</p>
      <p>As an example, we can observe that the most frequent queries contained in the TEL log file are named
entities (cf. Table 1). This is interesting in the context of a multilingual perspective that CLEF addresses:
on the one hand, named entities have the same name in many languages. This means that little translations
may be necessary, but also that it is hard to detect the language of most queries. Still, variations exist across
languages, and in cases such as book titles, translations using statistical machine translation methods may
be unfeasible.
What we are proposing goes beyond—and is different from—the recognition of geo-information as it is
being examined within the LAGI subtask at LogCLEF. In our view, from a user perspective, it may not
matter whether a painting is located at a museum in the Netherlands, Germany or Poland (although this is
potentially interesting metadata), and is therefore entered into a catalogue in either language. The painting
should be found in catalogues of any language. This requires that such entities can be identified across
languages and that they can be linked (or “resolved” or “normalized”) to a unique identifier, such as a
thesaurus term or a Wikipedia page—this unique identifier may be decorated with additional information
(other occurrences, type or category information, relations to other resolved entities, etc.) that can be
aggregated to provide us with insights about trends, types of queries, intent, etc, which in turn should
inform search algorithm optimization and interface design.</p>
      <p>What we are proposing, then, is to automatically enrich queries with semantic information by providing
links from queries (in context, ideally with session information) to one or more sources of background
information that are appropriate for the domain from which the log files originate—thesauri, directories,
Wikipedia, Linked Open Data, etc. In the case of the TEL log file, one can think of GTAA and Wikipedia
as suitable target sources.</p>
      <p>
        How can this semantic query enrichment task be approached in a broad-coverage and robust manner?
Until recently, approaches to automatic categorization of queries from a search log were mostly based
on pre-defining a list of topically categorized terms, which are them matched against queries from the
log; the construction of the log was done manually [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or semi-automatically [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. While this approach
achieves high accuracy, it tends to achieve very low coverage, e.g., 8% of unique queries for the
semiautomatic method, and 13% for the manual one. Mishne and de Rijke [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] take a different approach
to query categorization, substantially increasing the coverage but sustaining high accuracy levels: their
approach relies on external “categorizers” with access to large amounts of data—two category-based web
search services, Yahoo! Directory and Froogle. Meij et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] use a (high-performing and very broad
coverage) feature-based approach to linking queries to DBpedia in conjunction with search-based and
concept-specific features and apply their method to the transactions of Beeld en Geluid, the national Dutch
radio and television archive; the authors also provide guidelines and a test set for this linking task, showing
that ground truth can be established in a reliable manner with relatively little effort. Huurnink et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
show how the resulting information can be used to gain insights into users’ search behavior by aggregating
the information being linked to.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We welcome the arrival of a log analysis task at CLEF. Identifying basic statistical patterns in the log
files made available for the task is a valuable first step. But as a shared task, we need a task that goes
beyond this—the recognition of geographic components (as implemented in the LAGI subtask) is a good
example, but it does not seem completely appropriate for the TEL log files where enrichment of queries
with a broader range of semantic information seems more suitable. Instead, we propose a semantic query
enrichment task that aims to link queries to suitable semantic sources. Recent advances in semantic query
analysis suggests that has become a do-able task, without having reached the status of solved problem—
making it suitable for collaborative benchmarking in the CLEF setting.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This research was supported by the Netherlands Organization for Scientific Research (NWO) under project
numbers 017.001.190, 640.001.501, 640.002.501, 612.066.512, 612.061.814, 612.061.815, 640.004.802,
and by the Dutch-Flemish research programme STEVIN under project DuOMAn (STE-09-12).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Tiberi</surname>
          </string-name>
          .
          <article-title>Extracting semantic relations from query logs</article-title>
          .
          <source>In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>76</fpage>
          -
          <lpage>85</lpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM Press.
          <source>ISBN 9781595936097. doi: 10.1145/1281192</source>
          .1281204. URL http://dx.doi.org/10.1145/1281192.1281204.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <article-title>Liliana Caldero´n-Benavides, and Cristina Gonza´lez-Caro. The intention behind web queries</article-title>
          .
          <source>In String Processing and Information Retrieval</source>
          , pages
          <fpage>98</fpage>
          -
          <lpage>109</lpage>
          ,
          <year>2006</year>
          . doi:
          <volume>10</volume>
          .1007/ 11880561 9.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Steven</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Beitzel</surname>
            , Eric C. Jensen, Abdur Chowdhury, David Grossman,
            <given-names>and Ophir</given-names>
          </string-name>
          <string-name>
            <surname>Frieder</surname>
          </string-name>
          .
          <article-title>Hourly analysis of a very large topically categorized web query log</article-title>
          .
          <source>In Proceedings SIGIR '04</source>
          , pages
          <fpage>321</fpage>
          -
          <lpage>328</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrei</given-names>
            <surname>Broder</surname>
          </string-name>
          .
          <article-title>A taxonomy of web search</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2002</year>
          . ISSN 0163-
          <fpage>5840</fpage>
          . doi:
          <volume>10</volume>
          .1145/792550.792552. URL http://dx.doi.org/10.1145/792550.792552.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jorge</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Herskovic</surname>
            , Len Y. Tanaka,
            <given-names>William</given-names>
          </string-name>
          <string-name>
            <surname>Hersh</surname>
          </string-name>
          , and
          <string-name>
            <surname>Elmer</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Bernstam</surname>
          </string-name>
          .
          <article-title>A day in the life of pubmed: analysis of a typical day's query log</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ):
          <fpage>212</fpage>
          -
          <lpage>220</lpage>
          ,
          <year>2007</year>
          . ISSN 1067-
          <fpage>5027</fpage>
          . doi:
          <volume>10</volume>
          .1197/jamia.M2191. URL http://dx.doi.org/10.1197/jamia. M2191.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Bouke</given-names>
            <surname>Huurnink</surname>
          </string-name>
          , Laura Hollink, Wietske van den Heuvel, and Maarten de Rijke.
          <article-title>Information needs of broadcast professionals at an audiovisual archive: A transaction log analysis</article-title>
          .
          <source>Submitted</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bernard</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jansen</surname>
          </string-name>
          .
          <article-title>Search log analysis: What it is, what's been done, how to do it</article-title>
          .
          <source>Library &amp; Information Science Research</source>
          ,
          <volume>28</volume>
          (
          <issue>3</issue>
          ):
          <fpage>407</fpage>
          -
          <lpage>432</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Bernard</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jansen</surname>
            and
            <given-names>Udo</given-names>
          </string-name>
          <string-name>
            <surname>Pooch</surname>
          </string-name>
          .
          <article-title>A review of web searching studies and a framework for future research</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .,
          <volume>52</volume>
          (
          <issue>3</issue>
          ):
          <fpage>235</fpage>
          -
          <lpage>246</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM Press.
          <source>ISBN 158113567X. doi: 10.1145/775047</source>
          .775067. URL http://dx.doi.org/10.1145/775047.775067.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Edgar</surname>
            <given-names>Meij</given-names>
          </string-name>
          , Marc Bron, Bouke Huurnink, Laura Hollink, and Maarten de Rijke.
          <article-title>Learning semantic query suggestions</article-title>
          .
          <source>In 8th International Semantic Web Conference (ISWC</source>
          <year>2009</year>
          ). Springer,
          <year>October 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Gilad</given-names>
            <surname>Mishne</surname>
          </string-name>
          and Maarten de Rijke.
          <article-title>A study of blog search</article-title>
          . In M. Lalmas, A. MacFarlane, S. Ru¨ger, A. Tombros,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Yavlinsky, editors,
          <source>Advances in Information Retrieval: Proceedings 28th European Conference on IR Research (ECIR</source>
          <year>2006</year>
          ), volume
          <volume>3936</volume>
          <source>of LNCS</source>
          , pages
          <fpage>289</fpage>
          -
          <lpage>301</lpage>
          . Springer, April
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Thomas</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
          </string-name>
          .
          <article-title>The history and development of transaction log analysis</article-title>
          .
          <source>Library Hi Tech</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>41</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>Hsiao Tieh Pu and Shui Lung Chuang</article-title>
          .
          <article-title>Auto-categorization of search terms toward understanding web users' info rmation needs</article-title>
          .
          <source>In ICADL 2000: Intern. Conference on Asian Digital Libraries</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Dian</surname>
            <given-names>Tjondronegoro</given-names>
          </string-name>
          , Amanda Spink, and
          <string-name>
            <given-names>Bernard J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          .
          <article-title>A study and comparison of multimedia web searching: 1997-2006</article-title>
          . J.
          <source>Am. Soc. Inf. Sci. Technol</source>
          .,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>