<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LogCLEF 2010: the CLEF 2010 Multilingual Logfile Analysis Track Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Mandl</string-name>
          <email>mandl@uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Maria Di Nunzio</string-name>
          <email>dinunzio@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Maria Schulz</string-name>
          <email>schulzju@uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Science, University of Hildesheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Log data constitutes a relevant aspect in the evaluation process of multilingual search services. Activity logs allow to study the usage of search engines and to better adapt them to the needs of their users. The study of multilingual log analysis was promoted by the Cross Language Evaluation Forum (CLEF). For the second time, the track LogCLEF was conducted. As is 2009, large log files were obtained from information providers. One log covers 30 months of activities on the website of The European Library (TEL) and the second log shows user activities of users on the German EduServer. Seven groups explored the data using a variety of approaches. They analyzed languages of queries, activities within sessions and success of searches. The data for the track, the evaluation methodology and results are presented and discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Web Search Engines deal with the representation, storage, organization of, and access
to information items which are essentially Web pages. The characterization of the
user information need is not simple, and this problem can roughly be divided into
three aspects: how the user poses his request to the search engine, how the user
interacts with the search engine, and how the search engine organizes the results.</p>
      <p>
        Log data constitute a relevant aspect in the evaluation process of the quality of a
search engine and the quality of a multilingual search service; log data can be used to
study the usage of a search engine, and to better adapt it to the objectives the users
were expecting to reach [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The log data can be used to study the usage of a specific
application, and to better adapt it to the objectives the users were expecting to reach.
The analysis of transaction logs for studying automatic information access systems
has a long history, much earlier than the World WideWeb as we know it today.
      </p>
      <p>The interest in multilingual log analysis was promoted by the Cross Language
Evaluation Forum (CLEF)1 in the track LogCLEF2 which was conducted for the first</p>
      <sec id="sec-1-1">
        <title>1 http://www.clef-campaign.org/</title>
        <p>
          2 http://www.uni-hildesheim.de/logclef/
time in 2009 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and for the second time in 2010. LogCLEF is an evaluation initiative
for the analysis of queries and other logged activities as expression of user behavior.
The main goal of LogCLEF is the analysis and classification of queries in order to
understand search behavior in multilingual contexts and ultimately to improve search
systems. Another important long-term aim is to stimulate research on user behavior in
multilingual environments and promote standard evaluation collections of log data.
        </p>
        <p>LogCLEF differs from other evaluation tracks since its goal is not the production
of a gold standard for a specific task, but to create a forum for the creative exploration
of user behavior based on logs.</p>
        <p>The data sets used in 2010 were activity logs derived from the The European
Library (TEL) Web site3 and the German EduServer4 -- Deutscher Bildungsserver
(DBS) -- maintained by the DIPF, the Leibniz Institute for Educational Research and
Educational Information. The task definition, the data for the track, the evaluation
methodology and some results of submitted experiments are presented in this
overview paper.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2 Task Definition</title>
      <p>The main question behind the task definition comes from search service providers
who wonder how they can improve their services. Ultimately, researchers need to
better understand user behavior in order to reach that high level goal. Two objectives
of the analysis of the logs are proposed, one for each set of log files:</p>
      <p>TEL: Investigate language of queries with respect to successful search sessions. A
successful search could be defined as one of the following actions listed in the right
hand box of the TEL interface when an item of the result clicked is listed.
+ Services:</p>
      <p>Availability at the library,
Link to other services,
collection homepage
+ Options:</p>
      <p>Save in favorites,</p>
      <p>Send by email.</p>
      <sec id="sec-2-1">
        <title>Potential research issues for TEL:</title>
        <p>1. language identification for the queries
2. initial language vs country IP address
3. subsequent languages used on same search
4. country of the library vs language of the query vs</p>
        <p>language of the interface</p>
        <p>DBS: The objective of the analysis of the DBS logs is the exploration of the
relation between query and viewed content. The analysis can explore formal issues of</p>
      </sec>
      <sec id="sec-2-2">
        <title>3 http://www.theeuropeanlibrary.org/</title>
        <p>4 http://www.eduserver.de/
the query and content as well as the distribution of words within both.</p>
        <p>Potential research issues for DBS:
1. Are query terms related to the content viewed and/or paths taken within the
system?
2. Can query modifications be explained by the content viewed?
3. Develop metrics to identify successful searches</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Data Description</title>
      <p>The data for LogCLEF 2010 collection consists of two large log files from
information providers:
• The European Library (TEL) logs: As in 2009, a large log of activities from
The European Library are provided. This service provides access to several
national libraries of Europe. Users and content come from many languages.
• German EduServer (Deutscher Bildungsserver, DBS) logs: The "Deutscher
Bildungsserver" is a quality controlled internet directory for educational
resources. A raw server log representing three months of activities on the
portal is made available. The size of all files is 5 GB.</p>
      <p>The following table gives an overview on the log resources which were been made
available at CLEF over the last years.
TEL is a free service that offers access to the resources of 48 national libraries of
Europe in 35 languages, it aims to provide a vast virtual collection of material from
all disciplines and offers interested visitors simple access to European cultural
heritage. Resources can be both digital (e.g. books, posters, maps, sound recordings,
videos) and bibliographical and the quality and reliability of the documents are
guaranteed by the 48 collaborating national libraries of Europe.</p>
      <p>The data used for this task are search logs and Web server logs of The European
Library portal.</p>
      <sec id="sec-3-1">
        <title>3.1.1 TEL Action Logs</title>
        <p>Search logs are usually named “action logs” in the context of TEL activities. In TEL
portal’s home page, a user can initiate a simple keyword search with a default
predefined collection list presenting catalogues from national libraries. From the same
page, a user may perform an advanced search with Boolean operators and/or limit
search to specific fields like author, language, and ISBN. It is also possible to change
the searched collection by checking the theme categories below the search box. After
the search button is clicked, the result page appears, where results are classified by
collections and the results of the top collection in the list are presented with brief
descriptions. Subsequently, a user may choose to see result lists of other collections or
move to the next page of records of current collection’s results. While viewing a
result list page a user may also click on a specific record to see detailed information
about the specific record. Additional services may be available according to the
record selected.</p>
        <p>
          All these type of actions and choices are logged and stored by TEL in a relational
table, where each record represents a user action [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The most significant columns of
the table are:
• A numeric id, for identifying registered users or “guest” otherwise;
• User’s IP address;
• An automatically generated alphanumeric, identifying sequential actions of
the same user (sessions) ;
• Query contents;
• Name of the action that a user performed;
• The corresponding collection’s alphanumeric id;
• Date and time of the action’s occurrence.
Action logs distributed to the participants of the task cover the period from January
2007 until June 2008 and from January 2009 until December 2009. The log file
contains user activities and queries entered at the search site of TEL. Examples for
entries in the log file are shown in Table 3.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.2 TEL Web Server Logs</title>
        <p>The Web server log files of TEL cover the same period of the first data set of action
logs, from January 2007 until June 2008. These log files are saved in 18 text files
date: year-month-day.
time: hour:minute:second.</p>
        <p>HTTP method: for example GET, HEAD, POST, etc.</p>
        <p>URI stem: the path of the requested file.</p>
        <p>URI query: the string of the query in the URL, if any.</p>
        <p>IP address: the address of the client, (obfuscated, e.g. 127.0).</p>
        <p>User agent: the user agent of the client.</p>
        <p>Cookie: the cookie sent to/by the client.</p>
        <p>Referrer: the URL of the resource which linked the client to TEL.</p>
        <p>The Cookie field is divided into subfields by semi-colons “;”. Some of the
subfields are (some of them are ignored for this task):
• cTargets: the identifiers of the collections selected by the user;
• TELSESSID: the identifier of the session. It is the same identifier recorded in
the acion logs under the name “sesid”. This is an important field to
crossanalyze action logs to Web server logs. Figure 1, shows an example of how a
user session may be stored in the two different logs.
The quality controlled "Deutscher Bildungsserver" is a clearinghouse for educational
resources on the Web5. It also contains content provided by the DIPF as well as
5 http://www.bildungsserver.de/start_e.html
descriptions and reviews on Web sites on education. The Internet resources (web
sites) are described, checked for their quality, manually indexed and classified.
The logs were collected in the time between September and November of 2009. The
logs are server logs in standards format in which the searches and the results viewed
can be observed. An excerpt is shown in table 2. The logs have been anonymized by
partially obscuring the IP addresses of users.</p>
        <p>The two upper levels of server names or IP addresses have been hashed. This
allows the reconstruction of sessions within the data. Note that accesses by search
engine bots are still within the logs. The logs allow to observe two types of user
queries:
•
•
queries in search engines (in the referrer when DBS files were found using a
search engine)
queries within the DBS (see query parameters in metasuche/qsuche)
The logs also allow so observe the browsing behavior within the DBS server
structure. The following pages are of most interest:
•
the descriptions of the educational web sites within DBS (mlesen)
thematic lists of educational web sites (zeigen, anzeigen, fachlist, listen)
a newspaper documentation on articles about education (zeitdok)
the descriptions of the educational web sites within DBS
The logs allow to access two types of content and compare them to the queries.
•
the content of the educational web sites themselves (which might have
changed since the logs have been collected) in those cases where the user
might have accessed them</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Participants and Results</title>
      <p>The two following sections shows the participants of LogCLEF 2010 and presents
some results. For more detailed results, the reader is referred to the papers by the
participants which describe the approaches and findings in more detail.
4.1</p>
      <sec id="sec-4-1">
        <title>Participants</title>
        <p>As shown in Table 4, a total of 7 groups submitted results for LogCLEF. Of the 15
registered groups, only less than 50% managed to obtain results. The results of the
participating groups are reported in the following section and elaborated in the papers
of the participants. All groups analyzed the TEL logs and none worked with the DBS
logs. This might be due to the nature of a raw web server log which requires much
pre-processing. LogCLEF could not provide a pre-processed version due to the lack
of funding for LogCLEF.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Participant</title>
        <p>DAEDALUS</p>
        <sec id="sec-4-2-1">
          <title>SINAI TCD-DCU NII</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Info Foraging Lab Info Science CELI s.r.l</title>
          <p>4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Results</title>
        <p>A large variety of approaches was taken to analyze the TEL log files. This can be
considered as a success of the open definition of the task which encouraged creative
exploration of the data.</p>
        <p>
          Two groups contrasted user behavior at a quality search service like TEL to
common Web search behavior. A group from The Netherlands under the leadership of
the University of Nijmegen contrasted frequent queries and number of queries per
session in the TEL log with data from an MSN log [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Verberne et al. also created a
network of actions within TEL visualizing the frequency of actions as well as
transition probabilities. It can be observed that view actions are more frequent than
search actions and that the full view of a result is selected more often than the brief
view.
        </p>
        <p>
          The NII from Tokio also compared the TEL logs to web search logs and theories
developed by exploiting web search [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Takaku et al. analyzed the two TEL logs
separately and observed few differences between the two time spans. They also
integrated the length of a session into their work. Generally a high correlation
between the number of actions and the length can be seen, but there are many
exceptions which might be interesting for further exploration. Takaku et al. extracted
the ranks of the documents clicked by the users and compared the result from Web
search experiments.
        </p>
        <p>The DAEDALUS group formally defined success for sessions and queries. They
calculated that only 6% of the queries and 10% of the sessions could be labeled as
successful.</p>
        <p>Three groups focused on language issues. The SINAI group showed that most of
the sessions are in English. They also conclude that 50.000 of the sessions exhibit
only one action. More than 80% of the sessions have 10 or fewer actions.</p>
        <p>
          The CELI research institute tried to identify the language of search queries [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
They manually labeled 100 queries and their system managed to correctly identify
over 70%. CELI concludes that the integration of named entity recognition needs is
necessary.
        </p>
        <p>
          The difficulties of language identification were elaborated by a group from Berlin
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. They manually checked 510 queries for their detailed analysis. It showed that
over 50% of the queries consisted of only a named entity and an additional 8%
included named entities together with another term. Obviously, this complicates
language identification and even in the manual analysis 38% of the queries could not
be classified as being of one language. Another 31% were English. Stiller et al. also
showed that the interface language selected and the origin of the IP are only weak
indicators for the query language in their sub set [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          A group from Dublin [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] also conducted research on the interface language and the
origin of the user. Leveling et al. related these factors to the collection selected by the
user and managed to develop a scoring function which can rerank the result
documents in a way that improves the result quality for the user based on the clicks as
observed in the log file. Leveling et al. managed to analyze the content of the queries
in order to develop query performance estimators. They implemented IDF and clarity
score [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>Studies on log files are limited by privacy issues. For the first time, LogCLEF
provided evaluation resources for log file analysis which can be used for comparative
system evaluation. The second year of LogCLEF obtained more attention by
participants. It is intended to encourage and facilitate the exchange of resources and
tools generated within the participation at LogCLEF.</p>
      <p>
        In the future, log analysis should be the basis for other evaluation tasks. Logs can
show how users behave and what they need. One example could be the selection of
topics for retrieval evaluation or for questions answering systems [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
The organization of LogCLEF was mainly volunteer work. We want to thank The
European Library (TEL) and DIPF, the Leibniz Institute for Educational Research and
Educational Information, Frankfurt, Germany for providing the log files.
At the University of Padua, the work has been partially supported by TELplus
Targeted Project for digital libraries, as part of the eContentplus Program of the
European Commission (Contract ECP-2006-DILI-510003).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Spink</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Taksa</surname>
          </string-name>
          , I. (eds.)
          <article-title>Handbook of Research on Web Log Analysis</article-title>
          . Idea Group Reference: Hershey et al.
          <year>2009</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T</given-names>
          </string-name>
          ; Agosti,
          <string-name>
            <given-names>M.</given-names>
            ; Di Nunzio, G.;
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Mani</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Doran,
          <string-name>
            <given-names>C.</given-names>
            &amp;
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. LogCLEF</surname>
          </string-name>
          <year>2009</year>
          :
          <article-title>the CLEF 2009 Cross-Language Logfile Analysis Track Overview</article-title>
          .
          <source>In: Multilingual Information Access Evaluation I: Text Retrieval Experiments: Proc. 10th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2009</year>
          , Corfu, Greece.
          <source>Revised Selected Papers</source>
          . Berlin et al.: Springer [LNCS 6241] Preprint in Working Notes: http://www.clef-campaign. org/2009/working_notes/LogCLEF-2009
          <string-name>
            <surname>-Overview-</surname>
          </string-name>
          Working-Notes-2009-09-14.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Di</given-names>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.M.:</surname>
          </string-name>
          <article-title>LogCLEF 2009</article-title>
          <year>2009</year>
          /03/02 v 1.
          <article-title>0 Description of the The European Library (TEL) Search Action Log Files</article-title>
          . http://www.uni-hildesheim.de/logclef/Daten/ LogCLEF2009_file_description.pdf 2009
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bosca</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Dini</surname>
            ,
            <given-names>L</given-names>
          </string-name>
          :
          <article-title>Language Identification Strategies for Cross Language Information Retrieval</article-title>
          . In this volume (
          <article-title>LogCLEF 2010 Working Notes</article-title>
          , http://clef2010.org/)
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Perea-Ortega</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Montejo Ráez,,
          <string-name>
            <given-names>A.; Garcia</given-names>
            <surname>Cumbreras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            &amp;
            <surname>Ureña-López</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.A..</surname>
          </string-name>
          <article-title>SINAI at LogCLEF 2010 In this volume</article-title>
          .
          <source>(LogCLEF 2010 Working Notes</source>
          , http://clef2010.org/)
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Leveling</surname>
            , J.; Ghorab,
            <given-names>M.R.</given-names>
          </string-name>
          ; Magdy,
          <string-name>
            <surname>W.</surname>
          </string-name>
          ; Jones,
          <string-name>
            <given-names>G.</given-names>
            &amp;
            <surname>Wade</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          : DCU-TCD@
          <article-title>LogCLEF 2010: Re-ranking Document Collections and Query Performance Estimation</article-title>
          . In this volume.
          <source>(LogCLEF 2010 Working Notes</source>
          , http://clef2010.org/)
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lana-Serrano</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Villena-Román</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>González-Cristóbal,</surname>
          </string-name>
          J-C.
          <article-title>DAEDALUS at LogCLEF 2010: Analyzing the Success of Search Queries</article-title>
          . In this volume (
          <article-title>LogCLEF 2010 Working Notes</article-title>
          , http://clef2010.org/)
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Verberne</surname>
            ,
            <given-names>S</given-names>
          </string-name>
          ; Hinne, M.;
          <string-name>
            <surname>van der Heijden</surname>
            , M; Hoenkamp,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kraaij</surname>
          </string-name>
          , W. &amp;
          <string-name>
            <surname>van der Weide</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>How does the Library Searcher behave? In this volume (LogCLEF 2010 Working Notes</article-title>
          , http://clef2010.org/)
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Takaku</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Egusa</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Saito</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Kando,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Teraki</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Miwa,
          <string-name>
            <surname>M..</surname>
          </string-name>
          <article-title>CRES at LogCLEF 2010: Towards Understanding the User Behaviors through an Analysis of Search Sessions, Search Units and Click Ranks</article-title>
          . In this volume (
          <article-title>LogCLEF 2010 Working Notes</article-title>
          , http://clef2010.org/)
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stiller</surname>
            , J.; Gaede,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Petras</surname>
            <given-names>V.</given-names>
          </string-name>
          <article-title>Ambiguity of Queries and the Challenges for Query Language Detection</article-title>
          . In this volume (
          <article-title>LogCLEF 2010 Working Notes</article-title>
          , http://clef2010.org/) 11.
          <string-name>
            <surname>Sutcliffe</surname>
            , R.; Kruschwitz,
            <given-names>U.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T. Web</given-names>
          </string-name>
          <string-name>
            <surname>Logs</surname>
            and
            <given-names>Question</given-names>
          </string-name>
          <string-name>
            <surname>Answering</surname>
          </string-name>
          .
          <source>In: Proc. Web Logs and Question Answering (WLQA2010) Workshop at the Seventh International Conference on Language Resources</source>
          and
          <string-name>
            <surname>Evaluation (LREC) Malta</surname>
          </string-name>
          , 22nd May. S. 1-
          <fpage>7</fpage>
          . http://www.csis.ul.ie/wlqa2010/proceedings.htm
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>