LogCLEF 2010: the CLEF 2010 Multilingual Logfile Analysis Track Overview Thomas Mandl1, Giorgio Maria Di Nunzio2, Julia Maria Schulz1 1 Information Science, University of Hildesheim, Germany {mandl,schulzju}@uni-hildesheim.de 2 Department of Information Engineering, University of Padua, Italy dinunzio@dei.unipd.it Abstract. Log data constitutes a relevant aspect in the evaluation process of multilingual search services. Activity logs allow to study the usage of search engines and to better adapt them to the needs of their users. The study of multilingual log analysis was promoted by the Cross Language Evaluation Forum (CLEF). For the second time, the track LogCLEF was conducted. As is 2009, large log files were obtained from information providers. One log covers 30 months of activities on the website of The European Library (TEL) and the second log shows user activities of users on the German EduServer. Seven groups explored the data using a variety of approaches. They analyzed languages of queries, activities within sessions and success of searches. The data for the track, the evaluation methodology and results are presented and discussed. 1 Introduction Web Search Engines deal with the representation, storage, organization of, and access to information items which are essentially Web pages. The characterization of the user information need is not simple, and this problem can roughly be divided into three aspects: how the user poses his request to the search engine, how the user interacts with the search engine, and how the search engine organizes the results. Log data constitute a relevant aspect in the evaluation process of the quality of a search engine and the quality of a multilingual search service; log data can be used to study the usage of a search engine, and to better adapt it to the objectives the users were expecting to reach [1]. The log data can be used to study the usage of a specific application, and to better adapt it to the objectives the users were expecting to reach. The analysis of transaction logs for studying automatic information access systems has a long history, much earlier than the World WideWeb as we know it today. The interest in multilingual log analysis was promoted by the Cross Language Evaluation Forum (CLEF)1 in the track LogCLEF2 which was conducted for the first 1 http://www.clef-campaign.org/ 2 http://www.uni-hildesheim.de/logclef/ time in 2009 [2] and for the second time in 2010. LogCLEF is an evaluation initiative for the analysis of queries and other logged activities as expression of user behavior. The main goal of LogCLEF is the analysis and classification of queries in order to understand search behavior in multilingual contexts and ultimately to improve search systems. Another important long-term aim is to stimulate research on user behavior in multilingual environments and promote standard evaluation collections of log data. LogCLEF differs from other evaluation tracks since its goal is not the production of a gold standard for a specific task, but to create a forum for the creative exploration of user behavior based on logs. The data sets used in 2010 were activity logs derived from the The European Library (TEL) Web site3 and the German EduServer4 -- Deutscher Bildungsserver (DBS) -- maintained by the DIPF, the Leibniz Institute for Educational Research and Educational Information. The task definition, the data for the track, the evaluation methodology and some results of submitted experiments are presented in this overview paper. 2 Task Definition The main question behind the task definition comes from search service providers who wonder how they can improve their services. Ultimately, researchers need to better understand user behavior in order to reach that high level goal. Two objectives of the analysis of the logs are proposed, one for each set of log files: TEL: Investigate language of queries with respect to successful search sessions. A successful search could be defined as one of the following actions listed in the right hand box of the TEL interface when an item of the result clicked is listed. + Services: Availability at the library, Link to other services, collection homepage + Options: Save in favorites, Send by email. Potential research issues for TEL: 1. language identification for the queries 2. initial language vs country IP address 3. subsequent languages used on same search 4. country of the library vs language of the query vs language of the interface DBS: The objective of the analysis of the DBS logs is the exploration of the relation between query and viewed content. The analysis can explore formal issues of 3 http://www.theeuropeanlibrary.org/ 4 http://www.eduserver.de/ the query and content as well as the distribution of words within both. Potential research issues for DBS: 1. Are query terms related to the content viewed and/or paths taken within the system? 2. Can query modifications be explained by the content viewed? 3. Develop metrics to identify successful searches 3 Data Description The data for LogCLEF 2010 collection consists of two large log files from information providers: • The European Library (TEL) logs: As in 2009, a large log of activities from The European Library are provided. This service provides access to several national libraries of Europe. Users and content come from many languages. • German EduServer (Deutscher Bildungsserver, DBS) logs: The "Deutscher Bildungsserver" is a quality controlled internet directory for educational resources. A raw server log representing three months of activities on the portal is made available. The size of all files is 5 GB. The following table gives an overview on the log resources which were been made available at CLEF over the last years. Table 1: Log file resources at CLEF Year Origin Size Type 2007 MSN 800.000 queries Query log 2009 Tumba! 350.000 queries Query log 2009 TEL 1.870.000 records Query and activity log 2010 TEL 2.600.000 records Query and activity log 2010 TEL 1.5 GB (zipped) Web server log 2010 DIPF.de 5 GB Web server log 3.1 TEL TEL is a free service that offers access to the resources of 48 national libraries of Europe in 35 languages, it aims to provide a vast virtual collection of material from all disciplines and offers interested visitors simple access to European cultural heritage. Resources can be both digital (e.g. books, posters, maps, sound recordings, videos) and bibliographical and the quality and reliability of the documents are guaranteed by the 48 collaborating national libraries of Europe. The data used for this task are search logs and Web server logs of The European Library portal. 3.1.1 TEL Action Logs Search logs are usually named “action logs” in the context of TEL activities. In TEL portal’s home page, a user can initiate a simple keyword search with a default predefined collection list presenting catalogues from national libraries. From the same page, a user may perform an advanced search with Boolean operators and/or limit search to specific fields like author, language, and ISBN. It is also possible to change the searched collection by checking the theme categories below the search box. After the search button is clicked, the result page appears, where results are classified by collections and the results of the top collection in the list are presented with brief descriptions. Subsequently, a user may choose to see result lists of other collections or move to the next page of records of current collection’s results. While viewing a result list page a user may also click on a specific record to see detailed information about the specific record. Additional services may be available according to the record selected. All these type of actions and choices are logged and stored by TEL in a relational table, where each record represents a user action [3]. The most significant columns of the table are: • A numeric id, for identifying registered users or “guest” otherwise; • User’s IP address; • An automatically generated alphanumeric, identifying sequential actions of the same user (sessions) ; • Query contents; • Name of the action that a user performed; • The corresponding collection’s alphanumeric id; • Date and time of the action’s occurrence. Table 3: Examples from the TELlog (date has been deleted for readability) id;userid;userip;sesid;lang;query;action;colid;nrrecords;recordposition;sboxid;objurl;date 892989;guest;62.121.xxx.xxx;btprfui7keanue1u0nanhte5j0;en;("plastics mould");view_brief;a0037;31;;; 893209;guest;213.149.xxx.xxx;o270cev7upbblmqja30rdeo3p4;en;("penser leurope");search_sim;;0;-;;; 893261;guest;194.171.xxx.xxx;null;en;(“magna carta”);search_url;;0;-;;; 893487;guest;81.179.xxx.xxx;9rrrtrdp2kqrtd706pha470486;en;("spengemann");view_brief;a0067;1;-;;; 893488;guest;81.179.xxx.xxx;9rrrtrdp2kqrtd706pha470486;en;("spengemann");view_brief;a0000;0;-;;; 893533;guest;85.192.xxx.xxx;ckujekqff2et6r9p27h8r89le6;fr;("egypt france britain");search_sim;;0;-;;; Action logs distributed to the participants of the task cover the period from January 2007 until June 2008 and from January 2009 until December 2009. The log file contains user activities and queries entered at the search site of TEL. Examples for entries in the log file are shown in Table 3. 3.1.2 TEL Web Server Logs The Web server log files of TEL cover the same period of the first data set of action logs, from January 2007 until June 2008. These log files are saved in 18 text files (zipped), one for each month of the year, and each record contains the following fields: • date: year-month-day. • time: hour:minute:second. • HTTP method: for example GET, HEAD, POST, etc. • URI stem: the path of the requested file. • URI query: the string of the query in the URL, if any. • IP address: the address of the client, (obfuscated, e.g. 127.0). • User agent: the user agent of the client. • Cookie: the cookie sent to/by the client. • Referrer: the URL of the resource which linked the client to TEL. The Cookie field is divided into subfields by semi-colons “;”. Some of the subfields are (some of them are ignored for this task): • cTargets: the identifiers of the collections selected by the user; • TELSESSID: the identifier of the session. It is the same identifier recorded in the acion logs under the name “sesid”. This is an important field to cross- analyze action logs to Web server logs. Figure 1, shows an example of how a user session may be stored in the two different logs. Figure 1. An example of how actions of the user are recorded in the two log data sets. Searching and browsing activities of the same computer are uniquely identified by the TELSESSID field which is stored both in the action logs and in the cookie field in the HTTP logs. 3.2 EduServer The quality controlled "Deutscher Bildungsserver" is a clearinghouse for educational resources on the Web5. It also contains content provided by the DIPF as well as 5 http://www.bildungsserver.de/start_e.html descriptions and reviews on Web sites on education. The Internet resources (web sites) are described, checked for their quality, manually indexed and classified. The logs were collected in the time between September and November of 2009. The logs are server logs in standards format in which the searches and the results viewed can be observed. An excerpt is shown in table 2. The logs have been anonymized by partially obscuring the IP addresses of users. The two upper levels of server names or IP addresses have been hashed. This allows the reconstruction of sessions within the data. Note that accesses by search engine bots are still within the logs. The logs allow to observe two types of user queries: • queries in search engines (in the referrer when DBS files were found using a search engine) • queries within the DBS (see query parameters in metasuche/qsuche) Table 2: Examples from the DBS log (some data has been modified for readability) f64.alicedsl.de - - [09/Nov/2009:00:23:09 +0100] "GET /zeigen.html?seite=5892 HTTP/1.1" 200 22436 http://www.bildungsserver.de/zeigen.html?seite=2521 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" 80d.superkabel.de - - [09/Nov/2009:00:26:28 +0100] "GET /db/fwulesen.html?Id=200006289 HTTP/1.1" 200 16301 http://www.google.de/search?hl=de&source=hp&q=+landes+filmstelle&btnG=Google- Suche&meta=&aq=f&oq =&fp=6013614429992176 "Mozilla/5.0 (Windows; U; Windows NT 6.0; de; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4 (.NET CLR 3.5.30729)" 937.googlebot.com - - [09/Nov/2009:00:27:09 +0100] "GET /db/ffach2.html?fach=2&Rnum=12&Snum=3 HTTP/1.1" 200 16019 - "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 5bd.ono.com - - [09/Nov/2009:00:30:46 +0100] "GET /db/mlesen.html?Id=42021 HTTP/1.1" 200 180746 - "Java/1.6.0_13" 8f4.primacom.net - - [09/Nov/2009:00:30:45 +0100] "GET /zeigen.html?seite=771 HTTP/1.1" 200 45871 http://www.bildungsserver.de/metasuche/qsuche.html?feldinhalt1=aktive+medienarbe it&bool1=AND&finden=finden&searchall= ja&datenbanken%5B%5D=dbs_seiten&DBS=1&art=einfach "Mozilla/5.0 (Windows; U; Windows NT 6.0; de; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4 (.NET CLR 3.5.30729)" The logs also allow so observe the browsing behavior within the DBS server structure. The following pages are of most interest: • the descriptions of the educational web sites within DBS (mlesen) • thematic lists of educational web sites (zeigen, anzeigen, fachlist, listen) • a newspaper documentation on articles about education (zeitdok) The logs allow to access two types of content and compare them to the queries. • the descriptions of the educational web sites within DBS • the content of the educational web sites themselves (which might have changed since the logs have been collected) in those cases where the user might have accessed them 4 Participants and Results The two following sections shows the participants of LogCLEF 2010 and presents some results. For more detailed results, the reader is referred to the papers by the participants which describe the approaches and findings in more detail. 4.1 Participants As shown in Table 4, a total of 7 groups submitted results for LogCLEF. Of the 15 registered groups, only less than 50% managed to obtain results. The results of the participating groups are reported in the following section and elaborated in the papers of the participants. All groups analyzed the TEL logs and none worked with the DBS logs. This might be due to the nature of a raw web server log which requires much pre-processing. LogCLEF could not provide a pre-processed version due to the lack of funding for LogCLEF. Table 4. LogCLEF 2010 participants Participant Institution Country DAEDALUS Universidad Politécnica de Madrid & Spain Universidad Carlos III de Madrid SINAI University of Jaen Spain TCD-DCU Trinity College Dublin & Ireland Dublin City University NII National Institute of Informatics & Japan other institutions Info Foraging Radboud University Nijmegen & The Netherlands Lab Maastricht University Info Science Humboldt University Berlin Germany CELI s.r.l CELI Research, Torino Italy 4.2 Results A large variety of approaches was taken to analyze the TEL log files. This can be considered as a success of the open definition of the task which encouraged creative exploration of the data. Two groups contrasted user behavior at a quality search service like TEL to common Web search behavior. A group from The Netherlands under the leadership of the University of Nijmegen contrasted frequent queries and number of queries per session in the TEL log with data from an MSN log [8]. Verberne et al. also created a network of actions within TEL visualizing the frequency of actions as well as transition probabilities. It can be observed that view actions are more frequent than search actions and that the full view of a result is selected more often than the brief view. The NII from Tokio also compared the TEL logs to web search logs and theories developed by exploiting web search [9]. Takaku et al. analyzed the two TEL logs separately and observed few differences between the two time spans. They also integrated the length of a session into their work. Generally a high correlation between the number of actions and the length can be seen, but there are many exceptions which might be interesting for further exploration. Takaku et al. extracted the ranks of the documents clicked by the users and compared the result from Web search experiments. The DAEDALUS group formally defined success for sessions and queries. They calculated that only 6% of the queries and 10% of the sessions could be labeled as successful. Three groups focused on language issues. The SINAI group showed that most of the sessions are in English. They also conclude that 50.000 of the sessions exhibit only one action. More than 80% of the sessions have 10 or fewer actions. The CELI research institute tried to identify the language of search queries [4]. They manually labeled 100 queries and their system managed to correctly identify over 70%. CELI concludes that the integration of named entity recognition needs is necessary. The difficulties of language identification were elaborated by a group from Berlin [10]. They manually checked 510 queries for their detailed analysis. It showed that over 50% of the queries consisted of only a named entity and an additional 8% included named entities together with another term. Obviously, this complicates language identification and even in the manual analysis 38% of the queries could not be classified as being of one language. Another 31% were English. Stiller et al. also showed that the interface language selected and the origin of the IP are only weak indicators for the query language in their sub set [10]. A group from Dublin [6] also conducted research on the interface language and the origin of the user. Leveling et al. related these factors to the collection selected by the user and managed to develop a scoring function which can rerank the result documents in a way that improves the result quality for the user based on the clicks as observed in the log file. Leveling et al. managed to analyze the content of the queries in order to develop query performance estimators. They implemented IDF and clarity score [6]. 5 Conclusion and Future Work Studies on log files are limited by privacy issues. For the first time, LogCLEF provided evaluation resources for log file analysis which can be used for comparative system evaluation. The second year of LogCLEF obtained more attention by participants. It is intended to encourage and facilitate the exchange of resources and tools generated within the participation at LogCLEF. In the future, log analysis should be the basis for other evaluation tasks. Logs can show how users behave and what they need. One example could be the selection of topics for retrieval evaluation or for questions answering systems [10]. Acknowledgments The organization of LogCLEF was mainly volunteer work. We want to thank The European Library (TEL) and DIPF, the Leibniz Institute for Educational Research and Educational Information, Frankfurt, Germany for providing the log files. At the University of Padua, the work has been partially supported by TELplus Targeted Project for digital libraries, as part of the eContentplus Program of the European Commission (Contract ECP-2006-DILI-510003). References 1. Jansen, B.; Spink, A.& Taksa, I. (eds.) Handbook of Research on Web Log Analysis. Idea Group Reference: Hershey et al. 2009 2. Mandl, T; Agosti, M.; Di Nunzio, G.; Yeh, A., Mani, I.; Doran, C. & Schulz, J. LogCLEF 2009: the CLEF 2009 Cross-Language Logfile Analysis Track Overview. In: Multilingual Information Access Evaluation I: Text Retrieval Experiments: Proc. 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece. Revised Selected Papers. Berlin et al.: Springer [LNCS 6241] Preprint in Working Notes: http://www.clef-campaign. org/2009/working_notes/LogCLEF-2009-Overview-Working-Notes-2009-09-14.pdf 3. Di Nunzio, G.M.: LogCLEF 2009 2009/03/02 v 1.0 Description of the The European Library (TEL) Search Action Log Files. http://www.uni-hildesheim.de/logclef/Daten/ LogCLEF2009_file_description.pdf 2009 4. Bosca, A.& Dini, L: Language Identification Strategies for Cross Language Information Retrieval. In this volume (LogCLEF 2010 Working Notes, http://clef2010.org/) 5. Perea-Ortega, J.; Montejo Ráez,, A.; Garcia Cumbreras, M. & Ureña-López, L.A.. SINAI at LogCLEF 2010 In this volume. (LogCLEF 2010 Working Notes, http://clef2010.org/) 6. Leveling, J.; Ghorab, M.R.; Magdy, W.; Jones, G. & Wade, V.: DCU-TCD@LogCLEF 2010: Re-ranking Document Collections and Query Performance Estimation. In this volume. (LogCLEF 2010 Working Notes, http://clef2010.org/) 7. Lana-Serrano, S.; Villena-Román, J. & González-Cristóbal, J-C. DAEDALUS at LogCLEF 2010: Analyzing the Success of Search Queries. In this volume (LogCLEF 2010 Working Notes, http://clef2010.org/) 8. Verberne, S; Hinne, M.; van der Heijden, M; Hoenkamp, E.; Kraaij, W. & van der Weide, T. How does the Library Searcher behave? In this volume (LogCLEF 2010 Working Notes, http://clef2010.org/) 9. Takaku, M.; Egusa, Y.; Saito, H.; Kando, N.; Teraki, H.; Miwa, M.. CRES at LogCLEF 2010: Towards Understanding the User Behaviors through an Analysis of Search Sessions, Search Units and Click Ranks. In this volume (LogCLEF 2010 Working Notes, http://clef2010.org/) 10.Stiller, J.; Gaede, M. & Petras V. Ambiguity of Queries and the Challenges for Query Language Detection. In this volume (LogCLEF 2010 Working Notes, http://clef2010.org/) 11.Sutcliffe, R.; Kruschwitz, U. & Mandl, T. Web Logs and Question Answering. In: Proc. Web Logs and Question Answering (WLQA2010) Workshop at the Seventh International Conference on Language Resources and Evaluation (LREC) Malta, 22nd May. S. 1-7. http://www.csis.ul.ie/wlqa2010/proceedings.htm