=Paper= {{Paper |id=Vol-512/paper-8 |storemode=property |title=Evaluation of Digital Library Services Using Complementary Logs |pdfUrl=https://ceur-ws.org/Vol-512/paper08.pdf |volume=Vol-512 |dblpUrl=https://dblp.org/rec/conf/sigir/AgostiCN09 }} ==Evaluation of Digital Library Services Using Complementary Logs== https://ceur-ws.org/Vol-512/paper08.pdf
                 Evaluation of Digital Library Services Using
                           Complementary Logs∗

               Maristella Agostii                      Franco Crivellari              Giorgio Maria Di Nunzio
               University of Padua                     University of Padua                 University of Padua
            Via Gradenigo 6/a, 35131                Via Gradenigo 6/a, 35131            Via Gradenigo 6/a, 35131
                  Padua, Italy                            Padua, Italy                        Padua, Italy
              agosti@dei.unipd.it                     crive@dei.unipd.it                dinunzio@dei.unipd.it

ABSTRACT                                                           Log is a concept commonly used in computer science; in
In recent years, the importance of log analysis has grown, log     fact, log data are collected by programs to make a permanent
data constitute a relevant aspect in the evaluation process of     record of events during their usage. The log data can be used
the quality of a digital library system. In this paper, we ad-     to study the usage of a specific application, and to better
dress the problem of log analysis for complex systems such         adapt it to the objectives the users were expecting to reach.
as digital library systems, and how the analysis of search         In the context of the Web, the storage and the analysis of
query logs or Web logs is not sufficient to study users and        Web log files are mainly used to gain knowledge on the users
interpret their preferences. In fact the combination of im-        and improve the services offered by a Web portal, without
plicitly and explicitly collected data improves understanding      the need to bother the users with the explicit collection of
of behavior with respect to the understanding that can be          information.
gained by analyzing the sets of data separately.
                                                                   When research addresses the problem of studying log data
Categories and Subject Descriptors                                 in digital libraries, which are very complex systems, differ-
H.3.7 [Digital Libraries]: User Issues; H.3.3 [Information         ent characteristics regarding library automation systems and
Storage and Retrieval]: Information Search and Retrieval—          digital library systems need to be taken into account. In fact,
Search process; H.3.4 [Systems and Software]: User pro-            for all the different categories of users of a digital library sys-
files and alert services                                           tem, the quality of services and documents the digital library
                                                                   supplies are very important. Log data constitute a relevant
                                                                   aspect in the evaluation process of the quality of a digital li-
General Terms                                                      brary system and of the quality of interoperability of digital
Algorithms, Design, Experimentation                                library services [2, 18]. With this concept in mind, it is also
                                                                   possible to think about new different logging formats which
Keywords                                                           reflect how a generic DL system behaves [14].
Web Log, Search Log, User Study
                                                                   This paper deals with the study of complementary types of
                                                                   logs in complex systems with the aim of finding new ways
1.   INTRODUCTION                                                  of using them to evaluate and personalize digital library ser-
The interaction between the user and an information access
                                                                   vices for the final users. The paper is organized as follows:
system can be analyzed and studied to gather user prefer-
                                                                   Section 2 presents previous related work, Section 3 analyzes
ences and to “learn” what the user likes the most, and to use
                                                                   and presents different facets of the study and use of logs of
this information to personalize the presentation of results.
                                                                   complex systems, Section 4 presents the findings of the case
User preferences can be learned explicitly, for example ask-
                                                                   study conducted in the context of the TELplus project1 for
ing the user to fill-in questionnaires, or implicitly, by study-
                                                                   the evaluation and personalization of the services of The Eu-
ing the actions of the user which are recorded in the search
                                                                   ropean Library, and lastly Section 5 draws conclusions and
log of a system. The second choice is certainly less intrusive
                                                                   indicates directions for the continuation of the work.
but requires more effort to reconstruct each search session a
user made in order to learn his preferences.
∗Copyright is held by the author/owner(s).                         2.   RELATED WORK
                                                                   In the last decade, log analysis has become one of the main
SIGIR’09, July 19-23, 2009, Boston, USA.
                                                                   threads of research for understanding users of search engines
                                                                   as shown by the works presented at three major relevant
                                                                   conferences and that have been analyzed by us2 .

                                                                   Those works study logs in different ways and for different
                                                                   1
                                                                    http://www.theeuropeanlibrary.org/telplus/
                                                                   2
                                                                    The three analyzed major conferences are:
                                                                   SIGIR - http://www.sigir.org/
                                                                   WWW - http://www.iw3c2.org/
                                                                   JCDL - http://www.jcdl.org/
purposes, but they can be divided into two main classes:          by content in the same manner as information retrieval sys-
studies about search query logs, and studies about Web            tems and search engines [1]. In all other types of searches,
server logs. Since most of these research papers concern          either the digital library system makes use of authority data
search engines, the focus of their research is more on improv-    to respond to final users in a more consistent and coherent
ing queries and results and less on surfing the Web. The few      way through a search system that is a sort of a new gener-
exceptions to this classification will be analyzed later in the   ation of online public access catalogue (OPAC) system, or
paper.                                                            the system supports the full content search with a service
                                                                  that gives the final users the facilities of a search engine.
Query search logs can be used for: building knowledge, such
as automatically building a search thesaurus [10], or ac-         Search query logs or Web logs alone give only a partial view
quiring ontological knowledge [24]; refining and expanding        of the stream of information that users produce. [28] show
queries by means of analysis of search logs [4], or by means of   how to combine two different streams of data, search query
correlations between query terms and document terms based         logs and click-streams, in order to analyze re-finding behav-
on search query logs [11]; comparing of query extension tech-     ior of a group of users under observation for a period of one
niques with pseudo-relevance feedback techniques [30]; orga-      year.
nizing search results [29]; studying temporal changes and re-
lationships, such as changes of queries on hourly basis in or-    Moreover, log analysis can be supported and validated by
der to understand how user preferences change over time [5],      user studies which are a valuable method for understanding
analysis of multitasking user searches [6], issues related to     user behavior in different situations. User studies require
ambiguity and freshness of queries [22], studies of causal        a significant amount of time and effort, so an accurate de-
relations between queries [27]; mining queries for extracting     sign of the process has to be carried out. In general, user
news-related queries [20], and association rules to discover      studies and logs are used in a separate way, since they are
related queries [25], or fast query recommendations [32].         adopted with different aims in mind. Ingwersen and Järvelin
                                                                  report in [17] that it seems more scientifically informative to
Web logs can be used for: improving rank of results by re-        combine logs together with observation in naturalistic set-
placing the adjacency matrix of the HITS algorithm with a         tings. Pharo and Järvelin in [23] suggest systematic use of
link matrix which weights connections between nodes based         the triangulation of different data collection techniques as
on the usage data from Web server log traffic [21]; matching      a general approach in order to get better knowledge of the
website organization with visitor expectations by means of        Web information search process. An example of this type of
Web log analysis [26]; finding user navigational patterns [9];    combined studies is [15], where that authors claim that fully
agents’ detection [7].                                            understanding user satisfaction and user intent requires a
                                                                  depth of data unavailable in search query logs but possible
There is also a recent emerging research activity about log       to acquire from other sources of data, such as one-on-one
analysis which tackles cross-lingual issues: [13] extends the     studies or instrumented panels.
notion of query suggestion to cross-lingual query suggestion
studying search query logs; [16] leverages click-through data     The combination of implicitly and explicitly collected data
to extract query translation pairs. The interest in multilin-     improves understanding of behavior with respect to the un-
gual log analysis is also confirmed by initiatives promoted       derstanding that can be gained by analyzing the sets of data
by the TrebleCLEF3 coordination action which supports the         separately. In particular for digital libraries, where the eval-
development and consolidation of expertise in the multidis-       uation of the different services is difficult if logs are used
ciplinary research area of multilingual information access        alone, the combined sets of data provide the opportunity
(MLIA).                                                           of reaching insights towards user personalization of digital
                                                                  library services.
3.     LOGS OF COMPLEX SYSTEMS                                    From this starting point we have developed a method for col-
Present digital library systems are complex software sys-         lecting data derived from the user interaction log, “implicit”
tems, often based on a service-oriented architecture, able to     data, and data collected from user questionnaires, “explicit”
manage complex and diversified collections of digital objects.    data, for analyzing the interaction between users and digital
One significant aspect that still relates present systems to      libraries. This means that the conceived method is based on
the old ones is that the representation of the content of the     the combination and analysis of the following data sources:
digital objects that constitute the collection of interest is     HTTP log which contains the HTTP requests sent by the
still done by professionals. This means that the manage-          Web client to the Web server during a user browsing session;
ment of metadata can still be based on the use of authority       search log which contains the actions performed by the user
control rules in describing author, place names and other rel-    during a search; questionnaire data which are collected at
evant catalogue data. A digital library system can exploit        the end of a user browsing and searching session.
authority data that keep lists of preferred or accepted forms
of names and all other relevant headings. This is a dra-          The possibility of studying and correlating different sources
matic difference between digital library systems and search       of data was envisaged during the study of the Web portal of
engines, and it is usually overcome with the analysis of log      The European Library4 , which provides a vast virtual col-
data. In fact a search engine often becomes a specific com-       lection of material from all disciplines and offers interested
ponent of a digital library system, when the digital library      visitors simple access to European cultural heritage.
system faces the management and search of digital objects
3                                                                 4
    http://www.trebleclef.eu/                                         http://www.theeuropeanlibrary.org/
4.   RESULTS OF THE CASE STUDY
The European Library is a free service that offers access to      Table 1: Summary of statistics for the time of a
the resources of 48 national libraries of Europe in 20 lan-       user session in minutes calculated in the search
guages with about 150 million entries across Europe. The          logs (between brackets registered user only), HTTP
European Library provides a vast virtual collection of mate-      logs (between brackets user who participated in the
rial from all disciplines and offers interested visitors simple   study), and the time for filling-in the questionnaire.
access to European cultural heritage.                                          Search log HTTP log Questionnaire
                                                                      Median    2.0 (4.0) 1.3 (30.25)       31.0
To validate the proposed method, a study was conducted in              Mean     6.0 (8.0) 4.7 (31.80)       33.0
a controlled setting at the end of 2007 – beginning of 2008, in
the computer laboratories of different faculties of the Uni-
versity of Padua, Italy, where students were requested to         session. One of the outcomes of the questionnaire was the
conduct a free navigation and search for information on The       disorientation of the user upon entering The European Library
European Library portal and to fill in a questionnaire specif-    portal for the first time, in particular it seems not to be clear
ically designed to harvest the data that can be used to ex-       what kind of information can be accessed through this por-
tract information on users satisfaction on the use of different   tal. Users are in general ready to search in a Google-like
parts of the portal. A total of 155 students participated in      fashion and obtain documents, in terms of links to pages
the study, mostly Italians, equally distributed between males     or documents online, in the case of The European Library
and females, and with an age range typical of students of         they are essentially in front of an online public access cat-
Bachelor and Master Degree (in most cases between 19 and          alogue which retrieves bibliographic records. Obtaining li-
25 years old).                                                    brary catalogue records after a search is a source of confusion
                                                                  which leaves the user unhappy and willing to leave the portal
The analysis of the results was done in the following order:      quickly.
the analysis of each stream of data - i.e. HTTP log, search
query log, questionnaires - was first conducted, while the        Questionnaires also show that images in particular seem to
analysis of possible interrelation among these sources was        be very appealing for users; both the “treasures” section, a
conducted later. The description of the analysis of each          section which shows high resolution images of ancient doc-
single stream is reported in [3], here we concentrate on the      uments, and the “exhibition” section, a section which shows
aspects which emerge from the correlation of the different        pictures of the national libraries buildings, were thoroughly
sources of information.                                           browsed by users even before making any query in the por-
                                                                  tal. This is an important clue which may suggest that there
Table 1 summarizes one of the important features when do-         should be more linking from the images to the catalogue
ing log analysis: session length. In particular, the table        records. The interrelation among the information about
shows how different these lengths are according to the source     users who prefer images and the HTTP log and searches
that is analyzed. The “Search log” column shows the statis-       log is still under investigation. In fact, we would like to
tics of the times, in minutes, of sessions found in the search    see if this willingness expressed in the questionnaire is also
logs, and between brackets the times of sessions of users         reflected in user actions: for example, a user who is inter-
who registered to the portal. This shows that logging on          ested in images clicks more frequently on images or search
is a clear intention of users who are willing to spend time       for documents like maps or paintings; or a user expresses
in the portal and search more, compared to random users.          this interest in images but actually does not perform any
The “HTTP log” column shows the times of sessions found           action in the portal which confirms this interest.
in the HTTP logs computed in October 2007, and between
brackets the times of the sessions of users who participated      5.   CONCLUSIONS
in the user study at the University of Padua. In this case,       The insights gained by analyzing log data together with data
there is a strong bias of the students of the user study due      from controlled studies are more informative than the results
to the time slot which was about 30/45 minutes. The times         that can be derived by separately analyzing the groups of
of random users are comparable to those found in the search       data. Our studies on logs combined with interviews have
logs. The last column shows the times of sessions for filling-    shown that the results are more scientifically informative
in the questionnaires, which are obviously very similar to        than those obtained when the two types of studies are con-
the times of HTTP sessions of the user study. There is one        ducted alone. This encouraging result constitutes the ground
important aspect which emerges from the data: sessions are        on which we are generalizing and formalizing starting from
very short, browsing and searching activity lasts less than 2     the obtained results. A crucial feature in the future will be
minutes in 50% of the cases. This particular situation can be     making active use also of the information on metadata that
explained only by studying the answers of the users to the        are present in the log, because until now no active way of
questionnaire where there are clear indications about some        using them has been incorporated in the proposed method.
difficulties they found in understanding how to read the list
of the results, and how to use some functions of the inter-       6.   ACKNOWLEDGEMENTS
face. These are also the reasons why they would have left         The work has been partially supported by the TELplus Tar-
the portal sooner if they had not been asked to stay and fill     geted Project for digital libraries, as part of the eContentplus
in the questionnaire.                                             Program of the EC, and by the TrebleCLEF Coordination
                                                                  Action, as part of the 7FP of the EC.
An important interrelation was found among questionnaires
and log data which may explain the short length of a user
                                                                  7.   REFERENCES
 [1] M. Agosti, editor. Information access through search             Q. Yang. Web query translation via web log mining.
     engines and digital libraries. Springer, Berlin,                 In S.-H. Myaeng, D. W. Oard, F. Sebastiani, T.-S.
     Germany, 2008.                                                   Chua, and M.-K. Leong, editors, SIGIR, pages
 [2] M. Agosti. Log data in digital libraries. In M. Agosti,          749–750. ACM, 2008.
     F. Esposito, and C. Thanos, editors, IRCDL, pages           [17] P. Ingwersen and K. Järvelin. The Turn. Springer,
     115–122. DELOS: an Association for Digital Libraries,            The Netherlands, 2005.
     2008.                                                       [18] T. Koch, A. Ardö, and K. Golub. Browsing and
 [3] M. Agosti, F. Crivellari, and G. M. Di Nunzio. A                 searching behavior in the renardus web service a study
     method for combining and analyzing implicit                      based on log analysis. In H. Chen, H. D. Wactlar,
     interaction data and explicit preferences of users.              C. chih Chen, E.-P. Lim, and M. G. Christel, editors,
     Workshop on Contextual Information Access, Seeking               JCDL, page 378. ACM, 2004.
     and Retrieval Evaluation (ECIR 2009), April 2009.           [19] W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr,
 [4] P. G. Anick. Using terminological feedback for web               and N. Kando, editors. SIGIR 2007: Proceedings of
     search refinement: a log-based study. In SIGIR, pages            the 30th Annual International ACM SIGIR
     88–95. ACM, 2003.                                                Conference on Research and Development in
 [5] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. A.                 Information Retrieval, Amsterdam, The Netherlands,
     Grossman, and O. Frieder. Hourly analysis of a very              July 23-27, 2007. ACM, 2007.
     large topically categorized web query log. In               [20] M. Maslov, A. Golovko, I. Segalovich, and
     M. Sanderson, K. Järvelin, J. Allan, and P. Bruza,              P. Braslavski. Extracting news-related queries from
     editors, SIGIR, pages 321–328. ACM, 2004.                        web query log. In Carr et al. [8], pages 931–932.
 [6] N. Buzikashvili. An exploratory web log study of            [21] J. C. Miller, G. Rae, and F. Schaefer. Modifications of
     multitasking. In Efthimiadis et al. [12], pages 623–624.         kleinberg’s hits algorithm using matrix exponentiation
 [7] N. Buzikashvili. Sliding window technique for the web            and weblog records. In W. B. Croft, D. J. Harper,
     log analysis. In Williamson et al. [31], pages                   D. H. Kraft, and J. Zobel, editors, SIGIR, pages
     1213–1214.                                                       444–445. ACM, 2001.
 [8] L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and          [22] J. Parikh and S. Kapur. Unity: relevance feedback
     M. Dahlin, editors. Proceedings of the 15th                      using user query logs. In Efthimiadis et al. [12], pages
     international conference on World Wide Web, WWW                  689–690.
     2006, Edinburgh, Scotland, UK, May 23-26, 2006.             [23] N. Pharo and K. Järvelin. The SST method: a tool for
     ACM, 2006.                                                       analysing Web information search processes.
 [9] J. Chen and T. Cook. Mining contiguous sequential                Information Processing & Management,
     patterns from web logs. In Williamson et al. [31],               40(4):633–654, July 2004.
     pages 1177–1178.                                            [24] S. Sekine and H. Suzuki. Acquiring ontological
[10] S.-L. Chuang, H.-T. Pu, W.-H. Lu, and L.-F. Chien.               knowledge from query logs. In Williamson et al. [31],
     Auto-construction of a live thesaurus from search term           pages 1223–1224.
     logs for interactive web search. In SIGIR, pages            [25] X. Shi and C. C. Yang. Mining related queries from
     334–336, 2000.                                                   search engine query logs. In Carr et al. [8], pages
[11] H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma.                      943–944.
     Probabilistic query expansion using query logs. In          [26] R. Srikant and Y. Yang. Mining web logs to improve
     WWW 2002, pages 325–332, 2002.                                   website organization. In WWW 2001, pages 430–437,
[12] E. N. Efthimiadis, S. T. Dumais, D. Hawking, and                 2001.
     K. Järvelin, editors. SIGIR 2006: Proceedings of the       [27] Y. Sun, K. Xie, N. Liu, S. Yan, B. Zhang, and
     29th Annual International ACM SIGIR Conference on                Z. Chen. Causal relation of queries from temporal
     Research and Development in Information Retrieval,               logs. In Williamson et al. [31], pages 1141–1142.
     Seattle, Washington, USA, 2006. ACM, 2006.                  [28] J. Teevan, E. Adar, R. Jones, and M. A. S. Potts.
[13] W. Gao, C. Niu, J.-Y. Nie, M. Zhou, J. Hu, K.-F.                 Information re-retrieval: repeat queries in yahoo’s
     Wong, and H.-W. Hon. Cross-lingual query suggestion              logs. In Kraaij et al. [19], pages 151–158.
     using query logs of different languages. In Kraaij et al.   [29] X. Wang and C. Zhai. Learn from web search logs to
     [19], pages 463–470.                                             organize search results. In Kraaij et al. [19], pages
[14] M. A. Gonçalves, G. Panchanathan,                               87–94.
     U. Ravindranathan, A. Krowne, E. A. Fox,                    [30] R. W. White, C. L. A. Clarke, and S. Cucerzan.
     F. Jagodzinski, and L. N. Cassel. The xml log                    Comparing query logs and pseudo-relevance
     standard for digital libraries: Analysis, evolution, and         feedbackfor web-search query refinement. In Kraaij
     deployment. In JCDL, pages 312–314. IEEE                         et al. [19], pages 831–832.
     Computer Society, 2003.                                     [31] C. L. Williamson, M. E. Zurko, P. F. Patel-Schneider,
[15] C. Grimes, D. Tang, and D. M. Russell. Query logs                and P. J. Shenoy, editors. Proceedings of the 16th
     alone are not enough. In E. Amitay and C. G. M. J.               International Conference on World Wide Web, WWW
     Teevan, editors, Query Log Analysis: Social And                  2007, Banff, Alberta, Canada, 2007. ACM, 2007.
     Technological Challenges. A workshop at the 16th            [32] Z. Zhang and O. Nasraoui. Mining search engine
     International World Wide Web Conference (WWW                     query logs for query recommendation. In Carr et al.
     2007), May 2007.                                                 [8], pages 1039–1040.
[16] R. Hu, W. Chen, P. Bai, Y. Lu, Z. Chen, and