=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-LogCLEF-VerberneEt2010
|storemode=property
|title=How does the Library Searcher Behave? A Contrastive Study of Library Search against Ad-hoc Search
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-LogCLEF-VerberneEt2010.pdf
|volume=Vol-1176
}}
==How does the Library Searcher Behave? A Contrastive Study of Library Search against Ad-hoc Search==
<pdf width="1500px">https://ceur-ws.org/Vol-1176/CLEF2010wn-LogCLEF-VerberneEt2010.pdf</pdf>
<pre>
                 How does the Library Searcher behave?
         A contrastive study of library search against ad-hoc search

                     Suzan Verberne, Max Hinne, Maarten van der Heijden,
                     Eduard Hoenkamp, Wessel Kraaij, Theo van der Weide

                       Information Foraging Lab, Radboud University Nijmegen


      Abstract. We investigated the search behaviour of the library searcher using the log
      data from The European Library (TEL). We were especially interested in how this
      behaviour compares to the search behaviour of ad-hoc searchers, represented by log
      data in the MSN search engine. At first sight, the two data sets mainly differ in the
      topics of the queries entered and their multi-lingual vs. mono-lingual content.
      When studying user behaviour, session information is very important: how does the
      user navigate through the engine’s interface? We visualized the TEL users’ interac-
      tions with the system by creating a transition network for the users’ intra-session
      actions. In general, we think that research into user behaviour on the basis of search
      engine logs can be very informative for the evaluation of search engine interfaces.


1   Introduction
For the logCLEF 2010 workshop1 , the organization released a log file of user activities from the
The European Library (TEL). TEL provides access to a number of national libraries in Europe
through a search interface.2
     The aim of the current paper is to model the search behaviour of the library searcher in contrast
with the ad-hoc searcher. To this end, we collect a number of statistics of the TEL data and a log
file from a general web search engine: the Microsoft 2006 RFP dataset that was distributed for
the WSCD 2009 workshop3 . We assume that the TEL data are representative for library search
and the MS click data are representative for ad-hoc web search.
     In Section 2, we present a number of statistics about both data sets. In Section 3, we discuss
the main differences between them. In Section 4, we focus on one specific characteristic from the
TEL data: the behaviour of the user within one session. In Section 5, we answer our main question:
“How does the Library Searcher behave?”

2   Descriptions of the two data sets
The ad-hoc web search data consists of approximately 12 million clicks from US users entered into
the Microsoft MSN search engine during the spring of 2006. For each query, the following details
are available: a query ID, the query itself, the user session ID (more on session information later),
a time-stamp, the URL of the clicked document, the rank of the URL and the number of results.
    The library search data consists of a little under 2 million records collected between January
1st of 2007 and the 30th of June of 2008. For these records, the following information is included:
a record ID, a user ID, an obfuscated user IP, a session ID, the chosen interface language, the
query, the user’s action (e.g. simple search or search within search), a document collection ID, the
number of results, the rank of the clicked result, a search box ID, the URL of the object being
viewed and a time-stamp. Some records in the data do not contain a query: these are interactions
such as choosing a specific collection for the next search.
    A number of key statistics of both data sets are shown in Table 1 and Table 2.
1
  http://www.uni-hildesheim.de/logclef/
2
  http://search.theeuropeanlibrary.org/
3
  http://research.microsoft.com/en-us/um/people/nickcr/wscd09/
                            Table 1. Basic statistics of the two data sets

                             TEL                             MSN
                   1 866 330 interactions         12 251 067 clicks
                   1 345 508 queries issued        8 831 280 queries issued
                     220 409 unique queries        3 875 427 unique queries
                         2.2 average query length        2.5 average query length

                      Table 2. The most frequent queries with their frequency.

                   TEL                              MSN
                   Query                 frequency Query                 frequency
                   mozart                    16 605 google                  123 392
                   van gogh                   3 788 yahoo                   111 724
                   meisje met de parel        3 481 myspace                  79 955
                   harry potter               3 451 yahoo.com                60 494
                   einstein                   1 925 myspace.com              49 327
                   pink floyd                 1 664 ebay                     48 691
                   rembrandt                  1 618 mapquest                 31 062
                   shakespeare                1 579 aol.com                  27 345
                   nuremberg                  1 552 google.com               25 304


3     Differences between the search engines

A quick inspection of the TEL data immediately shows that the search topics are different from
the general-domain web search queries. Some example queries are ‘il vocabolario degli accademici
della crusca’, ‘twain mark’, ‘christie’, ‘maumet’, ‘vvedenskij’, ‘daylight’ and ‘dubois cg’. These
examples confirm that the database is multi-lingual.
    Some topics have a special form related to the metadata contained in the TEL index, for
example ‘(subject all experiential learning)’. This is only a very small proportion of the queries:
0.13%. 6.8% of the queries contain boolean operators that combine atomic queries, for example
‘(title all keywords) and (creator all keywords)’. Note that all queries have been lowercased.


3.1   Language selection

In contrast to the MSN search engine, the TEL interface gives the user the option to select the
interface language. Not all users use this option, resulting in relatively many number of records
that use the default language setting, English. In total, the TEL click data contain 41 different
interface languages. The most frequent are in shown Table 3.


Table 3. The most frequent languages selected in the TEL interface with their frequency (number of
queries).

                                         Language frequency
                                         English   1,600 514
                                         French       60,108
                                         Polish       36,438
                                         German       27,384
                                         Italian      24,068
                                         Spanish      20,074
3.2        Sessions
A more striking difference between the data sets is the way the records are obtained. In the MSN
data set, a record is stored each time a user clicks on a result in the result list. As a consequence,
information about queries that did not result in a click is lost. On the other hand, in the TEL
data set a record is stored each time the user interacts with the engine. Such an interaction can
be the execution of a query (the most common interaction), but also viewing a document, saving
a session or a switch in document collection.
    An important observation is the difference in the grouping of records per session. The MSN
data set considers a new session to start each time a user issues a new query. The TEL data set uses
a more intuitive notion of sessions, where a session contains all subsequent user interactions within
a certain time limit. Unfortunately, because of the way MSN sessions are stored, it is impossible
to directly compare MSN and TEL sessions, for example with respect to query reformulation.
    More exploratively, we considered the distribution of number of queries per query session as a
basic descriptor of user search behaviour [1] for just the TEL data set. We plotted the cumulative
distribution of the number of queries per session, as well as the unique number of queries per
session, as shown in Figure 1. The plot shows that there is a significant number of sessions with
a large number of queries, but that the amount of sessions with a large number of unique queries
is much lower.


                       1
                           1                10                  100               1000                10000


                     0.1
                                                                                     TEL unique queries

                                                                                     TEL queries
      Pr(Q > q)


                    0.01


                   0.001


                  0.0001
                                                       Queries per session, q


                  Fig. 1. The cumulative distribution Pr(Q > q) for the number of queries per session q.


    We found that 16% of the queries are repeated three or more times by the same user in
the same session. We suppose that the user retried the query with different settings in the TEL
interface, or repeatedly interacted with the search engine for accessing a known document. We
further investigate this phenomenon below.

4      Intra-session user behaviour in TEL
Ad-hoc web search generally consists for a large part of navigational and transactional queries [2].
For the library searcher, these types of queries are much less relevant. This warrants the question:
why are there so many repeating queries in the TEL log, if they are not used for navigation or
transactional queries?
   As discussed earlier, the TEL search engine provides the user with several options to refine
a search. The language options have already been mentioned, but the search engine also offers
several actions alternative to basic search, such as:
 – to use either a simple or an advanced search form,
 – to search within results,
 – to continue searching with a specific record as a starting point,
 – to start searching with a URL as query string,
 – to view a short or a long description (title) of a result,
 – to view a retrieved object in the interface of the library, or to see it online.
Some of these actions are not available in web search interfaces. To see how users work with this
functionality, we proceeded as follows. Each time a query was repeated within a session, we kept
track of the particular transition between the previously selected action and the newly selected
action.
    After normalizing so that for each action all possible transitions sum to 1, this resulted in
a transition matrix representing the users’ selection of actions, which is displayed as a directed
graph in Figure 2. The sizes of the circles represent the overall behaviour of the searcher on the
long term: what is the probability that at a given time, a user is involved in each action?

5   Conclusion: How does the library searcher behave?
We have investigated the search behaviour of the library searcher using the log data from The
European Library (TEL). We were especially interested in how this behaviour compares to the
search behaviour of ad-hoc searchers, represented by log data from the MSN search engine.
    Although the average length of the queries does not differ much between the two search en-
gines, the topics of the queries are very different. In web search there are many navigational and
transactional queries, which can be seen from the most frequent queries “Google”, “Yahoo” and
“Myspace” in the MSN data. Among the most frequent queries in the TEL data are many named
entities such as “mozart”, “van gogh” and “meisje met de parel”. Moreover, the TEL data contain
queries in multiple languages.
    When studying user behaviour, session information is very important: how does a searcher
formulate and reformulate queries, and how does he navigate through the engine’s interface?
Unfortunately, the MSN data does not contain the type of session information that we need for
such an analysis, which makes the two data collections not comparable in this respect.
    Therefore, we investigated the interaction of the user with the search interface of TEL only. The
finding that 16% of the queries are repeated three of more times shows that the library searcher
makes an effort to get relevant results for his information need, often even without reformulating
his query.
    For research into user behaviour on the basis of click data, visualizing the users’ interactions
with the system can be very informative. We did this for the TEL data by creating a transition
network for the users’ intra-session actions. Naturally, the searcher ends his session with viewing
the object he was searching for (‘view full’ or ‘view brief’). But the graph also shows that users
spend more time on simple search than on the other, more advanced, search options.
    We think that research into user behaviour on the basis of search engine logs can be very
informative for the evaluation of search engine interfaces.

References
1. Baeza-Yates, R., Hurtado, C., Mendoza, M., Dupret, G.: Modeling user search behavior. In: LA-WEB
   ’05: Proceedings of the Third Latin American Web Congress. p. 242. IEEE Computer Society, Wash-
   ington, DC, USA (2005)
2. Broder, A.: A taxonomy of web search. SIGIR FORUM 36(2), 3–10 (2002)
                                               available
                                                  at

                                                   0.10


                                                                         option     option option save                         see
        search    search search res                                                                                                               search       jump
         adv        res   rec any
                                    0.11    search sim            0.14    send
                                                                         email
                                                                                     save    session
                                                                                  reference favorite
                                                                                                         0.16    0.49
                                                                                                                              online
                                                                                                                                           0.39
                                                                                                                                                    url
                                                                                                                                                         0.38
                                                                                                                                                              to page


 0.23                   0.49         0.45   0.56           0.43           0.57 0.81       0.50                         0.26                  0.42            0.50


                                                                                                                                   page
                 0.38      0.40      0.23   0.36             view full                               0.61       0.27      0.25
                                                                                                                                   brief
                                                                                                                                             0.24     0.15          0.40


                                                                    0.180.38                                                    0.57


                                                              view brief                         0.49


Fig. 2. The transition network of user search action selection. The sizes of the circles represent the probabil-
ities for an asymptotic distribution, i.e. the overall behaviour of the searcher on the long term. Transitions
with probability p < 0.2 have been excluded to avoid cluttering. Consequently, not all transitions sum to
1.

</pre>