<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A contrastive study of library search against ad-hoc search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suzan Verberne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Max Hinne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten van der Heijden</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard Hoenkamp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wessel Kraaij</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Theo van der Weide</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Foraging Lab, Radboud University Nijmegen</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We investigated the search behaviour of the library searcher using the log data from The European Library (TEL). We were especially interested in how this behaviour compares to the search behaviour of ad-hoc searchers, represented by log data in the MSN search engine. At rst sight, the two data sets mainly di er in the topics of the queries entered and their multi-lingual vs. mono-lingual content. When studying user behaviour, session information is very important: how does the user navigate through the engine's interface? We visualized the TEL users' interactions with the system by creating a transition network for the users' intra-session actions. In general, we think that research into user behaviour on the basis of search engine logs can be very informative for the evaluation of search engine interfaces.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Descriptions of the two data sets</title>
      <p>The ad-hoc web search data consists of approximately 12 million clicks from US users entered into
the Microsoft MSN search engine during the spring of 2006. For each query, the following details
are available: a query ID, the query itself, the user session ID (more on session information later),
a time-stamp, the URL of the clicked document, the rank of the URL and the number of results.</p>
      <p>The library search data consists of a little under 2 million records collected between January
1st of 2007 and the 30th of June of 2008. For these records, the following information is included:
a record ID, a user ID, an obfuscated user IP, a session ID, the chosen interface language, the
query, the user's action (e.g. simple search or search within search), a document collection ID, the
number of results, the rank of the clicked result, a search box ID, the URL of the object being
viewed and a time-stamp. Some records in the data do not contain a query: these are interactions
such as choosing a speci c collection for the next search.</p>
      <p>A number of key statistics of both data sets are shown in Table 1 and Table 2.
1 http://www.uni-hildesheim.de/logclef/
2 http://search.theeuropeanlibrary.org/
3 http://research.microsoft.com/en-us/um/people/nickcr/wscd09/
1 866 330 interactions
1 345 508 queries issued
220 409 unique queries
2.2 average query length
12 251 067 clicks
8 831 280 queries issued
3 875 427 unique queries</p>
      <p>2.5 average query length</p>
    </sec>
    <sec id="sec-3">
      <title>Di erences between the search engines</title>
      <p>A quick inspection of the TEL data immediately shows that the search topics are di erent from
the general-domain web search queries. Some example queries are `il vocabolario degli accademici
della crusca', `twain mark', `christie', `maumet', `vvedenskij', `daylight' and `dubois cg'. These
examples con rm that the database is multi-lingual.</p>
      <p>Some topics have a special form related to the metadata contained in the TEL index, for
example `(subject all experiential learning)'. This is only a very small proportion of the queries:
0.13%. 6.8% of the queries contain boolean operators that combine atomic queries, for example
`(title all keywords) and (creator all keywords)'. Note that all queries have been lowercased.
3.1</p>
      <p>Language selection
In contrast to the MSN search engine, the TEL interface gives the user the option to select the
interface language. Not all users use this option, resulting in relatively many number of records
that use the default language setting, English. In total, the TEL click data contain 41 di erent
interface languages. The most frequent are in shown Table 3.
A more striking di erence between the data sets is the way the records are obtained. In the MSN
data set, a record is stored each time a user clicks on a result in the result list. As a consequence,
information about queries that did not result in a click is lost. On the other hand, in the TEL
data set a record is stored each time the user interacts with the engine. Such an interaction can
be the execution of a query (the most common interaction), but also viewing a document, saving
a session or a switch in document collection.</p>
      <p>An important observation is the di erence in the grouping of records per session. The MSN
data set considers a new session to start each time a user issues a new query. The TEL data set uses
a more intuitive notion of sessions, where a session contains all subsequent user interactions within
a certain time limit. Unfortunately, because of the way MSN sessions are stored, it is impossible
to directly compare MSN and TEL sessions, for example with respect to query reformulation.</p>
      <p>
        More exploratively, we considered the distribution of number of queries per query session as a
basic descriptor of user search behaviour [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for just the TEL data set. We plotted the cumulative
distribution of the number of queries per session, as well as the unique number of queries per
session, as shown in Figure 1. The plot shows that there is a signi cant number of sessions with
a large number of queries, but that the amount of sessions with a large number of unique queries
is much lower.
      </p>
      <p>10
100
1000</p>
      <p>10000</p>
      <sec id="sec-3-1">
        <title>TEL unique queries</title>
      </sec>
      <sec id="sec-3-2">
        <title>TEL queries 1 1</title>
        <p>)
q
&gt;
Q
(
r
P
0.1
0.01
0.001
0.0001</p>
        <p>Queries per session, q</p>
        <p>We found that 16% of the queries are repeated three or more times by the same user in
the same session. We suppose that the user retried the query with di erent settings in the TEL
interface, or repeatedly interacted with the search engine for accessing a known document. We
further investigate this phenomenon below.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Intra-session user behaviour in TEL</title>
      <p>
        Ad-hoc web search generally consists for a large part of navigational and transactional queries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
For the library searcher, these types of queries are much less relevant. This warrants the question:
why are there so many repeating queries in the TEL log, if they are not used for navigation or
transactional queries?
      </p>
      <p>As discussed earlier, the TEL search engine provides the user with several options to re ne
a search. The language options have already been mentioned, but the search engine also o ers
several actions alternative to basic search, such as:
{ to use either a simple or an advanced search form,
{ to search within results,
{ to continue searching with a speci c record as a starting point,
{ to start searching with a URL as query string,
{ to view a short or a long description (title) of a result,
{ to view a retrieved object in the interface of the library, or to see it online.</p>
      <p>Some of these actions are not available in web search interfaces. To see how users work with this
functionality, we proceeded as follows. Each time a query was repeated within a session, we kept
track of the particular transition between the previously selected action and the newly selected
action.</p>
      <p>After normalizing so that for each action all possible transitions sum to 1, this resulted in
a transition matrix representing the users' selection of actions, which is displayed as a directed
graph in Figure 2. The sizes of the circles represent the overall behaviour of the searcher on the
long term: what is the probability that at a given time, a user is involved in each action?
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion: How does the library searcher behave?</title>
      <p>We have investigated the search behaviour of the library searcher using the log data from The
European Library (TEL). We were especially interested in how this behaviour compares to the
search behaviour of ad-hoc searchers, represented by log data from the MSN search engine.</p>
      <p>Although the average length of the queries does not di er much between the two search
engines, the topics of the queries are very di erent. In web search there are many navigational and
transactional queries, which can be seen from the most frequent queries \Google", \Yahoo" and
\Myspace" in the MSN data. Among the most frequent queries in the TEL data are many named
entities such as \mozart", \van gogh" and \meisje met de parel". Moreover, the TEL data contain
queries in multiple languages.</p>
      <p>When studying user behaviour, session information is very important: how does a searcher
formulate and reformulate queries, and how does he navigate through the engine's interface?
Unfortunately, the MSN data does not contain the type of session information that we need for
such an analysis, which makes the two data collections not comparable in this respect.</p>
      <p>Therefore, we investigated the interaction of the user with the search interface of TEL only. The
nding that 16% of the queries are repeated three of more times shows that the library searcher
makes an e ort to get relevant results for his information need, often even without reformulating
his query.</p>
      <p>For research into user behaviour on the basis of click data, visualizing the users' interactions
with the system can be very informative. We did this for the TEL data by creating a transition
network for the users' intra-session actions. Naturally, the searcher ends his session with viewing
the object he was searching for (`view full' or `view brief'). But the graph also shows that users
spend more time on simple search than on the other, more advanced, search options.</p>
      <p>We think that research into user behaviour on the basis of search engine logs can be very
informative for the evaluation of search engine interfaces.
available
at
0.10
search
adv
search search res 0.11
res rec any
search sim
0.14 oespmetniaodinl reofspeartvieoennce opfstaeivosonsirositaneve 0.16 0.49
see
online
0.38
view full
0.61 0.27
0.180.38</p>
      <p>0.57
0.49</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hurtado</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendoza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dupret</surname>
          </string-name>
          , G.:
          <article-title>Modeling user search behavior</article-title>
          .
          <source>In: LA-WEB '05: Proceedings of the Third Latin American Web Congress</source>
          . p.
          <fpage>242</fpage>
          . IEEE Computer Society, Washington, DC, USA (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Broder</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A taxonomy of web search</article-title>
          .
          <source>SIGIR FORUM 36(2)</source>
          ,
          <volume>3</volume>
          {
          <fpage>10</fpage>
          (
          <year>2002</year>
          )
          <fpage>0</fpage>
          <source>.42 0.50 0.40 0.23 0.36 0.15 0</source>
          .
          <fpage>40</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>