<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterizing Search Behavior in Web Archives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miguel Costa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>miguel.costa@fccn.pt</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mário J. Silva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mjs@di.fc.ul.pt</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foundation for National Scientific Computing</institution>
          ,
          <addr-line>Lisbon</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Lisbon, Faculty of Sciences</institution>
          ,
          <addr-line>LaSIGE, Lisbon</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>33</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>Web archives are a huge source of information to mine the past. However, tools to explore web archives are still in their infancy, in part due to the reduced knowledge that we have of their users. We contribute to this knowledge by presenting the first search behavior characterization of web archive users. We obtained detailed statistics about the users' sessions, queries, terms and clicks from the analysis of their search logs. The results show that users did not spend much time and effort searching the past. They prefer short sessions, composed of short queries and few clicks. Full-text search is preferred to URL search, but both are frequently used. There is a strong evidence that users prefer the oldest documents over the newest, but mostly search without any temporal restriction. We discuss all these findings and their implications on the design of future web archives.</p>
      </abstract>
      <kwd-group>
        <kwd>Portuguese Web Archive</kwd>
        <kwd>Search Behavior</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        The web has a democratic nature, where everyone can
publish all kinds of information. News, blogs, wikis,
encyclopedias, interviews and public opinions are just a few
examples. Part of this information is unique and
historically valuable. However, since the web is too dynamic, a
large amount of information is lost everyday. Ntoulas et al.
discovered that 80% of the web pages are not available after
one year [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In a few years they are all likely to disappear,
creating a knowledge gap for future generations. Most of
what has been written today will not persist and, as stated
by UNESCO, this constitutes an impoverishment of the
heritage of all nations [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>Several initiatives of national libraries, national archives
and consortia of organizations started to archive parts of the
web to cope with this problem1. Some country code
toplevel domains and thematic collections are being archived
regularly2. Other collections related to important events,
such as September 11th, are created at particular points
in time3. In total, billions of web documents are already
archived and their number is increasing as time passes. The
Internet Archive alone collected 150 billion documents since
1996. The historic interest in the documents is also growing
as they age, becoming an unique source of past information
for widely diverse areas, such as sociology, history,
anthropology, politics or journalism. However, to make historical
analysis possible, web archives must turn from mere
document repositories into living archives. The development of
innovative solutions to search and explore it are required.</p>
      <p>
        Current web archives are built on top of web search
engine technology. This seems like the logical solution, since
the web is the main focus of both systems. However, web
archives enable searching over multiple web snapshots of the
past, while web search engines only enable searching over
one snapshot of the close present. Users from both systems
also have different information needs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Hence, we also
expected different search patterns and behaviors, which
without a proper response, could degrade results and negatively
influence users’ satisfaction. We studied the above issues
and drew the first profile of how web archive users search.
It is based on the quantitative analysis of the Portuguese
Web Archive (PWA) search logs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Our results show that users of both types of systems have
similar behaviors. In general, web search technology can be
adopted to work on web archives. Nonetheless, our
identification of the users’ specificities provides insights on search
behavior, that might contribute to better support the
architectural design decisions of future web archives. Examples
include optimizing their performance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or designing better
web interfaces [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>This paper is organized as follows. In Section 2, we cover
the related work. In Section 3, we describe the search
environment. The methodology of analysis is explained in
Section 4 and the results are detailed in Section 5. Section 6
finalizes with the discussion of results and conclusions.
Copyright 2011 for the individual papers by the papers’ authors. Copying
permitted only for private and academic purposes. This volume is published
and copyrighted by its editors.</p>
      <p>TWAW 2011, March 28, 2011, Hyderabad, India.
1see http://www.nla.gov.au/padi/topics/92.html
2see http://www.archive.org/
3see http://www.loc.gov/minerva/</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
    </sec>
    <sec id="sec-4">
      <title>Web Archive User Studies</title>
      <p>
        There are several web archiving initiatives currently
harvesting and preserving the web heritage, but very few studies
about web archive users. The International Internet
Preservation Consortium (IIPC) reported a number of possible
user scenarios over a web archive [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The scenarios are
related to professional scopes and have associated the technical
requirements necessary to fulfill them. These requirements
include a wide variety of search and data mining
applications that have not been developed yet, but could play an
important role. However, the hypothetical scenarios did not
come directly from web archive users.
      </p>
      <p>
        The National Library of the Netherlands conducted an
usability test on the searching functionalities of its web archive
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Fifteen users participated on the test. One of the
results was a compiled list of the top ten functionalities that
users would like to see implemented. Full-text search was
the first one, followed by URL search. Strangely, time was
not mentioned in none of the top ten functionalities, despite
being present in all the processes of a web archive. The users’
choices can be explained by web archives being mostly based
on web search engine technology. As a result, web archives
offer the same search functionalities. This inevitably
constrains the users’ behaviors. Another explanation is that
Google became the norm, influencing the way users search
in other settings.
      </p>
      <p>
        In a previous publication, we studied the information needs
of web archive users [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We resort to three instruments to
collect quantitative and qualitative data, namely search logs,
an online questionnaire and a laboratory study. Our
observations were coincident. Users perform mostly navigational
searches without a temporal restriction. Other findings show
that users prefer full-text over URL search, the oldest
documents over the newest and many information needs are
expressed as names of people, places or things. Results also
show that users from web archives and web search engines
have different information needs, which cannot be effectively
supported by the same technology.
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Search Log Analysis</title>
      <p>
        Web usage mining focuses on using data mining to
analyze search logs or other activity logs to discover interesting
patterns. Srivastava et al. pointed five applications for web
usage mining: personalization, for adjusting the results
according to the user’s profile; system improvement, for a fast
and efficient use of resources; site modification, for providing
feedback on how the site is being used; business intelligence,
for knowledge discovery aimed to increase customer sales;
and usage characterization to predict users’ behavior [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
We focus on usage characterization. However, our results
can be applied to other purposes, such as the efficiency and
effectiveness improvements of IR systems [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        Search logs capture a large and varied amount of
interactions between users and search engines. This large number
of interactions is less susceptible to bias and enables
identifying stronger relationships among data. Additionally, search
logs can be analyzed at low cost and in a non intrusive way.
Most users are not aware that their interactions are being
logged. Users also try to fulfill their real information needs,
instead of having tasks assigned by a researcher that can
bias their behaviors. On the other hand, search logs are
limited to what can be registered. They ignore the contextual
information about users, such as their demographic
characteristics, the motivations that lead them to start searching,
and their degree of satisfaction with the system. Qualitative
studies, such as surveys and laboratory studies, can
complement log analysis with information that can explain some of
the patterns found [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Several logs from web search engines were analyzed with
the goal of understanding how these systems were used. A
common observation across these studies is that most users
conduct short sessions with only one or two queries,
composed by one or two terms each [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. When users submit
more than one query, they tend to refine the next query
by changing one term at a time. Most users only see the
first search engine results page (SERP) and rarely use
advanced search operators. These discoveries imply that the
use of web search engines is different from traditional IR
systems, which receive queries three to seven times longer
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Queries for special topics (e.g. sex), special types (e.g.
question-format) and multimedia formats (e.g. images) are
also longer [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This shows that the users’ behavior varies
not only between IR systems, such as search engines, online
catalogs and digital libraries, but also depends on the type
of information and the way users search. Another aspect
that differentiates search behavior is users’ demographics
(i.e. age, gender, ethnicity, income, educational level) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>THE SEARCH ENVIRONMENT</title>
      <p>The PWA preserves the Portuguese web, which is
considered the subset having the most interesting contents for
the Portuguese community. Specifically, we define the
Portuguese web as all the documents4 satisfying one the
following rules: (1) hosted on a site under a .PT domain; (2)
hosted on a site under another domain, but embedded in a
document under the .PT domain; (3) suggested by the users
and manually validated by the PWA team. Additionally,
the PWA team integrated web collections from several other
sources, such as the Internet Archive and the Portuguese
National Library. The number of indexed documents have been
growing and there are now more than 180 million
accessible by full-text and URL search. As far as we know, this is
the largest web archive collection searchable by full-text and
over such a large time span (from 1996 to 2009). The
experimental version of the PWA has been available as a service
to the general public since 2010 at http://archive.pt/.</p>
      <p>The interaction with the users and the layout of the
results is similar to web search engines, such as Google. In a
typical session, a user can submit a full-text query and
receive a search engine results page (SERP) containing a list
of 10 results matching the query. Figure 1 illustrates this
case. Each result includes the title of the web page and its
crawled date, a snippet of text containing the query terms
and the URL. The user can then click on the results to see
and navigate in the web pages as they were in the past. If
the desired information is not found, the user can repeatedly
modify and resubmit the query. In addition, the user can
click on the navigation links to explore other SERPs or use
the advanced search interface to restrict the query with
advanced search operators. These operators can also be added
to the query directly in the text box.
4The terms document and file are used interchangeably in this study.
For instance, it can be a web page, an image or a PDF file.</p>
      <p>This interface has some specificities. First, the text box is
complemented with a date range filter to narrow the results
to a time period. Second, each result has an associated link
to see all versions throughout time of the respective URL.
When clicked, the PWA presents the same search engine
versions page (SEVP) as when a user submits that URL
on the text box. A table is shown to the user, where each
column contains all the versions of a year sorted by date.
The user can then click on any version to see it as it was on
that date. Figure 2 depicts this interface.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Logs Dataset</title>
      <p>Our analysis is based on the logs of the PWA, covering
seven months of search interactions, from June to
December, 2010. By interactions, we mean all queries and clicks
submitted by the users and recorded by the PWA search
engine (server side). The seven month span has the advantage
of being less likely to be affected by ephemeral trends.</p>
      <p>The logs follow the Apache Common Log Format5. Each
entry corresponds to an interaction with the search engine
in the form of a HTTP request. It contains the user’s IP
address and the user’s session identifier. Each entry contains
also a timestamp indicating when the interaction occurred
and the HTTP request line that came from the client.</p>
      <p>We never used the log data to match a real identity.
However, we geographically mapped the IP addresses for a
better characterization of the users. We counted 72% of PWA’s
users with IP addresses assigned to Portugal. Near 89%
5see http://httpd.apache.org/docs/2.0/logs.html
of the interactions were submitted through the Portuguese
language interface. The remaining was submitted through
the English language interface. This strongly indicates that
users were mostly Portuguese.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>METHODOLOGY</title>
      <p>The analysis focused on four dimensions: sessions, queries,
terms and clicks. We define them in the following way:
• A session is a set of interactions by the same user when
attempting to satisfy one information need. The
session is the level of analysis in determining the success
or failure of a search. It is composed by one or more
queries and zero or more clicks.
• A query is a search request composed by a set of terms.</p>
      <p>We define an initial query as the first query submitted
in a session, while all the following queries are defined
as subsequent. An identical query is a query with
exactly the same terms as the previous one submitted in
the same session. A unique query corresponds to one
query regardless of the number of times it was logged.
The set of unique queries is the set of query
variations. An advanced query is a query with at least one
advanced search operator.
• A term is a series of characters bounded by white
spaces, such as words, numbers, abbreviations, URLs,
symbols or combinations between them. There are also
advanced search operators, but they do not count as
terms. We define a unique term as one term on the
dataset regardless of the number of times it was logged.</p>
      <p>The set of unique terms is the submitted lexicon.
• A click in this context refers to the following of a
hyperlink to immediately view a query result (i.e. archived
web page). It can be a SERP click or a SEVP click,
depending if the user clicks in a SERP or SEVP.</p>
      <p>Next, we briefly present the methods used on the search
log analysis.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Log Preparation</title>
      <p>
        We prepared the log fields for analysis through a series of
data cleansing steps. All incomplete entries, empty queries
and sessions without any query were discarded. Internal
queries submitted by the PWA monitoring system, the queries
by example displayed on the PWA entry page and sessions
conducted by clients identified as web crawlers were also
excluded. Additionally, sessions with more than 100 queries
were likely to come from crawlers, so they were removed
too. This cutoff value of 100 was used in some other
studies, thus enabling a more direct comparison with our results
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The queries that resulted from navigation clicks to see
another SERP were not counted as a new query. These are
the same queries parameterized to show more results.
      </p>
      <p>All terms were normalized to lowercase. Extra white spaces
were removed. Since the PWA did not perform stemming,
all variations of a query term were considered as different
terms. The set of query terms also includes misspellings.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Session Delimitation</title>
      <p>
        Most studies used the users’ IP address and/or session
identifier to delimit sessions [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We used these two
parameters to track and delimit user interactions. We also
used a time interval t of inactivity to delimit sessions. Two
consecutive interactions are included in different sessions if
they have an inactivity between them of at least t.
Without this gap, we could have sessions of several days, which
would hardly represent the reality. Studies diverge on the
choice of this interval, from 5 to 120 minutes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], while
others argue that no time boundary is effective in segmenting
sessions [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We selected the 30 minute interval, because
this interval has shown to produce good results, close to the
results produced by SVM classifiers that were designed for
delimiting sessions [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
    </sec>
    <sec id="sec-11">
      <title>LOG ANALYSIS</title>
      <p>Statistics were computed from the logged interactions.
The first pattern that we detected was that users mostly
conducted two types of sessions: with only full-text queries
and with only URL queries, in 59.34% and 31.10% of the
times, respectively. We defined these as full-text sessions
and URL sessions. In the analysis, we ignored the
remaining 9.57% sessions with mixed queries for simplification.</p>
      <p>Table 1 shows the general statistics. The users of the PWA
performed 6,177 full-text sessions, averaging 2.23 queries per
session. The number of query terms per query was 2.84,
with 6.42 characters per term. The users saw 1.44 SERPs
per query and clicked 1.06 times on their hyperlinks to view
a result. They hardly clicked to see all versions of a result.
This only happened in 0.06 times per query. Overall, these
results mean that for each query, the users saw mostly the
first and sometimes the next SERP, where they clicked once.</p>
      <p>Sessions
Queries
Terms</p>
      <p>SERPs
Clicks on SERPs</p>
      <p>Clicks on SEVPs
Queries per Session
Terms per Query</p>
      <p>SERPs per Query
Clicks on SERP per Query
Clicks on SEVP per Query</p>
      <p>Characters per Term</p>
      <p>Initial Queries
Subsequent Queries
- Modified
- Identical
- Terms Swapped
- New
Unique Queries</p>
      <p>Unique Terms
Queries never repeated
Terms never repeated</p>
      <p>The users also submitted 3,237 URL sessions, roughly half
of the full-text seassions. On average, each session had 1.54
queries with 27.27 characters. Half of the URLs submitted,
50.24%, were not found in the PWA. For the URLs found,
the users clicked on 1.56 versions to see them as they were
on past. Basically, a user submitted a URL and saw one or
two versions of that URL. Next, we will detail our analysis
and explain the remaining results.
5.1</p>
    </sec>
    <sec id="sec-12">
      <title>Session Level Analysis</title>
      <p>5.1.1</p>
      <sec id="sec-12-1">
        <title>Session duration</title>
        <p>
          The duration of a session is measured from the time the
first query is submitted until the last time the user interacted
with the PWA. We ignore if the user spent more session time
viewing the archived web pages after the last interaction or
used part of the time doing parallel tasks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. We assigned a
0 minutes duration to sessions composed by only one query.
        </p>
        <p>The large majority of sessions ended quickly as shown in
Table 2. Around 60% of the full-text sessions lasted less
than 1 minute and 89% less than 10 minutes. Only around
3% of the sessions had a longer than an half hour duration.
Each session took in average 4 minutes and 8 seconds. URL
sessions took even less time than full-text sessions. In
average, each session took 1 minute and 14 seconds. Around
81% of the sessions lasted less than 1 minute and only 6%
took longer than 5 minutes.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Query Level Analysis</title>
      <p>5.2.1</p>
      <sec id="sec-13-1">
        <title>Modified queries</title>
        <p>Sometimes users submit sequences of queries as a way to
refine or reformulate the search in a trial and error approach.
We consider that two sequential queries submitted on the
same session have the same information need if they share
at least one term. In this case, we called the second query
a modified query. We ignored the stopwords (too common
terms) in this analysis. Thus, a modified query could be a
specialization of the query (adding terms), a generalization
(removing terms) or both at the same time.</p>
        <p>
          We counted 44.53% of modified queries from all
subsequent full-text queries. Looking at Table 4, we see that
around 71% of the modified queries are the result of a zero
or one change on the number of terms. A zero length change
means that the users modified some terms, but their
number remained the same. Users tend to add more terms in the
modified queries rather than to remove them. We counted
around 42% versus 25%. PWA’s users tend to go from broad
to narrow queries, such as in web search engines [
          <xref ref-type="bibr" rid="ref12 ref22 ref3">3, 12, 22</xref>
          ].
advanced
operator
        </p>
        <p>NOT
PHRASE</p>
        <p>SITE
TYPE
total
% advanced
queries
3.62%
78.10%
12.81%
5.48%
100.00%</p>
        <p>A variety of reasons can lead users to repeat queries, such
as a refresh of the SERP or SEVP, a back-button click or
the submission of the same query more than once due to a
network or search engine delay. When analyzing the
fulltext queries, we counted 20.35% of identical queries, where
each query has exactly the same terms as the previous one
made in the same session (see Table 1). We also counted the
subsequent queries with the same terms, but written in a
different order. For instance, a query Web Archive followed
by a query Archive Web. Only a small number of subsequent
queries, 3.75%, had the order of the terms swapped. Besides
the modified and identical queries, the users also submitted
31.37% of subsequent queries with only new terms. This
indicates that at most this percentage of subsequent queries
were the result of a new information need.</p>
        <p>We divided the subsequent URL queries in identical and
new. 78.56% of the subsequent queries were new. The
remaining 21.44% were the result of the same URL submission
(see Table 1).</p>
        <p>In the PWA, users could use four advanced search
operators: NOT, to exclude all results with a term in their text
(e.g. -web); PHRASE, to match all results with a phrase
in their text (e.g. “web archive”); SITE, to match all
results from a domain name (e.g. site:wikipedia.org ); TYPE,
to match all results from a media type (e.g. type:PDF ).</p>
        <p>
          Table 5 presents the percentages of advanced queries (i.e.
with at least one advanced search operator). It shows that
25.86% of the queries included operators. This is a
significantly higher percentage when compared with studies over
web search engines [
          <xref ref-type="bibr" rid="ref12 ref22 ref9">9, 12, 22</xref>
          ]. The reason is the PHRASE
operator, which represents 78.10% of the choices. The PWA
suggested a URL within quotes for each URL submitted, to
inform the users that they could match the URL in the text.
However, even when ignoring the URLs within quotes, the
percentages are roughly the same. The second most used
operator was the SITE, occurring in 12.81% of the advanced
queries. The TYPE and NOT operators were insignificantly
used when compared to the total number of queries.
        </p>
        <p>
          The distribution of the terms per full-text query listed in
Table 6 shows that the majority of the queries had 1 or 2
terms. This is also visible by the 2.84 average of terms per
query (see Table 1). Around 87% of the queries had up to
5 terms and only 3% had 10 or more terms. These results
indicate that the users tend to submit short queries. These
values are useful, for instance, to optimize index structures
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] or to determine the adequate length of the input text
boxes on the user interface [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          The users saw on average about 1.44 SERPs per full-text
query. All users saw the first SERP as expected, since the
PWA always returned it after a query. Then, the users
followed the natural order of the SERPs, but in a sharp decline
(see Table 7). For instance, the second SERP was viewed in
14.44% of the queries. This indicates that prefetching the
second SERP would not significantly improve web archive
performance. On the other hand, the close percentages
of the following SERPs indicate that prefetching them can
bring improvements as shown in other studies [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
5.2.6
        </p>
      </sec>
      <sec id="sec-13-2">
        <title>Clicks on SERPs</title>
        <p>
          About 66% of the clicks occurred on the first SERP from
almost a click per query. The users clicked on 1.06 times per
query to access an archived web page listed on the SERPs.
We observed that users clicked on the rank of results
following a power law distribution, with a 0.88 correlation (see
Figure 3). These results are similar to web search engine
studies, which also present a discontinuity in the last
ranking position of each SERP (multiple of 10) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
5.2.7
        </p>
      </sec>
      <sec id="sec-13-3">
        <title>Query frequency distribution</title>
        <p>
          We ranked the full-text unique queries by their decreasing
frequency and verified that their distribution fits the power
law with a 0.96 correlation. This finding was also observed
in web search engines [
          <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
          ]. It means that a small number of
queries was submitted many times, while a large number of
queries were submitted just a few times. Figure 4 depicts the
cumulative distribution of queries. For instance, by caching
around 27% of the most frequent queries, the PWA could
respond to 50% of the total query volume.
00 20 40 60 80 100
% unique terms (ranked by decreasing frequency)
        </p>
        <p>
          We also ranked the URL unique queries by their
decreasing frequency and verified that their distribution, once again,
fits the power law with a 0.96 correlation. By caching around
32% of the most frequent URL queries, the PWA could
respond to 50% of the queries. Although satisfactory, the
percentage of queries cached are much superior than in
previous studies [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This is likely due to the small number of
sessions analyzed, which leads to a reduced repetition.
        </p>
        <p>As a consequence of the users’ queries and clicks
following a power law distribution, the archived pages seen by
the users also follow a power law distribution, with a 0.94
correlation. This applies to both full-text and URL sessions.</p>
        <p>
          Analogous to the query frequency distribution, we ranked
the full-text unique terms by their decreasing frequency.
Their distribution fits the power law with a 0.97 correlation.
As depicted in Figure 5, the cumulative distribution shows
that it is necessary to cache just around 6% of the most
frequent terms to handle 50% of the queries. Much less RAM is
necessary to cache terms than queries for a similar hit rate.
These results are consistent with others presented for web
search engines [
          <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
          ]. However, caching the terms instead of
the queries, adds extra processing over the posting lists of
the inverted index, to evaluate the documents matching the
query. A proper trade-off must be found.
        </p>
        <p>The users restricted by the end date 23.55% of the
fulltext queries, while only 1.64% by the start date. The start
and end dates were both changed in 12.98% of the queries.
The same pattern exists in URL queries as shown in Table 8,
where the start date was changed almost only when the end
date also was. This indicates that users are more interested
in old documents. The idea is reinforced by the distribution
of the years included in the full-text queries restricted by
date. As it can be seen in Figure 6, the older the years, the
more likely they are of being included in queries. However,
the URL queries have an almost constant rate.
5.4.2</p>
      </sec>
      <sec id="sec-13-4">
        <title>Clicks on temporal versions</title>
        <p>Documents tend to have just a few years with archived
versions, thus segmenting the number of clicks per year would
likely bias the results. Instead, we computed for all URL
queries, the percentage of clicks in each year yi with at least
one version. We measured as ctilmickess((yyii)) , where the
denominator represents the number of times the year yi was displayed
to the user, and the numerator the number of clicks in yi.
For instance, the first year y1 is 1997 if there is no archived
versions for that URL in 1996. Otherwise, y1 is 1996.</p>
        <p>In Figure 7 it is visible that users clicked much more on the
first year with archived versions than on the remaining years.
The first year was clicked in 55% of the times, while all the
others were clicked at most 20%. With exception of the
eighth year, the first three years had the higher percentages.
This shows a preference for the older documents.
5.4.3</p>
      </sec>
      <sec id="sec-13-5">
        <title>Implicit temporal queries</title>
        <p>We counted the number of queries with temporal
expressions, since they represent a temporal dependent intent. We
started by experimenting named-entity recognition tools for
Portuguese. However, queries are not grammatical, so the
2 3 4 5 6 7 8
rank of years with archived versions
9
tools presented a small precision. Instead, we used a simple
match of all the queries with years, months and day patterns.
Then, we classified a random subset of 1,000 queries to
validate our detection patterns. Surprisingly, they worked very
well. The patterns achieved a precision, recall and accuracy,
of 89%, 100% and 98%, respectively. The patterns created
some false positives, but unexpectedly no false negatives.
This was mostly, because there were no temporal
expressions in the logs without date patterns (e.g. last decade).</p>
        <p>
          All matches were manually validated, from which we
excluded the false positives. In the end, we counted 3.49% of
queries with temporal expressions. Almost all are related
with past events, such as world cup 2006. This is a small
percentage in line with the 1.5% of temporal expressions
found in the logs of the AOL web search engine [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
Search patterns from users of web archives and web search
engines are contrasted in Table 9. Web archive users submit
more single query sessions, which reflects in a smaller
number of queries per session. In a nutshell, web archive users
iterate less. This can be explained by most of the
information needs of web archive users being navigational, contrary
to the needs of web search engine users [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Web archive
users search for known-items using names, titles and URLs,
some within quotes, that give good clues of the desired
information. Another explanation is that web archive users
submit longer queries, which could lead to better results.
        </p>
        <p>
          On the other hand, the single term queries, the SERPs
viewed per query and the topic most seen, are in conformity
with web search engine results [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref16 ref3">3, 10, 11, 12, 16</xref>
          ]. The
classification of searched topics in web archives followed a different
taxonomy, so they are not directly comparable. Still,
Commerce is the most searched topic for navigational queries and
People for informational queries [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Overall, the search patterns of the users of both types of
systems show no evidence precluding the adoption of web
search engine technology for web archive search. This was a
surprise to us, because users from both systems have
different information needs. For instance, users said they wanted
to see the evolution of a page throughout time, but they
tend to click on one or two versions of each URL. All
information needs of the users are focused on the past, but most
of the user queries are not restricted by date, neither contain
temporal expressions. Users search as in web search engines.
This behavior may be the consequence of we having offered
a similar interface, leading them to search in a similar way.</p>
        <p>IR system
world region</p>
        <p>name
single query session
queries per session
single term queries
terms per query
advanced queries
SERPs viewed per query
topic most seen</p>
        <p>U.S.</p>
        <p>
          Excite [
          <xref ref-type="bibr" rid="ref11 ref24">11, 24</xref>
          ]
55%-60%
        </p>
        <p>2.3
20% - 30%</p>
        <p>2.6
11% - 20%</p>
        <p>1.7
Commerce, Travel
web search engine
Hence, new types of interfaces must be experimented, such
as the temporal distribution of documents matching a query
or timelines, which could create a richer perception of time
for the user and eventually trigger different search behaviors.</p>
        <p>Nevertheless, the identification of the users’ specificities
might contribute to the development of better adapted web
archives. We observe a strong preference in searching and
seeing the oldest documents over the newest. This finding
can be used in ranking results, when no other temporal data
is given. The ranking should also be tuned for navigational
queries when the query type is unknown. Queries, terms,
clicked ranks and seen archived pages follow a power law
distribution. This means that all have a small fraction that
is repeated many times and can be explored to increase the
performance of web archives.</p>
        <p>The PWA is still experimental and has a much smaller
user base than commercial web search engines. Still, we
believe that the obtained results are general, but studies over
larger datasets and from other web archives are necessary to
confirm this. Our future work will use these results to
improve the architecture and retrieval algorithms of the PWA.</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work could not be done without the help and
infrastructure of the PWA team. We thank Michel da Corte
for her review of the paper and FCT (Portuguese research
funding agency) for its Multiannual Funding Programme.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gionis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. P.</given-names>
            <surname>Junqueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murdock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Design trade-offs for search engine caching</article-title>
          .
          <source>ACM Transactions on the Web</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hurtado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Dupret</surname>
          </string-name>
          .
          <article-title>Modeling user search behavior</article-title>
          .
          <source>In Proc. of the 3rd Latin American Web Congress, page 242</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Costa</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>A search log analysis of a Portuguese web search engine</article-title>
          .
          <source>In Proc. of the 2nd</source>
          INForum - Simp´osio de Inform´atica, pages
          <fpage>525</fpage>
          -
          <lpage>536</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Costa</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>Understanding the information needs of web archive users</article-title>
          .
          <source>In Proc. of the 10th International Web Archiving Workshop</source>
          , pages
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          .
          <article-title>Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>51</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miranda</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Costa</surname>
          </string-name>
          .
          <article-title>Introducing the Portuguese web archive initiative</article-title>
          .
          <source>In Proc. of the 8th International Web Archiving Workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. T. W.</given-names>
            <surname>Group</surname>
          </string-name>
          .
          <article-title>Use cases for access to Internet Archives</article-title>
          .
          <source>Technical report, Internet Preservation Consortium</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hearst</surname>
          </string-name>
          .
          <article-title>Search User Interfaces</article-title>
          . Cambridge University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title>¨lscher and G. Strube. Web search behavior of Internet experts and newbies</article-title>
          .
          <source>Computer networks</source>
          ,
          <volume>33</volume>
          (
          <issue>1-6</issue>
          ):
          <fpage>337</fpage>
          -
          <lpage>346</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jansen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          .
          <article-title>An analysis of Web searching by European AlltheWeb.com users</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <fpage>361</fpage>
          -
          <lpage>381</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jansen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          .
          <article-title>How are we searching the World Wide Web? A comparison of nine search engine transaction logs</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ):
          <fpage>248</fpage>
          -
          <lpage>263</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Saracevic</surname>
          </string-name>
          .
          <article-title>Real life, real users, and real needs: a study and analysis of user queries on the Web</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>207</fpage>
          -
          <lpage>227</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Klinkner</surname>
          </string-name>
          .
          <article-title>Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs</article-title>
          .
          <source>In Proc. of the 17th ACM Conference on Information and Knowledge Management</source>
          , pages
          <fpage>699</fpage>
          -
          <lpage>708</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          .
          <article-title>Methods for evaluating interactive information retrieval systems with users, volume 3 of Foundations and Trends in Information Retrieval</article-title>
          . Now Publishers Inc.,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lucchese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Mining query logs to optimize index partitioning in parallel web search engines</article-title>
          .
          <source>In Proc. of the 2nd International Conference on Scalable Information Systems</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Markey</surname>
          </string-name>
          .
          <article-title>Twenty-five years of end-user searching</article-title>
          ,
          <source>Part</source>
          <volume>1</volume>
          : Research findings.
          <source>American Society for Information Science and Technology</source>
          ,
          <volume>58</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1071</fpage>
          -
          <lpage>1081</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ntoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          .
          <article-title>What's new on the web?: the evolution of the web from a search engine perspective</article-title>
          .
          <source>In Proc. of the 13th International Conference on World Wide Web</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. David.</surname>
          </string-name>
          <article-title>Use of temporal expressions in web search</article-title>
          .
          <source>In Proc. of the Advances in Information Retrieval, 30th European Conference on IR Research</source>
          , pages
          <fpage>580</fpage>
          -
          <lpage>584</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozmutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ozmutlu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          .
          <article-title>Multitasking Web searching and implications for design</article-title>
          .
          <source>American Society for Information Science and Technology</source>
          ,
          <volume>40</volume>
          (
          <issue>1</issue>
          ):
          <fpage>416</fpage>
          -
          <lpage>421</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Query chains: learning to rank from implicit feedback</article-title>
          .
          <source>In Proc. of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining</source>
          , pages
          <fpage>239</fpage>
          -
          <lpage>248</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ras</surname>
          </string-name>
          and S. van Bussel.
          <article-title>Web archiving user survey</article-title>
          .
          <source>Technical report, National Library of the Netherlands (Koninklijke Bibliotheek)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Silverstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Marais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Henzinger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Moricz</surname>
          </string-name>
          .
          <article-title>Analysis of a very large web search engine query log</article-title>
          .
          <source>In ACM SIGIR Forum</source>
          , volume
          <volume>33</volume>
          , pages
          <fpage>6</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Mining query logs: Turning search usage data into knowledge, volume 4 of Foundations and Trends in Information Retrieval</article-title>
          . Now Publishers Inc.,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozmutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Ozmutlu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          . U.S.
          <article-title>versus European Web searching trends</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>32</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Web usage mining: Discovery and applications of usage patterns from web data</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <fpage>23</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>[26] UNESCO. Charter on the Preservation of Digital Heritage. Adopted at the 32nd session of the General Conference of UNESCO, October</source>
          <volume>17</volume>
          ,
          <year>2003</year>
          . http://portal.unesco.org/ci/en/ files/13367/10700115911Charter_en.pdf/Charter_en.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>I.</given-names>
            <surname>Weber</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          .
          <article-title>The demographics of web search</article-title>
          .
          <source>In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>523</fpage>
          -
          <lpage>530</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>