<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Changing Vision for Access to Web Archives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zeynep Pehlivan</string-name>
          <email>zeynep.pehlivan@lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anne Doucet</string-name>
          <email>anne.doucet@lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stéphane Gançarski</string-name>
          <email>stephane.gancarski@lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIP6, University P. and M. Curie</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>41</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>Since late 90s, there has been a large investment in web archiving. Accessing these huge information sources is getting more and more attention. Web archive users profiles differ from casual web users profiles. Archive users need to analyze, evaluate and compare the information which requires complex queries with temporal dimension. These queries can not be performed by currently proposed access methods: wayback machine, full-text search and navigation. In this paper, we address this requirement by proposing a data model and a temporal query language for web archives which take into account different topics in web pages and the issues related to web archiving. In our approach, a captured web page is visually segmented into semantic blocks. A concrete block notion is introduced to represent these different semantic blocks. A concrete block is a triplet: frame block which keeps properties of a block, the content (textual and:or non-textual) and the importance accorded to a block. Each of them is timestamped with a period called validity. A web page, identified with an url, is a set of concrete blocks and a web site is a set of pages. Pages and sites are generated dynamically by manipulating concrete blocks when needed. Operators for data manipulation, navigation and ranking are also proposed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Web archives, in short WAC s, aim to preserve the
history of large portions of the web. They represent thus a
huge information source, potentially greater than the web
itself. This makes web archiving an active research area with
numerous opened issues like crawler optimization, storage
models, etc. An overview of main issues is presented by
Masan`es [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. It is interesting to notice that the researches
Copyright 2011 for the individual papers by the papers’ authors. Copying
permitted only for private and academic purposes. This volume is published
and copyrighted by its editors.
      </p>
      <p>TWAW 2011, March 28, 2011, Hyderabad, India.
on accessing web archives focus on extension of existing
access methods to web.</p>
      <p>
        For instance, the “Wayback Machine” [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] which allows
users to see captured versions of web pages over time, is
the best known access method to web archives. The
others methods currently provided are well-known web access
methods: full-text search and navigation within the requested
time range. These methods are powerful for casual users,
who search the web for general information and represent
the largest proportion of web users. According to the “Web
Archiving User Survey” [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], the web archive users profiles
consist of historians, journalists, lawyers, students etc. more
than casual users. It is also underlined that the main
reason for using web archives is research activity. Web archive
users need to analyze, compare and evaluate the
information. In order to achieve this, web archive systems must
provide tools to execute complex queries.
      </p>
      <p>
        Consider a researcher who studies how French media
covered the event “earthquake at Haiti in 2010” over last year.
At the very beginning of her research she would like to know
the number per month of web pages, in domain .fr, referring
to earthquake. This kind of queries can not be performed
by wayback machine, full-text search or navigation methods
but with an efficient query language. A number of possible
user scenarios over web archives are listed and illustrated
with technical requirements in “Use cases for access to
Internet Archives” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A significant number of them requires
complex query capability to be handled.
      </p>
      <p>
        Multi-topics and noisy information on a web page affect
the search performance. In recent years, there has been an
increasing interest in web based searching by taking these
different topics as units of retrieval and by eliminating noisy
information like copyrights, navigation bars etc. (e.g [
        <xref ref-type="bibr" rid="ref12 ref15 ref28">12,
15, 28</xref>
        ]). To achieve this, different page segmentation
algorithms are proposed (e.g [
        <xref ref-type="bibr" rid="ref14 ref16 ref20">20, 14, 16</xref>
        ]). They partition web
pages into non-overlapping hierarchical blocks where each
block deals with a different topic or only contains noisy
information. As web archives are temporal collections of web
pages, block-based approach needs to be extended with
temporal dimension in order to be used in web archive search.
      </p>
      <p>
        The role of search engines for web users is non-negligible.
On the other hand, as seen in use cases [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], most of the
time, they are not sufficient to meet the need of web archive
users. New approaches to explore web archives with complex
queries are needed. An appropriate query language, that
enables temporal search besides content queries and structural
queries (not only for the hypertext structure, also for web
page’s internal structure) is especially required for
formulating complex queries. There are a number of query languages
proposed for querying web data (e.g WebSQL, W3QL,
WebLog, WebOQL, etc.) but most of them are not suitable
for web archives due to a lack of temporal dimension and a
lack of handling challenges related to web archives.
Actually, our aim is to integrate the database approach and the
IR approach.
      </p>
      <p>In this paper, we present a conceptual model for web
archives and basics of a query language based on this model.
In our approach, visual blocks, which are extracted from a
page, are used as unit of retrieval rather than a whole page.
Each element of the model is timestamped with a period
called validity. Operators for data manipulation, navigation
and ranking are proposed. The query language that we
propose for web archives:
• Enables block-based search
• Takes into account incompleteness
• Eliminates duplicates
• Enables temporal ranking and grouping
• Is user-friendly</p>
      <p>We organize our paper as follows. In Section 2, we present
the features of the query languages for web archives. In
Section 3, we present the related works. In the following section,
conceptual model is presented with related operators.
Before the conclusion in Section 5, two use cases are explored
in Section 5.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>FEATURES OF WAC QUERY LANGUAGE</title>
      <p>The overarching design goal of our approach is to offer a
query language for web archive users. In this section, we
describe the features of this query language and the rationale
behind them.</p>
      <p>Block-Based Search: Today, web pages contain
various topics. A typical example is the web pages of
newspapers/TV channels like www.bbc.co.uk/news (Figure 1)
where multiple blocks with unrelated topics are marked with
different colors. A web page with matched query terms
in the same region is more relevant than a web page with
matched terms distributed over the entire page. Besides
increasing keyword search performance, the segmentation
gives a structure to the web page for structural queries.</p>
      <p>
        Previous works [
        <xref ref-type="bibr" rid="ref14 ref20">20, 14</xref>
        ] show that a page can be
partitioned into multiple blocks and, often, the blocks in a page
have a different importance. The importance weights are
accorded to different blocks inside a web page according to
their location, area size, content, etc. Using a web page as
unit of retrieval does not take into account these different
regions and their importance. In our model, we add this
notion to enrich our query language. With the block based
search and the importance, we are able to answer queries like
“Find pages mentioning Obama and Sarkozy in same block”
or “Find pages mentioning Obama in their most important
blocks” which returns more relevant information and also
helps to reduce the number of results.
      </p>
      <p>Incompleteness and Temporal Coherence: Due to the
limited sources, all web archives are incomplete (i.e they do
not contain all possible versions of all the pages on the web)
and we should query them as they are. If a user asks for a
version at t, and if the archive does not have versions at t
but it has the versions at t-2 and at t+2, the access model
should decide or support different choices to get the closest
version to t.</p>
      <p>Temporal coherence is another issue in web archiving. The
main reason is the dynamic structure of the web. It changes
continuously in an unpredicted and unorganized manner.
The web sites politeness constraints and the limited
allocated resources do not allow archiving a whole web site at
once, at the same moment. For example, in Figure 2, for a
temporal navigation starting from the version of p1 at t2, p2
at t1 is coherent, while p2 at t3 is incoherent. The access
model should be able to find the most coherent version.</p>
      <p>Temporal coherence and incompleteness lend to broken or
defected links which disable a complete navigation and bring
access to a standstill. An access model for web archives
should take into account these issues.</p>
      <p>
        Duplicates: Web archives has much more duplicated
contents than the web itself. In web archives, that kind of
duplicates can occur in two different cases: two versions of the
same URL crawled at different times or the same content is
pointed by several different URLs. According to [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], 25%
of the documents in web archives are exact duplicates.
Duplicates complicate the search and the result visualization.
      </p>
      <p>Temporal Ranking and Grouping: Ranking and
grouping are most common ways to deliver query results for the
large-scale data. For web archives, temporal dimension must
be included in both ranking and grouping process. Ranking
has also a dynamic nature in WACs. For example, a page
archived mentioning “Obama” at 1999 should not have the
same ranking result when querying at 2000 and at 2010.</p>
      <p>Temporal Logic: Support for temporal logic operators
enriches the language. For example, a researcher who wants
to analyze the effects of the arrest of Julian Assange can
execute a query: “Find pages linked to wikileaks.org after
Julian Assange’s arrest in London”.</p>
      <p>User-Friendliness: This query language will be used
by researchers and casual web users. So it should enable
users to find information without long-term training. Simple
queries (like keyword search) should be expressed
straightforwardly. Complex syntax should be used only to express
more complex queries. We believe that with an advanced
GUI, users can avoid from writing “codes”.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORKS</title>
      <p>In this section, we briefly summarize related works in three
different areas concerning the access to web archives: Web
archiving, query languages for web and block-based search</p>
      <sec id="sec-3-1">
        <title>Web Archiving</title>
        <p>
          To explore web archives, traditional access methods, i.e
navigation and full-text search are proposed by web
archiving initiatives [
          <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
          ]. In fall of 2001, Internet Archive(IA)
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] launched its collaborated project with Alexa Internet
called “Wayback Machine” [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. It allows users to go back
in time and view earlier versions of a web page for a given
URL. The inconvenience of this method is the necessity of
knowing exact URLs. There is another way of navigating in
web archives : navigation between different versions. Some
web archive initiatives propose a navigation tool like UK
Web Archives(UKWAC) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as seen in Figure 3 to facilitate
the navigation between versions. By using the cursor, users
can browse different versions easily.
        </p>
        <p>
          The increasing number of national web archives, diversity
of existing works led to the establishment of the
International Internet Preservation Consortium (IIPC) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in Paris
at 2003. The aim is to develop common standards, tools and
techniques for web archiving. One of the current projects of
IIPC, called WERA, is an archive access solution for
searching and navigating web archives. It allows a full-text search
besides wayback machine style search and it is based on the
NWA(Nordic Web Archive) toolset [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and the NutchWAX
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] full-text indexer. NutchWAX is extension of Nutch (an
open source search engine based on Lucene java for searching
and indexing) for searching web archive collections.
        </p>
        <p>Today, most of web archive initiatives use the wayback
machine to support URL indexing and search. NutchWAX
is used to enable full-text indexing and search. Our
motivation is focused on enabling complex queries which can not
be performed by existing methods.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Query Languages for Web</title>
        <p>
          A number of Web query languages have been developed
in the past (e.g WebSQL, W3QL, WebLog, WebOQL, etc.)
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. All those languages are intended for online queries
on the web. From the perspective of web archiving, the
most important inconvenience of those query languages is
their lack of temporal dimension and the lack of handling
challenges related to web archives like temporal coherence,
incompleteness etc.
        </p>
        <p>
          WebBase [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] project at Stanford University is a web
repository project that aims to manage large collections of web
pages and to enable web based search. A web warehouse
is interpreted simultaneously as a document collection, as
a directed graph and as a set of relations. For web based
search, a query language with the notions of ranking and
ordering is proposed. However, one copy of each page at a
time is archived, thus, no temporal dimension is provided in
that project.
        </p>
        <p>
          A web warehousing system called WHOWEDA(Warehouse
of Web Data) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] proposed by Web Warehousing and Data
Mining group at the Nanyang Technological University in
Singapore aims to store and manipulate web information. It
stores extracted web information as web tables and provides
web operators to manipulate those tables. Its data model
is based on nodes (pages) and links (hyperlinks) objects.
Links do not have any time-related attribute. Any change
in the last-modified time of the web document results in
a new node. Content of a web document is represented as
“node data tree”. For HTML documents, it is a HTML DOM
tree. The flexibility of HTML syntax might cause mistakes
in DOM tree structure. In addition, however, DOM tree is
powerful for presentation in the browser, it is not introduced
for description of the semantic structure of the web page. In
WHOWEDA, the internal semantic structure of a page is
not modeled, thus it only allows queries to be specified on
the whole content of the document. WHOWEDA only
focuses on user interested web sites which constitute a much
smaller scale than web archives.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Block-Based Search:</title>
        <p>
          Using semantic blocks in web pages as an unit of
information retrieval is an active research area. The vector space
model customized with importance and permeability (the
indexing of neighbors blocks) is proposed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] without
temporal dimension. In [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], after segmenting each page into
non overlapping blocks, an importance value is assigned to
each block which is used to weight the links in the ranking
computation. A block-based language model is proposed in
[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. As far as we know, there is no approach with temporal
dimension in block-based search, ranking and language
modeling. Our approach is based on visual page segmentation
of web pages [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and uses an importance model proposed
in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          In conclusion, wayback machine, full-text search and
navigation are the only applied solutions to access to web archives.
Most of the web query languages do not contain temporal
dimension for querying historical data. Different topics in a
web page are not handled in search except recent works like
[
          <xref ref-type="bibr" rid="ref12 ref13 ref15">12, 15, 13</xref>
          ] which suffer from lack of temporal dimension to
be used for web archives. In our approach, we propose a
data model which takes into account different regions of a
web page with associated importance and temporal
dimension.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>WAC QUERY LANGUAGE</title>
      <p>The details of WAC query language are described in this
section. After briefly introducing the interpretation of
temporal dimension in Section 4.1, we describe our data model
in Section 4.2 and list the operators in Section 4.3.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Temporal Dimension</title>
      <p>4.1.1</p>
      <sec id="sec-5-1">
        <title>Time on the Web</title>
        <p>Integrating time into data models, query languages and
database management system implementations still attracts
the attention of researchers. We focus here on two questions:
which time to use and how to represent it.</p>
        <p>In the context of web archiving, there are different types
of temporal information: time in content, HTTP headers,
crawled time.</p>
        <p>
          Temporal expressions can be found embedded in the
content of a web page. Those expressions are explained in three
categories in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]: Explicit, Implicit and Relative. Temporal
expressions that can be converted directly to chronons are
explicit expressions (e.g “December 24,2010...”). Names of
holidays or events which can be anchored in timeline are
considered as implicit expressions like “Christmas Day 2010”.
Relative expressions can not be anchored in timeline directly
like “today”, “on Wednesday” etc. Extraction of temporal
expressions from documents are active research field.
Temporal information in content is excluded from our model.
        </p>
        <p>
          The temporal fields in HTTP headers contain Date ,
LastModified and Expires. Servers send Last-Modified header
with date time value when the content was changed last
time. Last-Modified header is mostly used as a weak
validator. According to the definition of HTTP specification,
a validator that does not always change when the resource
changes is a weak validator. The meaning of Last-Modified
depends on the implementation of the origin server and the
nature of the original source (files, database gateways,
virtual objects, etc.) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The header Expires indicates when
a document stops being fresh. If it is used correctly, this
header indicates a validity period for the document and an
exact date for recrawling. But it is an unreliable tool: many
servers do not provide the header or provide with zero or
with low expiration delay.
        </p>
        <p>HTTP header Date represents the date and time at which
a web page is returned from a HTTP server upon request
by the HTTP client. Crawled time is the time that crawlers
capture the snapshot of a page. Web crawlers set the value
of the Date field in crawled documents to the date that a
document is crawled, unless they are configured otherwise.
It is the most trustworthy time because it does not depend
on host servers configurations. In our model, crawled time
is used as the basic time dimension but we should underline
the fact that our model supplies also other choices.</p>
        <p>
          Our model uses Allen’s interval-based representation of
temporal data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] where intervals are addressed as primitive
time elements. Each element is temporally stamped by a
period, called validity, defined as a time interval [ts, te) where
ts represents a starting time point and te is an ending time
point. now represents the current time and is assumed to
be larger than any timestamps.
4.1.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Time in queries</title>
        <p>There are two kinds of temporal queries in temporal
information retrieval: time-point and time-interval. In
timepoint queries, a time point t is given in the query and the
results should contain data whose validity contains t. In
time-interval queries, an interval [t1,t2) is given in the query
and data whose validity overlaps with [t1,t2) are returned
as result. A time-interval query is also called a time slice
query.</p>
        <p>In our model, all queries are treated as time interval queries.
For example, a time point query like “2010/08/19” is
converted to time-interval [2010/08/19,2010/08/20).
4.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conceptual Model</title>
      <p>In our model, each concept has its temporal and
nontemporal definition. Non-temporal definitions do not
contain temporal attributes and are denoted by concept −. We
follow the example in Figure 4 to illustrate the main features
of the model. In this example, we have a web page crawled
at t1,t2 and t3 without structural changes. The page is
segmented in three blocks marked with green, blue and pink. In
our approach, visually segmented blocks are called concrete
blocks. Each concrete block has a frame block, content and
importance. A page is a set of concrete blocks and a website
is a set of pages.</p>
      <p>Frame Block</p>
      <p>Web page segmentation returns a set of non-overlapping
hierarchical blocks. Frame Block (fb) keeps properties of a
block: the url to which the block belongs, a Dewey identifier
that indicates its place in the page structure and its validity
interval. If a frame block disappears and reappears later, it
is considered as a new frame block. A frame block is defined
as follows:
f b = (U RL, DeweyID, [f bts, f bte))</p>
      <p>f b− = (U RL, DeweyID)
For the in Figure 4, frame blocks are:
blue : (www.bbc.co.uk/news, 1, [t1, now))
pink : (www.bbc.co.uk/news, 2.1, [t1, now))
green : (www.bbc.co.uk/news, 2.2, [t1, now))</p>
      <p>Content</p>
      <p>There are two kinds of contents: non-textual and textual.
Non-textual content are images, videos etc. in a web page.
In our model, treating non-textual content is out of our
scope, thus, if the content has non-textual content, binary
is introduced to provide support for further works. Textual
content is treated as a bag of words (after elimination of
stop words, stemming etc.) in a block. A validity attribute
allows to trace the content changes over its frame block. Its
validity should be included in the validity of its frame block.
A block b can contain only one textual content but several
non-textual contents. A content of a block b is defined as:
• if b has only textual content
c = (text, [cts, cte))</p>
      <p>c− = (text)
• if b has only non-textual content
c = {(binary1, [cts, cte)), ..., (binaryn, [cts, cte))}</p>
      <p>c− = {(binary1), ..., (binaryn)}
• if b has textual and non-textual content
c = {(text, [cts, cte)), ..., (binaryn, [cts, cte))}
c− = {(text), (binary1), ..., (binaryn)}
In the example in Figure 4, contents are listed as followed:
((N ews), [t1, now))
(binary, [t1, now))</p>
      <p>blue
((Obama, America, gun, laws...), [t1, t2))
((binary, [t1, t2)))
((F rench, hostage, N iger...), [t1, t2))
(binary, [t1, t2))
((Sarkozy, Obama, U S, F rench), [t2, t3))
(binary, [t2, t3))
((W oody, Allen, ageing...), [t2, now))
(binary, [t3, now))
pink
pink
green
green
((Sarkozy, Carla, W oody, Allen...), [t3, now))
(binary, [t3, now))
pink
Importance</p>
      <p>The blocks in a page have different importances as
explained in Section 2. We define the importance as:
i = (alpha, [its, ite))</p>
      <p>i− = (alpha)</p>
      <p>
        The importance of a block, denoted alpha, depends on its
location, area size, content, etc. It is calculated according
to the model proposed in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Validity of importance is
not equal to the validity of the content in the same block.
Although its content stays unchanged, the importance of a
block can change by adding / deleting links or images, or by
updating blocks size in the page.
      </p>
      <p>For the example in Figure 4, Importances are listed as
followed:
(0.4, [t1, now)) blue (0.6, [t2, now)) pink
(0.2, [t1, t2)) pink (0.3, [t2, t3)) green
(0.1, [t1, t2)) green (0.4, [t3, now)) green
Concrete Block</p>
      <p>Concrete Block is a region in a web page. It is defined as
follows:</p>
      <p>cb = (f b, {c}, {i})</p>
      <p>In this triplet, fb is a frame block, {c} is a set of its content
ordered by validity attribute and {i} is a set of its
importance ordered by validity attribute.</p>
      <p>In Figure 4, Concrete Block for the green block is
described as followed:
0 8 (F rench, hostage..., [t1, t2)) 9
B@green, &lt;&gt; (binary, [t1, t2)) =&gt; , 8&lt; ((00..13,, [[tt21,, tt23)))) 9=1C
&gt;:(W oo(dbyin,aagreyi,n[tg3.,..n, o[tw2,))now))&gt;; :(0.4, [t3, now));A</p>
      <p>If we want to eliminate temporal nesting in data, for
example to find different versions, T-FLAT operator can be
used. Concrete block after applying T-FLAT operator is
called temporally flattened concrete block, and denoted as
cˇb. It has non-temporal triplet of frame block, its content
and its importance and a validity which is equal to the
intersection of validities of elements in triplet.</p>
      <p>cˇb =
(f b−, c−, i−), [cˇbts, cˇbte)
!</p>
      <p>For an example above, T-FLAT operator returns
temporally flattened concrete blocks as follows:
green, (F rench, hostage...) , 0.1, [t1, t2)</p>
      <p>(binary)
(green, (W oody, ageing...), 0.3, [t2, t3))
green, (W oody, ageing...) , 0.4, [t3, now)</p>
      <p>(binary)</p>
      <p>A snapshot of a concrete block at a given time t is a non
temporal triplet valid at t.</p>
      <p>cbt = (f b−, c−, i−)
Page</p>
      <p>A page is a set of concrete blocks. It is built dynamically
from concrete blocks for a given URL when needed. Validity
of a page is the union of all concrete blocks’ validity. A page
is defined as:</p>
      <p>purl = {cb1, cb2, cb3...}</p>
      <p>In our example (Figure 4), we have one page with three
versions. This page is built as follows:
purl=&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;:&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;8&gt;&gt; (0blguree,e(Nn:&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;8&gt;,e&gt;&gt;:&lt;8&gt;&gt;w(((((s(SWF(,[Oatror1beok,a(ndnob((m(cyiobbz((hn,iiwbbaya,nniia,,h)gnngWaar)oeuaa,rrysi(orryyn,nt0[oyy,,a,t.g[[dl4,,3gtt.a[[y,,31.tte[wn.,,,12t.,.nt.1o,,s[.2.ttt.,ow.,)23)2n.[w),.)))t,o[)n)))1)t,w)),3o[tt,)w21n)),))o)t,)w2&gt;&gt;;9&gt;&gt;=))),):8&lt;;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;9(,0((:.004..(,130[,,.t[[43tt,,12[n,,ttt3o23,w))n)))o)w;9=)!);A1 &gt;&gt;&gt;;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;=&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;9
&gt;&gt;&lt;&gt; B@pink,&lt;&gt;&gt; ((Sarkozy,Obama...),[t2,t3)) =&gt;&gt;&gt;&gt; 8&lt; ((00..13,,[[tt12,,tt23)))) 9=C &gt;
A snapshot of a page is a set of concrete block snapshots
for a given URL valid for a requested time and defined as
follows:</p>
      <p>purlt = {cb1t, cb2t, cb3t..., cbnt}
Site</p>
      <p>A website is a set of web pages that are addressed relative
to a common URL. In our approach, websites, as well as
pages, are built dynamically by using regular expressions to
find out which pages belong to which websites.</p>
      <p>sregex = {purl1, purl2, purl3...purln}
A snapshot of a website at t is a set of pages valid at t.
sregext = {purl1, pturl2, pturl3..., pturln}</p>
      <p>t
Links</p>
      <p>Link represents the hyperlink in the page. It points to a
page from a frame block and defined as follows:</p>
      <p>l = (label, type, f rom, to, [lts, lte))
where:
• label is a link label as shown here: &lt; ahref = “U RL00 &gt;
linklabel &lt; /a &gt;.
• type is used to distinguish global, internal and local
hyperlinks. A hypertext link in an HTML document
is said to be:
– interior if the destination document coincides with
the source document (ex:href=“#anchorname”)
– local if the destination and the source documents
are different but in the same domain
(ex:href=“/news/art.html” )
– global if the destination and the source documents
are located on different servers.
• from attribute corresponds to a frame block
• to attribute corresponds to an URL.</p>
      <p>In our example (Figure 4), links are listed as followed:
(“Mobile”, local, blue, “/news/mobile”, [t1, now))
(“News”, interior, blue , “#”, [t1, now))
(“Why America’s gun laws won’t change”, local, pink,
“/news/politics25698422”, [t1, t2))
(“US-French push for Iran sanctions”, local, pink,
“/news/politics2457913”, [t2, t3))
(“Bruni to star in Wood Allen film”, local, pink,
“/news/cinema18698422”, [t3, now))
(“Two French hostages in Niger”, local, green, “/news/world-4598422”,
[t1, t2))
(“Woody Allen on ageing and death”, local, green,
“/news/cinema2659874”, [t2, now))
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>Operators</title>
      <p>The proposed language consists of classical set operators
with their temporal extensions, relational operators,
navigation operators, interval-based temporal logic operators,
aggregate operators, ranking and grouping operators. Due
to the lack of space, only query operators related to web
archive issues are detailed in this section. Table 1 lists the
suite of operators.</p>
      <p>InBlock</p>
      <p>InBlock is one of our logical full-text operators like
orselect(OR), and-select(AND), not-select(NOT) etc. It finds
matches that satisfy full-text selection in the same block. It
is used combined with other logical full-text operators.</p>
      <p>For example, assume that we are looking for information
about Woody Allen’s new movie where Carla Bruni Sarkozy
participates (Figure 4). A query like “Sarkozy and Allen”
will return version at t2 and version at t3. In fact,
information in version at t2 is not relevant. But a query like
“Sarkozy InBlock-AND Allen” will return only version at t3
which has more relevant information.</p>
      <p>Wayback</p>
      <p>It returns all the versions of a page identified by its URL
for a given period.</p>
      <p>W AY BACK(U RL, [ts, te))
Fixdate</p>
      <p>This operator is used at the beginning of the queries to fix
the date interval for all queries in a session. For example,
if the user wants to work over the data of January 2000,
after calling FIXDATE([2000 − 01 − 01, 2000 − 02 − 01)), she
can call other operators without specifying the temporal
attributes (e.g WAYBACK(url)).</p>
      <p>Nearest/Recent/Both</p>
      <p>These operators are used to deal with incompleteness
explained in Section 2. For example, in Figure 4, if it is asked
for the version at t where t1 &lt; t &lt; t2, we need to make an
assumption over the version at t. Three different operators
are proposed:
• NEAREST: it returns the nearest time by minimizing
|t − tx|
• RECENT: it returns the closest time before t. It is the
default operator, if the user does not specify another
one.
• BOTH: it returns a time interval constructed with the
most closest time before t and after t.</p>
      <p>For Figure 4, a query WAYBACK(URL,t) will be
executed:
• with NEAREST as WAYBACK(URL,t2), assume that
|t − t2| &lt; |t − t1|
• with RECENT as WAYBACK(URL,t1)
• with BOTH as WAYBACK(URL, t1) ∪ WAYBACK(URL,
t2)
Navigation</p>
      <p>Operator outb finds the set of pages pointed in one step
from the set of concrete blocks (CB ) by following any of the
links valid at a given period. Operator out uses outb to find
the set of pages reachable in one step from the set of pages
(P ) at a given period. For that, it finds the set of concrete
blocks foreach page in P and calls outb foreach CB.</p>
      <p>Operator in finds the set of concrete blocks that points to
a set of page by links valid for a requested period.</p>
      <p>Operators jump+ and jump− return a set of pages
reachable, respectively, incoming and outgoing direction in one
to n steps by following links valid at a given period. It is a
combination of in and out operators with iteration.</p>
      <p>Collapse/Expand COLLAPSE, also referred in
literature as coalesce, combines the tuples, which have the same
non-temporal values and consecutive or overlapping
validities, into one tuple with validity that is the union of the
constituent validities.</p>
      <p>EXPAND expands a tuple into several tuples by splitting
its validity into consecutive validities in a given scale. In
our approach this scale can be following keywords: YEAR,
MONTH, DAY. These operators can be combined with IN
period to limit the range. In Figure 5, EXPAND BY YEAR
IN [2001,2004) returns the first three tuples.</p>
    </sec>
    <sec id="sec-8">
      <title>USE CASES</title>
      <p>In this section, to illustrate how the different operators
work in our approach, we use two examples and construct
the corresponding queries.</p>
      <p>Example 1: We extend the example that we gave in the
introduction. A social researcher who studies how French
media covered the event “earthquake at Haiti in 2010” over
last year wants to know the number per month of different
regions in web pages in domain .fr referring to earthquake by
eliminating duplicates. In that case, if at t, there were two
different articles in “lemonde.fr”, it is counted as 2 instead
of 1 (whole page). Figure 6 shows the query graph for this
example.</p>
      <p>First, we need to find all concrete blocks in domain .fr
which are valid at 2010. LIKE is used to make a string
comparison. Then, with CONTAINS operator, we find contents
which mention given keywords (Haiti, earthquake).
EXPAND operator is used to group by month over validity
of content. COUNT and DISTINCT operators are used to
find out the number of different regions. By using GROUP
BY operator with url, we can limit the count to web pages.</p>
      <p>Example 2: Our second example is based on finding
broken links from a given url X in a given period (2000 in our
example). Figure 7 illustrates the query graph for this
example. We need to underline the fact that these broken links
are not the result of incompleteness but HTTP 404 error
while crawling.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>In this paper, we addressed the problem of accessing
information in web archives. We presented a conceptual model
as the basis of a query language for web archives. The
operators to support queries are also described. In our model,
we take into account different topics in web pages by
using visual blocks as an unit of retrieval with accorded
importance. Block-based approach is used for information
retrieval on the web, however, as far as we know, it is never
used with temporal dimension. Navigation operators with
temporal dimension let users to execute queries over web
archives temporal hyperlink structure. The model and
operators enriched with the temporal dimension allow querying
web archives powerfully.</p>
      <p>
        Our approach is in the early stage of development. Our
first priority is to express the language in algebraic form.
Next steps will be the implementation with an appropriate
user-friendly syntax. We want to underline the fact that in
this paper we clarify the requirements of WAC query
language formally. It can be implemented as a new query
language or as an extension of an existing query language. We
will also work on ranking functions which take into account
the block-based structure and temporal dimension. By using
the existing temporal indexing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and block-based
indexing approaches [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we intend to propose a hybrid indexing
model. Once the proposed query language will be fully
implemented, our attention will focus on query optimization
strategies.
      </p>
      <p>OPERATOR
Select</p>
      <p>
        DESCRIPTION
corresponds to well-known select operator in the relational algebra. It returns
a subset of data which satisfies given predicates. Temporal select operation
is the same as the non-temporal selection but it is extended with additional
predicates for temporal comparison
corresponds to its well-known non-temporal version in the relational algebra
applies an estimation function over the archive and finds more coherent
version for a given page
applies a ranking function over a bag
computes the Cartesian product of two bags
returns a temporal Cartesian product of two bags
returns the union of two bags
returns the union of temporal elements where their validities overlap
returns the intersection of two bags
returns the intersection of temporal elements where their validity overlap
returns all the elements of the first bag which are not in the second bag
in addition to non-temporal Difference, it checks and removes overlapping
temporal parts from bags
performs an inner-join on two non-temporal bags based on predicates
performs an inner-join on two bags based on predicates
removes duplicates for non temporal bags
in addition to non-temporal Distinct, it removes duplicates by taking into
account overlapping temporal parts
creates groups of tuples sharing some attribute value
COUNT, MIN, MAX are supported
for a given element returns the version after/before
eliminates the temporal nesting in a bag
OR, AND, MILD-NOT, NOT, ORDERED, IN-BLOCK are supported
realizes keyword search over index servers
sorts a set into a specific order
Allen’s thirteen temporal operators [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are supported
finds differences between two versions of a web page
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Html</surname>
            <given-names>protocol</given-names>
          </string-name>
          , http://www.w3.org/protocols/rfc2616/rfc2616-
          <fpage>sec14</fpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[2] Iipc, international internet preservation consortium, http://netpreserve.org/about/index.php.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Internet</surname>
            <given-names>archive</given-names>
          </string-name>
          , http://www.archive.org/index.php.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Pandora, australia's web archive</article-title>
          , http://pandora.nla.gov.au/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>[5] Uk web archive, http://pandora.nla.gov.au/.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <article-title>Use cases for access to internet archives</article-title>
          .
          <source>Use case report, IIPC</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Allen</surname>
          </string-name>
          .
          <article-title>Maintaining knowledge about temporal intervals</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>26</volume>
          :
          <fpage>832</fpage>
          -
          <lpage>843</lpage>
          , Nov.
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gertz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          .
          <article-title>On the value of temporal information in information retrieval</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>41</volume>
          :
          <fpage>35</fpage>
          -
          <lpage>41</lpage>
          , Dec.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Berberich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bedathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>A time machine for text search</article-title>
          .
          <source>In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>519</fpage>
          -
          <lpage>526</lpage>
          . ACM Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Madria</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. K.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <source>Web Data Management</source>
          . Springer,
          <volume>1</volume>
          <fpage>edition</fpage>
          , Nov.
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bruno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Faessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Maitre</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Scholl</surname>
          </string-name>
          .
          <article-title>Indexing by permeability in block structured web pages</article-title>
          .
          <source>In Proceedings of the 9th ACM symposium on Document engineering, DocEng '09</source>
          , pages
          <fpage>70</fpage>
          -
          <lpage>73</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bruno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Faessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Maitre</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Scholl</surname>
          </string-name>
          . Blockweb:
          <article-title>An ir model for block structured web pages</article-title>
          .
          <source>In Content-Based Multimedia Indexing</source>
          ,
          <year>2009</year>
          . CBMI '
          <volume>09</volume>
          . Seventh International Workshop on, pages
          <fpage>219</fpage>
          -
          <lpage>224</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Ma</surname>
          </string-name>
          .
          <article-title>Block-level link analysis</article-title>
          .
          <source>In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <source>SIGIR '04</source>
          , pages
          <fpage>440</fpage>
          -
          <lpage>447</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and W. Ma. VIPS:
          <article-title>a vision-based page segmentation algorithm</article-title>
          .
          <source>Technical Report MSR-TR-2003-79</source>
          , Microsoft Research,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Ma</surname>
          </string-name>
          .
          <article-title>Block-based web search</article-title>
          .
          <source>In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>456</fpage>
          -
          <lpage>463</lpage>
          . ACM Press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Extraction of informative blocks from web pages</article-title>
          .
          <source>In Proceedings of the 2008 International Conference on Advanced Language Processing and Web Information Technology</source>
          , pages
          <fpage>544</fpage>
          -
          <lpage>549</lpage>
          , Washington, DC, USA,
          <year>2008</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendelzon</surname>
          </string-name>
          .
          <article-title>Database techniques for the World-Wide web: A survey</article-title>
          .
          <source>SIGMOD RECORD</source>
          ,
          <volume>27</volume>
          :
          <fpage>59</fpage>
          -
          <lpage>74</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Santos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>Managing duplicates in a web archive</article-title>
          .
          <source>In Proceedings of the 2006 ACM symposium on Applied computing, SAC '06</source>
          , pages
          <fpage>818</fpage>
          -
          <lpage>825</lpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hallgrimsson</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bang</surname>
          </string-name>
          . Nordic web archive,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kovacevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diligenti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maggini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Milutinovic</surname>
          </string-name>
          .
          <article-title>Recognition of common areas in a web page using visual information: a possible application in a page classification</article-title>
          .
          <source>In in the proceedings of 2002 IEEE International Conference on Data Mining (ICDM'02, page 250. IEEE Computer Society</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Masan</surname>
          </string-name>
          <article-title>´es</article-title>
          .
          <source>Web Archiving</source>
          . Springer Berlin Heidelberg,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pehlivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ben-Saad</surname>
          </string-name>
          , and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Gan¸carski. Vi-DIFF: understanding web pages changes</article-title>
          .
          <source>In Database and Expert Systems Applications</source>
          , volume
          <volume>6261</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer Berlin / Heidelberg,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Complex queries over web repositories</article-title>
          .
          <source>In Proceedings of the 29th international conference on Very large data bases -</source>
          Volume
          <volume>29</volume>
          , VLDB '
          <year>2003</year>
          , pages
          <fpage>33</fpage>
          -
          <lpage>44</lpage>
          . VLDB Endowment,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ras</surname>
          </string-name>
          and
          <string-name>
            <surname>S. van Bussel. WEB</surname>
          </string-name>
          <article-title>ARCHIVING user survey</article-title>
          .
          <source>Technical report, National Library of the Netherlands (Koninklijke Bibliotheek)</source>
          , Netherlands,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Learning block importance models for web pages</article-title>
          .
          <source>In In Intl. World Wide Web Conf. WWW</source>
          , pages
          <fpage>203</fpage>
          -
          <lpage>211</lpage>
          . ACM Press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stack</surname>
          </string-name>
          .
          <article-title>Full text search of web archive collections</article-title>
          . In In IWAW,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tofel</surname>
          </string-name>
          .
          <article-title>Wayback for accessing web archives</article-title>
          .
          <source>In IWAW'07</source>
          , Vancouver, British Colombia, Canada,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xue</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Block-Based language modeling approach towards web search</article-title>
          .
          <source>In Web Technologies Research and Development - APWeb</source>
          <year>2005</year>
          , volume
          <volume>3399</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>170</fpage>
          -
          <lpage>182</lpage>
          . Springer Berlin / Heidelberg,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>