<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal Anchor Text as Proxy for Real User Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thaer Samar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjen P. de Vries Centrum Wiskunde</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Informatica</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amsterdam firstname.lastname@cwi.nl</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>49</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>Web archives preserve the fast changing web. While we can archive the web pages, the popularity of queries in the past has usually not been preserved. Previous studies have observed the importance of anchor text for improving the quality of text search, and have shown that anchor text is similar to real user queries and documents titles. Other studies have shown that documents titles are similar to the real user queries. In this paper, we propose an approach to reconstruct the information that would be provided by query log in the past using temporal anchor text. First, we study the link graph of four years of Web archive in order to show how the target hosts and anchor text evolve over time. Second, we investigate the importance of anchor text over time. Our approach is to rank anchor text based on their popularity in the archive at specific time. Then, we check the importance of the top ranked anchor text in the public Web at the same time. In order to achieve this, we used the WikiStats dataset which aggregates page views of Wikipedia pages. Using exact string matching between top ranked anchor text and Wikipedia titles in the WikiStats dataset, we find a high percentage of overlap (approximately 57%). Our data strengthens the hypothesis that anchor text may be used as a proxy for actual query volume.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The World Wide Web (WWW) is the largest and the main source of information
nowadays, because of the ease of publishing and sharing data. However, the Web is dynamic
and data can be easily lost on the Web. Ntoulas et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] found that 80% of Web pages
are not available after one year. Many national libraries and organizations realized the
importance of Web archives for future culture heritage. Memory and heritage
institutions increasingly recognize that such digital born data are as easily deleted as they are
published, thereby introducing unprecedented risks to the world’s digital cultural
heritage [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] shows a list of Web archives initiatives undertaken by national libraries,
national archives, national and international organizations for preserving the Web.
      </p>
      <p>Despite the important attempts to preserve parts of the web by archiving, a large part
of the web’s content is unarchived and hence lost forever. In practice it is not feasible to
archive the entire web due to its ever increasing size and rapidly changing content. The
overall consequence is that our web archives are highly incomplete. On the other hand
the Web archive is too complete because it it contains additional information about a
Web page, more than its content, such as archived date, outlinks and anchor text.</p>
      <p>Queries that represent the past interests of real users, using the archived Web as it
was, are usually not available, because they were not preserved. Motivated by studies
which showed that anchor text is similar to documents titles and real users queries, we
2</p>
      <p>Authors Suppressed Due to Excessive Length
propose to use the important (popular) anchor text as proxy for queries in the past. In
this paper, we study how the link graph evolves over time; specifically, we focus on
target hosts and anchor text. We investigate evolution of the anchor text over time in
order to understand what was important in Web.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Methods using link structure analysis have been widely used, especially in the
information retrieval area such as Page Rank [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and HITS [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The links which define the
structure of the Web consist of a source URL, a destination URL, and anchor text which
is the text used to describe the target page in the link. Anchor text is a well-known
resource to enrich the representations of web page content to improve Web retrieval.
Craswell et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] first experimented with site finding using aggregated anchor text.
Aggregated anchor text for a link target has been used as surrogate documents, instead
of the target pages’ actual content. They concluded that anchor text can be more useful
than content words for navigational queries. Eiron and McCurley [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have investigated
the properties of anchor text in a large intranet corpus in order to understand why using
anchor text improves the quality of Web search. First, they showed empirically that
anchor text exhibits characteristics similar to real user queries. Second, they hypothesize
that anchor text is similar to web page titles, based on the observation by Jin et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
that titles can be used as an approximation of queries. They found that anchor text is
indeed similar to documents titles.
      </p>
      <p>
        Work in this area led to advanced models that combine various representations of
page content, anchor text, and link evidence [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Kraft and Zien [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] showed that
anchor text can produce higher quality query refinement suggestions than content text.
Fujii [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposed a model for classifying queries into navigational and informational.
Their retrieval system used content-based or anchor-based retrieval methods, depending
on the query type. Based on their experimental results, they concluded that content of
web pages is useful for informational query types, whereas anchor text information and
links are useful for navigational query types. Koolen and Kamps [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] concluded that
anchor text has added value for ad hoc informational search as well, and can lead to
significant improvements in retrieval effectiveness. Dou et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Kleinberg [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] took
the relationship between source and anchor text into account. Their model distinguished
between links from the same website and links from related sites to better estimate the
importance of anchor text. Similarly, Metzler et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] has overcome the problem of
anchor text sparsity by smoothing the influence of anchor text originating from within
the same domain by using ‘external’ anchor text: the aggregated anchor text from all
pages that link to a page in the same domain as the page to be enriched.
      </p>
      <p>
        In the context of Web archiving, link evidence and anchor text could be used to
locate missing webpages, of which the original URL is not accessible anymore. Klein and
Nelson [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] computed lexical signatures of lost webpages, using the top n words of link
anchors, and used these and other methods to retrieve alternative URLs for lost
webpages. Huurdeman et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] first used the link structure extracted from archived Web
pages to uncover target URLs that are not archived. Links extracted from the archived
pages contain evidence of the existence of unarchived target URLs. Second, they used
link evidence to reconstruct basic representations of target URLs. This evidence
includes the aggregated anchor text, crawl date, and source URLs.
      </p>
      <p>
        So far, we have described works that studied the structure of the Web and how the
link structure analysis was exploited for improving retrieval effectiveness. However, all
of them focused on using single snapshot of archived websites. Now, we summarize
studies that focused on the Web evolution by studying the link development over time.
Web link structure is very dynamic and grows following a power law [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. In the IR
community, several works used the temporal information of archived material to
improve search effectiveness. Li and Croft [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] proposed a time-based language model
based on studying the correlation between time and relevance. Based on the heuristic
that the probability of a document being relevant is higher for the most recent
documents, they boosted the relevance of recent documents. Jones and Diaz [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] exploited
the distribution of document versions over the timeline as an indication of the interval
of time relevant to a query. Elsas and Dumais [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] found that documents that are more
dynamic over time tend to be more relevant. Finally, Dai and Davison [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] quantified
anchor text importance by differentiating pages with different incoming link creation
rate over time and different historical incoming link context. They concluded that
incorporating the importance of anchor text over time in the ranking model improves
the performance, but they also point to the lack of available archived resources (few
encountered links were actually available in the Internet Archive).
      </p>
      <p>
        Costa et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] improved the effectiveness of searching Web archives by
incorporating temporal features such as number of versions available for the document in
the archive, and life span between first and last version of the document. They studied
the relation between Web document persistence and relevance. They presented an
approach that learns and combines multiple ranking models specific for each period of
time based on their believe that a single generic ranking model cannot predict the
variance of Web characteristics over a long period of time. They work on a test collection
constructed from the Portuguese Web Archive (PWA) in order to be used as ground truth
for Web Archive Information Retrieval (WAIR) research [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The dataset is publicly
available at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], including 269,801 assessed Web document versions. The assessed
documents were returned by different ranking models in response to 50 navigational
queries. Queries were randomly sampled from the PWA’s query log. The PWA consists
of archived documents from the Portuguese Web in the period from 1996 to 2009. They
found that there is no correlation between lifespan and number of versions, but both are
correlated with the relevance of documents. They found that 36% of documents have a
life span less than one year; notice that this percentage is different from the percentage
found by [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] which is 80%.
      </p>
      <p>
        Kanhabua and Nejdl [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] studied the evolution of anchor text extracted from edit
history of Wikipedia. First, they identified a set of entities using the approach introduced
by Bunescu and Pasca [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], for each Wikipedia snapshot. The snapshots were generated
by partitioning revisions of Wikipedia pages based on one-month granularity. Then,
they generate a set of entity-anchor relationships, based on the anchor text derived from
links pointing to the entities. They found that anchor texts with temporal information
can be candidates for capturing and tracing entities evolution.
4
      </p>
      <p>Authors Suppressed Due to Excessive Length</p>
      <p>In the context of Web archives, the queries that were used are usually not available,
especially when the archive was not available for search. Given all the previous work
that shows the similarity between anchor text and real users queries, and the similarity
between anchor text and titles, we propose to investigate the evolution of anchor text in
the past to give an insight about what was important and reconstruct queries over time.</p>
    </sec>
    <sec id="sec-3">
      <title>Setup</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>
          This study uses data from the Dutch Web archive at the National Library of the
Netherlands (KB). The KB currently archives a pre-selected (seed) set of more than 5,000
websites [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. Websites for preservation are selected by the library per category related
to Dutch historical, social and cultural heritage. Our snapshot of the Dutch Web archive
consists of 76,828 ARC files, which contain aggregated web content. Each ARC file
contains multiple archived records (content plus response header). A total number of
148M documents has been harvested between February 2009 and December 2012,
resulting in more than 7 Terabytes of data. Basic harvest metadata is available (crawl
dates, page modification dates, etc.). Additional metadata is available in separate
documentation, which includes the KB’s selection list, date of selection, and manually
assigned UNESCO codes (by curators of the KB). Table 1 summarizes the number of
websites added to the selection list and the total number of Web objects archived over
the years.
We extract a link structure from the archived objects that have text/html as
MIMEtype. The main percentage (approximately 70%, per year) of the archived web objects
are HTML-based textual content. In order to extract the links from the archive, we use
MapReduce to process all archived web objects contained in the archive’s ARC files.
During processing of the archived objects, JSoup1 was used to extract anchor links
from web objects that have text/html as MIME-type. For each found anchor link,
1 http://jsoup.org/
we keep the source URL (which is the URL of the page that has the link), target URL
(which is the URL of the page that the link is pointing to), and the anchor text of the
link (a short text describing the target page). The archived pages have meta data of
about the archived page such as the crawl date. We combine the year and the month of
the crawl date with link information (YYYYMM). In addition to that, we keep the hash
code (MD5) of the source page. More precisely, we keep the following information:
(sourceURL, targetURL, linkType, anchorText, crawlDate,
sourceHash)
The link type (linkType) indicates whether the link is internal link or external link. An
internal link has the same domain-name for both source and target (intra-domain), while
an external link the domain-name of the source URL is different from that of the target
URL (an inter-domain link).
        </p>
        <p>Different seeds are harvested at different frequencies; while most sites are harvested
only once a year, some sites are crawled more frequently. Therefore, we deduplicate the
links based on their values for source, target, anchor text, year and a hash of the source’s
content. We focus on the external links, and partition these links based on one-year, and
one-month granularity.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Wikipedia Page Views Statistics</title>
        <p>
          As evidence of query volume in the past, we used the WikiStats project dataset [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ],
which is an aggregated dataset from the Page view statistics for Wikimedia
projects2, which keeps the request history of articles from Wikipedia or from
another projects. For each article, it keeps the title and the number of requests. WikiStats
consists of weekly absolute views for Wikipedia pages in the period from January 2008
and January 2015. This gives the number of page views for the Wikipedia pages, the
top-level domain (TLD) of the page (such as NL for the Netherlands), and the page’s
title. Because our snapshot of the Dutch Web archive covers the period between
February 2009 and December 2012, we focused on the same period of the WikiStats dataset.
We partitioned the dataset in this period based on one-month granularity and one-year
granularity, keeping only Wikipedia titles which have more than 1; 000 page views.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Hosts Evolution</title>
      <p>In Section 3.2, we introduced our approach of extracting the link graph from the archived
text/html pages combined with metadata such as the crawl date, generating
different partitions at different granularities. In this section study the importance of hosts in
the archive over time.</p>
      <p>First, we experiment with partitions based on the year granularity. For each partition,
we generate the host of both the source page and target page in each link, where multiple
links from the same source host will be considered one. After that we aggregate the
links by target host. Finally, we rank the target hosts based on the number of incoming</p>
      <sec id="sec-4-1">
        <title>2 http://dumps.wikimedia.org/other/pagecounts-raw/</title>
        <p>Authors Suppressed Due to Excessive Length
links; which corresponds to the number of unique source hosts pointing to the target
host. Table 3 shows the top ranked hosts per year. We observe that the ranks of the
top hosts vary over the years. By considering the top 1; 000 hosts per year, we find no
correlation (using Kendall’s ) between the ranked lists of hosts in different years; the
strongest negative correlation was 0:982 between 2011 and 2012. Table 2 shows the
percentage of new hosts in our crawls over the years, considering different thresholds
of the top hosts. Here, a host is considered new in a particular year if it does not appear
in any previous year.</p>
        <p>Next, we experiment with aggregating links by target host, based on the one-month
granularity. Table 4 and Table 5 show the top hosts per month in 2009, illustrating that
the top hosts vary over the months as well. The number of target hosts varies per month,
with an average of 53; 215 hosts per month, where 25% these hosts are new.
In this section, we look into the usage of anchor text over time. For each partition At
at a given time granularity, we aggregate links by anchor text. The number of links
using anchor text a represents the frequency of a in partition At. We used this relative
frequency to represent the importance of anchor text a in the archive at specific time
granularity t (archive-based popularity), computing the importance of the
anchor text as follows:</p>
        <p>I(a; At) =
f (a; At)
maxAt
maxAt = max f (a; At)</p>
        <p>a
new(a; t) =
(1; if a 2= i[&lt;tAi</p>
        <p>0; otherwise
where f (a; At) is the frequency of anchor text a in partition At, and maxAt is the
maximum frequency of any anchor text in partition At.</p>
        <p>First, we investigate the evolution of anchor text over time. Therefore, for the anchor
text in partition At, we compute the percentage of new anchor text at the time of t. An
anchor text is considered new in At if it does not appear in any previous partition.
(1)
(2)
(3)
z24.nl twitter.com
wikipedia.org belastingdienst.nl</p>
        <p>2009
vriezenveners.nl
mi-website.es
startpagina.nl
fd.nl
blogspot.com
deviantart.com
co.uk
volkskrant.nl
gencircles.com
sitestat.com
belastingdienst.nl
web-log.nl
startkabel.nl
imageshack.us
readspeaker.com
google.com
hva.nl
digischool.nl
nrc.nl
trouw.nl
wordpress.com
photobucket.com
ugo.com
hetutrechtsarchief.nl wikipedia.org twitter.com
wikipedia.org</p>
        <p>hetutrechtsarchief.nl hetutrechtsarchief.nl
europa-nu.nl
bibe.library.uu.nl
europa.eu
vriezenveners.nl
startpagina.nl
minszw.nl
uva.nl
readspeaker.com
blogspot.com
co.uk
google.com
sitestat.com
amazon.com
wordpress.com
youtube.com
ebay.com
omroep.nl
volkskrant.nl
web-log.nl
ligfiets.net
nrc.nl
biblion.nl
twitter.com
europa-nu.nl
europa.eu
bibe.library.uu.nl
blogspot.com
youtube.com
co.jp
co.uk
wordpress.com
leidenuniv.nl
google.com
belastingdienst.nl
startpagina.nl
vriezenveners.nl
amazon.com
readspeaker.com
ligfiets.net
zijpermuseum.nl
co.cc
ebay.com
kennisnet.nl
tue.nl
wikipedia.org
europa.eu
bibe.library.uu.nl
wordpress.com
blogspot.com
europa-nu.nl
youtube.com
vriezenveners.nl
co.uk
google.com
leidenuniv.nl
ebay.com
rijksoverheid.nl
marktplaats.nl
overheid.nl
co.jp
knaw.nl
volkskrant.nl
nuzakelijk.nl
zie.nl
startpagina.nl
facebook.com
tue.nl</p>
        <p>Authors Suppressed Due to Excessive Length
volkskrantblog.nl seniorweb.nl readspeaker.com mi-website.es vriezenveners.nl hva.nl
anwb.nl fietsersbond.nl belastingdienst.nl startpagina.nl gencircles.com startpagina.nl
wordpress.com archined.nl cwi.nl startkabel.nl startkabel.nl blogspot.com
adobe.com begraafplaats.org artsennet.nl fd.nl startpagina.nl wikipedia.org
google.com co.uk wordpress.com volkskrant.nl deviantart.com google.com
anwbentreebewijs.nl sitestat.com w3.org z24.nl ugo.com co.uk
waverunner.nl drenthe.nl europa.eu digischool.nl readspeaker.com web-log.nl
wikipedia.org wikipedia.org knzb.nl wikipedia.org belastingdienst.nl twitter.com
postbus51.nl site-id.nl wikipedia.org sitestat.com wikipedia.org wikimedia.org
pharosreizen.nl google.com imageshack.us trouw.nl photobucket.com lexius.nl
live.com overheid.nl google.com web-log.nl imageshack.us blogger.com
w3.org knhb.nl oreilly.com nrc.nl twitter.com omroep.nl
volkskrant.nl nai.nl co.uk co.uk youtube.com wordpress.com
amsterdam.nl amsterdam.nl overheid.nl szw.nl blogspot.com creativecommons.org
telekom.at xs4all.nl google.nl members.lycos.co.uk blogger.com greenpeace.org
vrom.nl uitvaartmedia.com photobucket.com ifrance.com avs.nl hanze.nl
gelderlander.nl leidenuniv.nl myspace.com kennisnet.nl google.com technorati.com
google.nl tudelft.nl businessweek.com lycos.nl co.uk youtube.com
youtube.com uitvaartinformatie.nl uitvaart.nl ad.nl wikimedia.org hszuyd.nl
belastingdienst.nl volkskrant.nl blogspot.com blogspot.com independer.nl vpro.nl
where Ai represents any partition with time granularity less than the time granularity
of At. Based on the partitions of one-year granularity, with an average of 999; 695
distinct anchor text per year, we find that 59% of anchor text are new (average across
the percentage of all years). Based on the partitions of one-month granularity, 17; 024
links with distinct anchor text exist per month. The average percentage of new anchor
text per months is 34%.</p>
        <p>
          We have discussed a series of studies that showed that document titles are close to
real user queries, and that anchor text is similar to both document titles and real user
queries. We therefore hypothesize that we may be able to reconstruct query volume in
the past based on anchor text used in the past. Similar to the use of wikipedia in [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ],
we used the WikiStats dataset (described in Section 3.3) in order to find how the
important anchor text in the archive were related to popular queries in the past on the public
Web. We consider the number of page views of Wikipedia titles that match anchor
text to represent the importance of that anchor text in the public Web (web-based
popularity). We study the similarity between anchor text and Wikipedia titles
varying temporal granularity. We used exact string matching to match anchor text with titles
of Wikipedia pages in the WikiStats dataset, using the same time granularity.
Matching was done after transforming both anchor text and Wikipedia titles into lower case.
For each partition at time t, we rank the anchor text based on archive-based
popularity, after which we check at different thresholds k how many of the top-k anchor text
occurrences in the WikiStats dataset (in the partition at time t of the WikiStats dataset).
Table 6 summarizes the percentage of anchor text that have a matched Wikipedia
title. As we observe in the Table, a high percentage of the top ranked anchor text has a
matching Wikipedia title. For example, 56% of the top-1k anchor text occurrences in
the 2009 partition were found also in the 2009 partition of the WikiStats dataset. We
observe that the percentage of overlap between anchor text and the WikiStats dataset
partitions decreases as we increase the threshold of the top-k. The percentage reaches
26% (averages across all partitions) when we consider all anchor text in the one-year
partition.
        </p>
        <p>Table 7 shows a comma-separated sample of anchor text taken from the top-1k
popular anchor text in 2012 which do not have a match of any Wikipedia titles in 2012
of the WikiStats dataset. Some of these are uninformative having a specific purpose,
such as ‘login’ to proceed. Some anchor text have no match because of limitations
due to our approach of looking for exact string match between the anchor text and the
Wikipedia titles. For example the anchor text ‘filmpje’ has no match but in the WikiStats
dataset there is a page with title ‘filmpje!’. Likewise, ‘nunl’ has no exact match, however
there is a Wikipedia page with title ‘nu.nl’. In the future, our approach should consider
these cases by applying additional pre-processing steps like stemming and stopping,
and generalizing from exact match to matches with low edit distance. The list of anchor
text at the top-1k in 2012 that have a match with Wikipedia title is shown in Table 8. We
observe that some of these anchor texts correspond to cities in the Netherlands such as
Amsterdam, Rotterdam, Groningen, Utrecht and Den Haag (all are major cities in the
Netherlands). Another category of the top anchor text is related to social websites such
as twitter, linkedin, flickr, and vimeo. A different category of anchor text consists of
the major Dutch daily newspapers such as de Volkskrant, Telegraaf, Trouw, and NRC
handelsblad. The ‘uitzending gemist’ occurrence is related to a web service of the Dutch
Public Broadcasting (NPO) that offers a free on demand video for nation broadcasts.
The ‘belastingdienst’ anchor text is about a governmental service related to the Dutch
national tax office.</p>
        <p>Based on the one-month granularity, on average 26% of all anchor text over all
months has an exact match with a Wikipedia title (using all domains). The highest
percentage of Wikipedia titles that match the anchor text originate from the ‘NL’ domain
(around 55%). By ranking the anchor text per each one-month granularity based on the
archive-based popularity, we find that 42:5% of anchor text in the top-1k has match
with Wikipedia titles.</p>
        <p>Authors Suppressed Due to Excessive Length</p>
        <p>ga naar website van de fabrikant, word vaste donateur of doneer online via
de website van dit goede doel, create your own free blog on wordpresscom,
filmpje, vacatures, log in to proceed, wordpresscom, view more information,
grotere kaart weergeven, inlichtingen, routebeschrijving, powered by
wordpresscom, more information, projectinformatie, volg ons op twitter, nunl,
eigen homepage, inschrijven,
In this study, we looked into the viability of a new approach of using the evolution
of anchor text over time to reconstruct information that would be similar to real user
queries in the past. Our hypothesis is based on studies that have shown that anchor text
behaves similar to both real user queries and documents titles. We used the link structure
extracted from the Dutch Web archive to identify the most popular target hosts over
time, and to get the most popular anchor text over time. The link structure was extracted
from archived text/html archived pages in the Dutch Web archive in the period
between February 2009 and December 2012. In order to understand the importance
of the anchor text, we rely on the WikiStats dataset, which provides an aggregation of
page views of Wikipedia pages. We investigate the exact matches between anchor text
and Wikipedia titles, where both datasets (the link structure and the WikiStats) were
partitioned based on one-month and one-year granularity. Our analysis of the target
hosts shows that target hosts evolve significantly. Based on the one-month granularity,
on average 25% among all hosts per month are new. We experiment with finding popular
anchor text per time granularity, ranking anchor texts based on their popularity in the
archive. We find that a high percentage of anchor text in the top ranks have a match with
Wikipedia titles in the WikiStats dataset. Based on the one-year granularity, we found
that 57% of the top-1k anchor texts have matching Wikipedia titles. We conclude from
our data that the most important text provides a view of what are important entities in
the Netherlands. We cannot however conclude that evolution of anchor text serves as a
proxy for past query logs . There are some limitations that will consider in the future
work. First, matching anchor text and Wikipedia titles analysis, suggests a room for
improving our approach by applying additional pre-processing steps like stemming and
stopping, and generalizing from exact match to matches with low edit distance. Second,
we test our approach on a ‘deep crawl’ which is based on a few thousands of seeds. In
the future, we will test our approach on a ‘breadth-first crawl’ like the Common Crawl
dataset3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3 https://commoncrawl.org/</title>
        <p>Authors Suppressed Due to Excessive Length</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Dataset for learning to rank for wair research</article-title>
          . https://code.google.com/p/ pwa-technologies/wiki/L2R4WAIR.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] List of web archives initiatives</article-title>
          . http://en.wikipedia.org/wiki/List_of_ Web_archiving_initiatives.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Brin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Page</surname>
          </string-name>
          .
          <article-title>The anatomy of a large-scale hypertextual web search engine</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>30</volume>
          (
          <issue>1-7</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>117</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Razvan</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bunescu</surname>
            and
            <given-names>Marius</given-names>
          </string-name>
          <string-name>
            <surname>Pasca</surname>
          </string-name>
          .
          <article-title>Using encyclopedic knowledge for named entity disambiguation</article-title>
          .
          <source>In Diana McCarthy and Shuly Wintner</source>
          , editors,
          <source>EACL</source>
          <year>2006</year>
          ,
          <article-title>11st Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          ,
          <source>Proceedings of the Conference, April 3-7</source>
          ,
          <year>2006</year>
          , Trento, Italy. The Association for Computer Linguistics,
          <year>2006</year>
          . ISBN 1-932432-59-
          <fpage>0</fpage>
          . URL http://acl.ldc.upenn.edu/E/ E06/E06-1002.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Miguel</given-names>
            <surname>Costa</surname>
          </string-name>
          and Ma´rio
          <string-name>
            <given-names>J.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>Evaluating web archive search systems</article-title>
          . In Xiaoyang Sean Wang,
          <string-name>
            <given-names>Isabel F.</given-names>
            <surname>Cruz</surname>
          </string-name>
          , Alex Delis, and Guangyan Huang, editors,
          <source>WISE</source>
          , volume
          <volume>7651</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>440</fpage>
          -
          <lpage>454</lpage>
          . Springer,
          <year>2012</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          - 35062-7.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Miguel</given-names>
            <surname>Costa</surname>
          </string-name>
          , Francisco M. Couto, and Ma´rio
          <string-name>
            <given-names>J.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>Learning temporal-dependent ranking models</article-title>
          . In Shlomo Geva, Andrew Trotman,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Bruza</surname>
          </string-name>
          ,
          <string-name>
            <surname>Charles L. A. Clarke</surname>
          </string-name>
          , and Kalervo Ja¨rvelin, editors,
          <source>SIGIR</source>
          , pages
          <fpage>757</fpage>
          -
          <lpage>766</lpage>
          . ACM,
          <year>2014</year>
          . ISBN 978-1-
          <fpage>4503</fpage>
          -2257-7.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Nick</given-names>
            <surname>Craswell</surname>
          </string-name>
          , David Hawking,
          <string-name>
            <given-names>and Stephen</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>Effective site finding using link anchor information</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>250</fpage>
          -
          <lpage>257</lpage>
          . ACM,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Na</given-names>
            <surname>Dai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Brian D.</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <article-title>Mining anchor text trends for retrieval</article-title>
          . In Cathal Gurrin, Yulan He, Gabriella Kazai, Udo Kruschwitz, Suzanne Little, Thomas Roelleke, Stefan M. Ru¨ger, and Keith van Rijsbergen, editors,
          <source>ECIR</source>
          , volume
          <volume>5993</volume>
          <source>of LNCS</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>139</lpage>
          . Springer,
          <year>2010</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -12274-3.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Zhicheng</given-names>
            <surname>Dou</surname>
          </string-name>
          , Ruihua Song,
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ji-Rong Wen</surname>
          </string-name>
          .
          <article-title>Using anchor texts with their hyperlink structure for web search</article-title>
          . In James Allan, Javed A. Aslam, Mark Sanderson, ChengXiang Zhai, and Justin Zobel, editors,
          <source>SIGIR</source>
          , pages
          <fpage>227</fpage>
          -
          <lpage>234</lpage>
          . ACM,
          <year>2009</year>
          . ISBN 978-1-
          <fpage>60558</fpage>
          -483-6.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Nadav</given-names>
            <surname>Eiron</surname>
          </string-name>
          and
          <string-name>
            <surname>Kevin S. McCurley</surname>
          </string-name>
          .
          <article-title>Analysis of anchor text for web search</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>459</fpage>
          -
          <lpage>460</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Elsas</surname>
          </string-name>
          and Susan T. Dumais.
          <article-title>Leveraging temporal dynamics of document content in relevance ranking</article-title>
          . In Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu, editors,
          <source>WSDM</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . ACM,
          <year>2010</year>
          . ISBN 978-1-
          <fpage>60558</fpage>
          -889-6.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Atsushi</given-names>
            <surname>Fujii</surname>
          </string-name>
          .
          <article-title>Modeling anchor text and classifying queries to enhance web document retrieval</article-title>
          .
          <source>In Huai et al. [13]</source>
          , pages
          <fpage>337</fpage>
          -
          <lpage>346</lpage>
          . ISBN 978-1-
          <fpage>60558</fpage>
          -085-2.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Jinpeng</surname>
            <given-names>Huai</given-names>
          </string-name>
          , Robin Chen,
          <string-name>
            <surname>Hsiao-Wuen</surname>
            <given-names>Hon</given-names>
          </string-name>
          , Yunhao Liu,
          <string-name>
            <surname>Wei-Ying</surname>
            <given-names>Ma</given-names>
          </string-name>
          , Andrew Tomkins, and Xiaodong Zhang, editors.
          <source>Proceedings of the 17th International Conference on World Wide Web, WWW</source>
          <year>2008</year>
          , Beijing, China,
          <source>April 21-25</source>
          ,
          <year>2008</year>
          ,
          <year>2008</year>
          . ACM.
          <source>ISBN 978-1-60558- 085-2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>HugoC. Huurdeman</surname>
          </string-name>
          , Jaap Kamps, Thaer Samar, ArjenP. de Vries, Anat Ben-David,
          <article-title>and RichardA</article-title>
          . Rogers.
          <article-title>Lost but not forgotten: finding pages on the unarchived web</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          ,
          <year>2015</year>
          . ISSN 1432-
          <fpage>5012</fpage>
          . doi:
          <volume>10</volume>
          .1007/s00799-015-0153-
          <fpage>3</fpage>
          . URL http://dx.doi.org/10.1007/ s00799-015-0153-3.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Rong</surname>
            <given-names>Jin</given-names>
          </string-name>
          , Alexander G. Hauptmann, and ChengXiang Zhai.
          <article-title>Title language model for information retrieval</article-title>
          .
          <source>In SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11-15</source>
          ,
          <year>2002</year>
          , Tampere, Finland, pages
          <fpage>42</fpage>
          -
          <lpage>48</lpage>
          . ACM,
          <year>2002</year>
          . doi:
          <volume>10</volume>
          .1145/564376.564386. URL http://doi.acm.
          <source>org/10</source>
          .1145/564376.564386.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Rosie</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Diaz</surname>
          </string-name>
          .
          <article-title>Temporal profiles of queries</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>25</volume>
          (
          <issue>3</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Jaap</given-names>
            <surname>Kamps</surname>
          </string-name>
          .
          <article-title>Web-centric language models</article-title>
          .
          <source>In Otthein Herzog, Hans-Jo¨rg Schek</source>
          , Norbert Fuhr, Abdur Chowdhury, and Wilfried Teiken, editors,
          <source>CIKM</source>
          , pages
          <fpage>307</fpage>
          -
          <lpage>308</lpage>
          . ACM,
          <year>2005</year>
          . ISBN 1-59593-140-6.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Nattiya</given-names>
            <surname>Kanhabua</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          .
          <article-title>On the value of temporal anchor texts in wikipedia</article-title>
          .
          <source>In SIGIR 2014 Workshop on Temporal</source>
          ,
          <article-title>Social and Spatially-aware Information Access (TAIA'</article-title>
          <year>2014</year>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Klein and Michael L. Nelson</surname>
          </string-name>
          .
          <article-title>Moved but not gone: an evaluation of real-time methods for discovering replacement web pages</article-title>
          .
          <source>Int. J. on Digital Libraries</source>
          ,
          <volume>14</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>17</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Jon</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ):
          <fpage>604</fpage>
          -
          <lpage>632</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jon</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ):
          <fpage>604</fpage>
          -
          <lpage>632</lpage>
          ,
          <year>1999</year>
          . ISSN 0004-
          <fpage>5411</fpage>
          . doi:
          <volume>10</volume>
          .1145/324133.324140.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Marijn</given-names>
            <surname>Koolen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jaap</given-names>
            <surname>Kamps</surname>
          </string-name>
          .
          <article-title>The importance of anchor text for ad hoc search revisited</article-title>
          . In Fabio Crestani,
          <article-title>Ste´phane Marchand-Maillet,</article-title>
          <string-name>
            <surname>Hsin-Hsi</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Efthimis N.</given-names>
            <surname>Efthimiadis</surname>
          </string-name>
          , and Jacques Savoy, editors,
          <source>SIGIR</source>
          , pages
          <fpage>122</fpage>
          -
          <lpage>129</lpage>
          . ACM,
          <year>2010</year>
          . ISBN 978-1-
          <fpage>4503</fpage>
          -0153-4.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Reiner</given-names>
            <surname>Kraft</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Zien</surname>
          </string-name>
          .
          <article-title>Mining anchor text for query refinement</article-title>
          .
          <source>In Proceedings of the 13th international conference on World Wide Web, WWW '04</source>
          , pages
          <fpage>666</fpage>
          -
          <lpage>674</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM. ISBN 1-58113-844-X. doi:
          <volume>10</volume>
          .1145/988672.988763.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Jure</surname>
            <given-names>Leskovec</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jon M. Kleinberg</surname>
            , and
            <given-names>Christos</given-names>
          </string-name>
          <string-name>
            <surname>Faloutsos</surname>
          </string-name>
          .
          <article-title>Graph evolution: Densification and shrinking diameters</article-title>
          .
          <source>TKDD</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <year>2007</year>
          . doi:
          <volume>10</volume>
          .1145/1217299.1217301. URL http: //doi.acm.
          <source>org/10</source>
          .1145/1217299.1217301.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Xiaoyan</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Time-based language models</article-title>
          .
          <source>In CIKM</source>
          , pages
          <fpage>469</fpage>
          -
          <lpage>475</lpage>
          . ACM,
          <year>2003</year>
          . ISBN 1-58113-723-0.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Donald</surname>
            <given-names>Metzler</given-names>
          </string-name>
          , Jasmine Novak, Hang Cui, and
          <string-name>
            <given-names>Srihari</given-names>
            <surname>Reddy</surname>
          </string-name>
          .
          <article-title>Building enriched document representations using aggregated anchor text</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>219</fpage>
          -
          <lpage>226</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
          <source>ISBN 978-1-60558-483-6</source>
          . doi:
          <volume>10</volume>
          .1145/1571941.1571981.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Hannes</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>¨hleisen</article-title>
          . Wikistats - Wikipedia page views,
          <year>2013</year>
          . URL http:// wikistats.ins.cwi.nl.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Alexandros</surname>
            <given-names>Ntoulas</given-names>
          </string-name>
          , Junghoo Cho, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Olston</surname>
          </string-name>
          .
          <article-title>What's new on the web?: the evolution of the web from a search engine perspective</article-title>
          .
          <source>In Stuart I. Feldman</source>
          , Mike Uretsky,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Najork</surname>
          </string-name>
          , and Craig E. Wills, editors,
          <source>WWW</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . ACM,
          <year>2004</year>
          . ISBN 1-58113- 844-X.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ras</surname>
          </string-name>
          .
          <article-title>Eerste fase webarchivering</article-title>
          .
          <source>Technical report, Koninklijke Bibliotheek</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[30] UNESCO. Charter on the preservation of digital heritage (article 3.4)</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Stewart</surname>
            <given-names>Whiting</given-names>
          </string-name>
          , Joemon M. Jose, and Omar Alonso.
          <article-title>Wikipedia as a time machine</article-title>
          .
          <source>In 23rd International World Wide Web Conference, WWW '14</source>
          , Seoul, Republic of Korea, April 7-
          <issue>11</issue>
          ,
          <year>2014</year>
          , Companion Volume, pages
          <fpage>857</fpage>
          -
          <lpage>862</lpage>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1145/2567948.2579048. URL http://doi.acm.
          <source>org/10</source>
          .1145/2567948.2579048.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>