=Paper=
{{Paper
|id=None
|storemode=property
|title=Dynamics of Search Engine Rankings – A Case Study
|pdfUrl=https://ceur-ws.org/Vol-703/paper1.pdf
|volume=Vol-703
|dblpUrl=https://dblp.org/rec/conf/www/Bar-IlanLM04
}}
==Dynamics of Search Engine Rankings – A Case Study==
Dynamics of Search Engine Rankings – A Case Study
Judit Bar-Ilan
The Hebrew University of Jerusalem and Bar-Ilan University Israel
judit@cc.huji.ac.il
Mark Levene and Mazlita Mat-Hassan
School of Computer Science and Information Systems
Birkbeck , University of London
{mark, azy}@dcs.bbk.ac.uk
Abstract
The objective of this study was to characterize the changes in the rankings of the top-n results
of major search engines over time and to compare the rankings between these engines. We
considered only the top-ten results, since users usually inspect only the first page returned by
the search engine, which normally contains ten results. In particular, we compare rankings of
the top ten results of the search engines Google and AlltheWeb on identical queries over a
period of three weeks. The experiment was repeated twice, in October 2003 and in January
2004 in order to assess changes to the top ten results of some of the queries during a three
months period. Results show that the rankings of AlltheWeb were highly stable over each
period, while the rankings of Google underwent constant yet minor changes, with occasional
major ones. Changes over time can be explained by the dynamic nature of the Web or by
fluctuations in the search engines’ indexes (especially when frequent switches in the rankings
are observed). The top ten results of the two search engines have surprisingly low overlap.
With such small overlap (occasionally only a single URL) the task of comparing the rankings
of the two engines becomes extremely challenging, and additional measures are needed to
assess rankings in such situations.
Introduction
The Web is growing continuously; new pages are published on the Web every day. However
it is not enough to publish a Web page – this page must also be locatable. Currently the
primary tools for locating information on the Web are the search engines, and by far the most
popular search engine is Google (Nielsen/NetRatings, 2003; Sullivan & Sherman, 2004).
Google reportedly covers over 4.2 billion pages as of mid-February 2004 (Google, 2004;
Price, 2004), a considerable jump from over 3.3 billion as reported from August 2003 and
until mid-February 2004. Some of the pages indexed by Google are not from the traditional
“publicly indexable Web” (Lawrence & Giles, 1999), for example records from OCLC’s
WorldCat (Quint, 2003). Currently the second largest search engine in terms of the reported
number of indexed pages is AlltheWeb with over 3.1 billion pages (AlltheWeb, 2004). At the
time of our data collection, the two search engines were of similar size. There are no recent
studies on the coverage of Web search engines, but the 1999 study of Lawrence and Giles
found that the, then largest search engine (NorthernLight), covered only about 16% of the
Web. Today, authors of Web pages can influence the inclusion of their pages through the
paid-inclusion services. AlltheWeb has a paid-inclusion service, and even though Google
doesn’t, one’s chances of being crawled are increased if the pages appear in major directories
(which do have paid-inclusion services) (Sullivan, 2003a).
However, it is not enough to be included in the index of a search engine, placement is also
crucial, since most Web users do not browse beyond the first ten or twenty results (Silverstein
et al., 1999; Spink et al., 2002). Paid inclusion is not supposed to influence the placement of
the page. The SEOs (Search Engine Optimizers) offer their services to increase the ranking of
13
your pages on certain queries (see for example Search Engine Optimization, Inc,
http://www.seoinc.com/) – Google (Google, 2003a) warns against careless use of such
services. Thus it is clear to all that the top ten results retrieved on a given query have the best
chance of being visited by Web users. This was the main motivation for the research we
present herein, in addition to examining the changes over time in the top ten results for a set
of queries of the currently two largest search engines, Google and AlltheWeb. In parallel to
this line of enquiry, we also studied the similarity (or rather non-similarity) between the top
ten results of these two tools.
For this study, we could not analyze the ranking algorithms of the search engines, since these
are kept secret, both because of the competition between the different tools and in order to
avoid misuse of the knowledge of these algorithms by users who want to be placed high on
specific queries. For example, Google is willing to disclose only that its ranking algorithm
involves more than 100 factors, but “due to the nature of our business and our interest in
protecting the integrity of our search results, this is the only information we make available to
the public about our ranking system” (Google, 2003b). Thus we had to use empirical methods
to study the differences in the ranking algorithms and the influence of time on the rankings of
search engines.
The usual method of evaluating rankings is through human judgment. In an early study by Su
et al. (1998), users were asked to choose and rank the five most relevant items from the first
twenty results retrieved for their queries. In their study, Lycos performed better on this
criteria than the other three examined search engines. Hawking et al. (1999) compared
precision at 20 of five commercial search engines with precision at 20 of six TREC systems.
The results for the commercial engines were retrieved from their own databases, while the
TREC engines’ results came from an 18.5 million pages test collection of Web pages.
Findings showed that the TREC systems outperformed the Web search engines, and the
authors concluded that “the standard of document rankings produced by public Web search
engines is by no means state-of-the-art.” On the other hand, Singhal and Kaszkiel (2001)
compared a well-performing TREC system with four Web search engines and found that “for
finding the web page/site of an entity, commercial web search engines are notably better than
a state-of-the-art TREC algorithm.” They were looking for home pages of the entity and
evaluated the search tool by the rank of the URL in the search results that pointed to the
desired site. In Fall 1999, Hawking et al. (2001) evaluated the effectiveness of twenty public
Web search engines on 54 queries. One of the measures used was the reciprocal rank of the
first relevant document – a measure closely related to ranking. The results showed significant
differences between the search engines and high intercorrelation between the measures.
Chowdhury and Soboroff (2002) also evaluated search effectiveness based on the reciprocal
rank – this time of the URL of a known item.
Evaluations based on human judgments are unavoidably subjective. Voorhees (2000)
examined this issue, and found very high correlations among the rankings of the systems
produced by different relevance judgment sets. The paper considers rankings of the different
systems and not rankings within the search results, and despite the fact that the agreement on
the ranking performance of the search tools was high, the mean overlap between the relevance
judgments on individual documents of two judges was below 50% (binary relevance
judgments were made). Soboroff et al. (2001) based on the finding that differences in human
judgments of relevance do not affect the relative evaluated performance of the different
systems, proposed a ranking system based on randomly selecting “pseudo-relevant”
documents. In a recent study, Vaughan (to appear) compared human rankings of 24
participants with those of three large commercial search engines, Google, AltaVista and
Teoma on four search topics. The highest average correlation between the human-based
rankings and the rankings of the search engines was for Google, where the average correlation
was 0.72. The average correlation for AltaVista was 0.49.
24
Fagin et al. (2003) proposed a method for comparing the top-k results retrieved by different
search engines. One of the applications of the metrics proposed by them was comparing the
rankings of the top 50 results of seven public search tools (AltaVista, Lycos, AlltheWeb,
HotBot, NorthernLight, AOLSearch and MSNSearch - some of them received their results
from the same source, e.g., Lycos and AlltheWeb) on 750 queries. The basic idea of their
method was to assign some reasonable, virtual placement to documents that appear in one of
the lists but not in the other. The resulting measures were proven to be metrics, which is a
major point they stress in their paper.
The studies we have mentioned concentrate on comparing the search results of several
engines at one point in time. In contrast, this study examines the temporal changes in search
results over a period of time within a single engine and between different engines. In
particular, we concentrate on the results of two of the largest search engines, Google and
AlltheWeb using three different measures described below.
Methodology
Data Collection
The data for this study was collected during two, approximately three weeks long time
periods, the first during October 2003 and the second during January 2004. The data
collection for the first period was a course assignment at Birbeck, University of London. Each
student was required to choose a query from a list of ten queries and also to choose an
additional query of his/her own liking. These two queries were to be submitted to Google
(google.com) and AlltheWeb (alltheweb.com) twice a day (morning and evening) during a
period of three weeks. The students were to record the ranked list of the top ten retrieved
URLs for each search point. Overall, 34 different queries were tracked by twenty-seven
students (some of the queries were tracked by more than one student). The set of all queries
that were processed with the numbering assigned to them appear in Table 1. For the first
period queries q01-q05 were analyzed.
The process was repeated at the beginning January 2004. We picked 10 queries from the list
of 34 queries. This time we queried Google.com, Google.co.uk, Google.co.il and Alltheweb
in order to assess the differences between the different Google sites as well. In this
experiment, at each data collection point all the searches were carried out within a 20-minute
timeframe. The reason for rerunning the searches was to study the effect of time on the top
ten results. Between the two parts of the experiment, Google most likely introduced a major
change into its ranking algorithm (called the “Florida Google Dance” - (Sullivan, 2003b)),
and we were interested to study the effects of this change. For the second period queries q01-
q10 were analyzed. The search terms were not submitted as phrases at either stage.
Query ID Query
q01 Modern architecture
q02 Web data mining
q03 world rugby
q04 Web personalization
q05 Human Cloning
q06 Internet security
q07 Organic food
q08 Snowboarding
q09 dna evidence
q10 internet advertising techniques
Table 1: The queries
35
The Measures
We used three measures in order to assess the changes over time in the rankings of the search
engines and to compare the results of Google and AlltheWeb. The first and simplest measure
is simply the size of the overlap between two top ten lists.
The second measure was Spearman’s rho. Spearman’s rho is applied to two rankings of the
same set, thus if the size of the set is N, all the rankings must be between 1 and N (ties are
allowed). Since the top ten results retrieved by two search engines on a given query, or
retrieved by the same engine on two consecutive days are not necessarily identical, the two
lists must be transformed before Spearman’s rho can be computed. First the non-overlapping
URLs were eliminated from both lists, and then the remaining lists were reranked, each URL
was given its relative rank in the set of remaining URLs in each list. After these
transformations Spearman’s rho could be computed:
6∑ d i2
r = 1−
(n 2 − 1)n
where di is the difference between the ranking of URLi in the two lists. The value of r is
between -1 and 1, where -1 indicates that the two lists have opposite rankings, and 1 indicates
perfect correlation. Note that Sperman’s rho is based on the reranked lists, and thus for
example if the original ranks of the URLs that appear in both lists (the overlapping pairs) are
(1,8), (2,9) and (3,10), the reranked pairs will be (1,1), (2,2) and (3,3) and the value of
Spearman’s rho will be 1 (perfect correlation).
The third measure utilized by us was one of the metrics introduced by Fagin et al. (2003). It is
relatively easy to compare two rankings of the same list of items – for this well-known
statistical measures such as Kendall’s tau or Spearman’s rho can be easily utilized. The
problem arises when the two search engines that are being compared rank non-identical sets
of documents. To cover this case (which is the usual case when comparing top-k lists created
by different search engines), Fagin et al. (2003) extended the previously mentioned metrics.
Here we discuss only the extension of Spearman’s footrule (a variant of Spearman’s rho,
which is unlike Spearman’s rho is a metric), but the extensions of Kendall’s tau are shown in
the paper to be equivalent to the extension of Spearman’s footrule. A major point in their
method was to develop measures that are either metrics or “near” metrics. Spearman’s
footrule, is the L1 distance between two permutations (where the rankings on identical sets
can be viewed as permutations): F (σ 1 , σ 2 ) = ∑ | σ 1 (i) − σ 2 (i) | . This metric is extended for the
case where the two lists are not identical, to documents appearing in one of the lists but not in
the other an arbitrary placement (which is larger than the length of the list) is assigned in the
second list – when comparing lists of length k this placement can be k+1 for all the
documents not appearing in the list. The rationale for this extension is that the ranking of
those documents must be k+1 or higher – Fagin et al. do not take into account the possibility
that those documents are not indexed at all by the other search engine. The extended metric
becomes:
F ( k +1) (τ 1 ,τ 2 ) = 2( k − z )(k + 1) + ∑i∈Z | τ 1 (i ) − τ 2 (i ) | − ∑i∈S τ 1 (i ) − ∑i∈T τ 2 (i )
where Z is the set of overlapping documents, and z is the size of Z, S is the set of documents
that are only in the first list and T is the set of documents that appear in the second list only. A
problem with the measures proposed by Fagin et al. is that when the two lists have little in
common, the non-common documents have a major effect on the measure. Our experiments
show that usually the overlap between the top ten results of two search engines for an
identical query is very small, and the non-overlapping elements have a major effect.
F(k+1) was normalized by Fagin et al. so that the values lie between 0 and 1. For k=10 the
normalization factor is 110. Since F(k+1) is a distance measure, the smaller the value the more
46
similar are the two lists, however for Spearman’s rho the more similar the two lists are, the
value of the measure is nearer to 1. In order to be able to have some comparison between the
two measures, we computed
F ( k +1)
G ( k +1) = 1 −
max F ( k +1)
which we refer to as the G metric.
Data analysis
For a given search engine and a given query we computed these measures on the results for
consecutive data collection points. When comparing two search engines we computed the
measures on the top ten results retrieved by both engines on the given data collection point.
The two periods were compared on five queries - here we calculated the overlap between the
two periods and assessed the changes in the rankings of the overlapping elements based on
the average rankings.
Results and Discussion
A Single Engine over Time
AlltheWeb was very stable during both phases on all queries, as can be seen in Table 2. There
were almost no changes either in the set of URLs retrieved or in the relative placement of
these URLs in the top ten results. Some of the queries were monitored by several students,
thus the number of data comparisons (comparing the results of consecutive data collection
points) was high, For each query we present the total number of URLs identified during the
period, the average and minimum number of URLs that were retrieved at both of the two
consecutive data collection points (overlap). The maximum overlap was 10 for each of the
queries, an overlap of 10 was rather frequent, thus we computed the percentage of the
comparisons where the set of URLs was not identical in both of the points that were
compared (% of points with overlap less than 10). In addition, Table 1 displays the percentage
of comparisons where the relative ranking of the overlapping URLs changed and the minimal
values of Spearman’s rho and of G (the maximal values where 1 in all cases). Finally, in order
to assess the changes in the top-ten URLs over a longer period of time, we also present the
number of URLs that were retrieved in both the first and the last data collection points.
query # days # data # URLs average min % of % of points min min overlap
monitored comparisons identified overlap overlap points where relative Spearman G between
during overlap ranking first and last
period less than changed point
10
q01 12 20 10 10 10 0% 0% 1 1 10
q02 17 34 11 9.91 9 9% 0% 1 1 10
q03 26 109 12 9.86 9 14% 2% 0.9 0.8 10
q04 24 100 15 9.8 9 20% 0% 1 0.873 8
q05 21 41 10 10 10 0% 0% 1 1 10
Table 2. AlltheWeb – first period
When considering the data for Google we see somewhat larger variability, but still the
changes between two consecutive data points are rather small. Note that for the query number
3 (world rugby), there were frequent changes in the placement of the top ten URLs.
57
query # days # data # URLs average min % of % of points min min overlap
monitored comparisons identified overlap overlap points where relative Spearman G between
during overlap ranking first and last
period less than changed point
10
q01 12 20 11 9.95 9 5% 10% 0.95 0.891 9
q02 17 34 12 9.88 9 12% 3% 0.983 0.933 9
q03 26 109 14 9.86 8 10% 35% 0.548 0.8 8
q04 24 100 14 9.36 7 57% 0% 1 0.691 6
q05 21 41 10 10 10 0% 54% 0.891 0.927 10
Table 3. Google.com – first period
Google- Web Personalization - First Period
1.000
0.950
0.900
0.850
G
0.800
0.750
0.700
0.650
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
1
3
5
7
9
Data capture points
Figure 1: Time series of G metric for query, web personalization, submitted to
Google.com
Figures 1 and 2 present time-series for query 4, web personalization. The x-axis for both
graphs shows consecutive time-ordered data capture points. In Figure 1 we see that, during
the observed period, the G metric fluctuates mainly between 0.9 and 1.0, apart from a
significant drop to 0.7 during three data capture points during the middle of the period. This is
due to the decrease in the size of the overlap (from 9 to 7) and changes in the ranking of the
top-ten URLs observed.
Google - Web personalization - for URL
"Web personalization- Com puter World"
(w w w .com puterw orld.com /new s/1999/story/0,11280,43546,00.htm l)
10
9
Ranking
8
7
6
5
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
1
2
3
4
5
6
7
8
9
Data capture points
Figure 2: Time series of Google ranking for “Web personalization-Computer World”
page
68
Figure 2 illustrates the change in Google’s ranking of one web page entitled “Web
personalization – Computer World”, which contains an article from the Computer World
website. The ranking of this page was stable at position 6, for the first twenty-three data point
observed. The ranking then fluctuates between positions 8, 9 and 10 from data capture points
25 to 35. It is interesting to observe that, during this period, the rank of this Web page
changed twice a day, in the morning and the evening periods. The page then settled at
position 9 and then disappeared completely from the top-ten result list, three days before the
end of the observed period.
Similar analysis was carried out for the queries during the second period. The results appear
in Tables 4 and 5. Also during the second period the results and the rankings of AlltheWeb
were highly stable. Google.com exhibited considerable variability, even though the average
overlap was above 9 for all ten queries. Unlike AlltheWeb, quite often the relative placements
of the URLs changed.
Perhaps the most interesting case for Google.com was query 10 (internet advertising
techniques), where all except two of the previous hits were replaced by completely new ones
(and the relative rankings of the two remaining URLs were swapped, and from this point on
the search engine presented this new set of results. This was not accidental, the same behavior
was observed on Google.co.uk and Google.co.il as well. We do not display the results for
Google.co.uk and Google.co.il here, since the descriptive statistics are very similar, even
though there are slight differences between the result sets. We shall discuss this point more
extensively when we compare the results of the different engines.
query # days # # URLs average min % of % of points min min overlap
monitored comparisons identified overlap overlap points where relative Spearman G between
during overlap ranking first and last
period less than changed point
10
q01 22 44 11 9.97 9 2% 0% 1 0.84 9
q02 22 44 11 9.97 9 2% 0% 1 0.945 9
q03 22 44 11 9.97 9 2% 0% 1 0.818 9
q04 22 44 13 9.76 8 21% 0% 1 0.89 8
q05 22 44 10 10 10 0% 0% 1 1 10
q06 22 44 10 10 10 0% 0% 1 1 10
q07 22 44 10 10 10 0% 0% 1 1 10
q08 22 44 11 9.97 9 2% 0% 1 0.98 9
q09 22 44 13 9.9 9 14% 0% 1 0.927 9
q10 22 44 13 9.97 8 5% 0% 1 0.872 8
Table 4: AlltheWeb – second period
query # days # data # URLs average min % of % of points min min G overlap
monitored comparisons identified overlap overlap points where relative Spearman between
during overlap ranking first and last
period less than changed point
10
q01 22 43 20 9.56 6 35% 28% 0.889 0.636 5
q02 22 43 17 9.65 7 30% 12% 0.929 0.836 6
q03 22 43 17 9.65 8 28% 23% 0.842 0.818 7
q04 22 43 28 8.37 5 54% 21% 0.4 0.418 7
q05 22 43 13 9.88 9 12% 26% 0.903 0.909 9
q06 22 43 14 9.77 9 23% 2% 0.933 0.818 8
q07 22 43 15 9.81 8 16% 58% 0.612 0.854 8
q08 22 43 19 9.49 7 35% 23% 0.905 0.745 6
79
q09 22 43 14 9.77 9 23% 14% 0.85 0.855 9
q10 22 43 20 9.7 2 14% 12% -1 0.109 1
Table 5: Google.com – second period
Comparing Two Engines
At the time of the data collection the two search engines reportedly indexed approximately
the same number of documents (approximately 3 billion documents). In spite of this the
results show that the overlap between the top ten results is extremely small (see Tables 6 and
7). The small positive and the negative values of Spearman’s rho indicate that the relative
rankings on the overlapping elements are considerably different – thus even for those URLs
that are considered highly relevant for the given topic by both search engines; the agreement
on the relative importance of these documents is rather low.
query # days # average min max average min max average min G max G
monitored comparisons overlap overlap overlap Spearman Spearman Spearman G
q01 12 21 2 2 2 -1 -1 -1 0.145 0.145 0.145
q02 17 35 4 4 4 0.2978 0.266 0.311 0.4 0.4 0.4
q03 26 110 4.43 4 6 -0.139 -0.8 0.527 0.387 0.245 0.472
q04 24 101 1 1 1 n/a n/a n/a 0.177 0.173 0.182
q05 21 42 3 3 3 0.5 0.5 0.5 0.220 0.200 0.267
Table 6: Comparing the search results for AlltheWeb and Google.com – first period
query # days # average min max average min max average min G max G
monitored comparisons overlap overlap overlap Spearman Spearman Spearman G
q01 22 44 2 2 2 -1 -1 -1 0.133 0.109 0.145
q02 22 44 3.48 2 4 0.361 0.2 1 0.352 0.255 0.418
q03 22 44 3.75 3 5 0.545 0 0.8 0.317 0.291 0.345
q04 22 44 1.05 1 2 n/a -1 n/a 0.140 0.127 0.236
q05 22 44 1.82 1 2 n/a n/a 1 0.216 0.182 0.236
q06 22 44 5 5 5 0.698 0.6 0.7 0.6 0.509 0.616
q07 22 44 4.95 4 5 0.202 0.1 0.5 0.416 0.4 0.472
q08 22 44 3.3 2 4 0.493 -1 1 0.309 0.218 0.509
q09 22 44 3.09 3 4 0.527 0.5 0.8 0.438 0.436 0.455
q10 22 44 1.55 1 3 n/a n/a 0.5 0.109 0.036 0.273
Table 7: Comparing the search results for AlltheWeb and Google.com
– second period
There are two possible reasons why a given URL does not appear in the top ten results of a
search engine: either it is not indexed by the search engine or the engine ranks it after the first
ten results. We checked whether the URLs identified by the two search engines during the
second period are indexed by the search engine (we ran this check in February 2004). We
defined three cases: the URL was in the top ten list of the engine some time during the period
(called “top-ten”), it was not in the top ten, but is indexed by the search engine (“indexed”)
and is not indexed at all (“not indexed”). The results for queries 1-5 appear in Table 8. The
results for these five queries show that both engines index most of the URLs located (between
67.6% and 96.6% of the URLs – top-ten and indexed combined), thus it seems that the
ranking algorithms of the two search engines are highly dissimilar.
10
8
Query URLs AlltheWeb Google.com
identified
not not
top-ten indexed indexed top-ten indexed indexed
q01 28 35.7% 42.9% 21.4% 71.4% 25.0% 3.6%
q02 24 45.8% 45.8% 8.4% 70.8% 25.0% 4.2%
q03 22 50.0% 31.8% 18.2% 77.3% 13.6% 9.1%
q04 39 33.3% 35.9% 30.8% 71.8% 12.8% 15.4%
q05 20 50% 25% 25% 60% 30% 10%
Table 8: URLs indexed by both engines
During the second period we collected data not only from Google.com, but from
Google.co.uk and Google.co.il as well, overall the results are rather similar, but there are
some differences as can be seen from the results for five randomly chosen queries comparing
Google.co.il and AlltheWeb (Table 9 – compare with Table 7) and comparing Google.com
with Google.co.il (Table 10).
query # days # average min max average min max average min G max G
monitored comparisons overlap overlap overlap Spearman Spearman Spearman G
q02 22 44 3.27 3 4 0.42 0.2 0.5 0.37 0.327 0.418
q04 22 44 1.02 1 2 n/a -1 n/a 0.1 0.127 0.236
q06 22 44 5 5 5 0.602 0.6 0.7 0.6 0.509 0.618
q07 22 44 4.98 4 5 0.237 0.2 0.8 0.406 0.364 0.455
q08 22 44 3.51 3 4 0.749 0.4 1 0.383 0.309 0.436
Table 9: Comparing the search results for AlltheWeb and Google.co.il – second period
query # days # average min max average min max average min G max G
monitored comparisons overlap overlap overlap Spearman Spearman Spearman G
q01 22 44 9.6 9 10 0.998 0.964 1 0.97 0.909 1
q02 22 44 8.3 3 10 0.987 0.429 1 0.837 0.382 1
q05 22 44 9.75 9 10 0.944 0.745 1 0.95 0.836 1
q06 22 44 9.91 9 10 1 1 1 0.995 0.909 1
q10 22 44 9.98 9 19 0.996 0.903 1 0.996 0.927 1
Table 10: Comparing the search results for Google.com and Google.co.il – second period
Table 10 shows that usually the correlation between google.com and google.co.il is very high
– for some reason query 2 (Web data mining) was an exception.
Comparing Two Periods
The second period of data collection took place about three months after the first one. We
tried to assess the changes in the top ten lists of the two search engines. The findings are
summarized in Table 11. Here we see again that AlltheWeb is less dynamic than Google,
except for query 4 (web personalization), where considerable changes were recorded for
AlltheWeb as well.
11
9
query AlltheWeb Google
URLs overlap URLs min change max change URLs overlap URLs missing min change max change
(two missing from average average (both from second average average
periods) second set ranking ranking period) set ranking ranking
q01 11 10 1 0 0.75 22 9 2 0 2.72
q02 11 10 0 0 1 19 10 2 0 5.61
q03 22 8 4 0 2.45 19 12 2 0.18 3.64
q04 19 7 7 0 2.68 32 10 4 0 2.52
q05 10 10 0 0 0 13 10 0 0 1.40
Table 11: Comparing the two periods
Discussion and Conclusions
In this paper, we computed a number of measures in order to assess the changes that occur
over time to the rankings of the top ten results on a number of queries for two search engines.
We computed a number of measures, since none of them were satisfactory as a standalone
measure for such assessment. Overlap does not assess rankings at all, while Spearman’s rho
ignores the non-overlapping elements and takes into account relative placement only.
Moreover, Fagin’s measure gives too much weight to the non-overlapping elements. The
three measures together provide a better picture than any of these measures alone. Since none
of these measures are completely satisfactory, we recommend experimenting with additional
measures in the future.
The results indicate that the top ten results usually change gradually. Abrupt changes were
observed only very occasionally. Overall, AlltheWeb seems to be much less dynamic than
Google. The ranking algorithms of the two search engines seem to be highly dissimilar: even
though both engines index most of the URLs that appeared in the top ten lists; the differences
in the top ten lists are large (the overlap is small and the correlations between the rankings of
the overlapping elements are usually small, sometimes even negative). One reason for Google
being more dynamic may be due to its search indexes being unsynchronised while they are
being updated, and the non-deterministic nature of query processing due to its distributed
nature.
An additional area for further research, along the lines of the research carried out by Vaughan
(to appear), is comparing the rankings provided by the search engines with human judgments
placed on the value of the retrieved documents.
References
AlltheWeb (2004). Retrieved February 18, 2004 from http://www.alltheweb.com
Chowdhury, A. and Soboroff, I. (2002). Automatic evaluation of World Wide Web Search
Services. In Proceedings of the 25th Annual International ACM SIGIR Conference, 421-
422.
Fagin, R., Kumar, R. and Sivakumar, D. (2003). Comparing top k lists. SIAM Journal on
Discrete Mathematics, 17(1), 134-160.
Google. (2003a). Google information for Webmasters. Retrieved February 18, 2004, from
http://www.google.com/webmasters/seo.html
Google. (2003b). Google information for Webmasters. Retrieved February 18, 2004, from
http://www.google.com/webmasters/4.html
Google. (2004) Retrieved February 18, 2004 from http://www.google.com
Hawking, D., Craswell, N., Bailey, P. and Griffiths, K. (2001). Measuring search engine
quality. Information Retrieval, 4, 33-59.
12
10
Hawking, D., Craswell, N., Thistlewaite, P. and Harman, D. (1999). Results and challenges in
Web search evaluation. In Proceedings of the 8th International World Wide Web
Conference, May 1999, Computer Networks, 31(11-16), 1321-1330, Retrieved February
18, 2004, from http://www8.org/w8-papers/2c-search-discover/results/results.html
Lawrence, S., & Giles, L. (199). Accessibility of information on the Web. Nature, 400, 107-
109.
Nielsen/NetRatings (2003). NetView usage metrics. Retrieved February 18, 2004, from
http://www.netratings.com/news.jsp?section=dat_to
Price, G. (2004). Google ups total page count. In Resourceshelf. Retrieved February 18, 2004,
from
http://www.resourceshelf.com/archives/2004_02_01_resourceshelf_archive.html#107702
946623981034
Quint, B. (2003). OCLC Project Opens WorldCat Records to Google. In Information Today.
Retrieved February 18, 2004, from http://www.infotoday.com/newsbreaks/nb031027-
2.shtml
Silverstein, C., Henzinger, M., Marais, H and Moricz, M. (1999). Analysis of a very large
Web search engine query log. ACM SIGIR Forum, 33(1). Retrieved February 18, 2004
from http://www.acm.org/sigir/forum/F99/Silverstein.pdf
Singhal, A., and Kaszkiel, M. (2001). A case study in Web search using TREC algorithms. In
Proceedings of the 10th International World Wide Web Conference, May 2001, 708-716.
Retrieved February 18, 2004 from http://www10.org/cdrom/papers/pdf/p317.pdf
Spink, A., Ozmutlu, S., Ozmutlu, H., C., & Jansen, B. J. (2002). U.S. versus European Web
searching trends. SIGIR Forum, Fall 2002. Retrieved February 18, 2004 from
http://www.acm.org/sigir/forum/F2002/spink.pdf
Soboroff, I., Nicholas, C. and Cahan, P. (2001). Ranking retrieval systems without relevance
judgments. In Proceedings of the 24th annual international ACM SIGIR conference, 66-
72.
Su, L. T., Chen, H.L. and Dong, X. Y. (1998). Evaluation of Web-based search engines from
the end-user's perspective: A pilot study. In Proceedings of the ASIS Annual Meeting, 35,
348-361.
Sullivan, D. (2003a). Buying your way in: Search engine advertising chart. Retrieved
February 18, 2004, from
http://www.searchenginewatch.com/webmasters/article.php/2167941
Sullivan, D. (2003b). Florida Google dance resources. Retrieved February 18, 2004 from
http://www.searchenginewatch.com/searchday/article.php/3285661
Sullivan, D., & Sherman, C. (2004). 4th Annual Search Engine Watch 2003 Awards.
Retrieved February 18, 2004, from
http://www.searchenginewatch.com/awards/article.php/3309841
Vaughan, L. (to appear). New measurements for search engine evaluation proposed and
tested. To appear in Information Processing & Management. doi:10.1016/S0306-
4573(03)00043-8
Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval
effectiveness. Information Processing and Management, 36, 697-716.
13
11