Introduction

Dynamics of Search Engine Rankings - A Case Study

Judit Bar-Ilan

judit@cc.huji.ac.il 0 1 0 Mark Levene and Mazlita Mat-Hassan School of Computer Science and Information Systems Birkbeck , University of London 1 The Hebrew University of Jerusalem and Bar-Ilan University Israel

The objective of this study was to characterize the changes in the rankings of the top-n results of major search engines over time and to compare the rankings between these engines. We considered only the top-ten results, since users usually inspect only the first page returned by the search engine, which normally contains ten results. In particular, we compare rankings of the top ten results of the search engines Google and AlltheWeb on identical queries over a period of three weeks. The experiment was repeated twice, in October 2003 and in January 2004 in order to assess changes to the top ten results of some of the queries during a three months period. Results show that the rankings of AlltheWeb were highly stable over each period, while the rankings of Google underwent constant yet minor changes, with occasional major ones. Changes over time can be explained by the dynamic nature of the Web or by fluctuations in the search engines' indexes (especially when frequent switches in the rankings are observed). The top ten results of the two search engines have surprisingly low overlap. With such small overlap (occasionally only a single URL) the task of comparing the rankings of the two engines becomes extremely challenging, and additional measures are needed to assess rankings in such situations. The Web is growing continuously; new pages are published on the Web every day. However it is not enough to publish a Web page - this page must also be locatable. Currently the primary tools for locating information on the Web are the search engines, and by far the most popular search engine is Google (Nielsen/NetRatings, 2003; Sullivan & Sherman, 2004). Google reportedly covers over 4.2 billion pages as of mid-February 2004 (Google, 2004; Price, 2004), a considerable jump from over 3.3 billion as reported from August 2003 and until mid-February 2004. Some of the pages indexed by Google are not from the traditional “publicly indexable Web” (Lawrence & Giles, 1999), for example records from OCLC's WorldCat (Quint, 2003). Currently the second largest search engine in terms of the reported number of indexed pages is AlltheWeb with over 3.1 billion pages (AlltheWeb, 2004). At the time of our data collection, the two search engines were of similar size. There are no recent studies on the coverage of Web search engines, but the 1999 study of Lawrence and Giles found that the, then largest search engine (NorthernLight), covered only about 16% of the Web. Today, authors of Web pages can influence the inclusion of their pages through the paid-inclusion services. AlltheWeb has a paid-inclusion service, and even though Google doesn't, one's chances of being crawled are increased if the pages appear in major directories (which do have paid-inclusion services) (Sullivan, 2003a). However, it is not enough to be included in the index of a search engine, placement is also crucial, since most Web users do not browse beyond the first ten or twenty results (Silverstein et al., 1999; Spink et al., 2002). Paid inclusion is not supposed to influence the placement of the page. The SEOs (Search Engine Optimizers) offer their services to increase the ranking of

Introduction

your pages on certain queries (see for example Search Engine Optimization, Inc, http://www.seoinc.com/) – Google (Google, 2003a) warns against careless use of such services. Thus it is clear to all that the top ten results retrieved on a given query have the best chance of being visited by Web users. This was the main motivation for the research we present herein, in addition to examining the changes over time in the top ten results for a set of queries of the currently two largest search engines, Google and AlltheWeb. In parallel to this line of enquiry, we also studied the similarity (or rather non-similarity) between the top ten results of these two tools.

For this study, we could not analyze the ranking algorithms of the search engines, since these are kept secret, both because of the competition between the different tools and in order to avoid misuse of the knowledge of these algorithms by users who want to be placed high on specific queries. For example, Google is willing to disclose only that its ranking algorithm involves more than 100 factors, but “due to the nature of our business and our interest in protecting the integrity of our search results, this is the only information we make available to the public about our ranking system” (Google, 2003b) . Thus we had to use empirical methods to study the differences in the ranking algorithms and the influence of time on the rankings of search engines.

The usual method of evaluating rankings is through human judgment. In an early study by Su et al. (1998) , users were asked to choose and rank the five most relevant items from the first twenty results retrieved for their queries. In their study, Lycos performed better on this criteria than the other three examined search engines. Hawking et al. (1999) compared precision at 20 of five commercial search engines with precision at 20 of six TREC systems. The results for the commercial engines were retrieved from their own databases, while the TREC engines’ results came from an 18.5 million pages test collection of Web pages. Findings showed that the TREC systems outperformed the Web search engines, and the authors concluded that “the standard of document rankings produced by public Web search engines is by no means state-of-the-art.” On the other hand, Singhal and Kaszkiel (2001) compared a well-performing TREC system with four Web search engines and found that “for finding the web page/site of an entity, commercial web search engines are notably better than a state-of-the-art TREC algorithm.” They were looking for home pages of the entity and evaluated the search tool by the rank of the URL in the search results that pointed to the desired site. In Fall 1999, Hawking et al. (2001) evaluated the effectiveness of twenty public Web search engines on 54 queries. One of the measures used was the reciprocal rank of the first relevant document – a measure closely related to ranking. The results showed significant differences between the search engines and high intercorrelation between the measures. Chowdhury and Soboroff (2002) also evaluated search effectiveness based on the reciprocal rank – this time of the URL of a known item.

Evaluations based on human judgments are unavoidably subjective. Voorhees (2000) examined this issue, and found very high correlations among the rankings of the systems produced by different relevance judgment sets. The paper considers rankings of the different systems and not rankings within the search results, and despite the fact that the agreement on the ranking performance of the search tools was high, the mean overlap between the relevance judgments on individual documents of two judges was below 50% (binary relevance judgments were made). Soboroff et al. (2001) based on the finding that differences in human judgments of relevance do not affect the relative evaluated performance of the different systems, proposed a ranking system based on randomly selecting “pseudo-relevant” documents. In a recent study, Vaughan (to appear) compared human rankings of 24 participants with those of three large commercial search engines, Google, AltaVista and Teoma on four search topics. The highest average correlation between the human-based rankings and the rankings of the search engines was for Google, where the average correlation was 0.72. The average correlation for AltaVista was 0.49. Fagin et al. (2003) proposed a method for comparing the top-k results retrieved by different search engines. One of the applications of the metrics proposed by them was comparing the rankings of the top 50 results of seven public search tools (AltaVista, Lycos, AlltheWeb, HotBot, NorthernLight, AOLSearch and MSNSearch - some of them received their results from the same source, e.g., Lycos and AlltheWeb) on 750 queries. The basic idea of their method was to assign some reasonable, virtual placement to documents that appear in one of the lists but not in the other. The resulting measures were proven to be metrics, which is a major point they stress in their paper.

The studies we have mentioned concentrate on comparing the search results of several engines at one point in time. In contrast, this study examines the temporal changes in search results over a period of time within a single engine and between different engines. In particular, we concentrate on the results of two of the largest search engines, Google and AlltheWeb using three different measures described below.

Methodology Data Collection

The data for this study was collected during two, approximately three weeks long time periods, the first during October 2003 and the second during January 2004. The data collection for the first period was a course assignment at Birbeck, University of London. Each student was required to choose a query from a list of ten queries and also to choose an additional query of his/her own liking. These two queries were to be submitted to Google (google.com) and AlltheWeb (alltheweb.com) twice a day (morning and evening) during a period of three weeks. The students were to record the ranked list of the top ten retrieved URLs for each search point. Overall, 34 different queries were tracked by twenty-seven students (some of the queries were tracked by more than one student). The set of all queries that were processed with the numbering assigned to them appear in Table 1. For the first period queries q01-q05 were analyzed.

The process was repeated at the beginning January 2004. We picked 10 queries from the list of 34 queries. This time we queried Google.com, Google.co.uk, Google.co.il and Alltheweb in order to assess the differences between the different Google sites as well. In this experiment, at each data collection point all the searches were carried out within a 20-minute timeframe. The reason for rerunning the searches was to study the effect of time on the top ten results. Between the two parts of the experiment, Google most likely introduced a major change into its ranking algorithm (called the “Florida Google Dance” - (Sullivan, 2003b) ), and we were interested to study the effects of this change. For the second period queries q01q10 were analyzed. The search terms were not submitted as phrases at either stage.

Query ID q01 q02 q03 q04 q05 q06 q07 q08 q09 q10

Query Modern architecture Web data mining world rugby Web personalization Human Cloning Internet security Organic food Snowboarding dna evidence internet advertising techniques We used three measures in order to assess the changes over time in the rankings of the search engines and to compare the results of Google and AlltheWeb. The first and simplest measure is simply the size of the overlap between two top ten lists.

The second measure was Spearman’s rho. Spearman’s rho is applied to two rankings of the same set, thus if the size of the set is N, all the rankings must be between 1 and N (ties are allowed). Since the top ten results retrieved by two search engines on a given query, or retrieved by the same engine on two consecutive days are not necessarily identical, the two lists must be transformed before Spearman’s rho can be computed. First the non-overlapping URLs were eliminated from both lists, and then the remaining lists were reranked, each URL was given its relative rank in the set of remaining URLs in each list. After these transformations Spearman’s rho could be computed: r = 1 − 6∑ di2 (n 2 − 1)n where di is the difference between the ranking of URLi in the two lists. The value of r is between -1 and 1, where -1 indicates that the two lists have opposite rankings, and 1 indicates perfect correlation. Note that Sperman’s rho is based on the reranked lists, and thus for example if the original ranks of the URLs that appear in both lists (the overlapping pairs) are (1,8), (2,9) and (3,10), the reranked pairs will be (1,1), (2,2) and (3,3) and the value of Spearman’s rho will be 1 (perfect correlation).

The third measure utilized by us was one of the metrics introduced by Fagin et al. (2003) . It is relatively easy to compare two rankings of the same list of items – for this well-known statistical measures such as Kendall’s tau or Spearman’s rho can be easily utilized. The problem arises when the two search engines that are being compared rank non-identical sets of documents. To cover this case (which is the usual case when comparing top-k lists created by different search engines), Fagin et al. (2003) extended the previously mentioned metrics. Here we discuss only the extension of Spearman’s footrule (a variant of Spearman’s rho, which is unlike Spearman’s rho is a metric), but the extensions of Kendall’s tau are shown in the paper to be equivalent to the extension of Spearman’s footrule. A major point in their method was to develop measures that are either metrics or “near” metrics. Spearman’s footrule, is the L1 distance between two permutations (where the rankings on identical sets can be viewed as permutations): F(σ 1,σ 2 ) = ∑|σ 1(i) −σ 2 (i) | . This metric is extended for the case where the two lists are not identical, to documents appearing in one of the lists but not in the other an arbitrary placement (which is larger than the length of the list) is assigned in the second list – when comparing lists of length k this placement can be k+1 for all the documents not appearing in the list. The rationale for this extension is that the ranking of those documents must be k+1 or higher – Fagin et al. do not take into account the possibility that those documents are not indexed at all by the other search engine. The extended metric becomes:

F (k+1) (τ 1,τ 2 ) = 2(k − z)(k + 1) + ∑i∈Z |τ 1 (i) −τ 2 (i) | − ∑i∈Sτ 1 (i) − ∑i∈Tτ 2 (i) where Z is the set of overlapping documents, and z is the size of Z, S is the set of documents that are only in the first list and T is the set of documents that appear in the second list only. A problem with the measures proposed by Fagin et al. is that when the two lists have little in common, the non-common documents have a major effect on the measure. Our experiments show that usually the overlap between the top ten results of two search engines for an identical query is very small, and the non-overlapping elements have a major effect. F(k+1) was normalized by Fagin et al. so that the values lie between 0 and 1. For k=10 the normalization factor is 110. Since F(k+1) is a distance measure, the smaller the value the more similar are the two lists, however for Spearman’s rho the more similar the two lists are, the value of the measure is nearer to 1. In order to be able to have some comparison between the two measures, we computed

G (k+1) = 1 −

F (k+1) max F (k+1) which we refer to as the G metric.

Data analysis

For a given search engine and a given query we computed these measures on the results for consecutive data collection points. When comparing two search engines we computed the measures on the top ten results retrieved by both engines on the given data collection point. The two periods were compared on five queries - here we calculated the overlap between the two periods and assessed the changes in the rankings of the overlapping elements based on the average rankings.

Results and Discussion A Single Engine over Time

AlltheWeb was very stable during both phases on all queries, as can be seen in Table 2. There were almost no changes either in the set of URLs retrieved or in the relative placement of these URLs in the top ten results. Some of the queries were monitored by several students, thus the number of data comparisons (comparing the results of consecutive data collection points) was high, For each query we present the total number of URLs identified during the period, the average and minimum number of URLs that were retrieved at both of the two consecutive data collection points (overlap). The maximum overlap was 10 for each of the queries, an overlap of 10 was rather frequent, thus we computed the percentage of the comparisons where the set of URLs was not identical in both of the points that were compared (% of points with overlap less than 10). In addition, Table 1 displays the percentage of comparisons where the relative ranking of the overlapping URLs changed and the minimal values of Spearman’s rho and of G (the maximal values where 1 in all cases). Finally, in order to assess the changes in the top-ten URLs over a longer period of time, we also present the number of URLs that were retrieved in both the first and the last data collection points. When considering the data for Google we see somewhat larger variability, but still the changes between two consecutive data points are rather small. Note that for the query number 3 (world rugby), there were frequent changes in the placement of the top ten URLs. 10 9 9 9 10 % of points overlap less than 10 0% 9% 14% 20% 0% 0% 0% 2% 0% 0% 1 1 1 1 0.9 min G 1 1 1 0.8 0.95 0.891 0.983 0.933 0.548 0.8

1 0.691 0.891 0.927 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9

Data capture points

Google- Web Personalization - First Period Data capture points

Similar analysis was carried out for the queries during the second period. The results appear in Tables 4 and 5. Also during the second period the results and the rankings of AlltheWeb were highly stable. Google.com exhibited considerable variability, even though the average overlap was above 9 for all ten queries. Unlike AlltheWeb, quite often the relative placements of the URLs changed.

Perhaps the most interesting case for Google.com was query 10 (internet advertising techniques), where all except two of the previous hits were replaced by completely new ones (and the relative rankings of the two remaining URLs were swapped, and from this point on the search engine presented this new set of results. This was not accidental, the same behavior was observed on Google.co.uk and Google.co.il as well. We do not display the results for Google.co.uk and Google.co.il here, since the descriptive statistics are very similar, even though there are slight differences between the result sets. We shall discuss this point more extensively when we compare the results of the different engines. query

# days # # URLs monitored comparisons identified during period Comparing Two Engines At the time of the data collection the two search engines reportedly indexed approximately the same number of documents (approximately 3 billion documents). In spite of this the results show that the overlap between the top ten results is extremely small (see Tables 6 and 7). The small positive and the negative values of Spearman’s rho indicate that the relative rankings on the overlapping elements are considerably different – thus even for those URLs that are considered highly relevant for the given topic by both search engines; the agreement on the relative importance of these documents is rather low. query

# days # average min max average min max monitored comparisons overlap overlap overlap Spearman Spearman Spearman average min G max G

G # days # average min max average min max monitored comparisons overlap overlap overlap Spearman Spearman Spearman There are two possible reasons why a given URL does not appear in the top ten results of a search engine: either it is not indexed by the search engine or the engine ranks it after the first ten results. We checked whether the URLs identified by the two search engines during the second period are indexed by the search engine (we ran this check in February 2004) . We defined three cases: the URL was in the top ten list of the engine some time during the period (called “top-ten”), it was not in the top ten, but is indexed by the search engine (“indexed”) and is not indexed at all (“not indexed”). The results for queries 1-5 appear in Table 8. The results for these five queries show that both engines index most of the URLs located (between 67.6% and 96.6% of the URLs – top-ten and indexed combined), thus it seems that the ranking algorithms of the two search engines are highly dissimilar.

-1 0.266 -0.8 n/a 0.5

-1 0.311 0.527 n/a 0.5 # days # average min max average min max monitored comparisons overlap overlap overlap Spearman Spearman Spearman

Comparing Two Periods

The second period of data collection took place about three months after the first one. We tried to assess the changes in the top ten lists of the two search engines. The findings are summarized in Table 11. Here we see again that AlltheWeb is less dynamic than Google, except for query 4 (web personalization), where considerable changes were recorded for AlltheWeb as well.

AlltheWeb URLs overlap URLs min change max change URLs (two missing from average average (both periods) second set ranking ranking period)

Google URLs missing min change max change from second average average set ranking ranking 11 11 22 19 10 10 10 8 7 10 1 0 4 7 0 0 0 0 0 0

Discussion and Conclusions

In this paper, we computed a number of measures in order to assess the changes that occur over time to the rankings of the top ten results on a number of queries for two search engines. We computed a number of measures, since none of them were satisfactory as a standalone measure for such assessment. Overlap does not assess rankings at all, while Spearman’s rho ignores the non-overlapping elements and takes into account relative placement only. Moreover, Fagin’s measure gives too much weight to the non-overlapping elements. The three measures together provide a better picture than any of these measures alone. Since none of these measures are completely satisfactory, we recommend experimenting with additional measures in the future.

The results indicate that the top ten results usually change gradually. Abrupt changes were observed only very occasionally. Overall, AlltheWeb seems to be much less dynamic than Google. The ranking algorithms of the two search engines seem to be highly dissimilar: even though both engines index most of the URLs that appeared in the top ten lists; the differences in the top ten lists are large (the overlap is small and the correlations between the rankings of the overlapping elements are usually small, sometimes even negative). One reason for Google being more dynamic may be due to its search indexes being unsynchronised while they are being updated, and the non-deterministic nature of query processing due to its distributed nature.

An additional area for further research, along the lines of the research carried out by Vaughan (to appear), is comparing the rankings provided by the search engines with human judgments placed on the value of the retrieved documents.

AlltheWeb ( 2004 ). Retrieved February 18 , 2004 from http://www.alltheweb.com

Chowdhury , A. and Soboroff , I. ( 2002 ). Automatic evaluation of World Wide Web Search Services . In Proceedings of the 25th Annual International ACM SIGIR Conference , 421 - 422 .

Fagin , R. , Kumar , R. and Sivakumar , D. ( 2003 ). Comparing top k lists . SIAM Journal on Discrete Mathematics , 17 ( 1 ), 134 - 160 .

Google. ( 2003a ). Google information for Webmasters . Retrieved February 18 , 2004 , from http://www.google.com/webmasters/seo.html

Google. ( 2003b ). Google information for Webmasters . Retrieved February 18 , 2004 , from http://www.google.com/webmasters/4.html

Google. ( 2004 ) Retrieved February 18, 2004 from http://www.google.com

Hawking , D. , Craswell , N. , Bailey , P. and Griffiths , K. ( 2001 ). Measuring search engine quality . Information Retrieval , 4 , 33 - 59 .

Hawking , D. , Craswell , N. , Thistlewaite , P. and Harman , D. ( 1999 ). Results and challenges in Web search evaluation . In Proceedings of the 8th International World Wide Web Conference , May 1999 , Computer Networks, 31 ( 11 - 16 ), 1321 - 1330 , Retrieved February 18, 2004 , from http://www8.org/w8-papers / 2c-search-discover/results/results.html

Lawrence , S. , & Giles , L. ( 199 ). Accessibility of information on the Web. Nature , 400 , 107 - 109 .

Nielsen/NetRatings (

2003 ). NetView usage metrics . Retrieved February 18 , 2004 , from http://www.netratings.com/news.jsp?section=dat_to

Price , G. ( 2004 ). Google ups total page count . In Resourceshelf. Retrieved February 18 , 2004 , from http://www.resourceshelf.com/archives/2004_02_01_resourceshelf_archive. html#107702 946623981034

Quint , B. ( 2003 ). OCLC Project Opens WorldCat Records to Google . In Information Today. Retrieved February 18 , 2004 , from http://www.infotoday.com/newsbreaks/nb031027- 2 .shtml

Silverstein , C. , Henzinger , M. , Marais , H and Moricz, M. ( 1999 ). Analysis of a very large Web search engine query log . ACM SIGIR Forum , 33 ( 1 ). Retrieved February 18 , 2004 from http://www.acm.org/sigir/forum/F99/Silverstein.pdf

Singhal , A. , and Kaszkiel , M. ( 2001 ). A case study in Web search using TREC algorithms . In Proceedings of the 10th International World Wide Web Conference , May 2001 , 708 - 716 . Retrieved February 18, 2004 from http://www10.org/cdrom/papers/pdf/p317.pdf

Spink , A. , Ozmutlu , S. , Ozmutlu , H., C. , & Jansen , B. J. ( 2002 ). U.S. versus European Web searching trends . SIGIR Forum , Fall 2002 . Retrieved February 18 , 2004 from http://www.acm.org/sigir/forum/F2002/spink.pdf

Soboroff , I. , Nicholas , C. and Cahan , P. ( 2001 ). Ranking retrieval systems without relevance judgments . In Proceedings of the 24th annual international ACM SIGIR conference , 66 - 72 .

Su , L. T. , Chen , H.L. and Dong , X. Y. ( 1998 ). Evaluation of Web-based search engines from the end-user's perspective: A pilot study . In Proceedings of the ASIS Annual Meeting , 35 , 348 - 361 .

Sullivan , D. ( 2003a ). Buying your way in: Search engine advertising chart . Retrieved February 18 , 2004 , from http://www.searchenginewatch.com/webmasters/article.php/2167941

Sullivan , D. ( 2003b ). Florida Google dance resources . Retrieved February 18 , 2004 from http://www.searchenginewatch.com/searchday/article.php/3285661

Sullivan , D. , & Sherman , C. ( 2004 ). 4th Annual Search Engine Watch 2003 Awards. Retrieved February 18 , 2004 , from http://www.searchenginewatch.com/awards/article.php/3309841

Vaughan , L. (to appear). New measurements for search engine evaluation proposed and tested . To appear in Information Processing & Management. doi:10 .1016/S0306- 4573 ( 03 ) 00043 - 8

Voorhees , E. M. ( 2000 ). Variations in relevance judgments and the measurement of retrieval effectiveness . Information Processing and Management , 36 , 697 - 716 .