Position Paper: A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen Cheng-Zen Yang Dept. of Computer Sci. and Eng., Yuan Ze University Dept. of Computer Sci. and Eng., Yuan Ze University 135 Yuan-Tung Road, Chungli 135 Yuan-Tung Road, Chungli Taiwan, 320, ROC Taiwan, 320, ROC sean@syslab.cse.yzu.edu.tw czyang@syslab.cse.yzu.edu.tw ABSTRACT aspects. The first source is from the diverse operating policies and the business strategies adopted in each search engine company. Search engine bias has been seriously noticed in recent years. As mentioned in [1], such type of bias is more insidious than Several pioneering studies have reported that bias perceivably advertising. A recent hot piece of news demonstrates this type of exists even with respect to the URLs in the search results. On the bias from the event that Google in China distorts the reality of other hand, the potential bias with respect to the content of the “Falun Gong” by removing the searched results. In this example, search results has not been comprehensively studied. In this paper, Google agrees to comply with showing in China to guard its we propose a two-dimensional approach to assess both the business profits [4]. Second, the limitations of crawling, indexing, indexical bias and content bias existing in the search results. and ranking techniques may result in search engine bias. An Statistical analyses have been further performed to present the interesting example shows that the phrase “Second Superpower” significance of bias assessment. The results show that the content was once Googlewashed in only six weeks because webloggers bias and indexical bias are both influential in the bias assessment, spun the alternative meaning to produce sufficient PageRank to and they complement each other to provide a panoramic view flood Google [9][13][17]. Third, the information provided by the with the two-dimensional representation. search engines may be biased in some countries because of the opposed political standpoints, diverse cultural backgrounds, and Categories and Subject Descriptors different social custom. The blocking and filtering of Google in H.3.4 [Information Storage and Retrieval]: Systems and China [20][21] and the information filtering on Google in Saudi Software – Performance Evaluation Arab, Germany, and France are the cases that politics biases the Web search engine [19][20]. General Terms As a search engine is an essential tool in the current cyber society, Measurement people are probably influenced by search engine bias without awareness when cognizing the information provided by the search Keywords engine. For example, some people may never get the information search engine bias, indexical bias, content bias, information about certain popular brands when inquiring about the term quality, automatic assessment. “home refrigerators” via a search engine [11]. From the viewpoint of the entire information society, the marginalization of certain information limits the Web space and confines its functionality to 1. INTRODUCTION a limited scope [6]. Consequently, many search engine users are In recent years, an increasingly huge amount of information has unknowingly deprived of the right to fairly browse and access the been published and pervasively communicated over the World WWW. Wide Web (WWW). Web search engines have accordingly Recently, the issue of search engine bias has been noticed, and become the most important gateway to access the WWW and several studies have been proposed to investigate the even an indispensable part of today’s information society as well. measurement of search engine bias. In [10][11][12], an effective According to [3][7], most users get used to few particular search method is proposed to measure the search engine bias through interfaces, and thus mainly rely on these Web search engines to comparing the URL of each indexed item retrieved by a search find the information. Unfortunately, due to some limitations of engine with that by a pool of search engines. The result of such current search technology, different considerations of operating search engine bias assessment is termed the indexical bias. strategies, or even some political or cultural factors, Web search Although the assessment of indexed URLs is an efficient and engines have their own preferences and prejudices to the Web effective approach to predict search engine bias, assessing the information [10][11][12]. As a result, the information sources and indexical bias only provides a partial view of search engine bias. content types indexed by different Web search engines are In our observations, two search engines with the same degree of exhibited in an unbalanced condition. In the past studies indexical bias may return different page content and reveal the [10][11][12], such unbalanced item selection in Web search semantic differences. In such a case, the potential difference of engines is termed search engine bias. overweighing specific content may result in significant content In our observations, search engine bias can be incurred from three bias that cannot be presented by simply assessing the indexed URLs. In addition, if a search result contains redirection links to Copyright is held by IW3C2. other URLs that are absent from the search result, these absent WWW 2006, May 22–26, 2006, Edinburgh, UK. URLs still can be accessed via the redirection links. In this case, a search engine only reports the mediate URLs, and the search engine may thus have a poor indexical bias performance but that From the past literatures in search engine bias assessment, we is not true. However, analyzing the page content helps reveal a argue that without considering the Web content, the bias panoramic view of search engine bias. assessment only tells users part of the reality. Besides, how to In this paper, we examine the real bias events in the current Web appropriately assess search engine bias from both views needs environment and study the influences of search engine bias upon advanced study. In this paper, we propose an improved the information society. We assert that assessing the content bias assessment method for content bias and in advance present a two- through the content majorities and minorities existing in Web dimensional strategy for bias assessment. search engines as the other dimension can help evaluate search engine bias more thoroughly. Therefore, a two-dimensional 3. THE BIAS ASSESSMENT METHOD assessment mechanism is proposed to assess search engine bias. To assess the bias of a search engine, a norm should be first In the experiments, the two-dimensional bias distribution and the generated. In traditional content analysis studies, the norm is statistical analyses sufficiently expound the bias performance of usually obtained with careful examinations of subject experts [5]. each search engine. However, artificially examining Web page content to get the norm is impossible because the Web space is rapidly changing 2. LITERATURE REVIEW and the number of Web pages is extremely large. Therefore, an Recently, some pioneering studies have been conducted to discuss implicit norm is generally used in current studies [10][11][12]. search engine bias by measuring the retrieved URLs of Web The implicit norm is defined by a collection of search results of search engines. In 2002, Mowshowitz and Kawaguchi first several representative search engines. To avoid unfairly favoring proposed measuring the indexed URLs of a search engine to certain search engines, any search engine will not be considered if determine the search engine bias since they asserted that a Web it uses other search engine's kernel without any refinement, or its search engine is a retrieval system containing a set of items that indexing number is not comparably large enough. represent messages [10][11][12]. In their method, a vector-based Since assessing the retrieved URLs of search engines cannot statistical analysis is used to measure search engine bias by represent the whole view of search engine bias, the assessment selecting a pool of Web search engines as an implicit norm, and scheme needs to consider other expressions to satisfy the lack. In comparing the occurring frequencies of the retrieved URLs by the current cyber-society, information is delivered to people each search engine in the norm. Therefore, bias is assessed by through various Web pages. Although these Web pages are calculating the deviation of URLs retrieved by a Web search presented with photos, animations, and various multimedia engine from those of the norm. technologies, the main content still consists of hypertextual In [11], a simple example is illustrated to assess indexical bias of information that is composed of different HTML tags [1]. three search engines with two queries and the top ten results of Therefore, in our approach, the hypertextual content is assessed to each query. Thus, a total of 60 URL entries were retrieved and reveal another bias aspect. analyzed, and 44 distinct URLs with occurring frequencies were To appropriately present Web contents, we use a weighted vector transformed into the basis vector. The similarity between the two approach to represent Web pages and compute the content bias. basis vectors was then calculated by using a cosine metric. The The following subsections elaborate the generation of an implicit result of search engine bias is obtained by subtracting the cosine bias norm, a two-dimensional assessment scheme, and a weighted value from one and gains a result between 0 and 1 to represent the vector approach for content bias assessment. degree of bias. 3.1 Bias Norm Generation Vaughan and Thelwall further used such a URL-based approach As the definition of bias in [10][11][12], an implicit norm used in to investigate the causes of search engine coverage bias in our study is generated from the vector collection of a set of different countries [18]. They asserted that the language of a site comparable search engines to approximate the ideal. The main does not affect the search engine coverage bias but the visibility reason of this approximation is because the changes in Web space of the indexed sites. If a Web search engine has many high-visible are extremely frequent and divergent, and thus traditional sites, which means Web sites are linked by many other Web sites, methods of manually generating norms by subject experts are the search engine has a high coverage ratio. Since they calculated time-consuming and become impractical. On the other hand, the search engine coverage ratio based on the number of URLs search engines can be implicitly viewed as experts in reporting retrieved by a search engine, the assessment still cannot clearly search results. The norms can be generated by selecting some show how much information is covered. Furthermore, the representative search engines and synthesizing their search results. experimental sites were retrieved only from three search engines However, the selection of the representative search engines with domain names from four countries with Chinese and English should be cautiously considered to avoid generating biased norms pages, and thus such few samples may not guarantee a universal that will show favoritism on some specific search engines. truth in other countries. The selection of representative search engines is based on the In 2003, Chen and Yang used an adaptive vector model to explore following criteria: the effects of content bias [2]. Since their study was targeted on the Web contents retrieved by each search engine, the content 1. The search engines are generally designed for different subject bias was normalized to present the bias degree. Although the areas. Search engines for special domains are not considered. assessment appropriately reveals content bias, the study ignores In addition, search engines, e.g. localized search engines, the normalization influences of contents among each retrieved designed for specific users are also disregarded. item. Consequently, the content bias may be over-weighted with 2. The search engines are comparable to each other and to the some rich-context items. Furthermore, the study cannot determine search engines to be assessed. Search engines are excluded if whether the results are statistically significant. the number of the indexed pages is not large enough. 3. Search engines will not be considered if they use other search engine's core without any refinement. For example, Lycos has appropriate to represent and assess the contents of Web started to use the crawling core provided by FAST in 1999. If documents. both are selected to form the norms, their bias values are Since the search results are query-specific, query strings in unfairly lower. However, if a search engine uses other's engine different subjects are used to get corresponding representative kernel but incorporates individual searching rules, it is still vocabulary vectors RVV for search engines. Each RVV represents under consideration for it may provide different views. the search content of a search engine and is determined by 4. Metasearch engines are under consideration if they have their examining the first m URL entry in the search result list. Every own processing rules. We assume that these rules are not word in URL entries is parsed to filter out stop words and to prejudiced in favor of certain search engines. In fact, if there extract feature words. The RVV consists of a series of vocabulary exist prejudices, they will be revealed after the assessment, and entries VEi with eight fields: the i-th feature word, its overall the biased metasearch engine will be excluded. frequency f, its document frequency d, the number of documents n, its title frequency t, its H1 frequency H, its H2 frequency h, and 3.2 The Two-dimensional Assessment Scheme its score S. The score S is determined as follows: Since both indexical bias and content bias are important to represent the bias performance of a search engine, we assess n search engine bias from both aspects and present search engine S = ( f + t ⋅ wt + H ⋅ wH + h ⋅ wh) × log( ) (1) bias in a two-dimensional view. Figure 1 depicts the two- d dimensional assessment process. For each query string, the where wt, wH, and wh are respective tag weights. The scores are corresponding query results are retrieved from Web search used in similarity computations. engines. Then the URL locator parses the search results and fetches the Web pages. The document parser extracts the feature After all RVV vectors are computed, necessary empty entries are words and computes the content vectors. Stop words are also inserted to make the entries in RVV exactly corresponding to the filtered out in this stage. Finally, feature information is stored in entries in the norm for similarity computation. Then the cosine the database for the following bias measurement. function is used to compute the similarity between RVVi of i-th search engine and the norm N: Search Engine Search Engine ... Search Engine Web Pages Sim( RVVi, N ) = cos( RVVi, N ) = ∑S j RVVi , j ⋅ SN, j (2) Query URL Locator Document Parser Vocabulary Entries ∑S j 2 RVVi , j ∑S j 2 N, j where SRVVi,j is the j-th entry score of RVVi, and SN,j is the j-th Bias Bias entry score of the norm. Finally, the content bias value Assessor Report CB(RVVi,N) is defined as Figure 1: The assessment process of measuring search engine bias CB ( RVVi, N ) = 1 − Sim( RVVi, N ) (3) The bias assessor collects two kinds of information: the URL indexes and the representative vocabulary vectors (RVV) for 4. EXPERIMENTS AND DISCUSSIONS corresponding Web contents. The URL indexes are used to We have conducted experiments to study bias in currently famous compute the indexical bias, and the RVV vectors are used to search engines with the proposed two-dimensional assessment compute the content bias. After the assessment, the assessor scheme. Ten search engines are included in the assessment studies: generates bias reports. About, AltaVista, Excite, Google, Inktomi, Lycos, MSN, Overture, Teoma, and Yahoo. To compute RVV vectors, the top m=10 URLs from search results are processed because it is shown 3.3 The Weighted Vector Model that the first result screen is requested for 85% of the queries [16], Web contents are mainly composed of different HTML tags that and it usually shows the top ten results. To generate the norm, we respectively represent their own specific meanings in Web pages. used a weighted term-frequency-inversedocument-frequency (TF- For example, a title tag represents the name of a Web page, which IDF) strategy to select the feature information from the ten search is shown in the browser window caption bar. Different headings engines. The size of N is thus adaptive to different queries to represent differing importance in a Web page. In HTML there are appropriately represent the norm. six levels of headings. H1 is the most important; H2 is slightly less import, and so on down to H6, the least important [14]. In We have conducted experiments to measure the biases of ten content bias assessment, how to represent a Web document plays general search engines. The indexical bias is assessed according an important role to reflect the reality of assessment. to the approach proposed by Mowshowitz and Kawaguchi Here we adopt a weighted vector approach to measure content [10][11][12]. The content bias is assessed according to the bias [8]. It is based on a vector space model [15] but adapted to proposed weighted vector model. In the experiments, queries from emphasize the feature information in Web pages. Because the different subjects were tested. Two of the experimental results are features in