=Paper=
{{Paper
|id=Vol-190/paper-10
|storemode=property
|title=A Study of Web Search Engine Bias and its Assessment
|pdfUrl=https://ceur-ws.org/Vol-190/paper10.pdf
|volume=Vol-190
|authors=Ing-Xiang Chen and Cheng-Zen Yang
}}
==A Study of Web Search Engine Bias and its Assessment==
Position Paper: A Study of Web Search Engine Bias and
its Assessment
Ing-Xiang Chen Cheng-Zen Yang
Dept. of Computer Sci. and Eng., Yuan Ze University Dept. of Computer Sci. and Eng., Yuan Ze University
135 Yuan-Tung Road, Chungli 135 Yuan-Tung Road, Chungli
Taiwan, 320, ROC Taiwan, 320, ROC
sean@syslab.cse.yzu.edu.tw czyang@syslab.cse.yzu.edu.tw
ABSTRACT aspects. The first source is from the diverse operating policies and
the business strategies adopted in each search engine company.
Search engine bias has been seriously noticed in recent years. As mentioned in [1], such type of bias is more insidious than
Several pioneering studies have reported that bias perceivably advertising. A recent hot piece of news demonstrates this type of
exists even with respect to the URLs in the search results. On the bias from the event that Google in China distorts the reality of
other hand, the potential bias with respect to the content of the “Falun Gong” by removing the searched results. In this example,
search results has not been comprehensively studied. In this paper, Google agrees to comply with showing in China to guard its
we propose a two-dimensional approach to assess both the business profits [4]. Second, the limitations of crawling, indexing,
indexical bias and content bias existing in the search results. and ranking techniques may result in search engine bias. An
Statistical analyses have been further performed to present the interesting example shows that the phrase “Second Superpower”
significance of bias assessment. The results show that the content was once Googlewashed in only six weeks because webloggers
bias and indexical bias are both influential in the bias assessment, spun the alternative meaning to produce sufficient PageRank to
and they complement each other to provide a panoramic view flood Google [9][13][17]. Third, the information provided by the
with the two-dimensional representation. search engines may be biased in some countries because of the
opposed political standpoints, diverse cultural backgrounds, and
Categories and Subject Descriptors different social custom. The blocking and filtering of Google in
H.3.4 [Information Storage and Retrieval]: Systems and China [20][21] and the information filtering on Google in Saudi
Software – Performance Evaluation Arab, Germany, and France are the cases that politics biases the
Web search engine [19][20].
General Terms As a search engine is an essential tool in the current cyber society,
Measurement people are probably influenced by search engine bias without
awareness when cognizing the information provided by the search
Keywords engine. For example, some people may never get the information
search engine bias, indexical bias, content bias, information about certain popular brands when inquiring about the term
quality, automatic assessment. “home refrigerators” via a search engine [11]. From the viewpoint
of the entire information society, the marginalization of certain
information limits the Web space and confines its functionality to
1. INTRODUCTION a limited scope [6]. Consequently, many search engine users are
In recent years, an increasingly huge amount of information has unknowingly deprived of the right to fairly browse and access the
been published and pervasively communicated over the World WWW.
Wide Web (WWW). Web search engines have accordingly
Recently, the issue of search engine bias has been noticed, and
become the most important gateway to access the WWW and
several studies have been proposed to investigate the
even an indispensable part of today’s information society as well.
measurement of search engine bias. In [10][11][12], an effective
According to [3][7], most users get used to few particular search
method is proposed to measure the search engine bias through
interfaces, and thus mainly rely on these Web search engines to
comparing the URL of each indexed item retrieved by a search
find the information. Unfortunately, due to some limitations of
engine with that by a pool of search engines. The result of such
current search technology, different considerations of operating
search engine bias assessment is termed the indexical bias.
strategies, or even some political or cultural factors, Web search
Although the assessment of indexed URLs is an efficient and
engines have their own preferences and prejudices to the Web
effective approach to predict search engine bias, assessing the
information [10][11][12]. As a result, the information sources and
indexical bias only provides a partial view of search engine bias.
content types indexed by different Web search engines are
In our observations, two search engines with the same degree of
exhibited in an unbalanced condition. In the past studies
indexical bias may return different page content and reveal the
[10][11][12], such unbalanced item selection in Web search
semantic differences. In such a case, the potential difference of
engines is termed search engine bias.
overweighing specific content may result in significant content
In our observations, search engine bias can be incurred from three bias that cannot be presented by simply assessing the indexed
URLs. In addition, if a search result contains redirection links to
Copyright is held by IW3C2. other URLs that are absent from the search result, these absent
WWW 2006, May 22–26, 2006, Edinburgh, UK. URLs still can be accessed via the redirection links. In this case, a
search engine only reports the mediate URLs, and the search
engine may thus have a poor indexical bias performance but that From the past literatures in search engine bias assessment, we
is not true. However, analyzing the page content helps reveal a argue that without considering the Web content, the bias
panoramic view of search engine bias. assessment only tells users part of the reality. Besides, how to
In this paper, we examine the real bias events in the current Web appropriately assess search engine bias from both views needs
environment and study the influences of search engine bias upon advanced study. In this paper, we propose an improved
the information society. We assert that assessing the content bias assessment method for content bias and in advance present a two-
through the content majorities and minorities existing in Web dimensional strategy for bias assessment.
search engines as the other dimension can help evaluate search
engine bias more thoroughly. Therefore, a two-dimensional 3. THE BIAS ASSESSMENT METHOD
assessment mechanism is proposed to assess search engine bias. To assess the bias of a search engine, a norm should be first
In the experiments, the two-dimensional bias distribution and the generated. In traditional content analysis studies, the norm is
statistical analyses sufficiently expound the bias performance of usually obtained with careful examinations of subject experts [5].
each search engine. However, artificially examining Web page content to get the
norm is impossible because the Web space is rapidly changing
2. LITERATURE REVIEW and the number of Web pages is extremely large. Therefore, an
Recently, some pioneering studies have been conducted to discuss implicit norm is generally used in current studies [10][11][12].
search engine bias by measuring the retrieved URLs of Web The implicit norm is defined by a collection of search results of
search engines. In 2002, Mowshowitz and Kawaguchi first several representative search engines. To avoid unfairly favoring
proposed measuring the indexed URLs of a search engine to certain search engines, any search engine will not be considered if
determine the search engine bias since they asserted that a Web it uses other search engine's kernel without any refinement, or its
search engine is a retrieval system containing a set of items that indexing number is not comparably large enough.
represent messages [10][11][12]. In their method, a vector-based Since assessing the retrieved URLs of search engines cannot
statistical analysis is used to measure search engine bias by represent the whole view of search engine bias, the assessment
selecting a pool of Web search engines as an implicit norm, and scheme needs to consider other expressions to satisfy the lack. In
comparing the occurring frequencies of the retrieved URLs by the current cyber-society, information is delivered to people
each search engine in the norm. Therefore, bias is assessed by through various Web pages. Although these Web pages are
calculating the deviation of URLs retrieved by a Web search presented with photos, animations, and various multimedia
engine from those of the norm. technologies, the main content still consists of hypertextual
In [11], a simple example is illustrated to assess indexical bias of information that is composed of different HTML tags [1].
three search engines with two queries and the top ten results of Therefore, in our approach, the hypertextual content is assessed to
each query. Thus, a total of 60 URL entries were retrieved and reveal another bias aspect.
analyzed, and 44 distinct URLs with occurring frequencies were To appropriately present Web contents, we use a weighted vector
transformed into the basis vector. The similarity between the two approach to represent Web pages and compute the content bias.
basis vectors was then calculated by using a cosine metric. The The following subsections elaborate the generation of an implicit
result of search engine bias is obtained by subtracting the cosine bias norm, a two-dimensional assessment scheme, and a weighted
value from one and gains a result between 0 and 1 to represent the vector approach for content bias assessment.
degree of bias.
3.1 Bias Norm Generation
Vaughan and Thelwall further used such a URL-based approach As the definition of bias in [10][11][12], an implicit norm used in
to investigate the causes of search engine coverage bias in our study is generated from the vector collection of a set of
different countries [18]. They asserted that the language of a site comparable search engines to approximate the ideal. The main
does not affect the search engine coverage bias but the visibility reason of this approximation is because the changes in Web space
of the indexed sites. If a Web search engine has many high-visible are extremely frequent and divergent, and thus traditional
sites, which means Web sites are linked by many other Web sites, methods of manually generating norms by subject experts are
the search engine has a high coverage ratio. Since they calculated time-consuming and become impractical. On the other hand,
the search engine coverage ratio based on the number of URLs search engines can be implicitly viewed as experts in reporting
retrieved by a search engine, the assessment still cannot clearly search results. The norms can be generated by selecting some
show how much information is covered. Furthermore, the representative search engines and synthesizing their search results.
experimental sites were retrieved only from three search engines However, the selection of the representative search engines
with domain names from four countries with Chinese and English should be cautiously considered to avoid generating biased norms
pages, and thus such few samples may not guarantee a universal that will show favoritism on some specific search engines.
truth in other countries.
The selection of representative search engines is based on the
In 2003, Chen and Yang used an adaptive vector model to explore following criteria:
the effects of content bias [2]. Since their study was targeted on
the Web contents retrieved by each search engine, the content 1. The search engines are generally designed for different subject
bias was normalized to present the bias degree. Although the areas. Search engines for special domains are not considered.
assessment appropriately reveals content bias, the study ignores In addition, search engines, e.g. localized search engines,
the normalization influences of contents among each retrieved designed for specific users are also disregarded.
item. Consequently, the content bias may be over-weighted with 2. The search engines are comparable to each other and to the
some rich-context items. Furthermore, the study cannot determine search engines to be assessed. Search engines are excluded if
whether the results are statistically significant. the number of the indexed pages is not large enough.
3. Search engines will not be considered if they use other search
engine's core without any refinement. For example, Lycos has appropriate to represent and assess the contents of Web
started to use the crawling core provided by FAST in 1999. If documents.
both are selected to form the norms, their bias values are Since the search results are query-specific, query strings in
unfairly lower. However, if a search engine uses other's engine different subjects are used to get corresponding representative
kernel but incorporates individual searching rules, it is still vocabulary vectors RVV for search engines. Each RVV represents
under consideration for it may provide different views. the search content of a search engine and is determined by
4. Metasearch engines are under consideration if they have their examining the first m URL entry in the search result list. Every
own processing rules. We assume that these rules are not word in URL entries is parsed to filter out stop words and to
prejudiced in favor of certain search engines. In fact, if there extract feature words. The RVV consists of a series of vocabulary
exist prejudices, they will be revealed after the assessment, and entries VEi with eight fields: the i-th feature word, its overall
the biased metasearch engine will be excluded. frequency f, its document frequency d, the number of documents
n, its title frequency t, its H1 frequency H, its H2 frequency h, and
3.2 The Two-dimensional Assessment Scheme its score S. The score S is determined as follows:
Since both indexical bias and content bias are important to
represent the bias performance of a search engine, we assess n
search engine bias from both aspects and present search engine S = ( f + t ⋅ wt + H ⋅ wH + h ⋅ wh) × log( ) (1)
bias in a two-dimensional view. Figure 1 depicts the two- d
dimensional assessment process. For each query string, the where wt, wH, and wh are respective tag weights. The scores are
corresponding query results are retrieved from Web search used in similarity computations.
engines. Then the URL locator parses the search results and
fetches the Web pages. The document parser extracts the feature After all RVV vectors are computed, necessary empty entries are
words and computes the content vectors. Stop words are also inserted to make the entries in RVV exactly corresponding to the
filtered out in this stage. Finally, feature information is stored in entries in the norm for similarity computation. Then the cosine
the database for the following bias measurement. function is used to compute the similarity between RVVi of i-th
search engine and the norm N:
Search
Engine
Search
Engine ... Search
Engine Web
Pages Sim( RVVi, N ) = cos( RVVi, N ) =
∑S j RVVi , j ⋅ SN, j (2)
Query
URL Locator Document Parser Vocabulary
Entries
∑S j
2
RVVi , j ∑S j
2
N, j
where SRVVi,j is the j-th entry score of RVVi, and SN,j is the j-th
Bias Bias entry score of the norm. Finally, the content bias value
Assessor Report CB(RVVi,N) is defined as
Figure 1: The assessment process of measuring search engine bias CB ( RVVi, N ) = 1 − Sim( RVVi, N ) (3)
The bias assessor collects two kinds of information: the URL
indexes and the representative vocabulary vectors (RVV) for 4. EXPERIMENTS AND DISCUSSIONS
corresponding Web contents. The URL indexes are used to We have conducted experiments to study bias in currently famous
compute the indexical bias, and the RVV vectors are used to search engines with the proposed two-dimensional assessment
compute the content bias. After the assessment, the assessor scheme. Ten search engines are included in the assessment studies:
generates bias reports. About, AltaVista, Excite, Google, Inktomi, Lycos, MSN,
Overture, Teoma, and Yahoo. To compute RVV vectors, the top
m=10 URLs from search results are processed because it is shown
3.3 The Weighted Vector Model that the first result screen is requested for 85% of the queries [16],
Web contents are mainly composed of different HTML tags that
and it usually shows the top ten results. To generate the norm, we
respectively represent their own specific meanings in Web pages.
used a weighted term-frequency-inversedocument-frequency (TF-
For example, a title tag represents the name of a Web page, which
IDF) strategy to select the feature information from the ten search
is shown in the browser window caption bar. Different headings
engines. The size of N is thus adaptive to different queries to
represent differing importance in a Web page. In HTML there are
appropriately represent the norm.
six levels of headings. H1 is the most important; H2 is slightly
less import, and so on down to H6, the least important [14]. In We have conducted experiments to measure the biases of ten
content bias assessment, how to represent a Web document plays general search engines. The indexical bias is assessed according
an important role to reflect the reality of assessment. to the approach proposed by Mowshowitz and Kawaguchi
Here we adopt a weighted vector approach to measure content [10][11][12]. The content bias is assessed according to the
bias [8]. It is based on a vector space model [15] but adapted to proposed weighted vector model. In the experiments, queries from
emphasize the feature information in Web pages. Because the different subjects were tested. Two of the experimental results are
features in , , or tags usually indicate important reported and discussed here. The first is a summarization of ten
information and are used more often in the Web documents, hot queries. This study shows the average bias performance of
features in these tags are appropriately weighted to represent Web Web search engines according to their content bias and indexical
contents. Since the number of the total Web documents can only bias values. The second is a case study on overwhelming
be estimated by sampling or assumption, this model is more redefinition power of search engines reported in [13]. In this
experiment, the two-dimensional assessment shows that most
search engines report similar indexical and content bias ranking However, when we review the bias performance of Yahoo!, we
except Overture. can see that it has quite good content bias performance, which is
ranked as the second best, but only has a medium indexical bias
ranking. Such insistent bias performance shows that Yahoo! can
4.1 The Assessment Results of Hot Queries
discover other similar major contents from different Web sites.
In this experiment, we randomly chose ten hot queries from Lycos
However, such differences cannot be revealed when users only
50 [22]. For each of them, we collected 100 Web pages from ten
consider the indexical bias as the panorama of search engine bias.
search engines. The queries are “Final Fantasy”, “Harry Potter”,
In our experiments, a one-way analysis of variance (ANOVA)
“Iraq”, “Jennifer Lopez”, “Las Vegas”, “Lord of the Rings”,
was conducted to analyze the statistical significance on bias
“NASCAR”, “SARS”, “Tattoos”, and “The Bible”. The
performance among each search engine. The ANOVA analyses in
assessment results of their indexical bias and content bias values
Table 5 and Table6 indicate that the content bias of Yahoo! is
are shown in Table 1 and Table 2.
more statistically significant than the indexical bias.
About In Table 3 and Table4, the ANOVA results of the averaged
0.7
indexical bias and content bias are presented to display the
0.6 Alta Vista
statistical significance between the experimental search engines.
Indexical Bias
0.5 Excite Both of the ANOVA results reveal statistical significance of the
0.4 Google ten search engines over the hot query terms (p ≤ 0.05). The p-
0.3 Inktomi values in the table measure the credibility of the null hypothesis.
0.2 Lycos The null hypothesis here means that there is no significant
0.1 difference between each search engine. If the p-value is less than
MSN
0.0 or equal to the widely accepted value 0.05, the null hypothesis is
Overture rejected.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Teoma
Content Bias Since there is significant difference among the search engines, we
Yahoo! further analyze the variance across different hot query terms.
Table 5 and Table 6 show the ANOVA results of indexical bias
Figure 2: The two-dimensional analysis of the ten hot queries and content bias between each search engine over the ten hot
from Lycos 50 query terms. Table 5 further indicates that About, AltaVista,
In Figure 2, the average bias performance is further displayed in a Google, Lycos, and Overture are significant, and Table 6 presents
two-dimensional diagram. In the figure, two additional dotted that About, Google, MSN, and Yahoo! are significant. From the
lines are used to represent the respective statistic mean values of ANOVA analyses, the original indexical bias of MSN and Yahoo!
bias. The results show that Google has the lowest indexical and is less significant, but the content bias assessment can reveal the
content bias value, which means that Google outperforms others complementary information. The two-dimensional assessment
in bias performance. The best bias performance in Google scheme tells users a panoramic view of search engine bias.
represents that both the sites and the contents it retrieved are the
majority on the Web and may satisfy the most user needs. From
the average results, we found that most of the search engines
show similar bias rankings in both indexical bias and content bias.
Table 1: The indexical bias of ten hot queries randomly chosen from Lycos 50.
Queries About AltaVista Excite Google Inktomi Lycos MSN Overture Teoma Yahoo!
Final Fantasy 0.5895 0.1876 0.5194 0.1876 0.3488 0.2403 0.4339 0.7054 0.4573 0.2713
Harry Potter 0.5669 0.3098 0.5837 0.2253 0.3098 0.3275 0.4299 0.7758 0.3755 0.4181
Iraq 0.7231 0.2560 0.5328 0.3252 0.2733 0.3771 0.4809 0.3771 0.4463 0.4290
Jennifer Lopez 0.5878 0.3681 0.5835 0.2606 0.3864 0.2448 0.5123 0.3078 0.3550 0.2134
Las Vegas 0.6985 0.3439 0.5921 0.1488 0.2375 0.3793 0.5744 0.8049 0.3261 0.2552
Lord of the Rings 0.5493 0.2558 0.5659 0.2074 0.2924 0.2093 0.4418 0.7829 0.3953 0.2093
NASCAR 0.3745 0.3897 0.4318 0.2982 0.3816 0.4150 0.4652 0.7493 0.4819 0.2829
SARS 0.4206 0.4902 0.3309 0.2874 0.4743 0.4902 0.3526 0.6655 0.5691 0.5018
Tattoos 0.5017 0.3355 0.6543 0.3995 0.5633 0.2903 0.4177 0.5847 0.4177 0.4905
The Bible 0.6059 0.4518 0.5546 0.3148 0.3662 0.3245 0.6511 0.6917 0.3995 0.6247
Average: 0.5618 0.3388 0.5349 0.2655 0.3634 0.3298 0.4760 0.6445 0.4224 0.3696
Table 2: The content bias of ten hot queries randomly chosen from Lycos 50.
Queries About AltaVista Excite Google Inktomi Lycos MSN Overture Teoma Yahoo!
Final Fantasy 0.5629 0.4535 0.3315 0.3507 0.5545 0.2724 0.4396 0.2961 0.5030 0.3481
Harry Potter 0.5315 0.3028 0.4498 0.3181 0.4985 0.3555 0.4461 0.4346 0.3332 0.5443
Iraq 0.4301 0.1651 0.5557 0.2250 0.1605 0.2213 0.5390 0.4403 0.2461 0.1711
Jennifer Lopez 0.4723 0.4193 0.4524 0.3150 0.5921 0.3450 0.3959 0.2441 0.3914 0.3138
Las Vegas 0.4656 0.4252 0.3303 0.1831 0.1971 0.2080 0.5267 0.5286 0.2201 0.2036
Lord of the Rings 0.5853 0.2030 0.2622 0.1516 0.1801 0.1966 0.5129 0.4509 0.2440 0.1573
NASCAR 0.3318 0.2210 0.4724 0.1743 0.1995 0.2195 0.5005 0.6139 0.2515 0.1950
SARS 0.4373 0.6965 0.5769 0.3784 0.6521 0.7361 0.4259 0.5443 0.6819 0.3854
Tattoos 0.5270 0.4733 0.4989 0.3351 0.3145 0.3425 0.3472 0.3732 0.3907 0.4654
The Bible 0.5829 0.1874 0.5639 0.2394 0.1815 0.6096 0.6647 0.5358 0.6202 0.2126
Average: 0.4927 0.3547 0.4494 0.2671 0.3530 0.3507 0.4798 0.4462 0.3882 0.2997
Table 3: ANOVA result of the indexical bias between Web search engines
Sum of Squares Degree of Freedom Mean Square F-ration p-value
Between Groups 1.301 9 0.145 12.687 0.000
Within Groups 1.025 90 0.011
Total 2.326 99
Table 4: ANOVA result of the content bias between Web search engines
Sum of Squares Degree of Freedom Mean Square F-ration p-value
Between Groups 0.527 9 0.059 3.036 0.003
Within Groups 1.736 90 0.019
Total 2.263 99
Table 5: ANOVA result of the indexical bias across hot terms
Engine About AltaVista Excite Google Inktomi Lycos MSN Overture Teoma Yahoo!
p-value 0.002 0.023 0.089 0.000 0.072 0.014 0.163 0.000 0.429 0.092
Table 6: ANOVA result of the content bias across hot terms
Engine About AltaVista Excite Google Inktomi Lycos MSN Overture Teoma Yahoo!
p-value 0.010 0.232 0.089 0.003 0.221 0.206 0.021 0.101 0.499 0.025
4.2 The Case of “Second Superpower” the term to describe another totally different meaning, the
To further assess the bias event happening on the Web, we used a influence of the Internet and other interactive media [9].
real Googlewashed event happening on the Web to assess the bias In Figure 3, the two-dimensional assessment result shows that the
performance of Web search engines. In this experiment, we once Googlewashed effect indeed lowers the bias performance of
retrieved the search results and the Web pages from these ten Google. The two-dimensional analysis also reflects that the
search engines about one month later after the event happened. As Googlewashed effect was perceptible to Google and Yahoo! since
reported in [13], Tyler's original concept of “Second Superpower” Yahoo! once cooperated with Google at that time (Actually,
was flooded by Google with Moore's alternative definition in Yahoo is the same to Google in this query).
seven weeks. As a matter of fact, the idea of “second superpower”
first appeared in the New York Times written by Tyler to describe
the global anti-war protests [17]. After a while, Moore's essay used
1999);
0.8 About www.webdevelopersjournal.com/articles/search_engines.ht
0.7 Alta Vista ml
0.6 [4] Google Censors Itself for China. BBC News (Jan. 26, 2006);
Indexical Bias
Excite
0.5 news.bbc.co.uk/1/hi/technology/4645596.stm.
Google
0.4 [5] Holsti, O.R., Content Analysis for the Social Science and
0.3 Inktomi Humanities. 1st ed. Addison-Wesley Publishing Co., 1969.
0.2 Lycos [6] Introna, L. and Nissenbaum, H., Shaping the Web: Why the
0.1 MSN Politics of Search Engines Matters, The Information Society,
0.0
Overture 16, 3 (2000), 1-17.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Teoma [7] iProspect Search Engine User Attitudes (April-May, 2004);
Content Bias
Yahoo www.iprospect.com/premiumPDFs/iProspectSurveyComple
te.pdf.
Figure 3: The bias result of “Second Superpower” [8] Jenkins, C., and Inman, D., Adaptive Automatic
Interestingly, Figure 3 shows that the indexical bias ranking of Classification on the Web. In Proceedings of the 11th
Overture is relatively higher than its content bias. After manually International Workshop on Database and Expert Systems
reviewing the total of 100 Web pages for this query, we discovered Applications (Greenwich, London, U.K., 2000), 504-511.
that there are actually several definitions about “Second [9] Moore, J.F., The Second Superpower Rears its Beautiful
Superpower,” not just Tyler’s and Moore’s. Although most Head (March 31, 2003);
contents retrieved by Overture point to the major viewpoints cyber.law.harvard.edu/people/jmoore/secondsuperpower.ht
appearing in the norm, they are retrieved from diverse URLs but ml.
not mirror sites, and thus the search results incur a high indexical [10] Mowshowitz, A., and Kawaguchi, A., Assessing Bias in
bias value. In this study, it shows that the indexical bias cannot tell Search Engines. Information Processing & Management, 38,
us the whole story, but a two-dimensional scheme reflects a more 1 (Jan. 2002), 141-156.
comprehensive view of search engine bias. [11] Mowshowitz, A., and Kawaguchi, A., Bias on the Web.
Commun. ACM, 45, 9 (Sep. 2002), 56-60.
5. CONCLUSION [12] Mowshowitz, A., and Kawaguchi, A., Measuring Search
Since Web search engines have become an essential gateway to Engine Bias. Information Processing & Management, 41, 5
the Internet, their favor or bias of Web contents has deeply (Sep. 2005), 1193-1205.
affected users' browsing behavior and may influence their sight of [13] Orlowski, A., Anti-war Slogan Coined, Repurposed and
viewing the Web. Recently, some studies of search engine bias Googlewashed . . . in 42 Days. The Register (April 3, 2003);
have been proposed to measure the deviation of sites retrieved by a www.theregister.co.uk/content/6/30087.html.
Web search engine from the norm for each specific query. These [14] Raggett, D., Getting Started with HTML, W3C Consortium
studies have presented an efficient way to assess search engine (May 24, 2005); www.w3.org/MarkUp/Guide/.
bias. However, such assessment method ignores the content [15] Salton, G., Wong, A., and Yang, C. S., A Vector Space
information in Web pages and thus cannot present the search Model for Automatic Indexing. Commun. ACM, 18, 11
engine bias thoroughly. (Nov. 1975), 613-620.
[16] Silverstein, C., Henzinger, M., Marais, H., and Moricz, M.,
In this paper, we assert that both indexical bias and content bias Analysis of a Very Large AltaVista Query Log, ACM SIGIR
are important to present search bias. Therefore, we study the Forum, 33, 1 (Fall 1999), 6-12.
content bias existing in current popular Web search engines and [17] Tyler, P.E., A New Power in the Streets. New York Times
propose a two-dimensional assessment scheme to complement the (Feb. 17, 2003);
lack of indexical bias. The experimental results have shown that foi.missouri.edu/voicesdissent/newpower.html.
such a two-dimensional scheme can notice the blind spot of one- [18] Vaughan, L. and Thelwall, M., Search Engine Coverage
dimensional bias assessment approach and provide users with a Bias: Evidence and Possible Causes, Information
more thorough view to search engine bias. Statistical analyses Processing & Management, 40, 4, (July 2004), 693-707.
further present that such a two-dimensional scheme can fulfill the [19] Zittrain, J. and Edelman, B., Documentation of Internet
task of bias assessment and reveal more advanced information Filtering in Saudi Arabia, (Sep. 12, 2002);
about search engine bias. cyber.law.harvard.edu/filtering/saudiarabia/.
[20] Zittrain, J. and Edelman, B., Localized Google search result
6. REFERENCES exclusions, (Oct. 26, 2002);
[1] Brin, S., and Page, L., The Anatomy of Large-Scale cyber.law.harvard.edu/filtering/google/.
Hypertextual Web Search Engine. In Proceedings of the 7th [21] Zittrain, J. and Edelman, B., Internet Filtering in China.
International World Wide Web Conference (Brisbane, IEEE Internet Computing, 7, 2 (March/April, 2003), 70-77.
Australia, 1998), ACM Press, New York, 107-117. [22] 50.lycos.com.
[2] Chen, I.-X. and Yang, C.-Z., Evaluating Content Bias and
Indexical Bias in Web Search Engines. In Proceedings of
International Conference on Informatics, Cybernetics and
Systems (ICICS 2003) (Kaohsiung, Taiwan, ROC, 2003),
1597-1605.
[3] Gikandi D., Maximizing Search Engine Positioning (April 2,