=Paper=
{{Paper
|id=Vol-190/paper-10
|storemode=property
|title=A Study of Web Search Engine Bias and its Assessment
|pdfUrl=https://ceur-ws.org/Vol-190/paper10.pdf
|volume=Vol-190
|authors=Ing-Xiang Chen and Cheng-Zen Yang
}}
==A Study of Web Search Engine Bias and its Assessment==
<pdf width="1500px">https://ceur-ws.org/Vol-190/paper10.pdf</pdf>
<pre>
   Position Paper: A Study of Web Search Engine Bias and
                       its Assessment
                      Ing-Xiang Chen                                                        Cheng-Zen Yang
  Dept. of Computer Sci. and Eng., Yuan Ze University                   Dept. of Computer Sci. and Eng., Yuan Ze University
            135 Yuan-Tung Road, Chungli                                           135 Yuan-Tung Road, Chungli
                 Taiwan, 320, ROC                                                      Taiwan, 320, ROC
            sean@syslab.cse.yzu.edu.tw                                           czyang@syslab.cse.yzu.edu.tw

ABSTRACT                                                               aspects. The first source is from the diverse operating policies and
                                                                       the business strategies adopted in each search engine company.
Search engine bias has been seriously noticed in recent years.         As mentioned in [1], such type of bias is more insidious than
Several pioneering studies have reported that bias perceivably         advertising. A recent hot piece of news demonstrates this type of
exists even with respect to the URLs in the search results. On the     bias from the event that Google in China distorts the reality of
other hand, the potential bias with respect to the content of the      “Falun Gong” by removing the searched results. In this example,
search results has not been comprehensively studied. In this paper,    Google agrees to comply with showing in China to guard its
we propose a two-dimensional approach to assess both the               business profits [4]. Second, the limitations of crawling, indexing,
indexical bias and content bias existing in the search results.        and ranking techniques may result in search engine bias. An
Statistical analyses have been further performed to present the        interesting example shows that the phrase “Second Superpower”
significance of bias assessment. The results show that the content     was once Googlewashed in only six weeks because webloggers
bias and indexical bias are both influential in the bias assessment,   spun the alternative meaning to produce sufficient PageRank to
and they complement each other to provide a panoramic view             flood Google [9][13][17]. Third, the information provided by the
with the two-dimensional representation.                               search engines may be biased in some countries because of the
                                                                       opposed political standpoints, diverse cultural backgrounds, and
Categories and Subject Descriptors                                     different social custom. The blocking and filtering of Google in
H.3.4 [Information Storage and Retrieval]: Systems and                 China [20][21] and the information filtering on Google in Saudi
Software – Performance Evaluation                                      Arab, Germany, and France are the cases that politics biases the
                                                                       Web search engine [19][20].
General Terms                                                          As a search engine is an essential tool in the current cyber society,
Measurement                                                            people are probably influenced by search engine bias without
                                                                       awareness when cognizing the information provided by the search
Keywords                                                               engine. For example, some people may never get the information
search engine bias, indexical bias, content bias, information          about certain popular brands when inquiring about the term
quality, automatic assessment.                                         “home refrigerators” via a search engine [11]. From the viewpoint
                                                                       of the entire information society, the marginalization of certain
                                                                       information limits the Web space and confines its functionality to
1. INTRODUCTION                                                        a limited scope [6]. Consequently, many search engine users are
In recent years, an increasingly huge amount of information has        unknowingly deprived of the right to fairly browse and access the
been published and pervasively communicated over the World             WWW.
Wide Web (WWW). Web search engines have accordingly
                                                                       Recently, the issue of search engine bias has been noticed, and
become the most important gateway to access the WWW and
                                                                       several studies have been proposed to investigate the
even an indispensable part of today’s information society as well.
                                                                       measurement of search engine bias. In [10][11][12], an effective
According to [3][7], most users get used to few particular search
                                                                       method is proposed to measure the search engine bias through
interfaces, and thus mainly rely on these Web search engines to
                                                                       comparing the URL of each indexed item retrieved by a search
find the information. Unfortunately, due to some limitations of
                                                                       engine with that by a pool of search engines. The result of such
current search technology, different considerations of operating
                                                                       search engine bias assessment is termed the indexical bias.
strategies, or even some political or cultural factors, Web search
                                                                       Although the assessment of indexed URLs is an efficient and
engines have their own preferences and prejudices to the Web
                                                                       effective approach to predict search engine bias, assessing the
information [10][11][12]. As a result, the information sources and
                                                                       indexical bias only provides a partial view of search engine bias.
content types indexed by different Web search engines are
                                                                       In our observations, two search engines with the same degree of
exhibited in an unbalanced condition. In the past studies
                                                                       indexical bias may return different page content and reveal the
[10][11][12], such unbalanced item selection in Web search
                                                                       semantic differences. In such a case, the potential difference of
engines is termed search engine bias.
                                                                       overweighing specific content may result in significant content
In our observations, search engine bias can be incurred from three     bias that cannot be presented by simply assessing the indexed
                                                                       URLs. In addition, if a search result contains redirection links to
 Copyright is held by IW3C2.                                           other URLs that are absent from the search result, these absent
 WWW 2006, May 22–26, 2006, Edinburgh, UK.                             URLs still can be accessed via the redirection links. In this case, a
                                                                       search engine only reports the mediate URLs, and the search
engine may thus have a poor indexical bias performance but that        From the past literatures in search engine bias assessment, we
is not true. However, analyzing the page content helps reveal a        argue that without considering the Web content, the bias
panoramic view of search engine bias.                                  assessment only tells users part of the reality. Besides, how to
In this paper, we examine the real bias events in the current Web      appropriately assess search engine bias from both views needs
environment and study the influences of search engine bias upon        advanced study. In this paper, we propose an improved
the information society. We assert that assessing the content bias     assessment method for content bias and in advance present a two-
through the content majorities and minorities existing in Web          dimensional strategy for bias assessment.
search engines as the other dimension can help evaluate search
engine bias more thoroughly. Therefore, a two-dimensional              3. THE BIAS ASSESSMENT METHOD
assessment mechanism is proposed to assess search engine bias.         To assess the bias of a search engine, a norm should be first
In the experiments, the two-dimensional bias distribution and the      generated. In traditional content analysis studies, the norm is
statistical analyses sufficiently expound the bias performance of      usually obtained with careful examinations of subject experts [5].
each search engine.                                                    However, artificially examining Web page content to get the
                                                                       norm is impossible because the Web space is rapidly changing
2. LITERATURE REVIEW                                                   and the number of Web pages is extremely large. Therefore, an
Recently, some pioneering studies have been conducted to discuss       implicit norm is generally used in current studies [10][11][12].
search engine bias by measuring the retrieved URLs of Web              The implicit norm is defined by a collection of search results of
search engines. In 2002, Mowshowitz and Kawaguchi first                several representative search engines. To avoid unfairly favoring
proposed measuring the indexed URLs of a search engine to              certain search engines, any search engine will not be considered if
determine the search engine bias since they asserted that a Web        it uses other search engine's kernel without any refinement, or its
search engine is a retrieval system containing a set of items that     indexing number is not comparably large enough.
represent messages [10][11][12]. In their method, a vector-based       Since assessing the retrieved URLs of search engines cannot
statistical analysis is used to measure search engine bias by          represent the whole view of search engine bias, the assessment
selecting a pool of Web search engines as an implicit norm, and        scheme needs to consider other expressions to satisfy the lack. In
comparing the occurring frequencies of the retrieved URLs by           the current cyber-society, information is delivered to people
each search engine in the norm. Therefore, bias is assessed by         through various Web pages. Although these Web pages are
calculating the deviation of URLs retrieved by a Web search            presented with photos, animations, and various multimedia
engine from those of the norm.                                         technologies, the main content still consists of hypertextual
In [11], a simple example is illustrated to assess indexical bias of   information that is composed of different HTML tags [1].
three search engines with two queries and the top ten results of       Therefore, in our approach, the hypertextual content is assessed to
each query. Thus, a total of 60 URL entries were retrieved and         reveal another bias aspect.
analyzed, and 44 distinct URLs with occurring frequencies were         To appropriately present Web contents, we use a weighted vector
transformed into the basis vector. The similarity between the two      approach to represent Web pages and compute the content bias.
basis vectors was then calculated by using a cosine metric. The        The following subsections elaborate the generation of an implicit
result of search engine bias is obtained by subtracting the cosine     bias norm, a two-dimensional assessment scheme, and a weighted
value from one and gains a result between 0 and 1 to represent the     vector approach for content bias assessment.
degree of bias.
                                                                       3.1 Bias Norm Generation
Vaughan and Thelwall further used such a URL-based approach            As the definition of bias in [10][11][12], an implicit norm used in
to investigate the causes of search engine coverage bias in            our study is generated from the vector collection of a set of
different countries [18]. They asserted that the language of a site    comparable search engines to approximate the ideal. The main
does not affect the search engine coverage bias but the visibility     reason of this approximation is because the changes in Web space
of the indexed sites. If a Web search engine has many high-visible     are extremely frequent and divergent, and thus traditional
sites, which means Web sites are linked by many other Web sites,       methods of manually generating norms by subject experts are
the search engine has a high coverage ratio. Since they calculated     time-consuming and become impractical. On the other hand,
the search engine coverage ratio based on the number of URLs           search engines can be implicitly viewed as experts in reporting
retrieved by a search engine, the assessment still cannot clearly      search results. The norms can be generated by selecting some
show how much information is covered. Furthermore, the                 representative search engines and synthesizing their search results.
experimental sites were retrieved only from three search engines       However, the selection of the representative search engines
with domain names from four countries with Chinese and English         should be cautiously considered to avoid generating biased norms
pages, and thus such few samples may not guarantee a universal         that will show favoritism on some specific search engines.
truth in other countries.
                                                                       The selection of representative search engines is based on the
In 2003, Chen and Yang used an adaptive vector model to explore        following criteria:
the effects of content bias [2]. Since their study was targeted on
the Web contents retrieved by each search engine, the content          1. The search engines are generally designed for different subject
bias was normalized to present the bias degree. Although the              areas. Search engines for special domains are not considered.
assessment appropriately reveals content bias, the study ignores          In addition, search engines, e.g. localized search engines,
the normalization influences of contents among each retrieved             designed for specific users are also disregarded.
item. Consequently, the content bias may be over-weighted with         2. The search engines are comparable to each other and to the
some rich-context items. Furthermore, the study cannot determine          search engines to be assessed. Search engines are excluded if
whether the results are statistically significant.                        the number of the indexed pages is not large enough.
                                                                       3. Search engines will not be considered if they use other search
   engine's core without any refinement. For example, Lycos has                         appropriate to represent and assess the contents of Web
   started to use the crawling core provided by FAST in 1999. If                        documents.
   both are selected to form the norms, their bias values are                           Since the search results are query-specific, query strings in
   unfairly lower. However, if a search engine uses other's engine                      different subjects are used to get corresponding representative
   kernel but incorporates individual searching rules, it is still                      vocabulary vectors RVV for search engines. Each RVV represents
   under consideration for it may provide different views.                              the search content of a search engine and is determined by
4. Metasearch engines are under consideration if they have their                        examining the first m URL entry in the search result list. Every
   own processing rules. We assume that these rules are not                             word in URL entries is parsed to filter out stop words and to
   prejudiced in favor of certain search engines. In fact, if there                     extract feature words. The RVV consists of a series of vocabulary
   exist prejudices, they will be revealed after the assessment, and                    entries VEi with eight fields: the i-th feature word, its overall
   the biased metasearch engine will be excluded.                                       frequency f, its document frequency d, the number of documents
                                                                                        n, its title frequency t, its H1 frequency H, its H2 frequency h, and
3.2 The Two-dimensional Assessment Scheme                                               its score S. The score S is determined as follows:
Since both indexical bias and content bias are important to
represent the bias performance of a search engine, we assess                                                                      n
search engine bias from both aspects and present search engine                          S = ( f + t ⋅ wt + H ⋅ wH + h ⋅ wh) × log( )                       (1)
bias in a two-dimensional view. Figure 1 depicts the two-                                                                         d
dimensional assessment process. For each query string, the                              where wt, wH, and wh are respective tag weights. The scores are
corresponding query results are retrieved from Web search                               used in similarity computations.
engines. Then the URL locator parses the search results and
fetches the Web pages. The document parser extracts the feature                         After all RVV vectors are computed, necessary empty entries are
words and computes the content vectors. Stop words are also                             inserted to make the entries in RVV exactly corresponding to the
filtered out in this stage. Finally, feature information is stored in                   entries in the norm for similarity computation. Then the cosine
the database for the following bias measurement.                                        function is used to compute the similarity between RVVi of i-th
                                                                                        search engine and the norm N:
          Search
          Engine
                   Search
                   Engine    ...    Search
                                    Engine                Web
                                                          Pages                         Sim( RVVi, N ) = cos( RVVi, N ) =
                                                                                           ∑S      j     RVVi , j   ⋅ SN, j                                (2)

  Query
                      URL Locator            Document Parser      Vocabulary
                                                                    Entries
                                                                                           ∑S  j
                                                                                                       2
                                                                                                       RVVi , j     ∑S  j
                                                                                                                              2
                                                                                                                              N, j

                                                                                        where SRVVi,j is the j-th entry score of RVVi, and SN,j is the j-th
                                                                    Bias        Bias    entry score of the norm. Finally, the content bias value
                                                                   Assessor    Report   CB(RVVi,N) is defined as

Figure 1: The assessment process of measuring search engine bias                        CB ( RVVi, N ) = 1 − Sim( RVVi, N )                                (3)

The bias assessor collects two kinds of information: the URL
indexes and the representative vocabulary vectors (RVV) for                             4. EXPERIMENTS AND DISCUSSIONS
corresponding Web contents. The URL indexes are used to                                 We have conducted experiments to study bias in currently famous
compute the indexical bias, and the RVV vectors are used to                             search engines with the proposed two-dimensional assessment
compute the content bias. After the assessment, the assessor                            scheme. Ten search engines are included in the assessment studies:
generates bias reports.                                                                 About, AltaVista, Excite, Google, Inktomi, Lycos, MSN,
                                                                                        Overture, Teoma, and Yahoo. To compute RVV vectors, the top
                                                                                        m=10 URLs from search results are processed because it is shown
3.3 The Weighted Vector Model                                                           that the first result screen is requested for 85% of the queries [16],
Web contents are mainly composed of different HTML tags that
                                                                                        and it usually shows the top ten results. To generate the norm, we
respectively represent their own specific meanings in Web pages.
                                                                                        used a weighted term-frequency-inversedocument-frequency (TF-
For example, a title tag represents the name of a Web page, which
                                                                                        IDF) strategy to select the feature information from the ten search
is shown in the browser window caption bar. Different headings
                                                                                        engines. The size of N is thus adaptive to different queries to
represent differing importance in a Web page. In HTML there are
                                                                                        appropriately represent the norm.
six levels of headings. H1 is the most important; H2 is slightly
less import, and so on down to H6, the least important [14]. In                         We have conducted experiments to measure the biases of ten
content bias assessment, how to represent a Web document plays                          general search engines. The indexical bias is assessed according
an important role to reflect the reality of assessment.                                 to the approach proposed by Mowshowitz and Kawaguchi
Here we adopt a weighted vector approach to measure content                             [10][11][12]. The content bias is assessed according to the
bias [8]. It is based on a vector space model [15] but adapted to                       proposed weighted vector model. In the experiments, queries from
emphasize the feature information in Web pages. Because the                             different subjects were tested. Two of the experimental results are
features in <title>, <H1>, or <H2> tags usually indicate important                      reported and discussed here. The first is a summarization of ten
information and are used more often in the Web documents,                               hot queries. This study shows the average bias performance of
features in these tags are appropriately weighted to represent Web                      Web search engines according to their content bias and indexical
contents. Since the number of the total Web documents can only                          bias values. The second is a case study on overwhelming
be estimated by sampling or assumption, this model is more                              redefinition power of search engines reported in [13]. In this
                                                                                        experiment, the two-dimensional assessment shows that most
search engines report similar indexical and content bias ranking                                    However, when we review the bias performance of Yahoo!, we
except Overture.                                                                                    can see that it has quite good content bias performance, which is
                                                                                                    ranked as the second best, but only has a medium indexical bias
                                                                                                    ranking. Such insistent bias performance shows that Yahoo! can
4.1 The Assessment Results of Hot Queries
                                                                                                    discover other similar major contents from different Web sites.
In this experiment, we randomly chose ten hot queries from Lycos
                                                                                                    However, such differences cannot be revealed when users only
50 [22]. For each of them, we collected 100 Web pages from ten
                                                                                                    consider the indexical bias as the panorama of search engine bias.
search engines. The queries are “Final Fantasy”, “Harry Potter”,
                                                                                                    In our experiments, a one-way analysis of variance (ANOVA)
“Iraq”, “Jennifer Lopez”, “Las Vegas”, “Lord of the Rings”,
                                                                                                    was conducted to analyze the statistical significance on bias
“NASCAR”, “SARS”, “Tattoos”, and “The Bible”. The
                                                                                                    performance among each search engine. The ANOVA analyses in
assessment results of their indexical bias and content bias values
                                                                                                    Table 5 and Table6 indicate that the content bias of Yahoo! is
are shown in Table 1 and Table 2.
                                                                                                    more statistically significant than the indexical bias.

                                                                                  About             In Table 3 and Table4, the ANOVA results of the averaged
                    0.7
                                                                                                    indexical bias and content bias are presented to display the
                    0.6                                                           Alta Vista
                                                                                                    statistical significance between the experimental search engines.
   Indexical Bias


                    0.5                                                           Excite            Both of the ANOVA results reveal statistical significance of the
                    0.4                                                           Google            ten search engines over the hot query terms (p ≤ 0.05). The p-
                    0.3                                                           Inktomi           values in the table measure the credibility of the null hypothesis.
                    0.2                                                           Lycos             The null hypothesis here means that there is no significant
                    0.1                                                                             difference between each search engine. If the p-value is less than
                                                                                  MSN
                    0.0                                                                             or equal to the widely accepted value 0.05, the null hypothesis is
                                                                                  Overture          rejected.
                          0.0   0.1   0.2    0.3   0.4     0.5   0.6   0.7
                                                                                  Teoma
                                            Content Bias                                            Since there is significant difference among the search engines, we
                                                                                  Yahoo!            further analyze the variance across different hot query terms.
                                                                                                    Table 5 and Table 6 show the ANOVA results of indexical bias
Figure 2: The two-dimensional analysis of the ten hot queries                                       and content bias between each search engine over the ten hot
from Lycos 50                                                                                       query terms. Table 5 further indicates that About, AltaVista,
In Figure 2, the average bias performance is further displayed in a                                 Google, Lycos, and Overture are significant, and Table 6 presents
two-dimensional diagram. In the figure, two additional dotted                                       that About, Google, MSN, and Yahoo! are significant. From the
lines are used to represent the respective statistic mean values of                                 ANOVA analyses, the original indexical bias of MSN and Yahoo!
bias. The results show that Google has the lowest indexical and                                     is less significant, but the content bias assessment can reveal the
content bias value, which means that Google outperforms others                                      complementary information. The two-dimensional assessment
in bias performance. The best bias performance in Google                                            scheme tells users a panoramic view of search engine bias.
represents that both the sites and the contents it retrieved are the
majority on the Web and may satisfy the most user needs. From
the average results, we found that most of the search engines
show similar bias rankings in both indexical bias and content bias.


Table 1: The indexical bias of ten hot queries randomly chosen from Lycos 50.
Queries                                 About            AltaVista       Excite        Google      Inktomi    Lycos       MSN        Overture     Teoma       Yahoo!
Final Fantasy                           0.5895             0.1876        0.5194           0.1876    0.3488     0.2403     0.4339       0.7054      0.4573      0.2713
Harry Potter                            0.5669             0.3098        0.5837           0.2253    0.3098     0.3275     0.4299       0.7758      0.3755      0.4181
Iraq                                    0.7231             0.2560        0.5328           0.3252    0.2733     0.3771     0.4809       0.3771      0.4463      0.4290
Jennifer Lopez                          0.5878             0.3681        0.5835           0.2606    0.3864     0.2448     0.5123       0.3078      0.3550      0.2134
Las Vegas                               0.6985             0.3439        0.5921           0.1488    0.2375     0.3793     0.5744       0.8049      0.3261      0.2552
Lord of the Rings                       0.5493             0.2558        0.5659           0.2074    0.2924     0.2093     0.4418       0.7829      0.3953      0.2093
NASCAR                                  0.3745             0.3897        0.4318           0.2982    0.3816     0.4150     0.4652       0.7493      0.4819      0.2829
SARS                                    0.4206             0.4902        0.3309           0.2874    0.4743     0.4902     0.3526       0.6655      0.5691      0.5018
Tattoos                                 0.5017             0.3355        0.6543           0.3995    0.5633     0.2903     0.4177       0.5847      0.4177      0.4905
The Bible                               0.6059             0.4518        0.5546           0.3148    0.3662     0.3245     0.6511       0.6917      0.3995      0.6247
Average:                                0.5618             0.3388        0.5349           0.2655    0.3634     0.3298     0.4760       0.6445      0.4224      0.3696
Table 2: The content bias of ten hot queries randomly chosen from Lycos 50.
Queries                 About       AltaVista       Excite      Google      Inktomi     Lycos       MSN         Overture        Teoma      Yahoo!
Final Fantasy           0.5629        0.4535         0.3315      0.3507      0.5545      0.2724      0.4396       0.2961        0.5030      0.3481
Harry Potter            0.5315        0.3028         0.4498      0.3181      0.4985      0.3555      0.4461       0.4346        0.3332      0.5443
Iraq                    0.4301        0.1651         0.5557      0.2250      0.1605      0.2213      0.5390       0.4403        0.2461      0.1711
Jennifer Lopez          0.4723        0.4193         0.4524      0.3150      0.5921      0.3450      0.3959       0.2441        0.3914      0.3138
Las Vegas               0.4656        0.4252         0.3303      0.1831      0.1971      0.2080      0.5267       0.5286        0.2201      0.2036
Lord of the Rings       0.5853        0.2030         0.2622      0.1516      0.1801      0.1966      0.5129       0.4509        0.2440      0.1573
NASCAR                  0.3318        0.2210         0.4724      0.1743      0.1995      0.2195      0.5005       0.6139        0.2515      0.1950
SARS                    0.4373        0.6965         0.5769      0.3784      0.6521      0.7361      0.4259       0.5443        0.6819      0.3854
Tattoos                 0.5270        0.4733         0.4989      0.3351      0.3145      0.3425      0.3472       0.3732        0.3907      0.4654
The Bible               0.5829        0.1874         0.5639      0.2394      0.1815      0.6096      0.6647       0.5358        0.6202      0.2126
Average:                0.4927        0.3547         0.4494      0.2671      0.3530      0.3507      0.4798       0.4462        0.3882      0.2997


Table 3: ANOVA result of the indexical bias between Web search engines
                             Sum of Squares                   Degree of Freedom               Mean Square           F-ration             p-value
Between Groups                      1.301                             9                           0.145                12.687            0.000
Within Groups                       1.025                            90                           0.011
Total                               2.326                            99


Table 4: ANOVA result of the content bias between Web search engines
                             Sum of Squares                   Degree of Freedom               Mean Square           F-ration             p-value
Between Groups                      0.527                             9                           0.059                3.036             0.003
Within Groups                       1.736                            90                           0.019
Total                               2.263                            99


Table 5: ANOVA result of the indexical bias across hot terms
Engine          About     AltaVista         Excite      Google        Inktomi         Lycos       MSN         Overture     Teoma          Yahoo!
p-value         0.002       0.023           0.089        0.000            0.072       0.014       0.163        0.000           0.429       0.092


Table 6: ANOVA result of the content bias across hot terms
Engine          About     AltaVista         Excite      Google        Inktomi         Lycos       MSN         Overture     Teoma          Yahoo!
p-value         0.010       0.232           0.089        0.003            0.221       0.206       0.021        0.101           0.499       0.025


4.2 The Case of “Second Superpower”                                          the term to describe another totally different meaning, the
To further assess the bias event happening on the Web, we used a             influence of the Internet and other interactive media [9].
real Googlewashed event happening on the Web to assess the bias              In Figure 3, the two-dimensional assessment result shows that the
performance of Web search engines. In this experiment, we once               Googlewashed effect indeed lowers the bias performance of
retrieved the search results and the Web pages from these ten                Google. The two-dimensional analysis also reflects that the
search engines about one month later after the event happened. As            Googlewashed effect was perceptible to Google and Yahoo! since
reported in [13], Tyler's original concept of “Second Superpower”            Yahoo! once cooperated with Google at that time (Actually,
was flooded by Google with Moore's alternative definition in                 Yahoo is the same to Google in this query).
seven weeks. As a matter of fact, the idea of “second superpower”
first appeared in the New York Times written by Tyler to describe
the global anti-war protests [17]. After a while, Moore's essay used
                                                                                           1999);
                        0.8                                         About                  www.webdevelopersjournal.com/articles/search_engines.ht
                        0.7                                         Alta Vista             ml
                        0.6                                                         [4]    Google Censors Itself for China. BBC News (Jan. 26, 2006);
      Indexical Bias

                                                                    Excite
                        0.5                                                                news.bbc.co.uk/1/hi/technology/4645596.stm.
                                                                    Google
                        0.4                                                         [5]    Holsti, O.R., Content Analysis for the Social Science and
                        0.3                                         Inktomi                Humanities. 1st ed. Addison-Wesley Publishing Co., 1969.
                        0.2                                         Lycos           [6]    Introna, L. and Nissenbaum, H., Shaping the Web: Why the
                        0.1                                         MSN                    Politics of Search Engines Matters, The Information Society,
                        0.0
                                                                    Overture               16, 3 (2000), 1-17.
                              0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
                                                                    Teoma           [7]    iProspect Search Engine User Attitudes (April-May, 2004);
                                          Content Bias
                                                                    Yahoo                  www.iprospect.com/premiumPDFs/iProspectSurveyComple
                                                                                           te.pdf.
Figure 3: The bias result of “Second Superpower”                                    [8]    Jenkins, C., and Inman, D., Adaptive Automatic
Interestingly, Figure 3 shows that the indexical bias ranking of                           Classification on the Web. In Proceedings of the 11th
Overture is relatively higher than its content bias. After manually                        International Workshop on Database and Expert Systems
reviewing the total of 100 Web pages for this query, we discovered                         Applications (Greenwich, London, U.K., 2000), 504-511.
that there are actually several definitions about “Second                           [9]    Moore, J.F., The Second Superpower Rears its Beautiful
Superpower,” not just Tyler’s and Moore’s. Although most                                   Head               (March             31,             2003);
contents retrieved by Overture point to the major viewpoints                               cyber.law.harvard.edu/people/jmoore/secondsuperpower.ht
appearing in the norm, they are retrieved from diverse URLs but                            ml.
not mirror sites, and thus the search results incur a high indexical                [10]   Mowshowitz, A., and Kawaguchi, A., Assessing Bias in
bias value. In this study, it shows that the indexical bias cannot tell                    Search Engines. Information Processing & Management, 38,
us the whole story, but a two-dimensional scheme reflects a more                           1 (Jan. 2002), 141-156.
comprehensive view of search engine bias.                                           [11]   Mowshowitz, A., and Kawaguchi, A., Bias on the Web.
                                                                                           Commun. ACM, 45, 9 (Sep. 2002), 56-60.
5. CONCLUSION                                                                       [12]   Mowshowitz, A., and Kawaguchi, A., Measuring Search
Since Web search engines have become an essential gateway to                               Engine Bias. Information Processing & Management, 41, 5
the Internet, their favor or bias of Web contents has deeply                               (Sep. 2005), 1193-1205.
affected users' browsing behavior and may influence their sight of                  [13]   Orlowski, A., Anti-war Slogan Coined, Repurposed and
viewing the Web. Recently, some studies of search engine bias                              Googlewashed . . . in 42 Days. The Register (April 3, 2003);
have been proposed to measure the deviation of sites retrieved by a                        www.theregister.co.uk/content/6/30087.html.
Web search engine from the norm for each specific query. These                      [14]   Raggett, D., Getting Started with HTML, W3C Consortium
studies have presented an efficient way to assess search engine                            (May 24, 2005); www.w3.org/MarkUp/Guide/.
bias. However, such assessment method ignores the content                           [15]   Salton, G., Wong, A., and Yang, C. S., A Vector Space
information in Web pages and thus cannot present the search                                Model for Automatic Indexing. Commun. ACM, 18, 11
engine bias thoroughly.                                                                    (Nov. 1975), 613-620.
                                                                                    [16]   Silverstein, C., Henzinger, M., Marais, H., and Moricz, M.,
In this paper, we assert that both indexical bias and content bias                         Analysis of a Very Large AltaVista Query Log, ACM SIGIR
are important to present search bias. Therefore, we study the                              Forum, 33, 1 (Fall 1999), 6-12.
content bias existing in current popular Web search engines and                     [17]   Tyler, P.E., A New Power in the Streets. New York Times
propose a two-dimensional assessment scheme to complement the                              (Feb.                        17,                      2003);
lack of indexical bias. The experimental results have shown that                           foi.missouri.edu/voicesdissent/newpower.html.
such a two-dimensional scheme can notice the blind spot of one-                     [18]   Vaughan, L. and Thelwall, M., Search Engine Coverage
dimensional bias assessment approach and provide users with a                              Bias: Evidence and Possible Causes, Information
more thorough view to search engine bias. Statistical analyses                             Processing & Management, 40, 4, (July 2004), 693-707.
further present that such a two-dimensional scheme can fulfill the                  [19]   Zittrain, J. and Edelman, B., Documentation of Internet
task of bias assessment and reveal more advanced information                               Filtering in Saudi Arabia, (Sep. 12, 2002);
about search engine bias.                                                                  cyber.law.harvard.edu/filtering/saudiarabia/.
                                                                                    [20]   Zittrain, J. and Edelman, B., Localized Google search result
6. REFERENCES                                                                              exclusions,            (Oct.           26,            2002);
[1]                    Brin, S., and Page, L., The Anatomy of Large-Scale                  cyber.law.harvard.edu/filtering/google/.
                       Hypertextual Web Search Engine. In Proceedings of the 7th    [21]   Zittrain, J. and Edelman, B., Internet Filtering in China.
                       International World Wide Web Conference (Brisbane,                  IEEE Internet Computing, 7, 2 (March/April, 2003), 70-77.
                       Australia, 1998), ACM Press, New York, 107-117.              [22]   50.lycos.com.
[2]                    Chen, I.-X. and Yang, C.-Z., Evaluating Content Bias and
                       Indexical Bias in Web Search Engines. In Proceedings of
                       International Conference on Informatics, Cybernetics and
                       Systems (ICICS 2003) (Kaohsiung, Taiwan, ROC, 2003),
                       1597-1605.
[3]                    Gikandi D., Maximizing Search Engine Positioning (April 2,

</pre>