Profiling Topics on the Web

                         Aditya K. Sehgal                                     Padmini Srinivasan
                 Department of Computer Science                  School of Library and Information Science &
                      The University of Iowa                         Department of Computer Science
                    Iowa City, IA 52242, USA                                The University of Iowa
                                                                          Iowa City, IA 52242, USA
                  aditya-sehgal@uiowa.edu
                                                                     padmini-srinivasan@uiowa.edu

ABSTRACT                                                         first two are also entities, the latter are not. In general one
The availability of large-scale data on the Web motivates the    can liken any web search query to a topic. Topics may also
development of automatic algorithms to analyze topics and        be identified by other types of text units such as by one
identify relationships between topics. Various approaches        or more sentences. Our interest is in web mining methods
have been proposed in the literature. Most focus on specific     that are not constrained to particular varieties of topics. As
entities, such as people, and not on topics in general. They     an example, given an arbitrary group of topics, we would
are also less flexible in how they represent topics/entities.    like to explore implicit links between them and identify new
   In this paper we study existing methods as well as describe   relationships.
preliminary research on a different approach, based on pro-         Another observation that motivates our research is regard-
files, for representing general topics. Topic profiles consist   ing the differences between various web mining approaches
of different types of features. We compare different meth-       along two major dimensions as depicted in figure 1. The
ods for building profiles and evaluate them in terms of their    first dimension represents the number of pages used to rep-
information content and ability to predict relationships be-     resent a topic. This can range from a single page, such as a
tween topics. Our results suggest that profiles derived from     home page, to several hundred pages retrieved from a search
the full text present in multiple pages are the most informa-    engine. The second dimension refers to the level of data on
tive and that profiles derived from multiple pages are signif-   a given page from which features descriptive of the topic are
icantly better at predicting topic relationships than profiles   extracted. Two options are repeatedly used in the litera-
derived from single pages.                                       ture. The first designated the instance level uses text data
                                                                 including and surrounding individual mentions (instances)
                                                                 of the topic (more specifically entity) in a page. The second
Keywords                                                         option, page level, extracts features from all of the page.
text mining, web mining, profiles, information management           Figure 1 shows the different combinations of these two
                                                                 dimensions. The first quadrant (SPI) depicts approaches
                                                                 (e.g., [2]) that use instance-level data from a single page
1.   INTRODUCTION                                                to derive features. The second quadrant (SPP) depicts ap-
   The past two decades have seen the Web evolve from a          proaches (e.g,. [1]) that use full text from a single page.
specialized, closed-group medium to a very general, ubiqui-      The third quadrant (MPI) depicts approaches (e.g., [13])
tous source of information. It is generally agreed that no       that use instance-level data from multiple pages. Fourth
source of information today compares to the Web in terms         quadrant (MPP) methods (e.g., [12]) use full text from mul-
of sheer size and diversity. As it stands, the Web offers bil-   tiple pages to derive features. Importantly, at this point the
lions of pages containing information on almost every area       relative merits of these combinations are unknown.
that might be of interest to someone. The scale of the in-          Note that approaches based on SPI and MPI can only be
formation available provides tremendous motivation for the       applied to entities and not topics in general as it would be
development of automatic algorithms to “mine” the Web for        challenging to find complex topics explicitly represented by
interesting information. As a result, web mining has become      specific phrases. Also approaches based on SPP and SPI
a vibrant area of research.                                      data utilize information in only a single page. Our own
   A principal goal in web mining is to facilitate analysis of   inclination is to explore methods from the MPP quadrant.
web data relevant to a topic of interest as well as to iden-     This is because while a single page may contain information
tify (or predict) relationships between topics. The existing     relevant to a topic, it is unlikely to contain all the relevant
literature offers many different approaches in this regard.      information. Topical information is likely scattered across
An analysis reveals that most (e.g., [1, 13, 2]) focus on spe-   multiple pages, each potentially addressing different relevant
cific entities, mostly people entities. The focus is seldom on   aspects. Moreover, it is quite possible for relevant sentences
topics in general.                                               to appear distant from sentences that contain an instance
   We define a topic as any subject of interest to a user.       of the topic. E.g., in our dataset a document relevant to
Bill Clinton, A1BG Gene, Rainfall in the United States, and      the topic ‘Hurricane Andrew’ has the sentence Hurricane
Cancer in Children are all examples. Observe that while the      Andrew was the most destructive United States hurricane
Copyright is held by the author/owner(s).
                                                                 of record, which is relevant and contains an instance of the
WWW2007, May 8–12, 2007, Banff, Canada.                          topic. But four sentences away there is: The vast majority of
.
Figure 1: Different combinations of # of pages and
level of data used, to generate representations. # of
documents varies along horizontal axis and level of
data varies across vertical axis.                                  Figure 2: Topic-based web network vs. Page-based
                                                                   web network. In the upper image topics are unit
                                                                   of analysis and links are directly between topics. In
the damage in Florida was due to the winds. This sentence          the lower image pages are basic units and links are
is also relevant but does not contain an instance of the topic.    explicit hyperlinks.
   Another significant dimension that differentiates various
web mining approaches relates to how topics/entites are rep-
resented. More specifically the difference is in the kinds of
features used to represent topics. Example representations
include features composed of words and named entities as
in [12]. In [1] mailing lists subscribed to, words and links
are features representing people entities. Although our cur-
rent research is restricted to weighted word stems, links and
named entities it is worth outlining the kind of represen-
tation we envisage in our research. Our long-term interest
is in building extensible representations from the web that
can accommodate features such as key concepts, relations
and links. We have realized some of these interests in the
context of topic profiles extracted from MEDLINE1 using
Manjal, our prototype biomedical text mining system [14].
An example for the web is given in Figure 3 with Bill Clinton      Figure 3: Example profile for the topic ‘Bill Clinton’
as the topic. Formally, a topic profile is a composite vector      with 3 types of features.
of sub-vectors, each containing features of a particular type.
Each feature is weighted to reflect its relative importance to
the topic.                                                         tures are explored: words, links and named entities. We
   In summary, our observations regarding the characteris-         compare profile building methods through two experiments.
tics of current web mining research and our own interests          First we use gold standards to evaluate the content of the
in topics beyond entities motivate our research. We seek a         profiles derived (section 4.1). Second we compare these pro-
framework for automatically representing topics of all kinds       files in their ability to predict relationships (section 4.2).
using profiles and for analyzing relationships between them.       We then analyze potential sources of error (section 5) and
Generally our topical perspective leads us to a view of a          finally discuss our results and outline future steps.
higher-level web, one where topics, represented by their pro-
files, and not pages are the fundamental unit of informa-
tion. This is illustrated in figure 2. A key advantage is that
                                                                   2. RELATED RESEARCH
relationships between topics can be analyzed independent              The problem of representing topics/entities using web data
of links between individual pages. Relationships may also          and predicting relationships between them is very well known
be fine grained i.e., restricted to specific subsets of feature    in the web mining community. As mentioned before a num-
types. We believe that our higher-level topical web view is        ber of approaches proposed in this regard can be found in
largely consistent with the “Entity-Centric Information and        the literature.
Knowledge Management” emphasis of the workshop. Ad-                   Most of the existing web-based research focuses on specific
ditionally our topics include more abstract topics besides         types of entities, mainly people entites. E.g., Ben-Dov et al.
entities.                                                          [2], Raghavan et al. [13], and Adamic and Adar [1]. Far less
   In the next section we briefly discuss related research. Fol-   research [12] is seen with general topics. As mentioned in
lowing that we outline our methods for building SPI, SPP,          the introduction, most approaches fall under 4 categories,
MPI and MPP based profiles. Three different types of fea-          depending upon how many pages they use to derive repre-
                                                                   sentations and what data they use within a page.
1
    www.ncbi.nlm.nih.gov/entrez/                                      Using a single page, such as a home page, is a standard
approach. For example, Adamic and Adar [1] represent stu-              Our major goal is to compare different methods for build-
dents using data extracted from their home page. Ben-Dov            ing topic profiles. Following figure 1, we create various types
et al. [2] use an instance within a single page to represent a      of profiles differing in terms of the number of pages (Single
person entity. Few approaches [3, 13] use data from multi-          or Multiple) and the level of data used (Instance or Page).
ple pages to represent entities. In [13] Raghavan et al. use        Note that we use the same feature extraction strategy in
instance-level data extracted from multiple pages to derive         each case.
entity representations. Such efforts are risk missing poten-           We compare profiles in two different ways via two sepa-
tially useful information outside these text windows. The ex-       rate experiments. Firstly, we compare the quality of infor-
ample given earlier regarding the Hurricane topic illustrates       mation in each type of representation and secondly we com-
this. A key disadvantage is also that such representations          pare their ability to predict relationships between topics. In
cannot be applied to general topics such as Rainfall in the         the first experiment we build different types of profiles for
United States. Relevant pages may not contain this phrase           general topics using the SPI, SPP, MPI and MPP and com-
while still dealing with the topic. Even fewer approaches           pare them with profiles created from a known high quality
use full-text data from multiple web pages. In [12] Newman          source of information. Our topics are a mix of celebrities,
et al. represent topics and entities using words and named-         important events and large corporations.2 . In the second set
entities, and associated topics, respectively. However, their       of experiments we build the 4 types of profiles (SPI, SPP,
choice of features is somewhat arbitrary and also their rep-        MPI and MPP) for topics representing human proteins, and
resentations are fixed. In contrast, we explore approaches          predict interactions between pairs of proteins based on how
that exploit multiple pages, are not instance-based, and are        similar their profiles are. We evaluate the accuracy of these
able to accommodate a variety of features.                          predictions using a high quality knowledge base of protein
   A number of approaches have been proposed to predict             interactions.
and analyze relationships between entities on the Web. Most            The profile building process consists of two steps. In the
rely on explicit indicators, such as shared hyperlinks [7,          first step we retrieve relevant pages using the Google search
4] or co-occurrence [6, 2, 11] to infer a relationship be-          API and extract the title text, body text, anchor text, and
tween two entities. Others [1, 13], while being indepen-            hyperlinks from the individual pages. While this step mostly
dent of this requirement, are limited by the use of instance        relies on the accuracy of the search engine to retrieve rele-
or page -level data to represent entities. Our topic pro-           vant pages, the retrieved pages may be further filtered (as
files provide a framework for analyzing relationships between       is done in the first experiment; described in detail below).
topics/entities based on various types of common features.          Multiple-page profiles are derived from the top N filtered
Each type of feature provides a distinct thread that po-            set of pages while single-page profiles are derived from the
tentially binds two topics together. Consequently different         home page. In case there is no home page then the top
forms of relationships can be analyzed between a pair of            ranked filtered page is used. For instance-level profiles, the
topics using our profiles. Also, profile-based relationships        documents are further processed to eliminate sentences that
do not depend on shared hyperlinks or co-occurrence, which          do not contain an individual instance of the topic. We eval-
allows for analysis of more implicit relationships.                 uated two sentence boundary detection tools, viz., Lingua
   Interestingly, the literature reveals existing structures that   EN3 and MxTerminator4 and use the latter as we found it
are similar to ours but have been designed for different pur-       to be more accurate through preliminary experimentation.
poses or are limited in different aspects. Li et al. [8] create        The second step consists of identifying the individual fea-
entity profiles limited to people, with only two types of fea-      tures that form part of the profiles and assigning each a
tures, viz., salient concepts, such as name and organization,       weight. Three types of features are explored, words, links
and relationships to other entities (people). Glance et al.         and named-entities, extracted from the title, body and an-
[5] generate high-level summaries of products. Features are         chor text. Individual words are stemmed and stop words
limited to 4 measures generated using data from blogs and           are removed. Each feature is assigned a tf*idf weight. We
message boards. Liu et al. [10] generate descriptions of top-       describe the individual profiles in each experiment in more
ics consisting of subtopics and their corresponding urls. This      detail below.
is analogous to topic profiles with only two types of features.
Adamic and Adar [1] represent entities (students) using fea-
tures, such as hyperlinks, text, and subscribed mailing lists,      4. EXPERIMENTS & RESULTS
extracted from their home pages. Newman et al. [12] repre-
sent topics using words and named entity features extracted         4.1 Experiment 1: Information Content
from multiple pages and entities using topic features. They           In this experiment we evaluate topic profiles based on their
generate social networks for entities based on the similarity       information content. Specifically, we build SPI, SPP, MPI
of their representations.                                           and MPP topic profiles and compare each with profiles for
   While these are similar to our topic profiles, there exist       the same topics built from a known high quality source of
substantial differences. Most representations are limited to        information. We choose Wikipedia as our source of gold
specific types of entities, such as people [8, 1] or products       standard information. Wikipedia is an online repository of
[5]. These structures also consist of only specific kinds of        information on various topics in different languages. The
features. Importantly, all the above efforts, except Adamic’s
                                                                    2
and Newman’s do not go beyond creating topic/entity syn-              Our choice of topics is influenced by the fact that SPI and
opses while we study topic representations in the context of        MPI profiles can only be derived for entities and not general
using them for higher-level web mining applications.                topics. Thus in order to compare all quadrants in figure 1,
                                                                    all the topics we choose are in fact entities.
                                                                    3
                                                                      http://search.cpan.org/dist/Lingua-EN-Sentence/
3.   METHODS                                                        4
                                                                      www.id.cbs.dk/ dh/corpus/tools/MXTERMINATOR.htm
English version of the site contains descriptions of more than                         95% confidence intervals) of profiles gauged against Wikipedia
a million topics (as of November 2006). A Wikipedia entry                              profiles. Two types of profiles are created for each approach.
for a topic typically contains a summary, a small table list-                          Those derived from word features extracted from the title,
ing prominent characteristics, a table of important relations,                         body, and anchor text (MPP, MPI, SPP, SPI), and those
relevant external links, and references. A Wikipedia entry                             additionally containing link features (MPP-L, MPI-L, SPP-
can be created by anyone and can also be edited by anyone.                             L, SPI-L). In the link-inclusive profiles (*-L), text informa-
Thus, well-developed entries tend to contain the viewpoints                            tion was assumed to be more important and thus assigned
of many people. Wikipedia is the largest collaborative jour-                           a higher weight (0.9) than link information (0.1).
nalism effort till date and is viewed as a highly regarded                                First we see that profiles derived from multiple pages are
reference site [9]. It has been reported that the quality of                           significantly better (statistically at 0.05 level) than single-
Wikipedia articles is high and they are referenced by many                             page profiles. For MPP the difference between both varia-
teachers5 and are also frequently cited by newspapers [15].                            tions is also significant, with the non link version being bet-
We also use the online version of the Encyclopedia Britan-                             ter. But the same is not the case for the other three types
nica (EB) as another source of high quality information. In                            of profiles. For these link features are not useful. Assign-
contrast to Wikipedia, the data in EB is controlled, being                             ing a lower weight to links did not change this observation.
manually compiled by trained personnel.                                                Consequently, in subsequent experiments, we exclude link
   We compiled a list of 50 topics belonging to three cat-                             information from profiles. Interestingly, the average similar-
egories, viz., famous people, important events of the 20th                             ity for MPP profiles is quite high (0.88). A key observation
century and large companies. We randomly selected 16 peo-                              to note is that MPP profiles are significantly better than
ple from the Forbes celebrity list for 2006 and 21 companies                           MPI profiles.
from the Fortune 500 list for the same year. The ‘event’ top-                             Figure 5 compares MPP and MPI profiles derived from
ics were randomly compiled from a list of 30 major 20th cen-                           varying numbers of relevant pages, with Wikipedia as the
tury events6 . For each topic, we downloaded its Wikipedia                             gauge. Interestingly the difference between the two types of
page and derived profiles from word, link, and named-entity                            profiles decreases as the number of pages increases. At each
features extracted from the text. EB contained entries for                             point, however, MPP is statistically better than MPI at the
only 26 of the 50 topics. We processed these pages in the                              0.05 level. We note that the average score for MPP profiles
same way.                                                                              stabilizes at 10 pages and for MPI profiles at 15 pages.
   For each topic, we manually identified relevant synonyms
and generated boolean (OR) web search queries. These were                                                      1
                                                                                                                                                             MPP
used to retrieve the top 100 web pages for each topic using                                                                                                  MPI
                                                                                                              0.9
the Google Search API. The retrieved sets were filtered to
exclude pages from approximately 600 web sites known to                                                       0.8

mirror Wikipedia content7 (including Wikipedia itself). We                                                    0.7
                                                                                         Average Similarity


then created the four different types of profiles and com-                                                    0.6
puted the cosine similarity between each profile and corre-
                                                                                                              0.5
sponding Wikipedia and EB profiles.
                                                                                                              0.4

                          1                                                                                   0.3

                               MPP                                                                            0.2
                         0.9
                                             MPI
                                                                                                              0.1
                         0.8
                                                                                                               0
                         0.7         MPP-L         MPI-L                                                            5   10   15    20  25    30   35    40    45   50
    Average Similarity


                                                                                                                                  Number of Documents
                         0.6

                         0.5                                                           Figure 5: Performance of profiles by varying num-
                         0.4                               SPP                         bers of documents, against Wikipedia.
                                                                 SPP-L
                         0.3
                                                                         SPI   SPI-L      Figure 6 compares the different profiles with correspond-
                         0.2                                                           ing Wikipedia (WK subscript) and EB profiles (EB sub-
                         0.1                                                           script). We see that average scores are much higher when
                                                                                       Wikipedia is the gauge than EB. Also the variation in scores
                          0
                                                                                       is lower. Hence we have greater confidence in our Wikipedia-
                                                                                       based observations. We find no significant difference be-
Figure 4: Performance of two variations of profiles                                    tween MPPEB and MPIEB . But the two are again signifi-
against Wikipedia. (*-L) profiles contain link fea-                                    cantly better than page-based profiles.
tures in addition to word features in the other types                                     This figure also reveals interesting (and significant) differ-
of profiles.                                                                           ences between using Wikipedia and EB as gold standards.
                                                                                       E.g., there is a consistent difference in length of the confi-
    Figure 4 shows the average similarity scores (along with                           dence bars. In an effort to understand why this was so, we
                                                                                       closely analyzed the EB entries, which revealed a significant
5
  http://meta.wikimedia.org/wiki/Research                                              variation in the amount of text they contain. Some entries
6
  http://history1900s.about.com/cs/majorevents/                                        are very long (e.g., over 50 pages for WWII ) while others are
7
  http://en.wikipedia.org/wiki/Wikipedia:Mirrors and forks                             very small (e.g., only a small paragraph for Bank of Amer-
                        1                                                                                      1
                                                                                                                    MPPWK
                             MPPWK
                       0.9                                                                                    0.9                MPIWK
                                         MPIWK
                       0.8                                                                                    0.8        MPPEB

                                 MPPEB                                                                                                MPIEB
                       0.7                       MPIEB                                                        0.7
  Average Similarity


                                                                                         Average Similarity
                       0.6                                                                                    0.6

                       0.5                                                                                    0.5                             SPPWK
                                                         SPPWK
                       0.4                                                                                    0.4
                                                             SPPEB
                       0.3                                                                                    0.3
                                                                     SPIWK                                                                        SPPEB SPI
                                                                                                                                                              WK
                       0.2                                                   SPIEB                            0.2
                                                                                                                                                                   SPIEB
                       0.1                                                                                    0.1

                        0                                                                                      0


Figure 6: Performance of profiles against Wikipedia                                  Figure 8: Performance of profiles for topics with EB
and EB.                                                                              entries greater than 5 KB in size.


ica). Also, most EB entries are quite small, in comparison                           files derived from varying numbers of relevant pages, with
to corresponding Wikipedia entries. We believe that the av-                          EB as the gold standard. As was the case in figure 5, the
erage scores for different profiles, particularly those derived                      difference between their average similarities decreases as the
from multiple pages (MPP and MPI), are significantly af-                             number of pages increases. However, in this case the de-
fected by the smaller size of EB entries. To confirm this                            crease is much more rapid. These differences are significant
hypothesis, we segregated the topics into two groups, those                          for profiles derived from 20 or less relevant pages. We also
with EB entries less than 5 KB (14 topics) and those with                            note that the average similarity score stabilizes at 10 pages
EB entries greater than or equal to 5 KB (12 topics). We                             for both MPP and MPI.
then repeated our analysis for these two groups.
                                                                                                               1
                                                                                                                                                              MPP
                        1                                                                                     0.9                                             MPI
                             MPPWK
                       0.9                                                                                    0.8
                                         MPIWK
                       0.8                                                                                    0.7
                                                                                         Average Similarity


                       0.7                                                                                    0.6
                                             MPIEB
  Average Similarity


                                 MPPEB
                       0.6                                                                                    0.5

                                                                                                              0.4
                       0.5
                                                         SPPWK                                                0.3
                       0.4
                                                             SPPEB                                            0.2
                       0.3
                                                                 SPIWK
                                                                             SPIEB                            0.1
                       0.2
                                                                                                               0
                       0.1                                                                                           5   10   15    20  25    30   35    40    45    50
                                                                                                                                   Number of Documents
                        0

                                                                                     Figure 9: Performance of profiles by varying num-
Figure 7: Performance of profiles for topics with EB                                 bers of documents, against EB.
entries less than 5 KB in size.
                                                                                        Figure 10 offers the same analysis but for topics with EB
   Figures 7 and 8 show that average scores are generally                            entries greater than 5 KB. Here again we see the difference
higher for topics with larger EB entries. The confidence in-                         between the average scores decreasing with increasing num-
tervals are considerably smaller. Note that the Wikipedia                            ber of pages but not as rapidly as in figure 9. Here MPP and
results are also shown for the two topic subsets. The differ-                        MPI profiles are significantly different for 30 or less pages.
ence between MPPEB and MPIEB profiles is also significant                            Also, the average score stabilizes at 15 pages for both types
at the 0.05 level for topics with larger EB entries while this                       of profiles.
is not the case for topics with smaller EB entries. But in                              We now take a different approach and extract named-
both topic subsets the average scores with Wikipedia are                             entity features. Figure 11 shows the average similarity scores
much higher than with EB. This may be because, in gen-                               for profiles with such features. Three types of named enti-
eral, Wikipedia entries have more text than corresponding                            ties, Person, Location, Organization, were extracted from
EB entries. We also see that profiles derived from multiple                          retrieved web pages. We tried two named-entity recogni-
pages are always significantly better than profiles derived                          tions tools, viz., Stanford-NER8 and Lingpipe and found
from single pages.
                                                                                     8
   Figure 9 shows the average scores for MPP and MPI pro-                                http://nlp.stanford.edu/software/CRF-NER.shtml
                          1                                                            tein database10 and excluded those that are English words
                                                                            MPP
                         0.9                                                MPI        (and thus cause for ambiguity). We then created web search
                                                                                       queries from these and for each query downloaded the top
                         0.8
                                                                                       100 pages using the Google API. For each protein topic,
                         0.7                                                           we build the various types of profiles (MPP, MPI, SPP, and
    Average Similarity


                         0.6                                                           SPI) and compute similarities between pairs of profiles. Two
                         0.5                                                           types of features were extracted from documents, word fea-
                                                                                       tures and named-entities, from the title, body and anchor
                         0.4
                                                                                       text. As before each feature is assigned a tf*idf weight.
                         0.3                                                              We predict relationships between proteins based on the
                         0.2                                                           similarity of their profiles. We use the standard IR cosine
                         0.1
                                                                                       similarity score to measure similarity between profiles. Two
                                                                                       proteins are predicted to be related if their similarity score
                          0
                               5   10    15    20     25   30     35   40    45   50   is above a certain threshold. Based on predictions made we
                                              Number of Documents                      measure precision, recall and f-score. In our experiments we
                                                                                       use f-score11 as the primary measure to evaluate different
Figure 10: Performance of profiles for topics with                                     types of profiles.
EB entries greater than 5 KB by varying numbers                                           We adopt a 5-fold cross-validation approach. The 82 pro-
of documents, against EB.                                                              teins we selected yielded 3403 unique pairs12 . These were
                                                                                       randomly split into five sets of equal size, each with the same
                                                                                       ratio of positives and negatives. We consider the union of
the former to be more accurate through preliminary experi-                             four of the splits (folds in standard machine learning par-
mentation. Hence we use the Stanford-NER system. We see                                lance) as our training set and the remaining split as our
that while MPP profiles have the highest average similarity,                           test set. We choose a threshold that optimizes the train-
the difference between them and MPI profiles is not statis-                            ing f-score and then apply this threshold to the test set and
tically significant. However, profiles derived from multiple                           compute the test f-score. This process is repeated five times,
pages continue to be significantly better than single-page                             each time with different combinations of splits considered as
profiles. We note a considerable drop in average similarity                            the training data. Different profiling building methods are
scores compared with word-based profiles (e.g., from 0.88                              compared based on the average of their five test f-scores.
to 0.6 for MPP).                                                                          For our first experiment we created profiles consisting of
                                                                                       word features found in the title, body and anchor text of
                          1
                                                                                       pages. As before, we consider four types of profiles (MPP,
                                                                                       MPI, SPP, and SPI). Table 1 shows the average training
                         0.9                                                           and testing precision, recall and f-scores for each type of
                         0.8                                                           profile across all five iterations. We see that the average
                                                                                       test f-scores for MPP, MPI and SPI ( 0.25) are very close
                         0.7       MPP                                                 and thus they are similar in their ability to predict rela-
    Average Similarity


                                                MPI
                         0.6                                                           tionships. However, all three are significantly (at 0.05 level)
                         0.5                                                           better than SPP profiles. Also the training f-scores are in
                                                                                       general similar to the testing f-scores, suggesting that the
                         0.4
                                                                                       thresholds computed using the training data were general
                         0.3                                    SPP                    enough to apply to new (test) data.
                         0.2                                                SPI           Figure 12 shows the average f-scores for MPP and MPI
                                                                                       profiles when the number of relevant pages used to build
                         0.1                                                           the profiles are varied. Here again profiles consist of word
                          0                                                            features extracted from the title, body and anchor text of
                                                                                       relevant pages. We see that varying the number of pages has
                                                                                       little affect on the average f-scores for both types of profiles
Figure 11: Performance of profiles with named-                                         and in general we see no significant difference in performance
entity features against Wikipedia.                                                     between the two. We also note that the average f-score for
                                                                                       both stabilizes at 5 documents.
                                                                                          We next created profiles that contained more specific fea-
4.2 Experiment 2: Prediction Ability                                                   tures, viz., named-entities. Named entities were extracted
   In the second set of experiments we evaluate the differ-                            from relevant pages using the lingpipe13 named entity ex-
ent types of profiles based on their ability to predict known                          traction package. This package comes with a model pre-
relationships between proteins. We randomly compiled a                                 trained on the GENIA14 corpus, which is what we used for
list of 82 proteins from the Database of Interacting Proteins                          this experiment. Figure 13 shows the average f-scores for
(DIP)9 . According to DIP, 90 pairs of proteins within this set
                                                                                       10
are known to interact with each other. This forms our pos-                                http://expasy.org/sprot/
                                                                                       11
itive gold standard set of interactions. The remaining pairs                              with equal weight to precision and recall
                                                                                       12
are considered as the negative set of interactions. For each                              taking into account symmetry so that only P1->P2 is con-
protein, we identified synonyms from the SwissProt pro-                                 sidered and P2->P1 is not
                                                                                       13
                                                                                          www.alias-i.com/lingpipe/
9                                                                                      14
    http://dip.doe-mbi.ucla.edu/                                                          www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
                                                            Training                                                     Testing
                         Type     Split         Precision     Recall       Fscore   Precision                Recall   Fscore LCI (95%)      UCI (95%)
                         MPP     Average         0.2205       0.3528       0.2704    0.2046                  0.3333   0.2529     0.1826      0.3232
                         MPI     Average         0.2188       0.3222       0.2604    0.2115                  0.3111   0.2479     0.2061      0.2897
                         SPP     Average         0.1183       0.2702       0.1628    0.1011                  0.2303   0.1357     0.0963      0.1751
                         SPI     Average         0.2190       0.3222       0.2607    0.2101                  0.3111   0.2501     0.1885      0.3117


Table 1: Average training and test precision, recall and f-scores for the four types of profiles. LCI and UCI
denote lower and upper 95% confidence intervals of test f-score, respectively.


                    1                                                                                         1
                                                                     MPP
                   0.9                                               MPI                                     0.9
                   0.8                                                                                       0.8
                   0.7
                                                                                                             0.7
  Average Fscore


                                                                                           Average F-score
                   0.6
                                                                                                             0.6
                   0.5
                                                                                                             0.5
                   0.4
                                                                                                             0.4       MPP     MPI
                   0.3
                                                                                                             0.3                          SPP
                   0.2                                                                                                                           SPI
                                                                                                             0.2
                   0.1

                    0                                                                                        0.1
                           5    10   15    20  25    30   35    40    45   50
                                                                                                              0
                                          Number of Documents


Figure 12: Average prediction f-scores for profiles                                      Figure 13: Average prediction f-scores for profiles
by varying numbers of pages.                                                             with named-entity features.


the four different types of profiles. Here again, we see no                              errors, due to the fact that there are significant differences
significant difference between MPP and MPI profiles. How-                                between general web text and newswire and GENIA text.
ever, MPP profiles are significantly better than single-page                             E.g., EBI (European Bioinformatics Institute) was recog-
profiles. Comparing named-entity profiles with word pro-                                 nized as a protein by Lingpipe.
files, we see that the average MPP, MPI, and SPP f-scores
are higher for the former than corresponding f-scores for the
latter (shown in table 1), while the reverse is true for SPI.
                                                                                         6. DISCUSSION
                                                                                            Our experiments have yielded interesting results. In sec-
                                                                                         tion 4.1 our results clearly show that profiles derived from
5.                 POTENTIAL SOURCES OF ERRORS                                           full text in multiple pages are more informative than single-
   Our approach relies upon several underlying technologies,                             page profiles. They are also better than profiles derived from
such as document retrieval, sentence detection, and named-                               multiple pages but restricted to instance-level text windows.
entity recognition. Each of these is a potential source of                               This is when Wikipedia is used as the gold standard. It is
error. First, we rely on the accuracy of search engines to                               also true with EB as the gold standard for topics that have
retrieve relevant documents for topics, from which profiles                              entries that are at least 5 KB in size. These results support
are derived. While modern search engines are fairly accu-                                our intuition that relevant information tends to be scattered
rate, they are still vulnerable to problems such as ambiguity.                           over multiple pages and is not necessarily instance bound.
Despite our attempts, some ambiguity still remained. E.g.,                                  Our results also suggest that adding the link information
‘NEMO’ is a synonym for a protein and is also the name of                                present in relevant pages as features does not improve the
a popular cartoon character.                                                             information content of a profile. In fact assigning a higher
   In this paper we have created profiles from sentences in                              weight to links had a detrimental effect. However, this could
relevant documents containing individual instances of the                                be attributed to the relatively small number of links present
topics. Sentence boundary detection in web pages is a hard                               in Wikipedia pages, which would bring down the average
problem and many off-the-shelf tools (we use MxTermina-                                  similarity score.
tor) are not optimized for web pages. Due to the tag struc-                                 The difference between MPP and MPI profiles was much
ture of web pages, many incoherent sentences are identified.                             less prominent when they contained specific features, such as
   In this research we have also evaluated profiles contain-                             named entities. The average similarity of such profiles w.r.t.
ing named-entity features extracted from relevant pages.                                 Wikipedia was also notably lower than for corresponding
Named-entity recognition is a challenging problem and pri-                               profiles containing word features.
marily depends on the training data used to train the model.                                Also, we found that the difference between MPP and MPI
Our use of models pre-trained on newswire data in the first                              was greatest when few relevant pages were present. As the
experiment and GENIA data in the second resulted in some                                 number of relevant pages increased, the difference between
the two shrank. This suggests that when few relevant pages          [3] J. Conrad and M. Utt. A system for discovering
are available, as is quite often the case, it would be better to        relationships by feature extraction from text
use the full-text to create representations. A key observation          databases. In Proceedings of the 17th Annual
made is that in general 10 to 15 pages are sufficient for MPP,          International ACM-SIGIR Conference (Special Issue
our best strategy, to achieve its highest similarity with the           of the SIGIR Forum), pages 260–270, 1994.
gold standard.                                                      [4] G. Flake, S. Lawrence, and C. Giles. Efficient
   As an aside our experiments also revealed interesting dif-           identification of web communities. In Proceedings of
ferences between the two different gold standard sources of             the 6th ACM SIGKDD International Conference on
information we considered, Wikipedia and EB. In general we              Knowledge Discovery and Data Mining, pages
found the former to be more comprehensive for the topics we             150–160, 2000.
selected, while the latter contained comprehensive informa-         [5] N. Glance, M. Hurst, K. Nigam, M. Siegler,
tion only for a few popular topics, such as WWI and WWII,               R. Stockton, and T. Tomokiyo. Deriving marketing
and cursory information for the rest.                                   intelligence from online discussion. In Proceedings of
   While significant differences between MPP and MPI pro-               the Eleventh ACM SIGKDD International Conference
files were detected in the first set of experiments, this was           on Knowledge Discovery and Data Mining
not the case with the second set of DIP experiments testing             (KDD-2005), pages 419–428, 2005.
prediction ability. In general, irrespective of the type of fea-    [6] H. Kautz, B. Selman, and M. Shah. Referral Web:
tures contained in the profile, the performance of MPP and              Combining social networks and collaborative filtering.
MPI profiles in predicting known DIP relationships was very             Communications of the ACM, 40(3):63–65, 1997.
similar. However, MPP profiles have the important advan-            [7] R. Kumar, P. Raghavan, S. Rajagopalan, and
tage that they can model general topics, while MPI profiles             A. Tomkins. Trawling the web for emerging
can only be built for entities. Thus it would be preferable to          cyber-communities. In Proceedings of the 8th
use the former. Again protein profiles derived from multiple            International World Wide Web Conference
pages were always significantly better than profiles derived            (WWW-1999), pages 403–415, 1999.
from single pages, confirming our intuition that the latter
                                                                    [8] W. Li, R. Srihari, C. Niu, and X. Li. Entity profile
did not contain enough information for a good representa-
                                                                        extraction from large corpora. In Proceedings of
tion. We also found that varying the number of relevant
                                                                        Pacific Association for Computational Linguistics
pages did not have any affect on the prediction ability of
                                                                        2003 (PACLING-2003), 2003.
MPP and MPI profiles.
   As a final point, our DIP based experiments also let us          [9] A. Lih. Wikipedia as participatory journalism:
                                                                        Reliable sources? metrics for evaluating collaborative
test the idea of implementing a bioinformatics web mining
                                                                        media as a news resource. In Proceedings of the 5th
application. Although the results are moderate these of-
fer a reasonable starting point. In future work we will also            International Symposium on Online Journalism, 2004.
compare these results with similar experiments run against         [10] B. Liu, C. Chin, and H. Ng. Mining topic-specific
MEDLINE. Given the vastness of the Web, it is fast be-                  concepts and definitions on the web. In Proceedings of
coming a data and knowledge source for domain specific                  the twelfth international World Wide Web conference
applications.                                                           (WWW-2003), pages 251–260, 2003.
   In conclusion, this research contributes to our long-term       [11] Y. Matsuo, J. Mori, M. Hamasaki, K. Ishida,
goal which is to be able to represent arbitrary topics on the           T. Nishimura, H. Takeda, K. Hasida, and M. Ishizuka.
web with topic profiles consisting of weighted features of dif-         Polyphonet: an advanced social network extraction
ferent types. We see value in pursuing a higher level topical           system from the web. In Proceedings of the 15th
web where the object (node) of interest is the topic and the            International World Wide Web Conference
link represents inter-topic relationships. Such a web has the           (WWW-2006), pages 397–406, 2006.
potential to more effectively support individual information       [12] D. Newman, C. Chemudugunta, P. Smyth, and
needs as well web mining applications seeking to discover               M. Steyvers. Analyzing entities and topics in news
novel connections between topics.                                       articles using statistical topic models. In IEEE
                                                                        International Conference on Intelligence and Security
7.   ACKNOWLEDGMENT                                                     Informatics, pages 93–104, 2006.
   This material is based upon work supported by the Na-           [13] H. Raghavan, J. Allan, and A. McCallum. An
tional Science Foundation under Grant No. 0312356 awarded               Exploration of Entity Models, Collective Classification
to Padmini Srinivasan. Any opinions, findings, and con-                 and Relation Description. In Proceedings of KDD
clusions or recommendations expressed in this material are              Workshop on Link Analysis and Group Detection,
those of the author(s) and do not necessarily reflect the               2004.
views of the National Science Foundation.                          [14] A. Sehgal and P. Srinivasan. Manjal - A Text Mining
                                                                        System for MEDLINE (demonstration). In
                                                                        Proceedings of the 28th Annual International ACM
8.   REFERENCES                                                         SIGIR, page 680, 2005.
 [1] L. Adamic and E. Adar. Friends and Neighbors on the
     Web. Social Networks, 25(3):211–230, 2001.                    [15] J. Voss. Measuring wikipedia. In Proceedings of the
 [2] M. Ben-Dov, W. Wu, P. Cairns, and R. Feldman.                      10th International Conference of the International
     Improving knowledge discovery by combining                         Society for Scientometrics and Informetrics, 2005.
     text-mining and link analysis techniques. In
     Proceedings of the SIAM International Conference on
     Data Mining, 2004.