<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pro ling Topics on the Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aditya K. Sehgal</string-name>
          <email>sehgal@uiowa.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Padmini Srinivasan</string-name>
          <email>srinivasan@uiowa.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, The University of Iowa</institution>
          ,
          <addr-line>Iowa City, IA 52242</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Library and Information Science &amp;, Department of Computer Science, The University of Iowa</institution>
          ,
          <addr-line>Iowa City, IA 52242</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <abstract>
        <p>The availability of large-scale data on the Web motivates the development of automatic algorithms to analyze topics and identify relationships between topics. Various approaches have been proposed in the literature. Most focus on speci c entities, such as people, and not on topics in general. They are also less exible in how they represent topics/entities. In this paper we study existing methods as well as describe preliminary research on a di erent approach, based on proles, for representing general topics. Topic pro les consist of di erent types of features. We compare di erent methods for building pro les and evaluate them in terms of their information content and ability to predict relationships between topics. Our results suggest that pro les derived from the full text present in multiple pages are the most informative and that pro les derived from multiple pages are significantly better at predicting topic relationships than pro les derived from single pages.</p>
      </abstract>
      <kwd-group>
        <kwd>text mining</kwd>
        <kwd>web mining</kwd>
        <kwd>pro les</kwd>
        <kwd>information management</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The past two decades have seen the Web evolve from a
specialized, closed-group medium to a very general,
ubiquitous source of information. It is generally agreed that no
source of information today compares to the Web in terms
of sheer size and diversity. As it stands, the Web o ers
billions of pages containing information on almost every area
that might be of interest to someone. The scale of the
information available provides tremendous motivation for the
development of automatic algorithms to \mine" the Web for
interesting information. As a result, web mining has become
a vibrant area of research.</p>
      <p>
        A principal goal in web mining is to facilitate analysis of
web data relevant to a topic of interest as well as to
identify (or predict) relationships between topics. The existing
literature o ers many di erent approaches in this regard.
An analysis reveals that most (e.g., [
        <xref ref-type="bibr" rid="ref1 ref13 ref2">1, 13, 2</xref>
        ]) focus on
speci c entities, mostly people entities. The focus is seldom on
topics in general.
      </p>
      <p>We de ne a topic as any subject of interest to a user.
Bill Clinton, A1BG Gene, Rainfall in the United States, and
Cancer in Children are all examples. Observe that while the
rst two are also entities, the latter are not. In general one
can liken any web search query to a topic. Topics may also
be identi ed by other types of text units such as by one
or more sentences. Our interest is in web mining methods
that are not constrained to particular varieties of topics. As
an example, given an arbitrary group of topics, we would
like to explore implicit links between them and identify new
relationships.</p>
      <p>Another observation that motivates our research is
regarding the di erences between various web mining approaches
along two major dimensions as depicted in gure 1. The
rst dimension represents the number of pages used to
represent a topic. This can range from a single page, such as a
home page, to several hundred pages retrieved from a search
engine. The second dimension refers to the level of data on
a given page from which features descriptive of the topic are
extracted. Two options are repeatedly used in the
literature. The rst designated the instance level uses text data
including and surrounding individual mentions (instances)
of the topic (more speci cally entity) in a page. The second
option, page level, extracts features from all of the page.</p>
      <p>
        Figure 1 shows the di erent combinations of these two
dimensions. The rst quadrant (SPI) depicts approaches
(e.g., [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) that use instance-level data from a single page
to derive features. The second quadrant (SPP) depicts
approaches (e.g,. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) that use full text from a single page.
The third quadrant (MPI) depicts approaches (e.g., [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ])
that use instance-level data from multiple pages. Fourth
quadrant (MPP) methods (e.g., [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) use full text from
multiple pages to derive features. Importantly, at this point the
relative merits of these combinations are unknown.
      </p>
      <p>Note that approaches based on SPI and MPI can only be
applied to entities and not topics in general as it would be
challenging to nd complex topics explicitly represented by
speci c phrases. Also approaches based on SPP and SPI
data utilize information in only a single page. Our own
inclination is to explore methods from the MPP quadrant.
This is because while a single page may contain information
relevant to a topic, it is unlikely to contain all the relevant
information. Topical information is likely scattered across
multiple pages, each potentially addressing di erent relevant
aspects. Moreover, it is quite possible for relevant sentences
to appear distant from sentences that contain an instance
of the topic. E.g., in our dataset a document relevant to
the topic `Hurricane Andrew' has the sentence Hurricane
Andrew was the most destructive United States hurricane
of record, which is relevant and contains an instance of the
topic. But four sentences away there is: The vast majority of
the damage in Florida was due to the winds. This sentence
is also relevant but does not contain an instance of the topic.</p>
      <p>
        Another signi cant dimension that di erentiates various
web mining approaches relates to how topics/entites are
represented. More speci cally the di erence is in the kinds of
features used to represent topics. Example representations
include features composed of words and named entities as
in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] mailing lists subscribed to, words and links
are features representing people entities. Although our
current research is restricted to weighted word stems, links and
named entities it is worth outlining the kind of
representation we envisage in our research. Our long-term interest
is in building extensible representations from the web that
can accommodate features such as key concepts, relations
and links. We have realized some of these interests in the
context of topic pro les extracted from MEDLINE1 using
Manjal, our prototype biomedical text mining system [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
An example for the web is given in Figure 3 with Bill Clinton
as the topic. Formally, a topic pro le is a composite vector
of sub-vectors, each containing features of a particular type.
Each feature is weighted to re ect its relative importance to
the topic.
      </p>
      <p>In summary, our observations regarding the
characteristics of current web mining research and our own interests
in topics beyond entities motivate our research. We seek a
framework for automatically representing topics of all kinds
using pro les and for analyzing relationships between them.
Generally our topical perspective leads us to a view of a
higher-level web, one where topics, represented by their
proles, and not pages are the fundamental unit of
information. This is illustrated in gure 2. A key advantage is that
relationships between topics can be analyzed independent
of links between individual pages. Relationships may also
be ne grained i.e., restricted to speci c subsets of feature
types. We believe that our higher-level topical web view is
largely consistent with the \Entity-Centric Information and
Knowledge Management" emphasis of the workshop.
Additionally our topics include more abstract topics besides
entities.</p>
      <p>In the next section we brie y discuss related research.
Following that we outline our methods for building SPI, SPP,
MPI and MPP based pro les. Three di erent types of
fea</p>
      <sec id="sec-1-1">
        <title>1www.ncbi.nlm.nih.gov/entrez/</title>
        <p>tures are explored: words, links and named entities. We
compare pro le building methods through two experiments.
First we use gold standards to evaluate the content of the
pro les derived (section 4.1). Second we compare these
proles in their ability to predict relationships (section 4.2).
We then analyze potential sources of error (section 5) and
nally discuss our results and outline future steps.
2.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>RELATED RESEARCH</title>
      <p>The problem of representing topics/entities using web data
and predicting relationships between them is very well known
in the web mining community. As mentioned before a
number of approaches proposed in this regard can be found in
the literature.</p>
      <p>
        Most of the existing web-based research focuses on speci c
types of entities, mainly people entites. E.g., Ben-Dov et al.
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Raghavan et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and Adamic and Adar [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Far less
research [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is seen with general topics. As mentioned in
the introduction, most approaches fall under 4 categories,
depending upon how many pages they use to derive
representations and what data they use within a page.
      </p>
      <p>
        Using a single page, such as a home page, is a standard
approach. For example, Adamic and Adar [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] represent
students using data extracted from their home page. Ben-Dov
et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] use an instance within a single page to represent a
person entity. Few approaches [
        <xref ref-type="bibr" rid="ref13 ref3">3, 13</xref>
        ] use data from
multiple pages to represent entities. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] Raghavan et al. use
instance-level data extracted from multiple pages to derive
entity representations. Such e orts are risk missing
potentially useful information outside these text windows. The
example given earlier regarding the Hurricane topic illustrates
this. A key disadvantage is also that such representations
cannot be applied to general topics such as Rainfall in the
United States. Relevant pages may not contain this phrase
while still dealing with the topic. Even fewer approaches
use full-text data from multiple web pages. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] Newman
et al. represent topics and entities using words and
namedentities, and associated topics, respectively. However, their
choice of features is somewhat arbitrary and also their
representations are xed. In contrast, we explore approaches
that exploit multiple pages, are not instance-based, and are
able to accommodate a variety of features.
      </p>
      <p>
        A number of approaches have been proposed to predict
and analyze relationships between entities on the Web. Most
rely on explicit indicators, such as shared hyperlinks [
        <xref ref-type="bibr" rid="ref4 ref7">7,
4</xref>
        ] or co-occurrence [
        <xref ref-type="bibr" rid="ref11 ref2 ref6">6, 2, 11</xref>
        ] to infer a relationship
between two entities. Others [
        <xref ref-type="bibr" rid="ref1 ref13">1, 13</xref>
        ], while being
independent of this requirement, are limited by the use of instance
or page -level data to represent entities. Our topic
proles provide a framework for analyzing relationships between
topics/entities based on various types of common features.
Each type of feature provides a distinct thread that
potentially binds two topics together. Consequently di erent
forms of relationships can be analyzed between a pair of
topics using our pro les. Also, pro le-based relationships
do not depend on shared hyperlinks or co-occurrence, which
allows for analysis of more implicit relationships.
      </p>
      <p>
        Interestingly, the literature reveals existing structures that
are similar to ours but have been designed for di erent
purposes or are limited in di erent aspects. Li et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] create
entity pro les limited to people, with only two types of
features, viz., salient concepts, such as name and organization,
and relationships to other entities (people). Glance et al.
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] generate high-level summaries of products. Features are
limited to 4 measures generated using data from blogs and
message boards. Liu et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] generate descriptions of
topics consisting of subtopics and their corresponding urls. This
is analogous to topic pro les with only two types of features.
Adamic and Adar [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] represent entities (students) using
features, such as hyperlinks, text, and subscribed mailing lists,
extracted from their home pages. Newman et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
represent topics using words and named entity features extracted
from multiple pages and entities using topic features. They
generate social networks for entities based on the similarity
of their representations.
      </p>
      <p>
        While these are similar to our topic pro les, there exist
substantial di erences. Most representations are limited to
speci c types of entities, such as people [
        <xref ref-type="bibr" rid="ref1 ref8">8, 1</xref>
        ] or products
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These structures also consist of only speci c kinds of
features. Importantly, all the above e orts, except Adamic's
and Newman's do not go beyond creating topic/entity
synopses while we study topic representations in the context of
using them for higher-level web mining applications.
      </p>
    </sec>
    <sec id="sec-3">
      <title>METHODS</title>
      <p>Our major goal is to compare di erent methods for
building topic pro les. Following gure 1, we create various types
of pro les di ering in terms of the number of pages (Single
or Multiple) and the level of data used (Instance or Page).
Note that we use the same feature extraction strategy in
each case.</p>
      <p>We compare pro les in two di erent ways via two
separate experiments. Firstly, we compare the quality of
information in each type of representation and secondly we
compare their ability to predict relationships between topics. In
the rst experiment we build di erent types of pro les for
general topics using the SPI, SPP, MPI and MPP and
compare them with pro les created from a known high quality
source of information. Our topics are a mix of celebrities,
important events and large corporations.2. In the second set
of experiments we build the 4 types of pro les (SPI, SPP,
MPI and MPP) for topics representing human proteins, and
predict interactions between pairs of proteins based on how
similar their pro les are. We evaluate the accuracy of these
predictions using a high quality knowledge base of protein
interactions.</p>
      <p>The pro le building process consists of two steps. In the
rst step we retrieve relevant pages using the Google search
API and extract the title text, body text, anchor text, and
hyperlinks from the individual pages. While this step mostly
relies on the accuracy of the search engine to retrieve
relevant pages, the retrieved pages may be further ltered (as
is done in the rst experiment; described in detail below).
Multiple-page pro les are derived from the top N ltered
set of pages while single-page pro les are derived from the
home page. In case there is no home page then the top
ranked ltered page is used. For instance-level pro les, the
documents are further processed to eliminate sentences that
do not contain an individual instance of the topic. We
evaluated two sentence boundary detection tools, viz., Lingua
EN3 and MxTerminator4 and use the latter as we found it
to be more accurate through preliminary experimentation.</p>
      <p>The second step consists of identifying the individual
features that form part of the pro les and assigning each a
weight. Three types of features are explored, words, links
and named-entities, extracted from the title, body and
anchor text. Individual words are stemmed and stop words
are removed. Each feature is assigned a tf*idf weight. We
describe the individual pro les in each experiment in more
detail below.
4.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS &amp; RESULTS</title>
    </sec>
    <sec id="sec-5">
      <title>Experiment 1: Information Content</title>
      <p>
        In this experiment we evaluate topic pro les based on their
information content. Speci cally, we build SPI, SPP, MPI
and MPP topic pro les and compare each with pro les for
the same topics built from a known high quality source of
information. We choose Wikipedia as our source of gold
standard information. Wikipedia is an online repository of
information on various topics in di erent languages. The
2Our choice of topics is in uenced by the fact that SPI and
MPI pro les can only be derived for entities and not general
topics. Thus in order to compare all quadrants in gure 1,
all the topics we choose are in fact entities.
3http://search.cpan.org/dist/Lingua-EN-Sentence/
4www.id.cbs.dk/ dh/corpus/tools/MXTERMINATOR.htm
English version of the site contains descriptions of more than
a million topics (as of November 2006). A Wikipedia entry
for a topic typically contains a summary, a small table
listing prominent characteristics, a table of important relations,
relevant external links, and references. A Wikipedia entry
can be created by anyone and can also be edited by anyone.
Thus, well-developed entries tend to contain the viewpoints
of many people. Wikipedia is the largest collaborative
journalism e ort till date and is viewed as a highly regarded
reference site [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It has been reported that the quality of
Wikipedia articles is high and they are referenced by many
teachers5 and are also frequently cited by newspapers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
We also use the online version of the Encyclopedia
Britannica (EB) as another source of high quality information. In
contrast to Wikipedia, the data in EB is controlled, being
manually compiled by trained personnel.
      </p>
      <p>We compiled a list of 50 topics belonging to three
categories, viz., famous people, important events of the 20th
century and large companies. We randomly selected 16
people from the Forbes celebrity list for 2006 and 21 companies
from the Fortune 500 list for the same year. The `event'
topics were randomly compiled from a list of 30 major 20th
century events6. For each topic, we downloaded its Wikipedia
page and derived pro les from word, link, and named-entity
features extracted from the text. EB contained entries for
only 26 of the 50 topics. We processed these pages in the
same way.</p>
      <p>For each topic, we manually identi ed relevant synonyms
and generated boolean (OR) web search queries. These were
used to retrieve the top 100 web pages for each topic using
the Google Search API. The retrieved sets were ltered to
exclude pages from approximately 600 web sites known to
mirror Wikipedia content7 (including Wikipedia itself). We
then created the four di erent types of pro les and
computed the cosine similarity between each pro le and
corresponding Wikipedia and EB pro les.</p>
      <p>1
0.9
0.8
0.7
y
it
lra 0.6
i
m
i
eS 0.5
g
rea 0.4
v
A 0.3
0.2
0.1
0</p>
      <p>MPP</p>
      <p>MPI
MPP-L</p>
      <p>MPI-L</p>
      <p>SPP</p>
      <p>SPP-L</p>
      <p>SPI SPI-L</p>
      <sec id="sec-5-1">
        <title>5http://meta.wikimedia.org/wiki/Research</title>
        <p>6http://history1900s.about.com/cs/majorevents/
7http://en.wikipedia.org/wiki/Wikipedia:Mirrors and forks
95% con dence intervals) of pro les gauged against Wikipedia
pro les. Two types of pro les are created for each approach.
Those derived from word features extracted from the title,
body, and anchor text (MPP, MPI, SPP, SPI), and those
additionally containing link features (MPP-L, MPI-L,
SPPL, SPI-L). In the link-inclusive pro les (*-L), text
information was assumed to be more important and thus assigned
a higher weight (0.9) than link information (0.1).</p>
        <p>First we see that pro les derived from multiple pages are
signi cantly better (statistically at 0.05 level) than
singlepage pro les. For MPP the di erence between both
variations is also signi cant, with the non link version being
better. But the same is not the case for the other three types
of pro les. For these link features are not useful.
Assigning a lower weight to links did not change this observation.
Consequently, in subsequent experiments, we exclude link
information from pro les. Interestingly, the average
similarity for MPP pro les is quite high (0.88). A key observation
to note is that MPP pro les are signi cantly better than
MPI pro les.</p>
        <p>Figure 5 compares MPP and MPI pro les derived from
varying numbers of relevant pages, with Wikipedia as the
gauge. Interestingly the di erence between the two types of
pro les decreases as the number of pages increases. At each
point, however, MPP is statistically better than MPI at the
0.05 level. We note that the average score for MPP pro les
stabilizes at 10 pages and for MPI pro les at 15 pages.
1
0.9
0.8
10 15 20 25 30 35 40 45 50</p>
        <p>Number of Documents</p>
        <sec id="sec-5-1-1">
          <title>MPPEB</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>MPIWK</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>MPIEB</title>
        </sec>
        <sec id="sec-5-1-4">
          <title>SPPWK</title>
          <p>SPPEB SPIWK</p>
        </sec>
        <sec id="sec-5-1-5">
          <title>SPIEB</title>
        </sec>
        <sec id="sec-5-1-6">
          <title>MPIWK</title>
        </sec>
        <sec id="sec-5-1-7">
          <title>MPPEB</title>
        </sec>
        <sec id="sec-5-1-8">
          <title>MPIEB</title>
        </sec>
        <sec id="sec-5-1-9">
          <title>MPPWK</title>
        </sec>
        <sec id="sec-5-1-10">
          <title>MPIWK</title>
        </sec>
        <sec id="sec-5-1-11">
          <title>MPPEB</title>
        </sec>
        <sec id="sec-5-1-12">
          <title>MPIEB</title>
        </sec>
        <sec id="sec-5-1-13">
          <title>SPPWK</title>
        </sec>
        <sec id="sec-5-1-14">
          <title>SPPEB</title>
        </sec>
        <sec id="sec-5-1-15">
          <title>SPIWK</title>
        </sec>
        <sec id="sec-5-1-16">
          <title>SPIEB</title>
        </sec>
        <sec id="sec-5-1-17">
          <title>SPPWK</title>
        </sec>
        <sec id="sec-5-1-18">
          <title>SPPEB</title>
          <p>SPIWK SPIEB
y
it
lra 0.6
i
m
ieS 0.5
g
rae 0.4
v
A 0.3
0.9
0.8
0.7
0.2
0.1
0
1
0.9
0.8
0.7
y
it
lra 0.6
i
m
ieS 0.5
g
rae 0.4
v
A 0.3
0.2
0.1
0
0.9
0.8
0.7
0.2
0.1
0
1
0.9
0.8
ica). Also, most EB entries are quite small, in comparison
to corresponding Wikipedia entries. We believe that the
average scores for di erent pro les, particularly those derived
from multiple pages (MPP and MPI), are signi cantly
affected by the smaller size of EB entries. To con rm this
hypothesis, we segregated the topics into two groups, those
with EB entries less than 5 KB (14 topics) and those with
EB entries greater than or equal to 5 KB (12 topics). We
then repeated our analysis for these two groups.
les derived from varying numbers of relevant pages, with
EB as the gold standard. As was the case in gure 5, the
di erence between their average similarities decreases as the
number of pages increases. However, in this case the
decrease is much more rapid. These di erences are signi cant
for pro les derived from 20 or less relevant pages. We also
note that the average similarity score stabilizes at 10 pages
for both MPP and MPI.</p>
          <p>Figures 7 and 8 show that average scores are generally
higher for topics with larger EB entries. The con dence
intervals are considerably smaller. Note that the Wikipedia
results are also shown for the two topic subsets. The di
erence between MPPEB and MPIEB pro les is also signi cant
at the 0.05 level for topics with larger EB entries while this
is not the case for topics with smaller EB entries. But in
both topic subsets the average scores with Wikipedia are
much higher than with EB. This may be because, in
general, Wikipedia entries have more text than corresponding
EB entries. We also see that pro les derived from multiple
pages are always signi cantly better than pro les derived
from single pages.</p>
          <p>Figure 9 shows the average scores for MPP and MPI
pro5
10 15 20 25 30 35 40 45 50</p>
          <p>Number of Documents</p>
          <p>Figure 10 o ers the same analysis but for topics with EB
entries greater than 5 KB. Here again we see the di erence
between the average scores decreasing with increasing
number of pages but not as rapidly as in gure 9. Here MPP and
MPI pro les are signi cantly di erent for 30 or less pages.
Also, the average score stabilizes at 15 pages for both types
of pro les.</p>
          <p>We now take a di erent approach and extract
namedentity features. Figure 11 shows the average similarity scores
for pro les with such features. Three types of named
entities, Person, Location, Organization, were extracted from
retrieved web pages. We tried two named-entity
recognitions tools, viz., Stanford-NER8 and Lingpipe and found
8http://nlp.stanford.edu/software/CRF-NER.shtml
0.8
5
10 15 20 25 30 35 40 45 50</p>
          <p>Number of Documents
the former to be more accurate through preliminary
experimentation. Hence we use the Stanford-NER system. We see
that while MPP pro les have the highest average similarity,
the di erence between them and MPI pro les is not
statistically signi cant. However, pro les derived from multiple
pages continue to be signi cantly better than single-page
pro les. We note a considerable drop in average similarity
scores compared with word-based pro les (e.g., from 0.88
to 0.6 for MPP).</p>
          <p>In the second set of experiments we evaluate the di
erent types of pro les based on their ability to predict known
relationships between proteins. We randomly compiled a
list of 82 proteins from the Database of Interacting Proteins
(DIP)9. According to DIP, 90 pairs of proteins within this set
are known to interact with each other. This forms our
positive gold standard set of interactions. The remaining pairs
are considered as the negative set of interactions. For each
protein, we identi ed synonyms from the SwissProt
pro</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>9http://dip.doe-mbi.ucla.edu/</title>
        <p>tein database10 and excluded those that are English words
(and thus cause for ambiguity). We then created web search
queries from these and for each query downloaded the top
100 pages using the Google API. For each protein topic,
we build the various types of pro les (MPP, MPI, SPP, and
SPI) and compute similarities between pairs of pro les. Two
types of features were extracted from documents, word
features and named-entities, from the title, body and anchor
text. As before each feature is assigned a tf*idf weight.</p>
        <p>We predict relationships between proteins based on the
similarity of their pro les. We use the standard IR cosine
similarity score to measure similarity between pro les. Two
proteins are predicted to be related if their similarity score
is above a certain threshold. Based on predictions made we
measure precision, recall and f-score. In our experiments we
use f-score11 as the primary measure to evaluate di erent
types of pro les.</p>
        <p>We adopt a 5-fold cross-validation approach. The 82
proteins we selected yielded 3403 unique pairs12. These were
randomly split into ve sets of equal size, each with the same
ratio of positives and negatives. We consider the union of
four of the splits (folds in standard machine learning
parlance) as our training set and the remaining split as our
test set. We choose a threshold that optimizes the
training f-score and then apply this threshold to the test set and
compute the test f-score. This process is repeated ve times,
each time with di erent combinations of splits considered as
the training data. Di erent pro ling building methods are
compared based on the average of their ve test f-scores.</p>
        <p>For our rst experiment we created pro les consisting of
word features found in the title, body and anchor text of
pages. As before, we consider four types of pro les (MPP,
MPI, SPP, and SPI). Table 1 shows the average training
and testing precision, recall and f-scores for each type of
pro le across all ve iterations. We see that the average
test f-scores for MPP, MPI and SPI ( 0.25) are very close
and thus they are similar in their ability to predict
relationships. However, all three are signi cantly (at 0.05 level)
better than SPP pro les. Also the training f-scores are in
general similar to the testing f-scores, suggesting that the
thresholds computed using the training data were general
enough to apply to new (test) data.</p>
        <p>Figure 12 shows the average f-scores for MPP and MPI
pro les when the number of relevant pages used to build
the pro les are varied. Here again pro les consist of word
features extracted from the title, body and anchor text of
relevant pages. We see that varying the number of pages has
little a ect on the average f-scores for both types of pro les
and in general we see no signi cant di erence in performance
between the two. We also note that the average f-score for
both stabilizes at 5 documents.</p>
        <p>We next created pro les that contained more speci c
features, viz., named-entities. Named entities were extracted
from relevant pages using the lingpipe13 named entity
extraction package. This package comes with a model
pretrained on the GENIA14 corpus, which is what we used for
this experiment. Figure 13 shows the average f-scores for
10http://expasy.org/sprot/
11with equal weight to precision and recall
12taking into account symmetry so that only P1-&gt;P2 is
considered and P2-&gt;P1 is not
13www.alias-i.com/lingpipe/
14www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
Type
MPP
MPI
SPP
SPI</p>
        <p>Training</p>
        <p>Recall
0.3528
0.3222
0.2702
0.3222</p>
        <p>Testing
Fscore LCI (95%)
0.2529 0.1826
0.2479 0.2061
0.1357 0.0963
0.2501 0.1885
5
10 15 20 25 30 35 40 45 50</p>
        <p>Number of Documents
the four di erent types of pro les. Here again, we see no
signi cant di erence between MPP and MPI pro les.
However, MPP pro les are signi cantly better than single-page
pro les. Comparing named-entity pro les with word
proles, we see that the average MPP, MPI, and SPP f-scores
are higher for the former than corresponding f-scores for the
latter (shown in table 1), while the reverse is true for SPI.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>POTENTIAL SOURCES OF ERRORS</title>
      <p>Our approach relies upon several underlying technologies,
such as document retrieval, sentence detection, and
namedentity recognition. Each of these is a potential source of
error. First, we rely on the accuracy of search engines to
retrieve relevant documents for topics, from which pro les
are derived. While modern search engines are fairly
accurate, they are still vulnerable to problems such as ambiguity.
Despite our attempts, some ambiguity still remained. E.g.,
`NEMO' is a synonym for a protein and is also the name of
a popular cartoon character.</p>
      <p>In this paper we have created pro les from sentences in
relevant documents containing individual instances of the
topics. Sentence boundary detection in web pages is a hard
problem and many o -the-shelf tools (we use
MxTerminator) are not optimized for web pages. Due to the tag
structure of web pages, many incoherent sentences are identi ed.</p>
      <p>In this research we have also evaluated pro les
containing named-entity features extracted from relevant pages.
Named-entity recognition is a challenging problem and
primarily depends on the training data used to train the model.
Our use of models pre-trained on newswire data in the rst
experiment and GENIA data in the second resulted in some
errors, due to the fact that there are signi cant di erences
between general web text and newswire and GENIA text.
E.g., EBI (European Bioinformatics Institute) was
recognized as a protein by Lingpipe.
6.</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION</title>
      <p>Our experiments have yielded interesting results. In
section 4.1 our results clearly show that pro les derived from
full text in multiple pages are more informative than
singlepage pro les. They are also better than pro les derived from
multiple pages but restricted to instance-level text windows.
This is when Wikipedia is used as the gold standard. It is
also true with EB as the gold standard for topics that have
entries that are at least 5 KB in size. These results support
our intuition that relevant information tends to be scattered
over multiple pages and is not necessarily instance bound.</p>
      <p>Our results also suggest that adding the link information
present in relevant pages as features does not improve the
information content of a pro le. In fact assigning a higher
weight to links had a detrimental e ect. However, this could
be attributed to the relatively small number of links present
in Wikipedia pages, which would bring down the average
similarity score.</p>
      <p>The di erence between MPP and MPI pro les was much
less prominent when they contained speci c features, such as
named entities. The average similarity of such pro les w.r.t.
Wikipedia was also notably lower than for corresponding
pro les containing word features.</p>
      <p>Also, we found that the di erence between MPP and MPI
was greatest when few relevant pages were present. As the
number of relevant pages increased, the di erence between
the two shrank. This suggests that when few relevant pages
are available, as is quite often the case, it would be better to
use the full-text to create representations. A key observation
made is that in general 10 to 15 pages are su cient for MPP,
our best strategy, to achieve its highest similarity with the
gold standard.</p>
      <p>As an aside our experiments also revealed interesting
differences between the two di erent gold standard sources of
information we considered, Wikipedia and EB. In general we
found the former to be more comprehensive for the topics we
selected, while the latter contained comprehensive
information only for a few popular topics, such as WWI and WWII,
and cursory information for the rest.</p>
      <p>While signi cant di erences between MPP and MPI
proles were detected in the rst set of experiments, this was
not the case with the second set of DIP experiments testing
prediction ability. In general, irrespective of the type of
features contained in the pro le, the performance of MPP and
MPI pro les in predicting known DIP relationships was very
similar. However, MPP pro les have the important
advantage that they can model general topics, while MPI pro les
can only be built for entities. Thus it would be preferable to
use the former. Again protein pro les derived from multiple
pages were always signi cantly better than pro les derived
from single pages, con rming our intuition that the latter
did not contain enough information for a good
representation. We also found that varying the number of relevant
pages did not have any a ect on the prediction ability of
MPP and MPI pro les.</p>
      <p>As a nal point, our DIP based experiments also let us
test the idea of implementing a bioinformatics web mining
application. Although the results are moderate these
offer a reasonable starting point. In future work we will also
compare these results with similar experiments run against
MEDLINE. Given the vastness of the Web, it is fast
becoming a data and knowledge source for domain speci c
applications.</p>
      <p>In conclusion, this research contributes to our long-term
goal which is to be able to represent arbitrary topics on the
web with topic pro les consisting of weighted features of
different types. We see value in pursuing a higher level topical
web where the object (node) of interest is the topic and the
link represents inter-topic relationships. Such a web has the
potential to more e ectively support individual information
needs as well web mining applications seeking to discover
novel connections between topics.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENT</title>
      <p>This material is based upon work supported by the
National Science Foundation under Grant No. 0312356 awarded
to Padmini Srinivasan. Any opinions, ndings, and
conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily re ect the
views of the National Science Foundation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Adamic</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Adar</surname>
          </string-name>
          .
          <article-title>Friends and Neighbors on the Web</article-title>
          .
          <source>Social Networks</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <volume>211</volume>
          {
          <fpage>230</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ben-Dov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cairns</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Feldman</surname>
          </string-name>
          .
          <article-title>Improving knowledge discovery by combining text-mining and link analysis techniques</article-title>
          .
          <source>In Proceedings of the SIAM International Conference on Data Mining</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Conrad</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Utt</surname>
          </string-name>
          .
          <article-title>A system for discovering relationships by feature extraction from text databases</article-title>
          .
          <source>In Proceedings of the 17th Annual International ACM-SIGIR Conference (Special Issue of the SIGIR Forum)</source>
          , pages
          <fpage>260</fpage>
          {
          <fpage>270</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Flake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Giles</surname>
          </string-name>
          .
          <article-title>E cient identi cation of web communities</article-title>
          .
          <source>In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <volume>150</volume>
          {
          <fpage>160</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Glance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stockton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Tomokiyo</surname>
          </string-name>
          .
          <article-title>Deriving marketing intelligence from online discussion</article-title>
          .
          <source>In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2005)</source>
          , pages
          <fpage>419</fpage>
          {
          <fpage>428</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kautz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Selman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          . Referral Web:
          <article-title>Combining social networks and collaborative ltering</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ):
          <volume>63</volume>
          {
          <fpage>65</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajagopalan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          .
          <article-title>Trawling the web for emerging cyber-communities</article-title>
          .
          <source>In Proceedings of the 8th International World Wide Web Conference (WWW-1999)</source>
          , pages
          <fpage>403</fpage>
          {
          <fpage>415</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Srihari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Niu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Entity pro le extraction from large corpora</article-title>
          .
          <source>In Proceedings of Paci c Association for Computational Linguistics</source>
          <year>2003</year>
          (
          <article-title>PACLING-</article-title>
          <year>2003</year>
          ),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lih</surname>
          </string-name>
          .
          <article-title>Wikipedia as participatory journalism: Reliable sources? metrics for evaluating collaborative media as a news resource</article-title>
          .
          <source>In Proceedings of the 5th International Symposium on Online Journalism</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Mining topic-speci c concepts and de nitions on the web</article-title>
          .
          <source>In Proceedings of the twelfth international World Wide Web conference (WWW-2003)</source>
          , pages
          <fpage>251</fpage>
          {
          <fpage>260</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ishida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nishimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hasida</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ishizuka</surname>
          </string-name>
          .
          <article-title>Polyphonet: an advanced social network extraction system from the web</article-title>
          .
          <source>In Proceedings of the 15th International World Wide Web Conference (WWW-2006)</source>
          , pages
          <fpage>397</fpage>
          {
          <fpage>406</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chemudugunta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Steyvers</surname>
          </string-name>
          .
          <article-title>Analyzing entities and topics in news articles using statistical topic models</article-title>
          .
          <source>In IEEE International Conference on Intelligence and Security Informatics</source>
          , pages
          <volume>93</volume>
          {
          <fpage>104</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>An Exploration of Entity Models, Collective Classi cation and Relation Description</article-title>
          .
          <source>In Proceedings of KDD Workshop on Link Analysis and Group Detection</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sehgal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Srinivasan. Manjal - A Text Mining</surname>
          </string-name>
          <article-title>System for MEDLINE (demonstration)</article-title>
          .
          <source>In Proceedings of the 28th Annual International ACM SIGIR, page 680</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Voss</surname>
          </string-name>
          .
          <article-title>Measuring wikipedia</article-title>
          .
          <source>In Proceedings of the 10th International Conference of the International Society for Scientometrics and Informetrics</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>