Profiling Topics on the Web Aditya K. Sehgal Padmini Srinivasan Department of Computer Science School of Library and Information Science & The University of Iowa Department of Computer Science Iowa City, IA 52242, USA The University of Iowa Iowa City, IA 52242, USA aditya-sehgal@uiowa.edu padmini-srinivasan@uiowa.edu ABSTRACT first two are also entities, the latter are not. In general one The availability of large-scale data on the Web motivates the can liken any web search query to a topic. Topics may also development of automatic algorithms to analyze topics and be identified by other types of text units such as by one identify relationships between topics. Various approaches or more sentences. Our interest is in web mining methods have been proposed in the literature. Most focus on specific that are not constrained to particular varieties of topics. As entities, such as people, and not on topics in general. They an example, given an arbitrary group of topics, we would are also less flexible in how they represent topics/entities. like to explore implicit links between them and identify new In this paper we study existing methods as well as describe relationships. preliminary research on a different approach, based on pro- Another observation that motivates our research is regard- files, for representing general topics. Topic profiles consist ing the differences between various web mining approaches of different types of features. We compare different meth- along two major dimensions as depicted in figure 1. The ods for building profiles and evaluate them in terms of their first dimension represents the number of pages used to rep- information content and ability to predict relationships be- resent a topic. This can range from a single page, such as a tween topics. Our results suggest that profiles derived from home page, to several hundred pages retrieved from a search the full text present in multiple pages are the most informa- engine. The second dimension refers to the level of data on tive and that profiles derived from multiple pages are signif- a given page from which features descriptive of the topic are icantly better at predicting topic relationships than profiles extracted. Two options are repeatedly used in the litera- derived from single pages. ture. The first designated the instance level uses text data including and surrounding individual mentions (instances) of the topic (more specifically entity) in a page. The second Keywords option, page level, extracts features from all of the page. text mining, web mining, profiles, information management Figure 1 shows the different combinations of these two dimensions. The first quadrant (SPI) depicts approaches (e.g., [2]) that use instance-level data from a single page 1. INTRODUCTION to derive features. The second quadrant (SPP) depicts ap- The past two decades have seen the Web evolve from a proaches (e.g,. [1]) that use full text from a single page. specialized, closed-group medium to a very general, ubiqui- The third quadrant (MPI) depicts approaches (e.g., [13]) tous source of information. It is generally agreed that no that use instance-level data from multiple pages. Fourth source of information today compares to the Web in terms quadrant (MPP) methods (e.g., [12]) use full text from mul- of sheer size and diversity. As it stands, the Web offers bil- tiple pages to derive features. Importantly, at this point the lions of pages containing information on almost every area relative merits of these combinations are unknown. that might be of interest to someone. The scale of the in- Note that approaches based on SPI and MPI can only be formation available provides tremendous motivation for the applied to entities and not topics in general as it would be development of automatic algorithms to “mine” the Web for challenging to find complex topics explicitly represented by interesting information. As a result, web mining has become specific phrases. Also approaches based on SPP and SPI a vibrant area of research. data utilize information in only a single page. Our own A principal goal in web mining is to facilitate analysis of inclination is to explore methods from the MPP quadrant. web data relevant to a topic of interest as well as to iden- This is because while a single page may contain information tify (or predict) relationships between topics. The existing relevant to a topic, it is unlikely to contain all the relevant literature offers many different approaches in this regard. information. Topical information is likely scattered across An analysis reveals that most (e.g., [1, 13, 2]) focus on spe- multiple pages, each potentially addressing different relevant cific entities, mostly people entities. The focus is seldom on aspects. Moreover, it is quite possible for relevant sentences topics in general. to appear distant from sentences that contain an instance We define a topic as any subject of interest to a user. of the topic. E.g., in our dataset a document relevant to Bill Clinton, A1BG Gene, Rainfall in the United States, and the topic ‘Hurricane Andrew’ has the sentence Hurricane Cancer in Children are all examples. Observe that while the Andrew was the most destructive United States hurricane Copyright is held by the author/owner(s). of record, which is relevant and contains an instance of the WWW2007, May 8–12, 2007, Banff, Canada. topic. But four sentences away there is: The vast majority of . Figure 1: Different combinations of # of pages and level of data used, to generate representations. # of documents varies along horizontal axis and level of data varies across vertical axis. Figure 2: Topic-based web network vs. Page-based web network. In the upper image topics are unit of analysis and links are directly between topics. In the damage in Florida was due to the winds. This sentence the lower image pages are basic units and links are is also relevant but does not contain an instance of the topic. explicit hyperlinks. Another significant dimension that differentiates various web mining approaches relates to how topics/entites are rep- resented. More specifically the difference is in the kinds of features used to represent topics. Example representations include features composed of words and named entities as in [12]. In [1] mailing lists subscribed to, words and links are features representing people entities. Although our cur- rent research is restricted to weighted word stems, links and named entities it is worth outlining the kind of represen- tation we envisage in our research. Our long-term interest is in building extensible representations from the web that can accommodate features such as key concepts, relations and links. We have realized some of these interests in the context of topic profiles extracted from MEDLINE1 using Manjal, our prototype biomedical text mining system [14]. An example for the web is given in Figure 3 with Bill Clinton Figure 3: Example profile for the topic ‘Bill Clinton’ as the topic. Formally, a topic profile is a composite vector with 3 types of features. of sub-vectors, each containing features of a particular type. Each feature is weighted to reflect its relative importance to the topic. tures are explored: words, links and named entities. We In summary, our observations regarding the characteris- compare profile building methods through two experiments. tics of current web mining research and our own interests First we use gold standards to evaluate the content of the in topics beyond entities motivate our research. We seek a profiles derived (section 4.1). Second we compare these pro- framework for automatically representing topics of all kinds files in their ability to predict relationships (section 4.2). using profiles and for analyzing relationships between them. We then analyze potential sources of error (section 5) and Generally our topical perspective leads us to a view of a finally discuss our results and outline future steps. higher-level web, one where topics, represented by their pro- files, and not pages are the fundamental unit of informa- tion. This is illustrated in figure 2. A key advantage is that 2. RELATED RESEARCH relationships between topics can be analyzed independent The problem of representing topics/entities using web data of links between individual pages. Relationships may also and predicting relationships between them is very well known be fine grained i.e., restricted to specific subsets of feature in the web mining community. As mentioned before a num- types. We believe that our higher-level topical web view is ber of approaches proposed in this regard can be found in largely consistent with the “Entity-Centric Information and the literature. Knowledge Management” emphasis of the workshop. Ad- Most of the existing web-based research focuses on specific ditionally our topics include more abstract topics besides types of entities, mainly people entites. E.g., Ben-Dov et al. entities. [2], Raghavan et al. [13], and Adamic and Adar [1]. Far less In the next section we briefly discuss related research. Fol- research [12] is seen with general topics. As mentioned in lowing that we outline our methods for building SPI, SPP, the introduction, most approaches fall under 4 categories, MPI and MPP based profiles. Three different types of fea- depending upon how many pages they use to derive repre- sentations and what data they use within a page. 1 www.ncbi.nlm.nih.gov/entrez/ Using a single page, such as a home page, is a standard approach. For example, Adamic and Adar [1] represent stu- Our major goal is to compare different methods for build- dents using data extracted from their home page. Ben-Dov ing topic profiles. Following figure 1, we create various types et al. [2] use an instance within a single page to represent a of profiles differing in terms of the number of pages (Single person entity. Few approaches [3, 13] use data from multi- or Multiple) and the level of data used (Instance or Page). ple pages to represent entities. In [13] Raghavan et al. use Note that we use the same feature extraction strategy in instance-level data extracted from multiple pages to derive each case. entity representations. Such efforts are risk missing poten- We compare profiles in two different ways via two sepa- tially useful information outside these text windows. The ex- rate experiments. Firstly, we compare the quality of infor- ample given earlier regarding the Hurricane topic illustrates mation in each type of representation and secondly we com- this. A key disadvantage is also that such representations pare their ability to predict relationships between topics. In cannot be applied to general topics such as Rainfall in the the first experiment we build different types of profiles for United States. Relevant pages may not contain this phrase general topics using the SPI, SPP, MPI and MPP and com- while still dealing with the topic. Even fewer approaches pare them with profiles created from a known high quality use full-text data from multiple web pages. In [12] Newman source of information. Our topics are a mix of celebrities, et al. represent topics and entities using words and named- important events and large corporations.2 . In the second set entities, and associated topics, respectively. However, their of experiments we build the 4 types of profiles (SPI, SPP, choice of features is somewhat arbitrary and also their rep- MPI and MPP) for topics representing human proteins, and resentations are fixed. In contrast, we explore approaches predict interactions between pairs of proteins based on how that exploit multiple pages, are not instance-based, and are similar their profiles are. We evaluate the accuracy of these able to accommodate a variety of features. predictions using a high quality knowledge base of protein A number of approaches have been proposed to predict interactions. and analyze relationships between entities on the Web. Most The profile building process consists of two steps. In the rely on explicit indicators, such as shared hyperlinks [7, first step we retrieve relevant pages using the Google search 4] or co-occurrence [6, 2, 11] to infer a relationship be- API and extract the title text, body text, anchor text, and tween two entities. Others [1, 13], while being indepen- hyperlinks from the individual pages. While this step mostly dent of this requirement, are limited by the use of instance relies on the accuracy of the search engine to retrieve rele- or page -level data to represent entities. Our topic pro- vant pages, the retrieved pages may be further filtered (as files provide a framework for analyzing relationships between is done in the first experiment; described in detail below). topics/entities based on various types of common features. Multiple-page profiles are derived from the top N filtered Each type of feature provides a distinct thread that po- set of pages while single-page profiles are derived from the tentially binds two topics together. Consequently different home page. In case there is no home page then the top forms of relationships can be analyzed between a pair of ranked filtered page is used. For instance-level profiles, the topics using our profiles. Also, profile-based relationships documents are further processed to eliminate sentences that do not depend on shared hyperlinks or co-occurrence, which do not contain an individual instance of the topic. We eval- allows for analysis of more implicit relationships. uated two sentence boundary detection tools, viz., Lingua Interestingly, the literature reveals existing structures that EN3 and MxTerminator4 and use the latter as we found it are similar to ours but have been designed for different pur- to be more accurate through preliminary experimentation. poses or are limited in different aspects. Li et al. [8] create The second step consists of identifying the individual fea- entity profiles limited to people, with only two types of fea- tures that form part of the profiles and assigning each a tures, viz., salient concepts, such as name and organization, weight. Three types of features are explored, words, links and relationships to other entities (people). Glance et al. and named-entities, extracted from the title, body and an- [5] generate high-level summaries of products. Features are chor text. Individual words are stemmed and stop words limited to 4 measures generated using data from blogs and are removed. Each feature is assigned a tf*idf weight. We message boards. Liu et al. [10] generate descriptions of top- describe the individual profiles in each experiment in more ics consisting of subtopics and their corresponding urls. This detail below. is analogous to topic profiles with only two types of features. Adamic and Adar [1] represent entities (students) using fea- tures, such as hyperlinks, text, and subscribed mailing lists, 4. EXPERIMENTS & RESULTS extracted from their home pages. Newman et al. [12] repre- sent topics using words and named entity features extracted 4.1 Experiment 1: Information Content from multiple pages and entities using topic features. They In this experiment we evaluate topic profiles based on their generate social networks for entities based on the similarity information content. Specifically, we build SPI, SPP, MPI of their representations. and MPP topic profiles and compare each with profiles for While these are similar to our topic profiles, there exist the same topics built from a known high quality source of substantial differences. Most representations are limited to information. We choose Wikipedia as our source of gold specific types of entities, such as people [8, 1] or products standard information. Wikipedia is an online repository of [5]. These structures also consist of only specific kinds of information on various topics in different languages. The features. Importantly, all the above efforts, except Adamic’s 2 and Newman’s do not go beyond creating topic/entity syn- Our choice of topics is influenced by the fact that SPI and opses while we study topic representations in the context of MPI profiles can only be derived for entities and not general using them for higher-level web mining applications. topics. Thus in order to compare all quadrants in figure 1, all the topics we choose are in fact entities. 3 http://search.cpan.org/dist/Lingua-EN-Sentence/ 3. METHODS 4 www.id.cbs.dk/ dh/corpus/tools/MXTERMINATOR.htm English version of the site contains descriptions of more than 95% confidence intervals) of profiles gauged against Wikipedia a million topics (as of November 2006). A Wikipedia entry profiles. Two types of profiles are created for each approach. for a topic typically contains a summary, a small table list- Those derived from word features extracted from the title, ing prominent characteristics, a table of important relations, body, and anchor text (MPP, MPI, SPP, SPI), and those relevant external links, and references. A Wikipedia entry additionally containing link features (MPP-L, MPI-L, SPP- can be created by anyone and can also be edited by anyone. L, SPI-L). In the link-inclusive profiles (*-L), text informa- Thus, well-developed entries tend to contain the viewpoints tion was assumed to be more important and thus assigned of many people. Wikipedia is the largest collaborative jour- a higher weight (0.9) than link information (0.1). nalism effort till date and is viewed as a highly regarded First we see that profiles derived from multiple pages are reference site [9]. It has been reported that the quality of significantly better (statistically at 0.05 level) than single- Wikipedia articles is high and they are referenced by many page profiles. For MPP the difference between both varia- teachers5 and are also frequently cited by newspapers [15]. tions is also significant, with the non link version being bet- We also use the online version of the Encyclopedia Britan- ter. But the same is not the case for the other three types nica (EB) as another source of high quality information. In of profiles. For these link features are not useful. Assign- contrast to Wikipedia, the data in EB is controlled, being ing a lower weight to links did not change this observation. manually compiled by trained personnel. Consequently, in subsequent experiments, we exclude link We compiled a list of 50 topics belonging to three cat- information from profiles. Interestingly, the average similar- egories, viz., famous people, important events of the 20th ity for MPP profiles is quite high (0.88). A key observation century and large companies. We randomly selected 16 peo- to note is that MPP profiles are significantly better than ple from the Forbes celebrity list for 2006 and 21 companies MPI profiles. from the Fortune 500 list for the same year. The ‘event’ top- Figure 5 compares MPP and MPI profiles derived from ics were randomly compiled from a list of 30 major 20th cen- varying numbers of relevant pages, with Wikipedia as the tury events6 . For each topic, we downloaded its Wikipedia gauge. Interestingly the difference between the two types of page and derived profiles from word, link, and named-entity profiles decreases as the number of pages increases. At each features extracted from the text. EB contained entries for point, however, MPP is statistically better than MPI at the only 26 of the 50 topics. We processed these pages in the 0.05 level. We note that the average score for MPP profiles same way. stabilizes at 10 pages and for MPI profiles at 15 pages. For each topic, we manually identified relevant synonyms and generated boolean (OR) web search queries. These were 1 MPP used to retrieve the top 100 web pages for each topic using MPI 0.9 the Google Search API. The retrieved sets were filtered to exclude pages from approximately 600 web sites known to 0.8 mirror Wikipedia content7 (including Wikipedia itself). We 0.7 Average Similarity then created the four different types of profiles and com- 0.6 puted the cosine similarity between each profile and corre- 0.5 sponding Wikipedia and EB profiles. 0.4 1 0.3 MPP 0.2 0.9 MPI 0.1 0.8 0 0.7 MPP-L MPI-L 5 10 15 20 25 30 35 40 45 50 Average Similarity Number of Documents 0.6 0.5 Figure 5: Performance of profiles by varying num- 0.4 SPP bers of documents, against Wikipedia. SPP-L 0.3 SPI SPI-L Figure 6 compares the different profiles with correspond- 0.2 ing Wikipedia (WK subscript) and EB profiles (EB sub- 0.1 script). We see that average scores are much higher when Wikipedia is the gauge than EB. Also the variation in scores 0 is lower. Hence we have greater confidence in our Wikipedia- based observations. We find no significant difference be- Figure 4: Performance of two variations of profiles tween MPPEB and MPIEB . But the two are again signifi- against Wikipedia. (*-L) profiles contain link fea- cantly better than page-based profiles. tures in addition to word features in the other types This figure also reveals interesting (and significant) differ- of profiles. ences between using Wikipedia and EB as gold standards. E.g., there is a consistent difference in length of the confi- Figure 4 shows the average similarity scores (along with dence bars. In an effort to understand why this was so, we closely analyzed the EB entries, which revealed a significant 5 http://meta.wikimedia.org/wiki/Research variation in the amount of text they contain. Some entries 6 http://history1900s.about.com/cs/majorevents/ are very long (e.g., over 50 pages for WWII ) while others are 7 http://en.wikipedia.org/wiki/Wikipedia:Mirrors and forks very small (e.g., only a small paragraph for Bank of Amer- 1 1 MPPWK MPPWK 0.9 0.9 MPIWK MPIWK 0.8 0.8 MPPEB MPPEB MPIEB 0.7 MPIEB 0.7 Average Similarity Average Similarity 0.6 0.6 0.5 0.5 SPPWK SPPWK 0.4 0.4 SPPEB 0.3 0.3 SPIWK SPPEB SPI WK 0.2 SPIEB 0.2 SPIEB 0.1 0.1 0 0 Figure 6: Performance of profiles against Wikipedia Figure 8: Performance of profiles for topics with EB and EB. entries greater than 5 KB in size. ica). Also, most EB entries are quite small, in comparison files derived from varying numbers of relevant pages, with to corresponding Wikipedia entries. We believe that the av- EB as the gold standard. As was the case in figure 5, the erage scores for different profiles, particularly those derived difference between their average similarities decreases as the from multiple pages (MPP and MPI), are significantly af- number of pages increases. However, in this case the de- fected by the smaller size of EB entries. To confirm this crease is much more rapid. These differences are significant hypothesis, we segregated the topics into two groups, those for profiles derived from 20 or less relevant pages. We also with EB entries less than 5 KB (14 topics) and those with note that the average similarity score stabilizes at 10 pages EB entries greater than or equal to 5 KB (12 topics). We for both MPP and MPI. then repeated our analysis for these two groups. 1 MPP 1 0.9 MPI MPPWK 0.9 0.8 MPIWK 0.8 0.7 Average Similarity 0.7 0.6 MPIEB Average Similarity MPPEB 0.6 0.5 0.4 0.5 SPPWK 0.3 0.4 SPPEB 0.2 0.3 SPIWK SPIEB 0.1 0.2 0 0.1 5 10 15 20 25 30 35 40 45 50 Number of Documents 0 Figure 9: Performance of profiles by varying num- Figure 7: Performance of profiles for topics with EB bers of documents, against EB. entries less than 5 KB in size. Figure 10 offers the same analysis but for topics with EB Figures 7 and 8 show that average scores are generally entries greater than 5 KB. Here again we see the difference higher for topics with larger EB entries. The confidence in- between the average scores decreasing with increasing num- tervals are considerably smaller. Note that the Wikipedia ber of pages but not as rapidly as in figure 9. Here MPP and results are also shown for the two topic subsets. The differ- MPI profiles are significantly different for 30 or less pages. ence between MPPEB and MPIEB profiles is also significant Also, the average score stabilizes at 15 pages for both types at the 0.05 level for topics with larger EB entries while this of profiles. is not the case for topics with smaller EB entries. But in We now take a different approach and extract named- both topic subsets the average scores with Wikipedia are entity features. Figure 11 shows the average similarity scores much higher than with EB. This may be because, in gen- for profiles with such features. Three types of named enti- eral, Wikipedia entries have more text than corresponding ties, Person, Location, Organization, were extracted from EB entries. We also see that profiles derived from multiple retrieved web pages. We tried two named-entity recogni- pages are always significantly better than profiles derived tions tools, viz., Stanford-NER8 and Lingpipe and found from single pages. 8 Figure 9 shows the average scores for MPP and MPI pro- http://nlp.stanford.edu/software/CRF-NER.shtml 1 tein database10 and excluded those that are English words MPP 0.9 MPI (and thus cause for ambiguity). We then created web search queries from these and for each query downloaded the top 0.8 100 pages using the Google API. For each protein topic, 0.7 we build the various types of profiles (MPP, MPI, SPP, and Average Similarity 0.6 SPI) and compute similarities between pairs of profiles. Two 0.5 types of features were extracted from documents, word fea- tures and named-entities, from the title, body and anchor 0.4 text. As before each feature is assigned a tf*idf weight. 0.3 We predict relationships between proteins based on the 0.2 similarity of their profiles. We use the standard IR cosine 0.1 similarity score to measure similarity between profiles. Two proteins are predicted to be related if their similarity score 0 5 10 15 20 25 30 35 40 45 50 is above a certain threshold. Based on predictions made we Number of Documents measure precision, recall and f-score. In our experiments we use f-score11 as the primary measure to evaluate different Figure 10: Performance of profiles for topics with types of profiles. EB entries greater than 5 KB by varying numbers We adopt a 5-fold cross-validation approach. The 82 pro- of documents, against EB. teins we selected yielded 3403 unique pairs12 . These were randomly split into five sets of equal size, each with the same ratio of positives and negatives. We consider the union of the former to be more accurate through preliminary experi- four of the splits (folds in standard machine learning par- mentation. Hence we use the Stanford-NER system. We see lance) as our training set and the remaining split as our that while MPP profiles have the highest average similarity, test set. We choose a threshold that optimizes the train- the difference between them and MPI profiles is not statis- ing f-score and then apply this threshold to the test set and tically significant. However, profiles derived from multiple compute the test f-score. This process is repeated five times, pages continue to be significantly better than single-page each time with different combinations of splits considered as profiles. We note a considerable drop in average similarity the training data. Different profiling building methods are scores compared with word-based profiles (e.g., from 0.88 compared based on the average of their five test f-scores. to 0.6 for MPP). For our first experiment we created profiles consisting of word features found in the title, body and anchor text of 1 pages. As before, we consider four types of profiles (MPP, MPI, SPP, and SPI). Table 1 shows the average training 0.9 and testing precision, recall and f-scores for each type of 0.8 profile across all five iterations. We see that the average test f-scores for MPP, MPI and SPI ( 0.25) are very close 0.7 MPP and thus they are similar in their ability to predict rela- Average Similarity MPI 0.6 tionships. However, all three are significantly (at 0.05 level) 0.5 better than SPP profiles. Also the training f-scores are in general similar to the testing f-scores, suggesting that the 0.4 thresholds computed using the training data were general 0.3 SPP enough to apply to new (test) data. 0.2 SPI Figure 12 shows the average f-scores for MPP and MPI profiles when the number of relevant pages used to build 0.1 the profiles are varied. Here again profiles consist of word 0 features extracted from the title, body and anchor text of relevant pages. We see that varying the number of pages has little affect on the average f-scores for both types of profiles Figure 11: Performance of profiles with named- and in general we see no significant difference in performance entity features against Wikipedia. between the two. We also note that the average f-score for both stabilizes at 5 documents. We next created profiles that contained more specific fea- 4.2 Experiment 2: Prediction Ability tures, viz., named-entities. Named entities were extracted In the second set of experiments we evaluate the differ- from relevant pages using the lingpipe13 named entity ex- ent types of profiles based on their ability to predict known traction package. This package comes with a model pre- relationships between proteins. We randomly compiled a trained on the GENIA14 corpus, which is what we used for list of 82 proteins from the Database of Interacting Proteins this experiment. Figure 13 shows the average f-scores for (DIP)9 . According to DIP, 90 pairs of proteins within this set 10 are known to interact with each other. This forms our pos- http://expasy.org/sprot/ 11 itive gold standard set of interactions. The remaining pairs with equal weight to precision and recall 12 are considered as the negative set of interactions. For each taking into account symmetry so that only P1->P2 is con- protein, we identified synonyms from the SwissProt pro- sidered and P2->P1 is not 13 www.alias-i.com/lingpipe/ 9 14 http://dip.doe-mbi.ucla.edu/ www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ Training Testing Type Split Precision Recall Fscore Precision Recall Fscore LCI (95%) UCI (95%) MPP Average 0.2205 0.3528 0.2704 0.2046 0.3333 0.2529 0.1826 0.3232 MPI Average 0.2188 0.3222 0.2604 0.2115 0.3111 0.2479 0.2061 0.2897 SPP Average 0.1183 0.2702 0.1628 0.1011 0.2303 0.1357 0.0963 0.1751 SPI Average 0.2190 0.3222 0.2607 0.2101 0.3111 0.2501 0.1885 0.3117 Table 1: Average training and test precision, recall and f-scores for the four types of profiles. LCI and UCI denote lower and upper 95% confidence intervals of test f-score, respectively. 1 1 MPP 0.9 MPI 0.9 0.8 0.8 0.7 0.7 Average Fscore Average F-score 0.6 0.6 0.5 0.5 0.4 0.4 MPP MPI 0.3 0.3 SPP 0.2 SPI 0.2 0.1 0 0.1 5 10 15 20 25 30 35 40 45 50 0 Number of Documents Figure 12: Average prediction f-scores for profiles Figure 13: Average prediction f-scores for profiles by varying numbers of pages. with named-entity features. the four different types of profiles. Here again, we see no errors, due to the fact that there are significant differences significant difference between MPP and MPI profiles. How- between general web text and newswire and GENIA text. ever, MPP profiles are significantly better than single-page E.g., EBI (European Bioinformatics Institute) was recog- profiles. Comparing named-entity profiles with word pro- nized as a protein by Lingpipe. files, we see that the average MPP, MPI, and SPP f-scores are higher for the former than corresponding f-scores for the latter (shown in table 1), while the reverse is true for SPI. 6. DISCUSSION Our experiments have yielded interesting results. In sec- tion 4.1 our results clearly show that profiles derived from 5. POTENTIAL SOURCES OF ERRORS full text in multiple pages are more informative than single- Our approach relies upon several underlying technologies, page profiles. They are also better than profiles derived from such as document retrieval, sentence detection, and named- multiple pages but restricted to instance-level text windows. entity recognition. Each of these is a potential source of This is when Wikipedia is used as the gold standard. It is error. First, we rely on the accuracy of search engines to also true with EB as the gold standard for topics that have retrieve relevant documents for topics, from which profiles entries that are at least 5 KB in size. These results support are derived. While modern search engines are fairly accu- our intuition that relevant information tends to be scattered rate, they are still vulnerable to problems such as ambiguity. over multiple pages and is not necessarily instance bound. Despite our attempts, some ambiguity still remained. E.g., Our results also suggest that adding the link information ‘NEMO’ is a synonym for a protein and is also the name of present in relevant pages as features does not improve the a popular cartoon character. information content of a profile. In fact assigning a higher In this paper we have created profiles from sentences in weight to links had a detrimental effect. However, this could relevant documents containing individual instances of the be attributed to the relatively small number of links present topics. Sentence boundary detection in web pages is a hard in Wikipedia pages, which would bring down the average problem and many off-the-shelf tools (we use MxTermina- similarity score. tor) are not optimized for web pages. Due to the tag struc- The difference between MPP and MPI profiles was much ture of web pages, many incoherent sentences are identified. less prominent when they contained specific features, such as In this research we have also evaluated profiles contain- named entities. The average similarity of such profiles w.r.t. ing named-entity features extracted from relevant pages. Wikipedia was also notably lower than for corresponding Named-entity recognition is a challenging problem and pri- profiles containing word features. marily depends on the training data used to train the model. Also, we found that the difference between MPP and MPI Our use of models pre-trained on newswire data in the first was greatest when few relevant pages were present. As the experiment and GENIA data in the second resulted in some number of relevant pages increased, the difference between the two shrank. This suggests that when few relevant pages [3] J. Conrad and M. Utt. A system for discovering are available, as is quite often the case, it would be better to relationships by feature extraction from text use the full-text to create representations. A key observation databases. In Proceedings of the 17th Annual made is that in general 10 to 15 pages are sufficient for MPP, International ACM-SIGIR Conference (Special Issue our best strategy, to achieve its highest similarity with the of the SIGIR Forum), pages 260–270, 1994. gold standard. [4] G. Flake, S. Lawrence, and C. Giles. Efficient As an aside our experiments also revealed interesting dif- identification of web communities. In Proceedings of ferences between the two different gold standard sources of the 6th ACM SIGKDD International Conference on information we considered, Wikipedia and EB. In general we Knowledge Discovery and Data Mining, pages found the former to be more comprehensive for the topics we 150–160, 2000. selected, while the latter contained comprehensive informa- [5] N. Glance, M. Hurst, K. Nigam, M. Siegler, tion only for a few popular topics, such as WWI and WWII, R. Stockton, and T. Tomokiyo. Deriving marketing and cursory information for the rest. intelligence from online discussion. In Proceedings of While significant differences between MPP and MPI pro- the Eleventh ACM SIGKDD International Conference files were detected in the first set of experiments, this was on Knowledge Discovery and Data Mining not the case with the second set of DIP experiments testing (KDD-2005), pages 419–428, 2005. prediction ability. In general, irrespective of the type of fea- [6] H. Kautz, B. Selman, and M. Shah. Referral Web: tures contained in the profile, the performance of MPP and Combining social networks and collaborative filtering. MPI profiles in predicting known DIP relationships was very Communications of the ACM, 40(3):63–65, 1997. similar. However, MPP profiles have the important advan- [7] R. Kumar, P. Raghavan, S. Rajagopalan, and tage that they can model general topics, while MPI profiles A. Tomkins. Trawling the web for emerging can only be built for entities. Thus it would be preferable to cyber-communities. In Proceedings of the 8th use the former. Again protein profiles derived from multiple International World Wide Web Conference pages were always significantly better than profiles derived (WWW-1999), pages 403–415, 1999. from single pages, confirming our intuition that the latter [8] W. Li, R. Srihari, C. Niu, and X. Li. Entity profile did not contain enough information for a good representa- extraction from large corpora. In Proceedings of tion. We also found that varying the number of relevant Pacific Association for Computational Linguistics pages did not have any affect on the prediction ability of 2003 (PACLING-2003), 2003. MPP and MPI profiles. As a final point, our DIP based experiments also let us [9] A. Lih. Wikipedia as participatory journalism: Reliable sources? metrics for evaluating collaborative test the idea of implementing a bioinformatics web mining media as a news resource. In Proceedings of the 5th application. Although the results are moderate these of- fer a reasonable starting point. In future work we will also International Symposium on Online Journalism, 2004. compare these results with similar experiments run against [10] B. Liu, C. Chin, and H. Ng. Mining topic-specific MEDLINE. Given the vastness of the Web, it is fast be- concepts and definitions on the web. In Proceedings of coming a data and knowledge source for domain specific the twelfth international World Wide Web conference applications. (WWW-2003), pages 251–260, 2003. In conclusion, this research contributes to our long-term [11] Y. Matsuo, J. Mori, M. Hamasaki, K. Ishida, goal which is to be able to represent arbitrary topics on the T. Nishimura, H. Takeda, K. Hasida, and M. Ishizuka. web with topic profiles consisting of weighted features of dif- Polyphonet: an advanced social network extraction ferent types. We see value in pursuing a higher level topical system from the web. In Proceedings of the 15th web where the object (node) of interest is the topic and the International World Wide Web Conference link represents inter-topic relationships. Such a web has the (WWW-2006), pages 397–406, 2006. potential to more effectively support individual information [12] D. Newman, C. Chemudugunta, P. Smyth, and needs as well web mining applications seeking to discover M. Steyvers. Analyzing entities and topics in news novel connections between topics. articles using statistical topic models. In IEEE International Conference on Intelligence and Security 7. ACKNOWLEDGMENT Informatics, pages 93–104, 2006. This material is based upon work supported by the Na- [13] H. Raghavan, J. Allan, and A. McCallum. An tional Science Foundation under Grant No. 0312356 awarded Exploration of Entity Models, Collective Classification to Padmini Srinivasan. Any opinions, findings, and con- and Relation Description. In Proceedings of KDD clusions or recommendations expressed in this material are Workshop on Link Analysis and Group Detection, those of the author(s) and do not necessarily reflect the 2004. views of the National Science Foundation. [14] A. Sehgal and P. Srinivasan. Manjal - A Text Mining System for MEDLINE (demonstration). In Proceedings of the 28th Annual International ACM 8. REFERENCES SIGIR, page 680, 2005. [1] L. Adamic and E. Adar. Friends and Neighbors on the Web. Social Networks, 25(3):211–230, 2001. [15] J. Voss. Measuring wikipedia. In Proceedings of the [2] M. Ben-Dov, W. Wu, P. Cairns, and R. Feldman. 10th International Conference of the International Improving knowledge discovery by combining Society for Scientometrics and Informetrics, 2005. text-mining and link analysis techniques. In Proceedings of the SIAM International Conference on Data Mining, 2004.