Using Document Structure on Retrieving Webpages at the Web-CLEF 2006 Syntia Wijaya, Bimo Widhi, Tommy Khoerniawan, and Mirna Adriani Faculty of Computer Science University of Indonesia Depok 16424, Indonesia {swd20, bimo20, tokh20}@mhs.cs.ui.ac.id, mirna@cs.ui.ac.id Abstract. We present a report on our participation in the mixed monolingual web task of the 2006 Cross-Language Evaluation Forum (CLEF). We compared the result of web page retrieval based on the page content, page title, and anchor page. The retrieval effectiveness for the combination of page content, page title, and anchor texts was better than that of the combination of page title and page title only. Applying the pseudo-relevance feedback improved the retrieval performance of the queries. Keywords : web retrieval 1 Introduction The fast growing amount of information on the web motivated many researchers to come up with a way to deal with such information efficiently [2, 5]. Information retrieval forums such as the Cross Language Evaluation Forum (CLEF) have included research in the web area. In fact, since 2005, CLEF includes a WEBIR topic as one of the research tracks. This year we, the University of Indonesia IR-Group, participated in the mixed monolingual WEBIR - CLEF 2006 task. 2 The Retrieval Process The mixed monolingual task searches for web pages in a number of languages. The queries and the documents were processed using the Lemur1 information retrieval system. Stop-word removal, as is done by many IR systems, was applied only to the English queries and documents. 2.1 Web-page Scoring Techniques We employed five different techniques for scoring the relevance of documents (web pages) in the collection, i.e., based on the combination of the content of the page, the title of the page, and the anchor texts that appear on the pages. The first technique takes into account only the content of a web page to find the most relevant web pages to the query. We used the language model [3] to find the probability value between the query and the pages. The second technique considers the title of the web page as the only source in finding the relevant pages. The third technique uses the content and the title of the page to find the relevant pages. 1 See “http://www.lemurproject.org”. 3 Experiment The web collection contains over two million documents from the EUROGOV collection. In these experiments, we used Lemur information retrieval system to index and retrieve the documents. Lemur is built based on the language model [3]. We index the webpages according to their content pages, title pages, and anchors. Stopwords were removed from the collection, but word stemming was not applied to the collection. 4 Results We were very surprised to see the results of our participation this year. All of the results that we submitted are very low compared to our last year’s result. In 2005, we indexed the collection using a different information retrieval system, i.e., Lucene2 which is built based on the vector similarity model [1, 4]. The first result is shown in Table 1. In the retrieval, we compute the total relevance score by summing up the relevance scores based on page content, page title, and anchor texts found on the webpages. Table 1. Mean Reciprocal Mean (MRR) of the combined relevance score for page content, page title, and anchor texts on a webpage. Task : Mixed Monolingual UI1DTA MRR 0.0404 Average success at 1: 0.0258 Average success at 5: 0.0531 Average success at 10: 0.0707 Table 2 shows the result of combining the relevance scores based on page content and page title. As can be seen, the MRR dropped from 0.0404 (see Table 1) to 0.0116. Table 2. Mean Reciprocal Mean (MRR) of the combined relevance score for page content and page title. Task : Mixed Monolingual UI4DTW MRR 0.0116 Average success at 1: 0.0067 Average success at 5: 0.0150 Average success at 10: 0.0201 The third technique applies the pseudo-relevance feedback to the retrieval that uses the combined score of page content, page title, and anchor texts. As shown in Table 3, the feedback reduced the performance of the queries where the MRR dropped to 0.0253. The pseudo-relevance feedback was done using the top-5 relevant documents retrieved. Table 3. Mean Reciprocal Mean (MRR) of the combined score of page content, page title, and anchor texts with top-5 documents pseudo-relevance feedback. Task : Mixed Monolingual UI3DTAF MRR 0.0253 Average success at 1: 0.0160 Average success at 5: 0.0309 Average success at 10: 0.0423 2 See “http://lucene.apache.org/”. Finally, the last result was obtained by applying the pseudo-relevance feedback to the combined relevance score of page content and page title only. As shown in Table 4, we obtained the highest retrieval performance with MRR of 0.0918. Table 4. Mean Reciprocal Mean (MRR) of the combined score of page content and page with top-5 documents pseudo-relevance feedback. Task : Mixed Monolingual UI1DTF MRR 0.0918 Average success at 1: 0.0634 Average success at 5: 0.1202 Average success at 10: 0.1516 To investigate the cause of our poor retrieval performance, we conducted some further experiments. We used the queries from last year’s task and ran them on the same index that was built using Lemur. The result is as shown in Table 5, which is much better than for this year’s queries. However, we found a sign of indexing error, i.e., there were some domains that Lemur was unable to index. This resulted in Lemur’s not being able to retrieve any documents for a number of queries. We also suspected that the index for documents in languages containing non-latin characters was corrupt, as indicated by the fact that documents in some domains such as Russian and Greek were never retrieved. Table 5. Mean Reciprocal Mean (MRR) of the combined relevance score of page content, page title, and anchor texts using the 2005 query-topics. Task : Mixed Monolingual DTA-2005 MRR 0.2069 Average success at 1: 0.1444 Average success at 5: 0.2742 Average success at 10: 0.3254 5 Summary Our results demonstrate that combining the page content, the page title, and anchor texts resulted in a better mean reciprocal rank (MRR) compared to searching using the page content and page title only. The pseudo- relevance feedback that we employed increased the retrieval performance of the queries. However, we had some problems with indexing the collection, which resulted in our poor retrieval performance in our participation this year. We hope to improve our results in the future by exploring still other methods. References 1. Baeza-Yates, Richardo, and Berthier Ribeiro-Neto. Modern Information Retrieval, New York: Addison-Wesley, 1999. 2. Hawking, David. Overview of the TREC-9 Web Track. In NIST Special Publication: The 10th Text Retrieval Conference (TREC-10). 2001 3. Ponte, J. and Croft, W.B. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and development in Information Retrieval, p.275-281. ACM: 1998. 4. Salton, Gerard, and McGill, Michael J. Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983. 5. Zobel, J. How reliable are the results of large-scale information retrieval experiments? In Proceedings of ACM SIGIR’98. Melbourne, Australia: August 1998.