Introduction

University of Indonesia Participation at WEBIR-CLEF 2005

0 Mirna Adriani and Rama Pandugita Faculty of Computer Science University of Indonesia Depok 16424 , Indonesia

We present a report on our participation in the mixed monolingual web task of the 2005 Cross-Language Evaluation Forum (CLEF). We compared the result of web page retrieval based on the content of the page, the target domain and the page content, and a combination of the page title and the target domain. The result shows that combining the page title and the target domain resulted in better retrieval performance than using only the page content or the target domain and the page content.

Introduction

The mixed monolingual task searches for web pages in a number of languages. The queries and the documents were processed using the Lucene information retrieval system (see http://lucene.apache.org). Stopword removal was applied only to the English queries and documents. We used three different techniques for indexing the documents in the collection, i.e., based only on the content of the page, based on the target domain and the page content, and based on the combination of the page title and the target domain.

The first technique considers the content of the page in order to find the most relevant web pages to the query. We used the vector space model [1, 2] to find the similarity value between the query and the pages. Since the collection contains documents not only in English, we also used the information in the metadata of the target document. The information about the target domain of the query is matched with the domain metadata of the pages. We then applied the vector space model to obtain the most similar pages to the query based on the page content. For example: <topic_id> <domain> <LANGUAGE> <query> WC0001 eu EN road safety in europe

WC0003 nl NL list of dutch eco-labels Query number 1 (WC0001) was used for searching only in the eu (European Union) domain and query number 3 (WC0003) was used for searching only in the nl (Netherland) domain.

The third technique uses a combination of the page title and the target domain. We used only the terms in the page title as the query terms once the domain of the target pages is found. We then match the title of the pages with the query using the vector space model. 3

Experiment

The web collection contains over two million documents from the EUROGOV collection. The collection is divided into 27 European language domains. In this mixed monolingual task, the queries are in various languages and used to find documents in the same language as the queries. There are 547 queries to be used for searching in two categories, namely, the name page search and the homepage search. The average number of words in the queries is 6.29 words.

In these experiments, we used the Lucene information retrieval system to index and retrieve the documents. Lucene retrieval system is based on the vector space model. The documents were indexed in separate indexing files according to the domain. For example, documents from internet domain ‘uk’ are indexed separately from documents from internet domain ‘de’. Lucene has the capability to build separate indexes and to search according to the specified index.

Lucene is also capable of indexing documents using two separate fields such as the title page and the content, and then searching can be done using either the title page or the content page. 4

Results

The results that we have submitted were produced using the three techniques, namely, based on the content only, based on the target domain and page content, and based on a combination of the title page and the target domain.

The average of success values of each technique in several ranks (Table 2) show some differences. The best result was achieved by using the combination of the target domain and the page title, with average success at 1: 0.2249, which is consistent with the MRR result. The retrieval performance of the combination of the target domain and the page title was 33.34% better than that of using the content only and 15.47% better than that of using the target domain and the content. However, the average of success of using the target domain and the page content shows better results at higher ranks than the other two techniques.

Our result is similar to the work by Westerveld et al. [2] who obtained better results by using other information in addition to the content. Average success at 1: Average success at 5: Average success at 10: Average success at 20: Average success at 50: 0.1499 0.2834 0.3583 0.3931 0.4826

Target domain + title page 0.2249 0.3583 0.4186 0.4662 0.5320

Since this was our first participation in the WEB task, it took us quite a lot of effort to cope with such large collections. There were several document sets that were damaged, possibly in the process of downloading the files. As a result, we could not index those corrupt files. It is possible that those files were relevant to some of the queries.

The other problem that we had was that we did not prepare Lucene to handle non-Latin characters, and so, the retrieval of documents using queries containing such characters was erroneous.

Our results demonstrate that combining the target domain metadata and the page title resulted in better mean reciprocal rank (MRR) compared to searching using the content only and using the target domain metadata and the page content. However the combination of the target domain metadata and the page title achieved best performance only at 1 rank. For the other ranks, using the target domain metadata and page content showed better results compared to the other two techniques. We hope to improve our results in the future by exploring still other methods. 4 5

Baeza-Yates , Richardo, and Berthier Ribeiro-Neto. Modern Information Retrieval , New York: AddisonWesley, 1999 .

Salton , Gerard, and McGill , Michael J . Introduction to Modern Information Retrieval, New York: McGrawHill, 1983 .

Westerveld , Thisj, Wessel Kraaij and Djoerd

Hiemstra . Retrieving Web Pages using Content, Links, URLs, and Anchors . In NIST Special Publication: The 10th Text Retrieval Conference (TREC-10) . 2001 .