=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-WebCLEF-FiguerolaEt2007
|storemode=property
|title=REINA at WebCLEF 2007. Selecting Useful Snippets
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-WebCLEF-FiguerolaEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/FiguerolaBRR07
}}
==REINA at WebCLEF 2007. Selecting Useful Snippets==
REINA at WebCLEF 2007. Selecting Usefull Snippets Carlos G. Figuerola, José L. Alonso Berrocal, Ángel F. Zazo Rodrı́guez, Emilio Rodrı́guez REINA Research Group, University of Salamanca reina@usal.es Abstract The task for this year consist in retrieve snippets or pieces of text from web documents about several topics. The extraction of such snippets can be approached in several ways, as well as the selection of most usefull of them. We describe the segementation process adopted, and the selection of snippets carried out. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages—Query Languages General Terms Measurement, Performance, Experimentation Keywords Web retrieval, Text segmentation 1 Introduction This year, the WebCLEF track is focused in retrieving text snippets or fragments of web pages which bring up information about a topic; additionally, snippets must be in a language from a set of accepted ones. As a point of start, we have a set of topics, each with a title and a short description, as well as several documents or known sources about the topic. Additionally, for each topic, we have one or several searches in Google, with the first 1000 documents retrieved. The general approach, by our part, consists of considering, for each topic, the documents retrieved by Google as the document collection with which to work. Since the task demands to obtain snippets, these documents must be divided in fragments or snippets, each one of which will be considered like an independent document. As the query for each topic, we can use the description that we have for each one of them. Additionally, this query can be enriched with more terms from the known sources. Also we can use the available anchors which point at documents recovered by Google. Finally, it is possible to apply filters or restrictions that eliminate those found documents that they are not in some of the accepted languages for each topic. In this way, the task can be approached like a classic problem of retrieval, and apply, conse- quently, conventional techniques. 2 Building the database of documents As we said, the collection or database of documents will be formed by snippets from the documents retrieved by Google for each topic. For each one of these topics, has taken place one or more searches in Google, and the (mor or less) 1000 first recovered documents have been taken for each one from those searches. This implies a variable number of documents by topic. We have valued equal all the searches in Google for same topic. So, for each one of the documents retrieved by Google we would have to obtain the original document, to convert it to text, to disturb it in fragments, to obtain the terms of each fragment and to calculate its weights. The organizers of the task have already solved the first part of such operations, since they have provided us the original documents as well as their conversions to plain text. In general, the conversion to flat text is good (this is a non-trivial aspect). Nevertheless, the use of codifications of characters is disparate, although it’s affirmed that the plain text is codified in UTF-8. For languages that use characters noncontained in the standard ASCII, the codification and decoding of those characters are a source of headaches; the simple detection of the used system of codification is problematic in many cases. As an example, we have used the Universal Encoding Library Detector (chardet) [2], a module for Python based on the libraries for detection of Mozilla; which surprisingly indicates that most of the plain text versions is codified in Latin-2. 2.1 Segmentation of the text To segment documents and to obtain fragments or short text passages can be applied diverse techniques. Basically, ones are based on the size in bytes, or words; and others are oriented in the separation in phrases or paragraphs [4]. The former techniques produce, of course, pieces more homogenous in size, but often devoid of sense, as the partition point is blind. The other techoques tend to produce fragments of very different size. In addition, its application not always is simple; in many cases the conversion to plain text of a web document loses the separations between paragraphs, nondifference between soft and hard line feeds, or blurs structural elements, like the tables. A simplist approach, like the election of a orthographic character, as the period (.) like reference to fragment the text, tends to produce passages too short and, therefore, little useful for the objectives of this task. In our case, we adopted a mixed approach. After several tests, we decided that the suitable size for each fragment was around the 1500 bytes, but as we wanted fragments that had informative sense, our fragmenter looks for the period closest the 1500 bytes, and part by that point. 2.2 Other processes of lexical analysis Some other transformations were carried out: conversion to small letters, removing accents, re- moving stopwords, (with a long list of stop words for all the accepted languages), application of a simple s-stemmer. Each fragment thus obtained and transformed was considered an independent document. Terms were extracted and they were weighed according to scheme ATU (slope=0.2) [3], applying to the good well-known vectorial model. 2.3 Formation of queries Somehow, the objective is to solve the task using conventional or already known retrieval tech- niques. From the document collection formed with snippets, we must select those that are more usefull for each topic. The key is in composing suitable queries that can produce this selection. As sources of information to compose those queries, we have topics with a short title and a brief de- scription. Additionally, we also have, for each topic, a few documents denominated known sources, in full text. We also have the queries formulated to Google, but, since the document collection run 0 run 0.25 run 1 Precision 0.1415 0.1599 0.1624 Recall 0.1796 0.2030 0.2061 Table 1: Official Runs and Results comes exclusively from the answers to such queries, the information contained in them already is taken advantage of. So we can use topics (title and description) like nucleus of each query, and enrich this one with terms coming from the known sources. The known sources are complete documents, some of very long, which can contain many terms. It is possible to ask oneself if this will introduce perhaps too much noise in the query; a possibility is to weight the terms coming from those known sources in a different way that terms coming from title and the description from topics. Additionally, it is also possible to consider the different structural fields from the known sources (title, body, headings, meta tags, etc.). Previous experiments, in previous editions CLEF [1], show importance of some of such fields, as well as the little interest of others. The most interesting field is anchors of backlinks. In this sense, since we very have a reduced document set, we do not have many backlinks with which to work; nevertheless, they seem specially important those that, from the known sources point at some of documents retrieved by Google. Thus, we have used in the queries the terms of topics (title and description), plus the terms of the above mentioned anchors. To this, we have added the terms of the known sources, but weighted in different ways. In previous editions of WebCLEF we work on the use of different sources of information in retrieval and how to mix or to fuse these sources. In this time, we have chosen to modify the weights of the terms operating on the frequency of these in each document. The scheme of chosen weight also for the queries is ATU (slope=0.2), reason why the weight is directly proportional to the frequency of the term in the document; thus we have fixed a coefficient by which to multiply this frequency. Runs carried out varies based on that coefficient: one of them maintains the same frequency, reason why the terms of the known sources weight just like those of topics. Another one run weights the terms of the known sources in a quarter (frec. x 0,25), and third it does not use these terms. The idea is to value to what extent such terms are useful or, on the contrary, introduce noise. 3 Results Results of the trhee runs submitted show litle diference between. It seems that using terms of the known sources is more usefull than not. But we must note that several topics (about half of them) produce none usefull results. We not applied any restriction nor filter based on accepted languages; but restrictions, maybe, based on the type of information contained in the snippets would be desirable. By example, several of such snippets were references to another source of information (bibliographic references, academic courses on the topic, etc.). It seems that this type of information is not very usefull for this task. 4 Conclusions We based our work on building queries with terms from the known sources, as well as with terms from the description of the topics. Using the terms from known sources produces better results. Nevertheless, it seems more interesting the process of obtaining of the text segments and the selection of these being based on its content type and its language References [1] Carlos G. Figuerola, José Luis Alonso Berrocal, Ángel F. Zazo Rodrı́guez, and Emilio Rodrı́guez. REINA at WebCLEF 2006: Mixing fields to improve retrieval. In A. Nardi, C. Peters, and J.L. Vicedo, editors, ABSTRACTS CLEF 2006 Workshop, 20-22 September, Alicante, Spain. Results of the CLEF 2006 Cross-Language System Evaluation Campaign, 2006. [2] Mark Pilgrim. Universal Encoding Detector. http://chardet.freeparser.org. [3] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 18–22, 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pages 21–29. ACM, 1996. [4] Ángel F. Zazo, Carlos G. Figuerola, José Luis Alonso Berrocal, and Emilio Rodrı́guez. Reformu- lation of queries using similarity thesauri. Information Processing & Management, 41(5):1163– 1173, 2005.