=Paper=
{{Paper
|id=Vol-1171/CLEF2005wn-WebCLEF-FiguerolaEt2005
|storemode=property
|title=REINA at the WebCLEF Task: Combining Evidences and Link Analysis
|pdfUrl=https://ceur-ws.org/Vol-1171/CLEF2005wn-WebCLEF-FiguerolaEt2005.pdf
|volume=Vol-1171
|dblpUrl=https://dblp.org/rec/conf/clef/FiguerolaBRR05
}}
==REINA at the WebCLEF Task: Combining Evidences and Link Analysis==
REINA at the WebCLEF Task: Combining
evidences and Link Analysis
Carlos G. Figuerola, Jose L. Alonso Berrocal, Angel F. Zazo Rodrguez, Emilio Rodrguez
REINA Research Group, University of Salamanca
reina@usal.es
Abstract
The participation of the REINA Research Group in WebCLEF 2005 is focused in the
monolingual mixed task. Queries or topics are of two types: named and home pages.
For both, we rst perform a search by thematic contents; for the same query, we do
a search in several elements of information from every page (title, some meta tags,
text of backlinks) and then we combine the results. For queries about home pages, we
try to detect them with a method based in some keywords and their patterns of use.
After, a re-rank of the results of the thematic contents retrieval is performed, based
on Page-Rank and Centrality coecients.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Languages|Query Languages
General Terms
Measurement, Performance, Experimentation
Keywords
Information Retrieval, Web Search, Link Analysis, Search Fusion
1 Introduction
Our participation in WebCLEF 2005 is focused in the monolingual (spanish) mixed task. This
task has two goals: to nd named web pages and home web pages. Every query has an only right
answer: both kinds of queries are mixed, and we don't know in advance wich kind is every query.
In principle, the basic approach consists of nding the pages whose content is more similar to
each query; it is hoped that the valid answer is in the rst retrieves pages, and depends on the
techniques applied in this search that the ranking is worse or better.
For the queries searching a home page we will apply some procedure that rearranges the
retrieved documents list, considering, in addition to its similarity with the query, several evidences
of which can be home pages. An additional problem is that we do not know a priori what queries
or topics look for home pages and which not, so we will have to include some procedure that
analyzes the queries and determines which persecute a home page and which not.
This paper is organized as follow2: section 2 describes the part of the collecion of documents
which we have worked with. Section 3 decribes our aproach to task; next, we show the runs
submitted anthe their results; last, conclusions are given.
Format Number of docs.
PDF 4040
MS Word 315
empty docs 6
Table 1: Blacklist for .es domain
2 The collection of documents
Our participation this year is limited to domain .es in the EuroGov collection. This domain has
35,168 documents; not all of these are HTML pages, and not always is easy to identify the format
of every document. For this year, all the topics are on the HTML pages; the organizers provide a
blacklist of 4,365 documents (in the .es domain) which are not HTML.
Nevertheless, documents in other formats nonentered in the black list exist. Thus, of 35,168
documents of the domain .es 8,642 does not contain the tag.
Of another side, documents seems to be stripped in a size next to 64 K; in binaries les, as is
the case of some PDFs, chars chr(0) seems to be replaced by a space (chr(32)).
2.1 Topics
There are 118 topics in spanish, 59 searching for home pages and 59 for named pages. The concept
of home page, however, is some fuzzy; the consideration of some of the searched pages as home is
quite debatable.
In addition, there are some mistakes in the topics set. Thus, some topics are duplicated, or
even triplicated. Some of them, with diferent correct page as answer in the qrels le. Some topics
are a formulation too wide. By example, topic WC0098: Consejera de Educacion y Cultura ; there
are, in Spain, 17 Autonomous Communities and every one of them has a Council of Education
and Culture. Besides, we have found that many embassies have also a Consejera de Educacion y
Cultura, and there is a lot of embassies. Which of all these is the right answer?
A few topics have as correct answer a page which is not in the .es domain. This is, maybe,
right; but, since we work only in the .es domain, we can't nd the correct page anyway.
3 Our approach
As we said before, the basic idea is to nd the most similar pages to every query, and, for the
home pages queries, rearrange the list of retrieved documents boosting those more likely home
pages.This carry us, in addition, to analyze the queries to determine the type of these.
First part, to nd the most similar pages to every query, can be solved by a classic information
retrieval aproach. Nevertheless, web pages have informative elements other than the simple text
which we can see at the browser's window. Thus, we can use these elements to improving the
retrieval
3.1 Combining elements
The possible list of elements we can take in account in the web pages is extensive, but we focused
in:
the eld body, which seems the most important
the eld title
the contents of some META tags, as is the case of Description and Keywords
the text of the backlinks, that is the links wich, in the other documents are pointing to the
page tha we are analyzing.
All this elements are evidences tha we can combine to nd the most similar pages to every query.
There are several ways to do the fusion, or combining these elements; a rst issue is to do the
fusion prior or after run the query.
Our choice is to do it after; so, the procedure tha we applied is as follows:
to build an index with every of the elements tha we take in account
to run the query in every one of these indexes
to combine the results achieved with every of indexes
For the rst step, we have used our software Karpanta [5], based on the well known vector space
model, and we built indexes of: body, title, meta description, meta keywords and text of backlinks.
Terms weights are computed in a classic way based int tf IDF known as atc. In all cases stop
words (from a standard list of about 300 spanish words) were removed, and a enhanced s-stemmer
was applied [6].
The size of the indexes is di erent, as are the elds on wich the indexes are based on. Almost
all HTML pages have a eld body (some of them only have java scripts and so on), but is not the
same with the other indexes. So, 71.5 % of the pages in the .es domain have a eld title, and
the average size of the titles is about 40 characters; this is likely the titles are, in general, very
shorts.
On the META Description tag, is present in only 16.9 % of the documents, with an average
size of 38.6 characters. From these documents with META Description tag, in 7.4 % of them the
content of the META Description tag is identical to the eld title.
About the keywords (META Keywords tag), they are present in 24.7 % of the documents, with
7.7 keywords per document, in average (a keyword is not a term, but every expresion delimited
with a semicolon inside the tag; so, there are keywords wich are multiword expresions).
24.7 % of the documents don't receive any link (from the documents in the collection); docu-
ments with backlinks receive an average of 9 per document. Text of these backlinks is very short
(18.7 characters in average), but, perhaps, very signi cative.
So, it seems clear that, except the body eld, the other elements seems to have a limited
importance, as they are absents in lots of documents.
For the fusion of the list produced by every retrieval of every index, a z-score normalization
of the similarity values [2] was performed and then the lists were merged with the CombMNZ
algorithm [7], adapted to weight in di erents ways the results obtained with every index:
X n
Score = score k number of score ! = 0
i i
i=1
There are several procedures of combining [7], [11],[14], [1]. Most of them are based on com-
bining the similarity values obtaines after run the query on every of indexes; nevertheless, we can
also work with the rank positions in the lists of retrieved documents in every index [12]. This
algorithm has the advantage of the simplicity, as not even is necessary to normalize the similarity
values.
3.2 To nd home pages
First we must determine wich queries are about home pages. The concept of home page, never-
theless, is fuzzy; so, some of the correct answers to some queries, everybody would not consider
home pages.
In a exploratory phase, we examined manually several home pages from the .es; specially,
we examined de title eld, as we think that a query searching for these page, probably was
enough similar to the title of this one. Besides, we examined the home page queries used in
TREC. They are in English, but, after translated to Spanish, they can aproximate the structure
and characteristics of this kind of query.
In this exploratory phase we observed some common elements in the structure of the home
page queries. This structure lies about using certain terms in relationship with the searched home
page. Thus, this kind of pages are entry pages to the webs of certains institutions: ministries,
institutes, centers, etc. So, these terms will be present in the query [2].
Besides, they will be in certains positions inside the query, and they will go accompanied,
before and later, of certain auxiliary words (articles and other connectors). This allowed us to
build a set of home page query patterns, to which we added a simple heuristic: the presence of
expresions as home page, portal, etc.
With this technique we were able to correctly identify 32 home page queries, 4 were erroneously
considered as home, and 27 could not be classi ed.
Once identi ed, trough this way, the assumed home page queries, the results of a retrieval made
with the fusion of evidences as we have seen before, were re-ranked in a way that the relevant
pages most probably home page were in the rst places.
There are several techniques to determine which retrieved pages can be home pages. These are
not excluding techniques and they can be combined. The most known techniques are based on
using two types of information: the URL page structure, and the link analysis.
Techniques based on URL structure work with the URL deep. [10] studied the statistical
distribution of home pages in several URL deep levels, and also [2]. [13] also use techniques based
on the URL length, as [15] do.
Techniques based on link analysis also are widely used. Although considered of smaller utility
in the searches by content, they seem e ective to retrieve home pages [8]. Several coecients
are used, from the simples in and out-degrees [18], to most so stied page-rank [19] or HITS [4]
algorithms.
We have tried with Page-Rank [3], and with Centrality [9], both based on backlinks.
4 Runs submitted
Our goal is to determine which elements or evidences are useful in a search based on contents;
also, to test the e ectiveness of coecients based on link analysis to nd home pages.
Ocial results are given in table 2. Run USAL0 acts as baseline, and it consist in queries in
Spanish against the pages of the .es of EuroGov Collection. In this run, we work with the eld
body only.
Run USAL1 combines results of elds body, title, META Description and text of backlinks
of every page.
Run USAL2 adds to the USAL1 the eld META Keywords. Runs USAL3 and USAL4 try to
apply speci c techniques to nd home pages. On the retrieved documents of the run USAL1, a
try to detect the home page topics is done, and then, results are been re-ranked with Page-Rank
(USAL3) and centrality (USAL4).
4.1 Evaluation
Table 2 shows the results of the ocial evaluation of the submitted runs. However, we have seen
before somo problems about the queries (duplicateds ones, right answers in anothers domains).
So we have carried out an unocial evaluation, removing erroneous topics: duplicated ones (even
triplicated), right answers out of the .es domain, badly formulated queries. Classi cation in home
and named pages, although debatable, we have left it as it were.
4.2 Results
It seems clear that working with more elements, in addition to the body eld, improves retrieval.
This is true in the case of title, META Description and the text of the backlinks. However,
USAL0 USAL1 USAL2 USAL3 USAL4
success at 1 0.1343 0.1642 0.1567 0.1940 0.1567
success at 5 0.3134 0.4254 0.3657 0.4776 0.4179
success at 10 0.3731 0.5000 0.4776 0.5522 0.4925
success at 20 0.3955 0.5970 0.5821 0.6493 0.6269
success at 50 0.6269 0.7463 0.7090 0.7537 0.7313
MRR 0.2193 0.2796 0.2553 0.3214 0.2776
Table 2: Results of the Ocial Evaluation
USAL0 USAL1 USAL2 USAL3 USAL4
success at 1 0.1622 0.1982 0.1892 0.2162 0.1892
success at 5 0.3694 0.5135 0.4414 0.5586 0.5045
success at 10 0.4324 0.6036 0.5676 0.6486 0.5946
success at 20 0.4595 0.6847 0.6667 0.7207 0.7117
success at 50 0.7117 0.8378 0.7928 0.8468 0.8378
MRR 0.2611 0.3339 0.3045 0.3667 0.3255
Table 3: Unocial Evaluation
including META Keywords makes worse the results. This can be surprising (some simplistics re-
trieval systems are based only on this eld), but, if we examine the uses tha pages do of this eld,
we will see that, at least, it is a strange use. Table 4 shows the most used keyword expressions
(not individual terms) in the .es domain.
Most of them are very generic expressions, little useful for searches that take place on a
governmental collection. Some are included in pages also translated to English, some are directly
included in English, without version in Spanish (although the language of the rest of the page is
the Spanish).
A manual examination of some page of the collection shows that there are pages (specially
home pages of certain institutions) having, literally, hundreds of keywords. In some cases, these
lists of keywords are inherited with no variation by the rest of the pages of that site. Probably
this has something to see with some myths that circulate on the form in which the search engines
nd and rank the pages. Some pages repeat a lot of times same keyword, in the hope of search
engines place it in the rst positions of the list.
As for the location of home pages, it seems that the use of patterns to distinguish home page
queries and to treat them speci cally works on, since runs USAL3 and USAL4 improves on the
previous ones. Of these two, Centrality produces better results to detect home pages. Centrality
is simpler and it does not discriminate backlinks, but it seems that the home pages not necessarily
are the most prestigious.
5 Conclusions
We have described our participation in WebCLEF 2005, based on the retrieval by contents by
means of the fusion or combination of di erent elements, as well as on the use of coecients
coming from the link analysis for the location of home pages.
The use of elements of information as the TITLE or the text of backlinks improves clearly the
retrieval, although many pages even lack TITLE or backlinks; and although the texts of many
backlinks are very short. Nevertheless, keywords introduced by the authors of the pages is from
little aid and they do not produce good results.
Coecients based on the analysis of links, like Page-Rank or the simple Centrality Coecient,
helps to locate home pages.
keyword times
cultura 1864
ministerio 1624
investigacion 1202
spain 1174
administracion 1171
politica 1169
informacion 1169
policy 1168
ministry 1168
research 1168
telecommunications 1168
information 1157
espaa 1157
industria 1126
turismo 1119
comercio 1080
energia 1012
telecomunicaciones 990
industry 962
trade 962
commerce 962
energy 962
tourism 962
parques nacionales 658
Table 4: Most frequent keywords in .es
References
[1] B. T. Basterr, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked
retrieval systems. In Proceedings of the 17th Annual International ACM-SIGIR Conference on
Research and Development in Information Retrieval. Dublin. Ireland, 3{6 July 1994 (Special
Issue of the SIGIR Forum). ACM/Springer-Verlag, 1994.
[2] Steve Beitzel, Eric Jensen, Rebecca Cathey, Ling Ma, David Grossman, Ophir Frieder, Abdur
Chowdury, Greg Pass, and Herman Vandermolen. Task classi cation and document structure
for known-item search. In TREC12 [16].
[3] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search
engine. Computer Networks and ISDN Systems, 30(1{7):107{117, 1998.
[4] Mohamed Farah and Daniel Vanderpooten. Novel approaches in text information retrieval.
experiments in the web track of trec-2004. In TREC13 [17].
Zazo Rodrguez, J. L. Alonso Berrocal, and E. Rodrguez. Karpanta: Un
[5] C. G. Figuerola, A
motor de busqueda para la investigacion experimental en recuperacion de la informacion. In
IBERSID 2003, Zaragoza, Spain, 2003.
[6] Carlos G. Figuerola, Angel F. Zazo, Emilio Rodrguez Vazquez de Aldana, and Jose Luis
Alonso Berrocal. La recuperacion de informacion en espa~nol y la normalizacion de terminos.
Revista Iberoamericana de Inteligencia Arti cial, 8(22):135{145, 2004.
[7] E. A. Fox and J. A. Shaw. Combination of multiples searches. In Overview of the Third Text
REtrieval Conference (TREC-3), pages 243{252. NIST Special Publication 500-226, 1994.
[8] David Hawking and Nick Craswell. Very large scale retrieval and web search. In Ellen Voorhees
and Donna Harman, editors, TREC: Experiment and Evaluation in Information Retrieval.
MIT Press, 2005. http://es.csiro.au/pubs/trecbook for website.pdf
[9] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew S.
Tomkins. The web as a graph: measurements, models, and methods. Lecture Notes in
Computer Science, 1627, 1999.
[10] W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry
page search. In "5th Annual International ACM SIGIR Conference, pages 27{34. Association
for Computing Machinery, 2002.
[11] Joon Ho Lee. Combining multiple evidence from di erent relevance feedback methods. Tech-
nical report, Center for Intelligent Information Retrieval (CIIR), Department of Computer
Science, University of Massachusetts, 1996.
[12] Joon Ho Lee. Analyses of multiple evidence combination. In SIGIR '97: Proceedings of the
20th annual international ACM SIGIR conference on Research and development in informa-
tion retrieval, pages 267{276, New York, NY, USA, 1997. ACM Press.
[13] V. Plachouras, I. Ounis, C. J. van Rijsbergen, and F. Cacheda. University of glasgow at the
web track: Dynamic application of hyperlink analysis using the query scope. In TREC12
[16], page 646.
[14] P. Thompson. A combination of expert opinion approach to probabilistic information retrieval,
part 1: The conceptual model. Information Processing and management, 26(3):371{382, 1990.
[15] Stephen Tomlinson. Robust, web anf terabyte retrieval with hummingbird searchserver at
trec 2004. In TREC13 [17].
[16] The Twelfth Text REtrieval Conference (TREC 2003), Gaithersburg, Maryland,2003. NIST
Special Publication 500-255, 2003.
[17] The Thirteen Text REtrieval Conference (TREC 2004), Gaithersburg, Maryland (USA). NIST
Special Publication 500-261, 2004.
[18] K. Yang and D. Albertson. Widit in trec 2004 genomics, hard, robust and web tracks. In
TREC13 [17].
[19] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft
cambridge at trec-13: Web and hard tracks. In TREC13 [17].