-

Pruning Texts with NLP and Expanding Queries with an Ontology : TagSearch

Gil FRANCOPOULO www.tagmatica.com

The basic line of our action is first to use natural language processing to prune the texts and the query, and secondly to use an ontology to expand the queries. Last year The system described here is based on the one used last year for CLEF-2002. But most components have been improved and some new steps have been added. The system whose name is TagSearch is based on three main components: - A chunker, named TagChunker1. - Lucene, a good OpenSource search engine written by Doug Cuting and his friends2. - An ontology, named TagDico3. The first two components were used last year. The use of an ontology is new.

connected by a AND. But instead of building one big query, the combination of each expansion is built, producing a lot of queries. For each query, the number of terms is computed and the evaluation starts with the queries that has the lesser terms.

Step-3: a query is built like in step-1, but instead of using AND, we use OR between words.

Step-4: queries are built like in step-2, but instead of using AND, we use OR between groups.

A few words on expansion

The goal of expansion is to find documents that are on the subject we search but without exactly the word we have in the query. For instance, in a query like “Les syndicats en Europe7” (query C156). Imaging a text in the pool of text that is about “syndicats en Italie” but without the word Europe. If the word “Europe” is not expanded, you cannot find this document. So meronymy expansion is the only possibility to find this document.

A few words on document ranking

The documents must be ranked. And a document that is found during step-1 must have a higher rank than the one computed during step-2 or 3.

I use the ranking produced by Lucene as a basis and I multiply this ranking by a number lesser that one in order to reflect the query ranking.

Results

As far as I know in reading the comparative results, our results seems to be not so bad. We are not in the worst results.

Future

The lexicon being rather rich, a problem occurs: in case of polysemy, the system does not prune the meanings that are described in the lexicon but that are not in the context of the sentence. That means that sometime the expansion is noisy.

The system could be improved by a semantic desambiguation component.