Introduction

Query Expansion by Mining User Logs. In: IEEE Transactions on Knowledge and Data Engineering

A Search Engine based on Query Logs, and Search Log Analysis at the University of Sunderland

Michael Oakes

michael.oakes@sunderland.ac.uk 0

Yan Xu

yan.xu-1@sunderland.ac.uk 0

Free Keywords: Automatic Language Identification

Multilingual Search Engine

Query Logs.

0 0 University of Sunderland, Dept. of Computing, Engineering and Technology , DGIC, St. Peter's Campus, Sunderland SR6 0DD, England

2002

15 4 829 839

This work describes a variation on the traditional Information Retrieval paradigm, where instead of text documents being indexed according to their content, they are indexed according to the search terms previous users have used in finding them. We determine the effectiveness of this approach by indexing a sample of query logs from the European Library, and describe its usefulness for multilingual searching. In our analysis of the search logs, we determine the language of the past queries automatically, and annotate the search logs accordingly. From this information, we derive matrices to show that a) users tend to persist with the same query language throughout a query session, and b) submit queries in the same language as the interface they have selected, except in a large number of cases where the English interface is used to submit Latin queries. ACM Categories and Subject Descriptors: H Information Systems; H3 Information Storage and Retrieval; H3.3 Information Search and Retrieval; Search Process. A number of authors have previously used search logs to improve search engine performance. Hoy and Lyu (2004) used search logs for relevance feedback, a process whereby judgements on the quality of documents initially retrieved by a search engine are used to retrieve even more relevant documents at a second stage. Normally these judgements would be made by the user who had submitted the original query, a task which many search engine users find laborious. In this approach however, the judgements are effectively made by previous users who, as recorded in the search logs, chose to either download or not download the same documents. Cui et al. (2003) identified correlations between query terms and document terms by analysing search logs with a probabilistic model. The hypothesis behind search log-based approaches to search engine design is that previous users' choices are of interest to new users who input similar queries. We take an extreme position on this, by rather than indexing documents based on their content as in conventional search engines, we index documents solely with the terms past users have used in searching for them, as found in the search logs. Collated under each downloaded document ID will be the terms of every query ever submitted in a session leading up to the downloading of that document. Thus query terms from separate sessions leading to the retrieval of the same document will all become index terms for that document. This approach is beneficial for multilingual searching, since each previously downloaded document is indexed by all search terms which have ever been used in a search for that image, irrespective of which language they were in. Thus a document might be indexed by search terms in various languages. The user can then submit search terms in the language of his or her choice. If some of the index terms of a relevant document are in that language, there is a chance that these will match the query terms, allowing the user to retrieve this document in its original language. The limitation of this query log approach is that if previous users have never downloaded a particular image, then that image can never be retrieved by this technique. This is a practical limitation of the training data, not a theoretical limitation, but shows the need for large amounts of search log training data.

Introduction

session in the LogCLEF search logs. Query records for which the seventh field was “search xxx”, “available_at” or “see_online” were assumed to indicate the downloading of the relevant URL cited in field 12. These URLs would then be indexed by all the search terms submitted in either that session or any other session which resulted in the downloading of that same URL.

The queries were not stop listed, since they were multilingual, which would require a stop list for each language and increase the danger that a stop word in one language might be meaningful in another. Instead, a weighting scheme was used so that each query term would be given a weight reflecting its importance with respect to each document. This measure, called TF.IDF, assigns low weights to words with low information content. The formula is as follows: ⎛ NDoc ⎞ wkd = fkd ⋅ log⎝⎜⎜ Dk ⎠⎟⎟ In a conventional search engine, where a document is indexed based on its content, f kd is the frequency of term k in document d, Dk is the number of documents which include that term, NDoc is the total number of documents in the collection, and wkd is the weight reflecting the importance of term k to document d. In our approach, f kd is the number of times query term k has been used in a search for document d. Dk is number of documents which have been accessed using that query term, and NDoc is the total number of documents downloaded at some stage in the search logs. Thus wkd is the weight reflecting the typicality of term k with respect to session d. The highest TF.IDF scores are given to those terms which are often used in searches for the document of interest, but are not used in searching for many other documents.

Now that all the previously downloaded documents have been indexed, we can match the queries of future users against the document index terms using the cosine similarity coefficient, as in a conventional search engine. The documents are ranked according to their similarity to the user’s query, and the best matching documents are presented to the user. The architecture of our query log-based search engine is shown in Figure 1. For training the search engine we used 1399747 records (the first 75%) and for testing we used 466582 (the remaining 25%). It was necessary to sort the logs into chronological order in order to separate the test and the training set. If we define a search session as all records with a common session ID, the LogCLEF file contained 225376 sessions. Average session length according to this definition was 1399747 / 225376 = 6.21 lines. For the search engine approach, however, we defined a search session as being a sequence of records with the same ID which included one of the following commands, indicative of the retrieval of a relevant record: search xxx, see online, available at. Some of these sessions recorded more than one retrieved URL.

For each URL in the test set, we combined all the query terms used in the same session into a single text query. We then matched this query against each of the query term sets collated under URLs in the test set. We assumed that the URL retrieved in the test session was the gold standard, and wished to determine the search engine’s ability to retrieve the session (if there was one) with the same URL from the training set. There were 8586 URLs in the test set sessions, and 284 of them matched previous records retrieving the same URL in the top 100 best matching training set sessions. Thus the percentage of matched URLs was 3.32%.

Analysis of the Search Logs

Since the indexing process described in section 2 requires the set of search logs to be read in, it would be possible to incorporate modules (at present written as separate programs) into the search engine for the analysis of these logs. The overall procedure we have followed for search log analysis is as follows: The search logs are read in, and we find the most likely language of the query terms on each line. Each line of the search log is then annotated with the name of the query language on that line, or “null” if no query is present, as described in section 5. The frequencies of each query language used in the first 100000 lines of the logs are given in Section 6. Given the sequence of query languages in the logs, we determine the likelihood of a query in one language (or a new session) being followed by a query in each of the other languages, another query in the same language, or the end of the session. This program is described in section 7. For each interface language, we determine the frequency of the query languages used, as described in section 8. 5

Automatic Language Identification

Souter et al. (2004) describe a technique for automatic language identification, based on trigram (sequences of three adjacent characters) frequencies. They found that using knowledge of the frequencies of the trigrams occurring in 9 different languages, they were able to accurately identify a language from a sample of just 175 characters. We implemented this method to determine the language of each of the queries recorded in the search log. The method gave subjectively reasonable results, although we anticipate less than 100% accuracy, since our query samples were shorter than those used by Souter et al., typically about two words long. To estimate the trigram frequencies typical of a set of languages, we used the Europarl corpus which contains transcripts of meetings of the European Parliament in each of 11 languages. The languages of Europarl are Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish. We counted the frequency of each trigram in a sample of just over one million characters for each language. The frequency of each trigram was divided by the total number of characters (minus two) in the sample of that language, to give the probabilility of that trigram being selected randomly from a text in that language, and stored in an external file. For example, the sequence “TZE” was found 37 times in the sample of 1,058,761 characters (1,058,759 trigrams) of German, giving a probability of occurrence of about 3.49 E -5 (3.49 times 10 to the power -5). To prevent trigrams which were not found in the million word sample being regarded as impossible in that language, we set a default value of 0.5 divided by the number of trigrams in the sample as the probability of that trigram. In our system, queries to be analysed had to consist of at least one character. We automatically assigned initial and terminal blank characters to each query.

As an example, the overall probability of encountering the sequence _KATZE_ (where the underscore denotes a space character) was found by multiplying the five constituent trigram probabilities for each language (shown in Table 1), and multiplying them together to give an overall probability. Since the overall probability was much greater for German than the other two languages, KATZE is more likely to be a German word than an English or French one. Using this method, each of the first 100000 queries in the original search logs was annotated with the most likely language of that query. Only the query field in the search logs was required for this process. According to the automatic language recognition program, the languages of the 100000 individual queries were (from the set of 11 Europarl languages) were as shown in Table 2. English was the most commonly used query language, followed by Italian, but as discussed in section 8, many of the “Italian” queries may in fact have been in Latin. In almost 10% of the query lines, no text was submitted.

Language

English Italian German NULL French Dutch Spanish

Finnish Portuguese

Swedish Greek The purpose of this experiment was to use the search logs annotated with the most likely language of the query to find whether users tended to stick with one language throughout a search session, or whether they tended to change languages in mid-session as part of the query reformulation process. If they did seem to change languages mid-session, which languages did they most commonly change from and to? The time-ordered set of query logs was scanned, and each time the session ID did not match the session ID of the previous query, it was assumed that a new query had begun. Otherwise, if the previous and current session IDs were the same, entry [previous_state][current_state] in the matrix was incremented by 1. The results are shown in Table 3, where the earlier states are on the vertical axis, while the later states are found on the horizontal axis. “new” denotes the start or end of a session, and “null” indicates no query was submitted at this stage. The most common language for the first query of a session was English, followed by German. In the vast majority of cases, as shown by the high values on the principal diagonal, users having submitted a query in one language tended to use the same language for the next query. If users did change language mid-session, there was a slight tendency to change into English, shown by the slightly higher values in the final column. The other values in the matrix may represent a “noise floor” due to incorrect assignments by the automatic language identifier.

Table 3 can be transformed into a Markov model, by dividing each entry by the row total. In a Markov model, events such as the submission of a query in a particular language are regarded as states (denoted by the country A variant of this program was written to calculate the proportion of sessions in which more than one language was used. Here “null”, or no query submitted, was not considered as a language. The relative proportions of sessions consisting of zero, one or more than one language are shown in Table 5. The purpose of this experiment was to answer the question: Do users tend to submit queries in the same language as the interface they have chosen? First, the number of sessions conducted in each interface was collated, as shown in Table 6.

Language code Sr Sk Cs

Nl Lt El Fi

Conclusion

We have developed a search engine, where previously accessed documents are indexed by all the search terms, derived from search logs, that have ever been submitted in the same sessions as those in which that document was downloaded. New queries are matched against the old query terms in the indexes, and documents are ranked by the degree of match between their index terms and the new query. In order to learn about multilingual searching behaviour, we have performed automatic language identification using trigram frequencies at the time the queries are indexed. We were then able to record the sequence of languages used, and build a Markov model to store the relative frequencies of sequences of query languages.

Acknowledgements References

This work was supported by the EU-Funded VITALAS project (project number FP6-045389) http://vitalas.ercim.org Hoi, C-H.; Lyu, M. R. (2004): A Novel Log-Based Relevance Feedback Technique in Content-Based Image

Retrieval. In: ACM Multimedia, pp. 24-31.

Souter D; Churcher G; Hayes, G.; et al (2004) : Natural Language Identification Using Corpus-Based Models.

In: HERMES Journal of Linguistics 13, pp. 183-203.