<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Query Expansion by Mining User Logs. In: IEEE Transactions
on Knowledge and Data Engineering</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Search Engine based on Query Logs, and Search Log Analysis at the University of Sunderland</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Oakes</string-name>
          <email>michael.oakes@sunderland.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yan Xu</string-name>
          <email>yan.xu-1@sunderland.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Free Keywords: Automatic Language Identification</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Multilingual Search Engine</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Query Logs.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Sunderland, Dept. of Computing, Engineering and Technology</institution>
          ,
          <addr-line>DGIC, St. Peter's Campus, Sunderland SR6 0DD, England</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <volume>15</volume>
      <issue>4</issue>
      <fpage>829</fpage>
      <lpage>839</lpage>
      <abstract>
        <p>This work describes a variation on the traditional Information Retrieval paradigm, where instead of text documents being indexed according to their content, they are indexed according to the search terms previous users have used in finding them. We determine the effectiveness of this approach by indexing a sample of query logs from the European Library, and describe its usefulness for multilingual searching. In our analysis of the search logs, we determine the language of the past queries automatically, and annotate the search logs accordingly. From this information, we derive matrices to show that a) users tend to persist with the same query language throughout a query session, and b) submit queries in the same language as the interface they have selected, except in a large number of cases where the English interface is used to submit Latin queries. ACM Categories and Subject Descriptors: H Information Systems; H3 Information Storage and Retrieval; H3.3 Information Search and Retrieval; Search Process. A number of authors have previously used search logs to improve search engine performance. Hoy and Lyu (2004) used search logs for relevance feedback, a process whereby judgements on the quality of documents initially retrieved by a search engine are used to retrieve even more relevant documents at a second stage. Normally these judgements would be made by the user who had submitted the original query, a task which many search engine users find laborious. In this approach however, the judgements are effectively made by previous users who, as recorded in the search logs, chose to either download or not download the same documents. Cui et al. (2003) identified correlations between query terms and document terms by analysing search logs with a probabilistic model. The hypothesis behind search log-based approaches to search engine design is that previous users' choices are of interest to new users who input similar queries. We take an extreme position on this, by rather than indexing documents based on their content as in conventional search engines, we index documents solely with the terms past users have used in searching for them, as found in the search logs. Collated under each downloaded document ID will be the terms of every query ever submitted in a session leading up to the downloading of that document. Thus query terms from separate sessions leading to the retrieval of the same document will all become index terms for that document. This approach is beneficial for multilingual searching, since each previously downloaded document is indexed by all search terms which have ever been used in a search for that image, irrespective of which language they were in. Thus a document might be indexed by search terms in various languages. The user can then submit search terms in the language of his or her choice. If some of the index terms of a relevant document are in that language, there is a chance that these will match the query terms, allowing the user to retrieve this document in its original language. The limitation of this query log approach is that if previous users have never downloaded a particular image, then that image can never be retrieved by this technique. This is a practical limitation of the training data, not a theoretical limitation, but shows the need for large amounts of search log training data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>session in the LogCLEF search logs. Query records for which the seventh field was “search xxx”, “available_at”
or “see_online” were assumed to indicate the downloading of the relevant URL cited in field 12. These URLs
would then be indexed by all the search terms submitted in either that session or any other session which resulted
in the downloading of that same URL.</p>
      <p>The queries were not stop listed, since they were multilingual, which would require a stop list for each language
and increase the danger that a stop word in one language might be meaningful in another. Instead, a weighting
scheme was used so that each query term would be given a weight reflecting its importance with respect to each
document. This measure, called TF.IDF, assigns low weights to words with low information content. The
formula is as follows:
⎛ NDoc ⎞
wkd = fkd ⋅ log⎝⎜⎜ Dk ⎠⎟⎟
In a conventional search engine, where a document is indexed based on its content, f kd is the frequency of term
k in document d, Dk is the number of documents which include that term, NDoc is the total number of
documents in the collection, and wkd is the weight reflecting the importance of term k to document d. In our
approach, f kd is the number of times query term k has been used in a search for document d. Dk is number of
documents which have been accessed using that query term, and NDoc is the total number of documents
downloaded at some stage in the search logs. Thus wkd is the weight reflecting the typicality of term k with
respect to session d. The highest TF.IDF scores are given to those terms which are often used in searches for the
document of interest, but are not used in searching for many other documents.</p>
      <p>Now that all the previously downloaded documents have been indexed, we can match the queries of future users
against the document index terms using the cosine similarity coefficient, as in a conventional search engine. The
documents are ranked according to their similarity to the user’s query, and the best matching documents are
presented to the user. The architecture of our query log-based search engine is shown in Figure 1.
For training the search engine we used 1399747 records (the first 75%) and for testing we used 466582 (the
remaining 25%). It was necessary to sort the logs into chronological order in order to separate the test and the
training set. If we define a search session as all records with a common session ID, the LogCLEF file contained
225376 sessions. Average session length according to this definition was 1399747 / 225376 = 6.21 lines. For the
search engine approach, however, we defined a search session as being a sequence of records with the same ID
which included one of the following commands, indicative of the retrieval of a relevant record: search xxx, see
online, available at. Some of these sessions recorded more than one retrieved URL.</p>
      <p>For each URL in the test set, we combined all the query terms used in the same session into a single text query.
We then matched this query against each of the query term sets collated under URLs in the test set. We assumed
that the URL retrieved in the test session was the gold standard, and wished to determine the search engine’s
ability to retrieve the session (if there was one) with the same URL from the training set. There were 8586 URLs
in the test set sessions, and 284 of them matched previous records retrieving the same URL in the top 100 best
matching training set sessions. Thus the percentage of matched URLs was 3.32%.</p>
    </sec>
    <sec id="sec-2">
      <title>Analysis of the Search Logs</title>
      <p>Since the indexing process described in section 2 requires the set of search logs to be read in, it would be
possible to incorporate modules (at present written as separate programs) into the search engine for the analysis
of these logs. The overall procedure we have followed for search log analysis is as follows:
The search logs are read in, and we find the most likely language of the query terms on each line. Each line of
the search log is then annotated with the name of the query language on that line, or “null” if no query is present,
as described in section 5. The frequencies of each query language used in the first 100000 lines of the logs are
given in Section 6. Given the sequence of query languages in the logs, we determine the likelihood of a query in
one language (or a new session) being followed by a query in each of the other languages, another query in the
same language, or the end of the session. This program is described in section 7. For each interface language, we
determine the frequency of the query languages used, as described in section 8.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Automatic Language Identification</title>
      <p>Souter et al. (2004) describe a technique for automatic language identification, based on trigram (sequences of
three adjacent characters) frequencies. They found that using knowledge of the frequencies of the trigrams
occurring in 9 different languages, they were able to accurately identify a language from a sample of just 175
characters. We implemented this method to determine the language of each of the queries recorded in the search
log. The method gave subjectively reasonable results, although we anticipate less than 100% accuracy, since our
query samples were shorter than those used by Souter et al., typically about two words long.
To estimate the trigram frequencies typical of a set of languages, we used the Europarl corpus which contains
transcripts of meetings of the European Parliament in each of 11 languages. The languages of Europarl are
Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish.
We counted the frequency of each trigram in a sample of just over one million characters for each language. The
frequency of each trigram was divided by the total number of characters (minus two) in the sample of that
language, to give the probabilility of that trigram being selected randomly from a text in that language, and
stored in an external file. For example, the sequence “TZE” was found 37 times in the sample of 1,058,761
characters (1,058,759 trigrams) of German, giving a probability of occurrence of about 3.49 E -5 (3.49 times 10
to the power -5). To prevent trigrams which were not found in the million word sample being regarded as
impossible in that language, we set a default value of 0.5 divided by the number of trigrams in the sample as the
probability of that trigram. In our system, queries to be analysed had to consist of at least one character. We
automatically assigned initial and terminal blank characters to each query.</p>
      <p>As an example, the overall probability of encountering the sequence _KATZE_ (where the underscore denotes a
space character) was found by multiplying the five constituent trigram probabilities for each language (shown in
Table 1), and multiplying them together to give an overall probability. Since the overall probability was much
greater for German than the other two languages, KATZE is more likely to be a German word than an English or
French one. Using this method, each of the first 100000 queries in the original search logs was annotated with
the most likely language of that query. Only the query field in the search logs was required for this process.
According to the automatic language recognition program, the languages of the 100000 individual queries were
(from the set of 11 Europarl languages) were as shown in Table 2. English was the most commonly used query
language, followed by Italian, but as discussed in section 8, many of the “Italian” queries may in fact have been
in Latin. In almost 10% of the query lines, no text was submitted.</p>
      <sec id="sec-3-1">
        <title>Language</title>
        <p>English
Italian
German
NULL
French
Dutch
Spanish</p>
        <p>Finnish
Portuguese</p>
        <p>Swedish
Greek
The purpose of this experiment was to use the search logs annotated with the most likely language of the query
to find whether users tended to stick with one language throughout a search session, or whether they tended to
change languages in mid-session as part of the query reformulation process. If they did seem to change
languages mid-session, which languages did they most commonly change from and to? The time-ordered set of
query logs was scanned, and each time the session ID did not match the session ID of the previous query, it was
assumed that a new query had begun. Otherwise, if the previous and current session IDs were the same, entry
[previous_state][current_state] in the matrix was incremented by 1. The results are shown in Table 3, where the
earlier states are on the vertical axis, while the later states are found on the horizontal axis. “new” denotes the
start or end of a session, and “null” indicates no query was submitted at this stage.
The most common language for the first query of a session was English, followed by German. In the vast
majority of cases, as shown by the high values on the principal diagonal, users having submitted a query in one
language tended to use the same language for the next query. If users did change language mid-session, there
was a slight tendency to change into English, shown by the slightly higher values in the final column. The other
values in the matrix may represent a “noise floor” due to incorrect assignments by the automatic language
identifier.</p>
        <p>Table 3 can be transformed into a Markov model, by dividing each entry by the row total. In a Markov model,
events such as the submission of a query in a particular language are regarded as states (denoted by the country
A variant of this program was written to calculate the proportion of sessions in which more than one language
was used. Here “null”, or no query submitted, was not considered as a language. The relative proportions of
sessions consisting of zero, one or more than one language are shown in Table 5.
The purpose of this experiment was to answer the question: Do users tend to submit queries in the same language
as the interface they have chosen? First, the number of sessions conducted in each interface was collated, as
shown in Table 6.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Language code Sr Sk Cs</title>
        <p>Nl
Lt
El
Fi</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have developed a search engine, where previously accessed documents are indexed by all the search terms,
derived from search logs, that have ever been submitted in the same sessions as those in which that document
was downloaded. New queries are matched against the old query terms in the indexes, and documents are ranked
by the degree of match between their index terms and the new query. In order to learn about multilingual
searching behaviour, we have performed automatic language identification using trigram frequencies at the time
the queries are indexed. We were then able to record the sequence of languages used, and build a Markov model
to store the relative frequencies of sequences of query languages.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements References</title>
      <p>This work was supported by the EU-Funded VITALAS project (project number FP6-045389)
http://vitalas.ercim.org
Hoi, C-H.; Lyu, M. R. (2004): A Novel Log-Based Relevance Feedback Technique in Content-Based Image</p>
      <p>Retrieval. In: ACM Multimedia, pp. 24-31.</p>
      <p>Souter D; Churcher G; Hayes, G.; et al (2004) : Natural Language Identification Using Corpus-Based Models.</p>
      <p>In: HERMES Journal of Linguistics 13, pp. 183-203.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>