<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhisheng Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chong Wang</string-name>
          <email>chwang@microsoft.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xing Xie</string-name>
          <email>xingx@microsoft.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei-Ying Ma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Geographical Information Retrieval, Query Parsing, Task Design, Evaluation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Sci. &amp; Tech. of China</institution>
          ,
          <addr-line>Hefei, Anhui, 230026</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft Research Asia</institution>
          ,
          <addr-line>4F</addr-line>
          ,
          <institution>Sigma Center</institution>
          ,
          <addr-line>No.49, Zhichun Road, Beijing, 100080</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Geo-query parsing task is a sub-task in GeoCLEF2007 and it is run by Microsoft Research Asia. We have provided a query set of 800,000 real queries (in English) from MSN search. The proposed task requires that, based on the provided query set, the participants first identify the local queries and then analyze the different components for the local queries. We have provided a sample labeled query set of 100 queries for training. There are six valid submissions from six teams. We selected 500 queries to form our evaluation set. Under a rather strict evaluation criterion and metric, the Miracle team achieves the highest F1-score, 0.488, and the highest Recall too, 0.566, while the Ask team achieves the highest Precision, 0.625.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. Task design</title>
      <p>Our goal in this query tagging task is to identify the local query and extract the corresponding three components
described above. Moreover, we define three types according to the “what” terms, which are “Yellow page”, “Map”
and “Information”. Here we restrict the local query as the query containing EXPLICIT locations.
In our query set, a common local query structure will be “what” + “geo-relation” + “where”. The keywords in the
“what” component indicate what users want to search; “where” indicates the geographic area users are interested in;
“geo-relation” stands for the relationship between “what” and “where”. There also exist non-local queries in our
query set which also need to be recognized.</p>
      <p>For example, for a local query “Restaurant in Beijing, China”, “what” = “Restaurant”, “where” = “Beijing, China”,
and “geo-relation” = “IN”. For another query, “Mountains in the south of United States”, “what” = “Mountains”,
“where” = “United States”, and “geo-relation” = “SOUTH_OF”.
2.1 Tasks Description
1) Detect whether the query is a local query or not. A query is defined to be “local” if a query contains at least
a “where” component. For example, “pizza in Seattle, WA” is a local query, while “Microsoft software” is a
non-local query. For non-local queries, further processing is not needed.
2) If the query is local, extract the “where” component and output the corresponding latitude/longitude. For
example, in the query “pizza in Seattle, WA”, “Seattle, WA” will be extracted and lat/long value (47.59,
122.33) will be output. Sometimes terms in the “where” component are ambiguous. In this case, the
participant should output the lat/long value with the highest confidence. A few queries contain multiple
locations, for example, “bus lines from US to Canada”. We try our best to avoid this kind of queries
appearing in our query set.
3) Extract the “geo-relation” component from the local query and transform it into a pre-defined relation type.</p>
      <p>A suggested relation type list is shown in Table 1. If the relation type you find is not defined in Table 1, you
should categorize it into “UNDEFINED”.
4) Extract the “what” component from the local query and categorize it into one of three predefined types,
which are listed below:
a. Map type, users are looking for natural points of interests, like river, beach, mountain, monuments,
etc.
b. Yellow page type, users are looking for businesses or organizations, like hotels, restaurants,
hospitals, etc.</p>
      <p>c. Information type, users are looking for text information, like news, articles, blogs, etc.</p>
    </sec>
    <sec id="sec-2">
      <title>2.2 Data Set</title>
      <p>We provided a query data set of 800,000 queries. The queries were selected from MSN search logs collected over
fifteen days in Aug. 2006. The queries can be classified as four types according to whether they contain locations or
geo-relations or not. Table 2 shows the number of four type queries in the data set. Here “relation” means the types
listed in Table 1.
And we provided a sample labeled query set of 100 queries for participants. The format is described in the following
section.</p>
    </sec>
    <sec id="sec-3">
      <title>2.3 Format</title>
      <sec id="sec-3-1">
        <title>2.3.1 Data Set Format</title>
        <p>The query set is provided in XML format. Each query has two attributes: &lt;QUERYNO&gt; and &lt;QUERY&gt;. Examples:
&lt;QUERYNO&gt;1&lt;/QUERYNO&gt;
&lt;QUERY&gt;Restaurant in Beijing, China&lt;/QUERY&gt;</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.3.2 Training Set and Result</title>
        <p>The sample labeled set and the results are in the following format. There are 4 more attributes: &lt;LOCAL&gt;,
&lt;WHAT&gt;, &lt;WHAT-TYPE&gt;, &lt;GEO-RELATION&gt; and &lt;WHERE&gt;.
&lt;QUERYNO&gt;1&lt;/QUERYNO&gt;
&lt;QUERY&gt;Restaurant in Beijing, China&lt;/QUERY&gt;
&lt;WHAT-TYPE&gt;Yellow page&lt;/WHAT-TYPE&gt;
&lt;GEO-RELATION&gt;IN&lt;/ GEO-RELATION&gt;
&lt;WHERE&gt;Beijing, China&lt;/WHERE&gt;
&lt;LAT-LONG&gt;40.24, 116.42&lt;/LAT-LONG&gt;
&lt;QUERYNO&gt;2&lt;/QUERYNO&gt;
&lt;QUERY&gt; Lottery in Florida&lt;/QUERY&gt;
&lt;LOCAL&gt;YES&lt;/LOCAL&gt;
&lt;WHAT&gt;Lottery&lt;/WHAT&gt;
&lt;WHAT-TYPE&gt;Information&lt;/WHAT-TYPE&gt;
&lt;GEO-RELATION&gt;IN&lt;/ GEO-RELATION&gt;
&lt;WHERE&gt;Florida, United States&lt;/WHERE&gt;
&lt;LAT-LONG&gt;28.38, -81.75&lt;/LAT-LONG&gt;
If a submission from a team does not contain all queries or all the fields, those absent queries or fields will be treated
as errors.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Evaluation</title>
      <p>The contest is open to any party planning to attend GeoCLEF 2007. A person can participate in only one group.
Multiple submissions are allowed before the deadline, but we only evaluated the last submissions.
The participants take the responsibility of obtaining any permission to use any algorithms/tools/data that are
intellectual property of third party.</p>
    </sec>
    <sec id="sec-5">
      <title>3.1 Evaluation Set</title>
      <p>To evaluate the performance of the submission, we choose a set of queries from the query set to form an evaluation
set. However, if all the queries are chosen randomly, there will be several problems as follows. 1) There are some
typos in the queries, e.g. “beuty”; 2) The query is ambiguous and difficult to understand. For example, “Cambridge”,
“daa files”; 3) Many geo-relations don’t appear very often, e.g. “NORTH_EAST_TO”, “NORTH_OF”, so it is
difficult to include this kind of cases in the evaluation set if the queries are chosen randomly. So we choose the
following steps to construct the final evaluation set to cover as many different types as possible.
1) Choose 800 queries randomly from the query set.
2) Remove the typos and the ambiguous queries from the 800 ones manually.
3) Select the queries with special geo-relations from the remainder queries in the query set manually and add
them to the evaluation set.</p>
      <p>4) Select 500 queries for the final evaluation set.</p>
    </sec>
    <sec id="sec-6">
      <title>3.2 Distribution</title>
      <p>Figure 1 shows the distribution of the evaluation set. The three types of queries, including map, information and
yellow page, consist of the local queries which occupy 61.4%.</p>
    </sec>
    <sec id="sec-7">
      <title>3.3 Labeling approach</title>
      <sec id="sec-7-1">
        <title>3.3.1 Labeling tool</title>
        <p>To accelerate the labeling efficiency, we design a labeling tool. With its help we can easily identify each part of the
query. Figure 2 shows the interface of the tool.</p>
      </sec>
      <sec id="sec-7-2">
        <title>3.3.2 Label process</title>
        <p>Two experts identify the six fields of these queries according to the task description including the &lt;LOCAL&gt;,
&lt;WHAT&gt;, &lt;WHERE&gt;, &lt;GEO-RELATION&gt;, &lt;WHAT-TYPE&gt;, &lt;LAT-LONG&gt; of the location in the query. For the
&lt;LOCAL&gt; field, we label it as “local” only if the query contains explicit locations. For the &lt;WHAT&gt; field, we keep
all the terms in the query after extracting the &lt;GEO-RELATION&gt; and the &lt;WHERE&gt; fields. For example, the
&lt;WHAT&gt; field of the query “ambassador suite hotel in Atlanta” is “ambassador suite hotel”. For &lt;WHAT-TYPE&gt;
field, we define three types: Map type, Yellow page type, and Information type, which have been described above.
For the &lt;WHERE&gt; field, if the locations are ambiguous, we choose the location with the highest confidence score.
The format is “location name + its upper location name”, e.g. “Atlanta, United States”. Meanwhile, we label the
latitude and longitude of the location point. This value is only for reference here, because the lat-long values from
different participants may vary greatly especially for the “big” locations, e.g. “Asia”, “Canada”.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.4 Evaluation Method</title>
      <p>To evaluate the performance of the query tagging task for the participants, we do the following three steps. First we
pre-process the submissions of the participants to solve problems such as the format errors or data absence. Then we
choose the subset from submissions with the same query number as in the evaluation set. Here we don’t use
automatic checking since the format of the &lt;WHERE&gt; field is not unique. Three experts checked all the submissions
independently and reach a final decision through discussion.</p>
      <sec id="sec-8-1">
        <title>3.4.1 Criterion</title>
        <p>We consider the following criterions in the evaluation process:
1) the &lt;LOCAL&gt; field should be the same as the answer;
2) the terms in the &lt;WHAT&gt; field should be the same as the answer;
3) the &lt;WHAT-TYPE&gt; and &lt;GEO-RELATION&gt; should be the same as the answer;
4) the &lt;WHERE&gt; field should contain the locations in the original query, no matter its upper location is
correct or not;
5) we ignore the &lt;LAT-LONG&gt; field in this evaluation;
If one record in the submission meets the entire above criterions, it is correct, otherwise wrong.
3.4.2 Metric
We evaluate the submissions based on several evaluation metrics, including Precision, Recall, and F1-score. The
participants do not know which queries will be used for evaluation. Here are the set of measures we use to evaluate
results submitted by the participants:</p>
        <p>Precision =
Recall =
F1 - score =</p>
      </sec>
      <sec id="sec-8-2">
        <title>Correct_tagged_query_num</title>
        <p>all_tagged_query_num</p>
      </sec>
      <sec id="sec-8-3">
        <title>Correct_tagged_query_num</title>
        <p>all_local_query_num</p>
      </sec>
      <sec id="sec-8-4">
        <title>2* Precision* Recall</title>
      </sec>
      <sec id="sec-8-5">
        <title>Precision + Recall</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. Results</title>
      <p>We only give the results for the local query. We can see that the Miracle team achieves the highest F1-score, 0.488,
and the highest Recall too, 0.566, while the Ask team achieves the highest Precision, 0.625. Table 3 shows the results
of all participants.</p>
    </sec>
    <sec id="sec-10">
      <title>5. Discussions</title>
      <p>In total, six teams participated in this query tagging task. We find several problems (technical) during the evaluations.</p>
      <p>Fail to classify the local queries. Some local queries are classified as non-local by a few teams, so the recall
for the local queries drops significantly.</p>
      <p>The &lt;WHAT&gt; field is not complete. Some terms in the query are missing. For example, “apartments to rent
in Cyprus”, the &lt;WHAT&gt; field should be “apartments to rent”, but some participants just output
“apartments”. And “homer Alaska real estate”, &lt;WHAT&gt; field should be “homer real estate”, not “homer”
or “real estate”.</p>
      <p>Fail to classify the &lt;WHAT-TYPE&gt;. Especially for the “Yellow Page” and “Information”, a few of teams
classify the “Yellow Page” queries as “Information”. Frankly speaking, sometimes it’s really hard to
differentiate “Yellow Page” from “Information”, because of the ambiguity. For example, “Kansas state
government”, if you want to know about the information about state government, it can be classified as
“Information”, if you want to find the location, it can be classified as “Yellow Page”. Moreover, some
teams don’t output the &lt;WHAT-TYPE&gt; for the local queries. Though their extraction precision for other
fields is quite high, we have no choice but to label them as wrong cases.</p>
      <p>Fail to identify &lt;GEO-RELATION&gt; field correctly. Most of the teams can recognize the geo-relation “IN”,
but for the others, like “SOUTH_OF”, “SOUTH_WEST_OF”, “NORTH_WEST_OF”, few teams can
identify them correctly. For example, “bank west of nevada”, the &lt;GEO-RELATION&gt; should be
“WESTOF”.</p>
      <p>Most teams have employed a sophisticated gazetteer for location extraction, containing millions of geographical
references. Their approaches for analyzing and classifying queries were mainly based on pre-defined rules. The
system from the Miracle team, which achieved the best F1, was composed of three modules, namely geo-entity
identifier, query analyzer and a two-level multi-classifier. Other systems followed similar designs. Generally
speaking, the performance for most teams is not high. We list several possible reasons as follows:</p>
    </sec>
    <sec id="sec-11">
      <title>6. Conclusions</title>
      <p>In this report, we summarized the configuration and the results of the new query parsing task in GeoCLEF2007. The
main purpose for organizing this task is to gather researchers who have similar interests. We first discussed the
motivation of this task. Then we described the task design and evaluation methods. We also reported the evaluation
results for all the participants.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>5) Fail to find the correct &lt;WHERE&gt;. A few of teams fail to extract the locations from the queries and label them as non-local queries. We guess the reason is that their gazetteer is not big enough. Moreover, some teams fail to disambiguate the locations and don't output the locations with the highest confidence scores</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>6) Although we don't consider the &lt;LAT-LONG&gt; field this time, we find some participants don't output &lt;LAT-LONG&gt; at all. Maybe they don't have such information</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>1) New task. This query tagging task is totally new for the participants</article-title>
          .
          <article-title>And the time left is a bit short from the very beginning, just 2 months.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>2) Critical standard. Our criterions to judge the results are quite critical. Some teams are good at the extraction but fail to identify the &lt;WHAT-TYPE&gt;. The overall performance is thus affected</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>3) Queries are ambiguous. We find three kinds of ambiguity here. a. Local/Non-Local Ambiguity: Some queries, like “airport”, “space needle”, are defined as non-local here because they don't contain explicit locations</article-title>
          . b. Yellow Page/Information Ambiguity:
          <article-title>Due to the lack of background knowledge, it's hard to say some queries, like “Atlanta medical”, are Yellow Page or Information. But we defined it as Yellow Page in this task</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>4) Culture understanding problem. When we were labeling the queries, we found that we lacked the necessary background of culture of western countries. In such situation, the labeling process may still contain some errors, even if we tried our best to avoid them.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>