<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Data Collection to Analysis - Exploring Regional Linguistic Variation in Route Directions by Spatially-Stratified Web Sampling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sen Xu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anuj Jaiswal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiao Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Klippel</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasenjit Mitra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alan MacEachren</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Information Science and Technology, Pennsylvania State University</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Engineering, Pennsylvania State University</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>GeoVista Center, Department of Geography, Pennsylvania State University</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.</p>
      </abstract>
      <kwd-group>
        <kwd>Spatial language analysis</kwd>
        <kwd>volunteered spatial information</kwd>
        <kwd>geo-referenced web sampling</kwd>
        <kwd>regional linguistic variation</kwd>
        <kwd>cardinal directions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Spatial language is an important medium through which we study the representation, perception,
and communication of spatial information. Research has approached spatial language from various
perspectives. From the cognitive perspective, research has focused on group or individual
differences, on how language affects way-finding behaviour, or on how regional context affects
spatial language usage. From the computational perspective, modelling and reasoning has been
applied to spatial language interpretation. The spatial language samples used in these studies have
been mostly collected by individuals via time consuming experiments or interviews. This data
collection method could provide samples that offer understanding on small-scale phenomenon
through manual interpretation by analysts.</p>
      <p>However, studying the regional linguistic patterns in spatial language—such as regional
variations in route directions—requires a spatially distributed corpus. Spatial language data
available from the WWW has great potential for this study because of its unrivaled coverage and
easy accessibility. For example, it is common to find hotels, companies and institutions offering
route directions on their website which provides spatial way-finding instructions to travelers from
different places. Harnessing these human generated route directions on-line and analyzing them is
the major focus of this study.</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>To harness route direction documents from the WWW and ensure the spatial coverage of the
resulting corpus, a data collection scheme involving web crawling, text classification, and
georeferencing has been developed. Computational tools have been applied for assisting processing the
Spatially-strAtified Route Direction Corpus (SARD Corpus) and interpretation of the results.</p>
      <p>
        Collecting route direction documents from the WWW has two main challenges. First, route
directions have a high linguistic complexity that makes it difficult to separate the route direction
documents from a variety of irrelevant web documents. This challenge can be solved by applying a
machine learning algorithms for text classification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The precision of this route direction
document classifier used in this study reaches 93% (from 438 positive classified documents, 407 are
hand examined to be spatial language documents). Second, exploring regional variation in spatial
language usage requires geo-referencing each document in the corpus, which is not an easy task
(i.e., Geographic Named Entity Disambiguation). However, postal code, which commonly appears
in destination addresses in route directions, can be used to coarsely geo-reference a route direction
document on a postal code level. The data collection scheme first utilizes lists of postal codes for
crawling web documents. The returned web documents are fed into the route direction classifier,
where only positively classified route direction documents are stored in the result corpus. This data
collection scheme maximizes the spatial coverage of the SARD Corpus at a postal code level. To
prepare the corpus for extracting region linguistic attributes, the SARD Corpus is organized first by
nation, then by region (states in the U.S. and Australia, postal district in the U.K.).
      </p>
      <p>
        The data analysis of spatial language usage in route directions focuses on the regional linguistic
variation, which is addressed by analyzing the semantic usages of cardinal directions (i.e.: north,
south, east, west, northeast, northwest, southeast and southwest) and relative directions (i.e., left and
right). The semantic categories used are detailed in Table 1. The scale and size of the corpus makes
corpus linguistic tools a necessity for processing the regional linguistic characteristics. The
TermTree tool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is a text processing tool with the capacity to handle regular expressions, is
used for assisting an analyst to manually evaluate the semantic usages of direction terms. The
semantic categorical data is considered regional linguistic characteristics for each region in the
SARD Corpus. Visual Inquiry Toolkit [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is used for geovisualization of the regional linguistic
characteristics (Fig. 3) to interpret the analysis result.
3. Driving aid
1. Change of direction
2. Static spatial relationship
3. Traveling direction
4. General origin
*used in POI names
examples
take a left, bear right
see a landmark on your right, the
destination is left to a landmark
keep to the left lane, merge to the right
lane
head north, traveling south
veer southwest on US Hwy 24, turn
north
2 blocks east of landmark
from North, if coming from South of
New York
      </p>
      <p>North Atherton Street, West Street.</p>
      <p>As a result of the data collection, the SARD Corpus has been built with 11,254 web documents
covering the U.S., the U.K., and Australia. Overview of the workflow is presented in Fig. 1.
List of postal</p>
      <p>codes</p>
    </sec>
    <sec id="sec-3">
      <title>3 Results</title>
      <p>t80000
n
u
o60000
C
e
cn40000
e
r
r
u
cc20000
O
en 0
ko100%
T
80%
ino60%
tr
poo40%
r
P20%
0%
20000
15000
10000
5000
80%
60%
40%
20%
0%</p>
      <p>Regional pattern analysis demonstrates how cardinal/relative directions usage varies at both
national level (Fig. 2) and regional level (Fig. 3). On a national level, relative directions in all three
nations are mostly used to represent “change of direction” (the blue bar on the left). Similarly
cardinal directions are mostly used to represent “travelling direction” (The white bar on the right).
On the other hand, the preference for relative direction when representing “change of direction” is
much more common in the U.K. than in the U.S. and Australia. Correspondingly we find that
cardinal directions are used more often in the U.S. and Australia than in the U.K. (the blue bars on
the right) to represent “change of direction”.</p>
      <sec id="sec-3-1">
        <title>Seed data</title>
      </sec>
      <sec id="sec-3-2">
        <title>Crawling</title>
      </sec>
      <sec id="sec-3-3">
        <title>Classification Categorization Text Processing</title>
      </sec>
      <sec id="sec-3-4">
        <title>Analysis Visual Analytics</title>
      </sec>
      <sec id="sec-3-5">
        <title>Data indexer</title>
        <p>(search engine)</p>
        <p>Web
document
containing
zip codes</p>
      </sec>
      <sec id="sec-3-6">
        <title>Trained</title>
      </sec>
      <sec id="sec-3-7">
        <title>Maximum</title>
      </sec>
      <sec id="sec-3-8">
        <title>Entropy</title>
      </sec>
      <sec id="sec-3-9">
        <title>Classifier</title>
      </sec>
      <sec id="sec-3-10">
        <title>Location</title>
      </sec>
      <sec id="sec-3-11">
        <title>Validation</title>
      </sec>
      <sec id="sec-3-12">
        <title>Spatial language corpora</title>
      </sec>
      <sec id="sec-3-13">
        <title>Analyzing semantic category usage</title>
      </sec>
      <sec id="sec-3-14">
        <title>K-means</title>
      </sec>
      <sec id="sec-3-15">
        <title>Clustering</title>
      </sec>
      <sec id="sec-3-16">
        <title>Moran’s I Regional patterns</title>
        <p>change of direction 0
static spatial relationship100%
driving aid
general origin
static spatial relationship
change of direction
traveling direction
US(10,055) UK(710) Australia(489)</p>
        <p>Nation (corpus size)</p>
        <p>US(10,055)</p>
        <p>UK(710) Australia(489)
Nation (corpus size)</p>
        <p>To get a better understanding of the regional variation of relative versus cardinal direction
usages, the proportion of each semantic category is plotted on a map for comparison. The plotted
map can provide geographical knowledge about the regions, such as adjacency, which helps the
analyst to detect regional patterns. Fig. 3 shows that the two most dominant usages as noted at the
national-level (relative directions used for “change of direction”, cardinal directions used as
“travelling direction”) are used more frequently in most states in the U.S. For cardinal direction
usage, there is a geographic pattern (South Dakota to Kansas, Wyoming to Iowa, blue circle) that
differs from its surroundings states in every semantic category. The regional pattern detected is
comparable to the Colorado West and Central West region in the map of U.S. dialect [4, p.186]. A
possible explanation for this observation may lie in the correlation between the regional linguistic
preference and regional geographical features, which is yet to be investigated.</p>
        <p>e n
iltva itcoe
eR irD cohfadnirgeection
ilna iton
d c
ra ir</p>
        <p>e
C Dchange</p>
        <p>of direction
4 Summary
spatial
relationship
spatial
relationship
proportion
high
low
traveling
direction
general
origin</p>
        <p>This paper presents a first step toward an effective and scalable data collection method for spatial
language study. It enables spatial cognitive researches to scale-up the spatial language data sets and
answer spatial cognitive questions (such as the regional spatial language difference) at a large scale.
This study shows promise for effective spatial cognitive research through processing and analyzing
volunteered spatial language data, which is an alternative compared to collecting data by designing
human participant involved experiments. The presented workflow can also be extended to languages
other than English to assist in cross-language comparisons.</p>
        <p>The language preference at the nation-level and region-level are both explored, offering 1) a
better understanding of how people tend to use spatial language to communicate spatial information;
2) how people differ in using spatial language from different regions; and 3) a guideline to develop
a localized, use-specific natural language generation system for navigational devices. Regional
patterns of cardinal and relative direction usages in route directions are observed and analyzed,
offering a novel perspective for spatial linguistic studies. The design and implementation of
building a geo-referenced large-scale corpus from Web documents in this study offers a
methodological contribution to corpus linguistics, spatial cognition, and GISciences.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>Research for this paper is based upon work supported National Geospatial-Intelligence
Agency/NGA through the NGA University Research Initiative Program/NURI program. The views,
opinions, and conclusions contained in this document are those of the authors and should not be
interpreted as necessarily representing the official policies or endorsements, either expressed or
implied, of the National Geospatial-Intelligence Agency, or the U.S. Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaiswal</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klippel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacEachren</surname>
            ,
            <given-names>A.M.:</given-names>
          </string-name>
          <article-title>Extracting route directions from web pages</article-title>
          .
          <source>In: Twelfth International Workshop on the Web and Databases (WebDB</source>
          <year>2009</year>
          ), Providence, Rhode Island, USA. (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Turton</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacEachren</surname>
          </string-name>
          , A.:
          <article-title>Visualizing unstructured text documents using trees and maps</article-title>
          . In: GIScience workshop, Park City,
          <string-name>
            <surname>Utah</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacEachren</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Visual inquiry toolkit - an integrated approach for exploring and interpreting space-time, multivariate patterns</article-title>
          .
          <source>Technical report</source>
          , GeoVista Center and Department of Geography Pennsylvania State University, Department of Geography University of South Carolina (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Bum bags and fanny packs: a British-American, American-British dictionary</article-title>
          . Carroll &amp; Graf
          <string-name>
            <surname>Publishers</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>