=Paper=
{{Paper
|id=None
|storemode=property
|title=From Data Collection to Analysis - Exploring Regional Linguistic Variation in Route Directions by Spatially-Stratified Web Sampling
|pdfUrl=https://ceur-ws.org/Vol-620/paper8.pdf
|volume=Vol-620
}}
==From Data Collection to Analysis - Exploring Regional Linguistic Variation in Route Directions by Spatially-Stratified Web Sampling==
<pdf width="1500px">https://ceur-ws.org/Vol-620/paper8.pdf</pdf>
<pre>
       From Data Collection to Analysis – Exploring Regional
   Linguistic Variation in Route Directions by Spatially-Stratified
                           Web Sampling

                      Sen Xu1, Anuj Jaiswal2, Xiao Zhang3, Alexander Klippel1,
                               Prasenjit Mitra2 and Alan MacEachren1
             1
               GeoVista Center, Department of Geography, Pennsylvania State University, U.S.A.
           2
             College of Information Science and Technology, Pennsylvania State University, U.S.A.
          3
            Department of Computer Science and Engineering, Pennsylvania State University, U.S.A.


      Abstract. How spatial language varies regionally? This study investigates the possibility of
      exploring regional linguistic variations in spatial language by collecting and analyzing a
      Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language
      text on the Web. Because of the fast content sharing functionality of the World Wide Web, it
      quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels’
      Websites. These route directions can serve as a representation of everyday spatial language
      usage on the WWW. The spatial coverage and abundance of the data source is appealing while
      collecting and analyzing large quantities of spatially distributed data is still challenging.
      Through automated crawling, classifying and geo-referencing web documents containing route
      directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and
      Australia. We implement a semantic categorical analysis scheme to explore regional variations
      in cardinal versus relative direction usages. Preliminary results show both similarity and
      differences at national level and geographic patterns at regional level. The design and
      implementation of building a geo-referenced large-scale corpus from Web documents offers a
      methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

      Keywords: Spatial language analysis, volunteered spatial information, geo-referenced web
      sampling, regional linguistic variation, cardinal directions


1 Introduction

Spatial language is an important medium through which we study the representation, perception,
and communication of spatial information. Research has approached spatial language from various
perspectives. From the cognitive perspective, research has focused on group or individual
differences, on how language affects way-finding behaviour, or on how regional context affects
spatial language usage. From the computational perspective, modelling and reasoning has been
applied to spatial language interpretation. The spatial language samples used in these studies have
been mostly collected by individuals via time consuming experiments or interviews. This data
collection method could provide samples that offer understanding on small-scale phenomenon
through manual interpretation by analysts.
   However, studying the regional linguistic patterns in spatial language—such as regional
variations in route directions—requires a spatially distributed corpus. Spatial language data
available from the WWW has great potential for this study because of its unrivaled coverage and
easy accessibility. For example, it is common to find hotels, companies and institutions offering
route directions on their website which provides spatial way-finding instructions to travelers from
different places. Harnessing these human generated route directions on-line and analyzing them is
the major focus of this study.
2 Methods

To harness route direction documents from the WWW and ensure the spatial coverage of the
resulting corpus, a data collection scheme involving web crawling, text classification, and geo-
referencing has been developed. Computational tools have been applied for assisting processing the
Spatially-strAtified Route Direction Corpus (SARD Corpus) and interpretation of the results.
    Collecting route direction documents from the WWW has two main challenges. First, route
directions have a high linguistic complexity that makes it difficult to separate the route direction
documents from a variety of irrelevant web documents. This challenge can be solved by applying a
machine learning algorithms for text classification [1]. The precision of this route direction
document classifier used in this study reaches 93% (from 438 positive classified documents, 407 are
hand examined to be spatial language documents). Second, exploring regional variation in spatial
language usage requires geo-referencing each document in the corpus, which is not an easy task
(i.e., Geographic Named Entity Disambiguation). However, postal code, which commonly appears
in destination addresses in route directions, can be used to coarsely geo-reference a route direction
document on a postal code level. The data collection scheme first utilizes lists of postal codes for
crawling web documents. The returned web documents are fed into the route direction classifier,
where only positively classified route direction documents are stored in the result corpus. This data
collection scheme maximizes the spatial coverage of the SARD Corpus at a postal code level. To
prepare the corpus for extracting region linguistic attributes, the SARD Corpus is organized first by
nation, then by region (states in the U.S. and Australia, postal district in the U.K.).
    The data analysis of spatial language usage in route directions focuses on the regional linguistic
variation, which is addressed by analyzing the semantic usages of cardinal directions (i.e.: north,
south, east, west, northeast, northwest, southeast and southwest) and relative directions (i.e., left and
right). The semantic categories used are detailed in Table 1. The scale and size of the corpus makes
corpus linguistic tools a necessity for processing the regional linguistic characteristics. The
TermTree tool [2], which is a text processing tool with the capacity to handle regular expressions, is
used for assisting an analyst to manually evaluate the semantic usages of direction terms. The
semantic categorical data is considered regional linguistic characteristics for each region in the
SARD Corpus. Visual Inquiry Toolkit [3] is used for geovisualization of the regional linguistic
characteristics (Fig. 3) to interpret the analysis result.

Table 1. Semantic categories for cardinal directions and relative directions.

               Semantic categories                  examples
               1. Change of direction               take a left, bear right
                                                    see a landmark on your right, the
 Relative      2. Static spatial relationship
                                                    destination is left to a landmark
 Direction
                                                    keep to the left lane, merge to the right
               3. Driving aid
                                                    lane
               1. Change of direction               head north, traveling south
                                                    veer southwest on US Hwy 24, turn
               2. Static spatial relationship
                                                    north
 Cardinal
               3. Traveling direction               2 blocks east of landmark
 Direction
                                                    from North, if coming from South of
               4. General origin
                                                    New York
               *used in POI names                   North Atherton Street, West Street.

  As a result of the data collection, the SARD Corpus has been built with 11,254 web documents
covering the U.S., the U.K., and Australia. Overview of the workflow is presented in Fig. 1.
    Seed data                                     Crawling           Classification Categorization Text Processing                                         Analysis         Visual Analytics

                                                                                                   Location
                                                                                                   Location
                                              Data
                                               Data indexer
                                                    indexer                                       Validation
                                                                                                  Validation
                                             (search                                                                                                      K-means
                                                                                                                                                          K-means
                                             (search engine)
                                                      engine)              Trained
                                                                           Trained                                          Analyzing
                                                                                                                            Analyzing                    Clustering
                                                                                                                                                         Clustering
                                                                          Maximum
                                                                          Maximum                                           semantic
                                                                                                                            semantic                                                  Regional
                                                                                                                                                                                      Regional
 List
 List of
       of postal
          postal                                                           Entropy
                                                                           Entropy                                          category
                                                                                                                             category                                                 patterns
                                                                                                                                                                                      patterns
                                                                          Classifier
                                                                          Classifier                                          usage
                                                                                                                              usage                        Moran’s
                                                                                                                                                           Moran’s II
     codes
      codes                                          Web
                                                      Web                                           Spatial
                                                                                                     Spatial
                                                  document
                                                  document                                         language
                                                                                                   language
                                                  containing
                                                  containing                                        corpora
                                                                                                    corpora
                                                  zip
                                                   zip codes
                                                       codes


Fig. 1. Overview of the data collection and analysis schemes for building and analyzing the SARD Corpus


3 Results

   Regional pattern analysis demonstrates how cardinal/relative directions usage varies at both
national level (Fig. 2) and regional level (Fig. 3). On a national level, relative directions in all three
nations are mostly used to represent “change of direction” (the blue bar on the left). Similarly
cardinal directions are mostly used to represent “travelling direction” (The white bar on the right).
On the other hand, the preference for relative direction when representing “change of direction” is
much more common in the U.K. than in the U.S. and Australia. Correspondingly we find that
cardinal directions are used more often in the U.S. and Australia than in the U.K. (the blue bars on
the right) to represent “change of direction”.
                             80000                                                                         20000
    Token Occurrence Count


                             60000                                                                         15000


                             40000                                                                         10000


                             20000                                                                          5000
                                                                                                                                                                     general origin
                                                                                 change of direction
                                 0                                                                             0                                                     static spatial relationship
                             100%                                                static spatial relationship 100%
                                                                                                                                                                     change of direction
                                                                                 driving aid
                              80%                                                                           80%                                                      traveling direction
            Proportion


                              60%                                                                           60%

                              40%                                                                           40%

                              20%                                                                           20%

                               0%                                                                             0%
                                     US(10,055)     UK(710)     Australia(489)                                      US(10,055)         UK(710)      Australia(489)
                                              Nation (corpus size)
                                                                                                                                 Nation (corpus size)


Fig. 2. Nation-level comparison of relative directions and cardinal directions usage

   To get a better understanding of the regional variation of relative versus cardinal direction
usages, the proportion of each semantic category is plotted on a map for comparison. The plotted
map can provide geographical knowledge about the regions, such as adjacency, which helps the
analyst to detect regional patterns. Fig. 3 shows that the two most dominant usages as noted at the
national-level (relative directions used for “change of direction”, cardinal directions used as
“travelling direction”) are used more frequently in most states in the U.S. For cardinal direction
usage, there is a geographic pattern (South Dakota to Kansas, Wyoming to Iowa, blue circle) that
differs from its surroundings states in every semantic category. The regional pattern detected is
comparable to the Colorado West and Central West region in the map of U.S. dialect [4, p.186]. A
possible explanation for this observation may lie in the correlation between the regional linguistic
preference and regional geographical features, which is yet to be investigated.
    Direction

                                                          proportion
    Relative


                                                                 high

                change            spatial                        low
                of direction      relationship
    Direction
    Cardinal


                change           spatial                 traveling                general
                of direction     relationship            direction                origin


Fig. 3. Regional-level comparison of relative directions and cardinal directions usage (the U.S.).


4 Summary

   This paper presents a first step toward an effective and scalable data collection method for spatial
language study. It enables spatial cognitive researches to scale-up the spatial language data sets and
answer spatial cognitive questions (such as the regional spatial language difference) at a large scale.
This study shows promise for effective spatial cognitive research through processing and analyzing
volunteered spatial language data, which is an alternative compared to collecting data by designing
human participant involved experiments. The presented workflow can also be extended to languages
other than English to assist in cross-language comparisons.
      The language preference at the nation-level and region-level are both explored, offering 1) a
better understanding of how people tend to use spatial language to communicate spatial information;
2) how people differ in using spatial language from different regions; and 3) a guideline to develop
a localized, use-specific natural language generation system for navigational devices. Regional
patterns of cardinal and relative direction usages in route directions are observed and analyzed,
offering a novel perspective for spatial linguistic studies. The design and implementation of
building a geo-referenced large-scale corpus from Web documents in this study offers a
methodological contribution to corpus linguistics, spatial cognition, and GISciences.


5         Acknowledgement

   Research for this paper is based upon work supported National Geospatial-Intelligence
Agency/NGA through the NGA University Research Initiative Program/NURI program. The views,
opinions, and conclusions contained in this document are those of the authors and should not be
interpreted as necessarily representing the official policies or endorsements, either expressed or
implied, of the National Geospatial-Intelligence Agency, or the U.S. Government.


References

[1] Zhang, X., Mitra, P., Xu, S., Jaiswal, A.R., Klippel, A., MacEachren, A.M.: Extracting route directions from
    web pages. In: Twelfth International Workshop on the Web and Databases (WebDB 2009), Providence,
    Rhode Island, USA. (2009)
[2] Turton, I., MacEachren, A.: Visualizing unstructured text documents using trees and maps. In: GIScience
    workshop, Park City, Utah (2008)
[3] Chen, J., MacEachren, A.M., Guo, D.: Visual inquiry toolkit - an integrated approach for exploring and
    interpreting space-time, multivariate patterns. Technical report, GeoVista Center and Department of
    Geography Pennsylvania State University, Department of Geography University of South Carolina (2007)
[4] Smith, J.: Bum bags and fanny packs: a British-American, American-British dictionary. Carroll & Graf
    Publishers (2006)

</pre>