EXPLORING REGIONAL VARIATION IN SPATIAL LANGUAGE: A CASE STUDY ON SPATIAL ORIENTATION BY USING VOLUNTEERED SPATIAL LANGUAGE DATA

Open Access
- Author:
- Xu, Sen
- Graduate Program:
- Geography
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- None
- Committee Members:
- Alexander Klippel, Thesis Advisor/Co-Advisor
Alexander Klippel, Thesis Advisor/Co-Advisor
Alan Maceachren, Thesis Advisor/Co-Advisor - Keywords:
- text classification
geo-referenced web sampling
volunteered geographic information
information extraction
machine learning
spatial language analysis
regional linguistic differences
cardinal directional usage
visual analytics - Abstract:
- This thesis seeks to answer the question of how spatial language varies regionally within the same language on a geographic scale. Spatial language, such as route directions, is language pertaining to spatial situations and spatial relationships between objects. Spatial language is an important medium through which we study humans’ representation, perception, and communication of spatial information. Existing spatial language studies mostly use data collected via time-consuming experiments, which are therefore limited to a small sample size—thus limiting the detection of how spatial language varies from one region to another. More recently, larger sample sizes have become possible due to the abundance of volunteered spatial language data on the World Wide Web (WWW), such as directions on hotels’ websites. This data is a potential source for scaling up the analysis of spatial language data. Sourcing from the WWW, a spatial language data collection scheme has been developed. Automated web-crawling, spatial language text document classification based on computational linguistic methods, and geo-referencing of text documents are used to build a spatially-stratified corpus. Focusing on route directions on the WWW, the Spatially-strAtified Route Direction Corpus (the SARD Corpus) with more than 10,000 spatially distributed documents covering three countries (the United States, the United Kingdom, and Australia) is built. As a case study on the SARD corpus, a linguistic analysis scheme assisted by computational linguistic tools is designed based on the often raised question of cardinal versus relative direction term usage. Semantic usages of cardinal and relative directions are identified as regional linguistic characteristics; a visual analytic toolkit is used to detect regional variations in the SARD corpus. Analysis results and possible indications of linguistic variations at the national and regional scale are presented and discussed, contributing to research on spatial language use. The analysis shows similarities and differences in directional term usages on the national level; regional level analysis shows that geographic patterns emerge on the linguistic term usage. The findings offer knowledge contributions to the field of spatial cognition; the design and implementation of building a geo-referenced large-scale corpus from documents crawled from the WWW offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.