Multidisciplinary Applications of U.S. Soils Datasets: Machine Learning Models, Data Mining, and Land Use Analyses

Open Access
Ramcharan, Amanda Maria
Graduate Program:
Agricultural and Biological Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
March 02, 2017
Committee Members:
  • Thomas L. Richard, Dissertation Advisor
  • Thomas L. Richard, Committee Chair
  • Heather Gall, Committee Member
  • Douglas Miller, Committee Member
  • Tomislav Hengl, Outside Member
  • soil
  • machine learning
  • agroecosystem modeling
  • digital soil mapping
  • soil pedotransfer functions
  • random forest
  • data mining
U.S. efforts to understand and map soils have revolutionized the understanding of soil genesis, morphology, classification, behavior and mapping of soil. With one of the largest publicly available soils datasets in the world, the value of soils information continues to extend to many aspects of society such as agriculture, tax assessment, the design and construction of highways and buildings, and land management. With a growing human population, soil is being subject to increasing demands for food, energy and water. This has resulted in extensive soil erosion, where soil is degrading faster than it is naturally replenished. Current agricultural practices are therefore a threat to soil natural resources. In order to assess and manage changing demands on soil natural resources, soil data must adapt to new computational tools that can provide knowledge for decision-makers and stakeholders to preserve soil resources. This dissertation harnesses a high performance computational environment to change the design and application of U.S. national soil datasets. First, we demonstrated how the National Characterization Soil Survey (NCSS) database can be used to create a machine-learning pedotransfer function (PTF) to predict soil bulk density – a crucial soil variable often missing in soil databases. The results of a 5‒fold cross-validation showed that the average root mean squared prediction error (RMSPE) was 0.13 g cm-3, and the mean prediction error (MPE) was -0.001 g cm-3. Performance of the PTF was also illustrated by estimating soil organic carbon (SOC) stocks for four representative pedons across a temperature-precipitation gradient; these estimates were within 5% of measured SOC stocks. Second, we demonstrated how existing soil legacy datasets: the NCSS database, the National Soil Information System (NASIS), the gridded soil survey (gSSURGO), and the Rapid Carbon Assessment (RaCA) databases, can be used in a high performance computing system to create 100 meter spatial soil class and property maps. We overlaid and merged soil point data with soil characteristics extracted from gSSURGO, terrain attribute layers to describe the land surface, maps describing vegetation indices and land surface temperature derived from MODIS land surface satellite data, climatic and hydrologic datasets, to construct models of soil classes and properties using data mining algorithms. Ten-fold cross validation was used to calculate the performance metrics of the models. The average R2 for soil property models was 72%, while the soil class models had on average 37% out-of-bag prediction error. We then showed that these models could be used to extend predictions to the conterminous U.S. using environmental predictors with exhaustive coverage. Third, we used the NCSS database in land use analyses to assess the on-field environmental impacts of a double cropping system that provides for both food and biomass production from agricultural cropland in the mid-Atlantic region of the United States. Simulating different combinations of soil, fertilizer, harvest and location scenarios (n = 144), we quantified trade-offs among producing a low carbon energy source, improving soil carbon, and reducing nitrogen losses. Results showed that benefits to soil carbon, biomass yields, and in many cases profitability will be greater if winter crops are fertilized, resulting in trade-offs between these benefits and nitrogen losses.