Towards Secure Solutions for Cloud Applications

Open Access
Liao, Cong
Graduate Program:
Information Sciences and Technology
Doctor of Philosophy
Document Type:
Date of Defense:
August 27, 2018
Committee Members:
  • Anna Cinzia Squicciarini, Dissertation Advisor
  • Anna Cinzia Squicciarini, Committee Chair
  • Peng Liu, Committee Member
  • Sencun Zhu, Committee Member
  • David Jonathan Miller, Outside Member
  • Cloud Computing
  • MapReduce
  • Provenance Logging
  • Invariant Violation Detection
  • Hadoop File System
  • Distributed Storage
  • Location-Awareness
  • Machine Learning
  • Adversarial Machine Learning
  • Security
  • Convolutional Neural Networks
With the booming of Internet enabled services and technologies, data is being generated at a drastic velocity and volume. In light of such explosion of services and data, end users and industry players have begun to explore the potential of what big data could bring about if properly utilized. The rapid development of cloud computing further contributes to the advancement of big data analytics in the cloud. As a result, more and more data are being moved to the cloud with the proliferation of various cloud services and open source software that provide solutions for processing, storage and management of big data. In particular, data analytics based on machine learning prevails by harnessing the power of machine learning models to analyze massive data and succeeds in various challenging tasks including security critical applications. However, the migration of data to the cloud leads to data security and privacy concerns as users generally do not have the absolute control regarding how their data are managed, which might not be beneficial to building the trust between cloud users and service providers. In addition, the application of machine learning models to sensitive data and services has led to new attack venues for adversaries to exploit, requiring further investigation to better understand vulnerabilities and improve the robustness of the learning models. In this dissertation, we aim to mitigate these concerns, and to facilitate the trust between cloud users and service providers by enabling users with certain control in the process of data usage. Specifically, we look into Hadoop open software and present innovative mechanisms to extend user's control over how data flow in the course of storage in Hadoop distributed file system (HDFS) and processing with MapReduce. Further, we investigate vulnerabilities of machine learning models exposed in adversarial environment, and study two types of attacks against machine learning models in particular. Our main contributions in these three sub topics are as follows. First, MapReduce enables parallel and distributed data processing in a cluster, but it is subject to threats posed by malicious workers that could tamper with the data and computation. We investigate the computational data flow of MapReduce to detect anomalies in such process. Accordingly, we develop a computational provenance system that captures provenance data related to MapReduce computation within the MapReduce framework in Hadoop. In particular, we identify a set of invariants against aggregated provenance information, which are later analyzed to uncover anomalies indicating possible tampering of data and computation. Second, cloud storage has gained increasing popularity in recent years. Despite its benefits, the lack of knowledge and control of the physical locations of data could raise legal and regulatory issues, especially for certain sensitive data that are governed by laws to remain within certain geographic boundaries. We study the data flow in Hadoop's underlying storage system, and address data placement control problem by supporting policy-driven location-aware storage in HDFS when files are uploaded and managed for load balancing purpose. User policy is consistently enforced and data movement is also monitored to detect data placement violation. Third, machine learning models have been found to be vulnerable to well crafted sample as input causing its misclassification, e.g., evasion attack, or malicious samples contaminating the training process that leads to degradation of the model’s efficacy, e.g., poisoning attack. Based on these known attacks, we explore two new attacks, named model manipulation and backdoor injection attacks, with the goal of causing the model to misclassify certain malicious samples and still perform well on normal ones. Specifically, the former attack chooses to manipulate the model parameters instead of the input sample. The latter attack instead tends to inject a backdoor, that can be exploited later by well crafted input, into the model by poisoning its training process with malicious samples.