Defense against test-time evasion attacks and backdoor attacks

Open Access
- Author:
- Wang, Hang
- Graduate Program:
- Electrical Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- September 28, 2023
- Committee Members:
- Madhavan Swaminathan, Program Head/Chair
George Kesidis, Co-Chair & Dissertation Advisor
Constantino Lagoa, Major Field Member
David Miller, Co-Chair, Major Member & Dissertation Advisor
Jia Li, Outside Unit & Field Member
Vishal Monga, Major Field Member - Keywords:
- adversarial machine learning
backdoor attack
Trojan
maximum margin
adversarial examples
GANs
activation clipping - Abstract:
- Deep Neural networks (DNN) have been successfully applied to many areas. However, they have been shown to be vulnerable to adversarial attacks. One representative adversarial attack is the test time evasion attack (TTE attack, also known as adversarial example attack), which modifies a test sample with a small, sample-specific, and human imperceptible perturbation so that it is misclassified by the DNN classifier. The backdoor attack (Trojan) is another type of adversarial attack emerging recently. A backdoor attacker aims to inject a backdoor trigger (typically a universal pattern) into an attacked DNN classifier, such that the classifier will misclassify a test sample into a pre-designed target class whenever the backdoor trigger is present. A backdoor attack can be launched either by poisoning the training dataset or by controlling the training process. Both types of attacks are very harmful, especially in high-risk applications (like facial recognition authorization and traffic sign recognition in self-driving cars) where misclassification will lead to serious consequences. Defending against those attacks is important and challenging. To defend against the TTE attack, one can either robustify the DNN or detect the adversarial examples. One can attempt to robustify a DNN through adversarial training, certified training, or DNN embedding. Also, some adversarial examples can be identified using the internal layer activation features. Defense against backdoor attacks can be mounted at different stages. Pre-training (or during training) defenses aim to obtain a clean model given the potentially poisoned training set. Post-training defenses aim to either detect if a model is attacked or repair a potentially poisoned model to avoid misclassifications. Inference time defenses aim to detect or robustly classify a test sample with the backdoor trigger. In this thesis, we propose several defenses against TTE attacks and backdoor attacks. For TTE attacks, we proposed a conditional generative adversarial network based anomaly detection method (ACGAN-ADA). For backdoor attacks, we proposed a pre-training data cleansing method based on a contrastive learning method, which can cleanse the training set by filtering and relabeling the out-of-distribution training samples. Several defense schemes are also proposed post-training: A maximum classification-margin based backdoor detection method (MM-BD) is proposed to detect whether a model is attacked. The MM-BD method is based on the observation that the attacked model will overfit to the backdoor trigger, and thus be overconfident in the decision made on a sample with the backdoor trigger. MM-BD makes no assumption about the backdoor pattern type