A Flow Classifier with Tamper-resistant Features and an Evaluation of Its Portability to New Domains

Open Access
Author:
Zou, Guixi
Graduate Program:
Electrical Engineering
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
None
Committee Members:
  • David Jonathan Miller, Thesis Advisor
Keywords:
  • machine learning
  • traffic classification
  • portability
Abstract:
Flow classification techniques can be applied to network and end-host security, e.g., a flow whose destination port number indicates one application type, but whose features reflect another, is an anomaly that may indicate malicious activity. Also, network planners may wish to digest [7],[31] the types and quantities of packet-flows they handle in order to decide how to expand their network to better accommodate them. Moreover, any legal terms-of-use policies may need to be enforced by administrators of private-enterprise networks and ISPs. Flow classification by application type is motivated by on-line anomaly detection, off-line network planning, and on-line enforcement of terms-of-use policies by public ISPs or by admin- istrators of private-enterprise networks. Both signature matching and a variety of feature-based pattern recognition methods have been applied to address this problem. In this thesis, we pro- pose a TCP flow classifier that employs neither packet header information that is protocol-specific (including port numbers) nor packet-payload information. Techniques based on the former are readily evadable, while detailed yet scalable inspection of packet payloads is di±cult to achieve, may violate privacy laws, and is defeated by data encryption. In this thesis, our classifier is tested on two contemporary publicly available datasets recorded in similar networking contexts. We consider the often encountered scenario where ground-truth labels, necessary for supervised classifier training, are unavailable for a domain where flow classification needs to be applied. In this case, one must "port over" a classifier trained on one domain to make decisions on another. We address issues in reconciling differences in class definitions between the two domains. We also demonstrate by our results that domain differences in the class-conditional feature distributions, which will exist in practice, can lead to substantial losses in classification accuracy on the new domain. Finally, we also propose and evaluate a hypothesis testing approach to detect port spoofing by exploiting confusion.