Detecting and Recovering from Large-scale Failures in The Internet

Open Access
Zheng, Qiang
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
May 18, 2012
Committee Members:
  • Guohong Cao, Dissertation Advisor
  • Thomas F Laporta, Committee Member
  • Sencun Zhu, Committee Member
  • Tao Yao, Committee Member
  • Internet
  • large-scale failures
  • detection
  • recovery
The Internet has become a significant and widely used infrastructure for a wide range of communication and other services. Internet reliability and availability are crucial to many applications such as financial transactions, online games, Voice over IP, and video services. However, large-scale failures, caused by events like natural disasters and intentional attacks, are common in the Internet. Unlike sporadic and isolated link failures, large-scale failures usually lead to serious routing disruption and severe packet loss. Therefore, efficiently monitoring the Internet to detect large-scale failures and quickly recovering from failures are particularly important for enhancing Internet reliability and availability. The specific goal of this dissertation is to provide comprehensive solutions for detecting large-scale failures in the Internet and quickly recovering from large-scale failures. First, we design an approach for monitoring link performance using tomography-based end-to-end probe. Given a set of links to monitor, the objective is to select the minimum number of probing paths to uniquely determine the performance of all identifiable links and cover all unidentifiable links. Through this, we can identify the links with abnormal performance. Second, we propose a two-phase approach for detecting and localizing large-scale router failures using traceroute-like active probes. To detect large-scale router failures, the detection phase is periodically invoked to probe all routers. When detecting large-scale router failures, the localization phase is triggered to identify the failed routers. We reduce the probing cost by avoiding three types of useless probes. For the routers whose status cannot be identified by probes, we develop a distance-based method to estimate their failure probability. Third, we design an approach called Reactive Two-phase Rerouting (RTR) for intra-domain routing to quickly recover from geographically correlated failures with the shortest recovery paths. To recover a failed routing path, RTR first forwards packets around the failure area to collect information on failures. Then, in the second phase, RTR calculates a new shortest path and forwards packets along it through source routing. RTR can deal with geographically correlated failures associated with areas of any shape and location. Finally, we propose two cross-layer approaches built on backup paths to recover intra-domain routing from correlated link failures. Based on the mapping between the IP layer and optical layer topologies, we develop a correlated failure probability (CFP) model and a probabilistically correlated failure (PCF) model to quantify the impact of IP link failure on the reliability of backup paths. With the CFP model, we propose two algorithms to select a backup path to protect each IP link. The first algorithm aims at choosing the backup paths with minimum failure probability, and the second algorithm further considers the bandwidth constraint and aims at minimizing the traffic disruption caused by failures. With the PCF model, we propose an algorithm to choose multiple reliable backup paths for each IP link to minimize the routing disruption caused by failures.