NCSU Libraries
Search the Collection|Browse Subjects|Services|Library Information|Community |News & Events

Title page for ETD etd-10312005-114614


Type of Document Master's Thesis
Author Parthasarathy, Sailashri ,
Author's Email Address spartha@ncsu.edu
URN etd-10312005-114614
Title Improving Transient Fault Tolerance of Slipstream Processors
Degree Master of Science
Graduate Program Computer Engineering
Advisory Committee
Advisor Name Title
Dr. Eric Rotenberg Committee Chair
Dr. Jun Xu Committee Member
Dr. Suleyman Sair Committee Member
Keywords
  • transient fault tolerance
  • slipstream processors
Date of Defense 2005-10-28
Availability unrestricted
Abstract
PARTHASARATHY, SAILASHRI. Improving Transient Fault Tolerance of Slipstream Processors (Under the direction of Dr. Eric Rotenberg)

A slipstream processor runs two copies of a program, one slightly ahead of the other, to achieve both higher single-program performance and transient fault tolerance. The leading copy of the program, or the Advanced Stream (A-stream), is accelerated by executing only a key subset of all instructions. The partial A-stream is speculative. Therefore, a second, complete copy of the program, called the Redundant Stream (R-stream), receives and checks all A-stream outcomes. The R-stream is also accelerated in this process. Together, the A-stream and R-stream finish faster than a single program copy would.

The partial redundancy between the A-stream and R-stream enables detection and recovery from transient faults. A transient fault that affects a redundantly executed instruction is easily detected, because its two instances will differ. However, a transient fault that affects a singly executed instruction (instruction removed from A-stream) is difficult to detect directly, because there is no redundant counterpart for comparison.

Actually, a fault in a singly executed instruction is indirectly detectable via a redundantly executed consumer. However, such a fault is unrecoverable since the fault is attributed to the consumer. Recovery is initiated too late, from the consumer instead of the faulty producer.

We propose a mechanism that conservatively attributes a detected fault, not to the redundantly executed instruction that detected it, but to its singly executed producer. Accordingly, recovery is initiated safely from the singly executed producer. Our approach works by forming a forward slice for each singly executed instruction, terminating in its direct/indirect redundantly executed consumers. Now, a consumer can mark its singly executed producer as faulty when its comparison mismatches.

A singly executed branch does not have a forward slice and thus is not checkable by consumers. However, the branch was removed from the A-stream precisely because its branch prediction is highly confident, hence, very likely correct. This likely correct branch prediction is treated as a second execution for the corresponding singly executed branch, different from true execution but nearly as effective for detecting faults.

In fact, the observation about confident branches extends to all redundantly executed instructions since the A-stream is predictive as a whole. All A-stream instructions are speculative, yet most likely correct in the fault-free case. This reveals an intriguing predictive checking paradigm.

Experiments using the SPEC95 and SPEC2K benchmarks show that coverage improves from 81% for baseline slipstream to 99% with only a small decrease in speedup. To obtain the same performance as baseline slipstream, we propose a relaxed checking model, which still achieves a much higher coverage of 95%.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  etd.pdf 953.54 Kb 00:04:24 00:02:16 00:01:59 00:00:59 00:00:05