Defect/Fault Tolerant Systems and Design for Testability

Friday, January 29, 2010 - 12:00pm

By Rudrajit Datta, Graduate Research Assistant, Prof. Nur Touba’s Group

With the increasing complexity of design in modern day electronic systems, fault tolerance is becoming more and more important to guarantee reliable operation under all operating conditions. Dr. Nur Touba and graduate students have been focusing on the design of fault tolerant systems. Fault tolerance as a property finds application in a wide variety of scenarios ranging from satellites to modern microprocessors. Fault tolerant systems have the capability of withstanding defects and are able to provide specified output despite faults occurring or having occurred. Similarly design for testability (DFT) is a technique that facilitates ease of testing of complicated electronic systems. Researchers at the Computer Aided Testing (CAT) Lab at CERC have been developing modern techniques for DFT, as well as the design of reliable systems. Neither of these is just an academic concept anymore. Our research often leads to the exchange of ideas with leading companies in our field, such as Intel Corporation. Ideas and techniques developed at the CAT Lab have been implemented by companies including Intel Corporation, Logic Vision (recently acquired by Mentor Graphics) and led to joint publications at top conferences.

Fault tolerance is usually achieved by means of redundancy. This redundancy could be of several types viz. information redundancy, hardware redundancy, etc. One of the most common methods of fault tolerance through information redundancy is using parity bits. Parity constructed on a set of bits can be effectively used to detect and correct erroneous data. This is particularly useful for protecting data stored in semiconductor memory. Transient errors like radiation, power supply noise, etc. can cause bit flips in memory. To protect the data integrity of the memory an error correcting code (ECC) is employed. ECCs can range from simple single-error detection to intensely complicated multi-error detection, multi-error correction. We have been trying to develop newer and more effective types of ECC to counter growing reliability issues with continuing voltage scaling. In some cases, we have tried to augment the reliability of ECC with hardware redundancy, using spare rows and columns in the memory array. As operating power becomes a growing concern, our work can help maintain the data integrity of memories at low power by improving upon existing ECC. In DFT, we have developed newer techniques to mitigate the problem of X’s, or unknowns, arising from un-initialized memory elements, bus contention, etc. in testing. Compacting output streams that have unknown ‘X’ values is a major issue for test compression and built-in self-test (BIST). X’s corrupt the final signature making it unknown. At CAT Lab, we have developed techniques shown to achieve better x-compaction than existing methods. Our proposed schemes have also been implemented for industrial designs in conjunction with Intel Corporation.

DFT and design of fault tolerant systems continue to face further challenges as more and more transistors are packed on a chip, with the figure growing well beyond a billion. There is a tremendous scope for active research in both the above-mentioned topics and related topics on reliability and test compression. Some of the most important test challenges are now actually centered on some of the more subtle historical missions of manufacturing testing - reliability and yield learning. It is also important to note that the impact of these challenges affect not only the manufacturing test process itself, but the entire semiconductor business, both in terms of enabling the timely delivery of future processes and cost effective products and meeting customer expectations for reliability.