Production-Run Software Failure Diagnosis via Adaptive Communication Tracking

abstract

2016 IEEE. Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.

name of conference

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

authors

Muzahid, Abdullah

published proceedings

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

author list (cited authors)

Alam, M., & Muzahid, A.

citation count

2

complete list of authors

Alam, Mohammad Mejbah Ul||Muzahid, Abdullah

publication date

August 2016

publisher

Institute of Electrical and Electronics Engineers (IEEE) Publisher

published in

Proceedings / Annual International Symposium on Computer Architecture. International Symposium on Computer Architecture Journal

Production-Run Software Failure Diagnosis via Adaptive Communication Tracking Conference Paper

Overview

abstract

name of conference

authors

published proceedings

author list (cited authors)

citation count

complete list of authors

publication date

publisher

published in

Identity

Digital Object Identifier (DOI)

International Standard Book Number (ISBN) 13

Additional Document Info

start page

end page

Other

URL