Production-Run Software Failure Diagnosis via Adaptive Communication Tracking Conference Paper uri icon

abstract

  • 2016 IEEE. Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.

name of conference

  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

published proceedings

  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

author list (cited authors)

  • Alam, M., & Muzahid, A.

citation count

  • 2

complete list of authors

  • Alam, Mohammad Mejbah Ul||Muzahid, Abdullah

publication date

  • August 2016