Efficient distributed recovery using message logging Conference Paper uri icon

abstract

  • Various distributed algorithms are presented, that allow nodes in a distributed system to recover from crash failures efficiently. The algorithms are independent of the application programs running on the nodes. The algorithms log messages and checkpoint states of the processes to stable storage at each node. Both logging of messages and checkpointing of process states can be done asynchronously with the execution of the application. Upon restarting after a failure, a node initiates a procedure in which the nodes use the logs and checkpoints on stable storage to roll back to earlier local states, such that the resulting global state is maximal and consistent. The first algorithm requires adding extra information of size O(n) to each application message (where n is the number of nodes); for each failure, O(n2) messages are exchanged, but no node rolls back more than once. The second algorithm only requires extra information of size O(1) on each application message, but requires O(n3) messages per failure. Both the above algorithms require that each process should be able to send messages to each of the other processes. We also present algorithms for recovery on networks, in which each process only communicates with its neighbors. Finally, we show how to decompose large networks into smaller networks so that each of the smaller network can use a different recovery procedure.

published proceedings

  • Proceedings of the Annual ACM Symposium on Principles of Distributed Computing

author list (cited authors)

  • Sistla, A. P., & Welch, J. L.

complete list of authors

  • Sistla, AP||Welch, JL

publication date

  • December 1989