Shankar, Anusha (2014-12). Lock Prediction to Reduce the Overhead of Synchronization Primitives. Master's Thesis. Thesis uri icon

abstract

  • The advent of chip multi-processors has led to an increase in computational performance in recent years. Employing efficient parallel algorithms has become important to harness the full potential of multiple cores. One of the major productivity limitation in parallel programming arises due to use of Synchronization Primitives. The primitives are used to enforce mutual exclusion on critical section data. Most shared-memory multi-processor architectures provide hardware support for mutually exclusive access on shared data structures using lock and unlock operations. These operations are implemented in hardware as a set of instructions that atomically read and then write to a single memory location. Good synchronization techniques should try to reduce network bandwidth, have low access time in acquiring locks and be fair in granting requests. In a typical directory controller based locking scheme, each thread communicates with the directory controller for lock request and lock release. The overhead of this design includes communication with the directory controller for each step of lock acquisition, and this causes high latency transactions. Thus, a significant amount of time is spent in communication as compared to the actual operation. Previous works have focused on reducing the communication to home node through various techniques. One such technique of interest is the Implicit Queue on Lock Bit Technique (IQOLB). In this technique, the lock is forwarded directly to the requestor from the thread currently holding the lock without communication through the home node. Limitations of the method include the following: the forwarding operation can take place only after the current thread holding the lock has received information about the new lock requestor from the home node and also modification to cache coherence protocol to distinguish a regular memory read request and a synchronization request. Very little research has been performed in the area of lock prediction. We believe based on data analysis that lock communication is predictable and the prediction can improve performance significantly. This research focuses on predicting the sequence in which locks are acquired so that the thread currently holding the lock can preemptively invalidate the locked cache line and forward the same to subsequent requestors and hence reduce the time taken to acquire a lock. The predictor is adaptive: whenever a lock is biased towards a thread, it will remain in the cache of that particular thread, and invalidation will not take place. The benefits of the technique include reduction in the number of messages exchanged with the home node without any modification to the cache coherence protocol (does not distinguish a regular memory read request and synchronization request). The results of the evaluation of lock predictor on PARSEC benchmark suite shows an improvement in overall performance by an average of 9 % over the base case.
  • The advent of chip multi-processors has led to an increase in computational performance in recent years. Employing efficient parallel algorithms has become important to harness the full potential of multiple cores. One of the major productivity limitation in parallel programming arises due to use of Synchronization Primitives. The primitives are used to enforce mutual exclusion on critical section data. Most shared-memory multi-processor architectures provide hardware support for mutually exclusive access on shared data structures using lock and unlock operations. These operations are implemented in hardware as a set of instructions that atomically read and then write to a single memory location. Good synchronization techniques should try to reduce network bandwidth, have low access time in acquiring locks and be fair in granting requests.

    In a typical directory controller based locking scheme, each thread communicates with the directory controller for lock request and lock release. The overhead of this design includes communication with the directory controller for each step of lock acquisition, and this causes high latency transactions. Thus, a significant amount of time is spent in communication as compared to the actual operation.

    Previous works have focused on reducing the communication to home node through various techniques. One such technique of interest is the Implicit Queue on Lock Bit Technique (IQOLB). In this technique, the lock is forwarded directly to the requestor from the thread currently holding the lock without communication through the home node. Limitations of the method include the following: the forwarding operation can take place only after the current thread holding the lock has received information about the new lock requestor from the home node and also modification to cache coherence protocol to distinguish a regular memory read request and a synchronization
    request.

    Very little research has been performed in the area of lock prediction. We believe based on data analysis that lock communication is predictable and the prediction can improve performance significantly. This research focuses on predicting the sequence in which locks are acquired so that the thread currently holding the lock can preemptively invalidate the locked cache line and forward the same to subsequent requestors and hence reduce the time taken to acquire a lock. The predictor is adaptive: whenever a lock is biased towards a thread, it will remain in the cache of that particular thread, and invalidation will not take place. The benefits of the technique include reduction in the number of messages exchanged with the home node without any modification to the cache coherence protocol (does not distinguish a regular memory read request and synchronization request). The results of the evaluation of lock predictor on PARSEC benchmark suite shows an improvement in overall performance by an average of 9 % over the base case.

publication date

  • December 2014