Albarakat, Laith Mohammad (2017-08). Multithreading Aware Hardware Prefetching for Chip Multiprocessors. Master's Thesis. Thesis uri icon

abstract

  • To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor has to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data. We present a hardware prefetcher that enables large performance improvements from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand interference. Furthermore, it will utilize the time that a thread spends waiting on syn- chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics.
  • To take advantage of the processing power in the Chip Multiprocessors design,
    applications must be divided into semi-independent processes that can run concur-
    rently on multiple cores within a system. Therefore, programmers must insert thread
    synchronization semantics (i.e. locks, barriers, and condition variables) to synchro-
    nize data access between processes. Indeed, threads spend long time waiting to
    acquire the lock of a critical section. In addition, a processor has to stall execution
    to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data.

    We present a hardware prefetcher that enables large performance improvements
    from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand
    interference. Furthermore, it will utilize the time that a thread spends waiting on syn-
    chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics.

publication date

  • August 2017