Graphics Processing Units (GPUs) have been predominantly accepted for various general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing an enormous volume of data with high throughput has been rising rapidly. However, the performance of the massive parallelism workloads usually suffer from multiple constraints such as memory bandwidth, high memory latency, and power/energy cost. Also a bandwidth efficient network design is challenging in large-scale GPUs. In this research, we focus on mitigating network bottlenecks by effectively reducing the size of packets transferring through an interconnect network so that the overall system performance improves. The unused fraction of each L1 data cache block across a variety of benchmark suits is initially investigated to see inefficient cache usage. Then, categorizing memory access patterns into several types we introduce essential micro-architectural enhancements to support filtering out unnecessary words in packets throughout the reply path. A compression scheme (Dual Pattern Compression) adequate for packet compression is exploited to effectively reduce the size of reply packets. We demonstrate that our scheme effectively improves system performance. Our approach yields 39% IPC improvement across heterogeneous computing and text processing benchmarks over the baseline cooperating with DPC. Comparing this work with DPC, we achieved 5% IPC improvement for the overall benchmark suits and 20% IPC increase for favorable workloads to this scheme.
Graphics Processing Units (GPUs) have been predominantly accepted for various general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing an enormous volume of data with high throughput has been rising rapidly. However, the performance of the massive parallelism workloads usually suffer from multiple constraints such as memory bandwidth, high memory latency, and power/energy cost. Also a bandwidth efficient network design is challenging in large-scale GPUs. In this research, we focus on mitigating network bottlenecks by effectively reducing the size of packets transferring through an interconnect network so that the overall system performance improves. The unused fraction of each L1 data cache block across a variety of benchmark suits is initially investigated to see inefficient cache usage. Then, categorizing memory access patterns into several types we introduce essential micro-architectural enhancements to support filtering out unnecessary words in packets throughout the reply path. A compression scheme (Dual Pattern Compression) adequate for packet compression is exploited to effectively reduce the size of reply packets. We demonstrate that our scheme effectively improves system performance. Our approach yields 39% IPC improvement across heterogeneous computing and text processing benchmarks over the baseline cooperating with DPC. Comparing this work with DPC, we achieved 5% IPC improvement for the overall benchmark suits and 20% IPC increase for favorable workloads to this scheme.