Graphics Processing Units (GPUs) once served a limited function for rending of graphics. With technological advances, these devices gained new... Show moreGraphics Processing Units (GPUs) once served a limited function for rending of graphics. With technological advances, these devices gained new purposes beyond graphics. Most modern GPUs have exposed their APIs to allow processing of data beyond the display, thus leading to a revolution in computing where instructions and intensive tasks can be offloaded to these now General Purpose Graphical Processing Units (GPGPUs). Many compute and memory intensive tasks have utilized GPGPUs for acceleration and these devices are especially prevalent in the financial, pharmaceutical and automotive industries. As computing resources have increased exponentially, memory resources have not and now create a limiting factor known as the memory wall. GPUs have been designed as an application specific processing unit for the streaming data access patterns found in graphical applications. They are successful at their original purpose, but when extended to general purpose problems, they meet the same memory wall data access problem as their CPU counterparts; they can be more susceptible to the effects latency due to the locality and concurrency of instructions beside data. This thesis reviews the current GPGPU landscape, including the design of current scheduling systems, GPGPU architecture, as well as a way of computing and describing the memory access penalty with Concurrent Average Memory Access Time (C-AMAT). We will also demonstrate the current GPGPU landscape, including design of schedulers, simulators as well as how Concurrent Average Memory Access Time (C-AMAT) can be computed. We have devised a solution to manipulate the number of scheduled thread groups to allow a GPGPU’s processing units to match their current memory states defined by C-AMAT. Our solution results in the increase in IPC, the reduction in C-AMAT and decrease in memory misses. The solution also has different effects on different types of computing problems, with highest improvements achieved in compute intensive memory patterns with as much as a 12% improvement in the instructions per cycle and a 14% reduction in C-AMAT. Show less