Tuesday, May 10, 2016

performance - Is GPU locality of reference worth worrying about?



Does locality of reference make as much of a difference to GPU performance as it does CPU performance?


For example, if I send 200 draw commands to the GPU, will I see a (potentially) noticeable difference if the data for each command is contiguous in memory instead of jumping around the buffers/texture maps?


Side question: I'm assuming the GPU ensures against false sharing issues by having most resources immutable. But in the case where they're not, is that why threads always do four fragments' worth of work?



Answer



Locality of reference does matter, but you don't have to worry that much...because you don't have absolute control.


When using OpenGL/DirectX you usually have limited control over memory layout, the driver will do the rest. For example you can try multiple vertex buffers layouts, such as using interleaved or non-interleaved vertex data and depending on your data/driver/GPU performance will vary. Profile and choose what best fits your application.


For instance in GPU Gems Pipeline optimization, locality of reference is mentioned twice, the first:



Access vertex data in a relatively sequential manner. Modern GPUs cache memory accesses when fetching vertices. As in any memory hierarchy, spatial locality of reference helps maximize hits in the cache, thus reducing bandwidth requirements.




And the second



Optimize for the post-T&L vertex cache. Modern GPUs have a small first-in, first-out (FIFO) cache that stores the result of the most recently transformed vertices; a hit in this cache saves all transform and lighting work, along with all work done earlier in the pipeline. To take advantage of this cache, you must use indexed primitives, and you must order your vertices to maximize locality of reference over the mesh. There are tools available—including D3DX and NVTriStrip (NVIDIA 2003)—that can help you with this task.



In my opinion those recommendations follow what I have talking about, and imply that you don't have absolute control over memory layout, yet what you have control over, for example how each VBO vertices are laid out can have an effect on performance.


If your application is having a performance hit, you should first detect the bottleneck, it might not be a problem data locality of reference, but it might by because there huge amount data with no culling, for example you are not performing frustum culling ..etc You can check my answer here on the topic.


I think you should worry more about locality of reference when using OpenCL/CUDA were you have often absolute control over memory layout.


No comments:

Post a Comment

Simple past, Present perfect Past perfect

Can you tell me which form of the following sentences is the correct one please? Imagine two friends discussing the gym... I was in a good s...