I've been researching processors and graphics cards, and I discovered that GPUs are way faster than CPUs. I read in this one article, a 2-year-old Nvidia GPU outperformed a 3.2GHz Core I7 Intel processor by 14 times in certain circumstances. If GPUs are that fast, why don't developers use them for every function in a game? Is it possible for GPUs to do anything other than graphics?
Answer
"I've read that F1 cars are faster than those we drive on the streets... why people don't use F1 cars then?" Well... The answer to this question is simple: F1 cars can't break or turn as fast as most cars do (the slowest car could beat an F1 in that case). The case of GPUs is very similar, they are good at following a straight line of processing, but they are not so good when it comes to choosing different processing paths.
A program executed in te GPU makes sense when it must be executed many times in parallel, for instance when you have to blend all the pixels from Texture A with pixels from Texture B and put them all in Texture C. This task, when executed in a CPU, would be processed as something like this:
for( int i =0; i< nPixelCount; i++ )
TexC[i] = TexA[i] + TexB[i];
But this is slow when you have to process a lot of pixels, so the GPU instead of using the code above, it just uses the next one:
TexC[i] = TexA[i] + TexB[i];
and then it populates all the cores with this program (essentially copying the program to the core), assigning a value for i
for each. Then is where it comes the magic from the GPU and make all cores execute the program at the same time, making a lot of operations much faster than the linear CPU program could do.
This way of working is ok when you have to process in the same way a very lot of small inputs, but is really bad when you have to make a program that may have conditional branching. So now let's see what the CPU does when it comes to some condition check:
- 1: Execute the program until the first logical operation
- 2: Evaluate
- 3: Continue executing from the memory address result of the comparison (as with a JNZ asm instruction)
This is very fast for the CPU as setting an index, but for the GPU to do the same, it's a lot more complicated. Because the power from the GPU comes from executing the same instruction at the same time (they are SIMD cores), they must be synchronized to be able to take advantage of the chip architecture. Having to prepare the GPU to deal with branches implies more or less:
- 1: Make a version of the program that follows only branch A, populate this code in all cores.
- 2: Execute the program until the first logical operation
- 3: Evaluate all elements
- 4: Continue processing all elements that follow the branch A, enqueue all processes that chose path B (for which there is no program in the core!). Now all those cores which chose path B, will be IDLE!!--the worst case being a single core executing and every other core just waiting.
- 5: Once all As are finished processing, activate the branch B version of the program (by copying it from the memory buffers to some small core memory).
- 6: Execute branch B.
- 7: If required, blend/merge both results.
This method may vary based on a lot of things (ie. some very small branches are able to run without the need of this distinction) but now you can already see why branching would be an issue. The GPU caches are very small you can't simply execute a program from the VRAM in a linear way, it has to copy small blocks of instructions to the cores to be executed and if you have branches enough your GPU will be mostly stalled than executing any code, which makes no sense when it comes when executing a program that only follows one branch, as most programs do--even if running in multiple threads. Compared to the F1 example, this would be like having to open braking parachutes in every corner, then get out of the car to pack them back inside the car until the next corner you want to turn again or find a red semaphore (the next corner most likely).
Then of course there is the problem of other architectures being so good at the task of logical operations, far cheaper and more reliable, standarized, better known, power-efficient, etc. Newer videocards are hardly compatible with older ones without software emulation, they use different asm instructions between them even being from the same manufacturer, and that for the time being most computer applications do not require this type of parallel architecture, and even if if they need it them, they can use through standard apis such as OpenCL as mentioned by eBusiness, or through the graphics apis. Probably in some decades we will have GPUs that can replace CPUs but I don't think it will happen any time soon.
I recommend the documentation from the AMD APP which explains a lot on their GPU architecture and I also saw about the NVIDIA ones in the CUDA manuals, which helped me a lot on understanding this. I still don't understand some things and I may be mistaken, probably someone who knows more can either confirm or deny my statements, which would be great for us all.
No comments:
Post a Comment