Saturday, May 18, 2019

directx - Information about rendering, batches, the graphical card, performance etc. + XNA?


I know the title is a bit vague but it's hard to describe what I'm really looking for, but here goes.


When it comes to CPU rendering, performance is mostly easy to estimate and straightforward, but when it comes to the GPU due to my lack of technical background information, I'm clueless. I'm using XNA so it'd be nice if theory could be related to that.


So what I actually wanna know is, what happens when and where (CPU/GPU) when you do specific draw actions? What is a batch? What influence do effects, projections etc have? Is data persisted on the graphics card or is it transferred over every step? When there's talk about bandwidth, are you talking about a graphics card internal bandwidth, or the pipeline from CPU to GPU?
Note: I'm not actually looking for information on how the drawing process happens, that's the GPU's business, I'm interested on all the overhead that precedes that.


I'd like to understand what's going on when I do action X, to adapt my architectures and practices to that.


Any articles (possibly with code examples), information, links, tutorials that give more insight in how to write better games are very much appreciated. Thanks :)



Answer




I like to think of performance in terms of "limits". It's a handy way to conceptualise a fairly complicated, interconnected system. When you have a performance problem, you ask the question: "What limits am I hitting?" (Or: "Am I CPU/GPU bound?")


You can break it down into multiple levels. At the highest level you have the CPU and the GPU. You might be CPU bound (GPU sitting idle waiting for CPU), or GPU bound (CPU is waiting on GPU). Here is a good blog post on the topic.


You can break it down further. On the CPU side, you might be using all your cycles on data already in the CPU cache. Or you might be memory limited, leaving the CPU idle waiting for data to come in from main memory (so optimise your data layout). You could break it down further still.


(While I am doing a wide overview of performance regarding XNA, I will point out that an allocation of a reference type (class not struct), while normally cheap, could trigger the garbage collector, which will burn a lot of cycles - especially on Xbox 360. See here for details).


On the GPU side, I'll start out by pointing you to this excellent blog post which has lots of details. If you want an insane level of detail on the pipeline, read this series of blog posts. (Here's a simpler one).


To put it here simply, some of the big ones are the: "fill limit" (how many pixels you can write to the backbuffer - often how much overdraw you can have), "shader limit" (how complicated your shaders can be and how much data you can push through them), "texture-fetch/texture-bandwidth limit" (how much texture data you can access).


And, now, we come to the big one - which is what you're really asking - where the CPU and GPU have to interact (via the various APIs and drivers). Loosely there is the "batch limit" and "bandwidth". (Note that part one of the series I mentioned earlier goes into extensive details.)


But, basically, a batch (as you already know) happens whenever you call one of the GraphicsDevice.Draw* functions (or part of XNA, like SpriteBatch, does this for you). As you no doubt have already read, you get a few thousand* of these per frame. This is a CPU limit - so it competes with your other CPU usage. It's basically the driver packaging up everything about what you have told it to draw, and sending it off to the GPU.


And then there is the bandwidth to the GPU. This is how much raw data you can transfer there. This includes all the state information that goes with batches - everything from setting rendering state and shader constants/parameters (which includes things like world/view/project matrices), to vertices when using the DrawUser* functions. It also includes any calls to SetData and GetData on textures, vertex buffers, etc.


At this point I should say that anything that you can call SetData on (textures, vertex and index buffers, etc), as well as Effects - remains in GPU memory. It is not constantly re-sent to the GPU. A draw command that references that data is simply sent with a pointer to that data.



(Also: you can only send draw commands from the main thread, but you can SetData on any thread.)


XNA complicates things somewhat with its render state classes (BlendState, DepthStencilState, etc). This state data is sent per draw call (in each batch). I am not 100% sure, but I am under the impression that it is sent lazily (it only sends state that changes). Either way, state changes are cheap to the point of free, relative to the cost of a batch.


Finally, the last thing to mention is the internal GPU pipeline. You don't want to force it to flush by writing to data that it still needs to read, or reading data that it still needs to write. A pipeline flush means it waits for operations to finish, so that everything is in a consistent state when data is accessed.


The two particular cases to watch out for are: Calling GetData on anything dynamic - particularly on a RenderTarget2D that the GPU may be writing to. This is extremely bad for performance - don't do it.


The other case is calling SetData on vertex/index buffers. If you need to do this often, use a DynamicVertexBuffer (also DynamicIndexBuffer). These allow the GPU to know that they will be changing often, and to do some buffering magic internally to avoid the pipeline flush.


(Also note that dynamic buffers are faster than DrawUser* methods - but they have to pre-allocated at the maximum required size.)


... And that's pretty much everything I know about XNA performance :)


No comments:

Post a Comment

Simple past, Present perfect Past perfect

Can you tell me which form of the following sentences is the correct one please? Imagine two friends discussing the gym... I was in a good s...