In my game, every bit of geometry is a textured quad billboard in 3D. I have many thousands of these things on screen at once, sometimes with overdraw. As I understand it, there are a few ways I can go about drawing these:
Have a single VBO per type of quad (ie, unique set of texture coords). Batch the VBO and texture, but make multiple draw calls. Looks something like this:
foreach (Quad quad in quadTypes)
{
// The VBO just contains 4-6 vertices.
GraphicsDevice.SetVertexBuffer(quad._vertexBuffer);
Shader.SetTexture(quad._texture);
// Draw each instance using multiple draw calls
foreach(QuadInstance instance in quad._instances)
{
Shader.SetWorldMatrix(instance._worldMatrix);
foreach(ShaderPass pass in Shader._passes)
{
pass.Apply();
GraphicsDevice.DrawPrimitive(TRIANGLES);
}
}
}Make a single draw call by somehow combining all the triangles of all the instances of a quad type. Looks something like this:
foreach (Quad quad in quadTypes)
{
Shader.SetTexture(quad._texture);
// Probably can cache a vertex buffer for when quads are created/destroyed instead.
VertexBuffer vertexBuffer = new VertexBuffer();
foreach(QuadInstance instance in quad._instances)
{
// Add all of the (world transformed) triangles of this instance to
// the vertex buffer
vertexBuffer.AddAll(instance.CreateTriangles());
}
// Draw all the quads with a single draw call and vertex buffer.
GraphicsDevice.SetVertexBuffer(vertexBuffer);
foreach(ShaderPass pass in Shader._passes)
{
pass.Apply();
GraphicsDevice.DrawPrimitive(TRIANGLES);
}
}True hardware instancing. This is currently how I do it in my (XNA HiDef profile) game, but I can't figure out a way to get it to work in Monogame. With true hardware instancing, each instance is represented as a vertex, and the GPU creates all the necessary duplicate vertices in a single draw call.
In general, which one of these is most desirable? 1) makes many draw calls, but saves the most memory. 2) has a single draw call, but requires creating a vertex buffer for all of the instances, which seems really inefficient, and 3) has the best of both worlds but is not widely supported. Is there a fourth way that I'm missing?
Answer
Your question title explicitly states "thousand of quads". That is really not a lot of geometry. I would have to say that unless you expect millions, or are targeting mobile, I suggest going with simple batching to reduce draw calls. It is easiest to implement and should do the job admirably.
If you really do need more geometry, read on...
Since Monogame genuinely does not support this yet, I can only suggesting extending it using the underlying OpenTK, or looking around on the Monogame issues to see if someone else has already made headway into hardware instancing and is likely to provide source that you can adapt.
I can say definitely that OpenGL 3 hardware instancing is not much harder to implement than rendering single geometries; however, integrating this into Monogame would probably be a lot harder to do right. If you do implement hardware instancing in OpenTK, allow me to save you some time by strongly suggesting instancing without using gl_InstanceId
& using uniform arrays for instance data -- this is an outdated, inefficient, and limited approach. The best way is to use VAO + a vertex buffer containing all your per-instance data (e.g. model matrices) and uniforms for whatever is shared by all instances (e.g. view-projection matrix and textures).
Your approaches
It depends on cost. If calculating all transformations of all vertices on CPU is less than the cost of all the draw calls, it's definitely worth doing simplistic batching i.e. method 2. Otherwise you may be better off sticking with method 1 because you'll need to get per-instance matrices across without the cost of calculating them all on CPU. However, there is middle ground... see the below paragraph for more information on getting these across individually within a single draw call.
Best practices
Some sources outlining best practices (and their pros and cons) for where hardware instancing is not available:
Putting your per-instance matrices etc. in a texture can be very fast if you can wrangle it. You can also use a uniform array but I believe there is a fairly low limit on the number of individual uniforms you can have in a given shader program. Either way you are going to need to index into these in view of the lack of glAttribDivisor
& gl_InstanceId
-- this requires that you repeat the appropriate index for every single vertex of every instance, which is a bit costly but can't be avoided, and gets the job of reducing draw calls done.
No comments:
Post a Comment