opengl - What are the common rendering optimization techniques for the geometry pass in a deferred shading renderer?

I have been developing a game engine using OpenGL 3 and C++ (and glfw for window management). I have advanced so far, got most of the things done except sound entities and optimizations. The engine uses deferred shading so since deferred shading is itself a tiring process for an average GPU, I want to optimize the rendering process as much as possible.

The current system consists of a Scene, containing a Renderer and the current World and the World holds entities and lighting entities as separate std::vectors.

So basically every time the Scene gets called by ->render(), and it calls the Renderer, passinng the world as a parameter and gets the entity iterators from the world, draws them to the FBO and then goes through the lighting entities for the second pass. And I think this is not enough.

My current algorithm iterates through everything even if the entity is not in the screen space. I am thinking of a way to optimize the current rendering algorithm so it only calls the API functions only for the visible objects, so what are the common techniques for optimizing such renderer ?

Answer

Deferred shading is only a technique to "defer" the actual shading operation for later stages, this can be great to reduce the number of passes needed (for example) to render 10 lights which needs 10 passes. My point is regardless of the rendering technique you are using there are certain possible rendering optimizations that reduce the number of objects (vertices, normals etc) that your rendering pipeline need to process.

There are no de facto standard for rendering optimizations, but rather a number of techniques that can be used interchangeably or together to achieve certain performance characteristics. Using each technique highly depends on the nature of the scene being rendered.

Deferred rendering tries to solve the problem when the number of lights increases, which in forward rendering might make the number of passes explode.

Those techniques does not directly optimize the deferred shading part, but according to your description, the deferred shading part is NOT your problem. Your problem though is that you are submitting the whole scene to rendering pipeline. So your engine has to process (for example all the 100 million vertices) in your scene just to be able to submit the result to the g-buffer, while most of theses 100 million vertex can trivially be culled away, and not submitted to the pre-process vertex and fragments pass.

In case of a forward renderer the N vertex will be processed by the vertex stage as a total of vertex count*lights count and by the fragment stage as a total of fragments count*number Lights, deferred shading effectively reduces this to only vertex count for the vertex stage and fragments count for the fragment count, before resolving the actual shading. But still N can be too much to process, especially when most of them can be trivially culled.

This makes culling more effective in case of forward rendering/multiple passes. But keep in mind that most engines will use a dual rendering approach, because deferred shading alone can not resolve transparent objects, this makes using those optimizations a must, I don't know of any commercial engine that don't do all of them.

Frustum Culling

Only Objects that are fully or partially included in the view frustum, ever need to be submitted to the rendering pipeline. This is the basic concept of frustum culling, unfortunately checking if a mesh is in/out of the view frustum can be an expensive operation, so instead, engine designers use an approximate bounding volume like an (AABB)Axis Aligned bounding box or bounding sphere, even though this might not be as accurate as using the actual mesh, the accuracy difference isn't worth the trouble of checking with the actual mesh.

enter image description here

Even with bounding volumes, you don't really need to check each one, alternatively you can construct a bounding volume hierarchy to do an earlier culling, using this is highly dependent on the complexity of the scene.

This is a good and a simple technique for a smaller engine, and is almost used in every engine I ever used. I recommend using a "normal" Bounding Volume/Frustum checking without hierarchies if your engine does not require rendering very complex scenes.

Bounding Volume Hierarchy

Back face culling

This is a must, why draw faces that won't be visible anyway? rendering APIs provide an interface to turn on/off back face culling. Unless you have a strong reason why not to turn it on, like some of the CAD applications that need to draw backfaces in certain circumstances, this is a must do thing.

Occlusion Culling

Using the Z-buffer you can resolve visibility determination. But the problem is that Z-buffer isn't always great in terms of performance, since Z-buffer can only be resolved at later stages of the pipeline, objects being occluded should be rasterized and might be written to the Z-buffer and the Color buffer before failing the Z test.

Occlusion culling solves this, by doing some early tests to cull occluded objects that are in the rendering frustum. One practical implementation of occlusion culling is using point-based queries and checking if certain objects are visible from a specific point view. This can also be used to cull lights that do not contribute to the final image this is especially useful in a deferred engine renderer.

enter image description here

A great real world example of such technique is in GTA5, where the skyscrapers are stratigically placed at the center of the city, are not only decorations, but they also work as occluders, effectively occluding the rest part of the city and preventing it from being rasterized.

LOD

Level of Detail

Level of detail is widely used technique, the idea is to use a simpler version of the mesh when the mesh is less contributing to the scene. there are two common implementations; one simply switches the mesh with a simpler one when it's no longer greatly contributing, the mesh is selected based on some factor like the distance and the number of pixels (area on the scree) the mesh is occupying. The other version dynamically tessellates the mesh, this is widely used in terrain rendering.

enter image description here

What if all of these didn't work ?

Well, that's a good question.

The first thing you need to do is to Profile your application using a graphics profiler, and determine where the bottleneck is. Keep in mind that the bottleneck may change as the content being rendered change. Bottlenecks might be also be part of the code running on CPU so you need to measure that too.

After that you need to do some optimizations on the bottleneck, keep in mind that there is no right answer for this, and will be different from hardware to another.

Some common GPU optimization tricks:

Avoid branching in shaders.

Try different vertex structures for example {VNT} interleaved in the same array or {V},{N},{T} in different arrays.

Draw scene front to back.

Turn off Z-buffer at some points for example if an image doesn't need Z testing.

Use compressed textures.

Some common CPU optimization tricks:

Use inline functions for small functions.

Use SIMD (Single instruction multiple data) when possible.

Avoid cache unfriendly memory jumps.

Use VBOs with the "right" amount of data. (depending on your hardware) but usually less draw calls are better.

But what if my bottleneck was in the deferred shading ?

In this case, since deferred shading is more concerned about lights then the most obvious part is to optimize the actual shading calculations. some of the points to keep an eye on:

Render lights that actually affect the final image. In other words cull the lights that don't contribute. This can be effectively implemented using the occlusion culling I mentioned before.

Does this light need the specular or some other components? Maybe not.

Does this light cast shadow ? Some lights don't need to cast shadows.

Can this light contribution be pre-computed? If it is not moving probably some aspects can be pre-computed.

Blog

Friday, February 12, 2016