Per-triangle occlusion on the GPU in 1ms with Unity

This is a post about 3D rendering techniques, it requires intermediate knowledge of 3D real-time engines.

Purpose & Context

They are many many different reasons to build an occlusion system, and this is why they are a lot of very different occlusion algorithms and solutions out there. This is by no means a superior algorithm to Umbra or whatever Unreal is using, but it can be useful in any context similar to mine. Also warning, I’m no expert in this area!

Here is my context:

  • Figure out which mesh triangles are visible from a given “eye” point (origin)

  • My origin is not the player eye or main camera (it will be a heating source) but it’s a point with a position / direction / angle spread / max dist so you could say my origin is a frustum

  • I have a small 3D scene, I don’t expect to ever exceed 100k triangles, and it’s bounded to a limited space (it’s not an open world game, or a game with levels or large areas)

  • It is almost entirely procedurally generated so baking (Umbra) was not a possible solution

  • My triangle density is relatively homogeneous (I don’t have giant objects with low density vs tiny objects with high density of polygons)

  • The geometry changes a lot, so it has to be dynamic, potentially be refreshed every few frames, but not necessarily every frame

  • It’s completely fine if the solution is not perfect, it just needs to be 95% accurate. If I miss some visible / non-visible triangles it’s totally ok

  • It’s ok if the solution requires shader target 5.0, this it not for an ultra casual game on mobile

You can see that the context is quite unique, I’m fine with mistakes, I don’t need my visibility to be always up-to-date and I have a very particular 3D scene setting. And it’s not even used for the main camera culling.

On the left the game view with the render texture: this is what the occlusion camera sees and the shades of red and red-green are the triangle indexes that are visible. On the right the scene view, the occlusion camera is selected, we see its frustum and each visible triangle is marked by a colored gizmo.

Unity Project & C# Code

You can preview & download the code & sample Unity scene from Github. I added a lot of comments so you should be able to understand the code on your own, but I suggest you read the sections below they explain a few tricks and the high level view of the algorithm itself. More information on how to use the sample can be found on Github.

Unity Pipeline

I’m using the built-in render pipeline for this project, but if you’re on HDRP I would recommend looking into the Arbitrary Output Variables (AOVs) that can be used to output a custom pass compute shader, which is exactly what we’re doing here!

Occlusion Algorithm

The core idea is originally from Garrett Johnson a NASA engineer who did some pretty interesting Unity rendering investigations in 2017 and one of them was using an occlusion algorithm that I improved a bit, and customized for my needs. Here is how it works:

  • A Unity camera is setup at the origin, this will be our “occlusion camera”, and it’s disabled because we control its rendering via code

  • When we want to run the occlusion algorithm we make a List<MeshFilter> of all meshes that must be visibility checked and draw them using our occlusion camera, Graphics.DrawProceduralNow and a special shader into a render texture

  • The special shader doesn’t draw the meshes normally, we completely ignore their true material. Instead it transforms each triangle index into a color using smart bitwise operations and draws it into the texture. As a result the render texture becomes an image of (most of) all visible triangles indexes viewed from the origin

  • Then we have two compute shader kernels that very quickly parse the render texture and reconstruct a list of all triangle indexes that are visible, and that’s pretty much it!

What I like about this solution is that it leverages many technologies already built-in Unity, and that it’s really, really fast. It’s also super easy to make it faster (or inversely more precise) by lowering (or increasing) the render texture resolution.

In the title I said it runs in 1ms, and it’s true, but as always it depends heavily on your settings: your 3D scene, number of polygons, render texture resolution and if you can even accept to have a shader target 5.0.

In its current state the algorithm fits my needs so I don’t need to stress test it further, but they are many limits (polygons numbers, max objects numbers, maximum polygon density, texture resolution…) that probably make this algorithm completely unusable for many other projects.

With a resolution of 1024x1024 we already have great per-triangle result. Look at the walls that are barelly visible on the right, we still hit all of their triangle in the visibility check.

Performance

In my base test with 38k triangles the core part of the algorithm CheckVisibilityAsync() takes 0.14ms. This includes the procedural drawing and compute shader dispatch.

The call to DrawProceduralNow() takes 0.001ms, the compute shader dispatch takes 0.003ms, at this triangle count pretty much everything is instant.

However we need to wait 3 frames to receive the result of the compute shader from the GPU using AsyncGPUReadback. We’re at 850+ fps in the editor so that’s about 1ms with deep profiler running. Note that AsyncGPUReadback does not stall the main thread, the cpu or gpu, so during this 1ms everything else is running smoothly.

After waiting 1ms / 3 frames for the GPU to give the result, we finish everything in an additional 0.07ms to read the data, save it into a native array and finally into a managed memory array.

Total : 0.14 + 0.07 = 0.21ms of processing, and about 1ms of waiting for the GPU

Trick 1: Procedural Drawing

The strategy to generate the render texture is to use Graphics.DrawProceduralNow to render just the triangles that we want to check the visibility for. For this we use a shader that doesn’t calculate lighting but instead draw the triangle index of each triangle. Using DrawProceduralNow allows us to draw as many triangles as we want, we can stream it over many frames too if we have too many of them.

DrawProceduralNow.jpg

Trick 2: Triangle Index to Color to Triangle Index

In the shader used to draw the triangle indexes we transform each triangle index (idTriangle) into a color (o.id):

o.id = float4(
   ((idTriangle >> 0) & 0xFF) / 255.0,
   ((idTriangle >> 8) & 0xFF) / 255.0,
   ((idTriangle >> 16) & 0xFF) / 255.0,
   ((idTriangle >> 24) & 0xFF) / 255.0
);

These operators are called bitwise operators and they are two of them used here:

  • The right shift >> n which removes the lowest n bits

  • & 0xFF filters the lowest 8 bits (everything else is left out) to obtain a result between 0 and 255, which is our color resolution!

Note: 0xF is hexadecimal for 16 (int) which is 1111 in binary and 0xFF is hexadecimal for 255 (int) which is 11111111 in binary; exactly what we need to filter the lowest 8 bits using the & operator.

To summarize we take the triangle index (a 32 bits unsigned integer) and we split it into four 8 bits number which become our color components between 0 and 1.0. The red component uses 8x bits from 1 to 8, green uses 8x bits from 9 to 16, blue uses 8x bits from 17 to 24 and alpha uses 8x bits from 25 to 32. And we cover all colors from black ((0,0,0) for idTriangle = 0) to white ((1,1,1) for idTriangle = 4294967295).

Note: we will actually never reach the white color because in the shader code we retreive idTriangle using idVertex / 3 and idVertex is also a uint. Still… we have enough uint resolution for more than 1 billion unique triangle indexes, should be enough :)

It’s a clever trick because there is no loss of information, every 32 bits uint matches to a unique set of four 8 bits numbers. And more importantly, the operation is reversible. Finally in the compute shader used to parse the render texture we revert the color into an index:

uint idxTriangle = (
   ((int)(px.r * 255) & 0xFF) << 0) |
   (((int)(px.g * 255) & 0xFF) << 8) |
   (((int)(px.b * 255) & 0xFF) << 16) |
   (((int)(px.a * 255) & 0xFF) << 24);
)

Trick 3: Compute Shader

We use one compute shader with two kernels:

  • AccumulateTriangles parse the render texture (fully multithreaded on texture width and height) and creates a boolean array indexed by triangle idx where it’s true when this triangle is visible

  • MapTris parse the previous boolean array and transforms it into the result, an array of triangle indexes of visible triangles

ComputeShader.jpg

Trick 4: GPU to CPU Async Data Request

The Compute Shaders are dispatched once the render texture is drawn, and then there is one last trick, the AsyncGPUReadback. It’s important to understand that both the render texture and the compute buffers only reside in VRAM and most of this happens on the GPU, asynchronously.

To process a render texture (in native memory) you have to convert it to a Texture2D (in managed memory), and here with the compute buffer it’s almost the same process. We create an asynchronous GPU read request, wait for it to be completed (while + yield), then we get a Native Array (in native memory) that we convert into a managed memory array. See the documentation of AsyncGPUReadbackRequest for more info but unfortunately I couldn’t find any official example.

AsyncGPUReadback.jpg
Previous
Previous

The Battle for a New Monetary Standard

Next
Next

The Art of Damping